From 7d632a2a3a298dde5f4042fae0b447b0850774f1 Mon Sep 17 00:00:00 2001 From: Harikrishnan Date: Wed, 20 Feb 2019 17:34:20 +0100 Subject: [PATCH] Comprehensive article on Pandas DataFrame (#31801) * Comprehensive article on Pandas DataFrame I will add more operations and methods in the coming articles. * Updated the comments on title. reread and modified based on style guide. --- .../pandas/dataframe/index.md | 373 ++++++++++++++++++ 1 file changed, 373 insertions(+) create mode 100644 guide/english/data-science-tools/pandas/dataframe/index.md diff --git a/guide/english/data-science-tools/pandas/dataframe/index.md b/guide/english/data-science-tools/pandas/dataframe/index.md new file mode 100644 index 0000000000..d56f470938 --- /dev/null +++ b/guide/english/data-science-tools/pandas/dataframe/index.md @@ -0,0 +1,373 @@ +--- +title: pandas DataFrame +--- + +## DataFrame + +In this section you will have a detailed look on the other important data-type of pandas "DataFrame". In pandas, DataFrame is used as an object to represent multi-dimensional data. They are mainly used to represent 2 dimensional or tabular data with rows and columns. They can also be called as collection of `Series`. + +DataFrame also supports 3 dimensional data using the multi index properties. It will be the replacement for the old and now existing `panel` object. A 3-dimensional DataFrame can consist of multiple 2-D DataFrame. + +### Basic syntax of DataFrame + + +```python +pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False) +``` + +`Data` : ndarray, dict,Series, another dataframe + +`index` : array-like or index. default to RangeIndex(1,2,3 . . n). Represents the index label. + +`columns` : array-like or index. default to RangeIndex(1,2,3 . . n). Represents the column label. + +`dtype` : dtype, default None. Data type of the DataFrame + +### Creating DataFrame in different ways: + +As a first step import our pandas module: + + +```python +import pandas as pd +``` + +### Create an empty DataFrame: + + +```python +df = pd.DataFrame() +print(df) +``` + + Empty DataFrame + Columns: [] + Index: [] + + +### Create using a list: + + +```python +input_data = [['Mark',87],['Tom',78],['Monika',97]] +df = pd.DataFrame(data = input_data) +print(df) +``` + + 0 1 + 0 Mark 87 + 1 Tom 78 + 2 Monika 97 + + + +```python +print('DataFrame with column and row index labels:') +roll_no = [223,224,225] +df = pd.DataFrame(data= input_data,index= roll_no, + columns=['Name','Score']) +print(df) +``` + + DataFrame with column and row index labels: + Name Score + 223 Mark 87 + 224 Tom 78 + 225 Monika 97 + + +### Create using a dict: + + +```python +input_data = {'Name': ['Mark','Tom','Monika'], + 'Score': [87,78,97]} +df =pd.DataFrame(data=input_data,dtype= float) # Notice that the score is change to float. +print(df) +``` + + Name Score + 0 Mark 87.0 + 1 Tom 78.0 + 2 Monika 97.0 + + +### Create using a list of dict: + + +```python +input_data = [{'Name': 'Mark','Score': 87}, + {'Name': 'Tom','Score': 78}, + {'Name': 'Monika', 'Score': 97}] +df = pd.DataFrame(data= input_data, index=roll_no) +print(df) +``` + + Name Score + 223 Mark 87 + 224 Tom 78 + 225 Monika 97 + + +### Create using a dict of Series: + + +```python +input_data = {'Name': pd.Series(['Mark','Tom','Monika','John']), + 'Score': pd.Series([87,78,97])} +df = pd.DataFrame(input_data) +print(df) +``` + + Name Score + 0 Mark 87.0 + 1 Tom 78.0 + 2 Monika 97.0 + 3 John NaN + + +You can notice the above output, For John the score is NaN(not a number). In pandas empty values are defaulted with numpy.nan. + +### DataFrame Manipulations: + +Now that you have a comprehensive idea on how to create a DataFrame and different kind of inputs you can use to create it. Next on to different manipulation operations we can do with a DataFrame. + +### Column Manipulation: + +Below are the operations on the column level discussed here: +* Column selection +* Column addition +* Column deletion + + +```python +score_sheet = {'Name': pd.Series(['Mark','Tom','Monika','Lilly','Sam']), + 'Maths': pd.Series([89,87,83,78,77]), + 'Science': pd.Series([78,88,66,0,88])} +DF = pd.DataFrame(score_sheet) +print(DF) +``` + + Name Maths Science + 0 Mark 89 78 + 1 Tom 87 88 + 2 Monika 83 66 + 3 Lilly 78 0 + 4 Sam 77 88 + + +### Column selection: + + +```python +DF['Name'] # Selcting a particular column +``` + + + + + 0 Mark + 1 Tom + 2 Monika + 3 Lilly + 4 Sam + Name: Name, dtype: object + + + + +```python +type(DF['Maths']) +``` + + + + + pandas.core.series.Series + + + +You can notice that each column in a DataFrame is considered as a Series and it supports all the Series type operations. Example:` + + +```python +DF['Maths'].max() # Finding the max score in maths +``` + + + + + 89 + + + + +```python +math = DF[['Name','Maths']] # Selcting multiple column +print(math) +``` + + Name Maths + 0 Mark 89 + 1 Tom 87 + 2 Monika 83 + 3 Lilly 78 + 4 Sam 77 + + +### Column addition: + + +```python +DF['English'] = pd.Series([88,89,98,88,0]) # Adding a new subject English. +print(DF) +``` + + Name Maths Science English + 0 Mark 89 78 88 + 1 Tom 87 88 89 + 2 Monika 83 66 98 + 3 Lilly 78 0 88 + 4 Sam 77 88 0 + + + +```python +DF['Total Score'] = DF['Maths'] + DF['Science'] + DF['English'] +print(DF) +``` + + Name Maths Science English Total Score + 0 Mark 89 78 88 255 + 1 Tom 87 88 89 264 + 2 Monika 83 66 98 247 + 3 Lilly 78 0 88 166 + 4 Sam 77 88 0 165 + + +### Column deletion: + + +```python +#Using the del function: +del DF['English'] +print(DF) +``` + + Name Maths Science Total Score + 0 Mark 89 78 255 + 1 Tom 87 88 264 + 2 Monika 83 66 247 + 3 Lilly 78 0 166 + 4 Sam 77 88 165 + + + +```python +#Using the pop method: +DF.pop('Total Score') +print(DF) +``` + + Name Maths Science + 0 Mark 89 78 + 1 Tom 87 88 + 2 Monika 83 66 + 3 Lilly 78 0 + 4 Sam 77 88 + + +### Row Manipulation: + +As like column, `DataFrame` have the similar operations for rows as well. Now you will see about those operations in row level in detail. You will use the same `DataFrame` DF we have created before. + +### Row selection: + +There are two method availabel in DataFrame for selection. They are .iloc() and .loc(). + +* .iloc() method is used to select based on position. +* loc() method is used to select based on the label value. + +Now we will see about the .iloc() method. + + +```python +DF.iloc[2] #retruns the 2nd row. +``` + + + + + Name Monika + Maths 83 + Science 66 + Name: 2, dtype: object + + + + +```python +type(DF.iloc[3]) +``` + + + + + pandas.core.series.Series + + + +The important thing to notice here is that it returns a series again. Not just the column is retruned as a Series , rows as well. + + +```python +print(DF[2:4]) # Sliceing the rows +``` + + Name Maths Science + 2 Monika 83 66 + 3 Lilly 78 0 + + +### Row addition: + + +```python +new_student = pd.DataFrame(data = [['Ben',79,89]], + columns=['Name','Maths','Science'], + index=[5]) + +DF = DF.append(new_student) # Using the append method added a new column. +print(DF) +``` + + Name Maths Science + 0 Mark 89 78 + 1 Tom 87 88 + 2 Monika 83 66 + 3 Lilly 78 0 + 4 Sam 77 88 + 5 Ben 79 89 + + +### Row deletion: + + +```python +# We delete using the drop method and we use index label for deleting: +DF.drop(3) +print(DF) +``` + + Name Maths Science + 0 Mark 89 78 + 1 Tom 87 88 + 2 Monika 83 66 + 3 Lilly 78 0 + 4 Sam 77 88 + 5 Ben 79 89 + + +#### More Information: + +[DataFrame](http://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.html)