* Comprehensive article on Pandas DataFrame I will add more operations and methods in the coming articles. * Updated the comments on title. reread and modified based on style guide.
		
			
				
	
	
		
			374 lines
		
	
	
		
			7.6 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			374 lines
		
	
	
		
			7.6 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| ---
 | |
| title: pandas DataFrame
 | |
| ---
 | |
| 
 | |
| ## DataFrame
 | |
| 
 | |
| In this section you will have a detailed look on the other important data-type of pandas "DataFrame". In pandas, DataFrame is used as an object to represent multi-dimensional data. They are mainly used to represent 2 dimensional or tabular data with rows and columns. They can also be called as collection of `Series`.  
 | |
| 
 | |
| DataFrame also supports 3 dimensional data using the multi index properties. It will be the replacement for the old and now existing `panel` object. A 3-dimensional DataFrame can consist of multiple 2-D DataFrame.
 | |
| 
 | |
| ### Basic syntax of DataFrame
 | |
| 
 | |
| 
 | |
| ```python
 | |
| pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
 | |
| ```
 | |
| 
 | |
| `Data`  : ndarray, dict,Series, another dataframe
 | |
| 
 | |
| `index`  : array-like or index. default to RangeIndex(1,2,3 . . n). Represents the index label.
 | |
| 
 | |
| `columns`  : array-like or index. default to RangeIndex(1,2,3 . . n). Represents the column label.
 | |
| 
 | |
| `dtype`  : dtype, default None. Data type of the DataFrame 
 | |
| 
 | |
| ### Creating DataFrame in different ways:
 | |
| 
 | |
| As a first step import our pandas module:
 | |
| 
 | |
| 
 | |
| ```python
 | |
| import pandas as pd
 | |
| ```
 | |
| 
 | |
| ### Create an empty DataFrame:
 | |
| 
 | |
| 
 | |
| ```python
 | |
| df = pd.DataFrame()
 | |
| print(df)
 | |
| ```
 | |
| 
 | |
|     Empty DataFrame
 | |
|     Columns: []
 | |
|     Index: []
 | |
| 
 | |
| 
 | |
| ### Create using a list:
 | |
| 
 | |
| 
 | |
| ```python
 | |
| input_data = [['Mark',87],['Tom',78],['Monika',97]]
 | |
| df = pd.DataFrame(data = input_data)
 | |
| print(df)
 | |
| ```
 | |
| 
 | |
|             0   1
 | |
|     0    Mark  87
 | |
|     1     Tom  78
 | |
|     2  Monika  97
 | |
| 
 | |
| 
 | |
| 
 | |
| ```python
 | |
| print('DataFrame with column and row index labels:')
 | |
| roll_no = [223,224,225]
 | |
| df = pd.DataFrame(data= input_data,index= roll_no,
 | |
|              columns=['Name','Score'])
 | |
| print(df)
 | |
| ```
 | |
| 
 | |
|     DataFrame with column and row index labels:
 | |
|            Name  Score
 | |
|     223    Mark     87
 | |
|     224     Tom     78
 | |
|     225  Monika     97
 | |
| 
 | |
| 
 | |
| ### Create using a dict:
 | |
| 
 | |
| 
 | |
| ```python
 | |
| input_data = {'Name': ['Mark','Tom','Monika'],
 | |
|               'Score': [87,78,97]}
 | |
| df =pd.DataFrame(data=input_data,dtype= float)         # Notice that the score is change to float.
 | |
| print(df)
 | |
| ```
 | |
| 
 | |
|          Name  Score
 | |
|     0    Mark   87.0
 | |
|     1     Tom   78.0
 | |
|     2  Monika   97.0
 | |
| 
 | |
| 
 | |
| ### Create using a list of dict:
 | |
| 
 | |
| 
 | |
| ```python
 | |
| input_data = [{'Name': 'Mark','Score': 87},
 | |
|               {'Name': 'Tom','Score': 78},
 | |
|               {'Name': 'Monika', 'Score': 97}]
 | |
| df = pd.DataFrame(data= input_data, index=roll_no)
 | |
| print(df)
 | |
| ```
 | |
| 
 | |
|            Name  Score
 | |
|     223    Mark     87
 | |
|     224     Tom     78
 | |
|     225  Monika     97
 | |
| 
 | |
| 
 | |
| ### Create using a dict of Series:
 | |
| 
 | |
| 
 | |
| ```python
 | |
| input_data = {'Name': pd.Series(['Mark','Tom','Monika','John']),
 | |
|               'Score': pd.Series([87,78,97])}
 | |
| df = pd.DataFrame(input_data)
 | |
| print(df)
 | |
| ```
 | |
| 
 | |
|          Name  Score
 | |
|     0    Mark   87.0
 | |
|     1     Tom   78.0
 | |
|     2  Monika   97.0
 | |
|     3    John    NaN
 | |
| 
 | |
| 
 | |
| You can notice the above output, For John the score is NaN(not a number). In pandas empty values are defaulted with numpy.nan.
 | |
| 
 | |
| ### DataFrame Manipulations:
 | |
| 
 | |
| Now that you have a comprehensive idea on how to create a DataFrame and different kind of inputs you can use to create it. Next on to different manipulation operations we can do with a DataFrame.
 | |
| 
 | |
| ### Column Manipulation:
 | |
| 
 | |
| Below are the operations on the column level discussed here:
 | |
| * Column selection
 | |
| * Column addition
 | |
| * Column deletion
 | |
| 
 | |
| 
 | |
| ```python
 | |
| score_sheet = {'Name': pd.Series(['Mark','Tom','Monika','Lilly','Sam']),
 | |
|                'Maths': pd.Series([89,87,83,78,77]),
 | |
|                'Science': pd.Series([78,88,66,0,88])}
 | |
| DF = pd.DataFrame(score_sheet)
 | |
| print(DF)
 | |
| ```
 | |
| 
 | |
|          Name  Maths  Science
 | |
|     0    Mark     89       78
 | |
|     1     Tom     87       88
 | |
|     2  Monika     83       66
 | |
|     3   Lilly     78        0
 | |
|     4     Sam     77       88
 | |
| 
 | |
| 
 | |
| ### Column selection:
 | |
| 
 | |
| 
 | |
| ```python
 | |
| DF['Name']             # Selcting a particular column
 | |
| ```
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
|     0      Mark
 | |
|     1       Tom
 | |
|     2    Monika
 | |
|     3     Lilly
 | |
|     4       Sam
 | |
|     Name: Name, dtype: object
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| ```python
 | |
| type(DF['Maths'])
 | |
| ```
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
|     pandas.core.series.Series
 | |
| 
 | |
| 
 | |
| 
 | |
| You can notice that each column in a DataFrame is considered as a Series and it supports all the Series type operations.  Example:`
 | |
| 
 | |
| 
 | |
| ```python
 | |
| DF['Maths'].max()      # Finding the max score in maths
 | |
| ```
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
|     89
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| ```python
 | |
| math = DF[['Name','Maths']]    # Selcting multiple column
 | |
| print(math)
 | |
| ```
 | |
| 
 | |
|          Name  Maths
 | |
|     0    Mark     89
 | |
|     1     Tom     87
 | |
|     2  Monika     83
 | |
|     3   Lilly     78
 | |
|     4     Sam     77
 | |
| 
 | |
| 
 | |
| ### Column addition:
 | |
| 
 | |
| 
 | |
| ```python
 | |
| DF['English'] = pd.Series([88,89,98,88,0])   # Adding a new subject English.
 | |
| print(DF)
 | |
| ```
 | |
| 
 | |
|          Name  Maths  Science  English
 | |
|     0    Mark     89       78       88
 | |
|     1     Tom     87       88       89
 | |
|     2  Monika     83       66       98
 | |
|     3   Lilly     78        0       88
 | |
|     4     Sam     77       88        0
 | |
| 
 | |
| 
 | |
| 
 | |
| ```python
 | |
| DF['Total Score'] = DF['Maths'] + DF['Science'] + DF['English']
 | |
| print(DF)
 | |
| ```
 | |
| 
 | |
|          Name  Maths  Science  English  Total Score
 | |
|     0    Mark     89       78       88          255
 | |
|     1     Tom     87       88       89          264
 | |
|     2  Monika     83       66       98          247
 | |
|     3   Lilly     78        0       88          166
 | |
|     4     Sam     77       88        0          165
 | |
| 
 | |
| 
 | |
| ### Column deletion:
 | |
| 
 | |
| 
 | |
| ```python
 | |
| #Using the del function:
 | |
| del DF['English']
 | |
| print(DF)
 | |
| ```
 | |
| 
 | |
|          Name  Maths  Science  Total Score
 | |
|     0    Mark     89       78          255
 | |
|     1     Tom     87       88          264
 | |
|     2  Monika     83       66          247
 | |
|     3   Lilly     78        0          166
 | |
|     4     Sam     77       88          165
 | |
| 
 | |
| 
 | |
| 
 | |
| ```python
 | |
| #Using the pop method:
 | |
| DF.pop('Total Score')
 | |
| print(DF)
 | |
| ```
 | |
| 
 | |
|          Name  Maths  Science
 | |
|     0    Mark     89       78
 | |
|     1     Tom     87       88
 | |
|     2  Monika     83       66
 | |
|     3   Lilly     78        0
 | |
|     4     Sam     77       88
 | |
| 
 | |
| 
 | |
| ### Row Manipulation:
 | |
| 
 | |
| As like column, `DataFrame` have the similar operations for rows as well. Now you will see about those operations in row level in detail. You will use the same `DataFrame` DF we have created before.
 | |
| 
 | |
| ### Row selection:
 | |
| 
 | |
| There are two method availabel in DataFrame for selection. They are .iloc() and .loc().
 | |
| 
 | |
| * .iloc() method is used to select based on position.
 | |
| * loc() method is used to select based on the label value.
 | |
| 
 | |
| Now we will see about the .iloc() method.
 | |
| 
 | |
| 
 | |
| ```python
 | |
| DF.iloc[2]              #retruns the 2nd row.
 | |
| ```
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
|     Name       Monika
 | |
|     Maths          83
 | |
|     Science        66
 | |
|     Name: 2, dtype: object
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| ```python
 | |
| type(DF.iloc[3])
 | |
| ```
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
|     pandas.core.series.Series
 | |
| 
 | |
| 
 | |
| 
 | |
| The important thing to notice here is that it returns a series again. Not just the column is retruned as a Series , rows as well.
 | |
| 
 | |
| 
 | |
| ```python
 | |
| print(DF[2:4])           # Sliceing the rows 
 | |
| ```
 | |
| 
 | |
|          Name  Maths  Science
 | |
|     2  Monika     83       66
 | |
|     3   Lilly     78        0
 | |
| 
 | |
| 
 | |
| ### Row addition:
 | |
| 
 | |
| 
 | |
| ```python
 | |
| new_student = pd.DataFrame(data = [['Ben',79,89]], 
 | |
|                            columns=['Name','Maths','Science'], 
 | |
|                            index=[5])
 | |
| 
 | |
| DF = DF.append(new_student)             # Using the append method added a new column.
 | |
| print(DF)
 | |
| ```
 | |
| 
 | |
|          Name  Maths  Science
 | |
|     0    Mark     89       78
 | |
|     1     Tom     87       88
 | |
|     2  Monika     83       66
 | |
|     3   Lilly     78        0
 | |
|     4     Sam     77       88
 | |
|     5     Ben     79       89
 | |
| 
 | |
| 
 | |
| ### Row deletion:
 | |
| 
 | |
| 
 | |
| ```python
 | |
| # We delete using the drop method and we use index label for deleting:
 | |
| DF.drop(3)
 | |
| print(DF)
 | |
| ```
 | |
| 
 | |
|          Name  Maths  Science
 | |
|     0    Mark     89       78
 | |
|     1     Tom     87       88
 | |
|     2  Monika     83       66
 | |
|     3   Lilly     78        0
 | |
|     4     Sam     77       88
 | |
|     5     Ben     79       89
 | |
| 
 | |
| 
 | |
| #### More Information:
 | |
| 
 | |
| [DataFrame](http://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.html)
 |