freeCodeCamp/index.md at 086956b901b0b2031f0f63078ba14a6353a1a429

Files

Harikrishnan 4fbc0440a2 DF Operation, follow up article on DataFrame (#32621 )

* DF Operation, follow up article on DataFrame

Comprehensive article on Pandas DataFrame basic operation and methods.

* Update index.md

2019-02-20 11:35:06 -05:00

12 KiB

Raw Blame History

title

title
DataFrame Operations and Methods

DataFrame basic functionalities:

In the previous DataFrame section we had a brief look at the different ways to create a DataFrame and different manipulation operations associated with it. In this section we will go little further and perform some more operations with DataFrame.

DataFrame is a very powerful object, comes with so many pre built methods to help computation easier and we will check some of those method and get familier with them.

Lets creat a score sheet DataFrame.

import numpy as np
import pandas as pd

ran = np.random.randint
scores = {'Name': ['Mark','Tom','Lilly','Ben','Monika'],
          'Maths': ran(70,80,5),
          'Science': ran(70,90,5), 
          'English': ran(70,90,5),
          'Computer': ran(70,90,5),
          'History': ran(70,90,5)}
result = pd.DataFrame(scores)
print(result)

     Name  Maths  Science  English  Computer  History
0    Mark     74       76       80        82       87
1     Tom     72       73       75        82       81
2   Lilly     75       81       77        76       82
3     Ben     73       81       85        78       74
4  Monika     79       81       85        82       74

Now our DataFrame result is ready. We will have a look at some of their attributes and the associated methods:

head(): retruns the top 5 records. if argument is passed then it returns that many records from the top.

print(result.head(3))

    Name  Maths  Science  English  Computer  History
0   Mark     74       76       80        82       87
1    Tom     72       73       75        82       81
2  Lilly     75       81       77        76       82

tail(): retruns the bottom 5 records. if argument is passed then it returns that many records from the bottom.

print(result.tail(2))

     Name  Maths  Science  English  Computer  History
3     Ben     73       81       85        78       74
4  Monika     79       81       85        82       74

columns: As you might notice that we are not calling any method here because column is not a method and is a attribute of DataFrame. It stores all the column names as a index list

result.columns

Index(['Name', 'Maths', 'Science', 'English', 'Computer', 'History'], dtype='object')

index: As like columns. DataFrames has an attribute index to store all the index label.

result.index

RangeIndex(start=0, stop=5, step=1)

axes: axes is an another attribute which stores both index and columns togeather. It depends on your need you can use any of those attributes. The returned object is a list.

result.axes

[RangeIndex(start=0, stop=5, step=1),
 Index(['Name', 'Maths', 'Science', 'English', 'Computer', 'History'], dtype='object')]

dtypes: An attribute which gives the detail of all the columns object type.

result.dtypes

Name        object
Maths        int64
Science      int64
English      int64
Computer     int64
History      int64
dtype: object

shape: shape is a tuple with stores the shape of the DataFrame as (rows,columns)

result.shape

(5, 6)

size: size returns the total number of elements in a dataframe. In our result dataframe we have 5 rows and 6 columns i.e 5 * 6

result.size

values: values attribute is usefull when you want the DataFrame in a ndarray format for any computing. It returns the whole dataframe in a ndarray format.

result.values

array([['Mark', 74, 76, 80, 82, 87],
       ['Tom', 72, 73, 75, 82, 81],
       ['Lilly', 75, 81, 77, 76, 82],
       ['Ben', 73, 81, 85, 78, 74],
       ['Monika', 79, 81, 85, 82, 74]], dtype=object)

empty: empty have the value True if the DataFrame is empty else false.

result.empty

False

In pandas DataFrame we have a special method info which provides almost all the details from the above attributs in single call.

info(): Retruns a detailes about the whole DataFrame. Including the types of the column and number of non-empty values etc. It is very handy to have a qucik glance about the DataFrame.

result.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
Name        5 non-null object
Maths       5 non-null int64
Science     5 non-null int64
English     5 non-null int64
Computer    5 non-null int64
History     5 non-null int64
dtypes: int64(5), object(1)
memory usage: 320.0+ bytes

Statistics method in pandas

Pandas is widely used in mathamatical computation. For that pandas has many bulit in methods. Now we have a breif look at some of the methods used in statistic computations. In pandas most of these methods are associated with both Series and DataFrame. Let look at them one by one.

count(): As the name sys count returns the total number of no-null values in a DataFrame or Series.

result.count()

Name        5
Maths       5
Science     5
English     5
Computer    5
History     5
dtype: int64

sum(): Sum adds all the values even the string values. As you can see in the below result, Name are concatenated it is the operator overloadingproperty of python that the operator can be used with any type of operent.

result.sum()

Name        MarkTomLillyBenMonika
Maths                         373
Science                       392
English                       402
Computer                      400
History                       398
dtype: object

A logical question might arise here that what I want to know the sum of socres of each student. Yes, We can achieve that by using a argument in all the statistics method called axis. In axis we can specify in which axis(row or column) we want to sum the values. Example below:

result.sum(axis=1)

0    399
1    383
2    391
3    391
4    401
dtype: int64

mean(): returns the mean of the values. As like sum() method we can perform both in column or row levels. It is only applied to numeric columns.

result.mean()

Maths       74.6
Science     78.4
English     80.4
Computer    80.0
History     79.6
dtype: float64

median(): Same as mean we have the other important statistical variale median. It retruns the medain of the values.

result.median()

Maths       74.0
Science     81.0
English     80.0
Computer    82.0
History     81.0
dtype: float64

mode() Mode is the value which occured most of the time in a value set.

print(result['Maths'].mode())

0    72
1    73
2    74
3    75
4    79
dtype: int64

std(): std or standard deviation tells us how much the values are spread across. As like mean or meidan it is applicable only to the numeric columns.

result.std()

Maths       2.701851
Science     3.714835
English     4.560702
Computer    2.828427
History     5.594640
dtype: float64

max(): As the name says. returns the max value from a set of values. Lets find out the max score of each subjects.

result.max()

Name        Tom
Maths        79
Science      81
English      85
Computer     82
History      87
dtype: object

min(): Similar to max. returns the min value from a set of values. Now lets find out the least scores of each students.

result.min(axis=1)

0    74
1    72
2    75
3    73
4    74
dtype: int64

describe(): describe is a very handy method which gives a summary of the whole data set. It give all the above method results togeather. It is very useful to understand a the outline of the data we have.

print(result.describe())

           Maths    Science    English   Computer   History
count   5.000000   5.000000   5.000000   5.000000   5.00000
mean   74.600000  78.400000  80.400000  80.000000  79.60000
std     2.701851   3.714835   4.560702   2.828427   5.59464
min    72.000000  73.000000  75.000000  76.000000  74.00000
25%    73.000000  76.000000  77.000000  78.000000  74.00000
50%    74.000000  81.000000  80.000000  82.000000  81.00000
75%    75.000000  81.000000  85.000000  82.000000  82.00000
max    79.000000  81.000000  85.000000  82.000000  87.00000

Applying functions in Pandas

In this section we will see how can we apply a function to your DataFrame. This is very useful when yu have to do some modification to a column values or add some weight to whole DataFrame etc. There are two commonly used method availabel to apply a function to your dataset. They are differed as per how we can apply them. They are:

apply(): apply method is used to apply a function row or column wise.
applymap(): applymap method applies the function element by element.

Lets check them in detail.

apply() method:

Apply method is available in both DataFrame and Series objects. It differs based on the object.

apply() method in DataFrame:

When the DataFrame applied method is used it splits the DataFrame into series and passes them to the applied function. We can specify the axis to perform either row or column wise. Example below:

result[['Maths','Science','English']].apply(np.mean)

Maths      74.6
Science    78.4
English    80.4
dtype: float64

apply() method in Series:

When we use the apply method in the Series, each element is passed as a input to the function. So, you can perform only element wise operation. For instance take a scenario when you have give everyone 2 extra scores in Maths then we can use the apply method with the math score series. Example:

print('Before giving the extra score: \n{}'.format(result['Maths']))

new_score = result['Maths'].apply(lambda x: x+2)
print('\nAfter giving the extra scores: \n{}'.format(new_score))

Before giving the extra score: 
0    74
1    72
2    75
3    73
4    79
Name: Maths, dtype: int64

After giving the extra scores: 
0    76
1    74
2    77
3    75
4    81
Name: Maths, dtype: int64

applymap() method:

applymap is a DataFrame method. It is used when you have apply a fuction to all the elements in your DataFrame. It passes element by element to the applying function. Example: Lets add 2 extra scores to all the subjects.

sub = result.iloc[:,1:]       #taking only the subject values not the Names

print('Before giving the extra score: \n{}'.format(sub))

new_score = sub.apply(lambda x: x+2)
print('\nAfter giving the extra scores: \n{}'.format(new_score))

Before giving the extra score: 
   Maths  Science  English  Computer  History
0     74       76       80        82       87
1     72       73       75        82       81
2     75       81       77        76       82
3     73       81       85        78       74
4     79       81       85        82       74

After giving the extra scores: 
   Maths  Science  English  Computer  History
0     76       78       82        84       89
1     74       75       77        84       83
2     77       83       79        78       84
3     75       83       87        80       76
4     81       83       87        84       76

More Information:

DataFrame

12 KiB Raw Blame History