Data cleaning (#33923)
* Data cleaning Data cleaning * Updated with back ticks
This commit is contained in:
committed by
Christopher McCormack
parent
617502cc67
commit
30b7481f1c
@@ -214,6 +214,11 @@ df['col1'].apply(len)
|
|||||||
```python
|
```python
|
||||||
del df['col1']
|
del df['col1']
|
||||||
```
|
```
|
||||||
|
## Data Cleaning
|
||||||
|
Data cleaning is a very important step in data analysis. For example, we always check for missing values in the data by running `pd.isnull()` which checks for null Values, and returns a boolean array (an array of true for missing values and false for non-missing values). In order to get a sum of null/missing values, run `pd.isnull().sum()`. `pd.notnull()` is the opposite of `pd.isnull()`. After you get a list of missing values you can get rid of them, or drop them by using `df.dropna()` to drop the rows or `df.dropna(axis=1)` to drop the columns. A different approach would be to fill the missing values with other values by using df.fillna(x) which fills the missing values with x (you can put there whatever you want) or `s.fillna(s.mean())` to replace all null values with the mean (mean can be replaced with almost any function from the statistics section).
|
||||||
|
|
||||||
|
It is sometimes necessary to replace values with different values. For example, `s.replace(1,'one')` would replace all values equal to 1 with 'one'. It’s possible to do it for multiple values: `s.replace([1,3],['one','three'])` would replace all 1 with 'one' and 3 with 'three'. You can also rename specific columns by running: `df.rename(columns={'old_name': 'new_ name'})` or use `df.set_index('column_one')` to change the index of the data frame.
|
||||||
|
|
||||||
|
|
||||||
## Checking for missing values
|
## Checking for missing values
|
||||||
```df.isnull()```
|
```df.isnull()```
|
||||||
|
Reference in New Issue
Block a user