Wrangling Data In A Nutshell
First we analyze the Data
- Import Python Packages
import pandas as pd
import numpy as np
lets not forgot to import numpy and pandas
2. Importing Data in Python
data = “<data link or path in .csv, .json, .xlsx>”
we need a link to our data and assign a variable name to it(optional)
3. Data Frame
df = pd. read_csv(data)
convert to a data frame and assign a variable name to it. this file can be in other format than csv, so be aware.
4. .head( )
df.head( )
df.tail( )
head display the first five rows of the data frame while tail displays the last five rows on the data.
5. .info( )
df.info( )
info shows the column names, total entries, how many non null values and the data types on each columns.
6. .describe( )
df.describe( ) / df.describe(include = ‘object’) / df.describe(include= ‘all’)
describe only describes non object columns otherwise stated in the parenthesis; .describe(include = ‘all’) , .describe(include =‘object’)
7. .columns
df.columns
look at the columns you are dealing with and know which to rename or drop
8. .dtype and .astype( )
df.dtypes
checks the datatypes of each column and astype changes the datatypes
9. df.isnull().sum()
#count the missing values in each col
df.isnull().sum()
#convert to data frame using .to_frame
gives you the list of all the columns and sum of all null i.e missing data
we clean the data by dealing with missing values
10. dealing with missing data:
i. drop data
a. drop the whole row
b. drop the whole column
ii. replace data
a. replace it by mean
b. replace it by frequency
c. replace it based on other functions
11. various ways to replace
a.when working with a categorical column, we can replace with the frequently shown category using (.valueCount() ) and (.replace() )
b. when working with continuous number we can take the mean or we replace it with (-99999) so that it appears as outliers.
n.b: -99999 learnt from sentdex youtube videos
12. Check For Duplicate Rows(data)
duplicated_df = df[df.duplicated( )]
Here all duplicate rows except their first occurrence are returned because default value of keep argument was ‘first’.
If we want to select all duplicate rows except their last occurrence then we need to pass the keep argument as ‘last’
duplicated_df = df[df.duplicated(keep = ‘last’)]
some other moves includes;
standardization, normalization, binning, and converting object columns to numerical columns, also we might have to source data by ourselves and knowing how to merge different data properly and also how to scrape web will go a long way for us.
well this wasn’t much of a nutshell, but you get the whole idea, cleaning data can be so much but knowing some concept can make it easier for one.
note that every data is different and everybody cannot clean data the same way.