a broom cleaning up a place with numbers
clean up

Wrangling Data In A Nutshell

Onyinye Iloanugo

--

First we analyze the Data

  1. Import Python Packages

import pandas as pd
import numpy as np

lets not forgot to import numpy and pandas

2. Importing Data in Python

data = “<data link or path in .csv, .json, .xlsx>”

we need a link to our data and assign a variable name to it(optional)

3. Data Frame

df = pd. read_csv(data)

convert to a data frame and assign a variable name to it. this file can be in other format than csv, so be aware.

4. .head( )

df.head( )

df.tail( )

head display the first five rows of the data frame while tail displays the last five rows on the data.

5. .info( )

df.info( )

info shows the column names, total entries, how many non null values and the data types on each columns.

6. .describe( )

df.describe( ) / df.describe(include = ‘object’) / df.describe(include= ‘all’)

describe only describes non object columns otherwise stated in the parenthesis; .describe(include = ‘all’) , .describe(include =‘object’)

7. .columns

df.columns

look at the columns you are dealing with and know which to rename or drop

8. .dtype and .astype( )

df.dtypes

checks the datatypes of each column and astype changes the datatypes

9. df.isnull().sum()

#count the missing values in each col
df.isnull().sum()
#convert to data frame using .to_frame

gives you the list of all the columns and sum of all null i.e missing data

we clean the data by dealing with missing values

10. dealing with missing data:

i. drop data
a. drop the whole row
b. drop the whole column

ii. replace data
a. replace it by mean
b. replace it by frequency
c. replace it based on other functions

11. various ways to replace

a.when working with a categorical column, we can replace with the frequently shown category using (.valueCount() ) and (.replace() )

b. when working with continuous number we can take the mean or we replace it with (-99999) so that it appears as outliers.

n.b: -99999 learnt from sentdex youtube videos

12. Check For Duplicate Rows(data)

duplicated_df = df[df.duplicated( )]

Here all duplicate rows except their first occurrence are returned because default value of keep argument was ‘first’.

If we want to select all duplicate rows except their last occurrence then we need to pass the keep argument as ‘last’

duplicated_df = df[df.duplicated(keep = ‘last’)]

some other moves includes;

standardization, normalization, binning, and converting object columns to numerical columns, also we might have to source data by ourselves and knowing how to merge different data properly and also how to scrape web will go a long way for us.

well this wasn’t much of a nutshell, but you get the whole idea, cleaning data can be so much but knowing some concept can make it easier for one.

note that every data is different and everybody cannot clean data the same way.

--

--

Onyinye Iloanugo

Data scientist with machine learning || Front end developer. learning everyday and will be writing more on what took me time to learn and concepts that helped.