Code
import pandas as pd
import numpy as npVictor Omondi
June 11, 2021

Data wrangling is the process of gathering, selecting, and transforming data to answer an analytical question. Also known as data cleaning or “munging”, legend has it that this wrangling costs analytics professionals as much as 80% of their time, leaving only 20% for exploration and modeling.
Hello and welcome to Data Wrangling in Python. This is a notebook/presentatation/blog used in Data Science CIT club event, it depends on how you’ve accessed it. Depending on how you got this notebook, this is how to work around with it to access the data sets and the entire project. - Github - clone the entire project, instructions are on the README.md file - Blog - click on any of the banners shown at the top of the project - -
client status: Shows the client status and loan status at the end of the month their status remained the same.kenya_subcounties: shows the county name and sub countiesBefore starting data wrangling, we need to first install the necessary libraries needed. - Pandas: Pandas is the popular library used for data analysis and data transformation - numpy: Numpy is a popular library for scientific computing
We must import these libraries first, as a convention during importing pandas is aliased to pd and numpy as np.
For the purpose of viewing, some datasets might be having very long or very many columns, therefore we need to make sure pandas will be able to show all columns
Pandas usually truncates columns if they are many


Data might come in very many formats. But mainly, for many data scientists data, the popular used are CSV files. We will use CSV file for our data. Below is how to read CSV with pandas. In pandas, we use pd.read_csv to create pandas dataframes from csv. You pass the csv location as a string/variablestring as the first argument. e.g. pd.read_csv("file.csv")
Create a dataframe known as counties, the csv location is https://raw.githubusercontent.com/VictorOmondi1997/data_wrangling_cit/master/data/kenya_subcounty.csv

Before doing data wrangling or any data cleaning, it is important as a data scientist/analyst to explore your data. This will help you know how the data looks like, how many rows and columns it has. To check the first few rows we use df.head(n) where df is the dataframe name and n is the number of rows you want to be returned. The same is for df.tail and df.sample.
By Default, without passing any argument in head(), it will return the first 5 rows.
| Region | Office | Client ID | Client Status | Loan Status | Client Score | Status Date | |
|---|---|---|---|---|---|---|---|
| 0 | South Rift AgriBiz | Ndunyu Njeru | 85119.0 | New | P+I Current | NaN | 2021-02-28 |
| 1 | South Rift Region | Nakuru | 2005.0 | Dormant | Penalties | 44.83 | 2018-09-30 |
| 2 | Eastern Region | Wote | 69865.0 | Dormant | Penalties | 50.65 | 2020-11-30 |
| 3 | Western Region | Bungoma | 55339.0 | Dormant | Penalties | 60.71 | 2020-07-31 |
| 4 | South Rift Region | Kericho | 38049.0 | Dormant | Penalties | 0.15 | 2019-06-30 |
Passing the number of rows you want
| Region | Office | Client ID | Client Status | Loan Status | Client Score | Status Date | |
|---|---|---|---|---|---|---|---|
| 0 | South Rift AgriBiz | Ndunyu Njeru | 85119.0 | New | P+I Current | NaN | 2021-02-28 |
| 1 | South Rift Region | Nakuru | 2005.0 | Dormant | Penalties | 44.83 | 2018-09-30 |
| 2 | Eastern Region | Wote | 69865.0 | Dormant | Penalties | 50.65 | 2020-11-30 |
| 3 | Western Region | Bungoma | 55339.0 | Dormant | Penalties | 60.71 | 2020-07-31 |
| 4 | South Rift Region | Kericho | 38049.0 | Dormant | Penalties | 0.15 | 2019-06-30 |
| 5 | South Rift Region | Narok | 33391.0 | Dormant | Penalties | 0.94 | 2019-04-30 |
| 6 | North Rift Region | Iten | 53675.0 | Dormant | Penalties | 35.24 | 2020-05-31 |
| 7 | North Rift Region | Eldoret3 | 16893.0 | Active | Penalties | 26.21 | 2020-12-31 |
| 8 | Nairobi Region | Kiambu | 35681.0 | Dormant | Penalties | 46.78 | 2019-05-31 |
| 9 | North Rift Region | Kitale | 24807.0 | Dormant | Penalties | 36.94 | 2019-02-28 |
For the dataframe we created (counties) return the first 15 rows.
The tail method returns the last rows of a dataframe, by default it returns the 5 last rows, but you can specify the number of last rows that you need to return.
| Region | Office | Client ID | Client Status | Loan Status | Client Score | Status Date | |
|---|---|---|---|---|---|---|---|
| 40019 | Eastern Region | Matuu | 66991.0 | Dormant | Penalties | 33.38 | 2020-09-30 |
| 40020 | Report Title: Client Status (Monthly) | NaN | NaN | NaN | NaN | NaN | NaN |
| 40021 | Generated By: Victor Omondi | NaN | NaN | NaN | NaN | NaN | NaN |
| 40022 | Generated Time: 2021-03-26 02:44:32 | NaN | NaN | NaN | NaN | NaN | NaN |
| 40023 | Report Parameters:Office: Head Office month: 2 year: 2021 | NaN | NaN | NaN | NaN | NaN | NaN |
As you can see the last 4 rows have metadata on when the data was generated. We will hand this situation in the Reshaping data part.
| Region | Office | Client ID | Client Status | Loan Status | Client Score | Status Date | |
|---|---|---|---|---|---|---|---|
| 40014 | South Rift AgriBiz | Maili Nne | 48513.0 | Active | Penalties | 56.17 | 2021-01-31 |
| 40015 | North Rift Region | Iten | 42221.0 | Dormant | Penalties | 44.50 | 2019-08-31 |
| 40016 | Western Region | Mbale | 54391.0 | Active | P+I In Default | 56.98 | 2021-02-28 |
| 40017 | Coast Region | Mariakani | 67944.0 | Dormant | P+I In Default | 0.15 | 2020-10-31 |
| 40018 | North Rift Region | Kitale | 6888.0 | Dormant | Penalties | 32.42 | 2020-11-30 |
| 40019 | Eastern Region | Matuu | 66991.0 | Dormant | Penalties | 33.38 | 2020-09-30 |
| 40020 | Report Title: Client Status (Monthly) | NaN | NaN | NaN | NaN | NaN | NaN |
| 40021 | Generated By: Victor Omondi | NaN | NaN | NaN | NaN | NaN | NaN |
| 40022 | Generated Time: 2021-03-26 02:44:32 | NaN | NaN | NaN | NaN | NaN | NaN |
| 40023 | Report Parameters:Office: Head Office month: 2 year: 2021 | NaN | NaN | NaN | NaN | NaN | NaN |
For the dataframe (counties) we created return the last 15 rows.
df.sample mainly is used during sampling techniques, it a random sample of items from an axis of object.
| Region | Office | Client ID | Client Status | Loan Status | Client Score | Status Date | |
|---|---|---|---|---|---|---|---|
| 24131 | North Rift Region | Nandi Hills | 39393.0 | Dormant | Penalties | 43.14 | 2020-01-31 |
| 2638 | Western Region | Bungoma | 71754.0 | Dormant | Penalties | 55.37 | 2021-02-28 |
| 3398 | North Rift Region | Eldoret1 | 32182.0 | Dormant | Penalties | 4.35 | 2019-04-30 |
| 24100 | Coast Region | Voi | 73956.0 | Dormant | Penalties | 40.49 | 2021-02-28 |
| 31920 | North Rift Region | Kapsabet | 6664.0 | Dormant | Penalties | 34.83 | 2019-03-31 |
For the dataframe (counties) we created return a sample of 10 rows
In pandas, there are various ways to know how many columns (variables) and rows (observations) a dataframe has. 1. len(df) - returns number of rows 2. df.shape - returns number of rows and columns as a tuple 3. df.info() - returns data frame info, non null values, columns and data columns and the data type of columns.
For the dataframe (counties) we created how many rows and columns does it have?
df.info prints a summary of the data frame, ie, number of rows, columns, data column and their non-null values and the dtype
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40024 entries, 0 to 40023
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Region 39944 non-null object
1 Office 40020 non-null object
2 Client ID 40020 non-null float64
3 Client Status 40020 non-null object
4 Loan Status 40020 non-null object
5 Client Score 36720 non-null float64
6 Status Date 40020 non-null object
dtypes: float64(2), object(5)
memory usage: 2.1+ MB
For the dataframe (counties) we created how many missing values are there in the name columns?
We use df.describe() to get summary statistics, by default it returns the summary statistics of the numerical columns.
| Client ID | Client Score | |
|---|---|---|
| count | 40020.000000 | 36720.000000 |
| mean | 43747.996652 | 35.593555 |
| std | 24860.118889 | 20.538354 |
| min | 2.000000 | 0.150000 |
| 25% | 22291.750000 | 21.887500 |
| 50% | 44105.000000 | 36.700000 |
| 75% | 65170.500000 | 49.822500 |
| max | 86654.000000 | 92.540000 |
To show all summary statistics including those for objects (strings) you can pass include="all" to the df.describe.
| Region | Office | Client ID | Client Status | Loan Status | Client Score | Status Date | |
|---|---|---|---|---|---|---|---|
| count | 39944 | 40020 | 40020.000000 | 40020 | 40020 | 36720.000000 | 40020 |
| unique | 12 | 57 | NaN | 4 | 4 | NaN | 67 |
| top | Nairobi Region | Embakasi | NaN | Dormant | Penalties | NaN | 2021-02-28 |
| freq | 9860 | 1561 | NaN | 30535 | 28071 | NaN | 8430 |
| mean | NaN | NaN | 43747.996652 | NaN | NaN | 35.593555 | NaN |
| std | NaN | NaN | 24860.118889 | NaN | NaN | 20.538354 | NaN |
| min | NaN | NaN | 2.000000 | NaN | NaN | 0.150000 | NaN |
| 25% | NaN | NaN | 22291.750000 | NaN | NaN | 21.887500 | NaN |
| 50% | NaN | NaN | 44105.000000 | NaN | NaN | 36.700000 | NaN |
| 75% | NaN | NaN | 65170.500000 | NaN | NaN | 49.822500 | NaN |
| max | NaN | NaN | 86654.000000 | NaN | NaN | 92.540000 | NaN |
For the dataframe (counties) we created how many unique sub counties are there in Kenya?

From the tail that we looked above, we saw that the last 4 rows have the metadata on the generation of the data. Those metadata are not very important in our analysis. We will first remove them.
Let’s check some of the irrelevant data.
| Region | Office | Client ID | Client Status | Loan Status | Client Score | Status Date | |
|---|---|---|---|---|---|---|---|
| 40019 | Eastern Region | Matuu | 66991.0 | Dormant | Penalties | 33.38 | 2020-09-30 |
| 40020 | Report Title: Client Status (Monthly) | NaN | NaN | NaN | NaN | NaN | NaN |
| 40021 | Generated By: Victor Omondi | NaN | NaN | NaN | NaN | NaN | NaN |
| 40022 | Generated Time: 2021-03-26 02:44:32 | NaN | NaN | NaN | NaN | NaN | NaN |
| 40023 | Report Parameters:Office: Head Office month: 2 year: 2021 | NaN | NaN | NaN | NaN | NaN | NaN |
we can drop these 4 last columns by using df.drop and assigning the index argument to a list of indeces we want to drop.
If you want to drop columns, use df.drop(columns=cols) were cols is a list of column names you want to drop.
| Region | Office | Client ID | Client Status | Loan Status | Client Score | Status Date | |
|---|---|---|---|---|---|---|---|
| 40015 | North Rift Region | Iten | 42221.0 | Dormant | Penalties | 44.50 | 2019-08-31 |
| 40016 | Western Region | Mbale | 54391.0 | Active | P+I In Default | 56.98 | 2021-02-28 |
| 40017 | Coast Region | Mariakani | 67944.0 | Dormant | P+I In Default | 0.15 | 2020-10-31 |
| 40018 | North Rift Region | Kitale | 6888.0 | Dormant | Penalties | 32.42 | 2020-11-30 |
| 40019 | Eastern Region | Matuu | 66991.0 | Dormant | Penalties | 33.38 | 2020-09-30 |
We can also drop inplace, no need of assigning it to a variable. This can be done by specifying the inplace argument to True
| Region | Office | Client ID | Client Status | Loan Status | Client Score | Status Date | |
|---|---|---|---|---|---|---|---|
| 40015 | North Rift Region | Iten | 42221.0 | Dormant | Penalties | 44.50 | 2019-08-31 |
| 40016 | Western Region | Mbale | 54391.0 | Active | P+I In Default | 56.98 | 2021-02-28 |
| 40017 | Coast Region | Mariakani | 67944.0 | Dormant | P+I In Default | 0.15 | 2020-10-31 |
| 40018 | North Rift Region | Kitale | 6888.0 | Dormant | Penalties | 32.42 | 2020-11-30 |
| 40019 | Eastern Region | Matuu | 66991.0 | Dormant | Penalties | 33.38 | 2020-09-30 |
For our data frame, Client ID column uniquely identifies rows, for our analysis, we won’t be doing analysis on this column, we can set Client ID to be the index of our dataframe.
| Region | Office | Client Status | Loan Status | Client Score | Status Date | |
|---|---|---|---|---|---|---|
| Client ID | ||||||
| 85119.0 | South Rift AgriBiz | Ndunyu Njeru | New | P+I Current | NaN | 2021-02-28 |
| 2005.0 | South Rift Region | Nakuru | Dormant | Penalties | 44.83 | 2018-09-30 |
| 69865.0 | Eastern Region | Wote | Dormant | Penalties | 50.65 | 2020-11-30 |
| 55339.0 | Western Region | Bungoma | Dormant | Penalties | 60.71 | 2020-07-31 |
| 38049.0 | South Rift Region | Kericho | Dormant | Penalties | 0.15 | 2019-06-30 |
Sorting is arranging values/index in data either in ascending or descending order. To sort index we use df.sort_index, to sort values in columns we use df.sort_values(cols) where cols is column name or list of column names we want to sort their values.
| Region | Office | Client Status | Loan Status | Client Score | Status Date | |
|---|---|---|---|---|---|---|
| Client ID | ||||||
| 2.0 | Nairobi Region | Ngong Road | Dormant | Penalties | 37.98 | 2020-04-30 |
| 7.0 | NaN | Head Office | Inactive | Pending Renewal | NaN | 2015-02-28 |
| 10.0 | South Rift Region | Nakuru | Active | Penalties | 42.19 | 2020-12-31 |
| 12.0 | South Rift Region | Nakuru | Active | P+I Current | 65.34 | 2021-02-28 |
| 13.0 | South Rift Region | Nakuru | Dormant | Pending Renewal | 57.59 | 2016-02-29 |
| Region | Office | Client Status | Loan Status | Client Score | Status Date | |
|---|---|---|---|---|---|---|
| Client ID | ||||||
| 7.0 | NaN | Head Office | Inactive | Pending Renewal | NaN | 2015-02-28 |
| 2598.0 | Nairobi Region | Rongai | Inactive | Pending Renewal | NaN | 2015-02-28 |
| 1201.0 | Nairobi Region | Ngong Road | Inactive | Pending Renewal | NaN | 2015-02-28 |
| 2859.0 | South Rift Region | Naivasha | Inactive | Pending Renewal | NaN | 2015-06-30 |
| 1058.0 | Nairobi Region | Kahawa | Dormant | Pending Renewal | NaN | 2015-10-31 |
By defauld the sorting are in ascending order, we can also sort in descending order. To do this, we use ascending=False argument.
| Region | Office | Client Status | Loan Status | Client Score | Status Date | |
|---|---|---|---|---|---|---|
| Client ID | ||||||
| 28468.0 | North Rift Region | Kapenguria | Active | P+I Current | 92.54 | 2021-02-28 |
| 74947.0 | South Rift AgriBiz | Ndunyu Njeru | Active | P+I Current | 92.17 | 2021-02-28 |
| 64738.0 | Eastern Region | Mwingi | Active | P+I Current | 92.02 | 2021-02-28 |
| 31851.0 | Nairobi Region | Rongai | Active | P+I Current | 91.34 | 2021-02-28 |
| 49324.0 | South Rift AgriBiz | Maili Nne | Active | P+I Current | 91.21 | 2021-02-28 |
| ... | ... | ... | ... | ... | ... | ... |
| 86301.0 | Nairobi Region | Ngong Road | New | P+I Current | NaN | 2021-02-28 |
| 86303.0 | Nairobi Region | Kahawa | New | Pending Renewal | NaN | 2021-02-28 |
| 86308.0 | Nairobi Region | Kahawa | New | P+I Current | NaN | 2021-02-28 |
| 86313.0 | North Rift Region | Eldoret2 | New | Pending Renewal | NaN | 2021-02-28 |
| 86654.0 | Western Region | Bondo | New | Pending Renewal | NaN | 2021-02-28 |
40020 rows × 6 columns
For the dataframe (counties) we created sort the data frame as follows 1. name: descending 2. subCounty: ascending
you can use ascending=[False, True].
Duplicates are rows that are identical to each other, ie, they have the same values. to check for duplicates we use df.duplicated this will return a boolean series we can use the boolean series to filter out the repeting values and we can use df.drop_duplicates to remove any duplicates in the data frame.
the first occurence won’t be removed by default.
to check for duplicates, we will first reset the index, this will make Client ID to be a column as it was before.
0 False
1 False
2 False
3 False
4 False
...
40015 False
40016 False
40017 False
40018 False
40019 False
Length: 40020, dtype: bool
| Client ID | Region | Office | Client Status | Loan Status | Client Score | Status Date | |
|---|---|---|---|---|---|---|---|
| 5051 | 11331.0 | Nairobi Region | Ngong Road | Dormant | Penalties | 23.84 | 2018-02-28 |
| 5052 | 11331.0 | Nairobi Region | Ngong Road | Dormant | Penalties | 23.84 | 2018-02-28 |
| 5053 | 11331.0 | Nairobi Region | Ngong Road | Dormant | Penalties | 23.84 | 2018-02-28 |
| 5054 | 11331.0 | Nairobi Region | Ngong Road | Dormant | Penalties | 23.84 | 2018-02-28 |
| 9624 | 21440.0 | North Rift Region | Eldoret1 | Dormant | Penalties | 16.01 | 2018-08-31 |
| 9625 | 21440.0 | North Rift Region | Eldoret1 | Dormant | Penalties | 16.01 | 2018-08-31 |
| 9626 | 21440.0 | North Rift Region | Eldoret1 | Dormant | Penalties | 16.01 | 2018-08-31 |
| 9627 | 21440.0 | North Rift Region | Eldoret1 | Dormant | Penalties | 16.01 | 2018-08-31 |
| 11194 | 24807.0 | North Rift Region | Kitale | Dormant | Penalties | 36.94 | 2019-02-28 |
| 11195 | 24807.0 | North Rift Region | Kitale | Dormant | Penalties | 36.94 | 2019-02-28 |
| 11196 | 24807.0 | North Rift Region | Kitale | Dormant | Penalties | 36.94 | 2019-02-28 |
| 11197 | 24807.0 | North Rift Region | Kitale | Dormant | Penalties | 36.94 | 2019-02-28 |
| 22397 | 49227.0 | Coast Region | Voi | Dormant | Penalties | 0.24 | 2020-02-29 |
| 22398 | 49227.0 | Coast Region | Voi | Dormant | Penalties | 0.24 | 2020-02-29 |
| 22399 | 49227.0 | Coast Region | Voi | Dormant | Penalties | 0.24 | 2020-02-29 |
| 22400 | 49227.0 | Coast Region | Voi | Dormant | Penalties | 0.24 | 2020-02-29 |
| 35401 | 76644.0 | Western Region | Kisumu | Active | Penalties | 48.92 | 2020-09-30 |
| 35402 | 76644.0 | Western Region | Kisumu | Active | Penalties | 48.92 | 2020-09-30 |
| 35403 | 76644.0 | Western Region | Kisumu | Active | Penalties | 48.92 | 2020-09-30 |
| 35404 | 76644.0 | Western Region | Kisumu | Active | Penalties | 48.92 | 2020-09-30 |
For the dataframe (counties), are there any duplicates in that dataframe
df.drop_duplicates removes duplicates keeping first occurence by default.
Majorly involed during data cleaning. Missing values are values that are missing, alaaa!. In this session we won’t go deeper on how to handle missing values. In python, to check for missing values we use df.isnull() or df.isna, to remove missing values we use df.dropna
| Client ID | Region | Office | Client Status | Loan Status | Client Score | Status Date | |
|---|---|---|---|---|---|---|---|
| 0 | False | False | False | False | False | False | False |
| 1 | False | True | False | False | False | True | False |
| 2 | False | False | False | False | False | False | False |
| 3 | False | False | False | False | False | False | False |
| 4 | False | False | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 40015 | False | False | False | False | False | True | False |
| 40016 | False | False | False | False | False | True | False |
| 40017 | False | False | False | False | False | True | False |
| 40018 | False | False | False | False | False | True | False |
| 40019 | False | False | False | False | False | True | False |
40000 rows × 7 columns
That returns a dataframe with boolean showing if a value is null
Client ID 0
Region 80
Office 0
Client Status 0
Loan Status 0
Client Score 3300
Status Date 0
dtype: int64
df.isnull().sum() is the most widely used way of cheking the number of missing values per column.

As a data scientist it is good to check the counts of unique values, these is mainly used to check for imbalance at later states. To check for the number of unique values, we use df[col].value_counts() where col is the column name.
df[col] is selecting a column returning a series, where col is the column name
Nairobi Region 9856
North Rift Region 8506
South Rift Region 6781
Coast Region 4661
South Rift AgriBiz 3848
Western Region 3360
Eastern Region 2252
Central AgriBiz 656
Name: Region, dtype: int64
By default .value_counts drops values which are missing values, we can include the counts of missing values by setting dropna=False
Nairobi Region 9856
North Rift Region 8506
South Rift Region 6781
Coast Region 4661
South Rift AgriBiz 3848
Western Region 3360
Eastern Region 2252
Central AgriBiz 656
NaN 80
Name: Region, dtype: int64
For the dataframe (counties) Show the value counts of the name variable
In pandas - we can use comparison operators to filter data meeting a certain condition, - to filter columns we can use df[[col1....coln], - to filter rows based on their index we can use - iloc[i] or loc[strn] for integer based or label based indexing respectively.
| Client ID | Region | Office | Client Status | Loan Status | Client Score | Status Date | |
|---|---|---|---|---|---|---|---|
| 1 | 7.0 | NaN | Head Office | Inactive | Pending Renewal | NaN | 2015-02-28 |
| 1993 | 4700.0 | NaN | Head Office | Inactive | Pending Renewal | 15.15 | 2020-09-30 |
| 2156 | 5064.0 | NaN | Fraud | Dormant | Penalties | 23.38 | 2020-08-31 |
| 2279 | 5322.0 | NaN | Fraud | Dormant | Penalties | 23.96 | 2020-08-31 |
| 2407 | 5621.0 | NaN | Fraud | Dormant | Penalties | 13.77 | 2020-08-31 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 38004 | 82121.0 | NaN | Fraud | Active | P+I In Default | NaN | 2020-12-31 |
| 38100 | 82284.0 | NaN | Fraud | Active | P+I In Default | NaN | 2020-12-31 |
| 38227 | 82560.0 | NaN | Fraud | Active | P+I In Default | NaN | 2020-12-31 |
| 38296 | 82709.0 | NaN | Fraud | Active | P+I In Default | NaN | 2020-12-31 |
| 38503 | 83150.0 | NaN | Fraud | Active | P+I In Default | NaN | 2020-12-31 |
80 rows × 7 columns
We can use comparison operators for filtering values that meet a certain condition. - > - Greater than - < - Less than - >= - Greater or equal to - <= - Less than or equal to - == equal to - != not equal to
0 False
1 False
2 False
3 False
4 False
...
40015 False
40016 False
40017 False
40018 False
40019 False
Name: Client Score, Length: 40000, dtype: bool
df[col]>value returns a boolean series, we can use the boolean series to filter out values in dataframe not meeting that condition.
| Client ID | Region | Office | Client Status | Loan Status | Client Score | Status Date | |
|---|---|---|---|---|---|---|---|
| 2641 | 6150.0 | South Rift Region | Kericho | Active | Pending Renewal | 90.59 | 2021-02-28 |
| 8535 | 19100.0 | South Rift Region | Litein | Active | P+I Current | 90.85 | 2021-02-28 |
| 12657 | 27981.0 | South Rift AgriBiz | Ndunyu Njeru | Active | Pending Renewal | 90.58 | 2021-02-28 |
| 12876 | 28468.0 | North Rift Region | Kapenguria | Active | P+I Current | 92.54 | 2021-02-28 |
| 14405 | 31851.0 | Nairobi Region | Rongai | Active | P+I Current | 91.34 | 2021-02-28 |
| 22443 | 49324.0 | South Rift AgriBiz | Maili Nne | Active | P+I Current | 91.21 | 2021-02-28 |
| 26438 | 57810.0 | South Rift AgriBiz | Maili Nne | Active | P+I In Default | 90.92 | 2021-02-28 |
| 29804 | 64738.0 | Eastern Region | Mwingi | Active | P+I Current | 92.02 | 2021-02-28 |
| 30564 | 66344.0 | Central AgriBiz | Githunguri | Active | P+I Current | 90.33 | 2021-02-28 |
| 32576 | 70685.0 | South Rift AgriBiz | Londiani | Active | P+I Current | 90.77 | 2021-02-28 |
| 34593 | 74947.0 | South Rift AgriBiz | Ndunyu Njeru | Active | P+I Current | 92.17 | 2021-02-28 |
| 35239 | 76301.0 | Nairobi Region | Kiambu | Active | P+I Current | 90.04 | 2021-02-28 |
We can use logical conditions to support two or more expressions. In pandas we we can use the following operators:
&: for logical and|: for logical or~: for logical notParanthesis are very important. enclose expressions in a paranthesis (.....)
For the dataframe (counties) Filter the data to select dataset where name is equal to 'Nakuru' & subCounty is not equal to 'Naivasha'
Grouping data is a critical step in data wrangling. in pandas we can group DataFrame using a mapper or by a Series of columns.
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f2757be5a00>
The groupby returns an object, we can use the object to perform actions in the groups.
| Client ID | Client Score | |
|---|---|---|
| Region | ||
| Central AgriBiz | 36163955.0 | 20355.84 |
| Coast Region | 223714866.0 | 157343.10 |
| Eastern Region | 151738957.0 | 61644.78 |
| Nairobi Region | 306914995.0 | 309115.63 |
| North Rift Region | 326377934.0 | 270228.72 |
| South Rift AgriBiz | 201143618.0 | 168818.01 |
| South Rift Region | 270779301.0 | 211088.41 |
| Western Region | 229155555.0 | 107269.71 |
| Client Score | |
|---|---|
| Region | |
| Central AgriBiz | 20355.84 |
| Coast Region | 157343.10 |
| Eastern Region | 61644.78 |
| Nairobi Region | 309115.63 |
| North Rift Region | 270228.72 |
| South Rift AgriBiz | 168818.01 |
| South Rift Region | 211088.41 |
| Western Region | 107269.71 |
| Client ID | Client Score | |
|---|---|---|
| Region | ||
| Central AgriBiz | 55127.980183 | 49.527591 |
| Coast Region | 47997.182150 | 35.792334 |
| Eastern Region | 67379.643428 | 31.244187 |
| Nairobi Region | 31139.914265 | 32.839226 |
| North Rift Region | 38370.319069 | 34.059582 |
| South Rift AgriBiz | 52272.250000 | 53.729475 |
| South Rift Region | 39932.060316 | 33.389499 |
| Western Region | 68201.058036 | 35.205025 |
To check for the size of each group, we can use size() method of a groupby object.
Region
Central AgriBiz 656
Coast Region 4661
Eastern Region 2252
Nairobi Region 9856
North Rift Region 8506
South Rift AgriBiz 3848
South Rift Region 6781
Western Region 3360
dtype: int64
Region
Central AgriBiz 2
Coast Region 8
Eastern Region 7
Nairobi Region 8
North Rift Region 9
South Rift AgriBiz 6
South Rift Region 8
Western Region 7
Name: Office, dtype: int64
Pivot table (derived from excel) is an advanced pandas grouping method. Here you can decide on the index, columns, values and the aggregate functions. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.
the aggfunc by default is mean. You can pass string aggregates eg 'mean', 'sum' ect or aggregate functions such as np.mean, np.sum etc
| Region | Central AgriBiz | Coast Region | Eastern Region | Nairobi Region | North Rift Region | South Rift AgriBiz | South Rift Region | Western Region |
|---|---|---|---|---|---|---|---|---|
| Office | ||||||||
| Bomet | NaN | NaN | NaN | NaN | NaN | NaN | 33.836232 | NaN |
| Bondo | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 54.472500 |
| Bungoma | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 32.724249 |
| Busia | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 36.784428 |
| Changamwe | NaN | 36.532172 | NaN | NaN | NaN | NaN | NaN | NaN |
setting margins=True will result to row aggregate and column column aggregate.
| Region | Central AgriBiz | Coast Region | Eastern Region | Nairobi Region | North Rift Region | South Rift AgriBiz | South Rift Region | Western Region | All |
|---|---|---|---|---|---|---|---|---|---|
| Status Date | |||||||||
| 2016-01-31 | NaN | NaN | NaN | 44.733333 | NaN | NaN | NaN | NaN | 44.733333 |
| 2016-02-29 | NaN | NaN | NaN | 35.430541 | NaN | NaN | 37.207143 | NaN | 35.713182 |
| 2016-03-31 | NaN | NaN | NaN | 34.614533 | NaN | NaN | 33.599565 | NaN | 34.376327 |
| 2016-04-30 | NaN | NaN | NaN | 30.816381 | NaN | NaN | 33.833704 | NaN | 31.433561 |
| 2016-05-31 | NaN | NaN | NaN | 37.960885 | NaN | NaN | 41.225000 | NaN | 38.681241 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2020-11-30 | 47.642000 | 34.752153 | 31.791702 | 34.263287 | 38.249737 | 50.445581 | 33.603857 | 32.730902 | 34.959187 |
| 2020-12-31 | 43.470714 | 37.190167 | 30.627154 | 35.245165 | 39.696990 | 49.850078 | 34.649405 | 38.350320 | 37.861864 |
| 2021-01-31 | 50.570377 | 36.696154 | 34.157852 | 38.617857 | 41.534072 | 54.294533 | 36.930952 | 35.335667 | 41.915679 |
| 2021-02-28 | 55.504430 | 48.855651 | 41.270897 | 47.413198 | 49.778757 | 56.811797 | 47.444950 | 45.280237 | 49.599350 |
| All | 49.527591 | 35.792334 | 31.244187 | 32.839226 | 34.059582 | 53.729475 | 33.389499 | 35.205025 | 35.642344 |
63 rows × 9 columns
the are varies ways of combining data sets - append - add rows to the end of the caller. - concatenate - merge - inner join - left join - right join
Type of merge to be performed. - left: use only keys from left frame, similar to a SQL left outer join; preserve key order. - right: use only keys from right frame, similar to a SQL right outer join; preserve key order. - outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically. - inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys. - cross: creates the cartesian product from both frames, preserves the order of the left keys.
We use df.merge/pd.merge to import columns from other datasets.
| name | subCounty | |
|---|---|---|
| 0 | Mombasa | Changamwe |
| 1 | Mombasa | Jomvu |
| 2 | Mombasa | Kisauni |
| 3 | Mombasa | Likoni |
| 4 | Mombasa | Mvita |
We can merge status and counties using Office and subCounty (this is because that is the key present in both columns. Since we have diffent column names in both data frames, we have to use left_on and right_on to specify the columns.
if the columns are the same on both columns, we can use on argument.
| Client ID | Region | Office | Client Status | Loan Status | Client Score | Status Date | name | subCounty | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 33.0 | South Rift Region | Naivasha | Dormant | Penalties | 40.00 | 2020-06-30 | Nakuru | Naivasha |
| 1 | 34.0 | South Rift Region | Naivasha | Dormant | Pending Renewal | 40.23 | 2016-03-31 | Nakuru | Naivasha |
| 2 | 206.0 | South Rift Region | Naivasha | Active | P+I Current | 76.87 | 2021-02-28 | Nakuru | Naivasha |
| 3 | 615.0 | South Rift Region | Naivasha | Active | P+I Current | 58.43 | 2021-02-28 | Nakuru | Naivasha |
| 4 | 702.0 | South Rift Region | Naivasha | Dormant | Pending Renewal | 43.73 | 2016-08-31 | Nakuru | Naivasha |
Another method is using the pd.merge
| Client ID | Region | Office | Client Status | Loan Status | Client Score | Status Date | name | subCounty | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 33.0 | South Rift Region | Naivasha | Dormant | Penalties | 40.00 | 2020-06-30 | Nakuru | Naivasha |
| 1 | 34.0 | South Rift Region | Naivasha | Dormant | Pending Renewal | 40.23 | 2016-03-31 | Nakuru | Naivasha |
| 2 | 206.0 | South Rift Region | Naivasha | Active | P+I Current | 76.87 | 2021-02-28 | Nakuru | Naivasha |
| 3 | 615.0 | South Rift Region | Naivasha | Active | P+I Current | 58.43 | 2021-02-28 | Nakuru | Naivasha |
| 4 | 702.0 | South Rift Region | Naivasha | Dormant | Pending Renewal | 43.73 | 2016-08-31 | Nakuru | Naivasha |
For the dataframe (counties) do a full outer join with status dataframe, how many rows does it have?
After finishing data wrangling, it is good to export the data to the right file format for further uses. Let us do some final tweaks and export the final dataset as a csv.
# We will use the newly combined dataset
# 1. Rename columns and remove unwanted columns
combined_2.rename(columns={'name':'county'}, inplace=True)
combined_2.drop(columns=['subCounty'], inplace=True)
combined_2.dropna(inplace=True)
combined_3 = pd.pivot_table(combined_2, index='county', columns='Loan Status', values='Client Score', aggfunc=np.median)
display(combined_3.head())
combined_3.to_csv('combined.csv', index=False)| Loan Status | P+I Current | P+I In Default | Penalties | Pending Renewal |
|---|---|---|---|---|
| county | ||||
| Busia | 64.19 | 47.110 | 32.130 | 41.91 |
| Kericho | 64.92 | 61.170 | 45.910 | 62.42 |
| Kiambu | 60.76 | 48.570 | 32.665 | 50.66 |
| Kilifi | 67.01 | 54.220 | 36.005 | 52.41 |
| Laikipia | NaN | 44.435 | 26.275 | 52.40 |

Thank you for attending todays session.