Encouraging Data Literacy - Data Cleaning             

Encouraging Data Literacy - Data Cleaning

23 May 2018

This post is designed to improve data literacy by encouraging good data cleaning habits.

We will walk through a data cleansing process using data scraped from a public website.

Why is data cleaning important?

Clean datasets also allow our data processing/visualisation tools to function properly with improved performance and reduced errors.

So if you want to perform analysis on a dataset you will find it much easier to work with a clean dataset than with one found in the “wild”.

The problem is that most of the datasets found on public websites or even in internal organisations’ databases consist of dirty data.

Some examples of dirty data might include:

So before we can analyse or visualise our data properly we first need to resolve these issues.

Let’s start by having a look at a public dataset found on GrantsConnect.

This website has a page which provides a handy list of downloadable Excel files per time period - thanks GrantsConnect!

We will be using Pandas library to access/manipulate the dataset, so let’s pick one of the downloads from the list, copy it’s URL, and open it in Pandas.

Python logo

import pandas as pd

url = 'http://grants.gov.au/xxxxx77w3485w8tsgDownloadURLgoeshere'

test = pd.read_excel(url)

test.head()

The above commands attempt to read the Excel file located at the address in “url” variable, which we copied from the GrantsConnect website, then display the first 5 rows of the file.

 

Raw Dataset

 

We can see that something does not look exactly right - for example, the column headers are labelled “Unnamed” instead of having meaningful names.

Looking at values of the rows it seems like the column headers are not at the top of the file. In fact, they are on the 3rd row.

The first 2 rows of this file don’t seem to contain anything valuable so we can tell Pandas to skip these rows and re-load the file.

Python logo

test2 = pd.read_excel(url, skiprows=2)

test2.head()

 

Raw dataset with headers

 

Now we can see that our column headers seem to be in the right place. Great!

The dataset is looking cleaner already, but let’s keep exploring because there may be other areas of the file which also need cleaning.

We can see some date/time columns such as “Approved Date” and “Start Date”, and it’s important that these columns are identified as date/times by Pandas so that it can handle them correctly.

Let’s have a look. The test2.dtypes command lists the columns and datatypes from our DataFrame.

 

Raw dataset without datatypes

 

Interestingly, the “Approval Date” column is set to an “object” datatype instead of a date/time datatype.

In fact all of the date/time columns are all set to the “object” datatype, which for the sake of simplicity let’s assume “object” means “text”.

We’ve found the second issue with our dirty data file which needs to be cleaned: date/time columns, so let’s re-load the data file and specify which columns should be parsed as dates.

Python logo

test3 = pd.read_excel(url, skiprows=2, parse_dates=["Approval Date", "Variation Date", "Published Date/Time", "Start Date", "End Date"])

test3.dtypes

Great, now our date/time columns have changed from being “object” datatype to “datetime” datatype!

 

Raw dataset with datatypes

 

Having these columns correctly datatyped will ensure we can perform calculations on the date/time columns correctly.

Is there anything else left in our dataset to clean?

You may have noticed the “Unnamed: 39” column when we previously ran the test3.dtypes command.

 

Raw dataset empty column

 

It’s odd that all of the columns have meaningful names except this one. Perhaps the column is not going to be valuable for our analysis.

Before we ignore it, let’s have a quick look at the values in the column by running the command test3.loc[:, "Unnamed: 39"]

 

Raw dataset empty column values

 

Okay, the column seems like it’s empty, or at least not contain any valuable data, so let’s delete (drop) it from our dataset.

Python logo

test4 = test3.drop("Unnamed: 39", axis=1)

test4.dtypes

The above command drops/deletes the column from our DataFrame, so now we have relatively clean dataset to perform our analysis from our original dirty data file scraped from the web!

 

Raw dataset empty column dropped

 

The DataFrame is ready to perform analysis to answer questions such as:

test4.loc[:, ["Agency", "Value (AUD)"]].groupby("Agency").sum().sort_values(by="Value (AUD)", ascending=False)

 

Raw dataset top agencies

 

By starting your analysis with a clean dataset you can minimise your errors and troubleshooting time, and it will also improve your data literacy over time!