In the previous post in this series we examined a variety of ways that data can be considered dirty. In this post, we will look at ways to clean dirty data, specifically inaccurate data. We will look at cleaning other forms of dirty data in subsequent posts. The hardest part of cleaning inaccurate data is to identify it, so we will focus on identification in this article.
We need accurate data to make informed decisions. Inaccurate data is data that does not reflect reality. This could be due to issues gathering or recording the data. Sometimes data may be ‘filled in’ when it is not able to be collected directly; making incorrect assumptions when filling in this data can lead to inaccuracies. Inaccuracies can occur when data is migrated from one system to another or when the method of collecting data lends itself to errors.
Identifying Inaccurate Data
You can try the following methods to identify inaccurate data in a dataset:
- Taking a sample of the dataset and checking this data (e.g. against a known good dataset)
- Contacting students and ask them to provide their updated details, such as in a periodic email
- Identifying inaccuracies at the point of collection, such as checking an address against a database of known addresses
- Examining the data in Excel to look for possible inaccuracies.
We will look at using Excel in this article for identifying inaccuracies.
Using Excel to check for inaccuracies
There are a few tricks you can use to look for inaccuracies in data using Excel. We will use the following sample data from a Moodle course to illustrate these approaches.
The first thing to do would be to look at the data and see if anything stands out. Look at the dataset above and identify anything that looks out of place? You might notice:
- User Id 4 is missing their first name
- City for User Id 4 is XXx
- User Id 7 has — for their phone number.
Sorting on columns
Only some errors will jump out during your first look. Sorting a column can be a good way to sense-check the values. Sorting on the Date of Birth column shows the earliest date to be in 1886. We can be fairly certain there is no one on the course that is over 130 years of age if this is a current class. This must be an incorrect date.
A handy feature of Excel is conditional formatting. This can be used to identify cells that meet some criteria. Look at your dataset and determine what you expect to see in a column. Then create a rule that will format cells that fall outside of these expected values. This will show you values that are potentially inaccurate. Below we have applied formatting to the Email column to look for cells without an @ symbol. We know that all email addresses must contain the @ symbol, so any without it must be inaccurate.
Another option when you have data with numbers is to use some basic formulae to check for accuracy. In the dataset below, students need to weigh each of five columns up to a total of 100. Adding a column that sums the five columns for each student and then applying conditional formatting for values that are not 100 shows us any that are inaccurate.
Cleaning Inaccurate Data
Having completed your analysis of the dataset you now need to fix the errors. How you do this will depend on what you have found and what options you have available to you. You firstly want to try to identify what the correct values should be. This could be done by going back to the source (e.g. the student or the dataset it was migrated from). You might need to replace the value with an average, such as the average for the column in which the inaccurate value resides. Note that you should remove the inaccurate values before determining the average!
In some instances you would be better off deleting values than replacing them. It is always better to try to contact a user for their correct contact details rather than guess them. If you can’t track it down, leaving the value blank would probably be a better idea.
In this article we have examined inaccurate data and looked at some ways to identify it using Excel. What are some other ways that you have used Excel to find and clean dirty data?