Cleaning Inaccurate Data
In the previous post in this series we examined a variety of ways that data can be considered dirty. In this post, we will look at ways to clean dirty data, specifically inaccurate data. We will look at cleaning other forms of dirty data in subsequent posts. The hardest part of cleaning inaccurate data is to identify it, so we will focus on identification in this article.
Inaccurate Data
We need accurate data to make informed decisions. Inaccurate data is data that does not reflect reality. This could be due to issues gathering or recording the data. Sometimes data may be ‘filled in’ when it is not able to be collected directly; making incorrect assumptions when filling in this data can lead to inaccuracies. Inaccuracies can occur when data is migrated from one system to another or when the method of collecting data lends itself to errors.
Identifying Inaccurate Data
You can try the following methods to identify inaccurate data in a dataset:
- Taking a sample of the dataset and checking this data (e.g. against a known good dataset)
- Contacting students and ask them to provide their updated details, such as in a periodic email
- Identifying inaccuracies at the point of collection, such as checking an address against a database of known addresses
- Examining the data in Excel to look for possible inaccuracies.
We will look at using Excel in this article for identifying inaccuracies.
Using Excel to check for inaccuracies
There are a few tricks you can use to look for inaccuracies in data using Excel. We will use the following sample data from a Moodle course to illustrate these approaches.
Visual Check
The first thing to do would be to look at the data and see if anything stands out. Look at the dataset above and identify anything that looks out of place? You might notice:
- User Id 4 is missing their first name
- City for User Id 4 is XXx
- User Id 7 has — for their phone number.
Sorting on columns
Only some errors will jump out during your first look. Sorting a column can be a good way to sense-check the values. Sorting on the Date of Birth column shows the earliest date to be in 1886. We can be fairly certain there is no one on the course that is over 130 years of age if this is a current class. This must be an incorrect date.
Conditional Formatting
A handy feature of Excel is conditional formatting. This can be used to identify cells that meet some criteria. Look at your dataset and determine what you expect to see in a column. Then create a rule that will format cells that fall outside of these expected values. This will show you values that are potentially inaccurate. Below we have applied formatting to the Email column to look for cells without an @ symbol. We know that all email addresses must contain the @ symbol, so any without it must be inaccurate.
Using Calculations
Another option when you have data with numbers is to use some basic formulae to check for accuracy. In the dataset below, students need to weigh each of five columns up to a total of 100. Adding a column that sums the five columns for each student and then applying conditional formatting for values that are not 100 shows us any that are inaccurate.
Cleaning Inaccurate Data
Having completed your analysis of the dataset you now need to fix the errors. How you do this will depend on what you have found and what options you have available to you. You firstly want to try to identify what the correct values should be. This could be done by going back to the source (e.g. the student or the dataset it was migrated from). You might need to replace the value with an average, such as the average for the column in which the inaccurate value resides. Note that you should remove the inaccurate values before determining the average!
In some instances you would be better off deleting values than replacing them. It is always better to try to contact a user for their correct contact details rather than guess them. If you can’t track it down, leaving the value blank would probably be a better idea.
In this article we have examined inaccurate data and looked at some ways to identify it using Excel. What are some other ways that you have used Excel to find and clean dirty data?
- Version Control – Git GUI – 7th March 2023
- Version Control – Git Command Line Tool – 7th February 2023
- Version Control – Creating Repositories – 7th January 2023
Hi Jeff, I have a question regarding the inaccurate value data. Why it is much better to leave it blank and how we will interpret the blank data?
Hello, often it is better to have no data than to have incorrect data. If an email address was wrong, say missing the ‘@’ symbol, and you guessed what it was, you could be sending communications to the wrong person. If you are making decisions based on your data, the wrong data can cause you to make a poor decision. Removing the inaccurate data by setting it to blank means that it will not impact any further analysis you do on the data, such as calculating averages or finding the max value. It is also a good idea to remove these data points when you are trying to do trend analysis as they can disguise the true trend.
You can also set the value to something that represents an ‘unknown’. In python this could be np.nan or NaT for time data. For text it could be the empty string, or you could use None. This tells you that you have missing data rather than a data point of 0.
When performing the analysis you would normally remove the blank entries and then complete your analysis. This is often better than setting a numeric value to zero. If you want to know how many of each item you have in bands of 5 (0-5, 6-10, 11-15…) you would not want a lot of 0’s representing missing or inaccurate data. So it would be better to remove them first. If you are confident that you can use the average or mean in place of the value, you can do so, but if this does not make sense for your data it is much better to mark it as blank and remove that data point from your analysis. You can always have an ‘Unknown’ category with the number of blanks when you do your analysis.
Hello, I want to ask regarding the value of inaccurate data, why would leaving it blank much be better? and how would we interpret or analyze the blank data.
Great post Jeff !
I love the way you are using Excel, something that everyone has (or an equivalent) to review the data
– and it’s surprising how much ever a visual check can reveal 🙂