ElearningWorld.org

For the online learning world

AnalyticsTechnical

Cleaning Inaccurate Data

In the previous post in this series we examined a variety of ways that data can be considered dirty. In this post, we will look at ways to clean dirty data, specifically inaccurate data. We will look at cleaning other forms of dirty data in subsequent posts. The hardest part of cleaning inaccurate data is to identify it, so we will focus on identification in this article.

Inaccurate Data

We need accurate data to make informed decisions. Inaccurate data is data that does not reflect reality. This could be due to issues gathering or recording the data. Sometimes data may be ‘filled in’ when it is not able to be collected directly; making incorrect assumptions when filling in this data can lead to inaccuracies. Inaccuracies can occur when data is migrated from one system to another or when the method of collecting data lends itself to errors.

Identifying Inaccurate Data

You can try the following methods to identify inaccurate data in a dataset:

  • Taking a sample of the dataset and checking this data (e.g. against a known good dataset)
  • Contacting students and ask them to provide their updated details, such as in a periodic email
  • Identifying inaccuracies at the point of collection, such as checking an address against a database of known addresses
  • Examining the data in Excel to look for possible inaccuracies.

We will look at using Excel in this article for identifying inaccuracies.

Using Excel to check for inaccuracies

There are a few tricks you can use to look for inaccuracies in data using Excel.  We will use the following sample data from a Moodle course to illustrate these approaches.

Visual Check

The first thing to do would be to look at the data and see if anything stands out. Look at the dataset above and identify anything that looks out of place? You might notice:

  • User Id 4 is missing their first name
  • City for User Id 4 is XXx
  • User Id 7 has — for their phone number.

Sorting on columns

Only some errors will jump out during your first look. Sorting a column can be a good way to sense-check the values. Sorting on the Date of Birth column shows the earliest date to be in 1886. We can be fairly certain there is no one on the course that is over 130 years of age if this is a current class. This must be an incorrect date.

Conditional Formatting

A handy feature of Excel is conditional formatting. This can be used to identify cells that meet some criteria. Look at your dataset and determine what you expect to see in a column. Then create a rule that will format cells that fall outside of these expected values. This will show you values that are potentially inaccurate. Below we have applied formatting to the Email column to look for cells without an @ symbol. We know that all email addresses must contain the @ symbol, so any without it must be inaccurate.

Using Calculations

Another option when you have data with numbers is to use some basic formulae to check for accuracy. In the dataset below, students need to weigh each of five columns up to a total of 100. Adding a column that sums the five columns for each student and then applying conditional formatting for values that are not 100 shows us any that are inaccurate.

Cleaning Inaccurate Data

Having completed your analysis of the dataset you now need to fix the errors. How you do this will depend on what you have found and what options you have available to you. You firstly want to try to identify what the correct values should be. This could be done by going back to the source (e.g. the student or the dataset it was migrated from). You might need to replace the value with an average, such as the average for the column in which the inaccurate value resides. Note that you should remove the inaccurate values before determining the average!

In some instances you would be better off deleting values than replacing them. It is always better to try to contact a user for their correct contact details rather than guess them. If you can’t track it down, leaving the value blank would probably be a better idea.

In this article we have examined inaccurate data and looked at some ways to identify it using Excel. What are some other ways that you have used Excel to find and clean dirty data?

Jeff Mitchell
Latest posts by Jeff Mitchell (see all)
blank

Jeff Mitchell

Jeff is passionate about the role of learning and development, and has a specific interest in how people and organisations can be developed in order to achieve their potential. Jeff has a keen interest in information technology and specifically data analysis and the e-learning space.

4 thoughts on “Cleaning Inaccurate Data

  • Nor Hazirah Binti Mohd Zaki

    Hi Jeff, I have a question regarding the inaccurate value data. Why it is much better to leave it blank and how we will interpret the blank data?

    Reply
    • Hello, often it is better to have no data than to have incorrect data. If an email address was wrong, say missing the ‘@’ symbol, and you guessed what it was, you could be sending communications to the wrong person. If you are making decisions based on your data, the wrong data can cause you to make a poor decision. Removing the inaccurate data by setting it to blank means that it will not impact any further analysis you do on the data, such as calculating averages or finding the max value. It is also a good idea to remove these data points when you are trying to do trend analysis as they can disguise the true trend.

      You can also set the value to something that represents an ‘unknown’. In python this could be np.nan or NaT for time data. For text it could be the empty string, or you could use None. This tells you that you have missing data rather than a data point of 0.

      When performing the analysis you would normally remove the blank entries and then complete your analysis. This is often better than setting a numeric value to zero. If you want to know how many of each item you have in bands of 5 (0-5, 6-10, 11-15…) you would not want a lot of 0’s representing missing or inaccurate data. So it would be better to remove them first. If you are confident that you can use the average or mean in place of the value, you can do so, but if this does not make sense for your data it is much better to mark it as blank and remove that data point from your analysis. You can always have an ‘Unknown’ category with the number of blanks when you do your analysis.

  • Anonymous

    Hello, I want to ask regarding the value of inaccurate data, why would leaving it blank much be better? and how would we interpret or analyze the blank data.

    Reply
  • blank ElearningWorld Admin

    Great post Jeff !
    I love the way you are using Excel, something that everyone has (or an equivalent) to review the data
    – and it’s surprising how much ever a visual check can reveal 🙂

    Reply

Add a reply or comment...