Data is only useful for analysis if you can rely on it, and to rely on data it must be clean. Clean data is free from errors, is tidy and complete. Unfortunately, it is rare that data is entirely clean. In the next few articles we will look at the process of identifying and cleaning dirty data. In this article we will examine the types of dirty data, how they may arise, and some of their implications. We will consider these from the perspective of using data from a Moodle installation.
Dirty data is data that has one or more problems with it that prevent it being completely accurate and reliable for use. Examples of dirty data include data that is:
- Of the wrong type
- Missing values, and
We will examine each of these issues and relate them to data you could be working with within a Moodle environment.
Inaccurate data is data that does not reflect reality. The data says ‘X’ but really ‘Y’ is the case. This is problematic as any inferences or decisions you make based on inaccurate data are likely to be inaccurate themselves. Some examples of inaccurate data could include student names that are spelled wrong, old or inaccurate email addresses, and students that have been enrolled into the wrong class. Inaccurate data usually occurs when data is loaded into the system, either user-supplied, such as via a web form, or when entered by an administrator. You would be amazed how many times I have come across people who could not spell their own name or enter their address correctly!
Incomplete data is data that is partial. This may be for individual records or for a subset of the records. It can be a real problem when only part of the required information is present. Imagine trying to send an enrolment pack to a user when all you have is their street address but no city? Or you don’t have their street number? Incomplete data can also occur when you only have some of the records, even if the individual records are complete. An example of this could be a teacher loading grades into a course but missing out the last 10 students.
The data type can become an issue when you come to analyse data. Having strings (characters) in what should be a number field will lead to all sorts of problems if the field is used for calculations. As a quick example, if you were to calculate 4 * 4 in python you would get the number 16. But if instead you had ‘four’ in the field, you would get the string ‘fourfourfourfour’. Probably not what you were after! Types can also be an issue when working with date or categorical data such as gender.
Data can become garbled when it is loaded into a system en masse. This can often occur when working with data in Excel and then importing the file into Moodle. If you sort on a column but do not expand the selection, you can suddenly have contact details for students all mixed up. Import that into Moodle and you now have some very garbled data. Data can also become garbled when you have issues with a file. One example is when the encoding goes awry and you get lots of weird characters in place of your original data.
Data duplication can often occur when you are importing data from various systems. In Moodle a common issue is when a user that is already in the system is set up with a new account. Now there are two accounts for the same user. When you try to run a report on the courses the user has completed, only some of them appear as they are now split across the two accounts. Duplicated data can also lead to inaccurate analysis. In SQL you will often determine the amount of records using a GROUP BY and COUNT statement. Having duplicated records can lead to the COUNT for a given variable being a lot higher than it really is.
Data that is missing values can lead to problems when you come to analyse the data or wish to group it by a variable. Missing location data for some students would mean you could not place them into a group based on their city, which could result in them missing out on communication regarding an event in their area. Performing counts on a column will be inaccurate if there are values missing from that column. The more data that is missing, the less accurate the data will be.
Variations of data can be a problem when you are trying to categorise data, group users by a specific attribute, or analyse specific values in a column. A good example would be gender – ‘Female’ might also appear as ‘F’, ‘fem’, ‘fe’, ‘femal’ etc. which would make it likely that many female users would not get identified as female. You want your data to be consistent and you can achieve this by ensuring you do not have variations for the same piece of data.
We have seen in this article a number of examples of dirty data, how they can appear in a Moodle setting, and what they can result in. In the next article we will introduce some ways of dealing with dirty data.