You have questions you want to answer. You have some data that you have pulled out of your Learning Management System. What can you do when the dataset you have is not enough to answer your questions? Do you throw in the towel? No! What you do is combine your dataset with one or more other datasets to give you the full picture you need to answer your questions.
In this article we will work through the process for combining data from two or more datasets. The steps we will work through are as follows:
- Identifying missing data
- Identifying matching fields
- Combining datasets.
Identifying missing data
Your first step is to figure out what data is missing from your current dataset. What do you need to know to answer your questions? Look at our initial dataset:
What fields are missing from this dataset that will prevent you from answering the question ‘How many students enrolled in course SYS-200 during the period March – July 2020?’? Here you would need to have a field that includes the month and year that the student enrolled into the course.
Imagine you wanted to know the cities that students in SYS-200 lived in as you needed to book venues for a final practical assessment. What information is missing from the sample dataset that prevents this? You would need to know each student’s city of residence. You can source a dataset that holds the missing information once you identify what that missing information is. How will you then combine this new dataset with your existing dataset? By making sure the new dataset also contains a field that matches a field in your first dataset.
Identifying matching fields
Each dataset needs to contain at least one field that matches; you are unable to join two completely different datasets. This way each record (row in the dataset) can be matched up correctly. The matching field that we are looking for in our additional dataset is the unique identifier.
A unique identifier identifies a single entity in a dataset and only appears once in the dataset. Our sample dataset has a combination of fields as its identifiers. Each combination of Student ID and Course ID appears once, making them a combined identifier. Student ID and Course ID are themselves unique identifiers in their own datasets – the Students dataset and the Courses dataset.
Examining the Students dataset shows us that Student ID is the unique identifier – each Student ID appears once and the information for that student is contained only in their row. For the Courses dataset the Course ID is the unique identifier. Each course has a unique Course ID that appears only once in the dataset. You can see this in the two datasets below:
We can use the Student ID from the Students table to add any available student data to our sample dataset. Matching records by the Student ID ensures that the correct data goes with the correct student. Likewise, we could add course data from the Courses table using the Course ID field to ensure that we add course data to the correct courses, based on the Course ID. Matching fields between datasets allows us to join them.
We combine datasets by taking one or more fields from one dataset and adding them to another dataset. We match the datasets up on the matching fields (our unique identifiers) so that the correct information is added to the correct rows. The end result is that we have columns from each dataset in the combined dataset. We repeat this until we have a dataset that can answer our questions.
Here are two examples of combining our Students and Courses datasets with our sample dataset. These examples allow us to answer questions the sample dataset cannot answer on its own.
Combining Students data
To answer the question ‘What cities do students in SYS-200 reside in?’ we must add the student’s Country and City information from the Students table to our sample data. We match the data on the Student ID column to make sure we add the country and city data to the correct students. Note that we include the Country data in case the same city name is in multiple countries.
Below is how the sample data will now look. Note that we now have enough data to know which cities students in SYS-200 reside in.
Combining Courses data
The Head of Department for the Economics Department would like a list of each student that has completed a course from the department. We need to add the department to the sample dataset to answer this. How do we do this? By adding the Department column from the Courses table. We match the Course ID column in the sample data to make sure we add the correct department to each entry.
Below is how the sample data will now look. Note that we now have enough data to know which department each student is studying in which will allow us to identify all students that are studying in the Economics Department.
What other questions could you answer by combining the Students or Courses datasets with the sample dataset? What columns do you need to add, and what columns do you need to match on? Have a go at identifying some for yourself.
We have seen how we can combine different datasets on matching fields to answer questions that a single dataset cannot answer by itself. We do this by adding fields from additional datasets and matching records on a matching field that exists in both datasets. This greatly increases our ability to answer questions using data analysis. Data cannot answer everything though and in our next article in this series we will look at some of the things data cannot do.
You can view the previous posts in this series here: