In the third article in this series we looked at developing an understanding of your dataset. Understanding your dataset allows you to identify what the data can tell you. This in turn allows you to determine what questions the data could answer. It is essential that you understand how your dataset is structured and what it contains before you try to formulate questions.
Determining your questions
The first step when formulating questions is to summarise what your dataset tells you. Below is a summary statement of the sample dataset. Read this statement and then come up with some potential questions for the dataset based on this description.
This dataset tells us the status of each course a student is enrolled into and their completion percentage for that course. It shows us the student’s enrolment history along with their current workload. We could combine this dataset with a course dataset or access logs from the LMS to gather a wider picture of the student’s study progress within the institute.
The dataset looks like this:
Here are some questions that you could ask of the dataset based on this description:
- Which students currently have a high workload?
- How far is a specific student in a specific course?
- What courses has a specific student been enrolled into?
- Which courses are the most popular?
- Are there any courses that currently do not have anyone enrolled into them?
- Which students are close to completing a course?
- Which students have not yet started a course?
Each of the questions above can be answered by examining the data contained in the dataset. A thorough examination of our dataset will tell us what questions we can answer with it. It will also highlight for us questions that we cannot answer with the dataset in its current form.
The questions that you can answer with a dataset are limited. You can only answer a question if:
- The dataset has the fields required to answer the question
- The dataset contains enough data to answer the question
- The structure of entries allows the question to be answered.
Let’s examine each of these limitations by considering some questions we cannot ask of our sample data.
The dataset has the fields required to answer the question
Consider the question ‘How many students enrolled in course SYS-200 during the period March – July 2020?’. We need to know the month that a student enrolled into the course to answer this question. Our sample dataset does not contain a field for the month of enrolment; we are unable to answer this question from this dataset. We would not be able to answer the question ‘What is the average age of students enrolled in SYS-300’ as we do not have a field that tracks the user’s age. The dataset lacks the fields required to answer these questions.
The dataset contains enough data to answer the question
The amount of data within a dataset will determine if a question can be answered by it. You can only answer questions regarding an entire population if the whole population exists in the dataset. For example, you would need to have each possible course contained in the dataset to answer the question ‘Which course that is offered has the most students enrolled?’. This question could not be answered if you have only pulled down data for the ECO and SYS courses.
The structure of entries allows the question to be answered
The structure of dataset entries determines the questions that can be answered with it. Having a Year column in the dataset would allow us to examine the data on a year-by-year basis. Entries could be repeated but have a different Year value which would allow you to compare the data across years. If the sample dataset had a Month column it could be structured so that each Student ID and Course ID combination could also have a separate entry for each month identifying the Course Status in that specific month. Without this structure we are unable to answer the question ‘Which month has the highest number of students completing a course?’.
Take another look at the sample dataset. What are some questions that we would not be able to answer with this dataset? What would the dataset need to contain for us to answer these questions?
Combining datasets to answer questions
You will often find that the dataset that you have is not enough to answer all the questions you have of it. This is especially true if you have only pulled down data from a single table in the database. You may want to join your dataset with another dataset which contains further information that combined will paint a bigger picture. With our sample dataset, we could join it to another dataset that looks like the one below to answer questions around the demographics of students in specific courses:
We can only answer questions that our dataset is structured to answer. When the dataset does not contain all the data required to answer the question, we can either change our question or we can combine it with one or more additional datasets until it can answer our questions. In our next article we will examine the process of combining datasets to answer specific questions.
You can view the previous posts in this series here: