Writing documentation can be one of the least enjoyable aspects of working with data. It is also one of the most important. In this article we will look at why you should document your data tasks and how you can go about it.
Do your future self a favour
Documenting data tasks is something that you do for your future self and any others that will be working with your data. Six months from now will you remember where you got that data from? What parameters were used? Why you dropped 70 records from it? These answers may be obvious to you when completing your data tasks but will probably be hazy a few months (and many data tasks) down the track. Documentation is critical if you (or someone else) needs to repeat your data task or you are troubleshooting the data it produced.
What to document
You should consider documenting the following while completing a data task:
- Where you got your data from
- How you got your data, such as the report you ran or the SQL query you used
- Any parameters you used to generate the data, such as a specific date range
- Any cleaning steps you took including the dropping of records
- Any manipulation of the data you undertook such as mapping of categories
- Any assumptions you made when gathering the data
- The requirements of the data task – what is the intended output of the task?
Documenting within your analysis
When writing code to perform your data tasks, either in a script or a Jupyter Notebook, it is a good idea to provide comments where the code itself is not self-explanatory. Often this will be to explain why you are doing a specific action. For instance, it may be obvious from the code that you are dropping all records for a specific user. What you may want to know in a year’s time when you revisit this code is why you dropped only that user and if you need to drop them again when repeating this analysis. Stating that the user was temporarily on leave would tell you that they do not need to be dropped in future but rather it was specific to the time the code was first written.
It can be a good idea to document your assumptions where these impact your analysis. You may have cleaned your data by using today’s date in place of any missing dates. Explaining the assumptions that led you to make this choice will help you to decide if that is an appropriate action to take when repeating the analysis. It will also help you to understand the data better if you are looking at it again in the future.
How to document data tasks
We’ve already discussed one way of documenting data tasks which is in the form of comments within your code. If you are performing the data task in a Jupyter Notebook then you should make good use of markdown cells to document your task. The start of the notebook should explain the task requirements, inputs, outputs and how the data was gathered. Each section of the notebook should have a markdown cell that explains what follows and provides any information necessary to understand the code.
The most basic documentation would be a simple text file kept in the same folder as your data. Often people will include a README file which will explain the data task (and software involved) in more detail. Both these approaches involve a file being stored with the rest of the data task files. This is important as it keeps your documentation within the project files. This is preferred over having to try and source the data files and the documentation separately.
I want to finish by mentioning change logs. These are files that track the changes made across different versions of a piece of software. This is useful if you are writing scripts over time to perform the same data task. Seeing the changes to the script shows you how the data task has changed over time and can be useful when troubleshooting; understanding when certain changes took place in the script can help you to pinpoint when bugs were introduced or help you to understand why the output has changed.
Do yourself (and anyone else working with your data) a favour and leave good documentation in place. You will thank yourself for this.