This article will be the first in a series where we look at version control and how it can help you to effectively work with data. Version control is used heavily in software development and can also be applied to documents, web sites or any other files that may change over time.
Version control is a way of tracking changes to files over time. It allows you to revert to a previous version of a file or to work on a file in different states, keeping separate versions of the same file. In software development this could mean having separate versions of a piece of code. For data analysis it could mean having one version of your dataset that you use for analysis and a separate version that you use for your documentation. Version control is a means for managing these files and also keeping a history of changes to the files over time.
Version Control and Data Analysis
For these articles we will consider version control from the perspective of performing data analysis on some Moodle data. Your analysis process could involve three stages which you may want to keep separate:
- Cleaning your datasets
- Analysing your data
- Communicating your findings.
Imagine that you are undertaking a project to determine how the length of courses in your Moodle installation relates to the completion rates of those courses. You have run some reports in Moodle and saved the data as CSV files which you will be working with. You also have some data from a separate database that holds student information beyond that which is contained within Moodle. These datasets need to be cleaned up before you analyse them and then prepare a presentation to communicate what you have found.
You completed a similar process a couple of years ago looking at student retention. This proved to be painful and you faced some major problems. You did a bunch of work cleaning up one of the files only to find that some of the data you had ‘cleaned’ (read ‘removed’) you actually needed. So you had to start over again.
Last time you did most of your analysis using Excel. This worked, but for this investigation you have decided to write some python code instead. Writing code is a new skill for you and you know you will make plenty of errors. You want to be able to go back to a known good version of your software without saving multiple versions of the same file.
Another issue you faced last time was preparing the data for presentation. You would make changes to the Excel file so you could take screenshots to put into your PowerPoint and PDF files. Then when you went back to use the files you had to keep undoing these changes. There must be a better way.
Version Control to the Rescue
Version control can address each of the issues you faced last time. It will allow you to keep different versions of the same file, so you can format it for presentation in one version without affecting the version you use for analysis. It can keep previous versions of a file and a record of the changes made to the file. This will allow you to go back to a previous version if you make mistakes in your software and want to undo those ‘bugs’. You can even use it as a form of backup by having the files stored online and also on another device, with all the different file versions maintained.
Version control is a powerful tool and essential if you are writing code or developing software to perform data analysis. In the next article we will look at a piece of version control software called Git and see how it enables version control.