Data analysis is an iterative multi-step process. The CRISP-DM model (cross-industry standard process for data mining) describes the process from business understanding - asking the right questions from the business perspective - to the deployment of the final product. This can be a simple report or a complex piece of software that solves a business problem using data.
This book is mostly concerned with the data preparation step, which is typically the most time-consuming step in the process. Data preparation consists of the following tasks:
- Collecting and gathering data
- Transferring data between file systems, databases and analysis tools
- Integrating data from different sources
- Filtering and aggregating data
- Transforming data and generating features or KPIs
Just like the overall CRISP-DM process, this happens iteratively and involves a lot of trial and error. It's a messy process with lots of dead-ends, throw-away code and quick and dirty workarounds. You will often need to go back to the beginning to collect and integrate some additional data that you didn't know was relevant. When you look at the features, you'll often find your expectations violated by freak cases that you should have dealt with earlier in the process.
An important consideration that many young data scientists and business analysts forget about is reproducibility. With simple spreadsheet tools like Excel, it is not uncommon for you to take the data and modify parts of it - not to manipulate the results, but simply to get the data into the format that you need to perform your analysis.
To someone new to the field, this may not sound like a big problem. Isn't it the end result that really counts? Maybe you are trying to build a recommendation engine or answer a question from a business analyst. Who cares how you do it, as long as you produce the desired result or answer? After all, it's often much faster to just do some transformation steps with Excel.
Obviously, it's not that simple. It is quite likely that you may have to either repeat the analysis due to some additional requirement, explain to someone else the steps that you took to get to the final result or perform a similar analysis a few weeks later. In the worst case, you may have to do all the work again because you're not sure what exact steps you took during the iterative trial-and-error process of preparing the data.
That sucks, but it's nothing compared to what can happen when you're working in the field of fraud detection, for example supporting an attorney in a court case that involves analyzing large amounts of data like bank transactions, accounting records or emails. Here, you will have to account for every record that you received, guarantee that you didn't alter the original data and be able to explain every single step you took from the raw data you received to the final answer. If there is any step in your data preparation and analysis that you cannot justify, your entire analysis can be torn to shreds and you may lose the case without a second chance. I have been part of a data analyst team that worked with attorneys on cases involving potential fines as big as several billion US dollars. Let's just say that seamless reproducibility is more than a nice-to-have in situations like this. However, working like this is a valuable experience because it teaches you the importance and the methods of reproducible research quite thoroughly.
How do you make your work reproducible? The first step of receiving data is to save a copy of the original data somewhere and never touch it again. This is your raw data and you have to keep it to be able to prove that you did not make any changes to the data. Document the number of directories and files that you received along with hash sums for every single file. If you need to be rigorous, prepare a data handover sheet with all this information and let the person who handed you the data confirm it. Do not alter this copy of the data in any way and make sure to back it up somewhere.
You will usually want to import your data into a database or directly into your analysis tool. When you are doing this, make sure to document every single step. Use command-line based tools over point-and-click tools because code is much easier reproducible than a set of mouse movements and clicks. If you absolutely must use a point-and-click tool like an import wizard, make screenshots of every dialog box on the screen and write down every parameter or option that you specified. When you are using code, write comments into your code explaining why you chose specific parameters.
After loading the data, you will need to check that everything has been correctly imported. Therefore, you need to save all the log files and error tables created by your load tool and by the database or analysis tool. Prepare this before you load the data, because some output may not be stored persistently on your computer but simply be displayed as console output during the import.
Avoid any kind of sampling. If you need to sample, use a seed for your random number generator to obtain a random but reproducible sample. Do not use non-deterministic transformations on your data. For example, when you sort your data by a column with non-unique values, you could get different results every time, depending on the tool you use.
Try to import every column as text data and convert it to its final type within your analysis tool. This will give you more control over the typecasting process and it will allow you to do some consistency checks, e.g. to make sure that the same date format is used in all records. The advantages of importing data as text before typecasting have been discussed in previous chapters.
Calculate check sums within the data to verify that no records have been deleted or tampered with. When you have sales data, calculate yearly sums and make sure that they are equal to the yearly sums in the origin system.
When you need to filter your data, count the number of included and excluded records. Depending on your situation, you may need to account for every single record, so make sure to not apply a filter without documenting why you applied it and how many records were included and excluded as a consequence.
These are just some of the precautions you can take. The most important thing, however, is that someone who wants to repeat your analysis can do so easily. He or she needs to be able to understand what steps you took to get from the original raw data to your results. If this includes more than one script of code, make sure to write some documentation detailing what scripts need to be executed in which order. You may also need to specify what manual steps were required or which parameters you chose.
For a cool story on the importance of reproducibility in biology, I really recommend a video called 'The Importance of Reproducible Research in High-Throughput Biology'  that shows a talk from Keith A. Baggerly. You should be able to find it on YouTube.
- Use command line tools and code over point-and-click processes to make your work more easily reproducible
- Keep the original data you received and never touch it. Use hash sums to make sure that the data is in the same state it was when you first got it.