It is common knowledge among data scientists that 80% of the time invested in a data analysis project is spent on the supposedly mundane task of preparing the data. Before any type of predictive model can be applied, you need to find the right data, clean it, reorganize it, integrate it with other data sources and generate relevant features.

While this is not a secret, it is frequently being overlooked in data science textbooks, university programs, online courses and industry conferences. When people talk about data science, they focus on the glamorous side: Predictive models with accuracy that comes close to divination! Pattern recognition techniques uncovering hidden truths in the data that are invisible to the human eye! Algorithms that make better and more informed decisions at a speed a million times faster than any human being possibly could!

I'm not arguing that this is in fact where the value of the data scientist's work lies. It is what makes a difference in the business and what you talk about when you want to convince your boss - or the boss of your boss - to invest in data science projects. But this is not where the actual battle is fought. In my experience, every project involves some form of data analysis hinges on good data pre-processing and feature generation. Whether you use complex neural networks or a simple linear regression, you'll find that irrelevant features or features based on incomplete or erroneous data will always lead to poor results. On the other hand, when your data is clean and reliable and you generate the right features, most of the available models will do a fairly good job. In other words, the process of preparing the data is not only the biggest factor influencing the project duration, but also a key factor in determining whether your project will be successful.

Despite this simple fact, people tend to dismiss data pre-processing as simple and boring. This is understandable, given that it doesn't involve complex mathematical concepts or fancy modelling techniques. But that doesn't mean that it is easy. If it was, data scientists - usually fairly smart people - would not need to spend 80% of their time on it. But then, what makes data preparation so time-consuming?

As a matter of fact, most of the problems that come up when you prepare data are easily solved. The time-consuming part is the iterative process of trial-and-error that you will go through until you have resolved all major problems. Every iteration cycle costs time. That is:

  • Wondering about weird, unexplainable results or errors

  • Finding the root cause of the problem

  • Communicating the problem, e.g. by creating an error ticket or talking to colleagues

  • The mental, physical and administrative overhead of switching tasks

  • Locating and fixing the problem

  • Checking if the problem has been fixed

  • Documenting the solution or the changes

  • Recalculating the dataset (potentially very costly!)

  • Repeating data quality checks

  • Repeating all subsequent steps - often including model training and evaluation

  • Deploying the fixed data product, e.g. the data pipeline

The problem itself can usually be solved quickly and with little effort, often in minutes. But the total time required to recover from it can sum up to days! If you can spot a problem before it breaks something, you are avoiding an entire iteration cycle.

Let me repeat that, because it is important! Data pre-processing takes up 80% of your time and effort! Doing it right is the single most important factor determining the quality of your results! And yet, nobody ever talks about it. Textbooks, university programs, online courses and industry conferences largely ignore it! That's why I wrote this book.

I'm a mathematician by education and I started my data science career at the consulting branch of one of the big international accounting companies. What we pitched and sold to our customers was our expertise in the world of predictive models and data mining techniques. But to be honest, what we mostly needed to get the job done was the skill to obtain, clean and integrate the right data in the right way. Unfortunately, I have never been prepared for all the intricacies of data wrangling and had to learn a lot of things the hard way. There are hundreds of things that can go wrong between requesting data and plugging it into a model. Some problems will completely stop your progress until you can solve them. Some will lead to obviously incorrect results and make you go on the search for the needle in the haystack. And some will simply remain undiscovered and slowly degrade the performance of your model.

In this book, I want to share my experiences with you. If you're an aspiring data scientist like I was back in 2013, fresh out of university, you may not be aware of all the pitfalls waiting for you. The main difference between a beginner data scientist and an experienced one is that the beginner is unaware of the pitfalls and will run into every single one of them - like I did - while the experienced one will know what to look out for and quickly find his way around them. As mentioned before, most university courses will not prepare you for this. But given that you will spend 80% of your time on data preparation and that it is a key factor for a successful data project, I would say that it is worth investing some time into learning what to look out for.

If you have been working in data science for a while, you have probably made your own painful experiences. You will already have encountered many of the problems that I will describe in this book. Please use this book as an opportunity to sharpen your awareness and improve your ability to anticipate and avoid these pitfalls.

Since most of the pitfalls are easily dealt with, code examples are rarely helpful. That being the case, I have tried to include as little code as possible to make this book easier to read.

This book is structured so that it starts with a few thoughts on the nature of data, followed by the more technical problems like missing values or encoding problems that are typically easy to spot and solve. Towards the end, I will focus more on complex problems like how to dip your feet into a new huge enterprise database when you start a new job or how to avoid large monolithic SQL scripts. The book is not necessarily written to be read in a straight line. Feel free to check out the table of contents and jump straight to any chapter that sounds interesting to you.

I hope that you have a good time reading this book. Feel free to contact me via email or social media.

Last updated