Checklist for new data
This is a checklist of points you should consider checking when you look at a new dataset. In this case, we're looking only at data that has already been transformed into a table format so that every row is a record and every column is a variable or attribute of that record.
Get the number of records and compare it to your assumptions. Is it significantly more or less data than you expected?
Check the marginal distribution of the data over different dimensions. Example: If you have a date/time dimension, check the number of records for every day/week/month. Are there any gaps in the data?
For every column, look at the most frequent values and their frequencies. Are there any unexpected values? Are there any values that occur unexpectedly often? Are there any constant columns?
How frequent is the single most frequent value? If it occurs in more than 50% of the records, is that in line with your expectations?
Get the number of missing/NULL values in every column. If you have trouble interpreting the absolute numbers, get the percent of missing/NULL values in every column. Is there a high number of missing values? Are there missing values in columns where you wouldn't expect them?
Text columns:
Get the minimum and maximum character length of the column. Are there unexpectedly short/long strings?
Check for strings of length zero. Are there more than expected? Are these supposed to be NULL values? Do they have a special meaning?
Check for string-encoded NULL values. Are there columns with string values that say 'NULL', 'null', 'NA', '-', 'empty', 'missing', 'not available' etc.?
How many distinct values do you expect? If it's a small number, GROUP BY this column and look at the distinct values. Are there any unexpected values? Are there any duplicates with different spelling?
Check for trailing or leading whitespace. Are there records that are the same, except for some trailing or leading whitespace? Could this cause problems for some system or process consuming the data?
Check for encoding errors. Replace all 'expected' characters like alphanumeric characters, punctuation and whitespace. Do a GROUP BY to see what's left.
Are there spelling errors that you need to account for?
Numeric columns:
Get the minimum and maximum value of each column. Does the range violate your assumptions, e.g. non-negativity or upper bounds that shouldn't be exceeded? Is it plausible from a non-technical point of view?
Get an idea of the distribution of the column. Look at a histogram or a plot of the sorted values. Does the distribution make sense based on your understanding of the column? Are there any unexpected peaks?
If you follow this checklist, answer all the questions and don't encounter anything out of the ordinary, you have ensured a basic level of data quality. If there were any errors during data measurement or collection, data transfer or data import, you should have spotted them by now.
Of course, there's always the possibility that there are other issues with the data. If you are weighing yourself, you know that there is a problem with the data when the scale shows 'error' or -10 kg. But if the value is simply off by a few kilograms due to bad measurement or a systematic error, this will not lead to any red flags. Fortunately, MOST data quality problems or errors in a data pipeline will cause symptoms like those mentioned above and will be easy to spot - if you know what you are looking for.
So, this is the end of this book. My main goal was to close a gap in the textbook literature and provide some practical hands-on advice on how to spot and avoid data quality problems. I agree with the common notion that data scientists spend 80% of their time on data preparation. This conforms with my own personal experience. Data preparation is the most important step when building a data product and undiscovered data quality problems during this phase are the main reason for errors, failures and poor results. No model and no algorithm at the end of the data pipeline can make up for problems introduced by low-quality, erroneous data from earlier botched processing steps.
If you follow the advice from this book, you will spot data quality problems early and before they get the chance to break your product. This will help you work more efficiently and avoid costly iteration cycles. I hope you found the book useful and that it helps you in your daily work. If you liked the book or have some feedback, I would be glad to hear from you. Feel free to contact me via email or social media.
Twitter: @benjamingreve
LinkedIn: www.linkedin.com/in/bgreve
Email: greve.professional@gmail.com
Have a great day.
Last updated