Handling missing values
Missing values are often just a symptom of an error somewhere in your data pipeline. But they are also a problem in and of themselves because most predictive models cannot handle them. For you as a data scientist, there are two ways to resolve this: discarding all observations with missing values or filling missing values with the most likely value, which is called imputation.
Imputation of missing values is a regular step in data pre-processing, before plugging the data into model training or model serving. Getting pre-processed data is very convenient, but it can lead to problems when you don't know what exactly the other person or the machine did exactly. For example, the person who prepared the data might have done a simple imputation of a numeric column, replacing every missing value with the average or the median. This means that you lose awareness of the number of missing values in your dataset and assume that the data quality is a lot higher than it actually is.
You can discover this problem by looking at sufficiently granular histograms or looking at the frequencies of the top 10 most frequent values. If a large number of missing values have been replaced, you will see an extreme frequency peak in one number.
This type of check is also helpful to identify a common problem with real data called zero inflation. It means that your data has a column where a large portion of the entries are zero. There's an entire branch of statistical theory dealing with this type of distribution, but the details on that are beyond the scope of this book.
At this point, I should also mention default values in databases. The SQL standard allows you to specify default values for a column in the CREATE TABLE statement. Using default values for columns that should not be NULL seems like a good idea at first. All subsequent ETL steps, analyses or data products further down the pipeline can rely on the data without having to worry about missing data. The problem is that you can end up shooting yourself in the foot because you lose awareness of missing values. When default values are automatically inserted, you will not notice problems earlier in the data preparation that may lead to an explosion in the number of missing values. Instead of breaking, your data product will continue to work, but with errors that you are not aware of. In many cases, this is a much more dangerous scenario. Errors in preceding ETL steps could go unnoticed for weeks, months or years!
Instead of default values, use the NOT NULL constraint that will ensure that you don't insert NULL values into your table. Of course, this means that you must do the imputation of potential missing values yourself. If you simply use an UPDATE statement to replace all missing values with a default value, you're in the same situation as before, unaware of any problems. A good idea would be to build checks into your pipeline that will send an email when the number of missing values that had to be imputed exceeds a reasonable level. You can consider this another instance of a 'smoke test'. Exceeding a threshold of missing values may not be a complete failure on part of the data pipeline, but it's like there's smoke coming out of it: Something seems to be broken and requires fixing.
Another idea would be to create a report that keeps track of the number of imputed missing values so you can look at a line or bar chart showing the number of missing values per day for every critical point in the ETL. You can extract this data directly in your ETL or separately using database metadata tables. In Teradata, you can get the number of rows that a query has updated and/or inserted from the ROWCOUNT and ROWCOUNT2 columns of the table DBC.QRYLOGSTEPSV (Teradata).
I strongly recommend using NOT NULL constraints every time you are sure that a certain column cannot be NULL under normal circumstances. This is risky, because it means that your data pipeline will stop working, even if just one out of a billion values violates the NOT NULL constraint. You may think that this sounds like the recipe for a brittle and not very robust pipeline and you would be right about that. However, this forces you to deal with the problem at its very root. Every time your pipeline breaks, you will implement a solution to deal with the particular problem, which - step-by-step - improves the robustness of your process.
Not using a constraint would mean to ignore the problem and to pass it on. Passing on erroneous data will eventually break the pipeline at some point further down the line, making it a lot harder to spot the root cause of the problem.
Takeaways:
  • Don't forward erroneous data with missing values. Let the pipeline break as early as possible. This will make it much easier to locate potential problems
  • On the other hand, don't simply impute NULL values blindly
  • Make sure that your workflow throws an alarm if too many missing values occur
​
Copy link