Missing value representations
Identifying missing values is not always as easy as finding all NULL values in a database or all
NAvalues in a data frame in R. Sometimes, empty fields are encoded by specific values that represent the missing data point. For example, an empty string, that is, a string of length zero, can represent a missing string value or it can simply be an empty string with no characters. To make things even more complicated, some databases automatically convert zero-length strings to NULL while others treat them as different objects.
I have also encountered data where missing values in text columns were encoded with
'empty', among others. It's always a good idea to calculate counts for all values of a text column, look at the most frequent values and check for patterns like the above-mentioned. I also recommend checking how your database treats zero-length strings and do a quick check on how often they occur in your dataset.
Don't forget to do another check for strings that consist only of whitespace characters like
' '. They are not of length zero, but they are effectively empty and could also be considered missing values.
I recommend running checks for any of the above missing value representations. If you're building a data product and you need clean data further down the line, replace them with actual NULL values to avoid confusion.
A similar problem can occur in numeric columns. During data collection or data integration, missing values could have been replaced with fallback values. Popular fallback values for missing numeric values include 0, -1, 99, 999, -99, -999 and so on.
Sometimes, people become creative and replace missing numerical values with the mean of the column. This can be tricky because it's not obvious that there was data missing. When this is done to a continuous variable, drawing a kernel density estimation or calculating counts for each value in the column can help you spot the unusual spike in frequency at this particular value.
In columns with dates, look out for dates that are too far in the past to make sense in your data. For example, since Unix time counts the seconds since January 1st, 1970, the date
1970-01-01this is a commonly used fallback value when a date field is empty. Another value that should catch your attention is January 1st, 1900 (
1900-01-01), which is - or used to be - the smallest possible date in Microsoft Excel. When you encounter this value, you can almost always be sure that this is not the actual value of the field, but some fallback due to missing data.
Sometimes, data records are only valid for a certain time period. For example, a product may have been listed on your web shop from
START_DATE = 2016-03-01to
END_DATE = 2018-08-31. Another product may have a
START_DATE = 2017-05-01and still be sold in your shop, so
END_DATEshould be empty, i.e. NULL. The problem with this is that you may want to select all currently valid products by using the condition
WHERE CURRENT_DATE BETWEEN PRODUCT.START_DATE AND PRODUCT.END_DATE. This obviously doesn't work because
END_DATEis NULL. Sometimes, people solve this problem by setting
END_DATEto some unrealistically high value like
9999-12-31. Even though you may often see this in practice, it's not best practice. After all, this is how the millennium bug, also known as Y2K bug, came to be.
In general, I strongly discourage you from ever using missing value representations other than NULL (or
NAin the case of R). Using values to stand in for missing data introduces a level of implicit knowledge required to work with the data. While it may be obvious to you that
CUSTOMER_AGE = -1represents a missing value, it will be very confusing to someone who works with that data for the first time.
On the other hand, when you work with data that you are not yet familiar with, always expect and check for potential missing value representations.
- Check not only for missing values, but also for typical missing value representations
- Don't mix NULL values with other missing value representations
Last modified 2yr ago