Missing values
Last updated
Last updated
Missing data is the most common data quality problem that you will encounter. When there's a problem at any point in a data pipeline, it will often result in some form of missing data.
This makes checking for missing data the smoke test of data engineering. As the famous idiom says "Where there is smoke, there is fire". When you build a machine and you see smoke rising from it, you can be reasonably sure that there's something broken inside. You don't know what part of the machine broke, but you know that some form of repair will be necessary.
Similarly, missing data typically tells you that something went wrong somewhere along your workflow or data pipeline. You need to spot it early, identify the root cause and fix it, otherwise it will break even more parts further down the line.
Missing data will typically take one of the following forms:
Missing values: All the observations/records are available, but some of them are incomplete
Missing observations: All the available observations are complete, but some observations are missing entirely
While data can take many forms (tables, structured documents, text, binary files), it makes sense to start with the by far most common form - the data table. The rows of a data table represent observations while the columns represent attributes of these observations. In this case, missing values are simply empty cells in this table.
Databases usually have a special value to represent empty fields that is called NULL. To check for NULL values, most databases use a special syntax. This can vary between different database vendors, but the ANSI standard that almost all database vendors adopt goes like this:
Please note that while we would usually write FIRST_NAME != 'Alice'
or FIRST_NAME <> 'Alice'
to confirm that the first name of the person we are looking for is not Alice, we DO NOT write FIRST_NAME != NULL
or FIRST_NAME <> NULL
. This is due to the fact that NULL values have all kinds of funny behavior, especially when it comes to comparisons and joins, as we will explore later in this book. They can also lead to problems when you aggregate the data. Calculating the sum over a column where half of the observations have empty fields will give you a result that is pretty much useless. Calculating the average can still work okay, but only if the observations that have non-missing values are a representative sample of all observations.
When you look at an unknown dataset, one of your first actions should be to check for NULL values. You need to get a feeling for how complete the data is and which columns you need to be careful with due to missing data. For a single column, you can simply run the following query[1] to determine the number of missing values and compare it with the total number of values:
Unfortunately, this is somewhat tedious when the table has a large number of columns. It would be more convenient to have the number of NULL values for all columns at once without having to write and execute a single query for each column. While you can easily do this in more advanced languages like R, it is somewhat difficult in SQL without diving into the vendor-specific loop functions. However, with most databases, there is a way to use SQL to generate the SQL query that will get you the number of NULL values of each column, based on the database metadata. It goes like this:
As you can see, this query uses the databases object catalog, specifically the table with all column names for all tables, DBC.COLUMNSV
[2], to generate the SQL code that will count the number of missing values for each column in your table. More specifically, it gives you the SELECT
part of your statement. Copy the resulting table into your SQL query tool, complete the statement and execute it. Occasionally, when the SUM
gets too big, you can get a numeric overflow. Just add a CAST(... AS BIGINT)
around the SUM
to avoid that problem.
I always have a SQL cheat sheet with useful queries like this one where I just replace the database name and the table name in the WHERE statement to apply this to any new table I might encounter.
Takeaways:
The occurrence of NULL values where there should be no NULL values is the ultimate indicator that there is some kind of problem somewhere in your data or data processing
Automate the checking of columns for NULL values. Don't write the same query for different columns over and over again
[1] When dividend and divisor are integers, it could be necessary to manually cast them to floating point numbers - e.g. via CAST(COUNT(*) AS FLOAT)
- so that the result is also a floating point number with the exact quotient rather than the result of integer division.
[2] In this case, I am using Teradata-specific metadata tables. In a Teradata system, the table of all existing columns is called DBC.COLUMNSV
. Usually, every database system will have such metadata tables. If you are working with another database system, just replace the metadata tables with the ones in your system.
CUSTOMER_ID
FIRST_NAME
LAST_NAME
1
Alice
Cooper
2
NULL
Cumberbatch
3
Charlie
Chaplin
SELECT_STATEMENT
, SUM(CASE WHEN CUSTOMER_ID IS NULL THEN 1 ELSE 0 END AS NULL_VALUES_CUSTOMER_ID
, SUM(CASE WHEN FIRST_NAME IS NULL THEN 1 ELSE 0 END AS NULL_VALUES_FIRST_NAME
, SUM(CASE WHEN LAST_NAME IS NULL THEN 1 ELSE 0 END AS NULL_VALUES_LAST_NAME