Missing value patterns

Knowing the number of missing values per column will give you a feeling for the completeness of the data and will help you to quickly uncover potential problems. In order to get a better understanding of what exactly is missing, it also helps to look for missing value patterns, that is, groups of variables where, if one of the variables is missing, the others are usually missing as well.

In the below example, the field DEPARTMENT_NAME is always empty when EMPLOYEE_ID_OF_SUPERVISOR is empty and vice versa. This indicates that there might be a common reason.

Missing value patterns are very common when the data has been generated by joining multiple tables together. For example, if you join a table of employees with a table of departments, it's possible that one DEPARTMENT_ID is missing from the departments table. Thus, all the fields that you select from the departments table, i.e. DEPARTMENT_NAME or EMPLOYEE_ID_OF_SUPERVISOR will be empty as well.

A good way to identify missing value patterns is by replacing each column with an indicator column that is equal to 1 if the value is missing and 0 if the value is not missing. Each observation basically becomes a vector of ones and zeroes and if two of these vectors are identical, it means that the corresponding observations exhibit the same missing value pattern. If you then do a GROUP BY using all indicator columns, you will get a distinct list of all missing value patterns along with the number of observations in each pattern. You can sort them by the number of observations in descending order to get an idea of the most frequent missing value patterns.

SELECT
  CASE WHEN EMPLOYEE_ID IS NULL THEN 1 ELSE 0 END AS EMPLOYEE_ID
, CASE WHEN DEPARTMENT_ID IS NULL THEN 1 ELSE 0 END AS DEPARTMENT_ID
, CASE WHEN DEPARTMENT_NAME IS NULL THEN 1 ELSE 0 END AS DEPARTMENT_ID
, CASE WHEN EMPLOYEE_ID_OF_SUPERVISOR IS NULL THEN 1 ELSE 0 END AS EMPLOYEE_ID_OF_SUPERVISOR
, COUNT(*) AS N
FROM MYDATABASE.MYTABLE
GROUP BY 1, 2, 3, 4
;

In the above example, this doesn't make much of a difference because of how small the table is. You can simply look at it and see the missing value patterns with your own eyes. But when the table has a few million rows and/or dozens of columns, it will not be this obvious.

When K is the number of columns in the table, you will end up with K indicator columns and one count column. Writing the SQL that calculates the missing value patterns will become tedious if the number of columns K is large. Of course, you can use SQL to generate the SQL from the table metadata, using for example the table DBC.COLUMNSV in Teradata that contains a list of columns of each table in the database.

As for the runtime, you can make the query more efficient by concatenating the indicator columns to a string with K characters. That way, the intermediate result is typically smaller and you only need to group by a single column.

Takeaways:

  • When you have a lot of missing data, checking for missing value patterns can reveal more insights on the process that leads to the missing data

Last updated