Missing value patterns
Last updated
Last updated
Knowing the number of missing values per column will give you a feeling for the completeness of the data and will help you to quickly uncover potential problems. In order to get a better understanding of what exactly is missing, it also helps to look for missing value patterns, that is, groups of variables where, if one of the variables is missing, the others are usually missing as well.
In the below example, the field DEPARTMENT_NAME
is always empty when EMPLOYEE_ID_OF_SUPERVISOR
is empty and vice versa. This indicates that there might be a common reason.
Missing value patterns are very common when the data has been generated by joining multiple tables together. For example, if you join a table of employees with a table of departments, it's possible that one DEPARTMENT_ID
is missing from the departments table. Thus, all the fields that you select from the departments table, i.e. DEPARTMENT_NAME
or EMPLOYEE_ID_OF_SUPERVISOR
will be empty as well.
A good way to identify missing value patterns is by replacing each column with an indicator column that is equal to 1 if the value is missing and 0 if the value is not missing. Each observation basically becomes a vector of ones and zeroes and if two of these vectors are identical, it means that the corresponding observations exhibit the same missing value pattern. If you then do a GROUP BY
using all indicator columns, you will get a distinct list of all missing value patterns along with the number of observations in each pattern. You can sort them by the number of observations in descending order to get an idea of the most frequent missing value patterns.
In the above example, this doesn't make much of a difference because of how small the table is. You can simply look at it and see the missing value patterns with your own eyes. But when the table has a few million rows and/or dozens of columns, it will not be this obvious.
When K
is the number of columns in the table, you will end up with K
indicator columns and one count column. Writing the SQL that calculates the missing value patterns will become tedious if the number of columns K
is large. Of course, you can use SQL to generate the SQL from the table metadata, using for example the table DBC.COLUMNSV
in Teradata that contains a list of columns of each table in the database.
As for the runtime, you can make the query more efficient by concatenating the indicator columns to a string with K
characters. That way, the intermediate result is typically smaller and you only need to group by a single column.
Takeaways:
When you have a lot of missing data, checking for missing value patterns can reveal more insights on the process that leads to the missing data
EMPLOYEE_ID
DEPARTMENT_ID
DEPARTMENT_NAME
EMPLOYEE_ID_OF_SUPERVISOR
101
1
Finance
153
101
1
Finance
153
201
2
Accounting
290
501
5
NULL
NULL
502
5
NULL
NULL
EMPLOYEE_ID
DEPARTMENT_ID
DEPARTMENT_NAME
EMPLOYEE_ID_OF_SUPERVISOR
N
0
0
0
0
3
0
0
1
1
2