Automated testing
Poor data quality is one of the major risk factors in a data-driven project. This very book is all about how to identify and avoid potential data quality problems. Unfortunately, many of my suggestions require you to actively do something like writing a SQL query to check for cases where the data violates your assumptions. This can be very frustrating as you will repeatedly write similar types of queries for different tables. For example, I can't even count all the times I have written a simple query to count the number of NULL values in a specific column.
Normal people consider repeating tasks part of their job and accept them. Programmers typically don't, as they are used to automate these tasks. When you know how to code, you can stop doing the same thing repeatedly and write a piece of software that does it for you. If you want to take it even further, you can write a framework that allows you to apply the algorithm to other similar problems.
How would you go about writing a framework that does the data quality evaluation in your place? It's helpful to start with the simple tasks as a minimum viable product and expand it to the more complex tasks. You will find that by automating only a few of the most common types of data quality checks, you will be able to detect most cases of bad quality data.
You can think of your data quality framework as a set of tests. The most common type of test comes in the form of 'For column COLUMNNAME in table TABLENAME, does condition X apply?' Below, you can find a list of possible conditions to check for. In all cases, the parameters TABLENAME and COLUMNNAME will be required, so I have only highlighted the additional parameters needed for the test:
  • TEST_IS_NOT_NULL: Is the number of NULL values greater than zero? No additional parameters.
  • TEST_REASONABLE_NULL_FREQUENCY: Is the percentage of NULL values below UPPER_LIMIT_NULL_FREQUENCY, e.g. 5%?
For numeric columns, you can automatically test ranges of values and statistics:
  • TEST_IS _POSITIVE: Are the values always? No additional parameters.
  • TEST_IS_ABOVE_LIMIT: Are the values always greater than LOWER_LIMIT?
  • TEST_IS_BELOW_LIMIT: Are the values always smaller than UPPER_LIMIT?
  • TEST_IS_WITHIN_RANGE: Are the values always between LOWER_LIMIT and UPPER_LIMIT?
  • TEST_IS_AVG_WITHIN_RANGE: Is the average of the column between LOWER_LIMIT and UPPER_LIMIT?
For categorical columns, checking value ranges looks a bit different:
  • TEST_ALLOWED_VALUES: Are all the values in the set of ALLOWED_VALUES?
  • TEST_FORBIDDEN_VALUES: The column does not contain any values from the set of FORBIDDEN_VALUES?
  • TEST_MANDATORY_VALUES: Does the column cover all the mandatory values in MANDATORY_VALUES?
Some properties of the distribution can be checked for both numeric and categorical columns:
  • TEST_IS_NOT_CONSTANT: Is the column constant, i.e. does it only have a single distinct value? No additional parameters.
  • TEST_REASONABLE_MODE_FREQUENCY: Does the frequency of the most frequent value (mode) exceed UPPER_LIMIT_MODE_FREQUENCY, e.g. 95%?
In the case of columns containing text, you may want to check for any of the following conditions:
  • TEST_LENGTH_BELOW_LIMIT: Does the number of characters exceed UPPER_LIMIT_LENGTH?
  • TEST_LENGTH_OVER_LIMIT: Is the number of characters always above LOWER_LIMIT_LENGTH?
  • TEST_ALLOWED_CHARACTERS: Does any value contain any character not contained in the set of ALLOWED_CHARACTERS?
You may have noticed that some tests are just special cases of other tests. While testing if a column contains only positive values is the same as testing if all the values are greater than or equal to LOWER_LIMIT=0, it's often clearer and more concise to use TEST_IS_POSITIVE and avoid passing an additional parameter to TEST_IS_ABOVE_LIMIT.
In order to use a framework like this, you should only need to specify the name of the test and the required parameters. The framework should run the tests, summarize the results and notify you in the case that any of the tests failed. When you are working on a SQL database, you can write the tests so that they take the input parameters, compose the SQL to perform the test, run it on the database, format the results, judge if the test passed and notify you on some channel, e.g. email or persistent chat, when a test failed. There is still some work involved as you need to choose which tests you want to apply to which columns. You also need to specify things like upper limits, allowed values etc., which can require some preliminary analysis. But once you have done that, you can schedule the execution of the tests to run daily or weekly and lean back.
Copy link