Internal inconsistency
Checking a dataset for internal consistency is one of the best ways to check for potential data quality problems. The basic idea is to look for contradictions in the variables by setting them in relation to one another. For example, if you have the number of times a customer has placed an order with you, N_ORDERS, the total amount of revenue generated by this customer, SUM_REVENUE, and the total number of packages shipped to the customer, N_PACKAGES_SHIPPED, there are certain underlying constraints that you can check for. In this case, if any of these three numbers is greater than zero, then the same must be true for the other two. On the other hand, if any of these numbers is equal to zero, then so are the other two.
A simple way to check for records that violate this constraint would look like this:
In software engineering, programmers use a method they call 'smoke testing' to check for indicators of errors. If there is smoke (inconsistency), there is fire (problems with the data or the data pipeline). I like the visual of this idea, because it's exactly what we are doing here. We don't see any fire, but if we spot smoke, it's safe to assume that something in our machine is on fire and needs to be fixed. If we ignore the smoke, there's no point in keeping the machine running, because it will either break down or produce broken output.
Let's go back to the above example for a moment. If there are any records that satisfy the above condition, they need to be inspected more closely. Doing this often helps to further the understanding of the business processes behind the data. There might be customers where the constraint is violated for completely valid reasons. In the above example, a customer may have placed an order, but canceled it before the order was processed and sent to him. So N_ORDERS = 1, but N_PACKAGES_SHIPPED = 0. Or maybe a cancellation is recorded as a separate order with negative revenue, so N_ORDERS = 2 but SUM_REVENUE = 0.
Your best course of action is to identify these unexpected records and show them to the person who knows the data the best. He or she will often recognize immediately where your assumptions were wrong and how to interpret these cases correctly. This - as most of the steps in analyzing data - is an iterative process. After accounting for cancellations, there may still be some records left that don't make any sense. Perhaps, the partner handling the shipping has changed in 2016, two years before you started working at the company. This would mean that all the shipments before 2016 are being recorded in a different table that you were not aware of until now.
There are a million things like this that can happen. Most of the time, it is impossible to anticipate all these potential problems in your first analysis. You are fighting against 'known unknowns' as well as 'unknown unknowns'. But if you follow the advice in this book, you should come to a point where you are aware that problems like this can occur and you know how to spot them.
Unfortunately, there is no recipe for how to check for internal consistency. While checking for NULL values is a purely technical data quality analysis, checking for internal consistency is highly non-technical. As seen above, it requires detailed knowledge on the business-side of the data, specifically how the data is being measured, what every variable means and how the different variables should relate to each other. You need to ask yourself the question 'Are there any implications between two or more variables?' or 'Is there a way in which two or more variables can contradict each other (based on my current understanding of the data)?'
  • Use internal consistency checks to spot data quality problems.
  • Often, internal inconsistencies are not real inconsistencies. They are just caused by your incomplete understanding of the way the data is being generated. Ask people on the business side for help.
Copy link