Checking a dataset for internal consistency is one of the best ways to check for potential data quality problems. The basic idea is to look for contradictions in the variables by setting them in relation to one another. For example, if you have the number of times a customer has placed an order with you, N_ORDERS
, the total amount of revenue generated by this customer, SUM_REVENUE
, and the total number of packages shipped to the customer, N_PACKAGES_SHIPPED
, there are certain underlying constraints that you can check for. In this case, if any of these three numbers is greater than zero, then the same must be true for the other two. On the other hand, if any of these numbers is equal to zero, then so are the other two.