📈
beginners-guide-to-clean-data
  • A Beginner's Guide to Clean Data
  • Introduction
    • Foreword
    • The value of data
    • The intangible nature of data
  • Missing data
    • Missing values
    • Missing value patterns
    • Missing value representations
    • Missing observations
    • Truncated exports
    • Handling missing values
  • Data range problems
    • Unexpected values
    • Outliers
    • Freak cases
  • Common CSV problems
    • CSV basics
    • Quotation characters
    • Line breaks in text fields
    • Missing or insufficient headers
    • Trailing line breaks
    • Data export and import
    • Column type violations
    • Guidelines for working with CSV files
  • Text mining problems
    • Text mining basics
    • Encoding in your data and IDE
    • Special characters
    • Character entities
    • Lookalike characters
    • Dummy words
  • Type- and format-related problems
    • Inconsistent timestamp formats
    • Whitespace-padded strings
    • Binary data
    • Semi-structured log files
    • Proprietary data formats
    • Spreadsheets
  • Database-related problems
    • Numeric overflow
    • Duplicate rows
    • Table joins
    • Huge enterprise databases
    • Case sensitivity
    • Separating DDL and DML statements
    • Database performance considerations
    • Naming tables and columns
    • Poorly written SQL
    • Large monolithic SQL scripts
    • SQL orchestration
  • Data inconsistency
    • No single point of truth
    • Non-matching aggregated data
    • Internal inconsistency
  • Data modeling
    • Business concepts
    • Handling complexity
    • Interfaces
    • Generalized data models
    • Reproducibility
    • Feature stores and feature engines
    • Thinking pragmatic
  • Monitoring and testing
    • Automated testing
    • Measuring database load
  • Bonus content
    • Checklist for new data
Powered by GitBook
On this page

Was this helpful?

  1. Data inconsistency

Internal inconsistency

Checking a dataset for internal consistency is one of the best ways to check for potential data quality problems. The basic idea is to look for contradictions in the variables by setting them in relation to one another. For example, if you have the number of times a customer has placed an order with you, N_ORDERS, the total amount of revenue generated by this customer, SUM_REVENUE, and the total number of packages shipped to the customer, N_PACKAGES_SHIPPED, there are certain underlying constraints that you can check for. In this case, if any of these three numbers is greater than zero, then the same must be true for the other two. On the other hand, if any of these numbers is equal to zero, then so are the other two.

A simple way to check for records that violate this constraint would look like this:

SELECT *
FROM MYDATABASE.MYTABLE
WHERE
    SIGN(ZEROIFNULL(N_ORDERS)) +
    SIGN(ZEROIFNULL(SUM_REVENUE)) +
    SIGN(ZEROIFNULL(N_PACKAGES_SHIPPED)) NOT IN (0, 3)

In software engineering, programmers use a method they call 'smoke testing' to check for indicators of errors. If there is smoke (inconsistency), there is fire (problems with the data or the data pipeline). I like the visual of this idea, because it's exactly what we are doing here. We don't see any fire, but if we spot smoke, it's safe to assume that something in our machine is on fire and needs to be fixed. If we ignore the smoke, there's no point in keeping the machine running, because it will either break down or produce broken output.

Let's go back to the above example for a moment. If there are any records that satisfy the above condition, they need to be inspected more closely. Doing this often helps to further the understanding of the business processes behind the data. There might be customers where the constraint is violated for completely valid reasons. In the above example, a customer may have placed an order, but canceled it before the order was processed and sent to him. So N_ORDERS = 1, but N_PACKAGES_SHIPPED = 0. Or maybe a cancellation is recorded as a separate order with negative revenue, so N_ORDERS = 2 but SUM_REVENUE = 0.

Your best course of action is to identify these unexpected records and show them to the person who knows the data the best. He or she will often recognize immediately where your assumptions were wrong and how to interpret these cases correctly. This - as most of the steps in analyzing data - is an iterative process. After accounting for cancellations, there may still be some records left that don't make any sense. Perhaps, the partner handling the shipping has changed in 2016, two years before you started working at the company. This would mean that all the shipments before 2016 are being recorded in a different table that you were not aware of until now.

There are a million things like this that can happen. Most of the time, it is impossible to anticipate all these potential problems in your first analysis. You are fighting against 'known unknowns' as well as 'unknown unknowns'. But if you follow the advice in this book, you should come to a point where you are aware that problems like this can occur and you know how to spot them.

Unfortunately, there is no recipe for how to check for internal consistency. While checking for NULL values is a purely technical data quality analysis, checking for internal consistency is highly non-technical. As seen above, it requires detailed knowledge on the business-side of the data, specifically how the data is being measured, what every variable means and how the different variables should relate to each other. You need to ask yourself the question 'Are there any implications between two or more variables?' or 'Is there a way in which two or more variables can contradict each other (based on my current understanding of the data)?'

Takeaways:

  • Use internal consistency checks to spot data quality problems.

  • Often, internal inconsistencies are not real inconsistencies. They are just caused by your incomplete understanding of the way the data is being generated. Ask people on the business side for help.

PreviousNon-matching aggregated dataNextBusiness concepts

Last updated 5 years ago

Was this helpful?