Non-matching aggregated data

Data aggregation can be a big problem in terms of consistency. It's really frustrating when you have a table with aggregated data that contradicts the figures you get when you do the aggregation manually. Figuring out what lead to the difference can be a very time consuming process. From a certain point of view, aggregation is just another form of duplicating data, which is problematic in various ways as discussed in the previous chapter.

Your best protection against this type of inconsistency is to avoid aggregations and work with the raw data whenever possible. Instead of doing the aggregation once and taking the risk of inconsistencies later on, create a view on the raw data that does the aggregation.

You can also create the pipeline, function, module or workflow that will create the aggregated data and save that instead of the aggregated data. If anybody needs the aggregated data, he simply needs to utilize this pipeline to obtain the data he needs, kind of like just-in-time production[1]. But keep in mind that this code might be unavailable to someone just looking at the tables in the database at some later point. It can also be subject to change, making it difficult to tell which version of the code has been used to generate a particular dataset.

Naturally, starting with the raw data every time will be a problem if it becomes too large to handle efficiently. At some point, the processing will take too long and you'll need to pre-aggregate the data. When you do this, start with the most granular aggregations first and avoid using filters. Aggregation always leads to a loss of information. For example, aggregating all purchases of a customer on a weekly basis will make it impossible to analyze on which weekdays the transactions occurred. The more you aggregate your data, the more information you lose and the fewer use-cases your data is good for.

Of course, the possible levels of aggregation depend strongly on your intended use-case. If you are in retail analyzing baskets, you don't want to aggregate the customers' transactions to a weekly level because you would lose information about the individual baskets.

Aggregated data often ages poorly. The underlying raw data can change occasionally due to errors or delays in previous ETLs. Don't expect to build a data aggregation ETL once and then to never look at it again. Every data pipeline needs a certain level of maintenance and monitoring. Try to automate some plausibility checks and occasional cross-checks against the raw data. Transactional data for example is not very palpable. You cannot simply look at it. You can only look directly at tiny excerpts from it, which makes it hard to tell when there's data missing. I've worked with datasets that I assumed were complete, only to discover that entire weeks of data were missing.

Takeaways:

  • Aggregated data ages poorly.

  • Provide the function that calculates the aggregated data rather that the data itself.

[1] https://en.wikipedia.org/wiki/Just-in-time_manufacturing

Last updated