Generalized data models

Large enterprise data warehouses can be quite intimidating at first. There is so much information from different systems, all of it linked in some way, representing actual business objects, but materialized into a huge interdependent data model - thousands of tables, linked by hundreds of different IDs and key columns.

Eventually, you will become familiar with the data and its intricacies. Here is an ID that is a fixed length CHAR data type despite containing a number, there are some tables that you need to join on four columns at once, and then there is that table with a customer ID that isn't called CUSTOMER_ID but RK_ACC_NR_16, a name that nobody outside of your company will understand.

In a large organization, it may take a few months until you are familiar with all these unintuitive intricacies. It is a great feeling to finally develop an understanding of the data and all its' weird quirks. You start to understand how these quirks came to be and you even start to defend them against other people who encounter them for the first time. And this is where it's easy to fall into a trap!

When you are building a data product, you are usually looking at a small subset of the data, representing only some of the business objects. For example, for a recommendation engine, you'll need to consider the following entities: customers, products, offers - i.e. a set of discounted articles - and transactions that inform about the past purchasing behavior of the customers. What many data scientists do is to simply take this data from the data warehouse, using all the data structures already present there. Unfortunately, this includes all the quirks and technical details from the implementation of the data warehouse. After all, you have invested so much time to understand them. And of course, you don't want to move away from the standard table and column names used in the rest of the organization, right?

It is only natural to go this way. Unfortunately, this creates dependencies between the technical implementation of the data warehouse and that of the data product you are trying to build. If, at any point, a change is made to the tables in the data warehouse, it will interfere with the implementation of your data product. Changing the structure of a table or even simply changing a column name can break your entire product. People working on both sides need to be aware of the implementation of both the data warehouse and the data product. You may end up with a complicated mess that is hard to explain to any external person coming into the project.

As described above, clean interfaces can be used to reduce the number of dependencies between your data product and the source system of the data. However, if you reintroduce the old data structures with all their intricacies, you ruin your chance to reduce the complexity of the data in the source system to the level required by your product. The solution is to develop a generalized data model that captures exactly the business objects that you need within your product, but without unnecessary complexity.

EXAMPLE 1: In a loyalty program, a customer account may have more than one loyalty card linked to it, e.g. a main card and a secondary card for a partner. For a recommendation engine, this information is irrelevant. Building this information into your data model will only make it more complicated. Instead, design the interface to your product around account-level statistics and completely eradicate the object card from your data model. Your product does not need to know what a customer card is.

EXAMPLE 2: In your data warehouse, the customer account ID could be a 16 character long CHAR(16) field, left-padded with whitespace and named RK_ACC_NR_16 - all because of some stupid decision from 5 years ago that kind of made sense back then. Doesn't matter. You can build your customer master data table with an INTEGER identifier, call it CUSTOMER_ID and build an ETL that makes sure that every new RK_ACC_NR_16 customer provided by your data warehouse receives a new and unique CUSTOMER_ID that is a 4 byte integer rather than a 16 byte CHAR(16).

EXAMPLE 3: I recently struggled with a data model chosen for promotional offers. In this case, an offer, which is basically a discount on a set of articles, was uniquely identifiable not by one, but by four columns because it is attached to a communication channel, a set of discounted products and other data. This meant that every time we wanted to join any offer-related information, we had to join on four different columns. This was a huge burden on the database, not only because it meant that the database had to store and move four times the amount of data, but also because it made all the joins and execution plans much more complex. On top of that, we had to join this type of data A LOT, so we had to write the four lines long join condition in the SQL statements dozens, probably hundreds of times. Just typing all the join conditions again and again took a long time, not to mention the fact that the entire code base got unnecessarily bloated and the code got hard to read.

I solved the problem by writing an offer management workflow that automatically identified new offers by their four key columns, added them to a mapping table and provided a surrogate key OFFER_ID that was just an integer column counting upwards. That way, I was able to use OFFER_ID all throughout the product code to uniquely identify offers or join tables on a single column instead of four. Only when I had to send data back to the data warehouse did I need to map the surrogate key back to the four key columns. For new team members, it is now obvious that this is how you can identify individual offers.

Takeaways:

  • Create your own generalized data model for your data product. Don't force yourself to stick to existing names and objects, e.g. from a production system or a data warehouse.

Last updated