When you build your first data product, you will typically build it as a monolithic piece of software. Unfortunately, a monolithic piece of software is hard to maintain. As you continue to develop it, you will recognize that there are actually different functional parts in your code. One part may be for reading in and transforming the data from a data warehouse, one for generating features and another one for training a model. This recognition typically leads to a certain level of modularization in your product. After a little bit of refactoring, you are able to separate the code into components that take care of different tasks.

Now, rather than the monolith doing all the things, you have a bunch of components (or modules), each doing only one specific thing. Every component in itself is now pretty simple and easy to understand. Unfortunately, you are adding a new layer of complexity because the components need to be able to talk to one another. When you don't have a clean interface between two components, changes in one component can break the other component. For example, when you change the name of a column from some table created in component A, suddenly component B stops working because some query now fails.

In the beginning, this doesn't seem like much of a problem. If you change something in one component, of course this also requires some changes in other components, right? You know how all the components work, so you can quickly fix any problems occurring in other parts of your code caused by the change. But as your product grows and people build other products or components on top of your output, it becomes difficult to keep track of all the complex interdependencies between the different components and products. At some point, you are too scared to touch any part of your code because you know that even a small change will have huge, seemingly unpredictable effects on the rest of the product.

I have been at this point. Unfortunately, my education as a mathematician and data scientist didn't teach me the importance of clean, well-defined interfaces. I was lucky to be surrounded by professional software developers who recognized what was going on and explained it to me. Here's the gist of it.

When you have two software components that need to talk to one another, e.g. for passing on data, each one should need to know as little as possible about how the other component works. As soon as one component contains any logic that is based on the inner workings of the other one, you create interdependencies, meaning that changes in one component have the potential to break the other component.

Instead, you want to design them in a way so that you can completely overhaul or even exchange one component without the other one even noticing. This is only possible when each component knows nothing about how the other components work. The only thing they need to know is the format of the interface between them. And this interface should be designed independently of any implementation details on either side.

I know this can be a little bit abstract, so here's a simple example. Let's say your data warehouse contains a customer ID called RK_ACC_NR_16 which is a CHAR(16). You are trying to build a recommendation engine, so naturally, you need to define a list of relevant customers in your product. This is done by a script that fetches a unique list of RK_ACC_NR_16 IDs from the data warehouse, perhaps including some additional filter conditions. This script is the basis for all other scripts to follow and the column RK_ACC_NR_16 is referenced everywhere in your code and in your table schemas.

At some point, a new team member joins your product team. They keep wondering why you are using a 16 Byte CHAR data type to store the customer ID when a 4 Byte INTEGER would have been the obvious and more efficient choice. They may also wonder why you chose a weird name like RK_ACC_NR_16 instead of the more obvious CUSTOMER_ID. Of course, you had a good reason for these two choices, but it requires everybody who ever looks at your code to be familiar with the way customer IDs are implemented in your data warehouse. If the implementation in the data warehouse ever changes, for example due to data privacy laws being passed that require the customer ID to be encrypted, you will need to change the table schemas and queries in hundreds of different places across your code and database.

On the other hand, what if you had built a customer management component - one that monitors the RK_ACC_NR_16 in the data warehouse and assigns a unique CUSTOMER_ID to every new RK_ACC_NR_16. If changes are made to the data warehouse, you don't need to change every single part of your product. All you need to do is to update the customer management component in order to deal with whatever the data warehouse engineers come up with. As long as whatever they develop can still be mapped to your integer type CUSTOMER_ID, you're good.

Creating an interface between two components forces you to think in detail about what data needs to be exchanged in what format. Create an interface definition by documenting every single attribute, including all constraints. What is the data type? Which values are allowed? If it is a numeric attribute, what is the minimum and maximum allowed value? Can it be negative or zero? If it is a categorical attribute, what is the allowed set of values? Can the attribute be empty or NULL?

This is already important and useful if you are the person on both sides of the interface, but if you are designing the interface between your component and one created by a different person or team, it is absolutely essential. Taking some time to do this will save you a lot of trouble later. When you are using a database table, you can already encode many of the constraints in the table DDL by choosing appropriate data types and using NOT NULL constraints.

When you are working in a database and you want two components to exchange information, a simple way to reduce dependencies is to not let component B directly access the table created as the output of component A. Instead, you can create a view on the output table of A and use that as an interface. Now you can rename the underlying table and its columns or even completely switch it out with another one. You just need to make sure that the input of the view is changed accordingly and that the columns receive the names agreed upon in the definition of the interface. Make sure that every other component C that also uses the output from component A gets its' own view. This allows you to not only reduce the number of columns to just those needed by each component, it also helps you to avoid breaking component B when you're working on the interface between component A and C.

The interface view on the output table can already contain filter conditions in the WHERE statement that encode the constraints demanded by the interface definition. For example, if the interface is for exchanging product information, you can add a filter PRODUCT_PRICE >= 0 or IS_PRODUCT_AVAILABLE IN ('y', 'n'). This means that even if somebody sends you bullshit data, it will not enter your data pipeline where it can potentially break something. It will already be filtered out at the very beginning.


  • Don't build monolithic software. Split your scripts and workflows into independent components.

  • A component should not need to know how another component works internally. Define clear interfaces that stay the same, even when the internal implementation of any of the component changes.

Last updated