Large monolithic SQL scripts

A large part of building a data product is writing SQL to generate the relevant input data, i.e. the model features. Since a lot of the predictive power of any model lies in the use of the right features, this can be quite an involved process. After working on your feature generation SQL script for a while, you may end up with a huge monolith that can be hundreds to thousands of lines long, containing dozens, sometimes hundreds of tables created as intermediate results. This is suboptimal because a single change in this script can affect the rest of the script. If you change something, a syntax error at a single place will halt the execution of the entire script. It will also become impossible to explain what happens in this script to a new team member.

The best solution is to modularize your script. Identify subsections that are - or should be - independent from the rest of the script and move them into separate SQL files. Splitting a large monolithic script into independent blocks is an art in and of itself. Don't expect to simply copy and paste blocks into separate scripts. You will likely have to rewrite large parts of the script to make it more modular.

There is very little generalizable advice I can give you to achieve this, as it depends greatly on your data. I find it helpful to identify groups of features that are based on the same input data and similar calculations. Other times, you can identify several different features that belong to the same dimension, for example features that are based on a customer's past behavior and that are independent of any other dimensions like the offer or the advertising channel.

A simple way to modularize your SQL scripts is to think of them as functions or transformations that take some input data, do something with it and then return the processed data. For example, you may enter a list of customers and get a table with a few metric columns for each customer like his total spent during the last year or the average duration between purchases. All relevant parameters and input data can be passed as tables.

Let's say you have a script that generates features for every combination of customers and offers, for example to serve as the input for a recommendation engine. Usually, you would have a long script that defines the time period of interest, derives from that the relevant set of customers and offers, applies some filters, calculates the features and writes them into the result table. The modular approach would go like this: You have a single script that specifies the time period of interest and writes it to table TIME. The table TIME is the input for another script that derives the relevant set of customers and offers and writes them to tables CUSTOMERS and OFFERS. Now you have your core script for the generation of a set of features that takes tables TIME, CUSTOMERS and OFFERS and uses them to generate table CUSTOMER_OFFER_FEATURES which is a cross join of CUSTOMERS and OFFERS and contains all feature columns, which are as of now still completely empty. Now you can write an arbitrary number of feature scripts, each of which calculates a single feature or a group of feature and merges the result into CUSTOMER_OFFER_FEATURES.

This approach forces you to think about the underlying structure of your script. The advantage of this approach is that you can change the logic that defines the relevant customers and offers without having to touch the scripts that calculate the features. You have created blocks that are independent of one another and only connected through a fixed interface that is unlikely to change. If you want to calculate the features for a different time period of interest, for example for the past to generate training data, you simply make a copy of the first script that generates the table TIME and rewrite it so that it fills TIME with a different time period. You don't need to change a single thing about the other two scripts because they only depend on the input table TIME.

It gets even better. Imagine someone wants to write a new feature. You simple need to tell him/her that he can find the relevant time period in TIME, the relevant customers in CUSTOMERS and the relevant offers in OFFFERS. He doesn't need to know anything about the rest of the scripts. He can simply write the code that generates his new feature for the specified customers and offers, and merge the result to the feature table CUSTOMER_OFFER_FEATURES.

This way of structuring code is fairly common computer science where programmers create micro services that communicate with each other through standardized REST interfaces. The micro services can be developed by different programmers or teams who don't need to know anything about the internal workings of the other micro services, as long as they are all agreeing on a standardized set of interfaces between them. The independence of the different components of such a system makes it less error prone and easier to maintain and develop. It is a way to replace large, monolithic, highly interdependent software programs that have become impossible to maintain. I have never seen this approach being used on large, monolithic, highly interdependent SQL scripts, but I have applied it in a project recently and found it very helpful.

Last updated