Database performance considerations

Data engineers sometimes jokingly refer to themselves as data plumbers. This metaphor is not too far-fetched. The result of their work is a series of interconnected data pipelines that transport data from A to B while merging it with data from other sources and performing some transformations along the way. These tubes can leak (missing values or records), they can be blocked (crashes) or they can pollute the water (duplicates, incorrect transformations). On top of that, they must be able to withstand the pressure of the water flowing in from the source.

If you're working with data, a large part of your work will typically be dedicated to managing the pressure of the data flowing in. For a small dataset, it won't make much of a perceived difference whether your pipeline code is written efficiently or not. A few milliseconds don't feel much slower than a second - you will barely notice the difference. This changes when you process large volumes of data. Depending on the way you handle the data as it moves through the different transformation and integration steps, the same process can take either seconds or hours.

The way you optimize for efficiency varies greatly between data platforms - even between supposedly similar relational database systems. Performance tuning looks very different for Teradata, MS SQL Server or Oracle. I won't go into the details of any one specific database system, but I will discuss a few universal approaches to reduce unnecessary calculations.

Intermediate results

Don't be afraid to create intermediate results. When I started writing my first database queries, I tried to put everything into A SINGLE statement with lots of subqueries. For the sake of your own sanity, do not do that. Instead, create intermediate result tables and use them as input for the next query so that every query does only one thing. When a query does only one thing, it's much easier for the internal optimizer of the SQL engine to find an efficient execution strategy. If storage space is a concern for you, make sure to delete every one of these temporary intermediate tables as soon as you don't need it anymore. A positive side effect of this method is the ability to check the data in the intermediate table for errors or incorrect results. When your result seems wrong, you can execute your script query by query and, at each stage, check if something went wrong. This will also help you to modularize your code, making it easier to maintain.

Pre-calculated metrics

The next step after TEMPORARILY saving intermediate results is to PERMANENTLY save some preliminary calculations. In a retail context, you could build a customer datamart that contains statistics on the individual customer's shopping behavior in the past. One such feature would be the customer's purchase cycle, i.e. the average number of days between visits of the customer. Now other subsequent processes can just retrieve this feature from the customer datamart instead of having to descend into the usually large transaction table to calculate it on the spot. This avoids both duplication of code and calculations when the feature is needed in different places (single point of truth).

Preliminary calculations are a double-edged sword. They can lower the computational load of your workflows, but at the same time, they create dependencies. If you have a process A that pre-calculates sales statistics for every new offer and a process B that uses these sales statistics to calculate features for a predictive model, then process B depends on process A. If process A fails for some reason, so will process B – and that is the BEST case scenario. In a WORST case, process B will complete and calculate the required features, but they will be wrong because the input from process A was missing. When you use preliminary calculations, make sure to test what happens when the preceding process fails to complete.

In most companies, there will already be some kind of data warehouse with key performance indicators on the most important dimensions like customer, product, supplier, region, point of sale etc. It's very rare that you need to calculate everything from scratch. But as already mentioned above, be aware of the dependencies that this creates for your product. This is not only a dependency, but a dependency on something outside the control of your product team. Make sure to implement processes to monitor the quality and completeness of the incoming data and to be alerted when something is off.

Another word of caution: Every datamart, every pre-calculated table, is a small data product on its own. Don't assume that you can just write the code once and never touch it again. You need to monitor the inputs and outputs for data quality issues and respond to changes in the source data.

Bulk-processing and partitioning

A big advantage of large-scale databases is that you can move around large amounts of data and perform transformations on the entire dataset without having to think about how to loop over the all the records. Unfortunately, the data can become too large to be processed in one go. In this case, moving from processing all the data to iterative bulk-processing can be a solution. This is particularly effective when you can chunk the data you need to process in a way that is supported by the storage mechanisms of the database system. The default mechanism that many database systems like Teradata or Hive support is partitioning. It's often implemented in a way where you only need to specify it in the CREATE TABLE statement. You don't need to change any of the SQL code because the query optimizer will automatically use it when possible.

A typical use case for partitioning is transaction data, e.g. sales records. These tables will often be accessed with a time filter. You typically don't want to look at the sales data from the last eight years, but at yesterday's or last week's sales. That means that it is useful to partition the table by day, week, month or year. This way, a query about last week doesn't trigger a full table scan that reads data from eight years ago. It just looks at all the partitions, determines which ones it needs to access and scans only those records.

Infrastructure scaling

An obvious solution to performance problems is infrastructure scaling. Infrastructure can be scaled vertically by running your code on a larger, more powerful machine, or horizontally by adding addtional machines to share the computational load. With parallel computation frameworks and cloud infrastructure, it's now easier than ever to just scale the hardware whenever the computational load gets too heavy. While this is fundamentally a good thing, it is - once again - a double-edged sword. Being able to solve performance issues by just throwing better hardware at the problem is often easier than refactoring the code to avoid unnecessary calculations and use the existing resources more efficiently. Limitations focus the mind[1] and a constrained environment incentivizes creative and efficient solutions[2], while abundant resources create lazy developers.

In my opinion, scaling infrastructure should be a last-ditch solution to a performance problem. It may be personal preference, but inefficient code just seems wasteful and ugly to me.

Avoiding table JOINs

One thing that is universally true when working with databases is that you should avoid JOINs whenever you can. This includes doing a costly JOIN more than once. When you have multiple queries performing an expensive JOIN to fetch sales data, try to fetch this data only once, store it in a pre-aggregated form in a temporary table and use this temporary table for subsequent calculations. This will avoid expensive CPU calculations at the cost of some additional need for memory. It's a trick I have frequently used to lower the database usage of my queries.

When you need to collect a series of statistics in order to calculate one or more features, try collecting them in the same table. I usually do this by calculating the input statistics and merging them into the same table one by one. Usually, this is more efficient than calculating tables for each statistic and then joining them all together in one large SQL statement with half a dozen table joins at once.

Takeaways:

  • Make sure that your database queries do only one thing. If you try to do too many things at once, the database system can't properly optimize your queries.

  • Build datamarts to pre-calculate costly features. Be aware of the dependencies this creates.

  • Use partitioning and try to reduce the number of expensive JOINs

[1] Limitations focus the mind: Think of ten things. Now think of ten things in the fridge. Which was easier? [2] One of many example for a creative solution to limited resources: "How we fit an NES game into 40 Kilobytes" https://www.youtube.com/watch?v=ZWQ0591PAxM. Source: https://www.youtube.com/user/miaumiaumiau

Last updated