Measuring database load
Building and running a data product is often one of the most computationally intensive processes in an organization. This is especially true in organizations that don't have much experience with building and deploying data products - which is still most companies. When you prepare the data for your product on the database hosting the data warehouse, sooner or later someone will notice your CPU usage skyrocketing.
Even if no one comes knocking on your door to stop you from breaking the database for everyone else, I would still recommend monitoring at least the daily CPU usage of your project. It doesn't matter whether you are using your sandbox environment for free or not, someone IS paying for it and you should at least be aware of the cost of your calculations.
Most database software will have metadata tables containing the log of all queries, including their CPU and memory usage. This is convenient, because you can use SQL to summarize the CPU usage of your SQL processes. I recommend building a small report that contains the daily and weekly CPU and/or memory usage statistics of your product. This allows you to see when some change causes the CPU load to explode. It also allows you to compare your CPU load in relation to the rest of the processes running on the database. The query log metadata will typically have a column with the name of the user who ran a query, so you can distinguish between CPU usage from the team members and CPU usage from technical users used for recurring processes.
When you are trying to optimize the performance, try to figure out which of your processes are the most expensive ones. When you named your tables following the recommendations in previous chapters, you can use the unique prefixes corresponding to the sub-processes in your product to filter the query log to only those queries concerning a particular sub-process. Within every process, you can look at the most expensive queries. Ask yourself: Is this query supposed to be expensive due to high data granularity, or is it simply written in an inefficient way? Is there a way to do this calculation cheaper or even avoid it? Can you change the underlying tables to make the table joins more efficient?
Take into consideration that some queries are run multiple times a day. Don't just look at individual queries, but at groups of queries that are similar. You can filter on a given keyword, e.g. the name of a process that you hopefully used as part of your table names. You can also filter on all queries that reference a specific table. Sometimes, if you run the exact same query multiple times, it makes sense to group by query text - or at least a sufficiently long substring of it - before looking at the most expensive query groups.
If nobody looks at the cost of computations, people tend to care less about building efficient solutions. Being aware of the cost of your calculations, computationally as well as financially, will lead to more disciplined coding. People will write better, more efficient code, making better use of the resources provided to them.
Takeaways:
  • Use database metadata tables like query logs to analyze the CPU and memory usage of your data product
  • Look at your most expensive queries
Copy link