Poorly written SQL
When you're working as a data scientist, most of your data pre-processing and large parts of your feature generation will happen inside a database and thus be coded in SQL. I have worked on projects where 90% of the code base was written in SQL, even though this was 'only' the boring pre-processing and feature generation part. It's well known among data scientists that getting the right data in the right format for your analysis is often 80% of the work. Naturally, this means that writing 'good' SQL is crucial for the success of your project.
On the following pages, I will give a few recommendations, most of which I learned from not doing them and running into trouble because of it.
AVOID NESTED QUERIES. For some reason, every junior data scientist who starts writing SQL scripts will run into the trap of getting tangled in deeply nested SQL queries. The use of the WITH clause allows you to get a clearer structure by giving a sub-query a name and allowing it to be referenced in multiple places. While this is helpful, I recommend the use of persistent tables to save intermediate results. Instead of one query that does a dozen different things, you can create a sequence of queries, each of which does only one or two things and writes the result to a table with a meaningful name that can be used in several different places later in the script. Doing this is also helpful because it allows the database to optimize the execution plan of queries based on internal statistics.
USE CONTROL TABLES. Every time you process data, you are limiting your attention to a pre-defined set of objects in the real world. This can be a subset of customers, offers, products or transactions. This usually happens through filter conditions using the WHERE statement as well as INNER JOINs with tables that limit your number of records. These filters are often buried deep inside your SQL script, sometimes in multiple different places. Instead, try to create separate tables with the relevant records right at the beginning of your script. For example
  • one set with the relevant set of customers TEMP_SALESSTATS_RELEVANT_CUSTOMERS
  • one with the relevant set of offers TEMP_SALESSTATS_RELEVANT_OFFERS
  • and one with the relevant points of sales TEMP_SALESSTATS_RELEVANT_STORES
This brings clarity into your script because everybody reading it will be aware of how you define the scope of your analysis or process. If somebody ever wonders what filters you applied to the set of customers, he can simply search for the query that created TEMP_SALESSTATS_RELEVANT_CUSTOMERS and look at the source table and the filter conditions.
After executing the script, you can compare your result table CUSTOMER_SALESSTATS to the table TEMP_SALESSTATS_RELEVANT_CUSTOMERS from the beginning to see whether you lost any records along the way.
USE A GOOD AND CONSISTENT CODING STYLE. The importance of a clean and consistently applied coding style is obvious. It makes your code much easier to read and avoids confusion. The definition of a GOOD coding style is difficult because it's often very subjective. Unless your company provides a recommendation, simply chose what makes sense to you and apply it consistently. Many SQL tools like DBeaver allow the use of SQL formatters that are also adaptable to your preferences. If possible, select a style that can be created with an automatic formatter. If you work with a pre-existing code base, try to stick to the style used there unless you have a very good reason not to.
WRITE HELPFUL COMMENTS. A helpful comment tells you what a query is supposed to do. Do not simply describe what it does because everybody who can read SQL will see that. Instead, describe what you INTEND to do and WHY you are doing it. If there is an error in the query, having a comment that says what the author of the SQL actually WANTED to do is very helpful to fix it.
A word of caution: Comments are a double-edged sword. If somebody changes the code later but does not change the comment accordingly, you may up with a comment that lies about what the code actually does, which is confusing and misleading. Don't write too many comments, otherwise maintaining comments may end up taking just as much time as maintaining code. Good naming of tables and columns can make some comments obsolete.
Takeaways:
  • The majority of code in many data science projects is written in SQL, so writing good SQL is crucial to the success of the project
  • Avoid nested queries as they are confusing, hard to maintain, error-prone and inefficient.
  • Apply a consistent coding style. Try to use automatic formatting.
Copy link