Table joins
Joining tables is one of the most common tasks performed in a database. A lot of data preprocessing is about gathering all the relevant data by joining multiple tables together to get - for example - all the features that you want to use in a predictive model.
Common examples include joining product data to sales data. The sales data might only contain the ID of the sold product, while the product data contains the name, category and supplier among other things. Since you are reading this book, I assume that you are familiar with the basics of what a join is, why you need to do it, and how it works.
What I want to achieve in this chapter is to make you aware of the potential problems you can encounter when you carelessly join tables, namely duplicate rows and orphans.
Let's say you're joining two tables A and B with a left join. You've done your homework and checked both tables for duplicate rows - they didn't contain any. But when you do the join, you suddenly notice that the number of rows of A joined with B is larger than the number of rows of A. The reason why this can happen is when the join condition is met by multiple rows. Maybe table A is sales data and table B is product data. You expected that you can simply join the two tables by PRODUCT_ID
to get the additional product information along with the sales data. However, upon closer inspection, you discover that table B contains multiple rows with the same product ID. You look at one example and discover that there's another column COMPANY_ID
- and that one differs between the rows with the same product ID. Apparently, your product data contains products from multiple companies that occasionally have overlapping product IDs.
This example may sound a little far-fetched, but it happens all the time. Many times, you are not joining simple lookup tables, but several complex tables that you have constructed in SQL queries hundreds of lines long. Maybe you're joining a dozen tables and the join conditions of any join include multiple columns. Forget a single column in the join condition and you'll end up with duplicate rows in your result table.[1]
As always, the solution is to check your assumptions. If you think that rows X and Y are enough to uniquely identify every row in a table A, check it with SELECT COUNT(*), COUNT(DISTINCT CONCAT(X, '_', Y)) FROM A
. If both counts are the same, you are safe. Also, make sure to obtain table counts before and after the join. Did the number of records increase, even though it was not supposed to?
Takeaways:
When you think that a column or a set of columns uniquely identify a row in a table, make sure to check this assumption
[1] By the way, if you are joining more than half a dozen tables in a single query as described above, chances are that your query sucks. Not only is this barely maintainable, it's also likely to be inefficient due to the way most databases create optimized execution plans. More on that in a later chapter on how to write good SQL.
Last updated