Duplicate rows

When data is stored in a table format, the term primary key refers to one or more columns that are unique for each row, meaning that no two rows can have the same primary key. Typically, it's some form of unique identifier, like a CUSTOMER_ID in a customer table or a PRODUCT_ID in a product table. A duplicate row is a violation of this rule. It means that there are multiple records with the same primary key. The rest of the columns may have the same values, or they can be different.

The first case is the less harmful one. Since there is no disagreement between the duplicate rows, you can simply discard them. However, instead of just adding a deduplication step at the end of your pipeline, fighting the symptoms, you should trace the origin of the duplicates back to the part of your pipeline where they occur and fix the source of the problem.

The second case of duplicate rows is far worse. When the rows have the same primary key but differ in some of their attributes, you don't know which of them is telling the truth. You need to find a way to deal with the inconsistency between them.

The worst case of duplicates in data that I encountered was in the ERP system[1] of a client when I was working in consulting. It was a large international company with thousands of business partners. They had increasing difficulties due to multiple records of the same business partners. People entered the name of the partners manually into the accounting system. For some partners, there were dozens of records, often with variations in the name, address and other important fields. We had to perform clustering methods using string similarity measures to detect similar names and addresses and then manually decide if a cluster of records belonged to the same company or person.

Spotting duplicates is straight forward if your data has a primary key. You simply count the number of occurrences of every primary key and display only those that occur more than once. Ideally, the result of this query should be empty.

As mentioned before, don't bother to build elaborate deduplication processes at the end of your data pipeline. The duplicates may already have caused incorrect calculations or other problems by then. Instead, make sure that the input tables at the beginning of the data processing are free from duplicates and that the processing steps are designed in a way that will not create duplicates.

In the above-mentioned ERP example, this doesn't work because the duplicates are different data objects with different primary keys, but they refer to the same business object - a specific company or person. In cases like these, defining a similarity measure between the objects and performing a clustering might be your only chance to get a handle on the situation. Just make sure you don't turn the problem into a PhD thesis. Clustering is hard to pull off because compared to supervised learning techniques, it's difficult to quantify how successful your method is. Use clustering as a method to pre-process the observations before manually classifying duplicates. This may not be as satisfying as a completely automated workflow that you can simply repeat in the future, but sometimes, you need to just solve the problem at hand. When the number of observations is small enough and it is unlikely that you need to do this more than once, it's not worth spending more time on the attempt to get the clustering just right.

A good way to avoid problems is to simply put a uniqueness constraint on a table column, e.g. a UNIQUE PRIMARY INDEX in Teradata. This way, trying to insert non-unique values will cause an error and the program or workflow will fail. At first, this may make your workflow fail more often, but it prevents you from ignoring the problem and forces you to fix it. This is preferable to simply hoping for the problem to not occur - which is just wishful thinking. However, make sure to check what this kind of constraint costs you in terms of performance. A uniqueness constraint will typically mean that every INSERT step includes a uniqueness check, which will slow down this step in the process, especially when the table gets bigger. If your data product frequently writes data into your table, the uniqueness constraint may be too expensive.

Takeaways:

  • When you spot duplicate rows in your data, don't fight the symptoms by adding a deduplication step at the end. Fix the problem as early as possible in your workflow.

  • Use uniqueness constraints. They will force you to fix problems instead of ignoring them. It may be inconvenient now, but it will help you in the long run.

[1] https://en.wikipedia.org/wiki/Enterprise_resource_planning

Last updated