Unexpected values

Nominal or categorical data is non-numeric data that can only take a limited number of values. A few examples are the name of a department in your company, the name of a city or the color of a product.

When you take the latter, you will expect to find values like 'red', 'blue' and 'green'. What you will not expect are entries like 'Ramona', 'true', 'ALL' and '1.175'. But when you are working with real world data, you will occasionally find values like this in your data.

Most software programs that generate data try to make sure that people only enter valid values, but the real world is messy and there are always a few ways in which this type of constraint can be compromised. For example, a data entry form with a drop-down menu seems like a good way to ensure that people choose only one of the allowed values for the color of the product. But perhaps, at some point, one of the developers replaced the values in the form from lower case to upper case, e.g. 'RED' instead of 'red'. Now the table in the database has records with the color 'red', but also some with the color 'RED'.

This approach can also fail when the number of potential values is too large. In theory, you could have a complete list of all cities on the planet, but it would contain tens of thousands of entries, making it completely unusable. On top of that, the entries would need to be in different languages and the list would need to be constantly curated to account for newly created locations as well as locations that are being combined or split apart. It would also be full of duplicates because humans are not that creative when coming up with city names.

Sometimes, there is nothing wrong with the values, but instead with your expectations. This is common in old databases that have seen a lot of change in the business. Old products, categories and metadata have been retired, new ones have been established. Maybe there was a special product that has only sold a few units over a small period. Maybe there was a new product that required a different data model with new columns that are set to some default value for all previous products. Very often, stuff like this has faded out of peoples' memories, especially in small companies with a lot of fluctuation in personnel, and nobody on the business side is consciously aware of it anymore. You will often have to closely work together with people on the business side to adjust your expectations to the actual data and figure out how to interpret your findings.

If you have multiple columns with nominal data, you should also check if the combinations of the values follow your expectations. Maybe your product 'carbon-frame bike' doesn't have a color option because it's grey by default, so the attribute color should always have the value 'carbon-grey'. But when you look at the combinations of values in your product column and your color column, you see this:

None of the individual values are weird or unexpected, but this particular combination of columns is. That's why you should also check for unexpected combinations of values.

Here is another recent real-world example from my work at a large retail company. I was writing a data pipeline processing offer data. An offer is basically a discount on a list of products in the form of loyalty points. For every offer, there is a list of discounted articles, for example all chocolate bars of a certain brand. An offer can contain somewhere between one and a few hundred different articles (OFFER_TYPE = 'category'). But there are also offers where you get an increased number of loyalty points on your entire purchase (OFFER_TYPE = 'basket'). In this case, there is no list of discounted articles in the database, so the count of discounted articles (ARTICLE_COUNT) should be zero[1]. However, I discovered an offer with OFFER_TYPE = 'basket' that had ARTICLE_COUNT = 11. This was an inconsistency in the data because for this type of offer, there should be no list of discounted articles. Please note that neither OFFER_TYPE = 'basket' nor the ARTICLE_COUNT = 11 are problematic in and of themselves. It is the combination of both values that made this record stand out as an indicator of a data quality problem.

A quick note regarding problems like this: When you come across such a problem in an individual record, be wary. It could be a one-off thing that only concerns this one record, but if this error somehow DID make its way into your data, chances are that something similar happened in other cases as well.

In computer science, there is a term called 'code smell'. According to Wikipedia.org, 'Smells are certain structures in the code that indicate violation of fundamental design principles and negatively impact design quality. […] Code smells are usually not bugs; they are not technically incorrect and do not prevent the program from functioning. Instead, they indicate weaknesses in design that may slow down development or increase the risk of bugs or failures in the future' [2]. Inconsistencies that you discover in individual records of your data can be an indicator of problems at an earlier part of your data pipeline or in the systems from which you obtain your data. You could call it 'data smell': While your data processing workflow may not have crashed yet, it either creates or propagates inconsistent, low-quality data. This will lead to pipeline failures and necessitate error-handling processes further down the line.

Whatever the reason for the gap between expectation and reality may be, make sure to check your assumptions about the occurring values. If you have columns with nominal data, group by these columns to display all distinct values and compare them to what you expected. If you see an unexpected value, check how often this value occurs in the data. Is it a one-off thing, a single record in a million other ones that you can simply ignore or exclude, or is it a reoccurring problem? Check what the rest of the record, i.e. the other columns, looks like. Maybe these columns will give you a hint as to what's going on with these strange records. Check whether the combination of values in the other columns make sense. If there is only a single record or a small number of records with a particular combination of values, take a closer look. Is this a plausible combination of values?

In many cases, unexpected values are caused by not filtering your data enough. This often happens when you are unfamiliar with the data because you are new to the organization. For example, you may make the following assumptions in the first example with the bike shop:

  • ASSUMPTION: The point-of-sale table contains only the stores of your company. REALITY: It contains stores as well as warehouses. The latter may need to be excluded from your analysis.

  • ASSUMPTION: The product table contains only products that have actually been sold at some point. REALITY: There's a huge bunch of dummy entries that people created for tests and that do not represent actual products.

  • ASSUMPTION: The sales transactions table contains only product sales. REALITY: It includes sales as well as returns.

If you encounter unexpected values, it is often because you included data that should have been excluded from your analysis. When talking with the people from the business side to clarify how you should interpret this particular combination, do not just show them the unexpected combination, but the number of records for which they occur as well as examples from the complete records. When they see the other attributes, they will often quickly identify where you missed a filter condition due to not being aware of the business context.

Takeaways:

  • Make assumptions about the data and check them

  • Show the unexpected values to someone more familiar with the data or the business processes behind the data. They can often help interpreting them

[1] A point can be made that the count of discounted articles for an offer where everything is discounted should be NULL, which is less misleading than zero.

[2] https://en.wikipedia.org/wiki/Code_smell

Last updated