beginners-guide-to-clean-data

Searchβ¦

Missing data

Data range problems

Common CSV problems

Text mining problems

Type- and format-related problems

Database-related problems

Data modeling

Monitoring and testing

Bonus content

Outliers

In a certain way, outlier and unexpected value are just different terms for the same thing. However, I prefer to use the term outliers for numeric values beyond the range of values that you would expect.

A tangible example is the height of a person in meters. For a grown-up, you would expect values somewhere between 1.60 and 2.10. Even a value of 2.15 might still be within the range of possibility, but values like 56.35 or -4.20 would not.

The way to check for outliers like that is fairly obvious. Define a plausible range for the variable and count the number of values beyond the limits. In cases like the above example of body height in meters, you have a vague idea about the plausible range of values. In these cases, the quick win solution would simply be to arbitrarily select two extremes and check for values outside of that range. I would set the lower boundary to 0.1 and the upper boundary to 3.0.

Your first instict would be to make these plausibility intervals as small as possible, but doing that doesn't make them more valuable like it does with confidence intervals. The cost of making the interval a little bit too large is very small. The purpose of these intervals is to catch errors in the data and most of the errors - especially those that will completely break your data product (analysis, report, recommendation engine, β¦) - are extreme cases like -136,000.0 m or 9,999.9 m. These can usually be handled automatically by some imputation method or by exclusion of the corresponding record because it is a safe bet that they are wrong. Nobody will blame you for not including them in your analysis. With smaller values like 2.25 m, automatically excluding them may even damage your analysis because there's the possibility that this is a correct value, perhaps even a particularly interesting one.

If you don't know which values are plausible, you can use the percentiles as a starting point. Due to the definition of the percentiles, there are always values smaller or larger, so you need to add or subtract a sufficiently large margin. As a rule of thumb, you can take the 95% percentile and add the interquartile range (IQR) - the difference between the 75% percentile and the 25% percentile - to obtain an upper limit. However, when you calculate these ranges dynamically as part of your pipeline, be aware that they depend heavily on the input data, so they will be different every time you run your pipeline with new data. While calculating the plausible range dynamically is convenient, I think it's too error-prone in most cases and I would generally advise against it.

When you discover that there are outliers, the obvious follow-up question is how to handle them. I'm a fan of simple and efficient solutions, so my favorite way is to simply exclude the corresponding records. The problem with exclusion is that there may actually be some degree of truth to these values. Of course, a person will not be 9,999.9 m large, but when you look for example at revenue of individual customers, there may actually be a single person who spent 200,000.00 EUR with your company in the last year, even though the average customer only spends 200.00 EUR with you. But even if the 200,000.00 EUR are accurate, including it in training data for a model may hurt you, depending on whether your model is sensitive to outliers and what metric you use for assessing model quality. Unfortunately, there are also situations where discarding a record is not an option. If you are using a model to predict something about this particular customer, you need to be able to handle the outlier in some way.

That is why I often apply a plausibility capping. By capping all values to 5000.00 EUR, you may misrepresent the 200,000.00 EUR customer, but save your data product. After all, the basic information - the fact that this is a customer who spends much more than the average customer with your company - is still there.

Obviously, this type of automatic outlier handling is not recommended for reports where the accuracy of the numbers is essential. It is usually helpful in situations where you are preparing data to be used for the training or scoring of a predictive model that powers a recommendation engine or creates forecasts. In these cases, including the unaltered outlier value will have a far worse effect than including the capped value, even though it is not the truth.

When you handle these types of outliers automatically either by capping the value or by excluding the record, there is always the danger of excluding or altering too many records. I recommend having some kind of data quality check that counts the number of capped or excluded values and alerts you when it exceeds a certain threshold like 5% of the total number of records. When you are capping 5, 10 or 15 percent of all your records, you have a bigger underlying data quality issue that you need to be aware of.

One problem that I didn't address here are outliers in a high-dimensional space. Until this point, we have only looked at each dimension individually. A record can be an outlier, even though the value of a certain dimension is completely within the expected range of its marginal distribution, i.e. the distribution of this value across all records. For example, a weight of 65 kg for a human body doesn't seem like an outlier. But when the individual in question is a 2 months old baby, it's very unlikely that this weight is accurate. This is because the conditional distribution of weight, given an age between 0 and 12 months, is very different from the marginal distribution[2].

Identifying this type of outlier is much more complicated. Following the Pareto principle[1], using the above-described methods to detect the most obvious outliers will suffice as a data quality check in 80% of the cases. Detecting seemingly plausible values as outliers or detecting outliers in high-dimensional space will require much more effort, is hard to automate and often not worth the effort. It is much more important in anomaly detection, e.g. when trying to detect fraud, but that's not within the scope of this book.

Takeaways:

- Define a plausible range and check for outliers. The plausible range can be very wide. You don't want to find interesting records, but records that are obviously incorrect
- Use capping to reduce the effect of extreme outliers, unless the accuracy of the numbers is critical. For most machine learning purposes, capping values will benefit the model accuracy

Last modified 1yr ago

Copy link