As a data scientist, you will often encounter and work with people from the business side of your company. After all, your work is often aimed at making their life easier and allowing them to make better decisions.
This is an interesting experience, because as you learn about the business side and explore the data, you will discover discrepancies between what you see and what you are being told. The reason for this is that people on the business side are concerned with running the company, which usually makes them focus on whatever is important at the moment. Thanks to reporting tools and their personal expertise, they are often well informed about the part of the company that they are responsible for. It is usually quite easy for them to give you answers about the typical values of their most important key metrics.
The discrepancies that I mentioned before are what people call 'freak cases'. Freak cases - in this context - are irregular business cases or transactions that deviate from the norm. People from the business side are often unaware of them because they are concerned with the 95% of regular cases that their day-to-day business is all about.
For you as a data scientist however, these freak cases do often make your life harder than it needs to be. Whether you are analyzing data, preparing a report or building a data product, you usually can't decide to ignore these cases. Outliers, NULL values in unexpected places, deviations from the default data format - freak cases come in many different shapes and sizes. The frustrating thing about them is that they are often completely irrelevant to the question you are trying to answer, but you still need to deal with them. When you are building a data product, it needs to be able to process them in some way. When you are building a report, they need to be included, otherwise the results would be incomplete and the figures incorrect.
I recently stumbled upon freak cases when I developed a simple heuristic to detect transactions with unusual prices. From historical transaction data, I calculated plausibility intervals for about 75,000 articles. The intervals were wide enough to allow for some volatility caused by seasonal changes in price, but sufficiently narrow to detect errors caused by people intentionally or unintentionally manipulating the price. The data was coming from hundreds of different warehouses, each one with different suppliers and different contracts for similar products. When someone entered a price that was significantly too high or too low, the system would require a clerk to check the transaction and verify that the price is correct or stop the transaction. One of the main concerns was the amount of transactions every day that a clerk would have to manually inspect. If this number was too high, the clerks would not want to work with the system. My employer was willing to overlook a few minor errors as long as the most important - i.e. most expensive - errors would be caught. I tested my solution on three months' worth of data, looking at the daily number of outliers detected by the system. Everything looked good. The number of cases to be handled every day was small enough to not overwhelm the clerks… except for a period of two weeks where the number of outliers suddenly exploded. I had prepared a visualization that was supposed to show how convenient my solution would be for the clerks, but the values for these days stuck out like a sore thumb. I looked into the data and discovered that most of the outliers from these two weeks belonged to a specific warehouse. Later, I learned that this warehouse was new and that some software error caused the transmission of incorrect prices. The error had remained undiscovered for days, with thousands of transactions with bad prices going unnoticed. The people on the business side were still busy cleaning up the mess.
Entering extreme values or freak cases into your data product - for example entering a very exotic product that your company only sold for a short period of time into a recommendation engine - will often lead to bad results. When people are evaluating your work, this is often what they will look at, especially when they are skeptical or trying to challenge your work. Unfortunately, freak cases often appear as outliers in visualizations, so they will immediately attract the attention of your audience. People love to focus on outliers because they are so irregular, so weird, so interesting. They want to understand what causes them and what effects they have. Seeing the quality of your product crumble because of a few spurious data records can be disheartening for you and disappointing for the people you are working for. The important thing here is to keep the right perspective. If your product works great for 99% of the time, don't throw it away or feel bad because of the 1% of the time where it fails. I am not telling you to ignore these cases, but put them into perspective, both for yourself and when you are presenting your work.
- Have realistic expectations. Accept that no data product will work well for every type of record you plug in
- Judge freak cases carefully. A product that works well for a majority of inputs can still be a good and valuable product, even if it fails hard for in some situations