Have you ever encountered SQL queries with huge, sometimes nested
CASE WHENstatements, often involving multiple different combinations of variables? This is a common symptom of a dissonance between the data logic and the business logic. It's the result of attempting to capture a business logic that is not directly and immediately represented in the columns of the data table.
A simple example would be the color of your product. Perhaps your company sells T-shirts and your want to find out which colors are the most popular with your customers. However, looking into the data, you discover that aside from plain one-colored shirts, you also have some with patterns and prints in various colors on them. Aside from the column
COLOR, which you thought would give you all the information you need, you discover that there's another column called
PATTERN_COLORthat contains a comma-separated list of all the most notable colors in your shirt's pattern or print. This means that a T-shirt may qualify as
'blue'not only when
COLOR = 'blue', but also when 'blue' is contained as one of the entries in
PATTERN_COLOR. This also means that a white-and-blue striped shirt can belong to the set of white shirts AND the set of blue shirts at the same time. The simple question has just gotten a lot more complex and you need to figure out how you want to deal with this level of complexity.
This happens a lot more often that you might think. I have also encountered this type of problem when trying to classify job ads posted on a job board website as well as when trying to classify different kinds of discount offers at a retail company.
Typically, someone from the business side ask a very simple but somewhat broad question and as you dig into the data, you discover that the answer to this simple question is a lot more complex that you expected. Let's take the job posting example. You work for a company that runs a job board on the internet. Other companies pay you to post their job postings on your website. A manager from your company may ask you a question as simple as 'How many job postings do we have on our website?' How do you answer this question? Specifically:
- Should you include the internship and apprenticeship postings in your analysis?
- Should you only count the paid job postings and exclude the postings that you received from crawling company websites?
- What about external job ads that you list in your search results, but that link out to a partner site?
- Should you only count the regular job postings that are valid for 30 days or also the long-running job ads that are valid throughout the entire year?
- What about the dynamic slots that you sold to a bunch of large companies that allow them to fill in their own ads?
- When you have answered all of these questions, you may wonder if you should include the large batch of job ads that a sales person has just allowed a business partner to upload for free in order to test if your site can deliver them applicants. After all, they make up 25% of your current content, leading to a spike that is not representative of the usual number of job postings on your site.
Meanwhile, the manager who asked you about the number of job postings is getting pissed off because he or she doesn't understand how the simple task of counting the number of job postings on your site can take this long.
The above example is a typical case of complex business logic. When the business logic is this complex, the data logic is usually even more involved. The job postings obtained from crawling company websites may be stored in a completely different table than the regular 30-day job postings that you are selling to your customers. The 60-day and 90-day job postings may be in the same table as the latter, but identifiable via the attribute
NUMBER_OF_DAYS_VALID. But there are also some postings that are valid for 15, 31 or 37 days, so those are kind of exceptional. The apprenticeship postings may also be in the same table, but differ in another binary attribute called
IS_APPRENTICESHIP_AD. That huge batch of 14-day test postings that you gave to a potential new customer? They may look exactly like as regular job postings, so you need to access the accounting data to figure out which of the job postings belong to the test batch.
Navigating the complexity of business concepts and how they are mirrored in the data is one of the hardest and most frustrating things about the job of the data scientist. It's often frustrating because the business side is pushing towards an ever-increasing level of complexity in the product while the data side is trying to keep up. At the same time, improving the product and customizing it to the customers' needs is a worthwhile goal and an indispensable part of a successful business. When you are responsible for the data side of things, you don't want to be the person slowing down the companies' ability to adapt to new circumstances.
Writing this chapter is a little bit unsatisfactory for me. I wish I could give you some good advice on how to deal with this form of complexity, but there's no recipe solution. The key takeaway from this chapter is that you should be aware that simple questions can become much more complex when you dig into the data and the business concepts. You need to work closely with the people from the business side to get their feedback on these issues. If they are hard to get a hold on, decide these questions on your own and write down how you handled each complexity within your analysis. It is very useful to provide transparency about your assumptions and decisions. Not just for the colleagues who receive your analysis, but also for your future self, who may wonder how you handled the different decisions that you had to make along the way.
It's also important to understand that it is YOUR task to keep the complexity from creeping into your data product. As mentioned above, the business side in their quest of improving the product and customizing it to the customers' needs is pushing towards ever increasing complexity. You should try to support them while at the same time figuring out which parts of the complexity to mirror in the data and which parts to ignore. For example, an algorithm predicting the number of clicks or page views a job posting will get, could assume that such a posting has a runtime of 30, 60 or 90 days. Any deviation like 31 or 35 days could simply be rounded to the next-closes allowed value, in this case 30, to simplify the feature
- Simple questions can become surprisingly complex when you start digging into the data.
- It's your task to keep the complexity of the outside world at bay.