The value of data

'Without data, you're just another person with an opinion' (W. E. Deming)

Data has been called 'the oil of the digital era'. Watching the rise of Google, Facebook, Amazon and other companies that are strongly data-driven, this statement does not seem too absurd. Even in companies that are not explicitly data-centered, data has always been a byproduct. It is being produced and processed at numerous different points within a company: sales, accounting, procurement, production, marketing, just to name a few. Collecting, storing, integrating and visualizing this data through data warehousing and reporting allows people in the company to make decisions informed by data. Such decisions are based on the real situation of the company rather than guesswork and gut instinct. This is a great innovation, but there are still many small to medium-sized companies that do not monitor their most important metrics. I have seen a company having problems calculating the revenue from individual sales due to bad accounting tools and practices. If you cannot calculate the revenue from a sale, you are unable to calculate your margin, one of the most important metrics of all. This may sound like an extreme case, but it is more common than you would think.

While good data warehousing and reporting are very important and should not be overlooked, these methods are not new. The technologies and concepts have been around for a few decades. You would not call it data science, even though many of the results produced by data scientist will eventually end up in this environment.

What most people associate with data science are buzzwords like data mining, pattern recognition, recommender systems, anomaly detection, predictive models, classification and automated decision-making, just to name a few. While each of these topics focuses on a different aspect of data analysis, they often use the same underlying methods and algorithms. Interestingly, one method can often be applied to many superficially different problems from different industries. For example, the techniques used for topic detection in text mining are the same techniques used to calculate product or category affinity measures in retail through collaborative filtering. A data scientist does not necessarily need to know all these methods in great detail, but he or she should be familiar with them, know them conceptually and be able to understand when and where they are applicable.

But what exactly is the job of the data scientist? Generally speaking, a data scientist is a problem-solver. He or she will work together with the experts from different departments, identify problems and figure out how his toolset can help to solve them. A common misconception - even among data scientist - is that they will work on completely new, previously unseen problems. This is rarely the case. Within every department, most problems do already have a solution. After all, the people working in that department are not sitting there all day doing nothing. However, many of the things people come up with are simple makeshift solutions that involve manual tasks or simple heuristics. They work okay, but they don't make good use of the available data.

The value of the data scientist lies in his or her ability to spot where he can apply the above-mentioned methods to automate and improve an existing solution or replace it with a better one. Instead of manually selecting a segment of customers to receive a specific offer, a recommendation engine can select the next best offer for each individual customer from a pool of offers. Instead of manually checking accounting records for fraudulent transactions, an algorithm can identify irregular posting patterns on a statistical basis. Instead of manually classifying hundreds of thousands of bank transactions, a data scientist can use a classification algorithm and a small training set of manually labeled transactions to automatically classify the rest of the transactions and any new transactions within seconds.

The output of a data scientist can take many different forms. This includes, for example

  • a singular 'ad-hoc' data analysis trying to answer a question from a product manager

  • a report or dashboard highlighting key performance indicators (KPIs) and exploring their development over time

  • a data integration pipeline extracting data from some production system, enriching it with data from additional sources and preparing it to be used in other applications

  • a recommendation engine providing users with personalized offers or relevant ads

  • a targeting software to select the right audience for a marketing campaign

  • a fraud detection system that identifies outliers by scoring records and ranking them so that the most suspicious records can be handed over to experts for a manual review

All of these products are centered around processing and analysing data, so I will simply refer to them as 'data products' for the rest of the book.

Last updated