Thinking pragmatic

The ability to get shit done - to solve a problem and implement a working solution - is one of the most important skills that one can possess. This is particularly true for data scientists working on data products.

Very often, data scientists are perfectionists by nature. Having roots in mathematics and computer science, we are concerned about precise language, attention to detail and clean, structured thinking. This is one of our biggest strengths and one of the main reasons companies are willing to pay us good money, but it can also become one of our biggest weaknesses. Perfectionism, while highly regarded in the academic world, is diametrically opposed to getting shit done.

Data scientists strive to make workflows fully automated and reproducible. This means that every step along the way is performed by the machine and there are no manual pre- or post-processing steps aside from starting a program and a few simple parameter choices like an output directory. This is a worthwhile goal if you can achieve it. It is also very difficult to achieve. Sometimes, the time and effort to achieve full automation and reproducibility is just too high.

I once developed a classification algorithm to support some colleagues who had to manually classify several hundreds of thousands of bank transactions into a few dozen different classes. It worked surprisingly well with an accuracy of 80% on an out-of-sample testing dataset. Unfortunately, as the result would be used in a court case, 80% accuracy was not enough - every single transaction needed to be correctly classified. I could have spent half a year improving the algorithm and may even have gotten up to 90%. But no matter how much time I would have spent, I would never have reached 100%. Instead, we decided that I would apply my classification algorithm to the entire set of transactions and create a first 'pre-classification'. The people who manually classified the records would use this as a suggestion, keeping in mind that it may not always be correct.

As it turned out, classifying these pre-classified records was much faster because it was easy to see - for a human - where the algorithm made a mistake. Fixing the algorithm's mistakes was much more efficient than manually classifying every transaction from scratch.

Whenever you do something only once, or maybe only once a year, think about how much time you would need to spend to fully automate it versus how much time it would cost you to simple perform a few of the hard-to-automate steps manually.

An important concept to get shit done efficiently is the idea of the minimum viable product[1], short MVP. According to Wikipedia.org, the MVP is 'a product with just enough features to satisfy early customers and to provide feedback for future product development'. Whatever you're trying to build, you probably have a picture of it in your head, along with a list of cool stuff that it should be able to do. But instead of building the fully developed product, ask yourself: 'How can I make this work as quickly as possible so that people can see it and use it?' Don't lose yourself in the bells and whistles of what you're trying to build, focus on the core functionality and keep everything else for later. Doing this is an art in and of itself and there are entire books dedicated to this topic. But just thinking in terms of an MVP and identifying the core features and the nice-to-have features is often enough to speed up your work and deliver on your promises.

Don't make things unnecessarily complicated.[2] This happens to be an important life lesson that I had to learn in my 20s and that turned out to be applicable in many different areas of life and particularly helpful in data science.

In this context, it means for example to not use overly complicated models for prediction, classification or other data science tasks. Most of the time, a simple model provides good enough results. A more complicated model may improve the performance slightly, but at the price of higher complexity and slower calculations. Unless you have strong evidence that your simple model is unsuited for your data, stick to it. Don't forget that at some point, you may have to explain it to your team, your boss, the boss of your boss and so on.

Also, your first model does not need to take into account every possible feature. Pick one or two features that should have a strong effect on the target variable you're trying to predict and build a simple model with just these features. This allows you to make progress within the CRISP-DM process and start working on model deployment instead of getting stuck in the trial-and-error cycle of data preparation and modelling.

Avoid costly features. Sometimes, calculating a particular feature can be very costly because you need to access the input data on a very granular level. While these features may be simple from the conceptual point of view, tuning your feature generation scripts to calculate them efficiently can become a very complex problem. Try to find similar features that capture the same type of property, but that can be calculated at a lower cost.

As for the business logic - keep it simple! In most large companies, you will meet - and probably collaborate with - people whose job it is to come up with new ideas to squeeze money out of the business. A company in the retail business may for example develop more and more complex ideas to acquire, retain and reengage customers. This leads to 'complexity creep' since data structures have to evolve along with the business logic. It's hard to stop the constant influx of new business logic, but try to abstract away as much complexity as possible in your data product. Not every detail of the business logic needs to be encoded into the data.

Stick to the above principles to avoid complexity building up in your product. Complexity will always slow down development and sometimes even bring it to a grinding halt. It requires more documentation. It makes it easier to break something unintentionally, which also makes team members more hesitant and careful to touch parts of the product. New team members will need more time to be ramped up and to familiarize themselves with the intricacies of the product. I'm even willing to go so far as to say that simplicity is one of the most important virtues when building anything. You will be seduced to compromise it in favor of more complex business logics or new requirements. Don't give in, at least not easily. Always compare cost - including long term cost - to benefit.

Takeaways:

  • When full automation is difficult to accomplish, manipulating the data by hand can be a valid option to save time and get shit done.

  • Keep things simple. Always ask yourself: 'Am I currently making something more complicated that it should be?'

[1] https://en.wikipedia.org/wiki/Minimum_viable_product

[2] In computer science, this is sometimes called the KISS principle. It stands for 'keep it simple and stupid'

Last updated