Feature stores and feature engines

Another way to ensure consistency is to build a feature store or feature engine that everyone else can work with. A feature store is a table that contains a large number of features that other people can use for their models.[1] It's like a set of Lego(R) bricks that you are providing to your team so that they can focus on model development and model management rather than feature generation.

Unfortunately, this approach requires a lot of space to store every feature for every single observation and you may need to recalculate a feature for every observation if the code ever changes - which it will, inevitably. On top of that, there's often a temporal dimension that you need to consider. In retail, a feature may refer to the engagement of a customer during the last year, where 'last year' depends on a reference point in time, e.g. a reference month or a reference week. For the feature store to be complete, you need the value of the feature not only for every customer, but also for every reference point in time.

Another problem with the feature store is that data is less flexible than code. With code, you can easily deploy different versions of the same software to different staging environments. A stable version can run on your production environment while your development version can run on your test environment. However, as soon as you write the data into a table with a fixed schema, you have dependencies across the different stages. Good luck with replacing an erroneous prod version with an older release of your software. The code will probably not work with the newer table schemas, so you need to run a series of DDL statements to get from one set of schemas to the other. And even then, you cannot 'undelete' a column or table.

A nice solution to that problem is the feature engine. Here, you simply provide the calculation rule, that is, the code that calculates the feature. Your team members or users can select parameters like the observations they want the features to be calculated for and your feature engine will do the rest. The major advantage of this approach is that you don't need to worry about whether or not your data is up-to-date or if it was calculated using a previous version of your code. After all, it's calculated fresh every time somebody uses it. The only interdependency limiting the freedom in every stage is the underlying raw data from which you calculate the features and the application that consumes your output. But if you followed the recommendations from earlier and built clean and robust interfaces for the input and output data, these types of external dependencies don't pose a problem. The obvious disadvantage of the feature engine approach is that it is much slower than doing a quick lookup into an existing feature store table. When you are working with large amounts of data, calculating features can be very costly or even impossible to do 'on the fly'.

[1] This is a little bit of a simplification. Recently developed feature store frameworks also offer tools for accessing and managing features in the feature store.

Last updated