Let's say you have a script that generates features for every combination of customers and offers, for example to serve as the input for a recommendation engine. Usually, you would have a long script that defines the time period of interest, derives from that the relevant set of customers and offers, applies some filters, calculates the features and writes them into the result table. The modular approach would go like this: You have a single script that specifies the time period of interest and writes it to table TIME
. The table TIME
is the input for another script that derives the relevant set of customers and offers and writes them to tables CUSTOMERS
and OFFERS
. Now you have your core script for the generation of a set of features that takes tables TIME
, CUSTOMERS
and OFFERS
and uses them to generate table CUSTOMER_OFFER_FEATURES
which is a cross join of CUSTOMERS
and OFFERS
and contains all feature columns, which are as of now still completely empty. Now you can write an arbitrary number of feature scripts, each of which calculates a single feature or a group of feature and merges the result into CUSTOMER_OFFER_FEATURES
.