This article explains the basics of model factories.
Note: Linked scripts were developed using Python 3.7.3 and DataRobot API version 2.19.0. Small adjustments might be needed depending on the python version and DataRobot API version you are using.
Model Factory Definition
A model factory, in the context of data science, is a system or set of procedures that automatically generate predictive models with little or no human intervention. Model factories can have multiple layers of complexity often called modules. One module might be training models while other modules could be deploying or retraining the models.
Why build a model factory?
Consider the following scenarios:
You have 20.000 SKUs and you need to do sales forecasting for each one of them.
You have multiple types of customers and you are trying to predict churners.
How would you tackle these? Would you build a single model? And, would that single model (single preprocessing method included) be enough?
Model Factory Architecture
If you wish to find the code to reproduce a model factory using DataRobot, use this notebook. For the purposes of this post, we will only take a look at the DataRobot model factory architecture:
Figure 1. Model factory architecture
You start by splitting data based on a group column. The group column can be anything really: a feature that differentiates between the products of your company, the different customer segments, a feature that splits data based on their geography.
After splitting the data, you create a new DataRobot project for each one of the datasets. DataRobot will find the best algorithm and preprocessing technique for each one of them; then you can deploy the best model and make it ready to receive new data.
The above architecture is the absolute minimum requirement for a model factory. You could add another layer of automation in the form of automated retraining and redeployment based on accuracy and data drift, or you could also add your own custom functionality on top of the DataRobot models.
The procedure described becomes seamless when you are working with the DataRobot API in either Python or R since you will not have to waste time splitting data and creating multiple projects manually.
The real power of using model factories with DataRobot is that you can fit the best model for each subset of your observations while still automating out of sample validation, machine learning preprocessing, and deployment. In high cardinality data, where accuracy is of importance, the model factory approach will almost always outperform the single model approach and that increase can translate into substantial business value.
You can find a Python notebook and media files, and sample training and test datasets for this model factory introduction in the DataRobot Community GitHub.