You can find the latest information for preparing learning data in the DataRobot public documentation. Also, click ? in-app to access the full platform documentation for your version of DataRobot.
(Article updated September 2020)
This article showcases DataRobot’s Feature Discovery capability which lets you combine datasets of different granularities to enable automatic feature engineering.
More often than not, features are split across multiple data assets and a lot of work has to go into bringing these data assets together: joining them and then running machine learning models on top of them. The problem becomes even greater when these datasets are of different granularities. In that scenario, someone would have to make aggregations in order to join the data successfully.
Feature Discovery solves this problem by automating the procedure of joining and aggregating these datasets. After defining how these datasets need to be joined, you then leave feature generation and modeling to DataRobot.
Data and Business Problem
For the purposes of this demonstration, we are going to use some data taken from Instacart, an online aggregator for grocery shopping. The business problem we will be trying to solve is whether we can predict if a customer is likely to purchase a banana.
We have three datasets that we want to use for this:
Users table: This table has information on users and whether they bought bananas or not at some order dates.
Orders table: This table has information on historical orders made by a user. A user will join with multiple orders from this table.
Transactions table: This table has information on specific products bought by the customer during an order. An order will join with multiple records of the transactions table.
From the description, it becomes obvious that each table has a different unit of analysis and that we need to determine how to join them together in a way that makes sense and also produces good results.
Starting with Feature Discovery
To start using Feature Discovery, you first need to upload all of the datasets you want to use to the AI Catalog. This is exactly what we have done with the three datasets that we defined above.
Next, we create a new project using the dataset that has the target feature. In our case that is the Users table. Figure 1 shows you where to find the Create project button from within the AI Catalog.
Figure 1. Creating a project through AI Catalog with Users table
The Orders and Transactions tables are going to be added into the project later. Furthermore, we will call these projects “secondary datasets” from now on.
If the above procedure was done successfully, all that remains is to define the feature engineering graph(s) that will be used for this project.
Adding secondary datasets
You now need to define relationships between different data assets so that DataRobot can use them to discover new features. Click on Adddatasets (Figure 2).
Figure 2. Adding Secondary Datasets
First, in the displayed popup window, specify the column that indexes your main table by time. Coincidentally in this dataset, the column's name is time as shown in Figure 3.
Figure 3. Setting Time Index Column
Below are the steps you need to follow to successfully define a relationship:
Choose the secondary datasets via the AI Catalog which can be accessed on the top left corner.
Select Add relation which can be accessed through the sandwich button in the primary dataset (Figure 4).
Select the Dataset to join with the primary dataset. (Figure 5).
Select how these datasets need to be joined (Figure 6).
Select the TimeIndex variable of the secondary dataset (Figure 7).
Define the Feature Derivation Window (Figure 7).
Figure 4. Add relation menu
Figure 5. Select Dataset to Join
Figure 6. Specify Columns for Joining
Keep in mind that instead of a single column, you could also have a list of features for more complex joining operations.
Figure 7. Specify Time Index and Feature Derivation Window
Repeat the same procedure for the Transactions table. Figure 8 shows how the final relationship should look like:
Figure 8. Complete Relationship
Now that the secondary datasets are in place and DataRobot knows how to join them, we can go back to the project by clicking on Continue to project and we can click Start.
The whole point of Feature Discovery is to automatically generate features in order to remove the manual and cumbersome procedure that comes with engineering each one of them.
DataRobot will automatically generate hundreds of these features and weed out the ones that do not make sense from the modeling procedure. Figure 9 shows an example of a generated feature.
Figure 9. Histogram for 30 Day Sum of Reorders
This particular generated feature is the 30-day sum of the reordered products found in the Transactions table. If we want to take a closer look at this feature, we can also visit the Feature Lineage tab (Figure 10).
Figure 10. Feature Lineage tab
After DataRobot calculates the new features and removes the ones with no signal, it's all set for machine learning and model building.
Investigating Data in Depth
DataRobot provides you the option to both download the dataset with the newest features created by Feature Discovery or see the details using the Feature Derivation Log. Both of these options can be accessed by clicking on the Feature Discovery tab under Data, as shown in Figure 11.
Figure 11. Download Dataset and Feature Derivation Log
The process for scoring on models built with Feature Discovery enabled is a bit more complicated than normal; this is because we need to ensure the secondary datasets are up to date and that feature derivation will complete without problems.
To make predictions, go to your model of interest and click the Predict tab. You now have the option to upload a dataset by clicking Import data from (Figure 12).
Figure 12. Uploading Data for Scoring
The important thing to note here is that the dataset needs to have the same schema as the dataset you used to create the project. The Target column is optional and secondary datasets do not need to be uploaded in this step.
It will take a few seconds for the upload to finalize and then you can either click Compute Predictions to get predictions back or change the default configuration for the secondary datasets. The latter is what you would have to do when the scoring data you uploaded are from a different time period and not joinable with the secondary datasets used in the training phase. If that is the case, click Change (Figure 13) and then click Create configuration (Figure 14) to set your new settings.
You will be asked to set the secondary datasets that you want to use together with this scoring dataset; make sure they were uploaded into the AI Catalog so that they appear here.
Note that secondary dataset configurations should be set prior to uploading your test set into DataRobot; if not, DataRobot will use the default settings to compute the joins and do feature derivation.