This notebook shows how you can use automation and R together for innovative new uses. In this case, you will accomplish feature selection by creating aggregated feature impact. You can find an R Markdown notebook containing this code here, and a Python version of this script here.
This is the procedure we are going to follow:
Calculate the feature importance for each trained model.
Get the feature ranking for each trained model.
Get the ranking distribution for each feature across models.
Sort by mean rank and visualize.
Create a new feature list.
R version 3.6.2
DataRobot API version 2.2.0
Small adjustments might be needed depending on the R version and DataRobot API version you are using.
The next step is to choose which models to use for feature impact. In this example only highly ranked, non-blender or auto-tuned models are used. This is to optimize for speed and computational power. You can include those if runtime is not a concern for your project. You also do not want to select models that were trained on small samples of the data. Using models that trained on 64% and 80% of the data is a good guideline.
This next block of code filtered down that list to just the models you want to focus on. For this example, we are going to pick all of the models trained on over 63% of the data; however, there are other approaches you can do, like focus only on models within one modeling family.
You can plot the features and their medians using GGplot’s geom_boxplot (Figure 1). Here we plot the boxplots for the unnormalized feature impact; however, there are other ways to show this information, such as plotting the standard deviation.
After you have made your master list of important features, you can create a new feature list that includes all the features across your top models. If you check the GUI, you will see the feature list in the project.
In addition to running Autopilot on a single new feature list, you can do multiple runs of Autopilot and plot the performance for each feature list length; this can help you decide how many features to include in a model.
The code below creates a loop that does an Autopilot run using the top 9, 6, 3, and 1 feature.
The next step is to extract the best AUC for each feature list. The final code blocks use a custom function to pull this data from an updated model list and then graph it using GGPLOT2 (Figure 2). You can see that in this case, as you increase the size of the feature list, the more poorly it performs.
Figure 2. Number of Features
If you have any questions, then feel free to click Comment and post them below.