Advanced Feature Selection with R

cancel
Showing results for 
Search instead for 
Did you mean: 

Advanced Feature Selection with R

This notebook shows how you can use automation and R together for innovative new uses. In this case, you will accomplish feature selection by creating aggregated feature impact. You can find an R Markdown notebook containing this code here, and a Python version of this script here.

Background

This is the procedure we are going to follow:

  1. Calculate the feature importance for each trained model.
  2. Get the feature ranking for each trained model.
  3. Get the ranking distribution for each feature across models.
  4. Sort by mean rank and visualize.
  5. Create a new feature list.
  6. Rerun Autopilot.

Requirements

  • R version 3.6.2
  • DataRobot API version 2.2.0

Small adjustments might be needed depending on the R version and DataRobot API version you are using.

Full documentation of the R package can be found here: https://cran.r-project.org/web/packages/datarobot/index.html

It is assumed you already have a DataRobot Project object and a DataRobot Model object.

Install Packages and Connect to DataRobot

 

library(datarobot)
library(dplyr)
library(stringr)
library(ggplot2)
library(purrr)

ConnectToDataRobot(endpoint = "YOUR ENDPOINT", 
                   token = "YOUR TOKEN")

 

List Models

The first step is to just get all of the models in our project. This next chunk of code selects the project you want to use (with the Project ID), and then lists all of the models.

 

project <- GetProject("YOUR_PROJECT_ID")
allModels <- ListModels(project)
modelFrame <- as.data.frame(allModels)
metric <- modelFrame$validationMetric
if (project$metric %in% c('AUC', 'Gini Norm')) {
  bestIndex <- which.max(metric)
} else {
  bestIndex <- which.min(metric)
}
model <- allModels[[bestIndex]]
model$modelType

 

Filter List

The next step is to choose which models to use for feature impact. In this example only highly ranked, non-blender or auto-tuned models are used. This is to optimize for speed and computational power. You can include those if runtime is not a concern for your project. You also do not want to select models that were trained on small samples of the data. Using models that trained on 64% and 80% of the data is a good guideline.

This next block of code filtered down that list to just the models you want to focus on. For this example, we are going to pick all of the models trained on over 63% of the data; however, there are other approaches you can do, like focus only on models within one modeling family.

 

models <- ListModels(project)
bestmodels <- Filter(function(m) m$featurelistName == "Informative Features" & m$samplePct >= 64 &  m$samplePct <= 80 & !str_detect(m$modelType, 'Blender') & !str_detect(m$modelType, 'Auto-Tuned') , models)
bestmodels <- Filter(function(m) m$samplePct >= 63, models)

 

Get Feature Impact

Next, you want to get the feature impact for the models that you selected. This for loop goes through the list and pulls out all of the feature impact data.

 

all_impact<- NULL
for(i in 1:length(bestmodels)) {  
    featureImpact <- GetFeatureImpact(bestmodels[[i]])
    featureImpact$modelname <- bestmodels[[i]]$modelType
    all_impact <- rbind(all_impact,featureImpact)
  }

 

Plot

You can plot the features and their medians using GGplot’s geom_boxplot (Figure 1). Here we plot the boxplots for the unnormalized feature impact; however, there are other ways to show this information, such as plotting the standard deviation.

 

all_impact <- all_impact %>% mutate(finalnorm = impactUnnormalized/max(impactUnnormalized))
p <- ggplot(all_impact, aes(x=reorder(featureName, finalnorm, FUN=median), y=finalnorm))
p + geom_boxplot(fill= "#2D8FE2") + coord_flip() + theme(axis.text=element_text(size=16),
        axis.title=element_text(size=12,face="bold"))

 

Figure 1. BoxplotFigure 1. Boxplot

Create New Feature List

After you have made your master list of important features, you can create a new feature list that includes all the features across your top models. If you check the GUI, you will see the feature list in the project.

 

ranked_impact <- all_impact %>% group_by(featureName) %>% 
    summarise(impact = mean(finalnorm)) %>% 
    arrange(desc(impact))

topfeatures <- pull(ranked_impact,featureName)
No_of_features_to_select <- 10
listname = paste0("TopFI_", No_of_features_to_select)
Feature_id_percent_rank = CreateFeaturelist(project, listName= listname , featureNames = topfeatures[1:No_of_features_to_select])$featurelistId

 

Run Autopilot on New Feature List

Now you can use this new feature list to rerun Autopilot. This allows us to see how the performance changes with the new feature list.

 

StartNewAutoPilot(project,featurelistId = Feature_id_percent_rank)
WaitForAutopilot(project)

 

Determine Ideal Number of Features

In addition to running Autopilot on a single new feature list, you can do multiple runs of Autopilot and plot the performance for each feature list length; this can help you decide how many features to include in a model.

The code below creates a loop that does an Autopilot run using the top 9, 6, 3, and 1 feature.

 

No_of_features_to_loop <- c(9, 6, 3, 1)

for(i in 1:length(No_of_features_to_loop)) {
  listname = paste0("TopFI_", No_of_features_to_loop[i])
  Feature_id_percent_rank = CreateFeaturelist(project, listName= listname , featureNames = topfeatures[1:i])$featurelistId
  StartNewAutoPilot(project,featurelistId = Feature_id_percent_rank)
  WaitForAutopilot(project)
}

 

The next step is to extract the best AUC for each feature list. The final code blocks use a custom function to pull this data from an updated model list and then graph it using GGPLOT2 (Figure 2). You can see that in this case, as you increase the size of the feature list, the more poorly it performs.

Figure 2. Number of FeaturesFigure 2. Number of Features

If you have any questions, then feel free to click Comment and post them below.

Labels (3)
Version history
Revision #:
11 of 11
Last update:
‎05-20-2020 11:15 AM
Updated by:
 
Contributors