Prediction Explanation Clustering with R

cancel
Showing results for 
Search instead for 
Did you mean: 

Prediction Explanation Clustering with R

Prediction Explanation Clustering with R

This post illustrates the technique of Prediction Explanation Clustering, as implemented in the datarobot.pe.clustering R package, hosted at the pe-clustering-R repository on the DataRobot Community GitHub.  An RMarkdown notebook with the code from this post is available here.

Prediction explanation clustering is a powerful technique for understanding the important patterns of your data in the context of a predictive model. It aggregates the row-by-row Prediction Explanations from a DataRobot model to produce clusters of observations with similar profiles of predictive factors for the target of interest. In this post, you’ll see how to run this methodology through the R package on an example dataset. For a more extensive discussion of this methodology, check out the model building learning session “Explanation Clustering” in the DataRobot Community.

Requirements

This post assumes you are already set up with the datarobot R package (the DataRobot client for R). For more information about the datarobot R package, check out this DataRobot Community article.

Installing the package

You can install the datarobot.pe.clustering package directly from GitHub:

 

if (!require("devtools")) { install.packages("devtools") }

if (!require("datarobot.pe.clustering")) { devtools::install_github("datarobot-community/pe-clustering-R", build_vignettes=TRUE) }

 

You will need to set up a GitHub PAT token and then export GITHUB_PAT=<token> in your shell before running install_github.

Loading libraries

In this example we’ll be using the datarobot.pe.clustering package itself, as well as a few other libraries to help illustrate the results.

 

library(datarobot.pe.clustering)
library(ggplot2)
library(dplyr)
library(tidyr)

 

Setting up the data

The data we will use for this example is the Pima Indians Diabetes dataset from the mlbench package. It contains health diagnostic measurements and diabetes diagnoses for 768 women of Pima Indian heritage. 

 

library(mlbench)
data(PimaIndiansDiabetes)
head(PimaIndiansDiabetes)

 

pregnant

glucose

pressure

triceps

insulin

mass

pedigree

age

diabetes

6

148

72

35

0

33.6

0.627

50

pos

1

85

66

29

0

26.6

0.351

31

neg

8

183

64

0

0

23.3

0.672

32

pos

1

89

66

23

94

28.1

0.167

21

neg

0

137

40

35

168

43.1

2.288

33

pos

5

116

74

0

0

25.6

0.201

30

neg

Obtaining a DataRobot model

To use prediction explanation clustering, we first need a relevant DataRobot model. For full documentation on fitting models with DataRobot, see the datarobot package.

For this example, we’ll start a new DataRobot project on the Pima Indians Diabetes dataset, training models to predict diabetes diagnosis. Then we’ll grab its top-performing model to use going forward.

 

project <- StartProject(dataSource = PimaIndiansDiabetes,
                        projectName = "PredictionExplanationClusteringVignette",
                        target = "diabetes",
                        mode = "quick",
                        wait = TRUE)
models <- ListModels(project$projectId)
model <- models[[1]]
summary(model)['modelType']

 

(Output)

                        modelType
"RandomForest Classifier (Gini)"

Running prediction explanation clustering

For full validity, we should run prediction explanation clustering on a separate dataset that was not used for training the machine learning models. For example purposes, however, we’ll just re-use the training dataset.

 

scoring_df <- PimaIndiansDiabetes %>% select(-diabetes)

 

Next we run our prediction explanation clustering function. This will run the prediction explanations themselves, and then perform the clustering routines on those explanations.

 

results <- cluster_and_summarize_prediction_explanations(
      model,
      scoring_df,
      num_feature_summarizations=10,
      num_neighbors=50,
      min_dist=10^-100,
      min_points=25
    )

 

The results object captures the intermediate and final outputs of the prediction explanation clustering process. We can introspect these results in a variety of ways.

 

str(results, max.level = 1)

 

(Output)

List of 5
$ plot_data :'data.frame': 768 obs. of 3 variables:
$ summary_data :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3 obs. of 10 variables:
$ cluster_ids : int [1:768] 3 2 3 2 3 2 3 2 3 3 ...
$ pe_frame :'data.frame': 768 obs. of 22 variables:
$ strength_matrix:'data.frame': 768 obs. of 8 variables:
- attr(*, "class")= chr "dataRobotPEClusterResults"

Introspection of results

Summary and Plot

We can use summary() to view a summary of the clusters based on the features most important to the predictive performance of the model. Here we can see that the clusters differ on average across a wide array of features.

 

summary(results)

 

clusterID

glucose

age

mass

pedigree

pregnant

triceps

pressure

insulin

n

1

91.59259

23.33333

21.60000

0.2937222

1.759259

14.29630

53.66667

28.16667

54

2

111.59551

32.25169

31.32764

0.4473371

3.501124

20.24045

69.92584

73.42697

445

3

142.15985

36.86617

35.17881

0.5482342

4.832714

22.27881

70.84758

100.70632

269

We can use plot() to plot the results to see how the clusters are spread out in the reduced-dimensionality space, to get a quick sense of how well the clusters are separated from each other in prediction explanation space.

 

plot(results)

 


Figure 1. Results PlotFigure 1. Results Plot

This same plotting data is available within the results, allowing for plotting through libraries like ggplot2:

 

ggplot(results$plot_data, aes(x=dim1, y=dim2, color=clusterID)) +
  geom_point()+
  theme_bw()+
  labs(title='Records by Prediction Explanation Cluster', x='Reduced Dimension 1', y='Reduced Dimension 2')

 

Figure 2. Results Plot via ggplot2Figure 2. Results Plot via ggplot2

Characterizing clusters by prediction risk and feature values

By joining the cluster IDs and predicted scores supplied by the results back to the original dataset, we can get further insight into the patterns captured by the clusters.

 

scoring_df_with_clusters <- scoring_df
scoring_df_with_clusters$cluster <- factor(results$cluster_ids)
scoring_df_with_clusters$predicted_risk <- results$pe_frame$class1Probability

 

For example, we can examine how the predicted risk of diabetes varies by cluster. Here we can see that one of our clusters has especially high diabetes risk, while the other two clusters have mostly lower levels of risk.

 

scoring_df_with_clusters %>%
    ggplot(aes(x=cluster,y=predicted_risk, fill=cluster))+geom_violin()+
    labs(title='Predicted Diabetes Risk by Cluster')+
    theme_bw()

 

Figure 3. Diabetes Risk by ClusterFigure 3. Diabetes Risk by Cluster

We can also look at how our clusters differ on the original feature values. Examining the distributions of these features, we can see that our clusters differ on a number of different features. Because these clusters are derived from the prediction explanation clustering, we can have more confidence that the differences between the clusters are associated with meaningful differences in the diabetes risk profile. 

 

scoring_df_with_clusters %>%
    gather(key='feature',value='value',-cluster)%>%
    ggplot(aes(x=value, group=cluster, color=cluster, fill=cluster)) +
    geom_density(alpha=0.2)+
    facet_wrap(~feature, scales='free')+
    theme_bw()

 

Figure 4. Feature Values by ClusterFigure 4. Feature Values by Cluster

Characterizing clusters by prediction explanation strength

In addition to looking at the clusters based on the original features, we can also look at the clusters based on the prediction explanations strengths. These will give us insights as to which features of the clusters were contributing most to the predicted diabetes risk profile of the clusters’ members, and whether a feature’s contribution increased or decreased risk. 

 

strength_matrix_with_clusters <- results$strength_matrix
strength_matrix_with_clusters$cluster <- factor(results$cluster_ids)
head(strength_matrix_with_clusters)

 

age

glucose

insulin

mass

pedigree

pregnant

pressure

triceps

cluster

0.5732024

0.5684304

0

0.0000000

0.4872620

0.0000000

0

0

3

0.4600908

-0.8918850

0

-0.6136407

0.0000000

0.0000000

0

0

2

0.0000000

0.8120229

0

-0.6083606

0.0000000

0.5067215

0

0

3

-1.9243175

-1.3742646

0

0.0000000

0.0000000

-1.3007276

0

0

2

0.0000000

0.5450452

0

0.5129875

0.3837956

0.0000000

0

0

3

0.0000000

0.0000000

0

-1.0686749

-0.5861849

-0.2575089

0

0

2

Examining the distribution of prediction explanation strengths by cluster, we can see that cluster 1 members tend to have a lower predicted diabetes risk due to their age, glucose levels, and mass. Looking back at the feature values by cluster above, we can see that cluster 1 tends to be younger, lighter, and have lower glucose levels.

In contrast, we can see that cluster 3 members often are predicted to have elevated risks due to age, glucose, and mass. Looking back at the feature values by cluster (above), we can see that cluster 3 members tend to be older, heavier, and have higher glucose values. 

 

strength_matrix_with_clusters %>%
  gather(feature, strength, -cluster)%>%
  ggplot(aes(x=strength, group=cluster, color=cluster, fill=cluster)) +
  geom_density(alpha=0.2)+
  facet_wrap(~feature, scales='free')+
  xlab('Strength of prediction explanation')+
  theme_bw()

 

 

Figure 5. Prediction Explanation Strengths by ClusterFigure 5. Prediction Explanation Strengths by Cluster

Where to go from here

Using the datarobot.pe.clustering R package (pe-clustering-R repository) and the techniques illustrated here, we encourage you to apply this methodology to your own datasets. Download the RMarkdown notebook here and adapt to your own data. Uncover what clusters are present in your model’s prediction explanations, and explore the varied ways to characterize and interpret these clusters. Use these clusters to inform further feature engineering, or incorporate cluster information into the output for the consumers of your model predictions. Once you have a feel for how the package works, branch out and explore different variations on the methodology coded in the package, and discover what works best for your problems. Then, share your experiences and insights with other users on the DataRobot Community. Together, we can find even more ways to leverage DataRobot for AI insights. 

More Information

You can find a Python implementation for Prediction Explanation Clustering in the DataRobot Community GitHub.

Check out the community learning session, Explanation Clustering and article, How Can I Explain a Prediction?

If you're a DataRobot licensed customers, search the in-app Platform Documentation for Prediction Explanations.

Labels (2)
Version history
Last update:
‎11-03-2020 11:43 AM
Updated by:
Contributors