Regarding partial dependency plots. In the Datarobot documentation
It seems to define actual as the mean value of the target in the data for the given fold for the feature of interest and predicted as the mean value of the prediction from the model, likewise.
What it the partial dependency line?
This source
https://christophm.github.io/interpretable-ml-book/pdp.html
Seems to define it as the expected value varying the other features. Which would be the actual or the predicted line depending on which function was involved. But in the above link to the Datarobot documentation, indicates that the calculation of the partial dependency uses all values of the feature of interest, and then scales using only the values in the fold.
It is unclear to me what that means. Can someone clarify?
------------
Note: I am definitely looking for someone to not attempt to explain to me the language in the paragraph, rather I hope someone knows the actual calculation in some terms that are more clear - such as a direct summation formula, thanks.
Hi @Bruce ,
Regarding the sampling, you are correct in assuming that it is done to reduce computational effort. Also, as I mentioned in my answer, the partial dependency plot is supposed to give us the insight of how the prediction would vary on average if we just vary just the feature of interest and keep everything else constant. This plot only provides a high level view of the relationship and the sampling choices are made based on the tests DataRobot has done on numerous datasets.
Partitioning of data is done as a part of modeling to ensure that our model fits well on unseen data (by not using a partition of the data while training). You can find more details on the partitioning here. In context of the partial dependence calculations, the partitions / folds represent the partition from which the sample of data is picked from.
Thanks @Vinay ,
I appreciate the effort you are putting in here. It is important to me to grok this plot. It impacts my work. I did read the literature you linked. And I think I am getting a better handle on the jargon (clearly we have different linguistic and technical backgrounds). But, could I get some direct feedback from you about my earlier response to the two of you?
Clearly such actions as choosing 1000 rows or 100 values of the feature of interest exist only to reduce the computational effort. The numbers are somewhat arbitrary and depend on the computational resources and the size and nature of the dataset.
Yes?
So, the aspects of partitioning (forming folds) and selection of the data is essentially a sampling process to estimate the true value from a sample of the data.
Yes?
So, for me, to understand what the plots are, it is easier to forget the issue of sampling and understand what the plots would be if we were not limited by computational power.
The actual plot is the expected value of the target in the data given the value of the feature of interest?
actual(x) = expect( T | X=x )
The predicted plot is the expected value of the target in the estimate given the value of the feature of interest?
predicted(x) = expect( M(X,Y) | X=x )
The dependency plot is the expected value of the model fixing the value of the feature of interest?
dependency(x) = expect( M(x,Y) )
I now feel fairly sure that this is what is going on.
In the above I have used the terms of stochastics. In terms of statistics or machine learning, one is computing a mean value, and if there are very few data points that have X=x, then the calculation will need to involving folds of data so that a meaningful mean can be taken. But, these are conceptually separable from the issue of goal and meaning of the calculation. IMHO and experience.
Hi @Bruce,
I used 5 values just as an example to explain how the computation is done. DataRobot uses the following logic for numerical features to calculate partial dependence -
If the value count of the feature in the entire dataset is greater than 99, DataRobot computes Partial Dependence on the percentiles of the distribution of the feature in the entire dataset.
If the value count is 99 or less, DataRobot computes Partial Dependence on all values in the dataset (excluding outliers).
The above can be found in our docs here.
The idea behind partial dependence is to understand how a change in a feature's value, while keeping all other features as they were, impacts a model's predictions. More detail on the partial dependency calculations can be found in our docs here.
Since we are varying one feature at a time, this method analyzes each feature effect independently. The objective of this chart is to understand how the feature impacts prediction at a high level. So for most use-cases, 1-way partial dependence is useful to understand how the model captures relationship between target and features.
The actual and predicted values in the plot are calculated separately from the partial dependence calculations. For numerical features, the calculations are done using this logic -
If the value count in the selected partition fold is greater than 20, DataRobot bins the values based on their distribution in the fold and computes Predicted and Actual for each bin.
If the value count is 20 or less, DataRobot plots Predicted/Actuals for the top values present in the fold selected.
You can find the above and more related details in our docs here.
The idea of predicted vs actuals is to identify if there are parts of your data where the model is systematically mis-predicting. If the insight shows larger differences between predicted and actual for a specific feature, it may suggest you need additional data to help explain the discrepancy.
Is this the right idea ...
Let X be the features of interest and Y be the other features and m(x,y) be the model output when X=x and Y=y. Where m(X,Y) is an estimate of T, the target feature.
actual(x) = expected( T given X=x)
predicted(x) = expected( m(X,Y) given X=x)
dependency(x) = expected( m(x,Y) )
Which makes the most sense if X and Y are independent. It they are not, then expected(m(x,Y)) might not look like predicted(x) at all.
Sorry, not getting it.
As I understand you - the selection of the 1000 rows is only for computational speed. We sample the data to reduce the required resources. So, conceptually, I can separate that out.
The 5 values of the feature of interest - are these also sampled? Are we assuming that the feature of interest is categorical? If the feature of interest is not categorical do we group it into folds to make it categorical?
Is it correct that the "actual" plot is the mean of the target feature in the data for fixed feature of interest and varied other features?
Is it correct that the "predicted" plot is the mean of the target estimate from the model for fixed feature of interest and varied other features?
The "dependency" plot appears then to be the mean of the model output over the cartesian product of all the values of the other features and all the values of the feature of interest. So - how can we plot that against the feature of interest?
It would seem to make more sense if the partial dependency at feature value was the mean, over all the rows, of the prediction value if you replace the feature of interest with the given value.
dependency(x) = expected( model(x,y) ) over y in the data. But, if x and y where not independent, then this could be really messy.
Thanks for your response.
I hope you can clear some details up for me.
Are we selecting 1000 rows just to reduce the computation time. In principle, if we could do this computation with all the rows, would this be more correct?
Are the 100 values of F selected uniformly over all the possible values? Or do you mean that this should be done for each fold of F? If we could do this computation with all the values of F in the fold, would this be more correct?
Is the idea that the horizontal axis in the plot gives folds of the feature F?
------------------------
Where does the matter of scaling come in?
let M be our model and G the features other than F. We seem to be computing the mean value of M(F,G) for fixed F and calling it the prediction from F. We seem to be computing the mean value of M(F,G) for fixed value of F and then calling it the partial dependency. But, these things are so different that we have to scale them to make them look similar. What is the scaling?
Hi @Bruce,
The partial dependence plot shows the marginal effect of a feature has on the predicted outcome of a machine learning model. To put it simply, the plot gives us an understanding of how the prediction varies if we just change one feature and keeping everything else as constant. This gives us an insight on the nature of relationship between the target and the feature like linear, monotonic etc.
Calculation in detail for one feature X1 on a sample of 1000 records of training data -
Let us assume the feature X1 has 5 different levels (like 0,5,10,15,20). For all the 1000 records, we create artificial datapoints by keeping all features constant except the feature X1, which translates to 5,000 records (each row duplicated 5 times with one value of the different levels of X1). Now, we run predictions for all these 5,000 records and average the predictions for all the 5 different levels of feature X1. This average prediction now corresponds to the marginal effect of feature X1 and this is what is displayed on the Partial Dependence Plot.
If we have 10 features and each features have 5 different values in the training dataset of 10k records, creating the marginal effect using all the data would require us to predict using 500k records. Hence, using the whole dataset to compute the partial dependence becomes computationally expensive. Also we can get almost similar results even if we use a representative sample of the data. Hence, DataRobot only uses a sample of the data to calculate partial dependence.
I hope this clarifies your questions.
Hi @Bruce
how to calculate partial dependence for a numerical feature F:
(Note that we are making 100*1000 predictions here - this is why the process takes a while)
With this, you have now an expected value for each s in S, if you plot that that gives you the yellow partial dependence plot.
The interpretation would be: how does the model think this feature F influences my predictions, if I left everything else constant.
Hope that helps 🙂
Cheers,
Lukas