Regarding partial dependency plots. In the Datarobot documentation
It seems to define actual as the mean value of the target in the data for the given fold for the feature of interest and predicted as the mean value of the prediction from the model, likewise.
What it the partial dependency line?
Seems to define it as the expected value varying the other features. Which would be the actual or the predicted line depending on which function was involved. But in the above link to the Datarobot documentation, indicates that the calculation of the partial dependency uses all values of the feature of interest, and then scales using only the values in the fold.
It is unclear to me what that means. Can someone clarify?
Note: I am definitely looking for someone to not attempt to explain to me the language in the paragraph, rather I hope someone knows the actual calculation in some terms that are more clear - such as a direct summation formula, thanks.
how to calculate partial dependence for a numerical feature F:
(Note that we are making 100*1000 predictions here - this is why the process takes a while)
With this, you have now an expected value for each s in S, if you plot that that gives you the yellow partial dependence plot.
The interpretation would be: how does the model think this feature F influences my predictions, if I left everything else constant.
Hope that helps 🙂
The partial dependence plot shows the marginal effect of a feature has on the predicted outcome of a machine learning model. To put it simply, the plot gives us an understanding of how the prediction varies if we just change one feature and keeping everything else as constant. This gives us an insight on the nature of relationship between the target and the feature like linear, monotonic etc.
Calculation in detail for one feature X1 on a sample of 1000 records of training data -
Let us assume the feature X1 has 5 different levels (like 0,5,10,15,20). For all the 1000 records, we create artificial datapoints by keeping all features constant except the feature X1, which translates to 5,000 records (each row duplicated 5 times with one value of the different levels of X1). Now, we run predictions for all these 5,000 records and average the predictions for all the 5 different levels of feature X1. This average prediction now corresponds to the marginal effect of feature X1 and this is what is displayed on the Partial Dependence Plot.
If we have 10 features and each features have 5 different values in the training dataset of 10k records, creating the marginal effect using all the data would require us to predict using 500k records. Hence, using the whole dataset to compute the partial dependence becomes computationally expensive. Also we can get almost similar results even if we use a representative sample of the data. Hence, DataRobot only uses a sample of the data to calculate partial dependence.
I hope this clarifies your questions.
Thanks for your response.
I hope you can clear some details up for me.
Are we selecting 1000 rows just to reduce the computation time. In principle, if we could do this computation with all the rows, would this be more correct?
Are the 100 values of F selected uniformly over all the possible values? Or do you mean that this should be done for each fold of F? If we could do this computation with all the values of F in the fold, would this be more correct?
Is the idea that the horizontal axis in the plot gives folds of the feature F?
Where does the matter of scaling come in?
let M be our model and G the features other than F. We seem to be computing the mean value of M(F,G) for fixed F and calling it the prediction from F. We seem to be computing the mean value of M(F,G) for fixed value of F and then calling it the partial dependency. But, these things are so different that we have to scale them to make them look similar. What is the scaling?
Sorry, not getting it.
As I understand you - the selection of the 1000 rows is only for computational speed. We sample the data to reduce the required resources. So, conceptually, I can separate that out.
The 5 values of the feature of interest - are these also sampled? Are we assuming that the feature of interest is categorical? If the feature of interest is not categorical do we group it into folds to make it categorical?
Is it correct that the "actual" plot is the mean of the target feature in the data for fixed feature of interest and varied other features?
Is it correct that the "predicted" plot is the mean of the target estimate from the model for fixed feature of interest and varied other features?
The "dependency" plot appears then to be the mean of the model output over the cartesian product of all the values of the other features and all the values of the feature of interest. So - how can we plot that against the feature of interest?
It would seem to make more sense if the partial dependency at feature value was the mean, over all the rows, of the prediction value if you replace the feature of interest with the given value.
dependency(x) = expected( model(x,y) ) over y in the data. But, if x and y where not independent, then this could be really messy.
Is this the right idea ...
Let X be the features of interest and Y be the other features and m(x,y) be the model output when X=x and Y=y. Where m(X,Y) is an estimate of T, the target feature.
actual(x) = expected( T given X=x)
predicted(x) = expected( m(X,Y) given X=x)
dependency(x) = expected( m(x,Y) )
Which makes the most sense if X and Y are independent. It they are not, then expected(m(x,Y)) might not look like predicted(x) at all.
I used 5 values just as an example to explain how the computation is done. DataRobot uses the following logic for numerical features to calculate partial dependence -
If the value count of the feature in the entire dataset is greater than 99, DataRobot computes Partial Dependence on the percentiles of the distribution of the feature in the entire dataset.
If the value count is 99 or less, DataRobot computes Partial Dependence on all values in the dataset (excluding outliers).
The above can be found in our docs here.
The idea behind partial dependence is to understand how a change in a feature's value, while keeping all other features as they were, impacts a model's predictions. More detail on the partial dependency calculations can be found in our docs here.
Since we are varying one feature at a time, this method analyzes each feature effect independently. The objective of this chart is to understand how the feature impacts prediction at a high level. So for most use-cases, 1-way partial dependence is useful to understand how the model captures relationship between target and features.
The actual and predicted values in the plot are calculated separately from the partial dependence calculations. For numerical features, the calculations are done using this logic -
If the value count in the selected partition fold is greater than 20, DataRobot bins the values based on their distribution in the fold and computes Predicted and Actual for each bin.
If the value count is 20 or less, DataRobot plots Predicted/Actuals for the top values present in the fold selected.
You can find the above and more related details in our docs here.
The idea of predicted vs actuals is to identify if there are parts of your data where the model is systematically mis-predicting. If the insight shows larger differences between predicted and actual for a specific feature, it may suggest you need additional data to help explain the discrepancy.
Thanks @Vinay ,
I appreciate the effort you are putting in here. It is important to me to grok this plot. It impacts my work. I did read the literature you linked. And I think I am getting a better handle on the jargon (clearly we have different linguistic and technical backgrounds). But, could I get some direct feedback from you about my earlier response to the two of you?
Clearly such actions as choosing 1000 rows or 100 values of the feature of interest exist only to reduce the computational effort. The numbers are somewhat arbitrary and depend on the computational resources and the size and nature of the dataset.
So, the aspects of partitioning (forming folds) and selection of the data is essentially a sampling process to estimate the true value from a sample of the data.
So, for me, to understand what the plots are, it is easier to forget the issue of sampling and understand what the plots would be if we were not limited by computational power.
The actual plot is the expected value of the target in the data given the value of the feature of interest?
actual(x) = expect( T | X=x )
The predicted plot is the expected value of the target in the estimate given the value of the feature of interest?
predicted(x) = expect( M(X,Y) | X=x )
The dependency plot is the expected value of the model fixing the value of the feature of interest?
dependency(x) = expect( M(x,Y) )
I now feel fairly sure that this is what is going on.
In the above I have used the terms of stochastics. In terms of statistics or machine learning, one is computing a mean value, and if there are very few data points that have X=x, then the calculation will need to involving folds of data so that a meaningful mean can be taken. But, these are conceptually separable from the issue of goal and meaning of the calculation. IMHO and experience.
Hi @Bruce ,
Regarding the sampling, you are correct in assuming that it is done to reduce computational effort. Also, as I mentioned in my answer, the partial dependency plot is supposed to give us the insight of how the prediction would vary on average if we just vary just the feature of interest and keep everything else constant. This plot only provides a high level view of the relationship and the sampling choices are made based on the tests DataRobot has done on numerous datasets.
Partitioning of data is done as a part of modeling to ensure that our model fits well on unseen data (by not using a partition of the data while training). You can find more details on the partitioning here. In context of the partial dependence calculations, the partitions / folds represent the partition from which the sample of data is picked from.
@Vinay Thanks again.
Is it correct to say
dependency(x) = expected(M(x,Y) )
Given my earlier clarification of the meaning of terms. (how to sample or fold the data or interpret the plot is not part of my question).
Hi @Bruce ,
I think like you mentioned our technical backgrounds differ. I reread what you were confirming regarding dependency plots
The dependency plot is the expected value of the model fixing the value of the feature of interest?
This is incorrect. The correct explanation is provided in the Christopher Molnar's book you had referenced in your first question (link). An excerpt from the same is given below -
The concept of marginalization is important here. The average needs to be taken after marginalizing.
Hi @Bruce ,
I hope the answers my colleagues provided are clear, and since the Feature Effects (Partial Dependence) is also explained, along with many other topics, in the DataRobot University instructor-led AutoML I course ( https://university.datarobot.com/automl-i ) you may wish to sign up for it to get a better understanding of how the platform works.
@Vinay So, you are saying that the formula I referred to in the reference that I gave up front to the literature is correct and is the way that Datarobot does it? Why did you not just say this in the first place?
Is there some way to get direct technical answers without going through this kind of mess again?
I have been very frustrated by the answers I got here, especially since the eventual answer was simply that the literature I referred to is correct after all. In my opinion Vinay has been playing some silly game. He either is incapable or thinks that I am. Not good. He just belatedly looked in the reference I gave and found the formula I referred to and copied it into his answer. Because of this, I now have zero trust in his abilities and will consider the question unanswered.
Addendum: I am honestly and determinedly looking for an answer here. My background is very strong in technical systems identification in heavy industry from an engineering and mathematical point of view, which is relevant to my current role of data scientist - but, that together with my age means that I have very different cultural expectations. While I have no reason to suppose that you will grok that - I need to find someone who can translate. You said you hope that the answers from your colleagues were clear. Go back and look at the chain of posts - see just how far apart Vinay and myself are in the way in which we are talking and our focus and expectations. He kept answering the wrong question until I forced the issue and then he became curt. I do not want that. I want to come to this forum for a pleasant professional and informative interaction. If you can honestly assist - it would give me a better feeling about Datarobot than I currently have.
I have been answering all your questions on this chain. The reason why me and Lukas were trying to explain to you the calculations behind the partial dependence charts instead of just referring to the Christopher Molnar's book you had provided is because your interpretation of the formula in the book was incorrect (Seems to define it as the expected value varying the other features).
I understand that your technical background is different to mine and hence I was trying to explain the maths behind the calculations. Partial dependence plots is common tool used in Machine Learning and there is a lot of literature on this. The reason I explained in detail instead of just referring you the link was to give you more details on how some of the specifics related to how calculations work in the background in DataRobot (sampling etc.).
I am sorry you feel that I am trying to play some game or waste your time here. My only intention was to help and I was never curt. I was trying to help you to the best of my ability based on my own data science experience and familiarity with DataRobot.
@Bruce I am so sorry you have been frustrated with our attempts at clarification on this.
Pardon my giving a simplistic analogy to this, it's like trying to describe a color verbally when it is much simpler to just see the color. Similarly trying to describe how to ride a bike is rather difficult instead of just sitting on it.
The reason I recommended one of our DataRobot University instructor-led courses is that the instructors provide a detailed explanation of partial dependence along with examples and applications that give the students a much better picture of partial dependence via that verbal and visual instructional mode.
Another alternative is to have one of our Customer Facing Data Scientists explain it to you via the platform with examples and perhaps even your own data for a better appreciation in your domain.
We do hope to be able to better serve you in your machine learning journey and appreciate your queries and consideration.
Thanks for following up on this.
I have been unable to continue for now due to work priorities. But, since some of these relate to our decisions about which data science platform to go with, I am still interested. And I am personally interested, as I usually make a point to understand all these details in a precise mathematical sense.
Can you suggest a good proper mathematics article or book chapter? My background is that I have a doctorate in mathematics after doing degrees in engineering and then software and working on industrial stochastic control systems. As such, I am used to just getting down to technical hard core. Probability spaces, stochastic process, and so on. I found that people on this forum either cant or wont talk about that - so it is not clear to me that talking to a Data Robot Data Scientist will be any better.
An exact description of the computational process would work - and I know that they were attempting that, but we kept talking at cross purposes. The discussion got entangled with details of sampling that are orthogonal to the matter of what partial dependency is.
My actual problem was that it was very unclear what Datarobot was doing in precise terms. And the terms seem to be used differently in different references and to not be very common. I looked up several references including, for example Practical Statistics for Data Scientists by Peter Bruce - and no such term "partial dependency" appeared in the index. I also asked a colleague who is a career data scientist with a mathematics degree - who was unfamiliar with the term as well.
What I was hoping for when I posted originally was literally just the mathematical definition of what was being calculated. I never got that. The right 10 page article could clear the whole thing up for me in one reading.
Much appreciated as I professionally and personally want to clear this up.
Thank you for your clarification Bruce and I think it would be easier to define the partial dependence calculation as an algorithm rather than a mathematical expression.
I think part of the reason for the confusion is it's simplicity and once you have it explained to you verbally or visually it will probably become apparent as to why it is difficult to distill into the standard journal form.
Given this long discussion trail a verbal conversation via one of our instructor-led courses or discussion with one of our Customer Facing Data Scientists is probably the most efficient way of understanding this and many other topics it will probably lead into.
Thank you once again for interest.