I asked this recently on the other forum but maybe this is more the right place.
A classification prediction comes with what looks like a number between 0 and 1, call it the certainty, and also a collection of feature-strength-value triples.
But, what exactly is strength?
If one is computing a continuous variable from continuous variables then this could be the partial derivative (marginal statistics). And since the certainty is a continuous variable between 0 and 1, this is a valid measure. But, is that what it is? At the very least it should be scaled by some measure of spread of the input variables. And what if the input variable is an integer or even catagorical?
Can someone clarify this for me?
Solved! Go to Solution.
Hi Bruce, in a classification problem the output values are propensity scores, which can then be converted to discrete predicted values by applying threshold(s). A propensity score is not really the certainty, and I often caution folks not to (necessarily) think of it as true real-world probability either - unless the model is very accurate and very well-calibrated. It can be more useful to think of propensity scores as relative - as many classification use cases tend to end up being a ranking exercise. Say for example ordering by descending propensity score to understand highest likelihood or risk amongst the individuals that were scored.
Re: your question on strength-feature-value triples, I understand this as referring to the prediction explanations which can be provided alongside the predictions. For models which support it, these are derived from SHAP values (Shapley Additive Explanations) - and for models which don't, from XEMP (Exemplar Based Explanations). Prediction Explanations are documented here:
To answer your question, integer and categorical features are catered for - the high-level interpretation is:
These are the (say top 3 or whatever was specified) feature-values, their direction positive or negative, and a simple granular representation of their magnitude, which contributed to the propensity score for this individual. So the 'strength' is the relative marginal influence on the predicted outcome, according to the feature value's numeric SHAP score for this individual. The SHAP values are ordered by descending magnitude and the top X are shown - and these will be different per row/individual.
This link to the docs goes into some detail on SHAP:
This general reference may also also useful:
Hope this helps.
@TravisB Thanks for the links that I have yet to digest.
I picked the word "certainty" because it has no common technical meaning - unlike probability or likelihood. However, you say propensity. What is the intended by the use of that term? I found this link fairly quickly Propensity Scores: A Primer - KDnuggets which spoke of it being the result of a broken or incomplete experiment. But, is this the same thing to which you refer? If you have 5 classes, should the 5 propensity scores add to exactly 1? Or is that not a thing. Does propensity have any intuitive meaning you can hang a hat on - or is it a mostly meaningless number regarding the way the model has approximated the function - sort of like using logistic regression on a binary step function?
@Bruce Cool discussion. I usually associate 'certainty' with determinism and 'uncertainty' with stochastic/probabilistic outcomes. I believe propensity scores approach true probability when models approximate the function in question very well (highly accurate & calibrated).. but no I don't believe algorithms for multinomial classification guarantee class propensities sum to 1.
Great point re: softmax @IraWatt
@TravisB Very interesting about propensity which I was not aware of and will have to read up on. Especially that they do not add to 1.
Regarding "certainty" I might be getting out of scope but I would like to explain my use of the word. In binary logic we give statements values 0 and 1. In probability values in [0,1]. In this sense probability theory can be seen as a generalization of binary logic, with binary logic reappearing for certainly true or certainly false statements. But, this is just one example of the idea of generalizing truth values. For example, we could use a standard trinary logic, true, false, or unknown. Or we could use fuzzy logic, which is a bit like probability, but the method of combination is different.
So, to me just as a person who has a weight of 0 is light and a person who has a strength of 0 is weak - a statement that has a certainty of 0 would be false. So, certainty is being used by me in the sense of determinism, but as a scale. So, yeah, that's the reason I picked it.
I have been doing a bit of looking around.
These guys say that propensity is a probability.
And these guys specifically say it is the conditional probability given the data.
These guys seem to mean a rate of change of the probability.
A lot of people refer to propensity estimation - which seems to me to imply that there is a something that exists that is being estimated. The example of logistic regression comes up several times. Clearly, one can use something like logistic regression to approximate the characteristic function of each class - which is analogous to what Data Robot is doing.
I have not done the experiment yet to see whether Data Robot prediction scores add to unity - but since this could be done merely by normalizing them, it feels like something that they would be remiss not to do.
I am a bit concerned that, though, that some writers are mixing up probability and likelihood.
My current position is then that the Data Robot prediction scores, or propensity, is intended in principle to be a conditional probability based on a model built from the statistical data - but it is unknown to me whether it is guaranteed to add to unity.
Would like to relook Bruce's original question, which as I understand has to do with the concept of strength reported by the XEMP Prediction Explanations.
The docs state as follows:
Each explanation is a feature from the dataset and its corresponding value, accompanied by a qualitative indicator of the explanation’s strength—strong (+++), medium (++), or weak (+) positive or negative (-) influence. If an explanation’s score is trivial and has little or no qualitative effect, the output displays three greyed out symbols (+++ or - - -).
I understand from the whitepaper that XEMP for each feature is calculated as the difference between Feature Effects (partial dependence) values and a weighted average of partial dependence values for the feature concerned. Therefore the basis for computing strength is the deviation in partial dependence from the 'usual value'. What is less clear perhaps is what basis is used for the 3 qualitative indicators of strength: strong, medium, weak (+ trivial).
Apologies if I have misunderstood your question and hijacked your thread @Bruce .
No hijacking, you are on the right track. I am going to accept your answer.
In simple terms - the strength is a (local) partial derivative estimation, and the scaling of the number of plus or minus signs is not apparent. In principle they stand for -3,-2,-1,+1,+2,+3, If the value is 0 the item is not mentioned as an explanation. But, the details of the scaling elude me and seem complicated and arbitrary.
But, I also take a moment to warn anyone following this track that I had to do quite a lot of reading to make sense out of the Xemp white paper, and that IMHO that white paper is misleading and (naturally) rather biased in favor of DataRobot as a piece of commercial software.
As far as I can see the essence of the distinction between Lime and Xemp is mainly that Xemp uses values from the original data in order to produce a consistent explanation. However, this is essentially a modification of Lime that forces the explanation to be consistent - but does not stop it having an element of arbitrariness. And since this is supposed to explore the model rather than the data - it suffers from testing the model only where the original data exists. Thus, being inappropriately kind to the model.
The core of the Lime method is to find a model that has similar behaviour in a local region. This is the same idea as used in many other contexts - in particular the use of Taylor series or simply local affine approximants - which are all over the place in theory and practice.
I was not convinced that Lime uses a surrogate model and Xemp uses the original. Lime could be said to be providing a simple approximation as a description of the original model. Xemp does not seem to spend any effort on the internals of the model so could be said to be using what amounts to an implicit surrogate model.
Thanks for accepting my solution Bruce.
I think it's important to keep in mind that choosing a model explanation method can be quite subjective. I cannot speak for DataRobot as to how they came to decide on their model explanation offerings, but here are some of my personal thoughts:
It is interesting to chat with you.
Unless you fundamentally support Xemp against Lime, we may well be in basic agreement. My own approach is that they are just part of the same general approach.
Although - I don't feel that Xemp applies a what-if analysis at all. It only uses the data that has already been processed. While Lime thinks up new scenarios and asks how the model would behave under those conditions. That sounds like Lime is doing the what-if analysis.
The thrust of my intention was to remove the idea that Xemp somehow wins hands-down against Lime. In fact, I see the two as essentially the same approach, differing only in how they select the data used to probe the model.
I am now in my official work not sticking to one or the other approach, but have explicitly chosen to see the whole thing in a more general light which I am calling Limesque.
I do not think that selecting data from the actual original sample used to train the model fundamentally uses more realistic data. All samples are biased. One could use further sampled data perhaps, and perhaps that would be more justfied. And if you have an idea of the distribution of the data, you could generate potentially realistic data to test it on. I don't feel that one should specialize to normal distribution - which Lime does. And, I don't feel that one should specialize to the already sampled data - which Xemp does.
We are perhaps in some agreement here - as when you say that the critique cuts both ways, that was precisely the point I was making. The idea that Xemp uses more realistic data is the standard Xemp proponent justification. So, I gave the counter argument.
My own personal interest has moved into looking at the idea that an explanation is a theory. And that in a very real sense, the large and supposedly more accurate numerical and combinatorial models with thousands of terms should not be considered fundamentally a better theory. In fact from the very fact of being large and complicated - they already fail. And they typically need recalibrating, which emphasizes that failure.
The real Data Science task seems to cut very deeply into the practice of science itself. It is not something that is solved by throwing computational grunt at it the way we are doing today.
In my opinion - an explanation of a model should fundamentally come from the internals of the model. Models should be built from the ground up to to produce explanations. After-market bolt-ons like Xemp and all Limesque approaches have a deep problem there.
Ultimately, an explanation should involve the ability to reverse the decision process. How can I change the data to something that might not have been seen before so that I can change the outcome. An explanation that does not involve control is a pretty poor explanation.
I suspect that a good answer to that only comes from an examination of the internals and probably requires the machine learning fitting method to have been designed from the ground up to admit this option. Neither Xemp nor Lime gets anywhere near doing this in their vanilla forms.