Solved: prediction values - DataRobot Community

cookie_yamyam · ‎05-25-2022

I built visual AI model that multilabel classification predicted which number the image is (like classification 0,1,2,3).

How can I understands "Prediction Explanations" values below.

Why does prediction go from 3.781 to -0.823?

These values don't look like probabilites.

What is it?

Thanks for your promptly reply.

jenD · ‎05-25-2022

Hi @cookie_yamyam. The metric shown in your screenshot shows RMSE, which means the project ran as regression and not multilabel classification. So, although you may have encoded the target as 0, 1, 2, 3, you probably launched the project by default in regression mode (the default for numeric targets). Can you try again and click “switch to classification” during target selection? Here is the documentation for doing this.

Alternatively, you could encode the target as a categorical variable, not numeric (cat_0, cat_1, cat_2, instead of 0,1,2), which will make classification the default mode.

Let us know if that solves the problem....jen

View solution in original post

jenD · ‎05-25-2022

OK, let's see. Easy one first. Thanks for the screenshot of the overlap issue, I sent it on to our developers.

As far as the leaky feature, if you do not specify a custom feature list, then yes, DataRobot will use Informative Features (for this case, probably Informative Features - Leakage Removed). So the leaky feature would not be used during model training process. Feature list documentation.

For the confusion matrix, it is based off the validation partition, which is 16% of the data. All data minus 20% for holdout (80% left) split into 5 CV partitions (4 to train, 1 to validate) gives you 16%. So...2962 is 16% of your total rows. Because you have it set to "Global" (all observations) and no class selected, that's where the number comes from!

Regarding why only 296 rows are showing in the Lift Chart, it's the same calculation of 2960 rows in the validation set, divided across 10 bins. And in terms of how to interpret it, it is same way as you would for binary classification - only this time it's "per class" Lift Chart (set the class at the bottom of he chart).

View solution in original post

jenD · ‎05-25-2022

Hi @cookie_yamyam. The metric shown in your screenshot shows RMSE, which means the project ran as regression and not multilabel classification. So, although you may have encoded the target as 0, 1, 2, 3, you probably launched the project by default in regression mode (the default for numeric targets). Can you try again and click “switch to classification” during target selection? Here is the documentation for doing this.

Alternatively, you could encode the target as a categorical variable, not numeric (cat_0, cat_1, cat_2, instead of 0,1,2), which will make classification the default mode.

Let us know if that solves the problem....jen

cookie_yamyam · ‎05-25-2022

Oh, I missed the important point!

Thank you, jenD.

I changed feature to categorical var and built model again.

When I change a feature type, DataRobot warn the leakage data beacause of association.

In this case, should I remove the original feature or just build model?

I think it doen't matter leave it all data because DataRobot selects informative featrues, am I right?

Say to different story, feature's information is somtimes overlaped because of length.

Also, I have another questions about understanding charts.

1) Why is the lift chart's row numbers same? This dataset has 18,509 datapoints. Where does the number rows 296 come from?

And, how can I explain this lift chart in visual AI? (multi-classification)

What is meaning of lift chart in this case?

2) In the multiclass confusion matrix, I can see 'total' number 2962. Where does the number come from? I think it should be 11,846 because this model used 64% of datasets which has 18,509 datapoints.

cookie_yamyam · ‎05-25-2022

In 'Image Imbedding' and 'Activation Map' menu, I can see only 0 and 1 results.

It should show 2 and 3 results.

I can find 0,1,2,3 results in 'Multiclass Confusion Matrix'.

jenD · ‎05-25-2022

OK, let's see. Easy one first. Thanks for the screenshot of the overlap issue, I sent it on to our developers.

As far as the leaky feature, if you do not specify a custom feature list, then yes, DataRobot will use Informative Features (for this case, probably Informative Features - Leakage Removed). So the leaky feature would not be used during model training process. Feature list documentation.

For the confusion matrix, it is based off the validation partition, which is 16% of the data. All data minus 20% for holdout (80% left) split into 5 CV partitions (4 to train, 1 to validate) gives you 16%. So...2962 is 16% of your total rows. Because you have it set to "Global" (all observations) and no class selected, that's where the number comes from!

Regarding why only 296 rows are showing in the Lift Chart, it's the same calculation of 2960 rows in the validation set, divided across 10 bins. And in terms of how to interpret it, it is same way as you would for binary classification - only this time it's "per class" Lift Chart (set the class at the bottom of he chart).

dalilaB · ‎05-25-2022

What is the distribution of each target?

jenD · ‎05-25-2022

In terms of the Image Embeddings results, remember that the visualization is showing a sample of 100 images. Is it possible that due to unlucky sampling all the images came from those two classes (it looks like the other two may have been significantly less frequent).

cookie_yamyam · ‎05-28-2022

Thank you for the clear explanation!

If I want to see all traing dataset's confusion matrix, what should I do?

cookie_yamyam · ‎05-28-2022

Zero have 9,191 rows.

One have 2,519 rows.

Two have 3,194 rows.

Three have 3,505 rows.

I think randomly choosing 100 img data logic is not balanced.

desmond_lim · ‎05-31-2022

@cookie_yamyam Target leakage that DataRobot automatically detects is based on ACE importance scores with respect to the target and removes it if it exceeds a very high threshold (https://app.datarobot.com/docs/data/analyze-data/data-quality.html#target-leakage). However, target leakage is more nuanced than that and it requires a domain expert to identify which features should be removed from the modelling exercise.

A good explanation of that is covered in the instructor-led AutoML class listed here:

https://university.datarobot.com/automl-i

prediction values

prediction values

Modeling

how to transform the var type in workbench

Understanding Model

Time Series Modelling

Trial Walkthrough Issue

Data for Visual AI