Hi team! A student in the DataRobot for Data Scientists class created a ModelID Categorical Int feature from the standard class “fastiron 100k data.csv.zip ” file and it flagged as Target Leakage on his first run under manual.
When he tried to do it again, the platform did not give the yellow triangle for target leakage but the Data Quality Assessment box did flag a target leakage feature.
His questions are:
1 - Why is DR showing the target leakage intermittently?
2 - The original ModelID as a numeric int did not cause a target leakage flag and also when he included that Parent feature with the child feature (ModelID as categorical int) it did not flag as Target Leakage--why is that?
At a quick glance, it sounds like the user created a new feature ModelID (Categorical Int) via Var Type transform, and then kicked off Autopilot Manual in which the created feature received calculated ACE importance scores. The importance scores passed our target leakage threshold and therefore Data Quality Assessment tagged the feature as potential leakage.
After looking at the project, I see that there was not a feature list called "Informative Features - Leakage Removed" created, meaning it didn't pass the "high-risk" leakage threshold value, and therefore was tagged as "moderate-risk" leakage feature.
I found the /eda/profile/ values from Network Console for the project for the specific feature ModelId (Categorical Int) . The calculated ACE importance score (Gini Norm metric) for that created feature is about 0.8501,
You can let the user know that changing a Numeric feature to Categorical var type can lead to potentially different univariate analysis results with regards to our Data page Importance score calculations. The Importance scores just narrowly passed our moderate-risk detected Target Leakage threshold value. Hope that helps.
Thanks for such a detailed feedback!
Any time, any further questions about Target Leakage let the TREX (trust and explainability) team know (or ping me, since I helped work on it)!