Data Cheats: How Target Leakage Affects Models

Community Team
Community Team
3 5 677

(Part of a model building learning session series.)

Target leakage, also known as data leakage, is one of the most challenging problems when building machine learning models. Without proper checks and guardrails, you may not realize you have target leakage until you deploy a model and notice that its performance in a production environment is worse than it was during development.

During this session, we cover conceptual definitions of target leakage and the ways it can arise prior to model building, in particular during the data engineering and project setup phase. Then we demonstrate how DataRobot's Data Quality Assessment performs Target Leakage Detection to ensure that projects follow data science best practices and resulting models will be robust to real-world data. Finally, we will provide a handy checklist to help you evaluate your projects for target leakage.

Hosts

  • Yuriy Guts (DataRobot, Engineer)
  • Alex Shoop (DataRobot, Engineer)
  • Rajiv Shah (DataRobot, Data Scientist)
  • Jack Jablonski (DataRobot, AI Success Manager)

More Information

DataRobot Community:

DataRobot University

DataRobot.com:

More information for DataRobot users: search in-app Platform documentation for Data quality assessment, then locate the section "Target leakage."

Let us know what you think!

Have questions not answered during the learning session? Want to continue your conversation with Yuriy, Shoop, and Rajiv? You can send email to learning_sessions@datarobot.com or Post Your Comment here. We're looking forward to hearing from you!

5 Comments
DataRobot Employee
DataRobot Employee

Question: When DataRobot does a Feature Impact if there's one feature which dominates the importance plot and others relatively <15% important - Would you consider that Target leakage?

Answer: 

    • Target leakage is typically defined as using a feature which is not available at the time of prediction.  So in this case, it would require your domain knowledge to help identify this as target leakage.  Is this feature available at the time of prediction?
    • It's not unusual to have one feature that dominates the importance plot.  For example, if I am predicting a child’s weight and using height along with their music preferences.  My guess is height will be an important feature, while their music preferences less so.  This isn’t target leakage, it's just height is a strong predictor of weight for children
DataRobot Employee
DataRobot Employee

Question: On DR, can you select and remove the leaky data directly from the dataset or do you have to create a new non-leaky feature set and use that for modeling?\

Answer: If you have the subject matter expertise, yes, you can manually remove the leaky feature(s) and create a new feature-list with the leak(s) removed right on the Data page before kicking off Autopilot modeling. More information about Feature Lists

DataRobot Employee
DataRobot Employee

Question: Any advice on checking on Performance degradation over-time that would be helpful

Answer: Yes, data drift detection and target drift detection can help with this. Take a look at MLOps (Machine Learning Operations) and our community walkthrough!

DataRobot Employee
DataRobot Employee

Question: In case of oversampling training, should the validation set always have the original proportions?

Answer:  I’d recommend validating on the original proportions since production data will have original proportions. However, in certain cases, I’ve had success with generating augmentations on test data as well and then averaging the predictions. E.g., if we have to predict if A is similar to B, we can also predict whether B is similar to A and average the predictions.

Data Scientist
Data Scientist

If you are interested in target leakage with images, take a look at our blog post on identifying leakage using computer vision on medical images.  

Announcements
BIG NEWS: The DataRobot Community is getting a new look!
Over the next few weeks, we'll be reorganizing some of our content to provide you with faster & easier help for your DataRobot questions. Stay tuned and check out some more information here.

HEADS UP: Guided AI Learning has moved!
As previously announced, we've now moved the articles from Guided AI Learning to Resources. And all self-paced learning is avaliable from DataRobot University. Go there for the complete, on-demand selection of world-class machine learning courses.