Target leakage, also known as data leakage, is one of the most challenging problems when building machine learning models. Without proper checks and guardrails, you may not realize you have target leakage until you deploy a model and notice that its performance in a production environment is worse than it was during development.

During this session, we cover conceptual definitions of target leakage and the ways it can arise prior to model building, in particular during the data engineering and project setup phase. Then we demonstrate how DataRobot's Data Quality Assessment performs Target Leakage Detection to ensure that projects follow data science best practices and resulting models will be robust to real-world data. Finally, we will provide a handy checklist to help you evaluate your projects for target leakage.


  • Yuriy Guts (DataRobot, Engineer)
  • Alex Shoop (DataRobot, Engineer)
  • Rajiv Shah (DataRobot, Data Scientist)
  • Jack Jablonski (DataRobot, AI Success Manager)

Question: When DataRobot does a Feature Impact if there's one feature which dominates the importance plot and others relatively <15% important - Would you consider that Target leakage?


    • Target leakage is typically defined as using a feature which is not available at the time of prediction.  So in this case, it would require your domain knowledge to help identify this as target leakage.  Is this feature available at the time of prediction?
    • It's not unusual to have one feature that dominates the importance plot.  For example, if I am predicting a child’s weight and using height along with their music preferences.  My guess is height will be an important feature, while their music preferences less so.  This isn’t target leakage, it's just height is a strong predictor of weight for children
Question: On DR, can you select and remove the leaky data directly from the dataset or do you have to create a new non-leaky feature set and use that for modeling?\

Answer: If you have the subject matter expertise, yes, you can manually remove the leaky feature(s) and create a new feature-list with the leak(s) removed right on the Data page before kicking off Autopilot modeling. More information about Feature Lists

Question: Any advice on checking on Performance degradation over-time that would be helpful

Answer: Yes, data drift detection and target drift detection can help with this. Take a look at MLOps (Machine Learning Operations) and our community walkthrough!

Question: In case of oversampling training, should the validation set always have the original proportions?

Answer:  I’d recommend validating on the original proportions since production data will have original proportions. However, in certain cases, I’ve had success with generating augmentations on test data as well and then averaging the predictions. E.g., if we have to predict if A is similar to B, we can also predict whether B is similar to A and average the predictions.

If you are interested in target leakage with images, take a look at our blog post on identifying leakage using computer vision on medical images.  

