05 Target Leakage: how to recognize and prevent it

cancel
Showing results for 
Search instead for 
Did you mean: 

05 Target Leakage: how to recognize and prevent it

This is the 5th article in our Best Practices for Building ML Learning Datasets series.

In this article you’ll learn:

  • How to recognize Target Leakage and why it’s problematic.
  • Protips for avoiding Target Leakage.

Recognizing Target Leakage and why it’s problematic

Target leakage is defined as: including, in your dataset, future information that would not be known at prediction time. When Target Leakage occurs, you are teaching your models with ‘contaminated’ data that results in overly optimistic expectations about how your model will perform in production. In other words, the performance you observe during the model building phase will not match what you’ll see when that model is put into production because the model was unable to properly learn. 

Think of Target Leakage as looking like the following visual example for interest rate data in which our Learning Dataset includes information that is only available after prediction time.

MelanieFawcett_0-1584471989803.png

 

Protips to avoid Target Leakage

  • Create a prediction date feature for transactional data: if you’re using data from transactional tables, you must have a prediction date, which serves as a cutoff date, in the data. This prediction date is a feature (column) that you create in your data and it serves as a boundary in time beyond which you should not include additional transaction data.

  • Avoid having more than one time value in an observation (row): if you have a single row of data with more than one time value in the row, then it's very easy to mistakenly run the prediction without considering both times.

  • Consider how critical data may be affected at prediction time: what if the data changed from the point in time a prediction is needed to the point in time the dataset is created, for example, today? For example: you are predicting if a credit card transaction is fraud. However, when creating your Learning Dataset, you need to be mindful of the fact that, after a fraud event, the bank may automatically close an account until the card user is notified. So if the transactional data you want to use for creating your Learning Dataset has a column for “number of accounts” and it uses the number of accounts from *today* instead of at the time of transaction, then you have target leakage.

Check out the following resources for a deeper dive on this topic:

Blog: What is Target Leakage and How Do I Avoid it?
DataRobot wiki on Target Leakage

More information for DataRobot users: search in-app Platform documentation for Data quality assessment, then locate the section "Target leakage."

Labels (2)
Version history
Revision #:
16 of 16
Last update:
‎09-17-2020 12:39 PM
Updated by:
 
Contributors