How to recognize Target Leakage and why it’s problematic.
Protips for avoiding Target Leakage.
Recognizing Target Leakage and why it’s problematic
Target leakage is defined as: including, in your dataset, future information that would not be known at prediction time. When Target Leakage occurs, you are teaching your models with ‘contaminated’ data that results in overly optimistic expectations about how your model will perform in production. In other words, the performance you observe during the model building phase will not match what you’ll see when that model is put into production because the model was unable to properly learn.
Think of Target Leakage as looking like the following visual example for interest rate data in which our Learning Dataset includes information that is only available after prediction time.
Protips to avoid Target Leakage
Create a prediction date feature for transactional data: if you’re using data from transactional tables, you must have a prediction date, which serves as a cutoff date, in the data. This prediction date is a feature (column) that you create in your data and it serves as a boundary in time beyond which you should not include additional transaction data.
Avoid having more than one time value in an observation (row): if you have a single row of data with more than one time value in the row, then it's very easy to mistakenly run the prediction without considering both times.
Consider how critical data may be affected at prediction time: what if the data changed from the point in time a prediction is needed to the point in time the dataset is created, for example, today? For example: you are predicting if a credit card transaction is fraud. However, when creating your Learning Dataset, you need to be mindful of the fact that, after a fraud event, the bank may automatically close an account until the card user is notified. So if the transactional data you want to use for creating your Learning Dataset has a column for “number of accounts” and it uses the number of accounts from *today* instead of at the time of transaction, then you have target leakage.
Check out the following resources for a deeper dive on this topic: