We’re trying to make sure we understand how DataRobot works to not overfit models. I know the platform default is to run 5-fold cross-validation and also that it does some feature reduction.
But I’d like to hear from someone who can tell me all of the ways datarobot makes sure models (auto TS) aren’t overfit.
We’re trying to move forward so really hoping someone can help us quickly? Appreciate the help!
Solved! Go to Solution.
For time series models, we use backtesting instead of cross validation to validate our models while keeping our observations in chronological order. The idea is to train your model on a fixed-length period of observations, then validate your model on a time window following the training period - what we call a backtest. You then shift forward in time in your training data, re-train your model on the a shifted training window of the same length, and validate the second model on the events following the second training period. And so on, repeating until you have enough backtests to be confident your model isn’t overfit.
For more detail, just search “Date/time partitioning” in the Platform Documentation.
This is great. Thanks for the reply and now I see this in the documentation also - very helpful
Unlike cross-validation, however, backtests allow you to select specific time periods or durations for your testing instead of random rows, creating in-sequence, instead of randomly sampled, “trials” for your data. So, instead of saying “break my data into 5 folds of 1000 random rows each,” with backtests you say “simulate training on 1000 rows, predicting on the next 10. Do that 5 times.” Backtests simulate training the model on an older period of training data, then measure performance on a newer period of validation data. After models are built, through the Leaderboard you can change the training range and sampling rate. DataRobot then retrains the models on the shifted training data.
That is correct.
Randomization works in typical modeling (when it is appropriate), because each row in the dataset is considered to be independent of the rows around it.
However, in a time-aware dataset, the order of the rows has meaning. If you randomized these rows into training and validation, you would risk training on "the future", relative to rows that are in the validation sample - which would introduce leakage. By doing backtesting, we guarantee that the sequence of the data is maintained, and that training data occurs before the validation data.