cancel
Showing results for 
Search instead for 
Did you mean: 

datarobot downsampling - some questions

datarobot downsampling - some questions

I’m going to be working with DataRobot’s downsampling feature and read through this post https://community.datarobot.com/t5/resources/how-does-datarobot-downsample/ta-p/805 and the documentation I found at https://app.datarobot.com/docs/reference/data-partitioning.html?highlight=cross%20validation#k-folds . I want to get a better understanding of how DataRobot is handling downsampling of the majority class together with the k-fold cross validation.

My questions-

  1. Is stratified downsampling only applied once or on each k-fold CV runs - for example 5x in the case of k=5?
  2. What are the differences between applying downsampling on train-validation-holdout method versus on k-fold CV?
  3. During model tuning in autopilot how is downsampling applied? For example, would I get x different downsampled datasets if I do x runs of a specific model with x different (hyper-)parameters? Curious how that works

hope someone can get back to me - appreciate your time

2 Solutions

Accepted Solutions

DR's smart downsampling is a very powerful feature that can raise questions! Please see some answers below. 1. Is stratified downsampling only applied once or on each k-fold CV runs - for example 5x in the case of k=5? In terms of the process, the full dataset is downsampled on the majority class based on the level you select. Then, and only then, is when the k-folds are built (if indeed you specify use of k-folds). e.g. say you have a dataset with 1000 rows (100 minority, 900 majority) - if you downsample to 33%, this will give you a dataset with 100 minority, 300 majority ~(ish). If you then select, say, 20% holdout with 5 folds, your holdout will have 80 rows, and each fold will have 64 rows (all maintaining now a 1:3 ratio). 2. What are the differences between applying downsampling on train-validation-holdout method versus on k-fold CV? No different to how it would be without downsampling. The k-folds validation will be more robust if you have small samples post-downsampling, so just be aware of that. Naturally model training will take longer if you select k-folds though. 3. During model tuning in autopilot how is downsampling applied? For example, would I get x different downsampled datasets if I do x runs of a specific model with x different (hyper-)parameters? Curious how that works Akin to my description above, the same downsampled dataset (with the same folds and holdout) will be used for training all blueprints. i.e. if you change, say, the "max depth" parameter on an xgBoost blueprint & run this tuned blueprint, that will still be applied the the same downsampled dataset as the original (as well as every other blueprint). Note, the only difference is that we will be ranking blueprints on the leaderboard subject to a weighted metric which will account for the downsampling which has been applied, e.g. Weighted LogLoss. FYI - If you search for "downsampling" in your model docs, you should come across a "Show Advanced Options link" page.

View solution in original post

DR's smart downsampling is a very powerful feature that can raise questions!
Please see some answers below.

1. Is stratified downsampling only applied once or on each k-fold CV runs - for example 5x in the case of k=5? In terms of the process, the full dataset is downsampled on the majority class based on the level you select. Then, and only then, is when the k-folds are built (if indeed you specify use of k-folds). e.g. say you have a dataset with 1000 rows (100 minority, 900 majority) - if you downsample to 33%, this will give you a dataset with 100 minority, 300 majority ~(ish).

If you then select, say, 20% holdout with 5 folds, your holdout will have 80 rows, and each fold will have 64 rows (all maintaining now a 1:3 ratio).

2. What are the differences between applying downsampling on train-validation-holdout method versus on k-fold CV? No different to how it would be without downsampling. The k-folds validation will be more robust if you have small samples post-downsampling, so just be aware of that. Naturally model training will take longer if you select k-folds though.

3. During model tuning in autopilot how is downsampling applied? For example, would I get x different downsampled datasets if I do x runs of a specific model with x different (hyper-)parameters? Curious how that works

Akin to my description above, the same downsampled dataset (with the same folds and holdout) will be used for training all blueprints. i.e. if you change, say, the "max depth" parameter on an xgBoost blueprint & run this tuned blueprint, that will still be applied the the same downsampled dataset as the original (as well as every other blueprint). Note, the only difference is that we will be ranking blueprints on the leaderboard subject to a weighted metric which will account for the downsampling which has been applied, e.g. Weighted LogLoss. FYI - If you search for "downsampling" in your model docs, you should come across a "Show Advanced Options link" page.

View solution in original post

4 Replies

DR's smart downsampling is a very powerful feature that can raise questions! Please see some answers below. 1. Is stratified downsampling only applied once or on each k-fold CV runs - for example 5x in the case of k=5? In terms of the process, the full dataset is downsampled on the majority class based on the level you select. Then, and only then, is when the k-folds are built (if indeed you specify use of k-folds). e.g. say you have a dataset with 1000 rows (100 minority, 900 majority) - if you downsample to 33%, this will give you a dataset with 100 minority, 300 majority ~(ish). If you then select, say, 20% holdout with 5 folds, your holdout will have 80 rows, and each fold will have 64 rows (all maintaining now a 1:3 ratio). 2. What are the differences between applying downsampling on train-validation-holdout method versus on k-fold CV? No different to how it would be without downsampling. The k-folds validation will be more robust if you have small samples post-downsampling, so just be aware of that. Naturally model training will take longer if you select k-folds though. 3. During model tuning in autopilot how is downsampling applied? For example, would I get x different downsampled datasets if I do x runs of a specific model with x different (hyper-)parameters? Curious how that works Akin to my description above, the same downsampled dataset (with the same folds and holdout) will be used for training all blueprints. i.e. if you change, say, the "max depth" parameter on an xgBoost blueprint & run this tuned blueprint, that will still be applied the the same downsampled dataset as the original (as well as every other blueprint). Note, the only difference is that we will be ranking blueprints on the leaderboard subject to a weighted metric which will account for the downsampling which has been applied, e.g. Weighted LogLoss. FYI - If you search for "downsampling" in your model docs, you should come across a "Show Advanced Options link" page.

Hey @matthewpleasant and @vyas.adhikari - not sure what's happened to the styling here. We're looking into a fix. In the meantime, here's a screenshot of what Vyas really typed, which is much easier to read 🙂

lhaviland_0-1614181479262.png

0 Kudos

DR's smart downsampling is a very powerful feature that can raise questions!
Please see some answers below.

1. Is stratified downsampling only applied once or on each k-fold CV runs - for example 5x in the case of k=5? In terms of the process, the full dataset is downsampled on the majority class based on the level you select. Then, and only then, is when the k-folds are built (if indeed you specify use of k-folds). e.g. say you have a dataset with 1000 rows (100 minority, 900 majority) - if you downsample to 33%, this will give you a dataset with 100 minority, 300 majority ~(ish).

If you then select, say, 20% holdout with 5 folds, your holdout will have 80 rows, and each fold will have 64 rows (all maintaining now a 1:3 ratio).

2. What are the differences between applying downsampling on train-validation-holdout method versus on k-fold CV? No different to how it would be without downsampling. The k-folds validation will be more robust if you have small samples post-downsampling, so just be aware of that. Naturally model training will take longer if you select k-folds though.

3. During model tuning in autopilot how is downsampling applied? For example, would I get x different downsampled datasets if I do x runs of a specific model with x different (hyper-)parameters? Curious how that works

Akin to my description above, the same downsampled dataset (with the same folds and holdout) will be used for training all blueprints. i.e. if you change, say, the "max depth" parameter on an xgBoost blueprint & run this tuned blueprint, that will still be applied the the same downsampled dataset as the original (as well as every other blueprint). Note, the only difference is that we will be ranking blueprints on the leaderboard subject to a weighted metric which will account for the downsampling which has been applied, e.g. Weighted LogLoss. FYI - If you search for "downsampling" in your model docs, you should come across a "Show Advanced Options link" page.

very helpful, thank you

0 Kudos