I am working on a classification model that has 10% targets with a relatively small modelling base (~2500 records). After the autopilot, the recommended model is denoted as using 100% of the sample size. As i am quite new to DataRobot, i have 3 questions as follows:
1) What does 100% of the sample size mean? Does it mean that the model was trained on 100% of the data and I should not upload the same set of data for prediction as this will not be indicative of how well the model is able to generalize to unseen data?
1) What kind of data do i need to load in for an external test? I am unable to get a the "external test" tag when i upload a set of data for prediction.
2) Under the predict tab, am i able to upload a dataset of ~10k records (containing the 2500~ records used for autopilot) for prediction using the recommended model in order to evaluate the model performance?
1) What does 100% of the sample size mean? Does it mean that the model was trained on 100% of the data and I should not upload the same set of data for prediction as this will not be indicative of how well the model is able to generalize to unseen data?
Yes. The model was trained on almost all the data, but a small subset was put aside for testing. In the case the model has been trained on all 100%, one assumes that the external data haven't been seen by the model.
1) What kind of data do i need to load in for an external test? I am unable to get a the "external test" tag when i upload a set of data for prediction.
The data have to have at least the same features used for training. If you would like to perform "External Test" than the target feature (column) has to be included
2) Under the predict tab, am i able to upload a dataset of ~10k records (containing the 2500~ records used for autopilot) for prediction using the recommended model in order to evaluate the model performance? Yes, you can, as we don't check if the dataset uploaded is the similar to the one used for auto-pilot. However, too make sure that your model didn't just memorize the patterns, the dataset you use for prediction shouldn't have the same 2500 rows that you used for training.
Hey @data_rookie,
You can see what is meant by 100% of the sample by clicking the '+' next to the sample size.
It should show that all the data was used for training:
The reason that blueprint has been trained on 100% of the data is because that blueprint generalised/performed well with a smaller sample (64% of the data in the example leader board below).
With that in mind arguably the 100% would also generalise well as it did previously. Though it probably would still be inflated slightly.
To get predictions on your data-set you can use the option here:
If you want to get predictions on data that the model didn't train on you can use the model trained on 64% of the data and then download its cross-validation and holdout predictions.
Hope that answered some of your questions.