cancel
Showing results for 
Search instead for 
Did you mean: 

Modelling targets and Prediction dataset

data_rookie
Blue LED

Modelling targets and Prediction dataset

I am working on a classification model that has 10% targets with a relatively small modelling base (~2500 records). After the autopilot, the recommended model is denoted as using 100% of the sample size. As i am quite new to DataRobot, i have 3 questions as follows: 

1) What does 100% of the sample size mean? Does it mean that the model was trained on 100% of the data and I should not upload the same set of data for prediction as this will not be indicative of how well the model is able to generalize to unseen data? 

1) What kind of data do i need to load in for an external test? I am unable to get a the "external test" tag when i upload a set of data for prediction. 

2) Under the predict tab, am i able to upload a dataset of ~10k records (containing the 2500~ records used for autopilot) for prediction using the recommended model in order to evaluate the model performance? 

Labels (1)
2 Replies
IraWatt
Laser

Hey @data_rookie,

You can see what is meant by 100% of the sample by clicking the '+' next to the sample size.

IraWatt_1-1639330544594.png

It should show that all the data was used for training:

IraWatt_2-1639330565147.png

The reason that blueprint has been trained on 100% of the data is because that blueprint generalised/performed well with a smaller sample (64% of the data in the example leader board below). 

IraWatt_0-1639330518149.png

With that in mind arguably the 100% would also generalise well as it did previously. Though it probably would still be inflated slightly. 

To get predictions on your data-set you can use the option here:

IraWatt_3-1639331385692.png

If you want to get predictions on data that the model didn't train on you can use the model trained on 64% of the data and then download its cross-validation and holdout predictions. 

Hope that answered some of your questions.

 

 

0 Kudos
dalilaB
Data Scientist
Data Scientist

1) What does 100% of the sample size mean? Does it mean that the model was trained on 100% of the data and I should not upload the same set of data for prediction as this will not be indicative of how well the model is able to generalize to unseen data? 

Yes.  The model was trained on almost all the data, but a small subset was put aside for testing.  In the case the model has been trained on all 100%, one assumes that the external data haven't been seen by the model.

1) What kind of data do i need to load in for an external test? I am unable to get a the "external test" tag when i upload a set of data for prediction. 
The data have to have at least the same features used for training.  If you would like to perform "External Test" than the target feature (column) has to be included

2) Under the predict tab, am i able to upload a dataset of ~10k records (containing the 2500~ records used for autopilot) for prediction using the recommended model in order to evaluate the model performance? Yes, you can, as we don't check if the dataset uploaded is the similar to the one used for auto-pilot.  However, too make sure that your model didn't just memorize the patterns, the dataset you use for prediction shouldn't have the same 2500 rows that you used for training.

0 Kudos