We know datarobot does imputation for the missing values to generate models. After model building, if i deploy the model in API and use the model to predict for a data set, which has missing values in the same column, will datarobot impute them? or i have to manually impute them?
The reason is when i use "Predict" option under "Batch predictions" the predicted values are different than the predicted values from API.
You do not need to impute missing values in the test set. How DataRobot will handle missing values in the test set depends on how they were dealt with in the training set.
DataRobot handles missing values in a number of ways in the training set:
For linear models DataRobot will impute missing values with the median.
For tree-based models, DataRobot will impute with an arbitrary value (e.g. -9999) rather than the median, if the feature is missing 10% or more of its values. This can be adjusted in advanced tuning after a model is run. If the feature is missing fewer than 10% of its values, median imputation will be used instead.
For categorical variables in all models, DataRobot will create a new category for the missing values.
You can find what DataRobot did for each model under the "Missing Values" tab (see below).
I hope this helps. Thanks for reaching out on Community!
Thanks for your response. I have deployed the model API and calling through alteryx. I am passing the training data set back to model to see how well it is predicting for training data. The predicted value i get is different from the values i get under "batch prediction" of the "Predict" tab of deployed model.
When i do the "Predict" i get a set of predicted values
After the deploying the above model in Alteryx API, i sent the traning data back to it to see its performance.The predicated values i get doesnt match with API predicted results.
Both predicted values doesn't match with each other even though it is same data set.
You have found one of the guardrails in DataRobot. When doing predictions on the training data in the GUI DataRobot does a process called "stacked predictions". This is achieved by building multiple models on different subsets of data. The prediction row is made using a model that excluded that data from training, this way, each prediction is "out-of-sample" when you run the training predictions on the predict tab. This prevents your model from looking more accurate than it is when you download the training predictions in the app.
When you send the data to the deployment, DataRobot doesn't know this is the prediction dataset and will score the data as normal.
You can find some documentation on stacked predictions here.
I hope this helps, please feel free to reach out with more questions.