Model Deployment

Model Deployment

Hi 

 

I'm trying to deploy a model that I've chosen from the model registry in DataRobot and I'm encountering  a few issues that I've tried troubleshooting with the DataRobot docs as well as the resources available in the community. Some of the issues I'm having are:

 

Association ID:

  • When uploading a dataset at the start of the project, do you need to include a column for your association ID and then just deselect it as a feature in your feature list? I'm assuming this is true because you need to link the actuals with the predicted values, but I could be mistaken.
  • I'm running a time series model and I saw in the docs that it is advised to create a different column to your date column and use something else as the values in the date column may not be unique. But, what if I had a d/m/y format so the values are unique due to the year being included?

 

Range in predictions tab:

  • Is this range just referring to the day from which you deployed your model to the current day? And is there a reason I'm unable to change the intervals to monthly? (is it because the model has only been in "deployment" for a few days?)

 

Making predictions in deployment tab:

  • Under the predictions tab in the deployment section, is the prediction dataset the one that was computed and downloaded under the 'predict' tab in the models section of the environment? (the output)
  • Or is this the original dataset that needs to be uploaded in order for the model chosen to make predictions
  • Or is this the same dataset as the one you needed to upload to make predictions under the 'predict' tab - the dataset with rows left open for the target variable corresponding to future dates? Writing it out, this makes the most sense; however, I could be mistaken.
  • Without this dataset, I'm assuming model deployment won't run since there is no data to track the drift and health?

 

Sorry for the long post. Thanks for your time 🙂

 

 

 

1 Solution

Accepted Solutions

Hey @Shai, I think I can answer these for you. 

 

Association ID: There is no practical need to upload the association ID with your training data. The primary purpose of the association ID is to track model performance over time on new data. Out in production, you will feed your model new rows to predict every so often, but in practice you won't have the actual results yet. Over time, you'll observe actual results, and you'll want to map those actual results back onto the predictions you made before. Because the training data already has the outcome/target column, there's no need to join actuals on at a later time. 

 

As for the time series example, I'd be worried about a situation where you make multiple predictions each day, possibly one for series A and one for series B, or maybe you need to make predictions every few hours (for example). If you're confident that the date is a unique identifier, then I don't think it's a problem to use it. 

 

Range in predictions tab: You are correct. You cannot look at a granularity (like month) if you don't yet have a month of observations. You can adjust the slider to view different time periods, anything between deployment date and current date.

 

Making predictions in deployment tab: When you navigate to the predictions-->make predictions subtab of a deployment, you can drop in a prediction dataset (e.g., csv) right there to be scored. This may be different from the one you dropped in on the Models tab (i.e. model leaderboard), and typically it will be different. It sounds like your question might be specific to time series, in which case your guess is correct: input a scoring dataset with enough filled-in rows of data for the time series model to "look back" and the number empty rows (with dates) you want to predict forward.

 

Generally, the deployment itself doesn't need new scoring data to "run." The deployment is active, waiting for that new scoring data. But yes, to monitor drift and health, you would need to input scoring data to be predicted. To monitor accuracy, you would have to separately submit actuals to the deployment. That will require an association ID field with column name and values that match those in your scoring datasets that were already predicted. 

 

Let me know if there's anything I can clarify in there!

View solution in original post

4 Replies

Hey @Shai, I think I can answer these for you. 

 

Association ID: There is no practical need to upload the association ID with your training data. The primary purpose of the association ID is to track model performance over time on new data. Out in production, you will feed your model new rows to predict every so often, but in practice you won't have the actual results yet. Over time, you'll observe actual results, and you'll want to map those actual results back onto the predictions you made before. Because the training data already has the outcome/target column, there's no need to join actuals on at a later time. 

 

As for the time series example, I'd be worried about a situation where you make multiple predictions each day, possibly one for series A and one for series B, or maybe you need to make predictions every few hours (for example). If you're confident that the date is a unique identifier, then I don't think it's a problem to use it. 

 

Range in predictions tab: You are correct. You cannot look at a granularity (like month) if you don't yet have a month of observations. You can adjust the slider to view different time periods, anything between deployment date and current date.

 

Making predictions in deployment tab: When you navigate to the predictions-->make predictions subtab of a deployment, you can drop in a prediction dataset (e.g., csv) right there to be scored. This may be different from the one you dropped in on the Models tab (i.e. model leaderboard), and typically it will be different. It sounds like your question might be specific to time series, in which case your guess is correct: input a scoring dataset with enough filled-in rows of data for the time series model to "look back" and the number empty rows (with dates) you want to predict forward.

 

Generally, the deployment itself doesn't need new scoring data to "run." The deployment is active, waiting for that new scoring data. But yes, to monitor drift and health, you would need to input scoring data to be predicted. To monitor accuracy, you would have to separately submit actuals to the deployment. That will require an association ID field with column name and values that match those in your scoring datasets that were already predicted. 

 

Let me know if there's anything I can clarify in there!

@matthias_kullowatz, thank you so so much for this, I really appreciate it. 

Everything is clear to me from your response but if I get stuck with anything I'll just post another question 🙂

Again, thanks for the reply. It has cleared up the confusion I had

Shai, here's an article on a few ways to upload actuals once you have them - as well as some considerations when choosing an Association ID value.  https://community.datarobot.com/t5/resources/measuring-prediction-accuracy-uploading-actual-results/...

Thanks so much! I will check it out

0 Kudos