I have a binary classification model (0/1) tracking Gradient Boosting model deployed in Datarobot. It is running predictions every hour and the predictions are being made on accounts who are created in last two hours. So as you can see the number of predictions made can vary since number of acts created can vary depending on the time of the day.
Next, in order to track the accuracy I am uploading the actual data with true target labels. But interestingly, the actual data has to be Snapshotted to make this work. In my knowledge, snapshotting means, a picture is taken of the data and stored, it is no more live.It implies that the actual data will remain stale and the prediction data (by the model) will keep on changing. This means the customer ids in both the dataset may not be same. In this case, I am not sure how the accuracy can be tracked. Can I kindly get some help here? thanks
I have an article on uploading actuals here: Measuring Prediction Accuracy: Uploading Actual Results
I think the problem here is in the association id choice, something I touch on in the article. The association id is for a prediction - it is not simply for your entity/object/surrogate key unless only one prediction is ever made with that key. If my model runs every Monday morning and predicts whether a customer might churn that week, then the true assocation id is not simply CUSTOMER_ID, as that predicted value can change every Monday. My actual association id - to uniquely identify a prediction record - is a concatenation of CUSTOMER_ID + Monday Date. I will have a unique prediction here, and a week later - a unique actual to tie back to it.
We can't reset the statistics once you upload an actual value. If state changes from hour to hour, you may consider waiting, if that makes sense for your application. If you are looking to see how the actual value changes over time for one given customer, then that is unsupported in DataRobot, but you could capture that and plot it at the customer level on your end. I hope that helps.
Appreciate the response, Matt.
But what I am looking for is, can the actual data be dynamic so that every hour when the job is run we get refreshed actual data. I have defined the filter on actual as well as prediction data as identical, so that same customer IDs are present in both of them.
For example of the job runs at 1pm PST say it is scoring customer_id1 at that time. Then the filter has been defined in such a way that actual data will also have the label for customer_id1 at that time. Therefore, I cannot see why it is not appropriate to join them on customer id(i.e. association id) in this case.In other words, I am looking for accuracy calculation at a given point of time.
Hope my question/request makes. sense.
If your actuals file has the association id to match to a prediction, then it doesn't matter if the actuals data you have is a historical snapshot. If you upload another actual result with that as association id it, we keep it, but the accuracy calculations will have used the first value (we'll be releasing a "refresh" option in an upcoming release).
Are you using the customer id as the association id? That may not work well for you if you make multiple predictions with the same customer id that reflect a different state of the customer and prediction input data. Using a new, unique id just to match the prediction to the actual is best.