To keep a history of all BatchPredictions (scheduled jobs), we need metadata. Let's say I have a table of Metadata and a table of Prediction. How can I merge these two tables and get the related prediction_id from the metadata?
Is one metadata for each batch predictions? Do each batch prediction has a unique id. For instance, batch prediction 1 will have batch_id 1, batch prediction 2 has batch_id 2. I'm assuming that the metadata has also a field called batch_id referring the the same batch_id in each batch. If so, you can just perform an inner join with key batch_id.
Where do you have the batches? Have you appended them to each other? if so, than you will just need an inner join, else, you need to append them and then perform an inner join.
If your batches and metadata reside in AI Catalog, you can use Workspace to set up a pipeline with SPARK SQL.
I hope this answers your question.
Thanks, Dalila for your reply.
Yes, each batch prediction has a batchPredictions_id, but there is no batchPredictions_id in the prediction_output.
As you can see we have metadata with BatchPrediction_id and also in separate table predictios_output(Binary Classification Models), but no unique id as BatchPrediction_id in predictions output. The blue field is coming from the input dataset the rest is created automatically by Datarobot.
My question is how to join Predictions output with the related metadata?
I see in the Predictions_MetaData you have output_dataStoreId . One possible solution is to add this field to the Binary classification models with the blue columns (output_dataStoreId)
If the The blue field is coming from the input dataset the rest is created automatically by Datarobot then add that dataset id to your prediction output
Yes, that's a good idea, but output_dataStoreId is created at the same time as predictions, so we need datarobot to add this unique key for both (predictions and metadata) in order to establish connections between them.
What is the output_dataStoreId for that specific prediction if we want to add it after both of these data have been generated?
A batch prediction has an ID field that is prediction_id and unique, so if this field exists in the predictions csv, then metadata and predictions can be joined.
From the sheet, I see that the Input_dataStoreId but not the Output_dataStoreId. Can you please screenshot again your sheet with output_dataStoreId included? Thanks
I checked with the MLOps team. Here is their answer: You seem like you want to include the batch prediction job ID itself in the output. That is not something that is supported
It means having BatchPrediction_id in the prediction result is impossible by the UI, my question is, if we use Zepl and create BatchPredictions by the python code, is it possible to add BatchPrediction_id in the prediction output?