I am exploring end to end Solution for data pre-processing and ML Modelling in Production scenario.
In a production environment, my system will generate an XML event which cannot be directly Posted to Prediction API for any hosted ML Model. The event will require data pre-processing at runtime before calling prediction API
how can Paxata help in pre-processing for transactional events in near real time before calling prediction API.
Are there any references to architecture documents where Paxata is performing data pre-processing in near real time for an incoming XML event
Solved! Go to Solution.
Hi @BJ ,
Thank you for your interesting question. Yes, we can use Paxata to implement near real time processing using our orchestration capabilities.
There are few ways you can implement this: 1) Using Paxata UI 2) Using Paxata REST API.
Lets detail the solution using Paxata UI (the same can be implemented using Paxata REST API if needed)
You could also perform the all these using our REST API in Java or Python or any programming environment which supports REST. The only difference is that your Java or Python code runs outside Paxata.
Please let me know if you need more details.
With Best Regards
Sudheer Kumar
Thanks @Anonymous for the Solution Approach.
Few more points on it.
Our Source system generating XML files is on premise.
We move the XMLs to Azure Storage in Near real time.
The Paxata data prep and DR Prediction API is called on Azure.
The target variable is again persisted back on Azure data lake.
The target variable is updated back to the Source system on premise.
Does this above process adds complexity to your proposed solution and how will it achieve above points as well.
Hi @BJ ,
Thank you so much for the additional details and your kind appreciation to my previous reply!
In fact, additional details makes it simpler. Since you are writing the XMLs to Azure storage (Azure Blob Storage or ADLS Gen? - we have connectors for both) the integration becomes lot more simpler.
Here is the updated flow:
The Paxata data prep and DR Prediction API is called on Azure:
The only thing we need to keep in mind is that Paxata Automation or APF is not yet even driven system. It is time based triggering. Also, you need to have a procedure to remove the processed XML files from the Azure storage since the connector won't delete the imported files.
I think this should provide a high level end to end flow that can be implemented with relative ease. Please let me know if you need more details or run into any issues. Happy to help and collaborate.
With Best Regards
Sudheer Kumar
Thanks Sudheer.
Can I call the Predict tool in Paxata for custom ML Models also deployed on DataRobot.
What are the alternate options in case?
The thought is end to end orchestration, from generation of XML event from Source system to updating back the Source system on prediction value.
Hi @BJ ,
Good question. I will check with couple of my colleagues and setback to you. In theory the Predict step in Paxata should be able to invoke the custom ML models deployed on DataRobot (MLOps) since it goes through the same REST interface.
@shyam , please share your thoughts.
With Best Regards
Sudheer Kumar
Hi @BJ ,
Yes, you should be able to use Predict Step in Paxata to invoke the custom ML models deployed on DataRobot(MLOps) since it goes through the same REST interface. I would highly recommend thorough testing when a custom model is deployed.
I hope this helps.
With Best Regards
Sudheer Kumar
Hi @Anonymous ,
I have further question on the scenario,
We have XML/ JSON raw event posted on the API on Azure. The events are being posted on API by Production Source system in near real time. We can store these events in near real time on Azure Service Bus using Azure functions.
Can Paxata read these events from Azure Service Bus and pre-process it before Prediction. Can paxata trigger the process based on event landing on Azure Service Bus.
Once the event is pre-processed in Paxata, can we then call the Predict function for the ML Model for scoring.
Once predicted , can we store the target outcome and all the API metadata in Azure Storage.
We are looking for Paxata as part of the solution in building end to end real time prediction pipeline.
Thanks,
Bhanu
Hi @BJ ,
Paxata doesn't support streaming data sources as well as event based triggering. Having said this one could accomplish the solution if micro-batching is acceptable.
1) You micro-batch the messages on Azure Service bus to be delivered once every minute.
2) Write the minute micro-batched messages to ADLS Gen2 or Azure Blob storage. Use the filenames so that it would be easy to glob the files using a pattern.
3) Schedule the Paxata job to run every minute which will grab the micro-batched messages and run through the pre-processing and then call the Predict Step for scoring.
4) Perform any additional dataprep steps and then publish the data to your desired Azure Storage (ADLS Gen2 or Azure Blob storage).
5) I would suggest micro-batching for improved processing efficiency.
Please let me know if you need more details.
With Best Regards
Sudheer Kumar
I hope this helps.