cancel
Showing results for 
Search instead for 
Did you mean: 

Paxata for pre-processing an XML event before prediction API

BJ
Linear Actuator

Paxata for pre-processing an XML event before prediction API

I am exploring end to end Solution for data pre-processing and ML Modelling in Production scenario.

In a production environment, my system will generate an XML event which cannot be directly  Posted to Prediction API for any hosted ML Model. The event will require data pre-processing at runtime before calling prediction API

how can Paxata help in pre-processing for transactional events in near real time before  calling prediction API.

Are there any references to architecture documents where Paxata is performing data pre-processing in near real time for an incoming XML event 

Labels (1)
8 Replies
Anonymous
Not applicable

Hi @BJ ,

Thank you for your interesting question. Yes, we can use Paxata to implement near real time processing using our orchestration capabilities.

There are few ways you can implement this: 1) Using Paxata UI 2) Using Paxata REST API.

Lets detail the solution using Paxata UI (the same can be implemented using Paxata REST API if needed)

  • The XML events can be batched/written to a file (s) on the server  for a defined period of micro-batch duration.
  • We need to figure out how to get the XML events micro batch file to Paxata Server (easiest is to network mount the folder and use our Network File Share connector to read the files (remember to glob the files if there are multiple files).
  • Our NFS connector parses the XML, imports into Library which can be fed to a project which does the required pre-processing. In the same project you could submit these processed events to DataRobot published models to generate the prediction score (same as calling Prediction API) .
  • If it helps we could wrap a REST Endpoint to return the batched XML events and use Paxata REST API connectors (can think of similar approach with SFTP connector as well).
  • This entire flow can be automated using Paxata APF.

You could also perform the all these using our REST API in Java or Python or any programming environment which supports REST. The only difference is that your Java or Python code runs outside Paxata. 

Please let me know if you need more details.

With Best Regards

Sudheer Kumar

Thanks @Anonymous for the Solution Approach.

Few more points on it.

Our Source system generating XML files is on premise.

We move the XMLs to Azure Storage in Near real time.

The Paxata data prep and DR Prediction API is called on Azure.

The target variable is again persisted back on Azure data lake.

The target variable is updated back to the Source system on premise.

Does this above process adds complexity to your proposed solution and how will it achieve above points as well.

 

Anonymous
Not applicable

Hi @BJ ,

Thank you so much for the additional details and your kind appreciation to my previous reply!

In fact, additional details makes it simpler. Since you are writing the XMLs to Azure storage (Azure Blob Storage or ADLS Gen? - we have connectors for both) the integration becomes lot more simpler.

Here is the updated flow:

  • Our Source system generating XML files is on premise.
  • We move the XMLs to Azure Storage in Near real time (Azure Blob or ADLS Gen1/2).

The Paxata data prep and DR Prediction API is called on Azure:

  • Paxata WASB (Azure Blob) or ADLS Gen1/2 Connector will read the XML files (remember to glob the files if there are multiple files), parses and flattens the data and imports consolidated data into Paxata library.
  • The imported data is fed to a data-prep project which does the required pre-processing. In the same project you could submit these processed events to DataRobot published models to generate the prediction score (same as calling Prediction API). This project will also publish the target variable or prediction score and details to Azure Data Lake.
  • The target variable is updated back to the Source system on premise. This can also be achieved through the project if we can write back to staging area of the Source system through a connector or we can take advantage of the existing process/procedure. 
  • The entire flow can be automated using Paxata automation or APF (please see the notes below).

The only thing we need to keep in mind is that Paxata Automation or APF is not yet even driven system. It is time based triggering. Also, you need to have a procedure to remove the processed XML files from the Azure storage since the connector won't delete the imported files.

I think this should provide a high level end to end flow that can be implemented with relative ease. Please let me know if you need more details or run into any issues. Happy to help and collaborate.

With Best Regards

Sudheer Kumar

Thanks Sudheer.

Can I call the Predict tool in Paxata for custom ML Models also deployed on DataRobot.

What are the alternate options in case?

The thought is  end to end orchestration, from generation of XML event from Source system to updating back the Source system on prediction value.

Anonymous
Not applicable

Hi @BJ ,

Good question. I will check with couple of my colleagues and setback to you. In theory the Predict step in Paxata should be able to invoke the custom ML models deployed on DataRobot (MLOps) since it goes through the same REST interface.

@shyam , please share your thoughts.

With Best Regards

Sudheer Kumar

Hi @BJ ,

Yes, you should be able to use Predict Step in Paxata to invoke the custom ML models deployed on DataRobot(MLOps) since it goes through the same REST interface. I would highly recommend thorough testing when a custom model is deployed.

I hope this helps.

With Best Regards

Sudheer Kumar

Hi @Anonymous ,

I have further question on the scenario, 

We have XML/ JSON raw event posted on the API on Azure. The events are being posted on API  by Production Source system in near real time.  We can store these events in near real time on Azure Service Bus using Azure functions.

 

Can Paxata read these events from Azure Service Bus and pre-process it before Prediction.  Can paxata trigger the process based on event landing on Azure Service Bus.

Once the event is pre-processed in Paxata, can we then call the Predict function for the ML Model for scoring.

Once predicted , can we store the target outcome and all the API metadata in Azure Storage.

 

We are looking for Paxata as part of the solution in building end to end real time prediction pipeline.

 

Thanks,

Bhanu

Anonymous
Not applicable

Hi @BJ ,

 

Paxata doesn't support streaming data sources as well as event based triggering. Having said this one could accomplish the solution if micro-batching is acceptable.

 

1) You micro-batch the messages on Azure Service bus to be delivered once every minute.

2) Write the minute micro-batched messages to ADLS Gen2 or Azure Blob storage. Use the filenames so that it would be easy to glob the files using a pattern.

3) Schedule the Paxata job to run every minute which will grab the micro-batched messages and run through the pre-processing and then call the Predict Step for scoring.

4) Perform any additional dataprep steps and then publish the data to your desired Azure Storage (ADLS Gen2 or Azure Blob storage).

5) I would suggest micro-batching for improved processing efficiency.

 

Please let me know if you need more details. 

 

With Best Regards

Sudheer Kumar

I hope this helps.