Running Batch Prediction Jobs to R/W from Azure Blob Storage

cancel
Showing results for 
Search instead for 
Did you mean: 

Running Batch Prediction Jobs to R/W from Azure Blob Storage

The DataRobot Batch Prediction API allows users to take in large datasets and score them against deployed models running on a Prediction Server. The API also provides flexible options for intake and output of these files.

Using the DataRobot Python Client package, which calls the Batch Prediction API, we will go through how to set up a batch prediction job that will read in a file for scoring from Azure Blob storage and then write the results back to Azure Blob storage. If you are using an Azure Data Lake Storage Gen2 account, then this method will also work because the underlying storage is the same.

All the code snippets that you see in this tutorial are part of a Jupyter Notebook that you can download from here to get started.

Requirements

In order to run this tutorial code, you will need the following things:

  • Python 2.7 or 3.4+
  • DataRobot Python Package (2.21.0+) (pypi)(conda)
  • DataRobot deployment
  • Azure storage account
    • Azure storage container
    • Scoring dataset to use for scoring with your DataRobot deployment that lives in the storage container

Creating Stored Credentials within DataRobot

The batch prediction job will need some credentials in order to read and write to Azure Blob storage. This will require the name of the Azure storage account and an access key.

You can get this by clicking the Access keys menu in Azure portal for the storage account.

lhaviland_4-1622056302437.png

Click the Show keys button to get the value for your access keys. You can use either of the keys shown (key1 or key2).

Next, use the following code to create a new credential object within DataRobot that can be used in the batch prediction job to connect to your Azure storage account.

lhaviland_5-1622056329965.png

Next, use the following code to create a new credential object within DataRobot that can be used in the batch prediction job to connect to your Azure storage account.

 

AZURE_STORAGE_ACCOUNT = "YOUR AZURE STORAGE ACCOUNT NAME"
AZURE_STORAGE_ACCESS_KEY = "AZURE STORAGE ACCOUNT ACCESS KEY"
 
DR_CREDENTIAL_NAME = "Azure_{}".format(AZURE_STORAGE_ACCOUNT)
 
# Create an Azure-specific Credential
# The connection string is also found below the access key in Azure if you want to copy that directly.
 
credential = dr.Credential.create_azure(
   name=DR_CREDENTIAL_NAME,
   azure_connection_string="DefaultEndpointsProtocol=https;AccountName={};AccountKey={};".format(AZURE_STORAGE_ACCOUNT, AZURE_STORAGE_ACCESS_KEY)
)
 
# Use this code to look up the ID of the credential object created.
credential_id = None
for cred in dr.Credential.list():
   if cred.name == DR_CREDENTIAL_NAME:
       credential_id = cred.credential_id
break
 
print(credential_id)

 

Setting up and Running the Batch Prediction Job

Now that a credential object has been created, it’s time to set up the batch prediction job. Set the intake_settings and output_settings to the azure type. Provide both attributes with the URL to the files in Blob storage that you want to read and write to (the output file does not need to exist already) and the ID of the credential object that we previously set up. The code below will create and run the batch prediction job and, when finished, provide the status of the job. This code also demonstrates how to configure the job to return both Prediction Explanations and passthrough columns for the scoring data as well.

 

DEPLOYMENT_ID = 'YOUR DEPLOYMENT ID'
AZURE_STORAGE_ACCOUNT = "YOUR AZURE STORAGE ACCOUNT NAME"
AZURE_STORAGE_CONTAINER = "YOUR AZURE STORAGE ACCOUNT CONTAINER"
AZURE_INPUT_SCORING_FILE = "YOUR INPUT SCORING FILE NAME"
AZURE_OUTPUT_RESULTS_FILE = "YOUR OUTPUT RESULTS FILE NAME"
 
# Set up our batch prediction job
# Input: Azure Blob Storage
# Output: Azure Blob Storage
 
job = dr.BatchPredictionJob.score(
   deployment=DEPLOYMENT_ID,
   intake_settings={
       'type': 'azure',
       'url': "https://{}.blob.core.windows.net/{}/{}".format(AZURE_STORAGE_ACCOUNT, AZURE_STORAGE_CONTAINER,AZURE_INPUT_SCORING_FILE),
       "credential_id": credential_id
   },
   output_settings={
       'type': 'azure',
       'url': "https://{}.blob.core.windows.net/{}/{}".format(AZURE_STORAGE_ACCOUNT, AZURE_STORAGE_CONTAINER,AZURE_OUTPUT_RESULTS_FILE),
       "credential_id": credential_id
   },
   # If explanations are required, uncomment the line below
   max_explanations=5,
 
   # If passthrough columns are required, use this line
   passthrough_columns=['column1','column2']
)
 
job.wait_for_completion()
job.get_status()

 

When the job has successfully completed, you should see your output file in your Blob storage container.

And with that, you have successfully set up a batch prediction job that can read and write from Azure Blob Storage via the DataRobot Python client package and the Batch Prediction API.

More Information

Have questions? 

Please let me know if you have questions about what I've presented here. You can click Comment (below) or send a PM to @stretch.

Labels (2)
Version history
Last update:
‎05-26-2021 03:28 PM
Updated by:
Contributors