Running Batch Prediction Jobs to Read and Write from GCS

cancel
Showing results for 
Search instead for 
Did you mean: 

Running Batch Prediction Jobs to Read and Write from GCS

The DataRobot Batch Prediction API allows users to take in large datasets and score them against deployed models running on a prediction server. The API also provides flexible options for intake and output of these files.

Using the DataRobot Python Client package, which calls the Batch Prediction API, we will go through how to set up a batch prediction job that will read in a file for scoring from Google Cloud Storage (GCS) and then write the results back to GCS.

All the code snippets that you see in this tutorial are part of a Jupyter notebook that you can download from here to get started.

Requirements

In order to run this tutorial code, you will need the following things:

  • Python 2.7 or 3.4+
  • DataRobot Python Client package (2.21.0+) (pypi)(conda)
  • A DataRobot deployment
  • Google Cloud Storage bucket
    • Scoring dataset to use for scoring with your DataRobot deployment that lives in the bucket
  • Service account to access the GCS bucket

Creating the GCP Service Account

The batch prediction job will need some credentials in order to read and write to Google Cloud Storage. This is done by creating a service account within GCP that has access to the GCS bucket and then downloading a key for this account to be used in the batch prediction job.

You can set this up by logging into the GCP console and selecting from the left side menu IAM & Admin and then Service Accounts.

stretch_0-1620841974611.png

Click on the Create Service Account button. Give your account a name and description and click the Create button and then the Done button.

On the main Service Account page, find the account that you just created and click on it. Once you are in the details page, click on the Keys button. Click the Add Key menu and select Create new key. By default the key type will be JSON, but confirm that is set before clicking the Create button.

stretch_1-1620841974605.png

This will generate a key and download a JSON file with the key information that you will need for your batch prediction job.

Grant GCP Service Account access to GCS Bucket

Go back to your GCS bucket and click on the Permissions tab. Click the Add button, enter the email address for the service account user that you created, and give the account the “Storage Admin” role. Click the Save button to confirm the changes.

stretch_2-1620842074266.png

Creating Stored Credentials within DataRobot

Once you have the JSON key downloaded, you can use the following code to create a new credential object within DataRobot that can be used in the batch prediction job to connect to your GCS bucket. Open the JSON key file and copy its contents into the key variable. The DataRobot Python client will read the JSON data as a dictionary and parse it accordingly.

 

# Set name for GCP credential in DataRobot
DR_CREDENTIAL_NAME = "YOUR GCP DATAROBOT CREDENTIAL NAME"
# Create a GCP-specific Credential
# NOTE: This cannot be done from the UI
 
# This can be generated and downloaded ready to drop in from within GCP
# 1. Go to IAM & Admin -> Service Accounts
# 2. Search for the Service Account you want to use (or create a new one)
# 3. Go to Keys
# 4. Click Add Key -> Create Key
# 5. Selection JSON key type
# 6. copy the contents of the json file into the gcp_key section of the credential code below
key = {
       "type": "service_account",
       "project_id": "**********",
       "private_key_id": "***************",
       "private_key": "-----BEGIN PRIVATE KEY-----\n********\n-----END PRIVATE KEY-----\n",
       "client_email": "********",
       "client_id": "********",
       "auth_uri": "https://accounts.google.com/o/oauth2/auth",
       "token_uri": "https://oauth2.googleapis.com/token",
       "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
       "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/*********"
   }
  
credential = dr.Credential.create_gcp(
   name=DR_CREDENTIAL_NAME,
   gcp_key=key
)
# Use this code to look up the ID of the credential object created.
credential_id = None
for cred in dr.Credential.list():
   if cred.name == DR_CREDENTIAL_NAME:
       credential_id = cred.credential_id
       break
print(credential_id)

 

Setting Up and Running the Batch Prediction Job

Now that a credential object has been created, it’s time to set up the batch prediction job. Set the intake_settings and output_settings attributes to the ‘gcp’ type. Provide both attributes with the URL to the files in GCS that you want to read and write to (the output file does not need to exist already) and the ID of the credential object that you previously set up. The code below will create and run the batch prediction job and, once completed, provide the status of that job. This code also demonstrates how to configure the job to return both Prediction Explanations and passthrough columns for the scoring data as well.

 

DEPLOYMENT_ID = 'YOUR DEPLOYMENT ID'
 
# Set GCP Info
GCP_BUCKET_NAME = "YOUR GCS BUCKET NAME"
GCP_INPUT_SCORING_FILE = "YOUR INPUT SCORING FILE NAME"
GCP_OUTPUT_RESULTS_FILE = "YOUR OUTPUT RESULTS FILE NAME"
 
# Set up our batch prediction job
# Input: Google Cloud Storage
# Output: Google Cloud Storage
 
job = dr.BatchPredictionJob.score(
   deployment=DEPLOYMENT_ID,
   intake_settings={
       'type': 'gcp',
       'url': "gs://{}/{}".format(GCP_BUCKET_NAME,GCP_INPUT_SCORING_FILE),
       "credential_id": credential_id
   },
   output_settings={
       'type': 'gcp',
       'url': "gs://{}/{}".format(GCP_BUCKET_NAME,GCP_OUTPUT_RESULTS_FILE),
       "credential_id": credential_id
   },
   # If explanations are required, uncomment the line below
   max_explanations=5,
 
   # If passthrough columns are required, use this line
   passthrough_columns=['column1','column2']
)
 
job.wait_for_completion()
job.get_status()

 

Once the job has successfully completed, you will see your output file in your blob storage container.

And, with that, you have successfully set up a batch prediction job that can read and write from Google Cloud Storage via the DataRobot Python Client package and the Batch Prediction API.

More information

  • Get the example code for this tutorial.
  • See the DataRobot Python Client reference, Batch Prediction Methods.
  • If you're a licensed DataRobot customer, search in-app Platform Documentation for Batch Prediction API.
Labels (2)
Version history
Last update:
‎05-12-2021 02:53 PM
Updated by:
Contributors