DataRobot + Snowflake: Large Batch Scoring & Object Storage

cancel
Showing results for 
Search instead for 
Did you mean: 

DataRobot + Snowflake: Large Batch Scoring & Object Storage

With DataRobot's Batch Prediction API, jobs can be constructed to mix and match source and target data sources and scored data destinations across different local files, JDBC databases, and cloud object storage options such as AWS S3, Azure Blob, and Google GCS. This other community article provided multiple examples of leveraging the Batch Prediction API via the GUI and raw HTTP endpoint requests. You can reference in-app Platform documentation (“Batch Prediction API”) as well as Python SDK library functions to create batch scoring jobs.

A new Snowflake type adapter is now available. The critical path in a scoring pipeline is typically the amount of resources available for the actual running of a deployed machine learning model. Although we can extract data from a database quite quickly, scoring throughput is limited to available scoring compute. Inserts themselves on shredded columnar cloud databases (e.g., Snowflake, Synapse) are also most efficient when done with native object storage bulk load operations, such as COPY INTO when using a Snowflake Stage. An added benefit, particularly on Snowflake, is that warehouse billing can be limited to running just a bulk load vs. a continual set of JDBC inserts during a job; this reduces warehouse running time and thus warehouse compute costs. New Snowflake and Synapse adapters leverage bulk extract and load operations to object storage, as well as object storage scoring pipelines.

Batch Prediction API / Snowflake Adapter Integration

Examples provided below leverage some of the credential management helper code presented in this earlier community article. Rather than using the Python SDK (which may be preferred for simplicity), this article demonstrates how to use the raw API with minimal dependencies. As scoring datasets grow larger, the object storage approach described here can be expected to reduce both the end-to-end scoring time and the database write time.

Since the Snowflake adapter type leverages object storage as an intermediary, batch jobs require two sets of credentials: one for Snowflake and one for the storage layer, like S3. Also, similar to jobs, adapters can be mixed and matched.

Snowflake JDBC to DataRobot to S3 Stage to Snowflake

Screen Shot 2021-05-16 at 1.33.58 AM.png

This first example leverages the prior existing JDBC adapter intake type as well as the Snowflake adapter output type, which will utilize the Snowflake stage object on bulk load. Only job details are specified below; you can see the full code on DataRobot Community GitHub. The job explicitly provides all values, although many have defaults that could be used without specification. This survival model scores Titanic passengers by specifying input tables.

 

job_details = {
    "deploymentId": DEPLOYMENT_ID,
    "numConcurrent": 16,
    "passthroughColumns": ["PASSENGERID"],
    "includeProbabilities": True,
    "predictionInstance" : {
        "hostName": DR_PREDICTION_HOST,
        "datarobotKey": DATAROBOT_KEY
    },
    "intakeSettings": {
        "type": "jdbc",
        "dataStoreId": data_connection_id,
        "credentialId": snow_credentials_id,
        "table": "PASSENGERS_6M",
        "schema": "PUBLIC",
    },
    'outputSettings': {
        "type": "snowflake",
        "externalStage": "S3_SUPPORT",
        "dataStoreId": data_connection_id,
        "credentialId": snow_credentials_id,
        "table": "PASSENGERS_SCORED_BATCH_API",
        "schema": "PUBLIC",
        "cloudStorageType": "s3",
        "cloudStorageCredentialId": s3_credentials_id,
        "statementType": "insert"
    }
}

 

Snowflake to S3 Stage to DataRobot to S3 Stage to Snowflake

Screen Shot 2021-05-16 at 1.50.46 AM.png

This second example uses the Snowflake adapter for both intake and output operations, with data dumped to an object stage, scored through an S3 pipeline, and loaded in bulk back from stage. For reasons of performance and cost, this is the recommended flow.

  • The stage pipeline (from S3 to S3) will keep a constant flow of scoring requests against Dedicated Prediction Engine (DPE) scoring resources and will fully saturate their compute.
  • No matter how long the scoring component takes, the Snowflake compute resources only need to run for the duration of the initial extract and, once all data is scored, for a single final bulk load of the scored data. This maximizes the efficiency of the load itself and for the costs of running all Snowflake compute resources.

Define this job similar to the first example. To illustrate the option, a SQL query is provided here as input rather than the source table name.

 

job_details = {
    "deploymentId": DEPLOYMENT_ID,
    "numConcurrent": 16,
    "chunkSize": "dynamic",
    "passthroughColumns": ["PASSENGERID"],
    "includeProbabilities": True,
    "predictionInstance" : {
        "hostName": DR_PREDICTION_HOST,
        "datarobotKey": DATAROBOT_KEY
    },
    "intakeSettings": {
        "type": "snowflake",
        "externalStage": "S3_SUPPORT",
        "dataStoreId": data_connection_id,
        "credentialId": snow_credentials_id,
        "query": "select * from PASSENGERS_6m",
        "cloudStorageType": "s3",
        "cloudStorageCredentialId": s3_credentials_id
    },
    'outputSettings': {
        "type": "snowflake",
        "externalStage": "S3_SUPPORT",
        "dataStoreId": data_connection_id,
        "credentialId": snow_credentials_id,
        "table": "PASSENGERS_SCORED_BATCH_API",
        "schema": "PUBLIC",
        "cloudStorageType": "s3",
        "cloudStorageCredentialId": s3_credentials_id,
        "statementType": "insert"
    }
}

 

Again, the full example code is available in the DataRobot Community GitHub.

Going forward, it is generally advised that customers running scoring jobs of volume with Snowflake or Azure Synapse take advantage of the related adapter. Using one of the new adapters for both intake and output ensures the scoring pipelines scale as data volumes increase in size.

Labels (2)
Version history
Last update:
‎06-01-2021 02:53 PM
Updated by:
Contributors