DataRobot & Snowflake: Project Creation

cancel
Showing results for 
Search instead for 
Did you mean: 

DataRobot & Snowflake: Project Creation

This is the first in a series of articles leveraging the DataRobot platform with the cloud database Snowflake. The series will progress from connectivity to supporting data ingest into DataRobot, to creating projects and machine learning models based on historic training sets, and finally to scoring new data through deployed models via several deployment methodologies. (Look for the next article in this series, "DataRobot & Snowflake: Client-Request Model Scoring.")

The usage examples provided are not exclusive to Snowflake, and can be applied in part or in whole to other databases as well.

To create a project in DataRobot, a set of training data is needed. This dataset may or may not go through data engineering or feature engineering processes before being used for modeling in DataRobot. At a high level, there are two approaches for getting this data into DataRobot.

  1. PUSH—DataRobot is sent data and told to create a project with it.  Examples include dragging a supportable file type into the GUI or leveraging the DataRobot API.
  2. PULL—DataRobot is told to create a project by pulling data from somewhere, such as a URL to a dataset, or via a database connection.

Both examples will be demonstrated below. A single tabular dataset will be created with one new feature-engineered column. The well-known Kaggle Titanic Survival dataset will be used for this post. The new feature created will be:

total_family_size = sibsp + parch + 1

PUSH—DataRobot Modeling API

DataRobot can be interacted with via either the GUI or programmatically through a REST API. The latest documentation for the endpoints can be found on the DataRobot Support site. This API is wrapped by an available R SDK and Python SDK, which simplifies calls and workflow with common multistep and asynchronous processes.

This process will leverage Python 3 and the DataRobot SDK package for project creation. The data will be obtained via the Snowflake Connector for Python. For ease of use in data manipulation, the installation option citing the Pandas-compatible driver will be used to accommodate feature engineering using dataframes.

First, import the necessary libraries and credentials; for the sake of convenience, they have been hardcoded into the script in this example.

 

import snowflake.connector
import datetime
import datarobot as dr
import pandas as pd

# snowflake parameters
SNOW_ACCOUNT = 'my_creds.SNOW_ACCOUNT'
SNOW_USER = 'my_creds.SNOW_USER'
SNOW_PASS = 'my_creds.SNOW_PASS'
SNOW_DB = 'TITANIC'
SNOW_SCHEMA = 'PUBLIC'

# datarobot parameters
DR_API_TOKEN = 'YOUR API TOKEN'
# replace app.datarobot.com with application host of your cluster if installed locally
DR_MODELING_ENDPOINT = 'https://app.datarobot.com/api/v2'
DR_MODELING_HEADERS = {'Content-Type': 'application/json', 'Authorization': 'token %s' % DR_API_TOKEN}

 

The training dataset in this example was loaded into the table TITANIC.PUBLIC.PASSENGERS_TRAINING. It is retrieved and brought into a Pandas Dataframe.

 

# create a connection
ctx = snowflake.connector.connect(
          user=SNOW_USER,
          password=SNOW_PASS,
          account=SNOW_ACCOUNT,
          database=SNOW_DB,
          schema=SNOW_SCHEMA,
          protocol='https',
)

# create a cursor
cur = ctx.cursor()

# execute sql
sql = "select * from titanic.public.passengers_training"
cur.execute(sql)

# fetch results into dataframe
df = cur.fetch_pandas_all()

df.head()

 

Screen Shot 2020-01-17 at 10.46.07 AM.png

Feature engineering can then easily be done within Python (in this case, using the Pandas library). Feature names are uppercase as a result of Snowflake following the ANSI standard SQL of capitalizing column names and treating them as case-insensitive unless quoted.

 

# feature engineering a new column for total family size
df['TOTAL_FAMILY_SIZE'] = df['SIBSP'] + df['PARCH'] + 1

df.head()

 

Screen Shot 2020-01-17 at 10.46.19 AM.png

 The data is then submitted to DataRobot to start a new machine learning project.

 

# create a connection to datarobot
dr.Client(token=DR_API_TOKEN, endpoint=DR_MODELING_ENDPOINT)

# create project
now = datetime.datetime.now().strftime('%Y-%m-%dT%H:%M')
project_name = 'Titanic_Survival_{}'.format(now)
proj = dr.Project.create(sourcedata=df,
    project_name=project_name)

# further work with project via the python API, or work in GUI (link to project printed below)
print(DR_MODELING_ENDPOINT[:-6] + 'projects/{}'.format(proj.id))

 

The project can be further interacted with using the SDK, or through the GUI with the provided hyperlink.

PULL—Snowflake JDBC SQL

The DataRobot platform supports the installation of JDBC drivers to establish database connectivity. To connect to the Managed AI Cloud offering from a locally hosted database, you must open firewall ports to allow DataRobot in and whitelist its IP addresses for incoming traffic. (If this is required, contact DataRobot Support.) Snowflake is cloud-native and publicly available by default. Additional similar steps may need to be taken if a service like AWS PrivateLink is leveraged in front of a Snowflake instance.

The DataRobot Managed AI Cloud has a JDBC driver installed and available; customers with an installation of their own DataRobot cluster will need to add it. (This is a simple task but out of scope for the purpose of this article.) Extensive exploration of the capabilities provided by the DataRobot AI Catalog will also be left to other articles.

Establishing a JDBC connection and initiating a project from the source via SQL can be completed as follows. The AI Catalog will be leveraged to set up objects connected to Snowflake. The JDBC connection will be used in a data connection. 

Navigate to AI Catalog -> Add to catalog -> Add data from -- Data Connection.

add-connection-workflow.png

Then, select add a new data connection.

Screen Shot 2020-01-18 at 4.30.31 AM.png

The JDBC driver connection string's format can be found in the Snowflake documentation. In this example, the database is named titanic; for any parameters left unspecified, the defaults associated with the account logging into Snowflake will be used.

Provide database credentials on the next screen and choose whether DataRobot should save them securely or prompt for credentials upon connection attempts to Snowflake.

Now that the data connection has been made, it can be leveraged to bring items into the catalog.  Choose Add to Catalog -> Data Connection and then choose the newly created connection. A prompt to enter credentials or use an available existing saved set will be provided. When connectivity is successful, DataRobot captures metadata of accessible objects to the database user for browsing. For this example, feature engineering of the new column will be done in SQL.

Choose the SQL query option rather than the object browsing options. The SQL to extract and create the new feature can be written and tested here.

Screen Shot 2020-01-19 at 9.28.16 AM.png

Note the checkbox for Create Snapshot; this will extract the data and materialize the dataset in the catalog. This dataset can be shared among users and used to create from zero to many projects, but it will not reach out to the database again. It will be labeled Snapshot within the catalog. Alternatively, a dataset can be created as a Dynamic set if not as a snapshot.  Each usage in subsequent projects will result in the query being executed against the database with the latest data being pulled. DataRobot will still execute the query and pull some data from the database to assess column data types. When registered successfully,  the dataset is published and available for use.

Some additional considerations with this method:

  • The SQL for a dynamic dataset cannot be edited.
  • A best practice for a dynamic dataset is to list each column of interest rather than using "*" for all columns.
  • Ordering the data may be of interest because it may affect training dataset partitions from one project to the next.
  • Ordering the data in the case of time series is a requirement because data in a series must be grouped together and sorted on the time element.
  • Users may want to connect to views with underlying logic that can be changed during project iterations. If implemented, that workflow may make it difficult to associate a project instance to a particular view logic at the time of extract.

Code for this article is available in github.

The next article in the series, "DataRobot & Snowflake: Client-Request Model Scoring," will present some client-side model scoring techniques using DataRobot and data from the cloud database Snowflake.

Version history
Revision #:
32 of 32
Last update:
‎04-14-2020 04:38 PM
Updated by:
 
Contributors