This article showcases how you can ingest data from an Amazon S3 bucket using DataRobot.
Identify the object
To start using an object saved in an S3 bucket, first navigate to the dataset you want to use. Then copy the object’s URL (Figure 1).
Figure 1. Identifying object URL
Next, select AICatalog from the DataRobot GUI.
Figure 2. DataRobot GUI
Now click Add to catalog and select “URL” (Figure 3).
Figure 3. URL add to catalog option
In the URL box, paste the URL of the object and click Save. DataRobot will automatically read the data and infer data types and the schema of the data. Basically, this works the same as if you uploaded a CSV file from your local machine.
You can also ingest data into DataRobot from private S3 buckets. For example, a pre-signned S3 URL creates a temporary link that DataRobot can use to retrieve the file. One of the easiest ways to accomplish this is by using the AWS Command Line Interface (CLI). After the CLI has been installed and configured, a command similar to the following may be used:
The URL produced in this example will allow whoever has it to read the private file.csv from the private bucket bucket-name, and the signed link will be available for 600 seconds upon creation.
If you have your own DataRobot installation, you have the following additional options:
The datarobot service account that the application runs as can be provided IAM privileges to read private S3 buckets. DataRobot will be able to then ingest from any location specified within S3 that it has privileges to access.
S3 impersonation of the user logging in to DataRobot can additionally be implemented for more limited access to S3 data. This requires LDAP be used for authentication, with authorized roles for the user specified within LDAP attributes.
Both of the above options will accept an s3:// URI path.
Figure 4. Pasting URL
Initiating DataRobot Project
Now that your data has been successfully uploaded, you can click on Create project in the upper right corner (Figure 5).
Figure 5. Created AI Catalog table
Now you will be able to initiate a project as you would normally be able to through DataRobot.
If you’re a licensed DataRobot customer: search the in-app documentation for Non-catalog import methods, then locate more information in the section “Importing files from S3.”