Solved: Re: Reading an AI Catalog .csv file from within a ... - DataRobot Community

craigcalder · ‎05-19-2023

I am trying (unsuccessfully) to load data from a csv file (that only has one column, namely a full address field: address | city | state | zip) from the AI Catalog with the following syntax:

addresses = pd.read_csv("https://app.datarobot.com/usecases/6467ed0c26e748....dc4df6fc/prepare/6467e45f7f....0f934df6b2")

If I do an addresses.head(6) command, it looks like the table's metadata, not the first few rows of the csv as I was expecting.

Does anyone have guidance on referencing files already loaded into the AI Catalog? It seems I'm not using the correct hash references.

PS> I am using information from this page https://docs.datarobot.com/en/docs/dr-notebooks/dr-notebook-ref.html which says:

"How can I access datasets in my Use Case that I have not yet loaded into my notebook?"

Access the dataset you want to include in the notebook from the Use Case dashboard. The ID is included in the dataset URL (after /prepare/); it is the same ID stored for the dataset in the AI Catalog.

crussellwalker · ‎06-05-2023

Hi Craig,

Glad you got what you needed it seems.

Curious if you want to iterate more on this.

I see in that last section of code:

## write to internal AI Catalog object (note, this unexpededly puts the results into a file called data.csv. I could not figure out how to name to new data file)
addresses.to_csv('address_coords.csv')

That addresses is a Pandas DataFrame so the .to_csv() method (docs here) wouldn't actually put anything in to the DataRobot AI Catalog.

Likely it created a CSV file in the local filesystem?

If you wanted to upload that addresses DataFrame to AI Catalog with that name I'd suggest this code snippet:

Dataset.create_from_in_memory_data(
    data_frame=addresses,
    fname="address_coords.csv",
)

The .upload() method I mentioned before is just a wrapper around that method anyway.

And for extra clarity here are the docs for create_from_in_memory_data() - which shows that fname defaults to "data.csv"

Hope that's helpful,

Chris

View solution in original post

crussellwalker · ‎05-19-2023

Hi,

I hope I can help.

In a notebook cell you should be able to do the following:

from datarobot.models.dataset import Dataset

# This ID will be seen here:
# https://app.datarobot.com/ai-catalog/641e1ee2cc9ba01fc1dab737
my_ds_id = ''  # Replace me with something like '641e1ee2cc9ba01fc1dab737'

dataset = Dataset.get(my_ds_id)

ds_as_df = dataset.get_as_dataframe()

ds_as_df.head()

crussellwalker · ‎05-19-2023

Also in a notebook cell once you've imported the `Dataset` class you can list all of them like so:

Dataset.list()

craigcalder · ‎05-19-2023

This works great! Thank you.

Is there a related function that allows me to write results back to my dataset?

For example, after I process all of the geocodes, I have a new structure that includes two new columns: lat & long that I need to save to either the existing dataset or a new one.

crussellwalker · ‎05-19-2023

Glad to hear it!

I'm unsure how to update an existing dataset and have the resulting alterations persist in AI Catalog (not that that means there isn't a way) but what I can suggest is something like the following:

my_updated_dataframe = ...

new_dataset = Dataset.upload(my_updated_dataframe)

And the docs for that method can be found here:

https://datarobot-public-api-client.readthedocs-hosted.com/en/latest-release/autodoc/api_reference.h...

craigcalder · ‎05-30-2023

Thank you @crussellwalker 

Here is my final code, although processing my full dataset using Google's API cost me more $ than I was expecting (lesson learned - test first with a subset of your data). 
I also found this documentation helpful: https://datarobot-public-api-client.readthedocs-hosted.com/en/latest-release/autodoc/api_reference.h...


## Import required libraries (Pandas and Google Maps)
from googlemaps import Client as GoogleMaps
import pandas as pd 
import time

## Create GoogleMaps object using the API key
gmaps = GoogleMaps ('{GoogleAPIkey}')

## Import the CSV file with a column with complete addresses to convert.

addresses = pd.read_csv("https://app.datarobot.com/usecases/{yourDRvalue}/prepare/{yourDRvalue}")
addresses.head(6)

##Add two empty columns that will hold the longitude and latitude data
addresses['long'] = ""
addresses['lat'] = ""

## Generating the longitude and latitude coordinates 
for x in range(len(addresses)):
try:
time.sleep(1) #to add delay in case of large DFs
geocode_result = gmaps.geocode(addresses['FullAddress'][x])
addresses['lat'][x] = geocode_result[0]['geometry']['location'] ['lat']
addresses['long'][x] = geocode_result[0]['geometry']['location']['lng']
except IndexError:
print("Address was wrong...")
except Exception as e:
print("Unexpected error occurred.", e )
addresses.head()


## write to internal AI Catalog object (note, this unexpededly puts the results into a file called data.csv. I could not figure out how to name to new data file)
addresses.to_csv('address_coords.csv')

Optimally this last step could be re-written to create a named AI Catalog item rather than data.csv, but it did work.

crussellwalker · ‎06-05-2023

Hi Craig,

Glad you got what you needed it seems.

Curious if you want to iterate more on this.

I see in that last section of code:

## write to internal AI Catalog object (note, this unexpededly puts the results into a file called data.csv. I could not figure out how to name to new data file)
addresses.to_csv('address_coords.csv')

That addresses is a Pandas DataFrame so the .to_csv() method (docs here) wouldn't actually put anything in to the DataRobot AI Catalog.

Likely it created a CSV file in the local filesystem?

If you wanted to upload that addresses DataFrame to AI Catalog with that name I'd suggest this code snippet:

Dataset.create_from_in_memory_data(
    data_frame=addresses,
    fname="address_coords.csv",
)

The .upload() method I mentioned before is just a wrapper around that method anyway.

And for extra clarity here are the docs for create_from_in_memory_data() - which shows that fname defaults to "data.csv"

Hope that's helpful,

Chris

craigcalder · ‎06-10-2023

Thank you @crussellwalker, for following up. My code did create a file, but I couldn't name it correctly. Your suggestion is better. However, I noticed that the performance of Dataset.create_from_in_memory_data function is atrocious. I had to slim down the number of attributes considerably and increase the session timeout for the function to complete the write to the AI Catalog, so I guess to those that follow, YMMV.

PS> I discovered that your code snippets, along with one to read from the AI Catalog, are available in the new Workbench area, which is super cool and handy.😀

Sylvester · ‎10-06-2023

@crussellwalker I support you and here my/ your proof "I just used my original id I only used yours now # you know the drill":

Sylvester · ‎10-06-2023

After you have uploaded your file in the AI Catalog, click that file so it can open you will see that "id" we are talking about, on that sides original http tab now that's the one

Reading an AI Catalog .csv file from within a Notebook

Reading an AI Catalog .csv file from within a Notebook

Notebooks

Python

In the spotlight: Support for trial users

No prediction explanations initialization found fo...

Python version required for importing datarobot mo...

Data Wrangling Issue - Walkthrough Hurdles

Cant add Prediction-Dataset from newly created AI-...