I am trying (unsuccessfully) to load data from a csv file (that only has one column, namely a full address field: address | city | state | zip) from the AI Catalog with the following syntax:
addresses = pd.read_csv("https://app.datarobot.com/usecases/6467ed0c26e748....dc4df6fc/prepare/6467e45f7f....0f934df6b2")
If I do an addresses.head(6) command, it looks like the table's metadata, not the first few rows of the csv as I was expecting.
Does anyone have guidance on referencing files already loaded into the AI Catalog? It seems I'm not using the correct hash references.
PS> I am using information from this page https://docs.datarobot.com/en/docs/dr-notebooks/dr-notebook-ref.html which says:
"How can I access datasets in my Use Case that I have not yet loaded into my notebook?"
Access the dataset you want to include in the notebook from the Use Case dashboard. The ID is included in the dataset URL (after /prepare/); it is the same ID stored for the dataset in the AI Catalog.
Thank you @crussellwalker
Here is my final code, although processing my full dataset using Google's API cost me more $ than I was expecting (lesson learned - test first with a subset of your data).
I also found this documentation helpful: https://datarobot-public-api-client.readthedocs-hosted.com/en/latest-release/autodoc/api_reference.h...
## Import required libraries (Pandas and Google Maps)
from googlemaps import Client as GoogleMaps
import pandas as pd
import time
## Create GoogleMaps object using the API key
gmaps = GoogleMaps ('{GoogleAPIkey}')
addresses = pd.read_csv("https://app.datarobot.com/usecases/{yourDRvalue}/prepare/{yourDRvalue}")
addresses.head(6)
##Add two empty columns that will hold the longitude and latitude data
addresses['long'] = ""
addresses['lat'] = ""
## Generating the longitude and latitude coordinates
for x in range(len(addresses)):
try:
time.sleep(1) #to add delay in case of large DFs
geocode_result = gmaps.geocode(addresses['FullAddress'][x])
addresses['lat'][x] = geocode_result[0]['geometry']['location'] ['lat']
addresses['long'][x] = geocode_result[0]['geometry']['location']['lng']
except IndexError:
print("Address was wrong...")
except Exception as e:
print("Unexpected error occurred.", e )
addresses.head()
## write to internal AI Catalog object (note, this unexpededly puts the results into a file called data.csv. I could not figure out how to name to new data file)
addresses.to_csv('address_coords.csv')
Optimally this last step could be re-written to create a named AI Catalog item rather than data.csv, but it did work.
Glad to hear it!
I'm unsure how to update an existing dataset and have the resulting alterations persist in AI Catalog (not that that means there isn't a way) but what I can suggest is something like the following:
my_updated_dataframe = ...
new_dataset = Dataset.upload(my_updated_dataframe)
And the docs for that method can be found here:
This works great! Thank you.
Is there a related function that allows me to write results back to my dataset?
For example, after I process all of the geocodes, I have a new structure that includes two new columns: lat & long that I need to save to either the existing dataset or a new one.
Also in a notebook cell once you've imported the `Dataset` class you can list all of them like so:
Dataset.list()
Hi,
I hope I can help.
In a notebook cell you should be able to do the following:
from datarobot.models.dataset import Dataset
# This ID will be seen here:
# https://app.datarobot.com/ai-catalog/641e1ee2cc9ba01fc1dab737
my_ds_id = '' # Replace me with something like '641e1ee2cc9ba01fc1dab737'
dataset = Dataset.get(my_ds_id)
ds_as_df = dataset.get_as_dataframe()
ds_as_df.head()