cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks batchpredictionjob.score_to_file help

Databricks batchpredictionjob.score_to_file help

I need help accesing the local file system on Azure Databricks.

I tried to use this as an example:

https://github.com/datarobot-community/examples-for-data-scientists/blob/master/Making%20Predictions...

and get stuck when trying to access local filesystem -- in this case, dbfs.

I get the error below:

InputNotUnderstoodError: sourcedata parameter not understood. Use pandas DataFrame, file object or string that is either a path to file or raw file content to specify data to upload
 
job = dr.BatchPredictionJob.score_to_file(
deploymentId,
intake_path = 'dbfs:/pathToRequest.csv',
output_path = 'dbfs:/pathToResponse.csv',
passthrough_columns_set='all'
)
1 Solution

Accepted Solutions

I have not used dbfs.  Although noting that it is an abstraction layer and per some of the examples in this article, it seems that using /dbfs/<path>/input.csv and /dbfs/<path>/output.csv may work?  The DataRobot SDK does not understand the dbfs reference.

Note also there are scenarios where a model can be brought into a Spark environment to score data through a Spark dataframe as well.

Also note that I typically advise only keeping a surrogate key column (or columns) to join, and join the data back to the original dataset if desired.  The passing through of all columns can take up compute time and certainly network time moving all the additional data around, although it is data you also already have in the client/source in many instances.

View solution in original post

1 Reply

I have not used dbfs.  Although noting that it is an abstraction layer and per some of the examples in this article, it seems that using /dbfs/<path>/input.csv and /dbfs/<path>/output.csv may work?  The DataRobot SDK does not understand the dbfs reference.

Note also there are scenarios where a model can be brought into a Spark environment to score data through a Spark dataframe as well.

Also note that I typically advise only keeping a surrogate key column (or columns) to join, and join the data back to the original dataset if desired.  The passing through of all columns can take up compute time and certainly network time moving all the additional data around, although it is data you also already have in the client/source in many instances.