Solved: Re: Databricks batchpredictionjob.score_to_file he... - DataRobot Community

chhay · ‎11-13-2020

I need help accesing the local file system on Azure Databricks.

I tried to use this as an example:

https://github.com/datarobot-community/examples-for-data-scientists/blob/master/Making%20Predictions...

and get stuck when trying to access local filesystem -- in this case, dbfs.

I get the error below:

InputNotUnderstoodError: sourcedata parameter not understood. Use pandas DataFrame, file object or string that is either a path to file or raw file content to specify data to upload

job = dr.BatchPredictionJob.score_to_file(
deploymentId,
intake_path = 'dbfs:/pathToRequest.csv',
output_path = 'dbfs:/pathToResponse.csv',
passthrough_columns_set='all'
)

doyouevendata · ‎11-14-2020

I have not used dbfs. Although noting that it is an abstraction layer and per some of the examples in this article, it seems that using /dbfs/<path>/input.csv and /dbfs/<path>/output.csv may work? The DataRobot SDK does not understand the dbfs reference.

Note also there are scenarios where a model can be brought into a Spark environment to score data through a Spark dataframe as well.

Also note that I typically advise only keeping a surrogate key column (or columns) to join, and join the data back to the original dataset if desired. The passing through of all columns can take up compute time and certainly network time moving all the additional data around, although it is data you also already have in the client/source in many instances.

View solution in original post

doyouevendata · ‎11-14-2020

I have not used dbfs. Although noting that it is an abstraction layer and per some of the examples in this article, it seems that using /dbfs/<path>/input.csv and /dbfs/<path>/output.csv may work? The DataRobot SDK does not understand the dbfs reference.

Note also there are scenarios where a model can be brought into a Spark environment to score data through a Spark dataframe as well.

Also note that I typically advise only keeping a surrogate key column (or columns) to join, and join the data back to the original dataset if desired. The passing through of all columns can take up compute time and certainly network time moving all the additional data around, although it is data you also already have in the client/source in many instances.

Databricks batchpredictionjob.score_to_file help

Databricks batchpredictionjob.score_to_file help

Paxata Cache Folder

how to transform the var type in workbench

Understanding Model

Time Series Modelling

Trial Walkthrough Issue