Solved: best practice for scoring over 100 million rows? - DataRobot Community

grace2 · ‎10-23-2020

What is the suggested way to score (using Databricks) a dataset that has many rows (100+ million)? We're hoping the best practice you suggest is faster than what we're doing now with distributed scoring -- it takes almost 3 hours.

doyouevendata · ‎10-23-2020

The best way to score 100 million rows can depend a lot on the technical stack and options you have available, as well as where the data is coming from and going to.

If you're already on Databricks and using it to prep a large amount of data, we can bring a model from DataRobot to the Databricks environment. This will leverage the exportable scoring code option to deploy a model, where a compiled binary java jar file of a model is used. It can be brought into the Databricks environment and used to score a Spark DataFrame.

We have an example article of this in the community: How to Monitor Spark Models with DataRobot MLOps. It additionally includes creating an external deployment in DataRobot so that this model can monitored as well, so that it can be tracked for things like data drift.

View solution in original post

doyouevendata · ‎10-23-2020

The best way to score 100 million rows can depend a lot on the technical stack and options you have available, as well as where the data is coming from and going to.

If you're already on Databricks and using it to prep a large amount of data, we can bring a model from DataRobot to the Databricks environment. This will leverage the exportable scoring code option to deploy a model, where a compiled binary java jar file of a model is used. It can be brought into the Databricks environment and used to score a Spark DataFrame.

We have an example article of this in the community: How to Monitor Spark Models with DataRobot MLOps. It additionally includes creating an external deployment in DataRobot so that this model can monitored as well, so that it can be tracked for things like data drift.

grace2 · ‎10-26-2020

we're using Data bricks yes

Looks like that article fills in the holes for us.

thanks for your help doyouevendata!

best practice for scoring over 100 million rows?

best practice for scoring over 100 million rows?

Oracle

How to make your own lagged features

Google Ads use case

Feature Generation

Downloaded Predictions do not Match Targets