Showing results for 
Search instead for 
Did you mean: 

best practice for scoring over 100 million rows?

best practice for scoring over 100 million rows?

What is the suggested way to score (using Databricks) a dataset that has many rows (100+ million)? We're hoping the best practice you suggest is faster than what we're doing now with distributed scoring -- it takes almost 3 hours.

2 Replies

 we're using Data bricks yes

Looks like that article fills in the holes for us.

thanks for your help doyouevendata!

The best way to score 100 million rows can depend a lot on the technical stack and options you have available, as well as where the data is coming from and going to.

If you're already on Databricks and using it to prep a large amount of data, we can bring a model from DataRobot to the Databricks environment.  This will leverage the exportable scoring code option to deploy a model, where a compiled binary java jar file of a model is used.  It can be brought into the Databricks environment and used to score a Spark DataFrame.

We have an example article of this in the community: How to Monitor Spark Models with DataRobot MLOps.  It additionally includes creating an external deployment in DataRobot so that this model can monitored as well, so that it can be tracked for things like data drift.