What is the suggested way to score (using Databricks) a dataset that has many rows (100+ million)? We're hoping the best practice you suggest is faster than what we're doing now with distributed scoring -- it takes almost 3 hours.
The best way to score 100 million rows can depend a lot on the technical stack and options you have available, as well as where the data is coming from and going to.
If you're already on Databricks and using it to prep a large amount of data, we can bring a model from DataRobot to the Databricks environment. This will leverage the exportable scoring code option to deploy a model, where a compiled binary java jar file of a model is used. It can be brought into the Databricks environment and used to score a Spark DataFrame.
We have an example article of this in the community: How to Monitor Spark Models with DataRobot MLOps. It additionally includes creating an external deployment in DataRobot so that this model can monitored as well, so that it can be tracked for things like data drift.