This article showcases how you can score big datasets accessed from HDFS using an in-memory model created by DataRobot (Note that in-place scoring on Hadoop is not available for Managed AI Cloud deployments.)
Scoring Data on Hadoop
DataRobot allows you to perform distributed scoring using a DataRobot-built model from within the Deploy to Hadoop tab.
Figure 1. Deploy to Hadoop tab
As you can see in Figure 1, DataRobot will ask for the input and output files and will then generate a datarobot-scoring command on a specified Hadoop host. This command allows you to run the model on huge datasets without worrying about the network congestion that would occur if you were to move this data around your network or send it through a POST request.
If you want to change the Spark job settings, click the Advanced options toggle. This will give you the opportunity to manually tune which resources this job will require.
Finally, you can use the datarobot-scoring command directly from the command line or set up an oozie job to schedule a time-based execution of the model. Be aware that when using the GUI, the downloading the model (file) is handled by DataRobot so there are a few extra steps involved like downloading the .drx file from the Downloads tab if you want to run this directly from the command line.
If you’re a licensed DataRobot customer, search the in-app Platform Documentation for Deploy to Hadoop tab and Using Hadoop Scoring from the command line.