Deploying a Model to Hadoop

Showing results for 
Search instead for 
Did you mean: 

Deploying a Model to Hadoop

(Updated May 2020)

This article showcases how you can score big datasets accessed from HDFS using an in-memory model created by DataRobot (Note that in-place scoring on Hadoop is not available for Managed AI Cloud deployments.)

Scoring Data on Hadoop

DataRobot allows you to perform distributed scoring using a DataRobot-built model from within the Deploy to Hadoop tab.

Figure 1. Deploy to Hadoop tabFigure 1. Deploy to Hadoop tab

As you can see in Figure 1, DataRobot will ask for the input and output files and will then generate a datarobot-scoring command on a specified Hadoop host. This command allows you to run the model on huge datasets without worrying about the network congestion that would occur if you were to move this data around your network or send it through a POST request.

Advanced Options

If you want to change the Spark job settings, click the Advanced options toggle. This will give you the opportunity to manually tune which resources this job will require.

Finally, you can use the datarobot-scoring command directly from the command line or set up an oozie job to schedule a time-based execution of the model. Be aware that when using the GUI, the downloading the model (file) is handled by DataRobot so there are a few extra steps involved like downloading the .drx file from the Downloads tab if you want to run this directly from the command line.

More Information

If you’re a licensed DataRobot customer, search the in-app Platform Documentation for Deploy to Hadoop tab and Using Hadoop Scoring from the command line.

Labels (3)
Computer Board

Yes for data extraction, feature generation for scoring and scoring your dataset (the actual reason you trained at model for!), you would still need the platform as that data is also likely to be huge.

One of the key requirements for a model is that it needs to be portable. You are likely to want to use the model in different places in the business, and there is a need for a mechanism that contains all the details necessary to make a new prediction in a different environment. This needs to include the type of algorithm used and the coefficients and other parameters calculated during training. For Spark and scikit-learn the formats supported are:

  • SparkMLWritable : the standard model storage format included with Spark, but limited to use only within Spark
  • Pickle: a standard python serialisation library used to save models from scikit-learn
  • PMML: (Predictive Model Markup Language) a standardised language used to represent predictive analytic models in a portable text format
  • ONNX: (Open Neural Network Exchange)  provides a portable model format for deep learning, usingGoogle Protocol Buffers for the schema definition
Version history
Last update:
‎03-23-2021 01:16 PM
Updated by: