Optimizing Real-Time Model Scoring Request Speed

Showing results for 
Search instead for 
Did you mean: 

Optimizing Real-Time Model Scoring Request Speed

  • How fast will a model score?
  • How many records can be scored through the model in a given amount of time?
  • Is there a limit to the number of requests that can be scored?

These are some of the many questions received around model scoring performance. This article focuses on using the Prediction API to query models deployed within DataRobot. (Alternative options may be available. If needed, contact your DataRobot account team to explore additional options.) This article presents scoring requests of single record or small on-demand payload scoring use cases; it does not address batch processing.

Customers may come to DataRobot with a required SLA for model scoring; however, it should be understood that the actual scoring data through the model component is only one part of the entire scoring lifecycle, and there are potential improvements to be made all along the entire scoring path. To maximize performance, each element should be evaluated and optimized to create the quickest possible scoring request.

Request Pre- and Post-Processing

Assembling the scoring payload that will be sent to the Prediction API may have logical, technical, and design improvements that can be made to reduce the time it takes to construct the request. Examples of this include retrieving data from a remote disk or untuned database rather than from a memory-based caching database, or using data that is already available in application server memory. Convoluted or overly complex processing logic can result in greater-than-acceptable end-to-end use case performance as well. A critical task is to carefully calculate and examine the design for improvements that can be made in these areas.

Network Performance

An application in a company's on-premise local server room in the Philippines connecting to DataRobot's US-East-1 cloud offering may provide acceptable, though not optimal, performance as a result of network latency. The local office pipe in the Philippines to their main network provider may not be very wide or quick.  Even with a fast connection inevitably there are still many hops as the network traffic is routed from machine to machine overseas before it reaches the AWS US-East-1 DataRobot server.

Screen Shot 2020-01-27 at 12.46.34 PM.png

Colocating the requesting server will reduce scoring time due to the increased network performance. For DataRobot prediction servers hosted in US-East-1, customers would optimally host application servers requesting the calls in the same US-East-1 AWS environment. Customers with their own installations would benefit from a similar approach. Even with a fast pipe between their own datacenter x and datacenter y, optimal performance would result from hosting the requesting application and DataRobot server next to one another.

Keep in mind that the scoring payload sent to DataRobot need to send only those features the model needs for scoring. Although additional features can be sent (and also requested back in the response from DataRobot), the most network efficient usage would not ship unnecessary data over the network in either direction.

Persistent HTTP/S Connections

The DataRobot API supports HTTP Keep-Alive, which is used to keep a session open for subsequent calls upon session establishment. This can be especially beneficial in the case of secure HTTPS connections because they have a larger overhead when creating a connection with the server. This technique will aid in situations where a single session is expected to be used for multiple model calls.

Screen Shot 2020-01-27 at 11.47.56 AM.png

Blueprint Selection

DataRobot models include preprocessing steps as well as trained algorithm scoring of the processed data.  Different algorithms entail their own speed considerations, as well as preprocessing techniques applied.  DataRobot provides data and graphs to illustrate relative performance among models; these should be evaluated when deciding which models to deploy for production use.

Screen Shot 2020-01-27 at 10.59.51 AM.png

On the bottom left of this Speed vs Accuracy graph, the Gradient Boosted Trees Classifier M34 can be seen as delivering excellent results (i.e., minimal LogLoss) while providing excellent performance. The time on the X-axis and the sample size of results scored should not be extrapolated in this graph to conclude results about scoring in a production environment. Instead, they should only be used for evaluating relative performance among models within a project.

Blenders / Ensembles

Multiple models can be combined to create a single score, often with an increase in accuracy. This is the case for the point graphed with the lowest LogLoss (found on the right side of the above Speed vs Accuracy graph)—the Advanced AVG Blender. However, the greater accuracy comes with an additional cost: multiple model are scored and their results are merged. Is the additional time worth the tradeoff for additional accuracy? Clients should consider this question for each particular use case.

Feature Lists

DataRobot tries multiple feature sets and allows users to experiment with modifying them within a project.  Although a dataset may have 300 features, a limited set of 100 features may actually give the model a signal and an additional lift in accuracy. A reduced feature set of the top 25 of these features may perform nearly as well and be much less burdensome to compute. Reducing the number of features required for a model typically yields an increase in performance.

Screen Shot 2020-01-27 at 11.16.12 AM.png

Prediction Explanations

Aside from providing the score, DataRobot can also explain why it made a score for each particular record and prediction; for example, this passenger would have an increased chance of survival on the Titanic and the female, young, and in first class values moved the needle the most towards survival. This is very valuable information about both the record and the model; however, it is computationally expensive to calculate the top features for a prediction. This becomes an even greater burden if the feature set is very wide and/or is used for a blender.

There are several ways to ease this burden if Prediction Explanations are needed but are not needed all the time. Or, if they are needed later (e.g., for analytic reporting), but are not needed right now for taking action.

In the first scenario, it is possible to request the endpoint to provide explanations for only threshold situations. For example, if a bank automatically denies loans for a score <= 0.5, the API can be requested to provide only explanations when this threshold applies.


# Parametrize Prediction Explanations with query parameters listed in the docs:
# https://app.datarobot.com/docs/users-guide/predictions/api/new-prediction-api.html#request-pred-explanations
params = {
    'maxCodes': 3,
    #'thresholdHigh': 0.5,
    'thresholdLow': 0.5,


The second scenario invites a hybrid approach where records are scored twice: once in real-time, where results of scoring can drive actions as quickly as possible, and a second time later in batch. An example is a customer-facing application in which the score is used immediately to drive decisions. The same data could then be extracted from the application as part of a nightly batch, scored with Prediction Explanations, and then loaded into a data warehouse for analytic reporting the next day. In this scenario, care should be taken to not skew data or model drift statistics within model management. This can be done by (1) excluding the real-time score from drift tracking statistics, or (2) deploying the same model twice, and using one endpoint for the web application and another for the data warehouse ETL.

Service-Level Agreements (SLAs)

SLAs should be realistic and monitored over time. DataRobot includes data and model drift statistics (for monitoring the health and relevancy of the model training data vs scoring requests) as well as endpoint statistics around health and performance of a deployed model endpoint.

Screen Shot 2020-01-27 at 11.30.39 AM.png

Prescored Lookup Table

Depending on use-case volumes and needs, one way to accelerate model scoring in a real-time, on-demand scenario is to already have it completed beforehand. This approach lends itself to those use cases where a limited set of known input permutations are available and can be scored throughout most, if not all, of the permutation space before the call is made. This offers the option for a scoring call by an application to simply be a lookup to the output of the model based on all of the input parameters provided. Note that this method hides the volume of those inputs coming in, and as such does not allow model management to capture a clear picture of what data drift or model drift may be occurring.

A real-world example of this could be the implementation of some kind of website product recommendation system.  Of course it would be great to use all the information about an exact customer browsing a site and recommend the best of 50 products available for them that might complement an item they just added to their cart. But, what about a scenario where there are 100,000 products available?

In this kind of scenario, it may make sense to categorize the type of customer into some cluster type, rather than use all of their exact information in a prediction request. The cluster then could simply have business rules applied to categorize the customer into one of these types. A model could prescore and populate a file or table with combinations of these customer types and product pairs to determine product affinity and likelihood for the customer to purchase. In this way the product pairings could be explored during a batch exercise, and construction of a recommendation table with real-world requests could be fulfilled by simply looking up the real-time customer interaction in the prescored table. The keys, type of customer and product added to cart; would help achieve a simple quick lookup to find the complementary product(s) to recommend.

Additional Considerations

Optimizing elements presented to in the above list should yield significant gains for single request performance, resulting in optimal overall workload performance. Additional techniques may be available to improve performance if required, and requests should be made to work with the DataRobot account team as needed. Of interest will be the results of the techniques above, desired performance, and peak workloads expected during the most demanding scoring intervals.

Labels (2)
Version history
Revision #:
16 of 16
Last update:
‎10-26-2020 01:49 PM
Updated by: