Balancing Worker Allocation for Throughput and User Experience
Image credit of XKCD
Within the DataRobot architecture, separate hardware resources are used for building models and scoring prediction requests through models. Both of these functions can be scaled out horizontally. This article will focus on the efficient usage of the model training nodes (workers) and considerations when allocating them among a user base.
DataRobot installed in a customer's VPC or on customer hardware works similarly to DataRobot's SaaS cloud offering. The license for an install includes a pool of workers (for example, 16) that users can access. This pool can be divided into separate sub organizational pools within a VPC install; for example, Data Scientists can use up to 8 workers from the pool at a time, whereas Data Analysts may only be allowed to use up to 4 workers at a time. In addition to these subset pools, a ceiling can also be set on an individual user basis. How to best take advantage of the available workers? This depends on organizational projects and needs, as well as user behavior and experience expectations.
Worker Queue Behavior
It should be noted that the current work queue is a FIFO queue: work is simply executed as items come in. Most often these queue jobs relate to some form of training and testing for a new model. Evaluating a model can also create queue jobs that utilize the workers; for instance, running a Feature Impact request on a model. As a user-demanded action vs. an automated insert of an Autopilot job, this particular request will jump to the front of the queue ahead of a backlog of items. Currently no users or user groups have queue priority over any other, so on a user basis, items are simply first come, first serve.
Queue Behavior During an Autopilot Machine Learning Session
During an Autopilot session, queue items are added based on the round in the process of the model training competition being run . Thirty models may be added to the queue in round 1, training on 16% of the data. Those that perform best, perhaps fifteen models, will be promoted to round 2 with 32% of the data and continue onward. Queue items are not added to round 2 until round 1 completes; thus, a single Autopilot project will naturally have lulls in the queue when workers are idle and free to accept new tasks.
Below is an example of a simulated single project load with 3 rounds of a project across five available workers. Note the free space between the rounds.
Stuffing the Queue
If a user has access to idle workers, resources can be taken advantage of by adding another process to fill the queue, e.g., kicking off another project. For example, if I have a use case for predicting whether a customer might damage my equipment, and a separate use case for predicting the severity/cost of that damage, one option would be to run each project in serial, with just one being worked on at a time. Alternatively, if both projects are run concurrently, more total items will be inserted into the queue. This increases the likelihood of filling in any idle gaps, as tasks from one project can fill in the lulls during another. This will reduce the overall time to complete building models for both use cases.
In this example, the same user initiates 3 projects concurrently, saturating the queue with their workload. P1 R1 M1 = Project 1, Round 1, Model 1 in this simulated view.
Throughput and User Experience
From the perspective purely as a throughput machine, it is always best to fill the queue with as much work that needs to be done as soon as possible, and allow the maximum number of workers to be used by any user. This results in the highest number of models completed per window of time.
A pure throughput approach with a total pool of 16 workers would allow every user to use up to 16 workers. This way the jobs they insert into the queue will be completed as quickly as possible. Not only will the user complete their requests sooner, but since their project ended earlier it's less likely to run concurrently with another user's project. This fully saturated environment (illustrated in the following table) would churn out 96 models.
However, there are a couple sticking points to this approach. One is the aforementioned queue stuffing: if a greedy user comes in and stuffs the queue with many projects, then other users will have a difficult time obtaining resources to run their own projects. As the above scenario is fully saturated, new user Dan will have 0 available workers when starting his new project.
Also, DataRobot recommends a minimum of 4 workers as a ceiling for a good user experience. What might the same workload look like under this limitation?
In the above, Dan can come in at any point of time to begin a new project, immediately leveraging 4 of the “Free” workers. But what if Dan doesn't start a new project? Then, everyone else's work can complete more quickly. For example, only 60 models will finish in this scenario (the above table) during the same time interval as the prior scenario, where 96 projects would be complete.
Finally, users are more likely to remember the times they occasionally received 2 or 0 workers versus when they had 10, 12, or 16 workers. Different projects have different demands as well; small datasets (in length and/or width) differ in training fit times, and projects that discover aggregate features (like time series) are generally more demanding as well.
Power users and analysts may be given both different sized group pools and different user limits within that pool. For example, data scientists may be allowed to use up to 12 of 16 workers, with any individual user allowed to use up to 8. Meanwhile, analysts may be allowed to use up to 8 of the 16 workers, with any individual user allowed to use up to 4.
Dynamic Worker Allocation
It is possible to adjust workers dynamically, in an attempt to maximize all resources at all times; for example, a script could be run to employ various types of logic to address worker needs at any one given point in time. A sample script is provided in the Community GitHub here as a Python Jupyter notebook. This script would ideally be run as a service account, and would require admin abilities to manipulate user worker limits as well as access to the Resource Monitor to query current worker state. Note that this script and option is available for customer installs of the platform only and is currently not an option for the SaaS cloud.
This simple script uses very straight-forward logic which may be easily overridden or edited. A minimum set of workers is specified, along with a ceiling of up to the entire available pool. If 0 or 1 users are on the system, all users are able to access workers up to the ceiling amount, which is potentially every worker on the system. As more concurrent users are performing some type of training tasks on the system, the pool is evenly divided among the number of workers, until a minimum is hit.
The following chart shows the number of workers available for every user, based on the number of users trying to do training related tasks, given a minimum of 4 workers, a hard-coded ceiling of 18 workers, and a total available worker pool of 20.
Number of Users
Ceiling of Workers for each User
(20/2) = 10
ceiling(20/3) = 7
This script could be scheduled to run every five minutes and constantly adjust ceilings based on platform usage. More complex logic could be coded and the script enhanced as desired.
Ultimately an organization will need to assess its project demands and user needs with the above considerations, to balance both throughput and user experience while building models on the platform. DataRobot can assist in examining usage patterns of customer environments both on the SaaS cloud offering as well as on-premise installations.