With the fight against COVID-19 spreading across the US and the world, DataRobot’s enterprise AI platform has developed models to predict which US counties are likely to have their first confirmed COVID-19 cases in the day. The goal was to help federal, state, and local governments to use this information to budget resources, take preemptive measures and help citizens to take preventive measures. This information also would be very useful to healthcare providers to help prepare their staff with the most accurate information.
Since releasing this model, we have had many requests to explain our approach and share the code so it can be replicated. In this post, we will explain the methodology and results of our model. The dataset we used along with a Python notebook of how to do the modeling in the GUI are available in the DataRobot Community GitHub; you can use them to replicate and build upon this work.
The modeling is explained in this video.
After studying the needs of the government and the available data, we decided to focus on identifying counties in the US that are likely to have a COVID case. While we provide a step-by-step description of how to build the model, like most science, the actual path was a bit more of a zig-zag with some double-backs. This section goes through how we built the model, the features or variables our models found important for predicting COVID, and how we are continuing to improve the model with new data.
The methodology we used is known as a look-alike model. This is a common approach in marketing, where a data scientist may be presented with data on 10,000 homeowners and be asked to identify 50,000 more homeowners with the same characteristics. This approach has other names as well from PU (positive-unknown) model or one-class classification.
Building the Model
As with all prediction models, the starting point is gathering historical data. We identified which counties currently have a virus detected or not. As an example, the map below shows confirmed cases by March 16th (Figure 1).
Figure 1. Map
The blog post shared was based on data from Johns Hopkins. However, you can find an equally useful dataset on the New York Times Github repository. The New York Times updates this data daily and it aggregates it at both the county and the state level. If you are getting started modeling COVID-19, then this is a great first step.
Now, let's take a look at the dataset that our data scientists use to model COVID-19 (Figure 2). You can see that our data scientists aggregated over time as well as county. Each row in this dataset represents a different county and each column represents features about that county. We included demographic and population statistics from census resources in this dataset and combined it with the COVID-19 data. You can see the target feature highlighted in yellow. This indicates whether or not this county has COVID-19 cases. This is a binary true/false variable, so we're ultimately going to be solving a binary classification problem. The exact dataset used for this modeling is included in this post.
Figure 2. Dataset
Once you have your dataset constructed, the next step is to import that data into DataRobot and set the target (Figure 3). You can also customize the modeling settings and explore the data at this stage. Once you indicate the target, you can confirm that you have a binary variable by looking at the distribution that appears underneath the target box.
Figure 3. Importing data
Once you have your data imported, your target set and your options customized, the next step is to push the Start button. This will kick off Autopilot and begin the partitioning, second Exploratory Data Analysis (EDA), and blueprint building. If you click on the Models tab, you can see DataRobot populate the Leaderboard with completed models (Figure 4).
Figure 4. Completed and still building models
Once Autopilot is complete, you can look at all of the models that were built ranked on the Leaderboard. Focusing on the top of the Leaderboard: now is a good time to evaluate the different models that were created.
The top model in this case is an XGBoost model. If you click on it a blueprint will dropdown and you can see the steps that were taken to build it (Figure 5).
Figure 5. Blueprint for this XGBoost model
If we look at feature impact, we can see that the top two features that are impacting our model include the number of people who have bachelor's degrees as well as associate degrees (Figure 6).
Figure 6. Feature Impact
If we look at feature effects (Figure 7), we can see that as the number of bachelor's degrees increased within a county, so does the likelihood of there being COVID-19 cases. The same is true for an associate’s degree. This is interesting because it suggests that populations of people that have a higher degree of education and maybe more socio economic status are more likely to be infected with the virus. This makes sense if you think about a travel and mobility perspective. People that have higher degrees of education might be more able to travel around the world or around the country and bring back outside pathogens.
Figure 7. Feature Effects
Finally, we can look at Prediction Explanations to examine local effects of features on the target (Figure 8). Those indicated in red are very likely to have COVID-19 infections. You can see that the number of people with bachelor's degrees is very high as well as the population change in these counties, while those with low probability of COVID-19 have a much smaller number of people with bachelor's degrees as well as a smaller median income as a percentage.
Figure 8. Prediction Explanations
This approach gives us a model that learns the common patterns between socioeconomic factors and COVID. The next step is using the model to get predictions on all the counties that do not have COVID. The counties with the highest predictions are known as look-alike counties, because they share similar characteristics as the counties that have COVID.
You can do this by downloading the predictions. You don’t need a separate test/train dataset with DataRobot because it takes care of the validation, cross-validation, and holdout scoring for you. Simply go to the Predictions tab and compute the training predictions (Figure 9).
Importantly, these training predictions are stacked predictions, meaning that they are out of sample. This is a critical data science step, because we do not want to train a model with data from a county and then make a prediction for the same county.
Figure 9. Predictions
You can then look at the counties that do not already have COVID-19, and have a high probability of having cases of the disease. If you rank the top 50, and have predictions on those areas most likely to see a COVID-19 case next.
The DataRobot Covid county model identifies patterns in demographic and socioeconomic data in counties that have reported cases of the COVID-19 and uses those patterns to identify similar counties who have not. For researchers that want to build upon this approach, this post describes the approach, shares the dataset as well as code to reproduce this model. Finally, we have shared thoughts on how this model can be improved. We are still working with partners to improve this model, so please reach out if you have suggestions or data that could be added.