In this article we discuss Time Series Anomaly Detection. We are going to see how we can use DataRobot Automated Time Series to train a model that detects time series anomalies.
This use case identifies when a motor is about to experience a failure. The dataset (Figure 1) contains various IoT sensor readings from six different motors inside a manufacturing plant. Our goal is to build an anomaly detection model. Through our testing data, we'll be able to use some labeled anomalies to help select and verify the best model.
Figure 1. Dataset
Uploading dataset and setting options
To create an anomaly detection model using DataRobot, you first need to upload the dataset into DataRobot (new project screen). After the dataset has been uploaded, you need to tell DataRobot that this is actually an unsupervised time series project. To do that, select No target? and Proceed in unsupervised mode.
Figure 2. Unsupervised modeling
Next we set uptime aware modeling, choose our date/time feature, and then select Time Series Modeling.
Figure 3. Set up time aware modeling
We need to tell DataRobot how far into the past to go to create lag features and rolling statistics. For this example we will use the default feature derivation window of 120 minutes to 0.
Figure 4. Feature Derivation window for time aware modeling
Automated Time Series has a number of modeling options that can be configured. We'll look at the most commonly used options.
With time series, we can’t just randomly sample data into partitions. The correct approach is called backtesting, and DataRobot does this automatically. Backtesting ensures we train on historical data and validate on recent data, and then repeat that multiple times to ensure we have a stable model. You can adjust the validation periods and the number of backtests to suit your needs.
To see the backtesting settings, navigate to Show Advanced Options > Partitioning > Date/Time. (For this article, we left the defaults for the dataset as shown in Figure 5.)
Figure 5. Backtesting and validation length options
DataRobot also allows you to provide an event calendar that will allow it to generate forward-looking features so that the model will be able to better capture special events. A calendar file consists of two fields: the date and the name of the event (Figure 6).
Figure 6. Event Calendar
To add an event calendar, select TimeSeries (under Advanced Options), scroll down to Calendar of holidays and special events, and drag and drop your calendar file as shown in Figure 7.
Figure 7. Adding the event calendar
There are many more options we could experiment with, but for now this is enough to get started.
Once we hit Start, DataRobot will take the original features we gave it (74 for our dataset), and create hundreds of derived features for the numeric, categorical, and text variables. For the dataset we used for this article, Automated Time Series created 456 new time series features as shown in Figure 8.
Figure 8. DataRobot created many derived features from the original features
After AutoPilot completes we can examine the results of the Leaderboard (Models tab), and evaluate the top-performing model across all backtests.
Leaderboard and Synthetic AUC
The Leaderboard sorts the models by the Synthetic AUC. This metric enables you to evaluate your dataset if you don’t have an external test set that identifies the anomalies.
Figure 9. Leaderboard sorted by Synthetic AUC
Synthetic AUC is a good metric to help identify the best model(s) to use for your dataset; however, the anomalies it finds might be different than the actual anomalies in your dataset. For this article we are using an external test set. We select a model on the Leaderboard, and navigate to the Predict tab and drag-and-drop that dataset to PredictionDatasets section.
Once the test dataset is uploaded, we go to Forecast settings.
Figure 10. Upload External Test Set
Select Forecast Range Predictions, check the Use known anomalies column to generate scores checkbox and select the name of the column to include when generating scores. Now, we can click ComputePredictions.
Once the scores are computed, go to Menu and select Show External Test Column; a new column with that information shows in the Leaderboard. We can compute the rest of the external test set scores for the other blueprints. Once they are finished the Leaderboard will look similar to Figure 11.
Figure 11. Leaderboard with External Test Set column
Anomaly Over Time
One of the most popular visualizations for a time series anomaly detection project is the Anomaly Over Time chart (under the Evaluate tab). Here we can see the anomaly scores plotted over time. We can also change the backtest so that we can evaluate the anomaly scores across the validation periods.
Figure 12. Anomaly Over Time
On the AnomalyAssessment tab (under the Evaluate tab), we can see which features are contributing to the anomaly score via the SHAP values. This is incredibly useful for gaining additional insight into your data and for explaining high scores.
Figure 13. Anomaly Assessment
On the ROCCurve tab (under the Evaluate tab) we can check how well the prediction distribution captures the model separation. In the Selection Summary box, you can find the F1 score, recall, precision, and others. At the top right we have the well-known confusion matrix.
Now let’s examine the graphs in the bottom of the tab. The first graph on the left is the ROC curve. This is followed by the Prediction Distribution, where you can adjust and try out different probability thresholds for your target. Lastly, we have the Cumulative Charts (gain and lift charts), which tell you how your effectiveness increases by using this model instead of naive method.
Figure 14. ROC Curve tab
In Figure 15 you can see the relative impact of each feature on this model, including the derived features (Understand > Feature Impact tab).
Figure 15. Feature Impact
Here we can see how changes to the value of each feature change model predictions (Understand > Feature Effects tab). Figure 16 shows that as motor_2_rpm (actual) increases or decreases, the anomaly score increases.
Figure 16. Feature Effects
Prediction Explanations (from the Understand tab) tell you why your model assigned a value to a specific observation (Figure 17).
Figure 17. Prediction Explanations
Now that we have built and selected our demand forecast model, we want to get predictions. There are three ways to get time series predictions from DataRobot.
The first is the simplest: use the GUI to drag-and-drop a prediction dataset (Figure 18). This method is typically used for testing, or for small, ad-hoc forecasting projects that don’t require frequent predictions.
Figure 18. Predictions, drag and drop
The second method is to create a deployment. This creates a REST endpoint so that you can request predictions via API. This connects the model to a dedicated prediction server and creates a dedicated deployment object.
Figure 19. Deploy model to prediction server
The third method is to deploy your model via Docker. This allows you to put the model closer to the data to reduce latency, as well as scale the scoring model as needed as shown in Figure 20.