In this article we discuss Time Series Anomaly Detection. We are going to see how we can use DataRobot Automated Time Series to train a model that detects time series anomalies.
This use case identifies when a motor is about to experience a failure. The dataset (Figure 1) contains various IoT sensor readings from six different motors inside a manufacturing plant. Our goal is to build an anomaly detection model. Through our testing data, we'll be able to use some labeled anomalies to help select and verify the best model.
To create an anomaly detection model using DataRobot, you first need to upload the dataset into DataRobot (new project screen). After the dataset has been uploaded, you need to tell DataRobot that this is actually an unsupervised time series project. To do that, select No target? and Proceed in unsupervised mode.
Next we set up time aware modeling, choose our date/time feature, and then select Time Series Modeling.
We need to tell DataRobot how far into the past to go to create lag features and rolling statistics. For this example we will use the default feature derivation window of 120 minutes to 0.
Automated Time Series has a number of modeling options that can be configured. We'll look at the most commonly used options.
With time series, we can’t just randomly sample data into partitions. The correct approach is called backtesting, and DataRobot does this automatically. Backtesting ensures we train on historical data and validate on recent data, and then repeat that multiple times to ensure we have a stable model. You can adjust the validation periods and the number of backtests to suit your needs.
To see the backtesting settings, navigate to Show Advanced Options > Partitioning > Date/Time. (For this article, we left the defaults for the dataset as shown in Figure 5.)
DataRobot also allows you to provide an event calendar that will allow it to generate forward-looking features so that the model will be able to better capture special events. A calendar file consists of two fields: the date and the name of the event (Figure 6).
To add an event calendar, select Time Series (under Advanced Options), scroll down to Calendar of holidays and special events, and drag and drop your calendar file as shown in Figure 7.
There are many more options we could experiment with, but for now this is enough to get started.
Once we hit Start, DataRobot will take the original features we gave it (74 for our dataset), and create hundreds of derived features for the numeric, categorical, and text variables. For the dataset we used for this article, Automated Time Series created 456 new time series features as shown in Figure 8.
After AutoPilot completes we can examine the results of the Leaderboard (Models tab), and evaluate the top-performing model across all backtests.
The Leaderboard sorts the models by the Synthetic AUC. This metric enables you to evaluate your dataset if you don’t have an external test set that identifies the anomalies.
Synthetic AUC is a good metric to help identify the best model(s) to use for your dataset; however, the anomalies it finds might be different than the actual anomalies in your dataset. For this article we are using an external test set. We select a model on the Leaderboard, and navigate to the Predict tab and drag-and-drop that dataset to Prediction Datasets section.
Once the test dataset is uploaded, we go to Forecast settings.
Select Forecast Range Predictions, check the Use known anomalies column to generate scores checkbox and select the name of the column to include when generating scores. Now, we can click Compute Predictions.
Once the scores are computed, go to Menu and select Show External Test Column; a new column with that information shows in the Leaderboard. We can compute the rest of the external test set scores for the other blueprints. Once they are finished the Leaderboard will look similar to Figure 11.
One of the most popular visualizations for a time series anomaly detection project is the Anomaly Over Time chart (under the Evaluate tab). Here we can see the anomaly scores plotted over time. We can also change the backtest so that we can evaluate the anomaly scores across the validation periods.
On the Anomaly Assessment tab (under the Evaluate tab), we can see which features are contributing to the anomaly score via the SHAP values. This is incredibly useful for gaining additional insight into your data and for explaining high scores.
On the ROC Curve tab (under the Evaluate tab) we can check how well the prediction distribution captures the model separation. In the Selection Summary box, you can find the F1 score, recall, precision, and others. At the top right we have the well-known confusion matrix.
Now let’s examine the graphs in the bottom of the tab. The first graph on the left is the ROC curve. This is followed by the Prediction Distribution, where you can adjust and try out different probability thresholds for your target. Lastly, we have the Cumulative Charts (gain and lift charts), which tell you how your effectiveness increases by using this model instead of naive method.
In Figure 15 you can see the relative impact of each feature on this model, including the derived features (Understand > Feature Impact tab).
Here we can see how changes to the value of each feature change model predictions (Understand > Feature Effects tab). Figure 16 shows that as motor_2_rpm (actual) increases or decreases, the anomaly score increases.
Prediction Explanations (from the Understand tab) tell you why your model assigned a value to a specific observation (Figure 17).
Now that we have built and selected our demand forecast model, we want to get predictions. There are three ways to get time series predictions from DataRobot.
The first is the simplest: use the GUI to drag-and-drop a prediction dataset (Figure 18). This method is typically used for testing, or for small, ad-hoc forecasting projects that don’t require frequent predictions.
The second method is to create a deployment. This creates a REST endpoint so that you can request predictions via API. This connects the model to a dedicated prediction server and creates a dedicated deployment object.
The third method is to deploy your model via Docker. This allows you to put the model closer to the data to reduce latency, as well as scale the scoring model as needed as shown in Figure 20.
If you want to try this out for yourself, go to DataRobot University and register for the Time Series Anomaly Detection Lab.
Check out these resources for more information on the various features we discussed here.
Community articles:
If you’re a licensed DataRobot customer, search the in-app Platform Documentation for Unsupervised learning (anomaly detection) and Time series modeling.