In this tutorial, we are going to learn how to use DataRobot’s unsupervised learning capabilities to train a model that can detect transactions tied to money laundering. If you want to see how you can simulate the same outcome using the Python API, click here.
Use Case Overview
Money Laundering is the act of trying to legalize illicitly obtained funds. It is estimated that the money laundered each year represents 2–5% of global GDP (source) or around 800 billion to 2 trillion dollars.
Detecting transactions that are tied to money laundering is not trivial. With modern detection systems, most money laundering transactions go undetected. The only way to realistically expand and identify uncaught cases is to train an unsupervised model that will detect anomalies and flag transactions for further investigation.
For this tutorial we are going to use two distinct CSV files:
aml_trainis an unlabeled dataset with alarms raised from a transactional risk engine. The team that looks into these alarms is too small to investigate all of the alarms that the system raises. The team investigates alarms and saves them to aml_test. The target column, SAR, informs us of whether the specific transaction was a money laundering attempt or not. (SAR, or Suspicious Activity Report, indicates if a report is generated.) For more information on SAR, visit wikipedia.
The rest of the features describe the characteristics of the transaction and provide some information about the person making the transaction.
If we tried to train a model to detect fraud based on the aml_test dataset, our model likely would not be very accurate as we only have a small sample of observations for training. Instead, our approach will be to train unsupervised anomaly detection models on the larger dataset and see if, by identifying anomalous values, we can actually detect money laundering attempts.
Figure 1. Data
Starting the Project
To initiate an unsupervised modeling project, first upload the aml_train.csv. dataset to the DataRobot platform.
Figure 2. Upload Data
Click No target? (which indicates unsupervised mode) and then click Proceed to unsupervised mode (Figures 3 and 4).
Figure 3. No Target? button
Figure 4. Proceed to Unsupervised Mode button
Finally, click Start to initiate Quick Autopilot modeling mode.
Once modeling is complete, you should be able to see a view with multiple anomaly detection blueprints trained on the aml_train dataset (Figure 5).
Figure 5. Model Leaderboard
The models here are ordered based on the metric Synthetic AUC. For the purposes of this tutorial we won’t go into details on how Synthetic AUC works (you can get more information here). To calculate Synthetic AUC, DataRobot generates two synthetic datasets out of the validation sample: one that is more normal and one that is more anomalous. The datasets are labeled and the model is then used to predict and calculate the Synthetic AUC.
To take a look at the results, you can navigate to Insights > Anomaly Detection tab (Figure 6).
Figure 6. Anomaly Detection Insight
Figure 7. Anomaly Scores
From the Model dropdown (Figure 7) you can change the model that the current insight is using. The best practice is to use a small subset of labeled data to gauge the accuracy of the model for detecting anomalies. While the Synthetic AUC metric and anomalyScore provide some indication of anomalies, they may be considering anomalies that are unrelated to your business problem. The best way to ensure the anomalies align to your business problem is to test them against few known anomalies. This is what we will be doing during the next step.
Uploading a Labeled Testing Set
Since we want to evaluate the actual accuracy of our models, we need to upload the smaller labeled testing set aml_test.csv into the platform. Navigate to the Predict > MakePredictions tab of any DataRobot model and upload the dataset (Figure 8).
Figure 8. Testing Data Upload
Once you have uploaded the data, click Run external test and set SAR as your Actuals column.
Figure 9. Run External Test
This process will kick off real accuracy calculations. To see them, select Menu > Show External Test Column.
Figure 10. Show External Test Column
Now you should be able to see a slightly changed Leaderboard with the actual AUC being calculated instead of the Synthetic AUC.
Figure 11. Updated Leaderboard
To finalize this process of identifying the best model, click Run (in the Externaltest column). DataRobot calculates the actual accuracy for the values.
Evaluating the Models
When the calculations complete (which may take several minutes), you will see the actual AUC for all of the unsupervised models that you trained (Figure 12).
Figure 12. Final Leaderboard
We see the AUC values here range from 0.82 to 0.5 which tells us that some anomaly detection models are better than others in detecting our target outcome. It’s important to note that the best model is not the model that had the best Synthetic AUC.
Finally, now that the labeled dataset is uploaded, the ROC Curvetab is available for inspection. From this page (Figure 13), you can see the optimal prediction threshold and how exactly your models are performing.
Figure 13. ROC Curve tab
The models trained in unsupervised learning mode can be deployed using the same methods as other (supervised) DataRobot models.
The result is not a probability but it is the anomalyScore which as previously mentioned, takes values from 0 to 1.