Anti-Money Laundering with Outlier Detection

cancel
Showing results for 
Search instead for 
Did you mean: 

Anti-Money Laundering with Outlier Detection

In this tutorial, we are going to learn how to use DataRobot’s unsupervised learning capabilities to train a model that can detect transactions tied to money laundering. If you want to see how you can simulate the same outcome using the Python API, click here.

Use Case Overview

Money Laundering is the act of trying to legalize illicitly obtained funds. It is estimated that the money laundered each year represents 2–5% of global GDP (source) or around 800 billion to 2 trillion dollars. 

Detecting transactions that are tied to money laundering is not trivial. With modern detection systems, most money laundering transactions go undetected. The only way to realistically expand and identify uncaught cases is to train an unsupervised model that will detect anomalies and flag transactions for further investigation.

Data

For this tutorial we are going to use two distinct CSV files:

  • aml_train.csv
  • aml_test.csv

aml_train is an unlabeled dataset with alarms raised from a transactional risk engine. The team that looks into these alarms is too small to investigate all of the alarms that the system raises. The team investigates alarms and saves them to aml_test. The target column, SAR, informs us of whether the specific transaction was a money laundering attempt or not. (SAR, or Suspicious Activity Report, indicates if a report is generated.) For more information on SAR, visit wikipedia.

The rest of the features describe the characteristics of the transaction and provide some information about the person making the transaction. 

Our Approach

If we tried to train a model to detect fraud based on the aml_test dataset, our model likely would not be very accurate as we only have a small sample of observations for training. Instead, our approach will be to train unsupervised anomaly detection models on the larger dataset and see if, by identifying anomalous values, we can actually detect money laundering attempts.

Figure 1. DataFigure 1. Data

Starting the Project

To initiate an unsupervised modeling project, first upload the  aml_train.csv. dataset to the DataRobot platform.

Figure 2. Upload DataFigure 2. Upload Data

 Click No target? (which indicates unsupervised mode) and then click Proceed to unsupervised mode (Figures 3 and 4).

Figure 3. No Target? buttonFigure 3. No Target? button

Figure 4. Proceed to Unsupervised Mode buttonFigure 4. Proceed to Unsupervised Mode button

Finally, click Start to initiate Quick Autopilot modeling mode.

Model Leaderboard

Once modeling is complete, you should be able to see a view with multiple anomaly detection blueprints trained on the  aml_train dataset (Figure 5).

Figure 5. Model LeaderboardFigure 5. Model Leaderboard

The models here are ordered based on the metric Synthetic AUC. For the purposes of this tutorial we won’t go into details on how Synthetic AUC works (you can get more information here). To calculate Synthetic AUC, DataRobot generates two synthetic datasets out of the validation sample: one that is more normal and one that is more anomalous. The datasets are labeled and the model is then used to predict and calculate the Synthetic AUC.

To take a look at the results, you can navigate to Insights > Anomaly Detection tab (Figure 6).

Figure 6. Anomaly Detection InsightFigure 6. Anomaly Detection Insight

Figure 7. Anomaly ScoresFigure 7. Anomaly Scores

From the Model dropdown (Figure 7) you can change the model that the current insight is using. The best practice is to use a small subset of labeled data to gauge the accuracy of the model for detecting anomalies.  While the Synthetic AUC metric and anomalyScore provide some indication of anomalies, they may be considering anomalies that are unrelated to your business problem. The best way to ensure the anomalies align to your business problem is to test them against few known anomalies. This is what we will be doing during the next step.

Uploading a Labeled Testing Set

Since we want to evaluate the actual accuracy of our models, we need to upload the smaller labeled testing set  aml_test.csv into the platform. Navigate to the Predict > Make Predictions tab of any DataRobot model and upload the dataset (Figure 8).

Figure 8. Testing Data UploadFigure 8. Testing Data Upload

Once you have uploaded the data, click Run external test and set SAR as your Actuals column.

Figure 9. Run External TestFigure 9. Run External Test

This process will kick off real accuracy calculations. To see them, select Menu > Show External Test Column.

Figure 10. Show External Test ColumnFigure 10. Show External Test Column

Now you should be able to see a slightly changed Leaderboard with the actual AUC being calculated instead of the Synthetic AUC

Figure 11. Updated LeaderboardFigure 11. Updated Leaderboard

To finalize this process of identifying the best model, click Run (in the External test column). DataRobot calculates the actual accuracy for the values.

Evaluating the Models

When the calculations complete (which may take several minutes), you will see the actual AUC for all of the unsupervised models that you trained (Figure 12). 

Figure 12. Final LeaderboardFigure 12. Final Leaderboard

We see the AUC values here range from 0.82 to 0.5 which tells us that some anomaly detection models are better than others in detecting our target outcome. It’s important to note that the best model is not the model that had the best Synthetic AUC.

Finally, now that the labeled dataset is uploaded, the ROC Curve tab is available for inspection. From this page (Figure 13), you can see the optimal prediction threshold and how exactly your models are performing.

Figure 13. ROC Curve tabFigure 13. ROC Curve tab

Model Deployment

The models trained in unsupervised learning mode can be deployed using the same methods as other (supervised) DataRobot models. 

The result is not a probability but it is the anomalyScore which as previously mentioned, takes values from 0 to 1.

Labels (1)
Version history
Revision #:
5 of 5
Last update:
2 weeks ago
Updated by:
 
Contributors