Predicting financial delinquency using credit scoring data

Originally posted on 11/14/13

----------

Predicting future financial distress and understanding the factors that cause it are critical to how banks decide who can get financing and on what terms. Credit scoring algorithms, which predict the probability of default, are the primary method banks use to determine whether or not a loan should be granted to a given applicant.

In this Eureqa tutorial, we’ll examine how Eureqa can be used to predict whether somebody will experience financial distress in the next two years using anonymous credit-scoring data provided by Kaggle.com.

 

In this Eureqa tutorial, we’ll examine how Eureqa can be used to predict whether somebody will experience financial distress in the next two years using anonymous credit-scoring data provided by Kaggle.com.

The original competition can be found here.

Let’s get started!

The training data

After downloading the training data from Kaggle.com we can go ahead and take a look at the various variables and values within Excel. At first glance, we can see that we’re going to be working with characteristics that are commonly used in assessing credit worthiness.

Predicting Financial Delinquency Using Credit Scoring Data 1.png

The full list of variables includes:

  • Serious delinquency in 2 years (SeriousDlqin2yrs)
  • Revolving Utilization Of Unsecured Lines
  • Number Of Time 30–59 Days Past Due Not Worse
  • Debt Ratio
  • Monthly Income
  • Number Of Open Credit Lines And Loans
  • Number Of Times 90 Days Late
  • Number of Real Estate Loans Or Lines
  • Number Of Time 60–89 Days Past Due Not Worse
  • Number of Dependents

You might suspect that a lot of these variables interact to affect someone’s likelihood of delinquency; but how exactly do they interact and what do we predict? We’ll use Eureqa to answer the questions.

Preparing the data for modeling

Eureqa includes several data preparation tools for your convenience; however, we’ll try working with the raw data, without any special pre-processing or preparation, just to get started. Once we complete our first attempt at modeling the data, we can go back and consider options like removing outliers and smoothing.

Predicting Financial Delinquency Using Credit Scoring Data 2.png

Predicting financial delinquency using credit scoring characteristics

We want to predict the variable SeriousDlqin2yrs, which is a variable containing 0’s for no delinquency and 1’s for serious delinquency. Since this variable has only values of 0 and 1, we’re going to use a special target expression that will provide a similar constraint on the resulting models. We recommend using the logistic function (see below) which squashes values to be between 0 and 1. We choose the logistic function (as opposed to a step function) because it provides a better search gradient. For more information, see our tutorial on modeling binary values.

Your target expression should now look something like the following:

Predicting Financial Delinquency Using Credit Scoring Data 3.png

We’re also going to make aPredicting Financial Delinquency Using Credit Scoring Data 4.png few changes to the model building blocks based on some assumptions that can be made about the data. Since the data is not related to engineering and is not cyclical or seasonal in nature, we can go ahead and uncheck the ‘Sine’ and ‘Cosine’ building blocks.  We also recommend enabling the ‘Logistic’ building block which, when used as a building block, provides an easy ability to threshold input variables or values that may be useful in a model.

Next, we’re going to enable row weighting on our target variable by clicking Row Weight and selecting 1/occurrences(SeriousDlqin2yrs). We do this because positive occurrences appear sparsely in the training data (roughly 10,000 out of 150,000 records), so we want to weight the outcomes proportionally to the more frequent case (e.g., by their frequency of occurrence, 1 to 15). 

Predicting Financial Delinquency Using Credit Scoring Data 5.png

Now that we’ve set our target expression, selected the appropriate model building blocks, and enabled row weighting, we’re ready to start our search. From within Eureqa, select the Start Search tab and click the Run button.

The results

From the ‘View Results’ tab, we can get a digest view of all the solutions generated by Eureqa thus far.  For this tutorial, we ran Eureqa using a 72 core private cloud for about five hours.

As with our predicting insurance claim payments tutorial, we’re going to judge the predictive accuracy of the solutions using Mean Absolute Error (MAE). This metric is the average error (plus or minus) you can expect with the predictions generated by our models.

Looking at the solutions Eureqa has generated, we can see that the top four models offer similar predictive accuracy, with MAE ranging between .2245 and .2249, while differing substantially in complexity.  Choosing the simplest of the four would result in a .2% decrease in predictive accuracy, while decreasing the number of terms by nearly 50%. This is clearly illustrated via Eureqa’s built-in Pareto Front display.

Predicting Financial Delinquency Using Credit Scoring Data 6.png

Since the outputs and the predictions are nearly all 0’s or 1s, we can interpret mean absolute error statistic as the percentage of time the model found in Eureqa will make an incorrect prediction. In the case of our most accurate solution, Eureqa could correctly predict whether or not someone would have financial distress 77.55% of the time.

Even more interestingly, because the output of Eureqa is an analytical model, we can easily identify what characteristics are indicative of future financial delinquency. Our most accurate model includes Revolving Utilization Of Unsecured Lines, Number Of Times 90 Days Late, Number Of Time 60–89 Days Past Due Not Worse, and Number Of Time 30–59 Days Past Due Not Worse as the most important factors in determining future delinquency. Given these variables, other variables like age, monthly income, number of dependents, real estate loans, and debt ratio—while possibly important indirectly—do not significantly improve accuracy and are not used in the best models. The variables related to being overdue appear to drive nearly all delinquency outcomes.

Summary

In just over a few hours, we were able to go from a raw dataset containing credit scoring characteristics, to a precise analytical model of financial delinquency that predicts claims correctly nearly 78% of the time, and we discovered that the variables related to overdue payments dramatically affect and drive these outcomes on the best model

For real world applications, we would likely want to improve on the predictive accuracy of our results by more thoroughly preparing our sample data, adding new building blocks, letting Eureqa search for a longer period of time, and leveraging additional computational resources such as Amazon EC2 or a private cloud using a dedicated Eureqa server. We can also use the models produced by Eureqa almost anywhere, such as software like Excel, R, SAS, or MATLAB in order to do additional analysis now that we have a model to work with.

Ready to try for yourself? Go ahead and download the Eureqa project file to get started.

Labels (1)
Comments
Blue LED

@tessgdavies, @BobF 

Hello,

I am a researcher at Thompson Rivers University, in Canada. Two years ago, I have had a free trial of Eureqa, and my co-author (Barry Smith, York University, Canada) and I are currently finalizing the resulting paper. We showed that symbolic regression (as implemented in Eureqa) can be used to find a closed-form approximation to the cdf of the sum of two independent and identically distributed (i.i.d.) log-normal random variables. The study focused on the portion of the body of the distribution that contains 0.999 of the cdf (rather than on the tails of the distribution).  For values of the standard deviation σ = 0.1, 0.2, 0.3, …, 1.9, 2.0, and for values of the mean µ = –1.5, –1.25, –1.0, … 1.25, 1.5 this method allowed us to obtain a closed-form approximation with a maximum absolute error of less than 0.0001.

We have many more ideas for future papers involving Eureqa.  When I contacted your Sales department asking about the cost of Eureqa's license, I was told "Eureqa is now part of the DataRobot enterprise platform which does not offer individual licenses,  however I would recommend posting your inquiry in the Eureqa section of the DataRobot Community as the team might be able to assist you."

Is there anything that could be done to allow us to buy an individual license for Eureqa?

Thank you,

 

Stan

 

DataRobot Alumni

Hi Stan,  

Thanks for reaching out. 

I am connecting you with @JessLin one our Data Scientists who may be able to assist you.

Thanks,

Bob.

Community Team
Community Team

In the meantime: we've been able to point other community members to academic licenses. Perhaps that would be useful to you too? https://community.datarobot.com/t5/ai-ml-general-discussions/inquiry/m-p/873#M49
The site for academic licenses is:
 https://sites.fastspring.com/nutonian/product/eureqa-formulize-academic-7cwpUmNsnyNN.

DataRobot Alumni

Agreed, @JessLin suggests an academic license also: https://community.datarobot.com/t5/ai-ml-general-discussions/inquiry/m-p/873#M49.

Hope this helps.

Blue LED

Thank you!

Blue LED

@BobF, @JessLin How many computers does my academic license allow me to run Eureqa on?

Do you still offer a product called Eureqa Server?

Thank you for the clarification,

 

Stan

DataRobot Alumni

Hi Stan,

So we only allow one machine per academic license. We are no longer selling any standalone Eureqa products, but Eureqa is available as a blueprint within the DataRobot platform.

Hope this helps.

Thanks,

Bob.

Blue LED

@BobF 

Thank you for getting back to me.

Is there an academic license for DataRobot?  Would such a license allow me to use the functionality of Eureqa Server, where the code could run on more than one machine?

Thank you for the clarification,

 

Stan

 

DataRobot Alumni

Hi @StanMiles ,

I do not believe there is an academic license for DataRobot and @JessLin can confirm.

Thanks,

Bob.

Blue LED

@BobF @JessLin My research involves showing that Eureqa can achieve a better return volatility forecast than the standard GARCH model. Using historical 5-minute data for 25 stocks, for each stock I would like to simulate making one-day volatility forecasts every day for two months. Assuming there are 21 trading days per month, this means that I will need to run Eureqa 25*21 = 525 times.  I would also like to run 10 trials for each of the times above, so ideally I would run Eureqa a total of 5,250 times. 

Would it be possible to make an exception in this case and to allow me to run Eureqa on multiple computers for at most three months?  If I get your permission to run it on multiple computers, I believe I can get my Dean's permission to deploy Eureqa on the computers at several computer labs at my university.

Thank you for considering this request.

 

Stan

Version history
Revision #:
7 of 7
Last update:
‎12-16-2019 01:42 PM
Updated by: