cancel
Showing results for 
Search instead for 
Did you mean: 

Multivariate adaptive regression splines

Multivariate adaptive regression splines

Part of a series of educational articles about data science.

Multivariate adaptive regression splines (MARS,  http://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines) is an algorithm for regression analysis. It is based on linear regression with the following differences:

  • it is a non-parametrics technique
  • it models non-linearities and interactions between variables automatically

MARS models build the following estimation for the resulting function:

$$\hat f(x) = \Sigma c_i B_i(x),$$

 where:

  • $c_i$ — coefficients
  • $B_i$ — basis functions

$B_i$ can take the following values:

  • $B_i$ = 1 — We need this for inteceptions
  • $B_i$ = \max(0, x - const)$ or $B_i = \max(0, const - x)$ — We need this for segments. This type of function is called "hinge."
  • $B_i$ = The product of two or more hinge (functions). We need this for non-linearities and interactions.

In this article, we demonstrate how MARS can be performed in Python using the package, pyearth.

To run this code for our demonstration, we need the following packages:

import urllib2  #need to read url
import sys  #need to read url
from bs4 import BeautifulSoup #need to read url
import pandas as pd #need for data preprocessing
import numpy as np #need for data preprocessing
from pyearth import Earth #need for MARS


Note: Not all of these packages are required for MARS. In fact, most are needed for other purposes, such as data transformations (as explained in the import comments).

For data, we're creating a new dataset from information in the page,  http://www.databasebaseball.com. (UPDATE: The new URL for this page is https://www.rotowire.com/.)

The new dataset we create contains information about seasons played by Red Sox and their results.

page = urllib2.urlopen('http://www.databasebaseball.com/teams/teampage.htm?franch=BOS').read()
soup = BeautifulSoup(page)
soup.prettify()

table = soup.findAll('table')

A = []

rows = soup.findAll('tr')    
for tr in rows:
    cols = tr.findAll('td')
    row = [ele.text.encode('latin1') for ele in cols]
    if len(row) < 10:
        row.append('')
    A.append(row)

del A[0:7]  #remove header

 

At this point we're going to transform our data to do some preprocessing. As is often the case, this step takes quite a bit of time and effort; this is important to keep in mind when you create your own machine learning projects.

The first step is to define the target variable. In our case, the target we select identifies if the Red Sox won.

print type(A)

from numpy  import asarray
X = asarray(A)
print type(X)


resp = (X[:,[9]] == '')
y = resp.astype(int)

Output:

<type 'list'>
<type 'numpy.ndarray'>

Now, we will do some data analysis to transform string variables to numeric. In addition, we'll parse string variables that contain number of wins and losses into two numeric columns.

to_parse = X[:,[2]].astype('str')

print "to_parse:", to_parse[6]  # use this row to determine the original data format

win = []
lose = []

for i in range(0, len(to_parse)):
    string = to_parse[i].tostring()
    parse = string.split(" - ",1)
    win.append(filter(None, parse[0].split('\x00')))
    lose.append(filter(None, parse[1].split('\x00')))
    
from numpy  import array
w = array(win, dtype=float)
l = array(lose, dtype=float)
x_0 = array(X[:,[0]], dtype=float)
x_3 = array(X[:,[3]], dtype=float)
x_4 = array(X[:,[4]], dtype=float)
x_5 = array(X[:,[5]], dtype=float)
x_6 = array(X[:,[6]], dtype=float)
x_7 = array(X[:,[7]], dtype=float)

print x_0.dtype
print x_3.dtype
print x_4.dtype
print x_5.dtype
print x_6.dtype
print x_7.dtype
print l.dtype

Output:

to_parse: ['95 - 67']
float64
float64
float64
float64
float64
float64
float64

Then, we stack all of the variables together in the format required by MARS algorithm:

variables = np.hstack((w, l, x_0, x_3, x_4, x_5, x_6, x_7))
#variables

The model by itself is just the following two lines of code:

model = Earth()
model.fit(variables,y)

Output:

Earth(penalty=None, min_search_points=None, endspan_alpha=None, check_every=None, max_terms=None, max_degree=None, minspan_alpha=None, thresh=None, minspan=None, endspan=None, allow_linear=None)

 At this point, we can see the main results of the model, including variables, their coefficients, and main metrics:

  • MSE
  • $R^2$
print model.summary()

 Output:

Earth Model
-------------------------------------
Basis Function  Pruned  Coefficient  
-------------------------------------
(Intercept)     No      1.73036      
h(x7-3)         No      0.691092     
h(3-x7)         No      -0.839318    
h(x2-1992)      No      -0.0402292   
h(1992-x2)      Yes     None         
h(x7-2)         No      -0.698699    
h(2-x7)         Yes     None         
x0              Yes     None         
x1              Yes     None         
-------------------------------------
MSE: 0.0270, GCV: 0.0378, RSQ: 0.8267, GRSQ: 0.7612

 

We can see that five features were included in the final model, and that the overall perfomance in quite good: $R^2 = 0.8267$ is a good indicator together with $MSE = 0.027$

We can compare this fit with OLS. Here we see that $R^2_{OLS} = 0.87$, which is even better then $R^2_{MARS}$; however, but the values are still fairly comparable.

In addition, we can see that $MSE_{OLS} = 0.80$ is much larger than $MSE_{MARS}$. Thus, overall the MARS model shows better performance than OLS.

import statsmodels.api as sm

model = sm.OLS(y.ravel(), variables)
results = model.fit()

print "MSE = ",results.mse_total
print "R-squared = ",results.rsquared

Output:

MSE = 0.807339449541
R-squared = 0.870987820528

 

As another check on the model's performance, we can determine how accurate it is for a particular threshold. The following example uses the threshold $0.4$.

resp_hat = model.predict(variables)

results = ((resp_hat > 0.4).astype(int).reshape(len(resp_hat),1) == y).astype(int)
accuracy = float(results.sum()) / len(results)

print "accuracy =", accuracy

Output:

accuracy = 0.963302752294


In this post, we showed how a MARS model can be easily applied in Python, and how this modeln be highly accurate.

For more information about MARS, see: http://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines

(The example in this article used the following packages: urllib2, sys, bs4, pandas, numpy, pyearth.)

Labels (2)
Contributors