Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Community
- :
- AI & ML General
- :
- Knowledge Base
- :
- Multivariate adaptive regression splines

- Article History
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Email to a Friend
- Printer Friendly Page
- Report Inappropriate Content

Multivariate adaptive regression splines

**Part of a series of educational articles about data science.**

Multivariate adaptive regression splines (MARS, http://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines) is an algorithm for regression analysis. It is based on linear regression with the following differences:

- it is a non-parametrics technique
- it models non-linearities and interactions between variables automatically

MARS models build the following estimation for the resulting function:

`$$\hat f(x) = \Sigma c_i B_i(x),$$`

where:

- $c_i$ — coefficients
- $B_i$ — basis functions

$B_i$ can take the following values:

- $B_i$ = 1 — We need this for inteceptions
- $B_i$ = \max(0, x - const)$ or $B_i = \max(0, const - x)$ — We need this for segments. This type of function is called "hinge."
- $B_i$ = The product of two or more hinge (functions). We need this for non-linearities and interactions.

In this article, we demonstrate how MARS can be performed in Python using the package, pyearth.

To run this code for our demonstration, we need the following packages:

```
import urllib2 #need to read url
import sys #need to read url
from bs4 import BeautifulSoup #need to read url
import pandas as pd #need for data preprocessing
import numpy as np #need for data preprocessing
from pyearth import Earth #need for MARS
```

*Note: Not all of these packages are required for MARS. In fact, most are needed for other purposes, such as data transformations (as explained in the import comments).*

For data, we're creating a new dataset from information in the page, http://www.databasebaseball.com. (UPDATE: The new URL for this page is https://www.rotowire.com/.)

The new dataset we create contains information about seasons played by Red Sox and their results.

```
page = urllib2.urlopen('http://www.databasebaseball.com/teams/teampage.htm?franch=BOS').read()
soup = BeautifulSoup(page)
soup.prettify()
table = soup.findAll('table')
A = []
rows = soup.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
row = [ele.text.encode('latin1') for ele in cols]
if len(row) < 10:
row.append('')
A.append(row)
del A[0:7] #remove header
```

At this point we're going to transform our data to do some preprocessing. As is often the case, this step takes quite a bit of time and effort; this is important to keep in mind when you create your own machine learning projects.

The first step is to define the target variable. In our case, the target we select identifies if the Red Sox won.

```
print type(A)
from numpy import asarray
X = asarray(A)
print type(X)
resp = (X[:,[9]] == '')
y = resp.astype(int)
```

Output:

```
<type 'list'>
<type 'numpy.ndarray'>
```

Now, we will do some data analysis to transform string variables to numeric. In addition, we'll parse string variables that contain number of wins and losses into two numeric columns.

```
to_parse = X[:,[2]].astype('str')
print "to_parse:", to_parse[6] # use this row to determine the original data format
win = []
lose = []
for i in range(0, len(to_parse)):
string = to_parse[i].tostring()
parse = string.split(" - ",1)
win.append(filter(None, parse[0].split('\x00')))
lose.append(filter(None, parse[1].split('\x00')))
from numpy import array
w = array(win, dtype=float)
l = array(lose, dtype=float)
x_0 = array(X[:,[0]], dtype=float)
x_3 = array(X[:,[3]], dtype=float)
x_4 = array(X[:,[4]], dtype=float)
x_5 = array(X[:,[5]], dtype=float)
x_6 = array(X[:,[6]], dtype=float)
x_7 = array(X[:,[7]], dtype=float)
print x_0.dtype
print x_3.dtype
print x_4.dtype
print x_5.dtype
print x_6.dtype
print x_7.dtype
print l.dtype
```

Output:

```
to_parse: ['95 - 67']
float64
float64
float64
float64
float64
float64
float64
```

Then, we stack all of the variables together in the format required by MARS algorithm:

```
variables = np.hstack((w, l, x_0, x_3, x_4, x_5, x_6, x_7))
#variables
```

The model by itself is just the following two lines of code:

```
model = Earth()
model.fit(variables,y)
```

Output:

`Earth(penalty=None, min_search_points=None, endspan_alpha=None, check_every=None, max_terms=None, max_degree=None, minspan_alpha=None, thresh=None, minspan=None, endspan=None, allow_linear=None)`

At this point, we can see the main results of the model, including variables, their coefficients, and main metrics:

- MSE
- $R^2$

`print model.summary()`

Output:

```
Earth Model
-------------------------------------
Basis Function Pruned Coefficient
-------------------------------------
(Intercept) No 1.73036
h(x7-3) No 0.691092
h(3-x7) No -0.839318
h(x2-1992) No -0.0402292
h(1992-x2) Yes None
h(x7-2) No -0.698699
h(2-x7) Yes None
x0 Yes None
x1 Yes None
-------------------------------------
MSE: 0.0270, GCV: 0.0378, RSQ: 0.8267, GRSQ: 0.7612
```

We can see that five features were included in the final model, and that the overall perfomance in quite good: $R^2 = 0.8267$ is a good indicator together with $MSE = 0.027$

We can compare this fit with OLS. Here we see that $R^2_{OLS} = 0.87$, which is even better then $R^2_{MARS}$; however, but the values are still fairly comparable.

In addition, we can see that $MSE_{OLS} = 0.80$ is much larger than $MSE_{MARS}$. Thus, overall the MARS model shows better performance than OLS.

```
import statsmodels.api as sm
model = sm.OLS(y.ravel(), variables)
results = model.fit()
print "MSE = ",results.mse_total
print "R-squared = ",results.rsquared
```

Output:

```
MSE = 0.807339449541
R-squared = 0.870987820528
```

As another check on the model's performance, we can determine how accurate it is for a particular threshold. The following example uses the threshold $0.4$.

```
resp_hat = model.predict(variables)
results = ((resp_hat > 0.4).astype(int).reshape(len(resp_hat),1) == y).astype(int)
accuracy = float(results.sum()) / len(results)
print "accuracy =", accuracy
```

Output:

`accuracy = 0.963302752294`

In this post, we showed how a MARS model can be easily applied in Python, and how this modeln be highly accurate.

For more information about MARS, see: http://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines

(The example in this article used the following packages: urllib2, sys, bs4, pandas, numpy, pyearth.)

© 2020 DataRobot, Inc DataRobot.com Community Guidelines Privacy Policy Terms of Service Version History