it models non-linearities and interactions between variables automatically
MARS models build the following estimation for the resulting function:
$$\hat f(x) = \Sigma c_i B_i(x),$$
$c_i$ — coefficients
$B_i$ — basis functions
$B_i$ can take the following values:
$B_i$ = 1 — We need this for inteceptions
$B_i$ = \max(0, x - const)$ or $B_i = \max(0, const - x)$ — We need this for segments. This type of function is called "hinge."
$B_i$ = The product of two or more hinge (functions). We need this for non-linearities and interactions.
In this article, we demonstrate how MARS can be performed in Python using the package, pyearth.
To run this code for our demonstration, we need the following packages:
import urllib2 #need to read url
import sys #need to read url
from bs4 import BeautifulSoup #need to read url
import pandas as pd #need for data preprocessing
import numpy as np #need for data preprocessing
from pyearth import Earth #need for MARS
Note: Not all of these packages are required for MARS. In fact, most are needed for other purposes, such as data transformations (as explained in the import comments).
The new dataset we create contains information about seasons played by Red Sox and their results.
page = urllib2.urlopen('http://www.databasebaseball.com/teams/teampage.htm?franch=BOS').read()
soup = BeautifulSoup(page)
table = soup.findAll('table')
A = 
rows = soup.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
row = [ele.text.encode('latin1') for ele in cols]
if len(row) < 10:
del A[0:7] #remove header
At this point we're going to transform our data to do some preprocessing. As is often the case, this step takes quite a bit of time and effort; this is important to keep in mind when you create your own machine learning projects.
The first step is to define the target variable. In our case, the target we select identifies if the Red Sox won.
from numpy import asarray
X = asarray(A)
resp = (X[:,] == '')
y = resp.astype(int)
Now, we will do some data analysis to transform string variables to numeric. In addition, we'll parse string variables that contain number of wins and losses into two numeric columns.
to_parse = X[:,].astype('str')
print "to_parse:", to_parse # use this row to determine the original data format
win = 
lose = 
for i in range(0, len(to_parse)):
string = to_parse[i].tostring()
parse = string.split(" - ",1)
from numpy import array
w = array(win, dtype=float)
l = array(lose, dtype=float)
x_0 = array(X[:,], dtype=float)
x_3 = array(X[:,], dtype=float)
x_4 = array(X[:,], dtype=float)
x_5 = array(X[:,], dtype=float)
x_6 = array(X[:,], dtype=float)
x_7 = array(X[:,], dtype=float)