Introduction to Eureqa

This article provides a quick tour of Eureqa models within DataRobot and their origin and purpose, and explains how to build a project within DataRobot that utilizes these models.


If you look at the Leaderboard (Figure 1), you will see a lot of open source models based on XGboost, Tensorflow, Sklean, etc. However, there are some models on the Leaderboard that are not open source. 

Figure 1. Leaderboard showing a number of open source modelsFigure 1. Leaderboard showing a number of open source models

Eureqa models are denoted by a blue EQ symbol (Figure 2) and are not based on open source. 

Figure 2. Leaderboard showing a Eureqa model and its blueprintFigure 2. Leaderboard showing a Eureqa model and its blueprint

Eureqa algorithms were developed by Nutonian, a company DataRobot acquired in 2017. Their idea was to develop a genetic algorithm that can fit different analytic expressions to trained data and return a formula as a machine learning model. This is a fundamentally different approach compared to traditional supervised machine learning models such as tree-based, regression, or deep learning. The approach has since been cited in over 800 peer-reviewed publications and used in applications ranging from finance to neuroscience.

In essence, Eureqa models are trained just like any other supervised machine learning algorithm. You provide the algorithm with labeled training data representing historic information and the algorithm will fit an analytic expression to that training data. Similar to other models on the Leaderboard, that expression is tested on both validation and holdout data.

Eureqa fits an analytic expression to your data in 3 steps:

  • Step 1: Eureqa takes in mathematical building blocks such as addition, subtraction, multiplication, or complex relationships such as natural logarithms or cosines.
  • Step 2: Eureqa conducts an evolutionary model search to find the best combination of the given mathematical building blocks that fit your data. Starting with a series of random expressions, the algorithm combines the best-fitting expressions with each other until it gradually finds a formula that fits your data.
  • Step 3: Eureqa applies a penalty in proportion to the complexity of the formula so as to prevent overfitting.


Figure 3. Eureqa builds models from training data in three stepsFigure 3. Eureqa builds models from training data in three steps

In order to demonstrate how to build a Eureqa model in DataRobot, we will predict the temperature of a motor given predictors like ambient temperature, speed and torque of the motor, etc. (Figure 4). Eureqa will fit an expression to the predictors and find the simplest analytical expression that predicts the target.

Figure 4. Data used to predict the temperature of a motorFigure 4. Data used to predict the temperature of a motor

As shown in Figure 5, DataRobot displays a number of Eureqa models on the Leaderboard after fitting the data described above.

Figure 5. DataRobot Eureqa models on the LeaderboardFigure 5. DataRobot Eureqa models on the Leaderboard

It also plots all the formulas each model found in terms of their complexity (X-axis) and out-of-sample error (Y-axis) in Describe > Eureqa Models page. These formulas are the most accurate (lowest error) models with the least complexity (a measure of the size and mathematical complexity of the analytical model) that a given Eureqa model found, and are shown in the Models by Error vs. Complexity graph. You can click on any of the circles in that graph to see the corresponding analytic expression found (shown in the Selected Model Detail graph).

Figure 6. Sample Eureqa solution for the motor temperature prediction problemFigure 6. Sample Eureqa solution for the motor temperature prediction problem

In the motor example, the first and simplest model predicted the average motor temperature (leftmost red circle in the Models by Error vs. Complexity graph). Eureqa gradually fitted more complex formulas until it landed at the most complex model with the lowest error (rightmost green circle in Models by Error vs. Complexity graph). If you click on any of the circles in this graph, the Selected Model Detail graph shows the corresponding analytical expression and how well the data fits it. You can see that each model generated a simple, human-readable and human-interpretable analytical expression.

Let’s look at another example. We will model the acceleration of the lower bar of a double pendulum—specifically the position, velocity, and acceleration of the ends of both the upper bar and the lower bar (as shown in Figure 7).

Figure 7. The motion of a double pendulumFigure 7. The motion of a double pendulum

To get an idea of just how complex this motion is, observe the video below showing the motion of the double pendulum.

As the pendulum swung back and forth, the camera was logging the location and movement of each point. We imported that data into DataRobot, selected the target variable (i.e., the acceleration of the lower bar), and fit a Eureqa model to the recorded data.

For this task, we also gave DataRobot Eureqa models a series of mathematical building blocks to use. The models searched for different possible combinations of predictors and different combinations of building blocks, to fit the acceleration of the lower bar. Given enough time to train, the Eureqa models were able to find the real physical formula for the acceleration of the lower bar of a double pendulum (Figure 8).

Figure 8. Eureqa’s analytical expression for the acceleration of the lower bar of the double pendulumFigure 8. Eureqa’s analytical expression for the acceleration of the lower bar of the double pendulum

There are a number of advantages to using Eureqa models:

  • They return human readable and interpretable analytic expressions, which are easily reviewed by subject matter experts. They also tend to deploy easily.
  • They are very good at feature selection because they are forced to reduce complexity during the model building process. For example, if the data had 20 different columns used to predict the target variable, the search for a simple expression would result in an expression that only uses the strongest predictors.
  • They work really well in small datasets and are so are very popular with scientific researchers who gather data from physical experiments that don’t produce massive amounts of data. (In such situations, traditional supervised machine learning models may be unable to learn.)
  • They provide an easy way to incorporate domain knowledge. If you know the underlying relationship in the system that you're modeling, you can actually give Eureqa a hint, e.g., the formula of the heat transfer or how house prices work in a particular neighborhood. You can give Eureqa that known relationship as a building block or a starting point to learn from. Eureqa will build machine learning corrections from there.

You can find a Jupyter notebook and dataset for this example in the DataRobot Community GitHub.

More Information

If you’re a licensed DataRobot customer, search the in-app documentation for Eureqa, then locate “Eureqa advanced tuning” for more information.

Comments
Jumper Wires

I wanted to experiment with a model of the signal:

tsig(t)=1.2*T(t-1)-4.6*T(t-3)

1)   I started a new project

a.    I download a csv file (two columns, the first is tsig and the second column is T) - as a local file

b.   tsig is chosen as target

c.    I pickup Manual modeling

d)  since I was using the manual mode I was directed to the repository  where I have chosen the following:
    d_1:   Eureqa Generalized Additive Model
    d_2:   EureqaRegressor (Default search 3000 Generations)
 
Both Eureqa models did not show good expression (solution) for tsig as a function of T. !
 
I did not change building blocks since the building block reference shows that there are no "History functions: delay, sma, wma, mma, smm that used to be a standard part of Eureqa.to deal with delayed signals.
 
Please tell me how to deal with delayed signals.   (I know that this a trivial task for eureqa!!)
 
(I wanted to attach the mentioned csv file, but this messaging does not allow that option.  It is clear that the target tsig is a linear combination of delayed versions  of signal T)
 
Thank you.
 
Version history
Revision #:
12 of 12
Last update:
‎05-15-2020 01:15 PM
Updated by:
 
Contributors