Symbolic Regression from Scratch with Python

cancel
Showing results for 
Search instead for 
Did you mean: 

Symbolic Regression from Scratch with Python

Introduction

DataRobot supports many different modeling approaches. One of the more interesting ones is symbolic regression, which is used in Eureqa models within DataRobot. The Eureqa algorithm uses evolutionary methods to search through many different types of models to find one with high accuracy and low complexity. (DataRobot customers: to learn more about this algorithm, search the in-app platform documentation for Eureqa Models.) 

What is symbolic regression?

Regression is the process of attempting to figure out the relationship between some variable (called the “target”) and one or more other variables (usually called “predictors” or “features). More traditional machine learning techniques, such as linear regression or decision trees, require that you first choose the general form of this relationship and then use statistical techniques to estimate some set of parameters to create a model of this relationship. The symbolic regression process is different in that it attempts to determine both the form of the model and its parameters1; this might be a complex non-linear equation or even actual code in some programming language.

Creating a symbolic regression system

In this article we will be learning about how symbolic regression works by building a system for performing symbolic regression from scratch. The only required knowledge is a basic working knowledge of Python. Some experience with recursive functions will also be helpful as we will be working with tree-based data structures. Of course there are several freely available symbolic regression software packages out there which are well suited to this task; however, understanding the internal workings of the algorithm not only will interest those who are curious about the algorithm’s inner workings, but also help to give some insight into the various tunable parameters used in these systems.

Using genetic programming for symbolic regression

Perhaps the most common technique used in symbolic regression is genetic programming2 (GP). This process starts with an initial set of random programs (the first “generation”) and makes small changes to the best programs from that generation and each subsequent generation to produce new programs. Repeating this process over and over usually will produce generations containing programs which are on average more accurate than ones from previous generations.

An example use case for symbolic regression

Let’s use the “Auto MPG” data from UCI: http://archive.ics.uci.edu/ml/datasets/Auto+MPG to understand symbolic regression. This data contains fuel consumption information from a variety of different vehicles. We’ll attempt to find some formula to estimate “mpg” (miles per gallon) based on other attributes of the vehicle such as weight and horsepower.

First, let’s read in the data and do a little preprocessing:

 

import pandas as pd                                                                                  
                                                                                                     
# Read in data                                                                                       
data = pd.read_csv(                                                                                  
    "auto-mpg.data", delim_whitespace=True, na_values="?", header=None,                              
    names=["mpg", "cylinders", "displacement", "horsepower", "weight",                               
           "acceleration", "model year", "origin", "car name", ])                                    
# Drop string feature                                                                                
data.pop("car name")                                                                                 
# Replace N/A values in horsepower                                                                   
data["horsepower"].fillna(data["horsepower"].median(), inplace=True)                                 
# Separate target feature                                                                            
target = data.pop("mpg")

 

The target we’re trying to predict for our experiment is “mpg.” We remove the non-numeric field “car name,” and also replace missing values in “horsepower” to ensure that we only have non-missing numeric features for the equations that we will be generating.

Deciding on a programming “language”

One of the key decisions in symbolic regression is how to represent the programs we are creating3. This can be anything from a simple mathematical equation to a full-featured programming language with loops, functions, and other control structures. To keep things simple we are just going to stick to mathematical equations involving addition, subtraction, multiplication, division, and negation.

Program representation

Equations (or programs) in our system will be represented by trees since this will make genetic operations such as mutation and crossover easier than if they were in more traditional infix form4. For example, the equation (horsepower + cylinders) * weight can be viewed as a tree:

lhaviland_2-1588353315013.png

These trees will be implemented as Python dictionaries:

 

import operator                                                                                      
                                                                                                     
val1 = {"feature_name": "horsepower"}                                                                
val2 = {"feature_name": "cylinders"}                                                                 
val3 = {"feature_name": "weight"}                                                                    
node1 = {                                                                                            
    "func": operator.add,                                                                            
    "children": [val1, val2],                                                                        
    "format_str": "({} + {})",                                                                       
}                                                                                                    
program = {                                                                                          
    "func": operator.mul,                                                                            
    "children": [node1, val3],                                                                       
    "format_str": "({} * {})",                                                                       
}

 

There are two node types in our dictionary-based program representation: function nodes, which execute a function with some number of arguments taken from child nodes and return a value, and feature nodes, which simply return a value from our data. In addition, we have a “format_str” field to help produce a representation of the program in a human-readable form:

 

def render_prog(node):                                                                               
    if "children" not in node:                                                                       
        return node["feature_name"]                                                                  
    return node["format_str"].format(*[render_prog(c) for c in node["children"]])                    
                                                                                                     
                                                                                                     
print(render_prog(program))
>> ((horsepower + cylinders) * weight)

 

Of course we will also need a way to run these programs, so let’s create a method called evaluate() which will execute a program given a single row from the data:

 

def evaluate(node, row):                                                                             
    if "children" not in node:                                                                       
        return row[node["feature_name"]]                                                             
    return node["func"](*[evaluate(c, row) for c in node["children"]])        
 
print(evaluate(program, data.iloc[0]))
>> 483552.0

 

Creating the first generation of programs

Now that we have our basic infrastructure in place, we can start work on the actual GP implementation. The first thing we need to do is generate an initial population of random programs. These random programs will be composed of arbitrarily chosen functions and values taken from the data. For demonstration purposes, we’ll just limit available functions to basic arithmetic operations:

 

def safe_div(a, b):                                                                                  
    return a / b if b else a                                                                         
                                                                                                     
operations = (                                                                                       
    {"func": operator.add, "arg_count": 2, "format_str": "({} + {})"},                               
    {"func": operator.sub, "arg_count": 2, "format_str": "({} - {})"},                               
    {"func": operator.mul, "arg_count": 2, "format_str": "({} * {})"},                               
    {"func": safe_div, "arg_count": 2, "format_str": "({} / {})"},                                   
    {"func": operator.neg, "arg_count": 1, "format_str": "-({})"},                                   
)

 

safe_div() is necessary to ensure no invalid programs due to division by zero errors are created which is a common hazard with randomly generated programs5. Next, let’s add some code to create random program trees:

 

from random import randint, random, seed                                                             
                                                                                                     
seed(0)                                                                                              
                                                                                                     
def random_prog(depth):                                                                              
    # favor adding function nodes near the tree root and 
    # leaf nodes as depth increases                           
    if randint(0, 10) >= depth * 2:                                                                  
        op = operations[randint(0, len(operations) - 1)]                                             
        return {                                                                                     
            "func": op["func"],                                                                      
            "children": [random_prog(depth + 1) for _ in range(op["arg_count"])],                    
            "format_str": op["format_str"],                                                          
        }                                                                                            
    else:                                                                                            
        return {"feature_name": data.columns[randint(0, data.shape[1] - 1)]}                         
                                                                                                     
                                                                                                     
POP_SIZE = 300                                                                                       
population = [random_prog(0) for _ in range(POP_SIZE)]             
 
print(render_prog(population[0]))   
>>> ((((acceleration - (model year * displacement)) * -(-(model year))) - (cylinders / acceleration)) / -(-(model year)))

 

We’ve just introduced our first genetic programming parameter, POP_SIZE, which determines how many programs are generated in each generation. The first generation always has 100% randomly generated programs. The random program generator creates program trees by starting from the root node and arbitrarily choosing to create either a function node or a feature node. In order to create relatively balanced trees with limited depth, the probability of creating a function node is higher near the tree root; as we move deeper into the tree, more feature nodes are created.

Creating offspring

The next piece we will need is code to produce new programs (offspring) from the current generation of programs. We’ll implement the two most common genetic operations used in GP: crossover and mutation. Mutation involves choosing a random point in a selected program and replacing it with randomly generated code:

lhaviland_0-1588353275834.png

Crossover involves selecting two programs from the population, choosing a random node in each program, and then replacing part of the first program with part of the second:

lhaviland_1-1588353295421.png

Both operations require selecting a random node in the programming tree:

 

def select_random_node(selected, parent, depth):                                                     
    if "children" not in selected:                                                                   
        return parent                                                                                
    # favor nodes near the root                                                                      
    if randint(0, 10) < 2*depth:                                                                     
        return selected                                                                              
    child_count = len(selected["children"])                                                          
    return select_random_node(
        selected["children"][randint(0, child_count - 1)], 
        selected, depth+1) 
 
print(render_prog(select_random_node(program, None, 0)))
>>> ((horsepower + cylinders) * weight)

 

Now that we have code for random program creation and random node selection, we can create the mutation operation:

 

from copy import deepcopy
 
def do_mutate(selected):
    offspring = deepcopy(selected)
    mutate_point = select_random_node(offspring, None, 0)
    child_count = len(mutate_point["children"])
    mutate_point["children"][randint(0, child_count - 1)] = random_prog(0)
    return offspring
 
 
print(render_prog(do_mutate(program)))
 
>>> ((((origin - (((weight + origin) * model year) + horsepower)) / (acceleration * -(horsepower))) - ((-(acceleration) + (model year / horsepower)) * (-(-(displacement)) - ((origin - (acceleration - origin)) + weight)))) * weight)

 

Also, we can code the crossover operation:

 

def do_xover(selected1, selected2):                                                                  
    offspring = deepcopy(selected1)                                                                  
    xover_point1 = select_random_node(offspring, None, 0)                                            
    xover_point2 = select_random_node(selected2, None, 0)                                            
    child_count = len(xover_point1["children"])                                                      
    xover_point1["children"][randint(0, child_count - 1)] = xover_point2                             
    return offspring                                                                                 
                                                                                                     
                                                                                                     
print(render_prog(do_xover(population[0], population[1])))                                           
                                                                                                     
>>> ((((acceleration - (model year * displacement)) * -(-(model year))) - (cylinders / acceleration)) / -(-(((weight / (origin - weight)) - origin))))

 

Selecting “Parents”

The next piece we will need is the ability to select programs from the current generation to serve as the basis for the next generation. We would like to choose programs which have a high fitness (i.e., they provide a good estimation of the target we are predicting). However, if we just rank the programs and select the best performers then after a few generations we will have a population with low diversity: all of the programs are very similar. This tends to not produce good results in genetic algorithms. Instead we will use what is known as “tournament selection.” This process involves selecting a specific number of randomly chosen programs from the population (the “tournament size”) and then choosing the one with the highest fitness. This way all programs have a chance to produce offspring; which ensures diversity in the next generation while also making sure higher fitness programs are selected more often.

The code for using tournament selection to select one parent looks like:

 

TOURNAMENT_SIZE = 3                                                                                  
                                                                                                     
def get_random_parent(population, fitness):                                                          
    # randomly select population members for the tournament                                          
    tournament_members = [
        randint(0, POP_SIZE - 1) for _ in range(TOURNAMENT_SIZE)]                  
    # select tournament member with best fitness                                                     
    member_fitness = [(fitness[i], population[i]) for i in tournament_members]                       
    return min(member_fitness, key=lambda x: x[0])[1]

 

Here we see a new parameter, TOURNAMENT_SIZE. Lower values will make parent selection more random and result in more population diversity while higher values will skew selection more towards high fitness programs.

Finally we have the code we need to create the offspring that will make up new generations of programs. To create a new child program we do the following:

 

XOVER_PCT = 0.7                                                                                                                                                            
                                                                                                     
def get_offspring(population, fitness):                                                              
    parent1 = get_random_parent(population, fitness)                                                 
    if random() > XOVER_PCT:                                                                         
        parent2 = get_random_parent(population, fitness)                                             
        return do_xover(parent1, parent2)                                                            
    else:                                                                                            
        return do_mutate(parent1) 

 


This will select one (for mutation) or two (for crossover) parents using the tournament selection code above. The parameter XOVER_PCT controls how many children will come from using crossover instead of mutation. Genetic programming researchers have found that favoring crossover tends to produce better results, so we’ll set it to 70%.7

Program fitness

The final piece we will need is a way to determine the fitness of programs. Since we are trying to perform regression analysis we will use the accuracy of a program’s output when compared to the actual target as our measure of fitness. Since our target (mpg) is fairly close to being normally distributed, we will use Mean Squared Error (MSE) to score our predictions. We will also need to perform some type of regularization as there is really no limit on how complex the programs we create can get. As with most machine learning techniques, overly complex models tend to fit training data very well but generalize poorly to unseen data.8 Therefore we will adjust the MSE score with a penalty which increases based on the number of nodes in the program tree:

 

REG_STRENGTH = 0.5                                                                                   
                                                                                                     
def node_count(x):                                                                                   
    if "children" not in x:                                                                          
        return 1                                                                                     
    return sum([node_count(c) for c in x["children"]])                                               
                                                                                                     
                                                                                                     
def compute_fitness(program, prediction):                                                            
    mse = ((pd.Series(prediction) - target) ** 2).mean()                                             
    penalty = node_count(program) ** REG_STRENGTH                                                    
    return mse * penalty  

 

Increasing the parameter REG_STRENGTH will increase the MSE score for more complex programs which may be overfitting the training data. (Note: Increasing MSE corresponds to lower fitness since MSE is a “lower is better” metric).

Putting it all together

So now that we finally have all of the pieces we need in place we can start the actual symbolic regression process. We are going to start with the initial random population of programs we have already created. For subsequent generations we will select parents via tournament selection and produce offspring using mutation or crossover.

 

MAX_GENERATIONS = 10                                                                                 
                                                                                                     
global_best = float("inf")                                                                           
for gen in range(MAX_GENERATIONS):                                                                   
    fitness = []                                                                                     
    for prog in population:                                                                          
        prediction = [                                                                               
            evaluate(prog, row) for _, row in data.iterrows()]                                       
        score = compute_fitness(prog, prediction)                                                    
        fitness.append(score)                                                                        
        if score < global_best:                                                                      
            global_best = score                                                                      
            best_pred = prediction                                                                   
            best_prog = prog                                                                         
    print(                                                                                           
        "Generation: %d\nBest Score: %.2f\nMedian score: %.2f\nBest program: %s\n"                   
        % (                                                                                          
            gen,                                                                                     
            global_best,                                                                             
            pd.Series(fitness).median(),                                                             
            render_prog(best_prog),                                                                  
        )                                                                                            
    )                                                                                                
    population = [                                                                                   
        get_offspring(population, fitness)                                                           
        for _ in range(POP_SIZE)]                                                                    
                                                                                                     
print("Best score: %f" % global_best)                                                                
print("Best program: %s" % render_prog(best_prog))                                                   
output = {"target": target, "pred": best_pred}                                                       
pd.DataFrame(output).to_csv("best_pred.csv")  

 

Running the above code produces:

 

Generation: 0
Best Score: 148.07
Median score: 245717.86
Best program: -((weight / -(horsepower)))
 
Generation: 1
Best Score: 140.85
Median score: 11499.62
Best program: (-(-(acceleration)) - (-(acceleration) - ((origin - acceleration) + cylinders)))
 
Generation: 2
Best Score: 140.85
Median score: 2331.44
Best program: (-(-(acceleration)) - (-(acceleration) - ((origin - acceleration) + cylinders)))
 
Generation: 3
Best Score: 109.10
Median score: 2323.00
Best program: ((acceleration + origin) + cylinders)
 
Generation: 4
Best Score: 109.10
Median score: 2455.49
Best program: ((acceleration + origin) + cylinders)
 
Generation: 5
Best Score: 109.10
Median score: 2377.74
Best program: ((acceleration + origin) + cylinders)
 
Generation: 6
Best Score: 109.10
Median score: 2296.93
Best program: ((acceleration + origin) + cylinders)
 
Generation: 7
Best Score: 109.10
Median score: 2454.95
Best program: ((acceleration + origin) + cylinders)
 
Generation: 8
Best Score: 109.10
Median score: 2604.31
Best program: ((acceleration + origin) + cylinders)
 
Generation: 9
Best Score: 109.10
Median score: 2531.03
Best program: ((acceleration + origin) + cylinders)
 
Best score: 109.099400
Best program: ((acceleration + origin) + cylinders)

 

There are a few interesting things to note in the above output. The first is that the final program contains a double negative which is not something normally found in mathematical equations. Most complete genetic programming systems include a program simplification step to deal with any redundant or meaningless code that was generated.9 Also note that the median population fitness decreases and levels off. Decreasing population fitness is the main mechanism by which all types of genetic algorithms reach some goal.10 In practice, there is always a parameter such as MAX_GENERATIONS because eventually population fitness stops increasing—although allowing the process to continue may still produce better programs due to the randomness of the process.

Summary

We’ve shown that it is possible to create a symbolic regression system with a small amount of Python code. There are plenty of improvements which can be made to this basic system, such as adding a program simplifier or allowing more math functions or loops/conditionals. Working through this process has hopefully also provided some insight into how these systems work and what the various tradeoffs are when designing them.

Code for the examples presented in this article can be found in the DataRobot Community GitHub.


1 Symbolic regression, https://en.wikipedia.org/wiki/Symbolic_regression

2 J. R. Koza,Genetic Programming: On the Programming of Computers by Means of Natural Selection (MIT Press, Cambridge, MA, 1992).

3 Oltean, Mihai, and Crina Grosan. "A comparison of several linear genetic programming techniques." Complex Systems 14.4 (2003): 285-314.

4 Cramer, Nichael Lynn. "A representation for the adaptive generation of simple sequential programs." Proceedings of the first international conference on genetic algorithms. 1985.

5 Janikow, C. Z. (1996). A methodology for processing problem constraints in genetic programming. Computers & Mathematics with Applications, 32(8), 97–113

6 Fang Y., Li J. (2010) A Review of Tournament Selection in Genetic Programming. In: Cai Z., Hu C., Kang Z., Liu Y. (eds) Advances in Computation and Intelligence. ISICA 2010. Lecture Notes in Computer Science, vol 6382. Springer, Berlin, Heidelberg

7 Luke, Sean, and Lee Spector. "A revised comparison of crossover and mutation in genetic programming." Genetic Programming 98.208-213 (1998)

8 Schaffer, Cullen. "Overfitting avoidance as bias." Machine learning 10.2 (1993): 153-178.

9 Blickle, Tobias, and Lothar Thiele. "Genetic programming and redundancy." (1994)

10 Greenhalgh, David, and Stephen Marshall. "Convergence criteria for genetic algorithms." SIAM Journal on Computing 30.1 (2000): 269-282.

Version history
Revision #:
14 of 14
Last update:
‎05-01-2020 01:31 PM
Updated by:
 
Contributors