Investors use predicted bond trade prices to inform their trading decisions throughout the day. While bond yield curves can be used to help make decisions, there are many other factors that could help predict a bond’s next trading price.
In this tutorial, we’ll review how Eureqa and the power of symbolic regression can be used to predict the next trading price of a US corporate bond. We use bond price data provided through Benchmark Solutions and Kaggle.com, which includes variables such as current coupon, time to maturity, and details of the previous 10 trades, among others.
The original competition and data, hosted by Kaggle, can be found here.
After downloading and viewing the data, we see that this dataset comprises 61 columns of parameters. These parameters include the row ID (can be used for time series analysis), the bond ID (there is data for almost 8,000 different bonds), current coupon, previous trade prices, and more. The most important column for us is the trade_price column, as this is the value that we are trying to solve for.
This dataset also includes over 700,000 rows of data. To start, we’re only going to take the first 200,000 rows of data for learning the model. Later, you can use the rest of this data for more training or validation, but let’s stick with 200,000 for now. To do this, you can extract the rows you want manually from the data, or create a new, smaller file by using the command line to run the following command:
head –n 200000 originalFileNameAndLocation.csv > newFileName.csv
For more details, please see a previous tutorial that covered this step.
Once you’ve imported the data and confirmed that it looks as expected in the Enter Data tab, let’s move on to the Prepare Data tab. This tab has options for you to do further pre-processing with your data, such as handling missing values or smoothing the data points. For this initial exploration, we will not choose any of those options, but you can return to these later to improve on the performance of your model.
Finally, let’s give our search a target expression and choose its building blocks. Since we want to solve for the trade price of a particular bond given the other variables, the target expression should be set so that the trade_price variable is modeled as a function of all other variables:
trade_price = f(weight, current_coupon, time_to_maturity, ..., curve_based_price_last10)
With regards to the building blocks, they are used to define the mathematical equation types that Eureqa will attempt to combine in your final model. We prefer using fewer building blocks initially to speed up the search, then later expanding the number of building blocks to add in more sophistication to subsequent models. In this case, we are only going to leave the basic building blocks checked and uncheck the two trigonometry building blocks.
For this dataset, we will leave all other options set to their defaults. Eureqa includes many other options to further refine your search; although we don’t need to use them for this example they can be very useful in more targeted searches.
At this point, we are finally ready to move on to the Start Search tab and begin to run the formula search. You can run your search on your local computer, just using the cores you have on your laptop or desktop, or you can speed up your searches by using either your own dedicated private cloud or leveraging the cloud with Amazon EC2. For this search, we ran it on 72 cloud cores for just 20 minutes.
Eureqa gives us a few different methods for assessing the progress of your search. On the Start Search tab, you can monitor both the confidence metrics in the ‘Progress and performance’ view as well as the ‘Progress over time’ chart. In conjunction with those two methods, the Pareto front display gives another visual indication of the performance of the generated equations.
In our case, Eureqa went through nearly 250,000 generations of equations in just 20 minutes, resulting in 11 equations. The top four most accurate equations (as judged by the Pareto front display) differ widely in terms of complexity, ranging from 14 terms used to 20 terms. The remaining, simpler models show a steep decrease in accuracy, but it is up to you to determine the correct trade off of simplicity versus accuracy.
The current most accurate solution has a 0.547 mean absolute error, signifying that this model can predict the future trading price of a bond with an average error of only $0.55. Using a less complex model with 20% fewer terms gives us a 0.554 mean absolute error. Given that the average future trading price among the entire training dataset is $105, having an average error of only $0.54 or $0.55 shows that both formulas model that data very closely. In this case, trading 20% fewer terms for only a 1% difference in accuracy seems like the ideal tradeoff in this scenario.
trade_price = 0.6964*trade_price_last1 + 0.3026*curve_based_price + 0.1059/(trade_type – 2.759)
In just 20 minutes we were able to discover a formula to predict the next price that a corporate bond will trade at, with a mean absolute error of only 0.554. In addition to just making pure predictions, this formula found the relationships within the data, allowing us to understand what factors are truly driving these prices. In this example, we found that the last trading price, the curve based price, and the trade type are the factors that are most important to what price the bond will trade at next. Out of the 61 variables that we began with, Eureqa was able to identify the 3 variables that have the most impact on the future trading price.
Before we continue: want to try for yourself? Go ahead and download the Eureqa project file to get started.
Throughout this example, we took a variety of shortcuts to reach an initial assessment quickly. Now that we have a sense for what this data has to offer and what we’re looking for, there are many opportunities to expand this model to reach even greater accuracy by doing additional data preparation, choosing different formula building blocks, or even just letting the search run for a longer amount of time. However, it is important to keep in mind that with this first investigation, we were able to quickly get visually intuitive results at a high level of accuracy without any deep technical knowledge.
Now we can go a little more in depth. Starting with a massive spreadsheet with >760,000 rows and 61 columns, we were able to generate 11 equations to describe the data in 20 minutes. While I focused on just one of the equations, there is still more we can learn from Eureqa.
I walked you through my thought process of how I chose a single equation out of the 11 that Eureqa generated. This equation had a size of 14, with only 4 parameters and 3 terms. Of all the equations, it seemed to best balance both accuracy and complexity, being able to predict the next price a bond will trade at within $0.55 based on only 3 variables. However, there’s far more information here in this tab—what else can we learn?
First, let’s talk a little more about the equation we chose:
trade_price = 0.6964*trade_price_last1 + 0.3026*curve_based_price + 0.1059/(trade_type – 2.759)
When you click on that specific solution, you will see details about that solution directly below. Eureqa provides details on 8 different error metrics for each solution, ranging from Mean Absolute Error to Hybrid Correlation/Error. I used MAE to judge accuracy in this case, but different datasets may require different error metrics.
While I didn’t previously touch upon R^2 Goodness of Fit, it can provide a meaningful way to evaluate your overall search. What this metric helps you understand is how much of the variance in your target variable is captured by each of your solutions. In this case, the R^2 value is telling us that our solution captures 98.9% of the variance in the predicted trade price. With this equation under our belt, let’s dig a little deeper.
Even though we chose this specific equation as the best for now, what can the other equations tell us about this data? There are three different ways of ranking solutions—by size only, by fit only, or by a combination of size and fit. The third is what Eureqa defaults to, but you can still find valuable data by ranking by the other two methods.
Specifically, let’s look at what happens when you rank by size, looking at the simplest solutions first. By doing this, you can see which single variable Eureqa believes to be the most crucial to understanding the target variable. Then going through each successively more complicated solution, you can see which other variables begin appearing in what order. The simplest solution here is just:
trade_price = trade_price_last1
When you look at the R^2 value for this solution, it actually shows us that this one variable captures 98.4% of the variance of the target variable. What does this mean for us? While we can (and did) find more sophisticated models that get us closer to modeling the future trade price, the last price that the bond traded at is by far the best indicator of the future price.
Finally, let’s focus on this trade_price_last1 variable. As we just discovered, it captures 98% of the variance in our target variable—trade_price. It could be interesting to look at what drives differences between the two variables, and Eureqa lets us do that extremely easily. All we need to do is set up a new search, and modify the target expression to find the difference between trade_price and trade_price_last1, as modeled by the rest of the variables:
trade_price – trade_price_last1 = f(weight, current_coupon, time_to_maturity, ..., curve_based_price_last10)
After running this for almost 7 hours on 72 cores, the most accurate solution I could generate was:
trade_price – trade_price_last1 =
(trade_type_last3 + 1.342*time_to_maturity)
/(2.819*curve_based_price_last1 – curve_based_price*trade_type_last1)
+ (trade_type_last3 + 1.342*time_to_maturity)
/(trade_type*curve_based_price – 2.819*curve_based_price)
As you can see from the Pareto front display, solutions with much more complexity are being introduced. Keeping in mind that the average difference between trade price and the last price is actually 0.607, our most accurate equation here has a 0.52 MAE. While this solution is the most accurate, you can choose for yourself which solution has a better balance of accuracy and complexity, such as the one with equation size of 13, using only 2 parameters. Additionally, doing more pre-processing on the dataset or choosing different building blocks will lead you to improved searches.
The goal of this tutorial was to show you how easy and intuitive it is to use Eureqa to quickly come up with incredibly accurate results, and then to expose you to some of the hidden power behind Eureqa that allows you to accomplish far more.
Of course, this is still only touching the tip of the iceberg of Eureqa’s abilities. Using the fxp file I mentioned earlier, go ahead and try yourself! If you run into any questions, don’t hesitate to leave a comment.