Eureqa automatically splits your data into groups: training and validation datasets. The training data is used to optimize models, whereas validation data is used to test how well models generalize to new data. Eureqa also uses the validation data to filter out the best models to display in the Eureqa user interface. This post describes how to use and control these datasets in Eureqa.
By default, Eureqa will randomly shuffle your data and then split it into train and validation datasets based on the total size of your data. Eureqa will color these points differently in the user interface, and also provide statistics for each when displaying stats, for example:
All other error metrics shown in Eureqa, like the “Fit” column and “Error” shown in the Accuracy/Complexity plot, use the metric calculated with the validation dataset.
Validation data settings
You can modify how Eureqa chooses the training and validation datasets in theOptions | Advanced Genetic Program Settingsmenu, shown below:
Here you can change the portion of the data that is used for the training data, and the portion that goes into the validation data. The two sets are allowed to overlap, but can also be set to be mutually exclusive as shown above.
For verysmall datasets(under a few hundred points) it is usually best to use almost all of this data for both training and validation. Model selection can be done using the model complexity alone in these cases.
For verylarge datasets(over 1,000 rows) it is usually best to use a smaller fraction of data for training. It is recommended to choose a fraction such that the size of the training data is approximately 10,000 rows or less. Then, use all the remaining data for validation.
Finally, you can also tell Eureqa to randomly shuffle the data before splitting or not. One reason to disable the shuffling is if you want to choose specific rows at the end of the dataset to use for validation.
Using validation data to test extrapolating future values
If you are using time series data and are trying to predict future time series values, you may want to create a validation data split that emphasizes the ability of the models to predict future values that were not used for optimizing the model directly.
To do this, you need to disable the random shuffling in the Options | Advanced Genetic Programming Options menu, and optionally make the training and validation datasets mutually exclusive (as shown in the options above). For example, you could set the first 75% of the data to be used for training, and the last 25% to be used for validation. After starting the search, you will see your data split like below:
Now, the list of best solutions will be filtered by their ability to predict only future values—the last rows in the dataset which were not used to optimize the models directly.