Classifying evergreen content with machine learning and Eureqa
Originally posted on 1/16/14
As a part of a recent Kaggle competition, StumbleUpon (a website discovery platform) challenged users to build a model using machine learning that would classify whether a webpage should be considered evergreen or ephemeral. The ability to better classify and understand evergreen content would allow StumbleUpon to greatly improve the performance of its recommendation engine.
Evergreen content, for those not familiar with the term, signifies content that remains relevant, valuable, and authoritative year after year. Evergreen content is of immense value to marketers, as it continually generates traffic and leads season after season.Hubspothas a great introductory article on the subject.
In this tutorial blog, we’ll review how Eureqa can be used to predict whether a webpage is evergreen or non-evergreen, using both structured and unstructured data provided by Kaggle and StumbleUpon.
The original competition and data, hosted by Kaggle, can be foundhere.
Examining the data
After downloading the training dataset, we can see that we will be working with 27 variables and 7,395 records, with each record (or row) corresponding to a given webpage. For the purposes of this tutorial, we are not going to use the ‘raw content’ file. Some of the variables we will be working with include:
Link Word Score
Avg Link Size
Number of Links
Number of Spelling Errors
At first glance, it does look like we’ll need to do a little bit of preparation to get our data properly set up for Eureqa.
Preparing the data
The first thing we’re going to do is parse the 'boiler plate' variable so that 'title', 'body', and 'url' are all in separate columns. This will enable us to not only examine what words are potentially indicative of evergreen content, but also the impact of their placement in the ‘title’, ‘body’, or ‘url’ sections. I also went ahead and parsed the 'URL' variable, so that the domain was in its own column. All of this can be accomplished within Excel or any other stats program.
Now that we’ve completed our initial adjustments to the training data, we can go ahead and import it into Eureqa. To do this, save your XLS file (or equivalent) as a CSV and, from within Eureqa, click Import Data. After it’s done loading, your worksheet should look similar to the screenshot below:
Once you’ve imported the data and confirmed that it looks as expected in the Enter Data tab, we can move on to the Prepare Data tab. This tab has options to further pre-process with your data, including handling missing values and smoothing the data points. For this initial analysis, we will not choose any of those options, but you can return to these later to improve on the performance of your model.
Before we move on and begin our model search, you may have noticed that several new columns were appended to your data. Eureqa uses a basic ‘bag of words’ implementation for handling text data, which takes the most frequently used words and appends them to your data as columns with boolean values. As an example, if the 'title' of a webpage was ‘6 Tips for Evergreen Content,’ the columntitle_Evergreenwould have a value of ‘True’ or 1. For more information, take a look at Wikipedia’s article onBag-of-Words.
How to classify evergreen content with machine learning and Eureqa
Let’s go ahead and click the tab labeled Set Target. From this tab, we can tell Eureqa the variable we wish to model as well as what mathematical building blocks should be used during the model search.
We want to predict the variable ‘Label’, which signifies whether or not a given webpage is considered evergreen or ephemeral. Since this variable only contains values of 0 or 1, we’re going to use a special target expression that will provide a similar constraint on the resulting model. For this tutorial, we used the logistic function, which squashes values to be between 0 and 1. We choose the logistic function (as opposed to a step function) because it provides a better search gradient. For more information, see our tutorial on modeling binary values.
We’re also going to make a couple of changes to the building blocks Eureqa will use during the search. Go ahead and enable the Logistic building block, as well as all of the Logical operators.
Also, since the Kaggle competition uses the AUC (area under curve) error metric, we should select AUC from the Error Metric dropdown list at the bottom of the Set Target screen.
At this point, Eureqa should look something like the screenshot below:
Now that we have set our target expression and selected what we believe are good ‘starter’ building blocks, we can begin our search.
The View Results tab offers a digest view of the top solutions Eureqa has generated over the course of a search. For the purposes of this tutorial, we ran Eureqa on a 72 Core private cloud for the better part of 3 hours, which generated 697,254 models.
At first glance, we can see that the top two models are very close in predictive accuracy and complexity, with AUC ranging between .2239 and .2276 and each using 29 to 27 terms, respectively.
Perhaps most importantly, because the output of Eureqa is an analytical model, we can easily identify what characteristics are most indicative of evergreen content. Our most accurate model includes URL_cake, URL_chicken, URL_chocolate, URL_cupcakes, URL_kitchen, URL_make, URL_Recipe, and URL_Recipes. This also seems to pass the sanity test because recipes would seemingly be "content" that will always stand the test of time.
Given these variables, it would appear that other variables like domain, embed ratio, number of links, and link word score, while possibly important indirectly, do not significantly improve accuracy and as a result are not used in the best models.
In just over three hours, we were able to go from a training dataset containing the characteristics of a given webpage, to an analytical model that predicts evergreen content correctly 78% of the time and offers us a much deeper understanding of the characteristics and relationships that are most indicative of evergreen content.
For real world applications, we would likely want to improve upon the predictive accuracy of our results by leveraging the 'raw content' by URL ZIP file provided by Stumbleupon, more thoroughly preparing our training data, adding or removing building blocks, letting Eureqa search for a longer time period, and leveraging additional computation resources.