I'm a guy that does a lot of data visualization work who is relatively new to ML.
I've loaded a data set with ~15 features, and it looks like DR has created about 40 models. I know when looking at the metrics defining a 'good' model, you can't necessarily take just one. But what are a few good metrics to start with to pick a reasonably good model? AUC?
Related question, is there any guidance on what a 'reasonable' score is for the different metrics?
Hi @ml-noob - welcome to the DataRobot Community! The in-app Platform Documentation has some pretty comprehensive explanations of the various optimization metrics and guidance for understanding them. Here's a quick link for you, for the trial: Optimization Metrics doc link.
As for understanding which of the models DataRobot created is the "best" (based on your needs and use case), you should have a look at the various visualizations, such as Feature Impact, Feature Effects, and Prediction Explanations -- as explained here.
Also, see this article for info on how DataRobot creates the Leaderboard. And if you want to compare some of the built models based on speed vs accuracy, lift charts, ROC Curves, etc, see this article for help.
Does this help? Do you have more specific questions?
Hi @ml-noob ,
Welcome to the world of machine learning, the water's warm... come on in. You ask a really good question and, as I see it, I has two separate parts:
1. What metrics are 'best' for different types of machine learning problems?
2. What constitutes 'good' for a metric/model?
Let me shed some additional context here, and I'll try to keep it to plain language (so there may be some technical nuances that I gloss over here)
What metrics are 'best' for different types of machine learning problems?
When thinking about metrics there are actually a few ways to think about model performance. "How well does my model 'fit' the data?" and "What sort of mistakes does the model make with predictions" are related, but separate questions. Different metrics will be better at answering one vs. another. There will also be different relevance of metrics for a classification problem (the target can be 'A' or 'B'... or perhaps any letter of the alphabet) or a regression problem (the target is a continuous numerical value), but there are some metrics can span both types of problems.
Metrics that are really focused on "How well does my model 'fit' the data?" are things like 'RMSE' ('Root Mean Squared Error') or 'MAE' ('Mean Absolute Error') will tell you by how much the model 'misses' on average, and there are other metrics like 'MAPE' ('Mean Absolute Percentage Error') that can account for differences in the scale of a 'miss' if there are multiple series in your dataset (for example in time series).
A metric like 'logloss' is an extremely good 'general' metric that describes how well the model fits the data and also includes some insights into 'correct' or 'incorrect' predictions. Cross-entropy loss, or 'logloss', measures the performance of a classification model whose output is a probability value between 0 and 1. Logloss increases as the predicted probability diverges from the actual label. So for example, predicting a probability of .12 when the actual observation label is 1, or predicting .91 when the actual observation label is 0, would be bad and result in a higher loss value. A perfect model would have a logloss of 0. But it can be difficult to know if a 'logloss' of 0.2 is good or bad.
On the flip side, we might look at a metric like 'Accuracy', that describes how often the model makes a correct prediction and what sort of different types of incorrect predictions are made. Let's use a classification problem as an example. Here, Accuracy as a metric captures the ratio of the total count of correct predictions over the total count of all predictions, based on a given threshold. True positives (TP) and true negatives (TN) are correct predictions, false positives (FP) and false negatives (FN) are wrong predictions. You calculate Accuracy as follows:
The advantage here is we know the scale for Accuracy (0-1) and we can clearly interpret how often the model makes correct predictions. Also, Accuracy lets you understand how often the model is making correct predictions (either TP or TN) relative to all of the predictions that it makes. Sometimes you really care about reducing certain types of incorrect predictions (ex. minimize the FP) or increasing a certain type of correct prediction (ex. maximize the TN) because of real world implications of correct or incorrect predictions. There are different metrics to focus on different aspects of predictions, such as: True Positive Rate (Sensitivity), False Positive Rate (Fallout), True Negative Rate (Specificity), Positive Predictive Value (Precision). Understanding how often a model makes correct Negative (TN) predictions isn't available using a metric like 'logloss', but would be incorporated into 'Accuracy' or focused-metric like 'Specificity'. The list goes on, but 'AUC' and 'Gini Norm' quite interpretable metrics for 'how good are the predictions overall' because they'll give you a clear range (AUC: 0.5-1, Gini Norm: 0-1).
What constitutes 'good' for a metric/model?
Ok, so there are lots of metrics. But what constitutes 'good' on any metric? The answer here is 'it depends' because the answer relies on what is 'valuable performance' for that specific problem. Let us use the example of a binary classification ('A' or 'B'). If there is no existing model to predict 'A' or 'B' and they both occur with the same frequency, then just beating a 50/50 guess (Accuracy > 0.5) is an improvement. That might make a model with an 'Accuracy' of 0.6 - 0.8 be valuable because it makes a meaningful improvement over the status quo. Or perhaps there is actually a 20/80 ratio of A/B (and not a 50/50 ratio), in that case a 50/50 guess isn't very good. We'd only view this as better than random if we beat a 0.8 Accuracy metric. So 'good' can depend on the distribution of your data and how well an existing or 'simple model' (i.e. a guess) performs.
Or maybe you're in a situation where the types of prediction mistakes are really important. False Positive (FP) or False Negative (FN) predictions are especially bad in some situations: think cancer diagnosis. What is the relative 'bad' of incorrectly diagnosing someone as having cancer? Or missing that they have cancer and giving an incorrect cancer-free diagnosis? And how does this balance against the 'good' from correctly predicting when someone does or does not have cancer? In this case, we'd probably assess 'good performance' as something that reduces the amount of 'bad' diagnoses (FP or FN) at the cost of potentially not being as good with some correct predictions (TP or TN). (Hint: we can help you make that decision using the 'Profit Curve' functionality, but that introduces the idea of a 'threshold' and we can tackle that if you have questions)
The point here is that there is no single answer to 'what is good performance?'. We'd all love to build models that give perfect predictions, but unfortunately we don't often see that. Either the data or the problem doesn't support perfect models. The definition of 'good' depends on the data, the value of making correct predictions, and the cost of making different types of incorrect predictions.
@Anonymous linked to some good content for thinking through these issues, and hopefully helps. If you have a specific situation and can share details, then perhaps I can offer some advice on what 'good' might look like in that situation.
There is also a course over in DataRobot University on Evaluating your Models: https://university.datarobot.com/evaluating-your-model. This is another place to get a deeper understanding for thinking about this.