When your company’s data scientists may say that they’re going to “build an ML model,” they really mean that they’re going to “build several models in search of the best one for this project.” That winner is what you’ll embed in your product, or use to operate your business, or what have you.
Before they build their first model, though, they’ll establish a couple of baselines to guide them. What does a baseline look like for an ML model? What are some common baselines one should develop as part of a modeling exercise? I’ll explain that and more in this post.
How do you determine which is the “best” model? You start by picking one or more metrics relevant to the model’s intended use case. Just like any other business metric, ML model metrics help you compare two entities according to their performance.
The most common ML metric is accuracy, which reflects how often a model made the correct prediction. Accuracy is one of many possible metrics, and it’s not always the most appropriate one to use.
(Choosing a metric is a topic unto itself. I’ll do a separate post on that at some point and link to it here. For the purposes of this post, I’ll keep things simple and use accuracy for the examples.)
A metric has no meaning in isolation, though. “The model performs at 90% accuracy” may sound impressive, but the question to ask is: 90% compared to what? If you’ve built other models that score 60% or 80% accuracy, then yes, 90% is quite a step up. But if a previous model demonstrated 89% accuracy, then this latest result is only a slight improvement. And if the previous model exhibited 98% accuracy, then 90% is a notable step down.
This gets to why baselines are so important in ML modeling: before you compare your models to each other, you compare them to the baseline performance. Any model – the combination of training data, algorithm choice, and tuning parameters – that doesn’t exceed the baseline isn’t a serious candidate for running in the real world.
Here are my top baselines for ML modeling effort:
1/ Pure-random: What if, instead of using a model for this project, you were to just flip a coin or roll some dice?
Since this system is just tossing out predictions at random, it should be wrong a good deal of the time. And if your new model can’t beat a system that’s not even trying to find the correct answer, that should raise an eyebrow.
How do you calculate accuracy for this case? The quick answer is “the inverse of the number of classes.” So if your model labels incoming support requests as either A, B, or C, you have three classes. One divided by three means 33% for a baseline accuracy of random choice.
But that quick answer only works if your training dataset is perfectly balanced between all classes. If one class is vastly under- or over-represented, then the math gets trickier. And that takes us to the next baseline:
2/ Always choose the most-represented class: If your training dataset is imbalanced in favor of one class, you can calculate a baseline for “just choose whatever class is most popular.”
Let’s say that your training dataset consists of 90% Class A and 10% Class B. A system that always chooses Class A will score 90% accuracy! If you hadn’t taken the time to calculate this metric, you might be impressed when you hear that an ML model performed at 92%.
And assuming the always-choose-A approach will harm your business – maybe you’re predicting credit card fraud, so you can’t just assume that all transactions are legitimate – then this baseline will set the bar for what your ML models will have to deliver.
3/ The incumbent solution: In many cases you’re building a model because you want to replace some existing system. Maybe this incumbent solution is based on an older ML model, or on manual effort. How does the new model stack up against what you’re already doing?
As a side note: I’m oversimplifying here. It’s possible that an ML-based solution performs slightly worse than the manual approach, but at a much lower cost. While the other baselines are more clear-cut as to which model “wins,” comparing the new model to the incumbent solution will require some nuance to account for things like total cost of ownership (TCO) and staffing matters.
4/ The out-of-the-box, no-tuning model: Most algorithms have tuning parameters that you can adjust to influence model performance. And most of those tuning parameters have default values. What happens if you run your data through the algorithm and accept the defaults? How well does that resultant model perform?
This is an interesting baseline because it represents the case of (near-)zero effort from the data science team. If they keep adjusting the parameters but performance doesn’t improve, that’s a sign. It’s probably time to swap in a different algorithm, or change the training dataset, or scrap this modeling project altogether.
Baselines help set your perspective of model performance. A model’s metrics may sound impressive until you compare them to that random-choice or always-choose-class-A baseline.
Baselines also set your expectations for an ML modeling effort. If your new models can’t meaningfully outperform the baselines – especially those baselines that are effectively a single line of code – it’s time to reconsider your approach.