Knowing what to measure
A 3-D render of a bar chart with orange and blue bars.  Photo by Nick Brunner on Unsplash

(Photo by Nick Brunner on Unsplash)

When you're training an ML/AI model, how do you know when it's good enough?

In ML/AI we often talk about building "a" model. It's really a matter of building several candidate model variants, then promoting the of the bunch to production. Metrics – gauges of model performance – play a key role in deciding which variant is best.

There are many metrics from which to choose, each one measuring some slightly different quality of a candidate model's performance. Comparing model variants along the wrong metric can give you a false sense of security and possibly harm your business. So which metric(s) should you use, and how would you decide?

I'll briefly walk through the different types of metrics and offer some thoughts on how to decide.

Standard metrics

Accuracy is the most common industry-standard metric. Even if you've never heard the name, you're familiar with the concept: it's the percentage of correct answers issued by the model during a test. Accuracy has a wide audience, too. It's the first metric that entry-level data scientists encounter, as well as a common first-look or gut-check metric for experienced practitioners.

Accuracy is simple to calculate and intuitive to understand, which is good. It also lacks context for the business problem you're trying to solve, which is not so good. Frankly, all of the industry-standard metrics – including F1, ROC, and so on – suffer from that same problem. For all but the simplest of problems, you'll want to consider other metrics.

Custom metrics

You can define custom metrics that account for your business use case. For example, let's say your model classifies pictures into buckets of "dog," "cat," and "bird." Assuming all buckets are of equal importance, you might be able to get by using plain old accuracy. But what if your clients are most interested in identifying cat photos? You'd need to define a custom metric that puts extra weight on incorrect predictions of "cat."

The key to developing a custom metric is to understand the model's purpose. What is it intended to do in the real world? Why is that important? And how would you measure the business value of correct or incorrect answers? If you're having trouble sussing out a metric, that's a sign your model's purpose is not clear.

Considerations

Whether you're going with tried-and-true standard metrics, or something custom, here are a few ideas to keep in mind:

It's important to choose a metric before you start evaluating candidate models. Otherwise it's too tempting to go metric-fishing – looking for the metric that yields the best score, instead of trying to tune the model to improve performance – because you want to wrap up and move on to the next project.

If you establish multiple metrics, understand that a new model variant may improve along some metrics yet weaken on others. You'll want to decide up-front which one matters most for your use case.

Runtime cost is an important metric. Sometimes model performance and cost go hand–in-hand. A model variant that exhibits greater analytic or predictive performance than the others, but also costs much more to operate, is probably not the "best" model for your business.

Using a third-party model? You're still on the hook to track model performance. Some of you have outsourced the modeling work to an external party (what I call AI as a Service – AIaaS) such as an LLM provider or fraud detection service. Even though you didn't build the model, you still need to confirm that it's up to snuff for what you're doing. Establish your metrics early and be prepared to make some hard decisions if the AIaaS doesn't perform.

Make sure you're actually moving the needle. By keeping an eye on the metrics, you'll see when your tuning has reached a point of diminishing returns. That's when you'll need to decide whether to move forward with the current state of the model, change tactics, or scrap the project altogether.

Most of all…

Choosing a metric for your model is important. But remember that metrics only tell part of the story. Once you factor in other business concerns, you may determine that the top-performing model is actually not the "best" one for your use case.

All talk, no action

In many companies, AI is just a topic of conversation

2024 best-of writing list

The pieces I most enjoyed writing in 2024