TCM: Total Cost of (ML/AI) Model
2020-09-08 | tags: AI data literacy

That ML/AI model costs more than you think.

The business world likes hard numbers. They're easy to see on a balance sheet, and they form the basis of many decisions.

Businesses also like ML/AI models, but those carry hidden costs that rarely make it to the accounting system. It's tougher to see when that model actually takes more of your money than it brings in, and when you should have taken a different approach.

In this piece I'll shed light on those hidden costs. This will lead you to additional structure in your decisions on when and whether to employ an ML/AI model.

Introducing TCM: Total Cost of Model

The business world uses the term Total Cost of Ownership (TCO) to describe the entire price of something. That name holds a subtle message, that there's more to this price than you initially see. In some cases it even means: you'll pay for this more than once.

The total cost of your car, for example, extends beyond the sticker price on the lot. You're also subject to the recurring costs of fuel, insurance, and periodic upkeep. Don't forget one-off costs such as repairs and traffic tickets. For those who live in dense urban neighborhoods, there's the emotional cost of looking for street parking on a Friday night, or having to dig your car out from a snow storm. While it's tougher to translate those emotional costs into dollars, you can still see how they might push someone over the threshold into no-car territory.

Calculating TCO is helpful because it leads to more informed decisions: is having your own car really cheaper than the alternatives, such as "splitting a car with your spouse" or "public transit plus the occasional ride share service?" Looking at just the sticker price can lead you to the wrong decision here. The key point about TCO is that you always pay it in the long run, even if you fail to calculate it up-front.

The same goes for employing ML/AI models. I call this Total Cost of Model (TCM) and it's a mix of up-front costs, upkeep, and intangibles. Some of these are within your control but many are not. If you haven't been through the full lifecycle of an ML/AI model -- and several times, so you can see all of the parts that can go awry -- it's easy to assume that you only pay for the initial development (equivalent to your car's sticker price) and then reap the benefit.

Mathematically, we can express that naive view of TCM as:

model development - benefit

Small development costs, paired with large benefit, imply a negative TCM. You're practically printing money! But you're not.

In reality, the full TCM involves a lot more. Consider:

( planning + development + deployment + maintenance + late fees ) - benefit

You can see how the first, naive attempt at defining TCM seems so much better. There's only one variable that increases TCM -- model development -- and it's easy to convince yourself that it's a fixed cost, or at least a contained cost. By comparison the second, thorough definition of TCM leaves a lot more room for the various cost factors to outweigh the benefit. And some of those cost factors influence each other.

(Economies of scale are on your side here: the more models you employ, the better you get at each stage of the process, which can shrink your marginal costs of employing new models. Keep that in mind as we walk through the cost factors in detail.)

Planning Costs

When planning your ML/AI model, you're trying to understand the future state of things. You'll need a realistic picture of what ML/AI can do in this situation, in this company -- not just what you hear about in the press -- to confirm that a model is suitable for the problem at hand.

This is the time to understand all of the ramifications of employing a model: "How would it help us in this case? How would it hurt us? How could it hurt someone else? What are the different ways things can go wrong?" The best way to answer these questions is to get all of the different teams -- leadership, product, data science, and IT -- involved to weigh in with their concerns. With a sufficiently diverse mindset, you can catch a lot of corner cases and blind spots, all of which contribute to TCM. (Some of the problems you uncover during planning are really part of the deployment-time risks, which we'll get to in a moment.)

Planning costs mostly boil down to people time. Specifically, people-thinking-and-researching time. That can slow the momentum when you're excited to get moving. There's no Ministry of ML/AI Models that requires you to plan, though, so it's easy to skip past that part and assume your cost of planning is zero.

Model Development Costs

Creating an ML/AI model is a straightforward, one-shot affair: prepare training data, feed training data to an algorithm, and collect your model on the other end.

It's very rare that the first model you create is the one you actually deploy, though. True model development is a create-evluate-tune cycle that runs until some mix of "it performs well enough" and "we're out of time and money." Since there's no true "end" of model development, cost containment boils down to your decision of when to stop working on it.

For purposes of uncovering TCM, then, you set aside some amount of time and money for model development and hope that your team is able to achieve decent results before those budgets run out.

Model Deployment Costs

That model you've built has to live somewhere, so that it can take in new data from the outside world and return predictions. Assuming you already have an infrastructure to host your home-built applications, or you're using one of the popular cloud services, hosting costs should be a straightforward calculation. Those costs may even be a low number, to boot. The real cost of model deployment comes from handling your risk:

Model Risk reflects the cost to your business when the model is wrong. I emphasize "when," not "if," since every model will be wrong sometimes. How many times and how many ways can your model give incorrect answers before it takes a bite out of your bottom line?

Operational Risk reflects the money that is at stake due to the operating environment, which is everything around the model. If the model goes offline, or it receives inputs that throw it off-balance, that will cost you.

You can implement monitors to reduce the impact of model and operational risks. Handling operational risks will also require that you have tools and procedures to make sure new versions of the model are properly deployed (lest you have your own Knight Capital moment).

All of that padding around the model will involve some amount of human labor to define what should trigger an alarm and then to respond to those alarms. This increases TCM. At least doing it early is the cheaper route. You can add monitoring and padding before your first incident, or after. But you will add it.

There's also an element of Reputation Risk if you use a model when inappropriate. This overlaps with Model Risk and Operational Risk, but it's still its own beast. If the model's purpose will make people very uncomfortable -- say, with certain forms of predictive policing, or creepy advertising -- expect public backlash. This counts even if the model's predictions are correct most of the time. That could lead you to decommission an (otherwise successful) model, which means you lose the benefits of having employed an ML/AI model and you have to clean up a PR mess to boot.

Some people say that you can reduce Reputation Risk by keeping your activities secret. I'd say that this pushes the problem down the road, since the truth eventually comes out. The best way to handle Reputation Risk is to stay out of sketchy business activities.

Maintenance Costs

You can think of maintenance as a continuation of the create-evaluate-tune cycle from the development stage. A model's view of the world is based on its training data, and over time that data no longer reflects the state of the world. Model maintenance involves rebuilding and updating the model with new data, such that it matches how the world looks today.

Maintenance costs can vary. The biggest contributors are usually in the data preparation and the exploratory analysis required to compare the old data to the new. For example, does the new state of the world (ergo, the new data) reflect a new range of values (for a regression model) or a change in classes (for a classifier)? That will take some R&D time to sort out, which increases TCM.

Late Fees

When you don't invest in planning, those costs come back to you down the road as late fees. Waiting to discover the true costs leads to making decisions in haste and cleaning up after problems have gained momentum. Late fees grow the longer you wait to address them. They can also compound and cascade in odd ways, such as team members leaving in frustration or your PR group working overtime.

The more you plan up-front, the less you'll pay pay in late fees.

How to Use TCM

TCM exists and you will eventually pay it, even if you choose to ignore it early on. By evaluating TCM up-front, you can see what you're getting into before you've invested too much time, money, and even reputation in an ML/AI model.

As a concrete example, let's say that you plan to deploy a model to moderate comments on your website. On the plus side, a model will tirelessly operate 24x7. You can also convince yourself that the model will scale far better than human labor because the marginal cost of excuting predictions is near-zero. This is our original, naive calculation of TCM.

For the more honest look at TCM, let's also factor in the myriad ways things can go wrong: besides the model's built-in error rate that you learned during development, you have the people factor. When people are terrible online, they can be doggedly persistent in trying to circumvent any moderation you have in place. You can expect them to find ways to fool the model, which means their horrible comments get through (a reputational risk for your company plus a poor experience for your site's visitors) and you'll wind up in a cat-and-mouse game as you keep re-training the model to address the model's loopholes (which increases your intended maintenance costs).

You then decide to employ more human moderators to compensate for the model's weak spots. With enough up-front planning -- most notably, learning early on that people can be terrible online -- you could have factored in this cost from the start and possibly even changed how you would use the model. But since you didn't, you wind up paying late fees on the TCM because you have to do all of this work in the heat of the moment.

In the end, even if the model saves you some money over using human moderators, it won't save you nearly as much as you'd originally thought. Wouldn't you rather have sorted this out early on? That's the value in calculating TCM up-front.

Getting Beyond the Sticker Price

A well-worn statistic in the software development field is that 75-80% of the TCO of custom software is in the long-term maintenance. Companies that only look at the cost of initial development are bound for a painful sticker shock later.

The same holds true for the world of ML/AI. Before rolling out take the time to do your homework and calculate your true TCM.

New Radar article: "An Agent of Change"

I've published a new article on O'Reilly Radar, on how the Covid-19 pandemic influences how we think, spend, and manage our businesses.

Setting Expectations for ML/AI Projects

Explaining the realities of how an ML/AI project may go awry.