My latest book: Twin Wolves: Balancing risk and reward to make the most of AI

2020-12-21 | tags: AI data literacy

Understand which aspects of your ML/AI shop can (and cannot) give you an edge over the competition.

Businesses desire a "moat," or competitive advantage in order to stay ahead of their peers. This usually translates to something that is meaningfully unique: "unique," in that you have exclusive access to some widget or invention, or at least your competitors will have a tough time building an equivalent. "Meaningful" in that you're able to use this unique something to improve your situation.

In most fields what constitutes a competitive advantage is very clear: it's a patented device, or an exclusive supplier relationship. This is less obvious in ML/AI -- perhaps outright counterintuitive compared to what gets covered in the tech press -- so I'll take this opportunity to clear up some confusion.

To sort this out, I'll first have to walk through some terminology:

The ingredients of an ML/AI model

Companies typically focus their ML/AI efforts on developing machine learning models. A model attempts to predict customer activity, forecast prices, or segment consumers into groups for further analysis.

A model is the result of using code to feed training data into an algorithm, which we can express as:

code( algorithm + training data ) → model

(The popular press often uses the term "algorithm" when it really means "model." The algorithm is the series of steps, expressed as a mix of code and matrix algebra, that looks for generalizations in your training data. The model is what actually performs the predictions.)

When we consider the elements of that formula -- algorithm, code, training data, model -- which are unique?

It's probably not the algorithm ...

Your company most likely uses off-the-shelf, open-source algorithms (such as TensorFlow, or those implemented in scikit-learn) to build your models. Every other company has access to these same tools because they are publicly available. Therefore, these algorithms do not constitute a competitive advantage.

What if you've come up with a novel algorithm? This might grant you a competitive advantage. The catch is that someone else, with no knowledge of what you have built, may develop a very similar algorithm because they are facing a similar problem. Be careful assuming -- and then relying on -- a uniqueness that you cannot prove.

Since the algorithm is probably not your source of competitive advantage, that leaves us the code, the training data, and the model. Any uniqueness in there?

It's not the code, either ...

Your data scientists and machine learning engineers write code (usually, Python or R) to build the models. This code serves to transform your data into a format the algorithm needs and then feed it to the algorithm for training. That data format and the API calls to invoke training are dictated by the algorithm.

When using the off-the-shelf algorithms that code will thus, by necessity, look the same wherever it is written. Most ML/AI code looks the same from project to project and from company to company, even if those companies are in different industry verticals with different problems. Field names may change, or maybe this company pulls data from a database while that company is loading data from flat-files ... but that's about it.

All of this is because the code's general shape is determined by the idioms of the programming language at hand and the algorithms you employ. This will be a close relation to -- in many cases, "a modified form of" -- the algorithm's (public) API documentation. Copy, paste, small-tweaks-to-fit-my-situation. This code you write to build your models is, therefore, not your competitive advantage.

This flies in the face of conventional wisdom, which tells companies that "any code written on our behalf counts as extremely valuable intellectual property (IP)," or that "any code written here can't possibly look like code written elsewhere, because our situation is unique!" (That's also the case for say, application development code ... but that's a story for another day.) In ML/AI, as in many other fields, "conventional wisdom" does not always live up to its name.

Maybe it's the training data?

Simplified for brevity, a dataset can be either accessible or proprietary.

Accessible data is available to pretty much anyone, such as that pulled from a government data portal. It therefore fails the uniqueness challenge. You can't really claim (and, in a legal sense, defend) ownership of it. Accessible data does not constitute a competitive advantage in the world of ML/AI.

Even if you've recorded some data yourself -- say, "daily rainfall in my city" or "number of cars that pass my street every day" -- if it is something that someone else could have been recording all these years, it's accessible data.

Proprietary data is the data to which you have exclusive access. You've probably created it yourself as a side-effect of your business activity: recording your customers' interactions with your website, data from your sensors, or whatever.

Your proprietary data belongs to you, so it's unique in a sense, but there is a catch. If a competitor can get access to an analog of that dataset, something that is sufficiently similar to serve as a stand-in for your proprietary dataset, then you lose the competitive advantage here.

In short, your training dataset might be a competitive advantage. But don't rely on that.

It has to be the model, right?

Remember our formula for creating an ML/AI model:

code( algorithm + training data ) → model

If you're using off-the-shelf algorithms (as mentioned earlier, you probably are) then the uniqueness of your model depends solely on the uniqueness of your training data. If only you have access to your training data, and if no one else has access to a meaningful analog of that training data, then your model will be unique.

Before you celebrate, there's a caveat: your model may be unique, yes. But remember that your goal is not the model itself. Your goal is the model's predictions, which you then apply to some business problem. It's too easy to get wrapped up in the model-development exercise and treat the model's deployment as the finish line. It most certainly is not. And developing that kind of tunnel vision will cost you.

You and your competitors all have the same business problem, and you're all applying ML/AI to that problem. So while your model is still unique to you, the overall solution ("generate predictions for Situation XYZ") is not. It's entirely possible for those peer companies to independently develop models (using different training data). It's also possible that those models perform as well as yours. In that case, you're all using different tools but to the same effect.

The real advantage

Where does this all lead? Aside from possibly your training data, your company's greatest competitive advantage in the ML/AI space is combination of ... speed and retention. If you and your peer companies are all using the same tools and writing very similar code to tackle the same problem, then you need to move quickly to stay ahead of them. First-mover advantage is not a guarantee of long-term success but it is quite a head start.

My definition of "speed" here is not so much "shorter times to process data," but "the ability to quickly turn an idea into a trained, performant model." In turn, that is a function of how you've built your ML/AI shop:

Do you have the infrastructure in place so that researchers can quickly test ideas? If your researchers can't reliably access training data, or if the fields aren't properly documented, your team will waste a lot of time before they actually get to developing a model.
Have you built a team of experienced data scientists who understand your business? That will reduce the time that business stakeholders spend explaining ideas to your data science team. As a bonus, it increases the chances that your data scientists will come up with useful ideas on their own, instead of being prompted by someone else.
Did your company develop (and periodically refresh) a data strategy that outlines the ways in which this company can apply data to its business challenges? If you do that up-front, your data team will be able to get to work faster because they'll already have ideas on where to start.

Above and beyond any ML/AI you can build, the real moat for your business is a loyal customer base for your company's products. If a customer doesn't want to leave you, it doesn't matter what technology your competitors have built.

Consider eBay: anyone can build an auction site, but eBay has become the trusted name in online auctions due to its millions of reviews of customers and merchants. This isn't just a rich, proprietary dataset on which to build fraud-detection systems; the reviews make people feel safer and more confident when using eBay to buy and sell goods. In turn, this encourages adoption and continued use of the platform.

Finding signal in the noise

It's too easy to chase after ML/AI issues that don't truly matter. Knowing what is your true competitive advantage will help you to better understand where to focus your company's ML/AI energy and keep you on-track for success.

Treating Your ML/AI Projects Like A Stock Portfolio

If your company has several ML/AI efforts on the roadmap, it can be difficult to decide how to prioritize them. You can look to the stock market for guidance.

Question Marks and Periods in the World of Data

Punctuation matters when working with data: BI is periods. AI is question marks.