New Zealand supermarket Pak’nSave has developed an AI bot to help customers use up their leftovers. Tell it what you have, and it will offer ideas to turn it into a meal.
The bot has offered up a few less-than-stellar ideas. Recipes that range from “probably not going to work” (“oreo vegetable stir-fry,” anyone?) to “definitely not going to work” (chlorine gas). These goofs are bringing the store the wrong kind of publicity.
Why does the Pak’n’Save bot churn out the occasional terrible idea? And what lessons can you apply to your company’s AI plans?
First, let’s talk about some of the technology at play.
While the linked Guardian article doesn’t specifically say that the Pak’n’Save is built on a large language model – or LLM, the kind of AI that powers generative AI tools like ChatGPT and Midjourney – the description sure sounds that way. So that’s what I’ll run with here.
The Pak’n’Save bot is a form of recommendation system. The same sort of tool that helps Netflix and Amazon present you with what you’re likely to watch or buy next. Recommenders have gone hand-in-hand with e-commerce since the early days, and are also present on some news sites and social media platforms.
More recently, companies as varied as French grocery chain Carrefour and food delivery service DoorDash have been using LLMs to drive recommenders. The LLM-based systems use more powerful technology and more training data than their predecessors, so they should give better results. Sometimes, they do. Other times, they go very far afield. Why is that?
The magic of LLMs is rooted in the idea that human language follows certain probabilistic patterns. Some words are very likely to follow other words, and some phrases commonly follow other phrases. If you feed an LLM enough data, it can appear to “know” a lot about various subjects. It can also make linguistic connections that exhibit proper grammar but are socially inappropriate or factually incorrect. And since the bots lack any concept of facts or social context, they have no problem emitting complete garbage.
When the LLM emits text, as when generating an essay, we call this garbage output a hallucination.
A grocery LLM providing a recipe for chlorine gas? That’s the recommendation system equivalent of a hallucination: an idea based on patterns that the LLM uncovered in its training data, but which makes no sense in the real world. I don’t know a fancy term of art for this phenomenon, but that’s what it is.
All of this should give you pause before rolling out that LLM-based recommender project in your company. These four steps should help you avoid the common problems:
1/ Be mindful of building on a building on a public, general-purpose LLM.
While OpenAI has been cagey about where it got that training data for ChatGPT, it’s likely that a good deal of it came from crawling the public internet.
That’s all well and good for a general-purpose conversational bot. It doesn’t work so well for a system that has a very narrow focus, like picking recipes. So if you choose to extend ChatGPT or some other bot for your own purposes, understand that your system will have a lot of general-purpose knowledge under the hood. General-purpose knowledge from sources that you may not know about, and from domains that may be unsuitable for what you are trying to present to your end-users. Like, say, explaining how to make napalm.
That takes me to the next point:
2/ If you build your own LLM from scratch, be sure to use your own (thoroughly curated) data.
The safest way to address the general-knowledge issue is to limit the bot’s training data to a very narrow, purpose-specific domain. This serves you in two ways:
- The bot will only “know” what you tell it. If the training dataset only includes edible items, say, it will only use edible ingredients in its suggestions. (No guarantee that the suggestions will be tasty, but that’s another story.)
- You’ll know exactly where the dataset came from and what it contains. You won’t get any pesky questions about copyright infringement. And you can make sure the dataset doesn’t contain any sensitive information like HR records, secret product designs, or credit card numbers.
3/ Review the chatbot prompts and their outcomes. Look for trends.
The queries people enter into LLM chatbots are known as prompts. When you deploy a chatbot system, you can save both the prompts and the bot’s outputs for further analysis. (Just make sure that your terms of service allow you to do this.)
You should be inspecting those anyway, for quality control. As a bonus, a periodic review will help you spot interesting ideas and trends. A grocery store bot could yield popular ideas that are suitable for including in a static recipe catalog.
In fact, such a grocery store could let employees test a chatbot internally to generate recipe ideas and post the best, hand-picked ones to its public website. That would spare them the trouble described in my next point:
4/ Expect people to tamper with the bot.
The harsh reality is that some people will feed a bot malicious prompts or otherwise try to misuse it. Others will innocently, accidentally induce the system to hallucinate. Either way, a public-facing, LLM-based AI chatbot exposes your company to a variety of risks.
You’d do well to perform a thorough risk assessment of your company’s AI chatbot plans and then keep an eye on the bot once it is released into the wild. (My O’Reilly Radar piece, “Risk Management for AI Chatbots,” goes into detail on technical and non-technical steps you can take to reduce your exposure.)
What happened to Pak’n’Save could happen to any company that deploys an AI-based, LLM-driven chatbot. They’re certainly not the first company to experience such an incident, nor will they be the last.
The important thing is for us to learn from these incidents. While there’s no such thing as a perfectly safe AI system, we can all take steps to reduce those systems’ potential to cause harm.