AI chatbots like ChatGPT and Bard have earned a reputation for saying outrageous, offensive things.
These statements might be shocking to experienced AI practitioners, but they’re hardly a surprise. When you build models for a living, you quickly learn that every model will be wrong now and then. That’s just par for the course.
The wider public has been sold on a different idea. They’ve been told that AI is infallible, even magical. These high-profile failures of generative AI, LLM-based chatbots make for a harsh reality check.
And maybe that’s a good thing? The weaknesses in these chatbots stem from issues that are common to all AI models, not just the generative variety. Watching ChatGPT stumble paves the way for increased understanding of AI truths.
I’ll walk through six of those truths here. But first, I’ll explain how this all happened.
AI, in its various incarnations and rebrandings, had long been the domain of corporate entities. Building a predictive model required that you assemble a team of high-priced data scientists, hand them a mountain of data, and patiently wait for results.
Public-facing generative LLM models have demolished that barrier to entry. Dall-E, soon followed by StableDiffusion, ChatGPT, and a host of other services, put this powerful AI in the hands of anyone who could type into a web form. No data science team needed.
Democratizing access in this fashion has led to two shifts in the public perception of AI. For one, the term now refers solely to generative AI. It’s not unlike how neural networks came to be synonymous with “machine learning” a few years back, relegating its forebears to the moniker “classical” ML.
Two, this has stress-tested the myth of AI being magical:
- People can now see, with their own eyes, the model’s outputs. They don’t have to hear about the results secondhand, filtered through some hand-waving corporate entity that glosses over model performance metrics and test procedures.
- Millions of people are able to poke at the models. If enough people use a system, they’ll collectively uncover the weak points and corner cases. This is why we see so many “I can’t believe what ChatGPT said!” articles alongside the “Here’s how ChatGPT does my job for me” variety.
- Some of these people want to see the models slip. They aren’t so much stumbling across the corner cases as they’re trying to induce them. A clever idea here, a few tweaks there, and you get screen caps that feed the aforementioned “I can’t believe what ChatGPT said!” hand-wringing.
These problems are not unique to generative AI. Every AI model, whether it’s predicting house prices or classifying documents, is going to goof now and then. Some will goof more than others.
These flaws have their roots in six key truths about how AI models work.
An AI model represents a set of patterns that were uncovered in a set of training data. The more training data you throw into that model, the greater the chances it will grasp meaningful patterns. That’s why the first “L” in the term LLM stands for “large”: it refers to the amount of training data that went into the model.
You don’t get that eerily-humanlike, ChatGPT-level of performance from a tiny dataset. If you want a model that sounds like it’s speaking clear, essay-quality language, you need a sizable training dataset to match. That leads me to the next AI truth:
The lucky companies have developed a proprietary dataset for model training. The rest have to purchase the data from third parties, or build a system to collect that data, or scrape it from various websites. (This is an especially acute issue for startups, which usually start with no data at all.)
- Building your own system means having to wait for it to populate with data. That slows your time-to-market.
- Purchasing data comes with a healthy dose of third-party vendor risk: you rely on the supplier to be reputable, to give you quality data, and to not go out of business. Oh, and you hope that their data collection methods aren’t riding on the fine edge of data privacy laws. A small shift in regulation can ruin them – and by association, you – overnight.
- Scraping carries a host of legal issues, since it usually means that you’re gathering the data against the wishes of the person or company that has put it online. That injects uncertainty into your plans because you’re building on data that might get you in trouble. Some lawsuits have been decided in favor of the scraper (see the LinkedIn/HiQ affair), sure. Others, most notably the several currently being waged against ChatGPT parent OpenAI, are still pending.
When it comes to datasets, remember the three Ps: “proprietary” beats “public-domain” beats “pilfered.”
Training data is the lifeblood of an AI project. It represents a model’s entire “knowledge” and “world view.” Using bad data will lead to subpar model results.
That said, the definition of “bad data” is rather broad. (I coordinated an entire book on that theme some years ago…) You can have a dataset that is clean and complete, for example, yet entirely unsuitable for the task at hand.
For a predictive model, that would mean data that has no connection to whatever you’re trying to predict – no “signal.” In a chatbot, bad data would include factually incorrect statements (which helps the model emit nonsense), dangerous know-how (so the model can teach people how to make napalm), or disturbing views (like Bard touting the “benefits” of slavery). A little curation to weed out the bad data will go a long way.
Speaking of curation:
The popular move these days is for companies to toss all of their data into a data lake. The hope is that mixing all of their data together will yield some novel insights. This isn’t necessarily a bad idea, but it’s possible for a model to see too much data during training.
Let’s say your company’s fancy new generative AI model has hoovered up the entire data lake. It now has knowledge of HR records, trade secrets, and pending deals. It also has no qualms against sharing those details with anyone who asks. That’s probably not what you want, but it’s what you’ve built.
It’s also possible to train a model on data that is simply off-limits. Companies in the consumer lending space, for example, are forbidden from using attributes such as race or gender to evaluate an applicant’s creditworthiness. If your company’s data scientists are unaware of this law and use those features in a predictive model, you could land in serious legal trouble.
Most data teams use the same well-known, open-source toolkits to train and invoke their AI models. The code you write for TensorFlow, Torch, or spaCy will therefore look very much the same across companies and even across industry verticals. The code is certainly important – you need it to generate the models, after all – but it’s hardly special.
How do you develop a meaningful, game-changing, moat-quality AI asset? A proprietary training dataset is your best shot. If only you have access to that data, only you can build that particular model.
(A competitor may still build a similar model from a completely different dataset, but that’s a story for another day.)
AI’s bread and butter is pattern-matching. But those patterns need to be present in its training dataset. If a task requires additional context or nuance, the model is of no use. You’ll need a human in the loop.
Content moderation is one such example. Running a large platform means you pretty much have to rely on AI to catch unsuitable content because there’s just too much for a team of people to sift through by hand; but human expression leaves a lot of room to confuse a model. Video game maker Activision’s solution is to use AI to flag questionable material, which a human moderator can then review and decide on a course of action.
LLM chatbots, since they’re typically intended to operate in real-time, interactive fashion, don’t leave that kind of room for human intervention. You’ll want to have a long think before letting a chatbot handle sensitive tasks, such as helping people navigate eating disorders.
I’ve often joked that some people will curse weather forecasts for being wrong so often, just seconds before they proclaim their faith in AI. The two are based on the same principles – finding patterns in data to predict future activity – and are subject to the same flaws.
We can draw a similar parallel between generative models and their predictive ML model siblings. Now that the wider public has experienced generative AI up-close, they understand the benefits, drawbacks, capabilities, and limits of all AI.
My hope is that this provides a boost to data literacy, leading individuals and companies alike to make better decisions about when to use AI models.