ML models and why chatbots don't really "think"

My latest book: Twin Wolves: Balancing risk and reward to make the most of AI

2023-07-03 | tags: data literacy generative AI ChatGPT

Image of a robot. Photo by Arseny Togulev on Unsplash

I've often said that AI chatbots like ChatGPT and Midjourney are just "replaying their training data." What does that mean? How should that knowledge set your expectations of what these tools can (and cannot) do?

Thinking like a model

Under the hood, those chatbots use what are called large language models (LLMs). That word "model" is a hint: an LLM is an oversized version of your standard machine learning (ML) model.

A quick primer on ML models will lay the foundation for understanding chatbots, then. Let's start with some everyday pattern-finding:

You've just moved to a new town. After a few weeks of commuting to work, you get a feel for when traffic is heavier. You can use this knowledge – this understanding of the past and present – to make some educated guesses about the future. "If I leave a little later on Friday, traffic will be lighter, and I'll have a less-stressful drive." There are no guarantees that you're correct, mind you, but it's a pretty safe bet.

Congratulations: you have just simulated the work of a machine learning model. You did this by:

collecting some data points … and then …
backing them into a formula that you can use to …
fill in data points you don't have (from the future)

An ML model takes a more formal path than a person, and it can sift through a lot more attributes ("features") in search of patterns, but it's the same overall process:

collect data points: Those data points take the form of training data, records of past events that you will use to take a stab at seeing the future.
find the formula: An ML algorithm inspects the training data for (statistical) patterns, and saves those patterns into a model. You can then make new requests of the model (the formula).
fill in new data points: Now that we have a formula, we can use it to make predictions.

A prediction is another way of saying "a synthetic data point that we didn't have before." Specifically, an ML model's prediction gives you points that were not in the original training dataset but – here's the important bit – look like they could have been.

Back to chatbots

Those LLM chatbots? They're oversized ML models. They're a larger version of the "collect training data, let an algorithm find the formula, replay that formula to fill in new data points" steps I described above. So when you use an LLM you are asking it to fill in a new data point based on the patterns it found in its training data.

The LLM won't return the exact documents on which it was trained (well, not usually -- specific text or images do indeed come up now and then, which adds fuel to some ongoing lawsuits) but you get documents that look like they could have come from the training data.

Why this understanding matters

Why is it important to understand the mechanics of these AI chatbots?

1/ While a chatbot's underlying LLM is indeed "creating" text or images, it won't stray too far from its training data. It can't. Its entire job is to return documents or images that (on a statistical level) look like they could have been in the training data.

That said…

2/ The LLM doesn't see "facts" or "logic" or "train of thought" when it generates text. It only sees the linguistic patterns found in its training data. When I said that the ML algorithm looks for patterns in the data, I glossed over an important step: it first transforms all of that text data into mountains of numbers.

I'll spare you the technical details. You can do a web search for "vectorization" or "embeddings" if you're curious. But as far as the LLM is concerned: "Word 142 is usually followed by word 435345 or maybe 324, which is sometimes followed by word 798 or even 32498384." It has no idea of what combination of words might be considered awkward, controversial, or socially acceptable.

Most of all …

3/ This explains why chatbots seem to make things up or "hallucinate." (I prefer the terms "fabricate" or "lie" but … fine.) Since the underlying LLM has no concept of facts or logic, and since it's trying to replay the grammatical patterns from its training data, it will sometimes create text that is grammatically correct but complete and utter nonsense.

It's possible to train an LLM chatbot on text that is factually incorrect, and it will emit a lot of text that is similarly incorrect. But it's also possible for it to hallucinate when it's been fed a diet of plain truthful material. "John Smith drove his tractor to the moon" could easily come from an LLM that was trained on documents about space exploration and farming.

What this means for you

The sum total is that an AI chatbot (and its underlying LLM model) doesn't really "know" anything. It doesn't have opinions and certainly isn't trying to convince you of any political view. The chatbot is simply giving you a block of text that could have plausibly fit into its training dataset. But the notion of "plausible" is based on grammatical patterns and not logical trains of thought.

The always/never tradeoff in data collection

A reminder that risk and reward are a package deal

Why do we need data scientists, then?

We have off-the-shelf models and turnkey data tools. Why do you need to hire data scientists, then?