What you see here is the last week’s worth of links and quips I have shared on LinkedIn, from Monday through Sunday.
For now I’ll post the notes as they appeared on LinkedIn, including hashtags and sentence fragments. Over time I might expand on these thoughts as they land here on my blog.
AI chatbots represent a spectrum of reputation risk. Listed in order of brand safety:
- your bot, available for general question-answering and conversation (ChatGPT)
- your bot, speaking on behalf of your company’s brand (customer service bots)
- your bot, speaking as you
No one wants the bot, wearing their logo, to say terrible things (see the recent issue with Google Bard expressing extremist views) but you might be able to wave that off as a technical glitch. It’s tougher to do that when your grocery store’s recipe bot is telling customers to serve up a dish of poison.
So what about when the bot isn’t just wearing your brand, but also your face ?
The AI chatbot mentioned in this article hasn’t said anything objectionable (at least, not as far as I know). But, given the crazy things AI bots have already said over the past several months, it’s entirely possible.
If and when it does, I expect the fallout will get ugly.
I’ve been thinking a lot about risks in generative AI. Michele Wucker kindly offered me a spot to talk about a rather subtle one. My guest post on The Gray Rhino blog:
I’ve been in the tech industry long enough to have seen some recurring patterns.
In my latest piece for O’Reilly Radar, “Structural Evolutions in Data,” I explore the different waves of the field we currently call “AI.” I also offer ideas on what the next wave might be.
(Huge thanks to the (anonymous) friend who gave me the term “structural evolutions” to describe the fractal nature of this phenomenon.)
Are you building an AI model? A robust, high-quality training dataset will play a key role in that model’s performance. If you do it right, it’ll also keep you out of trouble.
How do you develop a suitable training dataset, then? Over the coming days I’ll offer seven questions you can use to guide your efforts.
Today we’re starting with:
How well can we match the real world?
A training dataset should represent what the model will see when it’s operating “in the wild.” But your definition of that operating environment matters.
Let’s say you’re building a model to predict customer churn. You might say, “Customers are customers, right? Easy.” Maybe. But probably not. If you’ve trained the model on customer data from your American market, then it will make its best predictions on customers in the American market. I doubt this model will do as well to predict churn in your European or Asian markets.
If you think through the specifics of what the model is intended to do, and where it’s meant to do it, you’ll stand a chance of developing a suitable training dataset.
I’m sharing seven questions that will help you to build a robust AI training dataset and stay out of trouble. Today is Day 2:
What aspects of “the real world” matter for the problem at hand?
A model looks for patterns in the features (attributes) of each entity represented in its training data. You might use the features of square footage, number of bedrooms, and last year’s property taxes to predict a home’s sale price. A mobile phone carrier could try to predict subscriber churn based on a customer’s monthly spend, how many people they call, and how long they speak on each call.
This process of choosing features is called feature engineering and it is both an art and a science. You won’t know which features have meaningful predictive power until you’ve trained and tested a model; but you still need to ask yourself which features might make sense for the question at hand. Choose wisely.
(You’ll also need to focus on features you can actually access, as well as those that are off-limits for use in models. I’ll cover those later in this series.)
I’m sharing seven questions that will help you to build a robust AI training dataset and stay out of trouble. Today is Day 3.
How do we make this dataset more robust?
This is another aspect of “matching the real world.” One way to make a dataset more robust is to represent the full range of values that each feature can take on.
In our house-pricing example, consider a training dataset that only includes two-bedroom houses. The model built from this dataset will work best on predicting prices … of two-bedroom houses. In other words, your model will express unintentional bias in its predictions because it’s only seen a narrow sliver of what’s out in the real world. It will fail to predict prices for three-bedroom houses, high-rise condo units, or ten-bedroom mansions.
Unintentional bias often crops up in image-recognition models. You may have collected tens of thousands of training images, sure. But do those images reflect the full range of human hair styles, skin tones, headdress, and eyewear, as observed under various lighting conditions and camera angles? No? Then it’s probably of limited use in a general-purpose smartphone photo app or videoconferencing system.
I’m sharing seven questions that will help you to build a robust AI training dataset and stay out of trouble. Today is Day 4:
Where’s the leak?
A feature leak is a situation in which the thing you’re trying to predict (or a very close proxy thereof) creeps into the features you’re using to make the predictions. Not only does this give you false confidence in your model’s performance – you’ve literally handed it the answer! – but this feature won’t be present when the model operates in the real world.
Say, for example, you’re trying to predict today’s interest rates and your training dataset includes tomorrow’s interest rates as a feature. The model will pass internal tests with flying colors but then fail to work in the real world. It will ask you for tomorrow’s interest rate as a feature. And if you had that, you wouldn’t need a model, would you?
Feature leaks don’t always manifest in obvious ways. One model, trained to spot cancer, picked up on a dermatologist’s ruler which was only present in images of tumors. Was the ruler a useful feature, as far as discerning healthy tissue from tumors? Absolutely. Would this ruler be present in a real-world scenario? Absolutely not.
It helps to perform a thorough review of your training data to limit your chances of a feature leak.
I’m sharing seven questions that will help you to build a robust AI training dataset and stay out of trouble. Today is Day 5:
Where will this data come from?
You can talk about data all you like; but until you can actually get access to that data, you can’t build a model from it.
Do you already have this data in-house? If not, will you have to purchase it from an upstream data vendor? Will they charge you a small fortune for the privilege?
When you can’t get access to the data outright, sometimes you can get creative. Are there proxies for a given feature or dataset that are easier to acquire? Or can you simply do without a certain feature and see how far you get?
On a related note, you’ll likely have to combine data from several sources to make a single training dataset. Do yourself a favor and keep track of where each field comes from. When a customer request, regulator, or verdict compels you to remove certain data from your model, you want that to be as simple as possible: “OK, we’ve removed those records from the source data. We can rerun the data pipelines to generate the training dataset, then rebuild the model.”