Building a training dataset is hard

Image of athletic gear (sneakers, dumbbells, and a jump rope) on a mat. Photo by Alexandra Tran on Unsplash.

When you build a machine learning (ML) model, you’re asking a machine to look for patterns in a training dataset. The model later uses those patterns to make predictions, generate text, or something similar.

The training dataset therefore represents all of the model’s “knowledge” and “experience” – its entire world view. The better the training data, the better the model will perform, and the better off you’ll be. Building a good training dataset reduces various forms of AI-related risk.

What makes for a good training dataset, then? The specific answer depends on the problem at hand. Work through the following seven questions to find an answer for your situation:

1/ How well can we match the real world?

A training dataset should represent what the model will see when it’s operating “in the wild.” But your definition of that operating environment matters.

Let’s say you’re building a model to predict customer churn. You might say, “Customers are customers, right? Easy.” Maybe. But probably not. If you’ve trained the model on customer data from your American market, then it will make its best predictions on customers in the American market. I doubt this model will do as well to predict churn in your European or Asian markets.

If you think through the specifics of what the model is intended to do, and where it’s meant to do it, you’ll stand a chance of developing a suitable training dataset.

2/ What aspects of “the real world” matter for the problem at hand?

A model looks for patterns in the features (attributes) of each entity represented in its training data. You might use the features of square footage, number of bedrooms, and last year’s property taxes to predict a home’s sale price. A mobile phone carrier could try to predict subscriber churn based on a customer’s monthly spend, how many people they call, and how long they speak on each call.

This process of choosing features is called feature engineering and it is both an art and a science. You won’t know which features have meaningful predictive power until you’ve trained and tested a model; but you still need to ask yourself which features might make sense for the question at hand. Choose wisely.

(You’ll also need to focus on features you can actually access, as well as those that are off-limits for use in models. I’ll cover those in a moment.)

3/ How do we make this dataset more robust?

This is another aspect of “matching the real world.” One way to make a dataset more robust is to represent the full range of values that each feature can take on.

In our house-pricing example, consider a training dataset that only includes two-bedroom houses. The model built from this dataset will work best on predicting prices … of two-bedroom houses. In other words, your model will express unintentional bias in its predictions because it’s only seen a narrow sliver of what’s out in the real world. It will fail to predict prices for three-bedroom houses, high-rise condo units, or ten-bedroom mansions.

Unintentional bias often crops up in image-recognition models. You may have collected tens of thousands of training images, sure. But do those images reflect the full range of human hair styles, skin tones, headdress, and eyewear, as observed under various lighting conditions and camera angles? No? Then it’s probably of limited use in a general-purpose smartphone photo app or videoconferencing system.

4/ Where’s the leak?

A feature leak is a situation in which the thing you’re trying to predict (or a very close proxy thereof) creeps into the features you’re using to make the predictions. Not only does this give you false confidence in your model’s performance – you’ve literally handed it the answer! – but this feature won’t be present when the model operates in the real world.

Say, for example, you’re trying to predict today’s interest rates and your training dataset includes tomorrow’s interest rates as a feature. The model will pass internal tests with flying colors but then fail to work in the real world. It will ask you for tomorrow’s interest rate as a feature. And if you had that, you wouldn’t need a model, would you?

Feature leaks don’t always manifest in obvious ways. One model, trained to spot cancer, picked up on a dermatologist’s ruler which was only present in images of tumors. Was the ruler a useful feature, as far as discerning healthy tissue from tumors? Absolutely. Would this ruler be present in a real-world scenario? Absolutely not.

It helps to perform a thorough review of your training data to limit your chances of a feature leak.

5/ Where will this data come from?

You can talk about data all you like; but until you can actually get access to that data, you can’t build a model from it.

Do you already have this data in-house? If not, will you have to purchase it from an upstream data vendor? Will they charge you a small fortune for the privilege?

When you can’t get access to the data outright, sometimes you can get creative. Are there proxies for a given feature or dataset that are easier to acquire? Or can you simply do without a certain feature and see how far you get?

On a related note, you’ll likely have to combine data from several sources to make a single training dataset. Do yourself a favor and keep track of where each field comes from. When a customer request, regulator, or verdict compels you to remove certain data from your model, you want that to be as simple as possible: “OK, we’ve removed those records from the source data. We can rerun the data pipelines to generate the training dataset, then rebuild the model.”

6/ How can we get more data?

You never know how much data you’ll need until you start training the model. Even with a robust, realistic dataset, the model may require more data in order to uncover meaningful patterns.

How do you handle this, then? When my clients undergo a modeling exercise, I always tell them to pick a source that they can revisit in a hurry.

Your ideal situation is to work in a company that generates a lot of data every day (think: telcos, web hosts, social media platforms), because you’ll effectively have an unlimited supply. Need more training data? Just wait a day or two and it will come to you for free.

By comparison, be wary of one-shot sources of data. You may find yourself in a pickle if you need more of that data later but the proverbial well has run dry.

7/ Are we allowed to use this data?

Let’s say you have an idea for a feature that might improve your model’s performance. That feature isn’t part of your proprietary dataset, and none of your upstream data vendors have it. What do you do?

Maybe you could work with a shady data broker. Or collect it yourself in a way that violates some other company’s terms of service (TOS), like scraping images from a social media site.

You could do those things. But do you really want to? As noted in “Our Favorite Questions,” attorney Shane Glynn reminds us:

The safest course of action is also the slowest and most expensive: obtain your training data as part of a collection strategy that includes efforts to obtain the correct representative sample under an explicit license for use as training data.

The next best approach is to use existing data collected under broad licensing rights that include use as training data even if that use was not the explicit purpose of the collection.

A training dataset that leads to legal action is not a training dataset that you want. At best, you’ll win the court case but still lose time, money, and effort along the way. At worst, you’ll be forced to remove the offending data from your model. And if your business relies on that model to generate revenue, you may have to close up shop.

Stick to data that you’re actually permitted to use, and you’ll avoid this trap.

Reducing your exposure

No model is perfect. But you can influence your model’s performance by building a realistic and robust training dataset that represents the outside world.

Developing practices around how you build a training dataset is a form of risk management for your AI models. By carefully choosing your sources and curating your features, you can reduce your exposure to model risk and reputation risk down the road.