How Much Data Is Enough?
2019-12-16 | tags: AI data literacy

This post is part of a series in which I address common questions I get from data-interested executives and data practitioners.

"So ... How much data do we need in order to build a predictive model?"

I often get this question from executives who are planning their company's data efforts.

It's a very good question, but the frustrating truth is that there's no quick, specific answer. Your answer depends on several matters, such as:

I can, however, offer three guidelines:

1. More data is better, but ...

2. ... you won't know "how much data" until after the fact.

3. Make sure that it's a certain kind of "more."

Before I explain those in detail, let's first tackle some terminology:

Building a predictive model

I define a predictive model as the output of an algorithm (such as K-Means or SVM) that has been fed some particular set of training data. Building a predictive model is also known as training a model.

Companies develop predictive models to answer specific questions: "Will this customer churn?" "Is this a fraudulent transaction?" "Will this flight arrive on-time?"

The algorithm develops some generalized understanding based on patterns it finds in the data, and captures that understanding in a model. The model can then look at new data from the real world and reach a conclusion: "While I haven't seen this exact data record before, it's very similar to records for which the customer cancelled their service."

More data is better

Data for a model is like experience for a person. The more experience a person has, the better they are at spotting certain situations.

If you feed an algorithm a lot of training data, the resultant model embodies a more robust picture of the world. If you instead train on a small amount of data (I call this "starving the model"), it will have very little experience, so it will produce a lot of wrong answers.

You won't know "how much data" until after the fact

This is why it's so hard to know how much training data you'll need a priori: you first have to go through several iterations of training and evaluating your model in a controlled environment. Once it performs well enough for your needs -- it gets the right answers, enough of the time -- then you probably have enough training data.

Emphasis on "probably." There's one more guideline:

Just make sure that it's a certain kind of "more."

Even if you have a lot of data, it may not be the right data:

The Wrap-Up

Technically, you need just a few records of training data to build a predictive model. If you want a model that actually works, you'll need a lot of balanced, relevant training data. And if you want to stay out of trouble, you'll make sure that you acquired that training data through above-board means.

How to Prepare for That Data Scientist Job Interview

Looking for a data science job? It involves far more than the technical know-how.

How Do I Get More Data?

Since "more data is better," what do I do if I don't have enough?