This post is part of a series in which I address common questions I get from data-interested executives and data practitioners.
“So … How much data do we need in order to build a predictive model?”
I often get this question from executives who are planning their company’s data efforts.
It’s a very good question, but the frustrating truth is that there’s no quick, specific answer. Your answer depends on several matters, such as:
- The details of what you’re trying to predict: some predictions require less data than others.
- The specifics of your data: some datasets have stronger “signal” related to your questions.
- How precise of an answer you need: how much tolerance do you have for wrong answers?
I can, however, offer three guidelines:
1. More data is better, but …
2. … you won’t know “how much data” until after the fact.
3. Make sure that it’s a certain kind of “more.”
Before I explain those in detail, let’s first tackle some terminology:
I define a predictive model as the output of an algorithm (such as K-Means or SVM) that has been fed some particular set of training data. Building a predictive model is also known as training a model.
Companies develop predictive models to answer specific questions: “Will this customer churn?” “Is this a fraudulent transaction?” “Will this flight arrive on-time?”
The algorithm develops some generalized understanding based on patterns it finds in the data, and captures that understanding in a model. The model can then look at new data from the real world and reach a conclusion: “While I haven’t seen this exact data record before, it’s very similar to records for which the customer cancelled their service.”
Data for a model is like experience for a person. The more experience a person has, the better they are at spotting certain situations.
If you feed an algorithm a lot of training data, the resultant model embodies a more robust picture of the world. If you instead train on a small amount of data (I call this “starving the model”), it will have very little experience, so it will produce a lot of wrong answers.
This is why it’s so hard to know how much training data you’ll need a priori: you first have to go through several iterations of training and evaluating your model in a controlled environment. Once it performs well enough for your needs – it gets the right answers, enough of the time – then you probably have enough training data.
Emphasis on “probably.” There’s one more guideline:
Even if you have a lot of data, it may not be the right data:
Data that is dirty, inconsistent, or incomplete will confuse the model while it is training. If you are lucky, you will catch this during the train/evaluation cycle. If you are unlucky, the model will (by coincidence) produce the right answers during evaluation but fail when deployed to the real world.
Even if you have a lot of pristine, consistent data, it may have no connection to the question you’re trying to answer. You will usually spot this while trying to train the model (as it will fail to produce suitable answers) but you still have to go through the effort of trying to build a model to reach that conclusion.
Your training data may not align with what the model will see in the real world. Statisticians would say that your training data is not a representative sample of the real world, which is a nice way of saying that your training data is biased. Your model will try to answer questions for which it doesn’t have enough experience to make a proper judgement. Say that you want to predict customer churn, but your training data came from your North America market. The model is only suitable for predicting churn in that one market; it is not generally applicable and will give the wrong answers when applied to your European or Asian markets.
You may have a large amount of clean, applicable, unbiased data … that you acquired through unscrupulous means. While your model may produce favorable results, you will eventually get into trouble. (See my series “Data Ethics: A Risk Perspective” for more detail.)
Technically, you need just a few records of training data to build a predictive model. If you want a model that actually works, you’ll need a lot of balanced, relevant training data. And if you want to stay out of trouble, you’ll make sure that you acquired that training data through above-board means.