This post is part of a series in which I address common questions I get from data-interested executives and data practitioners.
“More data is better” is a common refrain in this field. The more training data you use to train a predictive model, the more “experience” it will have, and the better its predictions will be.
Still, “more data is better” is easier said than done. If you find yourself short on data, I have four ideas on how to get more:
If you have the time, you can start collecting data now and wait until you have enough to build a performant machine learning model.
You can craft your collection methods to capture exactly the data of interest, and in the format that works best for you, which means that you can skip a lot of data engineering work later on. (One could argue that you’re just front-loading the process with the data engineering work, but I digress…) Most of all, you can trust the data because you know exactly how it came to be. This gives you the most control compared to the other methods.
The greatest drawback here is time to market: you need to develop the procedures and infrastructure to capture the data, and then you need to wait weeks or even months to collect a useful amount.
Most companies don’t have that kind of time, so they shop around for their data.
There are plenty of vendors who will sell you data if you lack the time or technical resources to collect it yourself. Buying from a third party gives you the shortest time-to-market: swipe your credit card, download the data, and you’re off to the races.
This method isn’t a total win, though. For one, you’ll still have some data engineering work ahead of you. It’s unlikely the upstream vendor’s data format will be a drop-in for your needs. Doubly so when the vendor changes something in the data format and you have to scramble to update your code accordingly.
Also, a shady vendor may sell you patchy, dirty, or fabricated data. Or maybe they built that dataset through unscrupulous means, such as tricking people or outright stealing it. It’s only a matter of time before they (ergo, you) get caught. Caveat emptor.
The biggest risk of buying a dataset is that your data source can dry up overnight. Maybe the vendor closes up shop, or they’ve come to see you as a threat and they cut you off. If you’ve built your entire business on data that you don’t own, you are effectively renting your business from your data vendor.
Sometimes the data is within reach, but is not in machine-readable form. This is common when data is published websites, electronic documents, or anything else designed for human consumption. People need tables, headings, and other formatting to guide their reading, but all of that just confuses machines.
If you have the right technical talent in-house, you can build scrapers to parse those websites or documents. That will extract the data into into the neat rows and columns that machine learning models prefer. In some cases you can even coordinate large groups of people to perform manual data entry.
Besides the engineering or human labor costs, this method runs the risk of you violating someone else’s terms of service. They may subject you to legal action, or find a way to block your scraping efforts, or both.
Do you really need this particular dataset to answer your questions? or is there some other dataset – one that you already have on-hand – that is a reasonable proxy for the one you want? Sometimes you can be creative, and get more data by getting different data.
For example, say that you want historical temperature data for a certain city, but they’ve only recently started collecting that information. If you have access to historical weather data for a nearby city, the numbers won’t be exactly the same, but they should be suitable for a rough test of a model.
Your predictive models will perform better as you feed them more training data. Consider the four ideas outlined above to your training datasets.