The Importance of Simulating Data

The what and why of synthetic datasets.

(This post is based on a guest lecture I delivered to a graduate class in algorithmic trading.)

A common refrain in the ML/AI world is that data scientists spend most of their time cleaning up this messy data. While data cleaning is no doubt a valuable skill, equally valuable is the ability to create synthetic data.

Why would I need to create data?

Compared to the topics of building models and analyzing data, people rarely discuss creating data. There are several cases in which this may arise in the workplace:

You have data, but not enough: It’s not uncommon to develop proof-of-concept systems in anticipation of collecting enough real data later on. Similarly, stress-testing a data pipeline may require orders of magnitude more data than you have on-hand.

You have a lot of data, but it’s uneven: Data that reflects real-world processes is often messy. That means data scientists don’t always have the luxury of building a classifier on perfectly-balanced training data.

You have the data, but you can’t share it: Privacy rules often limit a researcher’s access to real data. You want to create a modified version of a dataset that’s sufficiently realistic for analysis or to validate someone else’s results, but vague enough to keep private details private.

You need different data: Larger systems are a breeding ground for edge cases and blind spots. Generating data lets you test various what-if scenarios, ranging from “vastly different arrangements of model inputs” to “malformed inputs.”

To better understand your data: Learning how to generate synthetic data is a great way to test your understanding the underlying data-generating process, summary statistics, and boundary conditions.

How to create synthetic data

Creating data requires that you understand how the original data was (or will be) created in the wild and how realistic the synthetic data needs to be.

Simple use cases – say, testing a new data visualization system or load-testing a data pipeline – are the data equivalent of lorem ipsum: you’re just filling space to see how everything else around it looks. In some cases you can solve this by generating lot of random numbers, or duplicating your existing dataset several hundreds or thousands of times over.

If you’re testing how a system reacts to certain conditions, you’ll need generate more realistic variants of existing data records. For imbalanced datasets you can borrow oversampling techniques such as SMOTE and ADASYN. With time series data, you can shuffle the values (or their differences) such that you retain the movements that actually happened, but in a different order. For a twist, you can treat the values of the original time series as a statistical distribution from which you can sample. (This is a more complicated version of the “marbles in urns” problems that are common in statistics textbooks.

In more advanced cases, you can turn to genetic algorithms, fractals, and even generative tools such as GANs to create your synthetic data. These techniques require a deeper knowledge of the original data, not to mention additional time and effort to implement, but they can prove useful when you’re trying to induce corner cases in your code or your ML/AI model. It’s much better to find those during a test scenario, when there’s no real-world money on the line.

Another way to catch corner cases is to create mostly-realistic-yet-slightly-weird data by generating, then intentionally corrupting, a synthetic dataset. For example, you could shuffle a time series, double the value at every third prime number, then randomly pick ten values to zero out.

Knowing “when” is as important as knowing “how”

Generating synthetic data is a useful skill. As a general rule, though, treat this as test data for test systems. You don’t want synthetic data to wind up in a production ML/AI model, or in any analysis on which real-world decisions will be made. In that case, its name changes from “synthetic data” to “faked data,” and that’s usually not good for your career.

There are, understandably, exceptions. Synthetic records that come from rebalancing techniques, such as the aforementioned ADASYN and SMOTE, are intended to be included in a model. In those cases it’s understood that you’re weighing the risk of building a model on synthetic training data against the risk of not being able to build a model at all. So long as everyone is aware of this – be sure to note this in your product discussions and MLOps notes – you can all set expectation accordingly.