This is only tangential to Frank Wiles's point, but ... not only is generating synthetic data fun/funny, it's a hell of a way to stretch your brain as a data practitioner or software developer.
In part because the first step involves understanding the problem you're trying to solve.
How so? Because you're not just "synthesizing data"; you're playing the role of a data-generating process.
Which means you have to understand that data-generating process, and also understand the downstream consumer(s) of that process relevant to the task at hand.
Digging deeper, there are several angles to "understanding the problem." Consider:
1/ "Why do we need this data?" (How will it be used?)
2/ "How realistic does it have to be?" (Do we just need characters/digits to fill slots, like a domain-specific Lorem Ipsum? Or will the downstream process be more discerning?)
3/ "How much data do we need?" (Is this a quick one-off where we don't need to worry about reproducibility? Will this require a steady stream of synthetic data for ongoing work?)
... and so on.
Your answers to those questions will guide your choice of tools. Here are just a few questions to sort through:
random number generator? (If so: which distribution?)
random word choice? (Which words, then? Any old words, or domain-specific?)
GAN? (If so: what's the training data?)
LLM? (Can a general-purpose, public genAI bot mimic this domain well enough for the task at hand??)
(You'll notice that writing the code is a small part of the effort, and the last step. That's typical for a data or software professional. Most of the job involves working through: "what are we trying to achieve here? And why?" Beware anyone who tries to skip past that.)
Thanks for reading. I've been meaning to write a blog post on synthetic data for ... several years now. Thanks to Frank's post, I guess I just kinda did?
In any case: if you've made it this far and you don't already follow Frank Wiles , you probably should. He has deep knowledge of all things software.
(Also, some of you may recognize this text from a thread I posted to Bluesky earlier today. If you'd like more timely updates from me – and through a system that doesn't default to a curated, "algorithmic" view – Bluesky is the place. @qethanm.bsky.social )
It takes effort and discipline
On Chief Data Officers and Chief AI Officers
You can't hide behind a beta label
Customers want products that work. Even if you insist on calling them beta releases