Can synthetic data bridge the AI training data gap?

Profile picture for user Neil Raden By Neil Raden July 27, 2021
The problem of ineffective or biased AI training data persists. Can synthetically-generated data alleviate this, or complement real data? It's not a simple question but it's one we need to assess.


We know that AI requires a great deal of data. At a minimum, to return valuable results. However, gathering the data and preparing it is complex. We have to check if there is bias in the data (and that is not a solved problem). We have to be sure the data doesn't contain sensitive information.

There are overlapping statutes and regulations about the data, particularly personal identification. The data may be incomplete, and it may not be representative of the population we are (presumably) trying to understand. And given its size, potentially a billion records or more, how would we ever meet all those requirements?

That old saying, "necessity is the mother of invention," has asserted itself with a different approach: synthetic data. In other words, skip the data wrangling and just create your data. Here is what I thought when I first heard it:

  • Say what?
  • The premise o ML is that the answer is in the data.
  • If you create your data, aren't you already telling the model the answer?
  • If you create your data, by whatever means, aren't you just as likely to insert your biases?
  • What do you mean it's "computer-generated?"
  • It must have some direction.

On that last point, it does.

Synthetic data is premised on the idea that the generated data has the identical mathematical and statistical properties as the real-world dataset it is replacing. Vendors claim that their solutions perform better than real-world data by correcting bias typically in datasets, particularly those containing history. How does it work? There are a few approaches.

The different approaches to synthetic data

1. No real data

If there is no actual data, but the AI engineer possesses a broad understanding of data set distribution, the engineer can create a random sample of any distribution such as Normal, Exponential, Chi-square, etc. The quality of the synthetic data is dependent on the engineer's grasp of a specific data environment.

2. There is real data

If there is real data, then synthetic data is generated by a best-fit distribution. If the distribution is already known, they can use a Monte Carlo method.

3. Only some real data

When only a portion of actual data exists, a hybrid synthetic data generation approach is used. The engineer generates part of the data set from assumed distributions and generates other parts from actual data.

Test-driving synthetic AI data

Just like models to create deep fakes, synthetic data generation uses generative adversarial networks (GAN). What's a GAN? A GAN uses a pair of neural networks. One generates synthetic data, and the other tries to detect if it's real. The networks continue until the discriminator function cannot distinguish between natural and synthetic data.

Another technique uses a Variational Autoencoder. To keep it simple, a variational autoencoder compresses the original data set and transmits data to the decoder. Then the decoder generates an output that mirrors the original data set. The system is designed to optimize correlation between input and output data.

Clear as mud?

I've had the chance to get my hands on a small sample of synthetic data software:

  • IBM, AWS and Microsoft already have synthetic data generators.
  • Hazy - used to boost fraud and money Laundering detection.
  • AiFi - Simulates retail stores and shopper behavior.
  • One View - creates datasets for astronomy and imagery.

My take

Would using algorithmic processes based on a data set match the distribution either presumed by the engineer or detected by a model? How does synthetic data deal with outliers? Black swans? Creating data based in statistics of the actual data set statistics is a little troubling. But then, what does the machine learning model do? It finds patterns, does a million differential equations to get to the cost function as fast as possible, a process derailed by Shortcut Learning and Adversarial Perturbations, from which the GANs are particularly susceptible.

I'm going to withhold judgment until I see some qualified results how well this operates. It seems like everything that happens with AI caries seeds of disaster. Like Paul Virilio said, "The invention of the ship was also the invention of the shipwreck."