The PR hype over synthetic data is mounting - but I'm not convinced. Articles that delve into synthetic data raise questions. First, let's consider The Advantages and Limitations of Synthetic Data, by Marcello Benedetti.
At the beginning of the article, Benedetti makes a claim that I don't understand: "Synthetic data is system-generated data that mimics real data, in terms of essential parameters set by the user. Synthetic data is any production data not obtained by direct measurement, and is considered anonymized."
Benedetti goes on further to say, "Synthetic data can be created by stripping any personal information (names, license plates, etc.) from a real dataset, so it is completely anonymized." But what if that personal information is necessary for the model to draw inferences? Are some of those variables "essential parameters?"
I understand the anonymizing concept to protect a person's privacy, or as a technique to avoid biased stereotypes. Enterprises crave personalized data, but protecting privacy is non-negotiable. Anonymizing the data brings limitations. For instance, deletion or masking of the PII (Personalized Identifiable Information) is limited in effect. Clever analysts can re-anonymize data, but join datasets with common attributes.
Anonymizing data still enjoys a good reputation despite abundant evidence that it is too easy to defeat. In 2007, Netflix offered a $1 million prize to the first algorithm that could outperform their collaborative filtering algorithm. The dataset they supplied was anonymized, but one group de-anonymized it by joining it with information from the IMDb database. An anonymized database can happily expose the PII (personally identifiable information) by combining it with a PII data source and matching other criteria (so-called latent values).
But more importantly, when trying to solve human problems, it helps to have human data, not clumps of them. Consider medical research. Masking or deleting any Personally Identifiable Information weakens the dataset by removing features relevant to the investigations. In every document I see, this effect of anonymizing synthetic data seems to imply it is an important feature; all I see are drawbacks.
The use of GANs for synthetic data - still troubling
The author goes on to say (and this is repeated in every document I've read on this topic), "Another method is to create a generative model from the original dataset that produces synthetic data that closely resembles the authentic data. A generative model is a model that can learn from large, real datasets to ensure the data it produces accurately resembles real-world data."
Using Generative models makes some logical sense depending on which type is used: Generative Adversarial Networks or GAN's, Variational Auto Encoders or VAE's, and Autoregressive models, but it doesn't explain how they operate. And the term "closely resembles" still troubles.
Just to give you an idea, Generative adversarial networks (GANs) use two neural networks competing with each other to generate among other things, synthetic data that resembles real data. VAE's and Autoregressive models operate differently, but they all are just math. They have no intelligence, they have no context for what they are doing.
In the early dawn of my career as an actuary, we didn't have AI, or data science or, for that matter, computers (the data center had an IBM mainframe, but we didn't have access to it.) We did something called mathematical modeling. Equations. And equation solving. When you're building models for rate-making or loss reserving, you apply practices and your judgment. That was the best we could do because we didn't have data or very little of it. The models were accurate to a point. Oddly, twenty years later, I was building mathematical models again about things we had no experiential data about - burying radioactive waste or simulating the performance of a new nuclear warhead. But it seems that creating synthetic data similar to what you already have just dilutes (or overamplifies) your collected data, and I don't know what the reason is.
Why train a new system with data that is already consistent with the statistical profile of the actual data? Would it not be sufficient to merely run your model on what you have, since it doesn't appear likely the synthetic data will add any new insight?
In another article by Cem Dilmegani, the author makes a claim: that synthetic data has several benefits over actual data. Let's review them and my responses.
1. Overcoming data usage restrictions: Real data may have usage constraints due to privacy rules or other regulations. Synthetic data can replicate all essential statistical properties of real data without exposing real data, thereby eliminating the issue.
My response: This sounds like a lot of nonsense to me. If you create data that is similar to existing data that has usage constraints, it seems to me you are violating those constraints.
2. Creating data to simulate not yet encountered conditions: Where real data does not exist, synthetic data is the only solution.
My response: There is no support in any material I've reviewed. This is what I described as equation solving and creating imagined data t prove your model.
3. Immunity to some common statistical problems: These can include item nonresponse, skip patterns, and other logical constraints.
My response: in other words, creating synthetic data that is better than what you have. What inference can you draw from that?
3. Focuses on relationships: Synthetic data aims to preserve the multivariate relationships between variables instead of specific statistics alone.
My response: That's interesting, but it's also how re-anonymizing works.
The limitations of synthetic data
The author Dilmegani acknowledges: though synthetic data has various benefits that can ease data science projects for organizations, it also has limitations.
1. Outliers may be missing: Synthetic data can only mimic real-world data; it is not a replica. Therefore, synthetic data may not cover some outliers that original data has. However, outliers in the data can be more critical than regular data points, as Nassim Nicholas Taleb explains in-depth in his book, the Black Swan.
My response: Not just outliers, but reasonable values that are not represented in the actual data because it hasn't been sampled, or it doesn't represent the existing population it is meant to model.
2. Quality of the model depends on the data source: The quality of synthetic data is highly correlated with the quality of the input data and the data generation model. Synthetic data may reflect the biases in source data.
My response: This begs the question, why bother?
3. User acceptance is more challenging: Synthetic data is an emerging concept, and it may not be accepted as valid by users who have not witnessed its benefits before.
My response: Present company included.
4. Synthetic data generation requires time and effort: Though easier to create than actual data, synthetic data is also not free.
My response: Sounds like something a synthetic data vendor would say.
5. Output control is necessary: Especially in complex datasets, the best way to ensure the output is accurate is by comparing synthetic data with authentic data or human-annotated data. this is because there could be inconsistencies in synthetic data when trying to replicate complexities within original datasets.
My response: It's an immature technology.
I'm still not convinced of the argument for similar data that closely resembles the real data; the anonymizing has issues and drawbacks, and generative models do not make it any more credible.