1  Synthetic Data

This section will contain a brief introduction to synthetic data and link to many resources.

1.1 What is synthetic data? Why use it?

Synthetic Data

Synthetic data consist of pseudo-records that are statistically representative of the confidential data. They are typically created via processes that introduce randomness, like sampling from parametric or non-parametric models.

  • The goal of most syntheses is to closely mimic the underlying distribution and statistical properties of the real data (also known as the confidential data) to preserve data utility while minimizing disclosure risks.
  • Synthesized values also limit an intruder’s confidence because they cannot confirm a synthetic value exists in the confidential dataset.
  • Synthetic data may be used as a “training dataset” to develop programs to run on confidential data via a validation server.
  • Synthetic data may help ensure data will satisfy users’ needs prior to requesting access via a secure enclave, or allow them to develop programs to run on confidential data while waiting (often time-consuming) access.
Partially synthetic

Partially synthetic data contain a mixture of confidential and synthesized observations or variables (Little 1993). In partially synthetic data, there generally remains a one-to-one mapping between confidential records and synthetic records.

Below, we see an example of what a partially synthesized version of confidential data could look like.

Figure 1.1: Partially synthetic data
Fully synthetic

Fully synthetic data contain only synthetic observations and variables (Rubin 1993). Fully synthetic data no longer directly map onto the confidential records but can remain statistically representative. Since fully synthetic data does not contain any actual observations, it protects against both attribute and identity disclosure (concepts we will describe later in more detail).

Below, we see an example of what a fully synthesized version of confidential data might look like. Note that there is no requirement for fully synthetic data to have the same number of observations as the confidential data.

Fully synthetic data


1.1.1 Partial vs. fully synthetic advantages and disadvantages

  • Changing only some variables (partial synthesis) in general leads to higher utility in analysis since the relationships between variables are by definition unchanged (Drechsler, Bender, and Rässler 2008).

  • Disclosure in fully synthetic data is nearly impossible because all values are imputed, while partial synthesis has higher disclosure risk since confidential values remain in the dataset (Drechsler, Bender, and Rässler 2008).

    • Note: while the risk of disclosure for fully synthetic data is very low, it is not zero.
  • Accurate and exhaustive specification of variable relationships and constraints in fully synthetic data is difficult and if done incorrectly can lead to bias (Drechsler, Bender, and Rässler 2008).

    • If a variable is synthesized incorrectly early in a sequential synthesis, all variables synthesized on the basis of that variable will be affected.
  • Partially synthetic data may be publicly perceived as more reliable than fully synthetic data.


1.1.2 Why synthetic data?

Synthetic data provides enhanced disclosure protection with a lower cost to utility than other “traditional” SDC methods. For example, swapping is an SDC method that exchanges sensitive values among sample units with similar characteristics. Top/bottom coding is an SDC method that limits values above or below a threshold value to the threshold. Applying these methods can limit utility for certain types of analyses:

  • Mitra and Reiter (2006) found that a 5 percent swapping of 2 identifying variables in the 1987 Survey of Youth in Custody invalidated statistical hypothesis tests in regression.

  • Top/bottom coding eliminates information at the tails of the distributions, degrading analyses that depend on the entire distribution (Reiter, Wang, and Zhang 2014).

Synthetic data also allow for release of data that is more disaggregated than might otherwise be possible with “traditional” SDC (aggregation is a very common SDC technique).


1.2 Data Synthesis Process Overview

Note that this overview is opinionated and simplified in order to provide a reasonable summary.

The synthesis process is very iterative, particularly in the privacy step.


1.2.1 Privacy stakeholders and the synthesis process

Figure 1.2: All of the privacy stakeholders discussed previously have a role in aspects of the synthesis process.

For more on involving data users and data participants in the synthesis process, we recommend Do No Harm Guide: Applying Equity Awareness in Data Privacy Methods, a report by Claire Bowen and Joshua Snoke.


1.3 Key terms for synthesis process

In a perfect world, we would synthesize data by directly modeling the joint distribution of the variables of interest. Unfortunately, this is often computationally infeasible.

Instead, we often decompose a joint distribution into a sequence of conditional distributions.

Sequential synthesis

Sequential synthesis is an implementation of synthetic data generation that iteratively estimates models for each predictor with previously synthesized variables used as predictors.

The process of sequential synthesis may be easier to understand with the following table:

Step Outcome Modelled with Predicted with
1 Sex - Random sampling with replacement
2 Age Sex Sampled Sex
3 Social Security Benefits Sex, Age Sampled Sex, Sampled Age
- - - -

Sequential synthesis allows us to easily model multivariate relationships without being computationally expensive and is the methodology used by tidysynthesis.

  • We can select the synthesis order based on the priority of the variables or the relationships between them.
  • The earlier in the order a variable is synthesized, the better the original information is preserved in the synthetic data usually.
  • Bowen, Liu, and Su (2021) proposed a method that ranks variable importance by either practical or statistical utility and sequentially synthesizes the data accordingly.


Bowen, Claire McKay, Fang Liu, and Bingyue Su. 2021. “Differentially Private Data Release via Statistical Election to Partition Sequentially.” Metron 79 (1): 1–31.
Drechsler, Jörg, Stefan Bender, and Susanne Rässler. 2008. “Comparing Fully and Partially Synthetic Datasets for Statistical Disclosure Control in the German IAB Establishment Panel.” Transactions on Data Privacy 1 (December): 105–30.
Little, Roderick JA. 1993. “Statistical Analysis of Masked Data.” JOURNAL OF OFFICIAL STATISTICS-STOCKHOLM- 9: 407–7.
Mitra, Robin, and Jerome P Reiter. 2006. “Adjusting Survey Weights When Altering Identifying Design Variables via Synthetic Data.” In Privacy in Statistical Databases: CENEX-SDC Project International Conference, PSD 2006, Rome, Italy, December 13-15, 2006. Proceedings, 177–88. Springer.
Reiter, Jerome P, Quanli Wang, and Biyuan Zhang. 2014. “Bayesian Estimation of Disclosure Risks for Multiply Imputed, Synthetic Data.” Journal of Privacy and Confidentiality 6 (1).
Rubin, Donald B. 1993. “Statistical Disclosure Limitation.” Journal of Official Statistics 9 (2): 461–68.