4  tidysynthesis

Published

April 23, 2026

Abstract
This chapter introduces the tidysynthesis R package for generating synthetic data.
Abstract 3D landscape of vertical rectangular bars in shades of pink, purple, and green, resembling a digital terrain.
Figure 4.1
NoteOverview

In this chapter, you will learn about:

  • tidysynthesis, the Urban Institute’s software for synthetic data development.
  • How to use this software to generate synthetic data using American Community Survey data.

Basic R and tidymodels knowledge are necessary for working with tidysynthesis.

4.1 tidysynthesis

Earlier, we discussed the different design decisions that go into producing synthetic data. We will demonstrate how to generate synthetic data accounting for these design decisions using the tidysynthesis package in R.

  • tidysynthesis is on CRAN and can be installed with install.packages("tidysynthesis").
  • Development versions of tidysynthesis are available on GitHub.
  • The tidysynthesis documentation website includes detailed documentation for the package.

Throughout this section, we’ll use example data from the 2019 American Community Survey for Nebraska respondents (which comes pre-packaged with tidysynthesis).

library(tidyverse)
library(tidymodels)
library(tidysynthesis)

glimpse(acs_conf_nw)
Rows: 1,500
Columns: 11
$ county       <fct> Other, Other, Other, Other, Douglas, Lancaster, Other, Sa…
$ gq           <fct> Household, Household, Household, Household, Household, Ho…
$ sex          <fct> Female, Male, Male, Female, Male, Female, Male, Male, Mal…
$ marst        <fct> Single, Married, Single, Single, Married, Divorced, Marri…
$ hcovany      <fct> With health insurance coverage, With health insurance cov…
$ empstat      <fct> NA, Employed, NA, NA, Employed, Employed, NA, NA, NA, Emp…
$ classwkr     <fct> N/A, Works for wages, N/A, N/A, Self-employed, Works for …
$ age          <dbl> 0, 41, 10, 12, 46, 36, 49, 5, 22, 31, 5, 55, 74, 50, 37, …
$ famsize      <dbl> 5, 4, 3, 6, 5, 3, 5, 5, 4, 1, 4, 2, 2, 2, 4, 1, 1, 4, 5, …
$ transit_time <dbl> 0, 30, 0, 0, 15, 15, 0, 0, 0, 5, 0, 7, 0, 15, 10, 0, 0, 0…
$ inctot       <dbl> NA, 68000, NA, NA, 91000, 26200, 6000, NA, 0, 37000, NA, …

4.1.1 Overview of tidysynthesis

Goal: provide a metapackage for managing design decisions. There are many possible design decisions, but we will start with the four required decisions:

  • Starting variable(s): which variables will be part of the first generative model?
  • Variable order: in what order will we synthesize variables?
  • Model definitions: how will we compute conditional distribution models?
  • Sampler definitions: how will we draw new samples from these models?

A tidysynthesis run has the following high-level four step overview:

  1. Create a roadmap S3 object.
  2. Create a synth_spec S3 object.
  3. Create a presynth S3 object using this roadmap and synth_spec.
  4. Create synthetic data using synthesize(presynth = my_presynth).
flowchart TD
  B[roadmap]:::required --> P[presynth]:::required 
  J[synth_spec]:::required --> P
  P --> Q[postsynth]:::required
  classDef required fill:#1696d2,stroke:#1696d2,color:#ffffff;
  classDef optional fill:#ec008b,stroke:#ec008b,color:#ffffff;
Figure 4.2: A diagram of the main tidysynthesis components.

Let’s dive into how each of these functions works.

4.1.2 roadmap

flowchart TD
  A[conf_data]:::required --> B[roadmap]:::required
  C[start_data]:::required --> B
  D[start_method]:::optional --> B
  E[schema]:::optional --> B 
  F[visit_sequence]:::optional --> B
  G[replicates]:::optional --> B
  H[constraints]:::optional --> B
  classDef required fill:#1696d2,stroke:#1696d2,color:#ffffff;
  classDef optional fill:#ec008b,stroke:#ec008b,color:#ffffff;
Figure 4.3: A diagram of roadmap components. Objects in blue are required while objects in magenta are optional.

The roadmap object describes input data sources and macroscopic properties about the synthesis order of operations. You can create a roadmap S3 object using the function roadmap() that requires two arguments:

  • conf_data: A data frame with the confidential data used to generate the synthetic data. The resulting synthetic data will have the same number of columns as conf_data.
  • start_data: A data frame with a strict subset of variables from conf_data, which is used to start the synthesis process (used for the initial generative model). The resulting synthetic data will have the same number of rows as start_data.

All remaining arguments are optional and are included in Figure 4.3.

The first decision to make when using tidysynthesis is which variables to include in start_data. Any variables not included in start_data will be sequentially synthesized.

For this minimal example, we will select only one variable in our start_data as-is.

# create start_data by selecting one variable from the confidential data
acs_start_county <- select(acs_conf_nw, county)

# create a minimal roadmap
acs_roadmap <- roadmap(
  conf_data = acs_conf_nw, 
  start_data = acs_start_county
)

# display the roadmap
acs_roadmap
Roadmap: 
conf_data: 1500 observations, 11 variables 
start_data: 1500 observations, 1 variables

4.1.3 synth_spec

flowchart TD
  I[models]:::required --> J[synth_spec]:::required
  K[samplers]:::required --> J
  L[steps]:::optional --> J
  M[noise]:::optional --> J
  N[tuners]:::optional --> J
  O[extractors]:::optional --> J
  classDef required fill:#1696d2,stroke:#1696d2,color:#ffffff;
  classDef optional fill:#ec008b,stroke:#ec008b,color:#ffffff;
Figure 4.4: A diagram of synth_spec components. Objects in blue are required while objects in magenta are optional.

A synth_spec S3 object specifies the modeling and sampling components used for sequential synthetic data generation. Each synth_spec requires that every synthesized variable be associated with a model object and a sampler function.

synth_spec provides flexibility to arbitrarily specify the details of different models, samplers, and more. We distinguish between modeling and sampling steps because in either stage, we may deviate from traditional generative modeling techniques (e.g., using additional privacy-preserving randomization at the modeling or sampling stage).

In this minimal example, we only specify default models and samplers that differentiate between categorical and numeric outputs.

  • default_regression_model: The default predictive model used to generate numeric data.
  • default_classification_model: The default predictive model used to generate categorical data.
  • default_regression_sampler: The default sampling method used to sample from regression models.
  • default_classification_sampler: The default sampling method used to sample from classification models.

synth_spec instances require models (specified using the parsnip::model_spec convention) and samplers (functions with a shared signature defined in the package). tidysynthesis provides default samplers for many common modeling engines used in parsnip, but you can also specify any custom model_spec and sampler function. Below, we implement basic classification and regression tree (CART) models and samplers from tidysynthesis using the rpart engine.

All remaining arguments are optional and are included in Figure 4.4.

# create basic parsnip CART models for regression and classification
rpart_mod <- decision_tree() |>
  set_engine(engine = "rpart") |>
  set_mode(mode = "regression")

rpart_class <- parsnip::decision_tree() |>
  set_engine(engine = "rpart") |>
  set_mode(mode = "classification")

# create a basic synth_spec 
acs_synth_spec <- synth_spec(
  # use previously defined parsnip models
  default_regression_model = rpart_mod,
  default_classification_model = rpart_class,
  # use tidysynthesis-provided sampler functions
  default_regression_sampler = sample_rpart,
  default_classification_sampler = sample_rpart
)

4.1.4 presynth

The presynth object combines the roadmap and synth_spec into the last object needed before we create synthetic data.

acs_presynth <- presynth(
  roadmap = acs_roadmap,
  synth_spec = acs_synth_spec
)

presynth performs many checks to try and discover bugs before the heavy computation of generating synthetic data in synthesize().

4.1.5 synthesize()

With presynth, we can finally synthesize the data using the synthesize() function.

acs_result <- synthesize(presynth = acs_presynth)
acs_result
Postsynth 
Synthetic Data: 1500 synthetic observations, 12 variables 
Total Synthesis Time: 0.607065916061401 seconds

We have synthetic data! The resulting object is a postsynth S3 object that contains information about the synthetic data and its generation process, such as the starting variables and model. Most importantly, the postsynth contains the resulting synthetic_data:

glimpse(acs_result[["synthetic_data"]])
Rows: 1,500
Columns: 12
$ county       <fct> Other, Other, Other, Other, Douglas, Lancaster, Other, Sa…
$ gq           <fct> Household, Household, Household, Household, Household, Ho…
$ sex          <fct> Male, Female, Male, Male, Male, Female, Female, Female, M…
$ marst        <fct> Married, Single, Married, Married, Single, Married, Marri…
$ hcovany      <fct> No health insurance coverage, With health insurance cover…
$ empstat      <fct> Employed, NA, Employed, Employed, NA, Employed, Employed,…
$ classwkr     <fct> Self-employed, N/A, Works for wages, Works for wages, N/A…
$ age          <dbl> 47, 13, 53, 62, 0, 53, 55, 73, 47, 30, 80, 17, 84, 38, 22…
$ famsize      <dbl> 6, 5, 2, 2, 1, 1, 1, 1, 3, 3, 2, 4, 2, 5, 1, 1, 3, 2, 3, …
$ transit_time <dbl> 1, 0, 20, 20, 0, 25, 7, 5, 5, 30, 0, 0, 0, 25, 30, 0, 20,…
$ inctot_NA    <fct> nonmissing value, missing value, nonmissing value, nonmis…
$ inctot       <dbl> 52600, NA, 65000, 82000, NA, 69000, 43000, 38030, 51000, …

Let’s look at all the steps together in one code chunk:

# create start_data by selecting one variable from the confidential data
acs_start_county <- select(acs_conf_nw, county)

# create a minimal roadmap
acs_roadmap <- roadmap(
  conf_data = acs_conf_nw, 
  start_data = acs_start_county
)

# create basic parsnip CART models for regression and classification
rpart_mod <- decision_tree() |>
  set_engine(engine = "rpart") |>
  set_mode(mode = "regression")

rpart_class <- parsnip::decision_tree() |>
  set_engine(engine = "rpart") |>
  set_mode(mode = "classification")

# create a basic synth_spec 
acs_synth_spec <- synth_spec(
  # use previously defined parsnip models
  default_regression_model = rpart_mod,
  default_classification_model = rpart_class,
  # use tidysynthesis-provided sampler functions
  default_regression_sampler = sample_rpart,
  default_classification_sampler = sample_rpart
)

# create the presynth object
acs_presynth <- presynth(
  roadmap = acs_roadmap,
  synth_spec = acs_synth_spec
)

acs_result <- synthesize(presynth = acs_presynth)
WarningWARNING: We’re Not Done

We’ve generated synthetic data but we can’t determine if these synthetic data meet our utility or disclosure risk needs without a contextual analysis of the synthetic data and their properties.

4.2 Synthesis planning and starting data

The first set of design decisions concern the macroscopic structure of a synthesis:

  • What data should we first generate?
  • How should we order the remaining variables?

Common techniques for starting data:

  • Selecting particular variables verbatim (also known as partially synthetic data).
  • Resampling particular variables with replacement.
  • Modeling the joint distribution of starting variable values and sampling values from this model.

Randomization considerations:

  • Using unmodified starting data records, with or without resampling, can…
    • Pro: Enable linkages between datasets within a SLDS (e.g., connecting information about high schools and university enrollment by not synthesizing).
    • Con: May increase disclosure risks by leaking information associated with the confidential data (e.g., allowing users to infer exact subpopulation sizes).
  • Using randomly modified starting records can…
    • Pro: Improve disclosure risk protections associated with releasing starting data.
    • Con: Reduce possible interoperability between SLDS datasets (e.g., “can’t link high school records and university records”).

Example 1: select 2 variables as example starting data

start_data_1 <- acs_conf_nw |> 
  select(county, gq) # select starting variables

glimpse(start_data_1)
Rows: 1,500
Columns: 2
$ county <fct> Other, Other, Other, Other, Douglas, Lancaster, Other, Sarpy, O…
$ gq     <fct> Household, Household, Household, Household, Household, Househol…


Example 2: resample records with replacement

start_data_2 <- acs_conf_nw |> 
  select(county, gq) |> # select starting variables
  slice_sample(         # resample using dplyr
    n = 1000,           # number of records to resample
    replace = TRUE      # resampling with replacement
  )

glimpse(start_data_2)
Rows: 1,000
Columns: 2
$ county <fct> Sarpy, Sarpy, Other, Douglas, Other, Other, Lancaster, Other, D…
$ gq     <fct> Household, Household, Household, Household, Household, Househol…


Example 3: resample records from a noisy frequency table

start_data_3 <- acs_conf_nw |> 
  select(county, gq) |>
  start_resample(           # resample using tidysynthesis
    n = 1000,               # number of records to resample
    support = "all",        # whether to use observed values or all possible values
    inv_noise_scale = 1.0   # noise to add to empirical frequency of values
  )

glimpse(start_data_3)
Rows: 1,000
Columns: 2
$ county <fct> Other, Other, Other, Other, Other, Other, Other, Other, Other, …
$ gq     <fct> Other GQ, Other GQ, Other GQ, Other GQ, Other GQ, Other GQ, Oth…


4.3 Strategies for sequential ordering

TipDefinition: Visit Sequence

A visit sequence is the order in which variables not contained in the preliminary generative model get synthesized.

  • Visit sequence choices can be data-dependent (e.g., evaluating the ease with which certain variables could or could not be conditionally modeled)…
  • … or data-independent (e.g., evaluating variable ordering based on subject matter expertise, structural relationships, or other domain knowledge).
ImportantIMPORTANT: General strategies for determining visit sequence order
  1. Categorical variables are generally easier to synthesize than numeric variables because their probability distributions are easier to represent.
  2. Categorical variables with fewer levels are typically easier to synthesize than variables with many levels.
  3. If one variable structurally depends on another, it should come after the prerequisite variable in the visit sequence.

When using fully conditional synthesis, not all variables can be effectively modeled using the same predictor variables. For example, if a model for graduation rate relies on predictors that fail to capture statistically meaningful differences in graduation rates, the resulting synthetic data may be lower quality or have less utility. This situation is why it is important to synthesize variables in an order that captures increasingly complex relationships. Doing so improves the quality of variables synthesized later in the process.


NoneClass Activity 1

This is a conceptual question about sequential ordering.

Which variable would be generally easier to synthesize: degree type (e.g., Bachelors, Associates, etc.), or major field (e.g., accounting, chemistry, etc.)?

Which variable would be generally easier to synthesize: degree type (e.g., Bachelors, Associates, etc.), or major field (e.g., accounting, chemistry, etc.)?

Degree type, due to fewer categorical levels.


NoneClass Activity 2

This is a conceptual question about sequential ordering.

Suppose you wanted to synthesize two different variables, school system (e.g., California State University) and school name (e.g., Cal State Fullerton). Each school name is associated with only one school system. Which variable would be better to synthesize first?

Suppose you wanted to synthesize two different variables, school system (e.g., California State University) and school name (e.g., Cal State Fullerton). Each school name is associated with only one school system. Which variable would be better to synthesize first?

School system, due to the hierarchical relationship between the two variables.

4.3.1 tidysynthesis snippet

tidysynthesis lets users implement these strategies using different metric choices to find data-driven synthesis orders:

acs_roadmap <- roadmap(
  conf_data = acs_conf_nw,
  start_data = acs_start_nw
) |>
  # add indicator variables for missing data
  enforce_schema() |>
  # order categorical variables by their information entropy
  add_sequence_factor(where(is.factor)) |>
  # order numeric variables by their correlation with age
  add_sequence_numeric(
    where(is.numeric),
    method = "correlation", 
    cor_var = "age",
    na.rm = TRUE
  )

acs_roadmap[["visit_sequence"]]
Visit Sequence
Method:Variable
entropy:hcovany entropy:inctot_NA entropy:empstat entropy:classwkr correlation:age correlation:famsize correlation:transit_time correlation:inctot 
# visualize the ACS data in sequential order
acs_roadmap[["conf_data"]] |>
  select(all_of(acs_roadmap[["visit_sequence"]][["visit_sequence"]])) |>
  glimpse()
Rows: 1,500
Columns: 8
$ hcovany      <fct> With health insurance coverage, With health insurance cov…
$ inctot_NA    <fct> missing value, nonmissing value, missing value, missing v…
$ empstat      <fct> NA, Employed, NA, NA, Employed, Employed, NA, NA, NA, Emp…
$ classwkr     <fct> N/A, Works for wages, N/A, N/A, Self-employed, Works for …
$ age          <dbl> 0, 41, 10, 12, 46, 36, 49, 5, 22, 31, 5, 55, 74, 50, 37, …
$ famsize      <dbl> 5, 4, 3, 6, 5, 3, 5, 5, 4, 1, 4, 2, 2, 2, 4, 1, 1, 4, 5, …
$ transit_time <dbl> 0, 30, 0, 0, 15, 15, 0, 0, 0, 5, 0, 7, 0, 15, 10, 0, 0, 0…
$ inctot       <dbl> NA, 68000, NA, NA, 91000, 26200, 6000, NA, 0, 37000, NA, …

4.4 Model selection: complexity trade-offs

WarningWARNING: Model quality

The quality of model fitting establishes an upper bound on the downstream usefulness of synthetic data.

  • Synthetic data creates a utility bottleneck.
  • If we can’t successfully capture data relationships in a model, we can’t successfully use the synthetic data.
  • For example, if a model struggles to successfully find a relationship between demographic variables and graduation rate, the resulting synthetic data won’t reflect that relationship.

Crucially, synthetic data only allow data users to perform statistical analyses that are supported by the synthesis process. If a synthetic data user is interested in investigating a particular aspect of the confidential data distribution, that aspect must be accounted for as part of the modeling process. In this way, the quality of synthetic data models is a “utility bottleneck” for tasks that any downstream application of synthetic data could theoretically reproduce. The effectiveness of synthetic data hinges on successful model building.

WarningWARNING: Another note on model quality

When modeling relationships for synthetic data, we don’t apply the same principles for model quality that we do for traditional analyses!

  • In most supervised learning settings, we’re interested in evaluating how well models predict new samples, i.e., how well they generalize.
    • Test error is the counterbalance against model complexity.
    • For example, how well does a model that predicts 2024 college graduate GPA work on 2025 college graduates?
  • In synthetic data modeling, we’re interested in evaluating how well models replicate features of the data.
    • Disclosure risk is the counterbalance against model complexity.
    • For example, how closely do predictions for 2024 college graduate GPA return highly accurate GPAs for the same students?

There are countless different methods for modeling the relationships between variables. Generic model fitting methods (sometimes known as nonparametric, empirical, or “black box” methods) allow for the construction of more complex, flexible models than simpler methods (such as parametric or generative models like regressions). Although more complex models can produce higher utility synthetic data, they can also produce models that overfit to the confidential data, leaking information about data subjects in the process.

A graphic illustrating the trade-offs in model development, composed of two parts: “Generalization” and “Privacy-Utility Trade-off.” The left chart, titled “Generalization,” plots “Error” on the y-axis against “Model Complexity” on the x-axis. It features two lines: A blue “Training Data” line shows error consistently decreasing as model complexity increases. A yellow “Test Data” line forms a U-shape, showing error decreasing to a minimum before increasing again as the model becomes more complex, which illustrates overfitting. The right side, titled “Privacy-Utility Trade-off,” contains two charts stacked vertically over the same “Model Complexity” x-axis: The bottom chart shows “Error” on its y-axis, with a blue line decreasing as complexity increases. The top chart shows “Disclosure Risk” on its y-axis, with a yellow line that curves upward, indicating risk increases with model complexity. A vertical, dashed pink line labeled “Policy decision on the trade-off” cuts across both right-hand charts, signifying a chosen balance point between model utility (lower error) and privacy (lower disclosure risk).
Figure 4.5: Comparison of generalization and privacy-utility trade-off as model complexity increases.

tidysynthesis lets users implement any supervised model of their choice, either already implemented in library(tidymodels) or implemented on your own. You can view the full list of supported models here.

Example 1: create basic linear regression model

lm_mod <- linear_reg() |> 
  set_engine(engine = "lm") |>
  set_mode(mode = "regression")


Example 2: create basic parsnip CART models for regression and classification

rpart_mod <- decision_tree() |>
  set_engine(engine = "rpart") |>
  set_mode(mode = "regression")

rpart_class <- parsnip::decision_tree() |>
  set_engine(engine = "rpart") |>
  set_mode(mode = "classification")


4.5 Sampling approaches

WarningWARNING: Check model assumptions!

Sampling from conditional models requires data generating assumptions, but not all predictive models have such assumptions!

  • Some models directly imply specific samplers (e.g., sampling from linear or generalized linear regression models).
  • Some models need methods for specifying sampling procedures (e.g., sampling from a decision tree).

tidysynthesis includes samplers for a subset of library(parsnip) models:

  • sample_lm()
  • sample_glm()
  • sample_rpart()
  • sample_ranger()
NoteNote: Randomness is a part of disclosure risk protections

Randomness is the key ingredient in synthetic data for disclosure risk protections.

Sources of randomness are not exclusive to samplers!

  • Model fitting: how do we determine model parameters?
  • Model quality: how much variability exists about our best predictions for new values?
  • Sampling design: how could we design additional variance into new values?

For example, the following object can be used to add discrete Gaussian noise to regression samplers in our synthesis:

# create noise objects for regression models
noise_reg <- noise(
  add_noise = TRUE,
  noise_func = add_noise_disc_gaussian,
  variance = 1
)

# create a basic synth_spec 
acs_synth_spec <- synth_spec(
  # use previously defined parsnip models
  default_regression_model = rpart_mod,
  default_classification_model = rpart_class,
  # use tidysynthesis-provided sampler functions
  default_regression_sampler = sample_rpart,
  default_classification_sampler = sample_rpart,
  default_regression_noise = noise_reg
)

Let’s explore a visual example with a generative model for a categorical variable, county, in our ACS data. We’ll consider a basic noise addition technique, replacing a portion of our records with a uniform random sample from the values in county.

The mixture proportion determines what percent of sampled records come from the uniform random sample (versus the confidential data empirical distribution).

NoneClass Activity 3

This is a conceptual activity about sampling.

What happens to data utility when the mixture proportion increases?

What happens to data utility when the mixture proportion increases?

As the proportion increases, the synthetic data becomes less useful because the distribution of the synthetic data differs more from that of the original confidential data.

NoneClass Activity 4

This is a conceptual activity about sampling.

What happens to our disclosure risk protections when the mixture proportion increases?

What happens to our disclosure risk protections when the mixture proportion increases?

As the proportion increases, the synthetic data have more disclosure risk protection because the distribution of the synthetic data depends less on the confidential data.

This is also an example of a formally private mechanism.