3  Basic Synthesis

3.1 Introduction

This section walks through a minimal example of generating synthetic American Community Survey (ACS) data using library(tidyverse) and library(tidysynthesis).

library(tidyverse)
library(tidymodels)
library(tidysynthesis)

This example focuses on a subset of 2019 ACS variables for respondents from the state of Nebraska. A subsample of these data are available in library(tidysynthesis) in the object acs_conf_nw, available here:

glimpse(acs_conf_nw)
Rows: 1,500
Columns: 11
$ county       <fct> Other, Other, Other, Other, Douglas, Lancaster, Other, Sa…
$ gq           <fct> Household, Household, Household, Household, Household, Ho…
$ sex          <fct> Female, Male, Male, Female, Male, Female, Male, Male, Mal…
$ marst        <fct> Single, Married, Single, Single, Married, Divorced, Marri…
$ hcovany      <fct> With health insurance coverage, With health insurance cov…
$ empstat      <fct> NA, Employed, NA, NA, Employed, Employed, NA, NA, NA, Emp…
$ classwkr     <fct> N/A, Works for wages, N/A, N/A, Self-employed, Works for …
$ age          <dbl> 0, 41, 10, 12, 46, 36, 49, 5, 22, 31, 5, 55, 74, 50, 37, …
$ famsize      <dbl> 5, 4, 3, 6, 5, 3, 5, 5, 4, 1, 4, 2, 2, 2, 4, 1, 1, 4, 5, …
$ transit_time <dbl> 0, 30, 0, 0, 15, 15, 0, 0, 0, 5, 0, 7, 0, 15, 10, 0, 0, 0…
$ inctot       <dbl> NA, 68000, NA, NA, 91000, 26200, 6000, NA, 0, 37000, NA, …

Recall that a tidysynthesis run has the following high-level four step overview:

  1. Create a roadmap S3 object.
  2. Create a synth_spec S3 object.
  3. Create a presynth S3 object using this roadmap and synth_spec.
  4. Create synthetic data using synthesize(presynth = my_presynth).
Note

Synthetic data generating with tidysynthesis typically requires fitting and sampling from multiple models, making it computationally intensive. To alleviate this burden for users, tidysynthesis uses lazy evaluation to ensure syntheses can be fully configured prior to substantive computation that begins when synthesize() is called.

We cover each step in detail for the reminder of this document.

3.2 roadmap

A roadmap object contains information about the order of operations for a specific synthesis, which is required for all syntheses. You can create a roadmap S3 object using the function roadmap() that requires two arguments:

  • conf_data: A data frame with the confidential data used to generate the synthetic data. The resulting synthetic data will have the same number of columns as conf_data.
  • start_data: A data frame with a strict subset of variables from conf_data, which is used to start the synthesis process. The resulting synthetic data will have the same number of rows as start_data.

The first decision to make when using tidysynthesis is which variables to include in start_data. Any variables not included in start_data will be conditionally synthesized, using the variables in the start_data as predictors. Including more variables in start_data provides more predictors for the sequential synthesis.

For this minimal example, we will select only one variable in our start_data as-is.

# create start_data by selecting one variable from the confidential data
acs_start_county <- select(acs_conf_nw, county)

# create a minimal roadmap
acs_roadmap <- roadmap(
  conf_data = acs_conf_nw, 
  start_data = acs_start_county
)

# display the roadmap
acs_roadmap
Roadmap: 
conf_data: 1500 observations, 11 variables 
start_data: 1500 observations, 1 variables

3.3 synth_spec

A synth_spec object specifies the modeling and sampling components used for sequential synthetic data generation. The synth_spec() function creates a synth_spec S3 object and contains many arguments for changing the synth_spec. But each synth_spec requires that every synthesized variable be associated with a model object and a sampler function.

The synth_spec provides you flexibility to arbitrarily specify the details of different models, samplers, and more (see the synth_spec documentation for more details). In this minimal example, we only specify default models and samplers that differentiate between categorical and numeric outputs.

  • default_regression_model: The default predictive model used to generate numeric data.
  • default_classification_model: The default predictive model used to generate categorical data.
  • default_regression_sampler: The default sampling method used to sample from regression models.
  • default_classification_sampler: The default sampling method used to sample from classification models.

All models must be model_spec objects from library(parsnip), and all samplers must be functions with a specific signature (see samplers for more details). tidysynthesis provides default samplers for many common modeling engines used in parsnip, but you can also specify any custom model_spec and sampler function. Below, we implement basic classification and regression tree (CART) models and samplers from tidysynthesis using the rpart engine.

# create basic parsnip CART models for regression and classification
rpart_mod <- decision_tree() |>
  set_engine(engine = "rpart") |>
  set_mode(mode = "regression")

rpart_class <- parsnip::decision_tree() |>
  set_engine(engine = "rpart") |>
  set_mode(mode = "classification")

# create a basic synth_spec 
acs_synth_spec <- synth_spec(
  # use previously defined parsnip models
  default_regression_model = rpart_mod,
  default_classification_model = rpart_class,
  # use tidysynthesis-provided sampler functions
  default_regression_sampler = sample_rpart,
  default_classification_sampler = sample_rpart
)

3.4 presynth

The presynth object combines the roadmap and synth_spec into the last object needed before we create synthetic data.

acs_presynth <- presynth(
  roadmap = acs_roadmap,
  synth_spec = acs_synth_spec
)
Warning in construct_noise(roadmap = roadmap, default_regression_noise =
synth_spec[["default_regression_noise"]], : No noise specified, using default
noise() object.
Warning in construct_tuners(roadmap = roadmap, default_regression_tuner =
synth_spec[["default_regression_tuner"]], : No tuners specified, using default
tuner
Warning in construct_extractors(roadmap = roadmap, default_extractor =
synth_spec[["default_extractor"]], : No extractors specified, using default
extractor.
Some variable(s) have no non-default visit sequence method specified: TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE

3.5 synthesize()

With presynth, we can finally synthesize the data using the synthesize() function.

acs_result <- synthesize(presynth = acs_presynth)
Synthesizing gq ...
Synthesizing sex ...
Synthesizing marst ...
Synthesizing hcovany ...
Synthesizing empstat ...
Synthesizing classwkr ...
Synthesizing age ...
Synthesizing famsize ...
Synthesizing transit_time ...
Synthesizing inctot_NA ...
Synthesizing inctot ...
acs_result
Postsynth 
Synthetic Data: 1500 synthetic observations, 12 variables 
Total Synthesis Time: 0.456373929977417 seconds

We have synthetic data! The resulting object is a postsynth S3 object that contains information about the synthetic data and its generation process, such as the starting variables and model. Most importantly, the postsynth contains the resulting synthtetic_data:

glimpse(acs_result$synthetic_data)
Rows: 1,500
Columns: 12
$ county       <fct> Other, Other, Other, Other, Douglas, Lancaster, Other, Sa…
$ gq           <fct> Household, Household, Household, Household, Household, Ho…
$ sex          <fct> Female, Female, Female, Female, Male, Male, Male, Female,…
$ marst        <fct> Single, Single, Married, Married, Married, Married, Singl…
$ hcovany      <fct> With health insurance coverage, With health insurance cov…
$ empstat      <fct> Employed, Employed, Employed, Employed, Employed, Employe…
$ classwkr     <fct> Works for wages, Works for wages, Works for wages, Self-e…
$ age          <dbl> 25, 36, 68, 70, 55, 36, 2, 22, 80, 73, 13, 75, 41, 82, 80…
$ famsize      <dbl> 1, 4, 4, 3, 2, 3, 3, 3, 2, 2, 6, 2, 1, 2, 2, 3, 2, 5, 4, …
$ transit_time <dbl> 35, 10, 15, 5, 20, 18, 0, 0, 0, 0, 0, 30, 13, 0, 0, 0, 0,…
$ inctot_NA    <fct> nonmissing value, nonmissing value, nonmissing value, non…
$ inctot       <dbl> 1500, 63000, 20300, 123000, 60000, 80000, NA, 0, 9600, 72…