3 Basic Synthesis
3.1 Introduction
This section walks through a minimal example of generating synthetic American Community Survey (ACS) data using library(tidyverse)
and library(tidysynthesis)
.
This example focuses on a subset of 2019 ACS variables for respondents from the state of Nebraska. A subsample of these data are available in library(tidysynthesis)
in the object acs_conf_nw
, available here:
Rows: 1,500
Columns: 11
$ county <fct> Other, Other, Other, Other, Douglas, Lancaster, Other, Sa…
$ gq <fct> Household, Household, Household, Household, Household, Ho…
$ sex <fct> Female, Male, Male, Female, Male, Female, Male, Male, Mal…
$ marst <fct> Single, Married, Single, Single, Married, Divorced, Marri…
$ hcovany <fct> With health insurance coverage, With health insurance cov…
$ empstat <fct> NA, Employed, NA, NA, Employed, Employed, NA, NA, NA, Emp…
$ classwkr <fct> N/A, Works for wages, N/A, N/A, Self-employed, Works for …
$ age <dbl> 0, 41, 10, 12, 46, 36, 49, 5, 22, 31, 5, 55, 74, 50, 37, …
$ famsize <dbl> 5, 4, 3, 6, 5, 3, 5, 5, 4, 1, 4, 2, 2, 2, 4, 1, 1, 4, 5, …
$ transit_time <dbl> 0, 30, 0, 0, 15, 15, 0, 0, 0, 5, 0, 7, 0, 15, 10, 0, 0, 0…
$ inctot <dbl> NA, 68000, NA, NA, 91000, 26200, 6000, NA, 0, 37000, NA, …
Recall that a tidysynthesis
run has the following high-level four step overview:
- Create a
roadmap
S3 object. - Create a
synth_spec
S3 object. - Create a
presynth
S3 object using thisroadmap
andsynth_spec
. - Create synthetic data using
synthesize(presynth = my_presynth)
.
Synthetic data generating with tidysynthesis
typically requires fitting and sampling from multiple models, making it computationally intensive. To alleviate this burden for users, tidysynthesis
uses lazy evaluation to ensure syntheses can be fully configured prior to substantive computation that begins when synthesize()
is called.
We cover each step in detail for the reminder of this document.
3.2 roadmap
A roadmap
object contains information about the order of operations for a specific synthesis, which is required for all syntheses. You can create a roadmap
S3 object using the function roadmap()
that requires two arguments:
conf_data
: A data frame with the confidential data used to generate the synthetic data. The resulting synthetic data will have the same number of columns asconf_data
.start_data
: A data frame with a strict subset of variables fromconf_data
, which is used to start the synthesis process. The resulting synthetic data will have the same number of rows asstart_data
.
The first decision to make when using tidysynthesis
is which variables to include in start_data
. Any variables not included in start_data
will be conditionally synthesized, using the variables in the start_data
as predictors. Including more variables in start_data
provides more predictors for the sequential synthesis.
For this minimal example, we will select only one variable in our start_data
as-is.
# create start_data by selecting one variable from the confidential data
acs_start_county <- select(acs_conf_nw, county)
# create a minimal roadmap
acs_roadmap <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_county
)
# display the roadmap
acs_roadmap
Roadmap:
conf_data: 1500 observations, 11 variables
start_data: 1500 observations, 1 variables
3.3 synth_spec
A synth_spec
object specifies the modeling and sampling components used for sequential synthetic data generation. The synth_spec()
function creates a synth_spec
S3 object and contains many arguments for changing the synth_spec
. But each synth_spec
requires that every synthesized variable be associated with a model object and a sampler function.
The synth_spec
provides you flexibility to arbitrarily specify the details of different models, samplers, and more (see the synth_spec
documentation for more details). In this minimal example, we only specify default models and samplers that differentiate between categorical and numeric outputs.
default_regression_model
: The default predictive model used to generate numeric data.default_classification_model
: The default predictive model used to generate categorical data.default_regression_sampler
: The default sampling method used to sample from regression models.default_classification_sampler
: The default sampling method used to sample from classification models.
All models must be model_spec
objects from library(parsnip)
, and all samplers must be functions with a specific signature (see samplers
for more details). tidysynthesis
provides default samplers for many common modeling engines used in parsnip
, but you can also specify any custom model_spec
and sampler function. Below, we implement basic classification and regression tree (CART) models and samplers from tidysynthesis
using the rpart
engine.
# create basic parsnip CART models for regression and classification
rpart_mod <- decision_tree() |>
set_engine(engine = "rpart") |>
set_mode(mode = "regression")
rpart_class <- parsnip::decision_tree() |>
set_engine(engine = "rpart") |>
set_mode(mode = "classification")
# create a basic synth_spec
acs_synth_spec <- synth_spec(
# use previously defined parsnip models
default_regression_model = rpart_mod,
default_classification_model = rpart_class,
# use tidysynthesis-provided sampler functions
default_regression_sampler = sample_rpart,
default_classification_sampler = sample_rpart
)
3.4 presynth
The presynth
object combines the roadmap
and synth_spec
into the last object needed before we create synthetic data.
Warning in construct_noise(roadmap = roadmap, default_regression_noise =
synth_spec[["default_regression_noise"]], : No noise specified, using default
noise() object.
Warning in construct_tuners(roadmap = roadmap, default_regression_tuner =
synth_spec[["default_regression_tuner"]], : No tuners specified, using default
tuner
Warning in construct_extractors(roadmap = roadmap, default_extractor =
synth_spec[["default_extractor"]], : No extractors specified, using default
extractor.
Some variable(s) have no non-default visit sequence method specified: TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE
3.5 synthesize()
With presynth
, we can finally synthesize the data using the synthesize()
function.
Synthesizing gq ...
Synthesizing sex ...
Synthesizing marst ...
Synthesizing hcovany ...
Synthesizing empstat ...
Synthesizing classwkr ...
Synthesizing age ...
Synthesizing famsize ...
Synthesizing transit_time ...
Synthesizing inctot_NA ...
Synthesizing inctot ...
Postsynth
Synthetic Data: 1500 synthetic observations, 12 variables
Total Synthesis Time: 0.456373929977417 seconds
We have synthetic data! The resulting object is a postsynth
S3 object that contains information about the synthetic data and its generation process, such as the starting variables and model. Most importantly, the postsynth
contains the resulting synthtetic_data
:
Rows: 1,500
Columns: 12
$ county <fct> Other, Other, Other, Other, Douglas, Lancaster, Other, Sa…
$ gq <fct> Household, Household, Household, Household, Household, Ho…
$ sex <fct> Female, Female, Female, Female, Male, Male, Male, Female,…
$ marst <fct> Single, Single, Married, Married, Married, Married, Singl…
$ hcovany <fct> With health insurance coverage, With health insurance cov…
$ empstat <fct> Employed, Employed, Employed, Employed, Employed, Employe…
$ classwkr <fct> Works for wages, Works for wages, Works for wages, Self-e…
$ age <dbl> 25, 36, 68, 70, 55, 36, 2, 22, 80, 73, 13, 75, 41, 82, 80…
$ famsize <dbl> 1, 4, 4, 3, 2, 3, 3, 3, 2, 2, 6, 2, 1, 2, 2, 3, 2, 5, 4, …
$ transit_time <dbl> 35, 10, 15, 5, 20, 18, 0, 0, 0, 0, 0, 30, 13, 0, 0, 0, 0,…
$ inctot_NA <fct> nonmissing value, nonmissing value, nonmissing value, non…
$ inctot <dbl> 1500, 63000, 20300, 123000, 60000, 80000, NA, 0, 9600, 72…