5  Synthesize!

tidysynthesis is lazy. Functions like roadmap() and synthspec() create objects but don’t synthesize data or perform any significant computations. The actual synthesis happens when synthesize() is run on a presynth object.

5.1 presynth()

The presynth object is a container for all objects that would go into a synthesis. These include the roadmap, synthspec, noise, constraints, and replicates. A presynth object is created with the presynth() function.

noise, constraints, and replicates are optional for a synthesis but are required for the code to run. That means noise, constraints, and replicates are required even when they are not used.

# don't add extra noise to predictions
noise <- noise(
  roadmap = roadmap,
  add_noise = FALSE,
  exclusions = 0
)

# don't impose constraints
constraints <- constraints(
  roadmap = roadmap,
  constraints = NULL,
  max_z = 0
)

# only generate one synthetic data set
replicates <- replicates(
  replicates = 1,
  workers = 1,
  summary_function = NULL
)

presynth() requires each object as an argument and will throw an error if objects are incorrectly specified.

presynth1 <- presynth(
  roadmap = roadmap,
  synth_spec = synth_spec,
  noise = noise, 
  constraints = constraints,
  replicates = replicates
)

5.2 constraints()

While the above example does not use any constraints, we will explain how the constraints function works. Also, note that synthesis Example 3 can serve as an exemplar of implementing constraints.

The constraints() function takes three arguments: roadmap, constraints, and max_z. The roadmap is simply the roadmap object created by roadmap(). constraints takes a dataframe with four columns: var, min, max, and conditions. The var column takes a string that is the name of the variable being synthesized. The min and max columns refer to the minimum and maximum numeric values that the variable should take under certain conditions. Lastly, conditions takes a string which should be a conditional statement returning a boolean. If the boolean evaluates to TRUE then the constraint will be enforced. Otherwise, the constraint will not be enforced. In the constraints dataframe shown in Example 03: Penguins (Again), the first constraint has a conditions string of TRUE, meaning it will always be enforced. The second condition is only enforced when bill_length_mm < 40.

max_z takes either 0 or a positive integer. If max_z is greater than 0, then tidysynthesis will resample up to max_z times to attempt to draw a value that meets the condition in the condition column. If max_z is 0 or even after max_z draws the condition is not met, tidysynthesis assigns the observation a value by “hard bounding,” or forcing the synthetic data value to the closest of either min or max.

5.3 synthesize()

The synthesize() function performs all of the specified computations to generate a synthetic data. This includes feature and target engineering, model fitting, prediction, and adding additional noise.

set.seed(1)
synth1 <- synthesize(presynth1)

5.4 postsynth

synthesize() returns a postsynth object.

The postsynth object includes several important objects:

  • synthetic_data is a data frame with synthetic data.
  • jth_preprocessing is the trained preprocessing model for each synthetic variable. This inclues parameters estimated during training that are required for transforming variables back to their original units.
  • total_synthesis_time is the total synthesis time for each data set.
  • jth_synthesis_time is the synthesis time in seconds for each synthetic variable.
  • ldiversity is a privacy metric unique to tree-based methods.
  • strata_keys are ids for identifying strata. This is only reported when a synthesis is stratified.
synth1
$synthetic_data
# A tibble: 333 × 7
   species   island sex    bill_length_mm flipper_length_mm body_mass_g
   <fct>     <fct>  <fct>           <dbl>             <dbl>       <dbl>
 1 Chinstrap Dream  male             48.4               203        4300
 2 Gentoo    Biscoe male             51.4               224        5300
 3 Adelie    Dream  female           35.2               186        3525
 4 Chinstrap Dream  male             49.5               207        4150
 5 Chinstrap Dream  male             53.5               197        4150
 6 Gentoo    Biscoe male             50.8               229        5200
 7 Chinstrap Dream  female           45.2               195        3675
 8 Adelie    Dream  female           38.1               198        3350
 9 Chinstrap Dream  male             50.5               201        4050
10 Chinstrap Dream  female           46.9               190        3700
# ℹ 323 more rows
# ℹ 1 more variable: bill_depth_mm <dbl>

$jth_preprocessing
$jth_preprocessing$flipper_length_mm
NULL

$jth_preprocessing$bill_depth_mm
NULL


$total_synthesis_time
[1] 0.208009

$jth_synthesis_time
# A tibble: 4 × 4
      j strata   variable          synthesis_time
  <int> <chr>    <ord>                      <dbl>
1     1 Strata 1 bill_length_mm            0.0496
2     2 Strata 1 flipper_length_mm         0.0239
3     3 Strata 1 body_mass_g               0.0243
4     4 Strata 1 bill_depth_mm             0.0277

$extractions
$extractions$bill_length_mm
[1] NA

$extractions$flipper_length_mm
[1] NA

$extractions$body_mass_g
[1] NA

$extractions$bill_depth_mm
[1] NA


$ldiversity
# A tibble: 333 × 4
   bill_length_mm flipper_length_mm body_mass_g bill_depth_mm
            <int>             <int>       <int>         <int>
 1             61                20          42            34
 2             61                19          24            25
 3             44                25          38            36
 4             61                20          42            34
 5             61                20          42            34
 6             61                19          24            25
 7             54                25          38            36
 8             44                25          38            36
 9             61                20          42            34
10             54                25          38            36
# ℹ 323 more rows

$strata_keys
NULL

attr(,"class")
[1] "postsynth"