# don't add extra noise to predictions
noise <- noise(
roadmap = roadmap,
add_noise = FALSE,
exclusions = 0
)
# don't impose constraints
constraints <- constraints(
roadmap = roadmap,
constraints = NULL,
max_z = 0
)
# only generate one synthetic data set
replicates <- replicates(
replicates = 1,
workers = 1,
summary_function = NULL
)
5 Synthesize!
tidysynthesis is lazy. Functions like roadmap()
and synthspec()
create objects but don’t synthesize data or perform any significant computations. The actual synthesis happens when synthesize()
is run on a presynth
object.
5.1 presynth()
The presynth
object is a container for all objects that would go into a synthesis. These include the roadmap
, synthspec
, noise
, constraints
, and replicates
. A presynth
object is created with the presynth()
function.
noise
, constraints
, and replicates
are optional for a synthesis but are required for the code to run. That means noise
, constraints
, and replicates
are required even when they are not used.
presynth()
requires each object as an argument and will throw an error if objects are incorrectly specified.
5.2 constraints()
While the above example does not use any constraints, we will explain how the constraints function works. Also, note that synthesis Example 3 can serve as an exemplar of implementing constraints.
The constraints()
function takes three arguments: roadmap
, constraints
, and max_z
. The roadmap
is simply the roadmap object created by roadmap()
. constraints
takes a dataframe with four columns: var
, min
, max
, and conditions
. The var
column takes a string that is the name of the variable being synthesized. The min
and max
columns refer to the minimum and maximum numeric values that the variable should take under certain conditions. Lastly, conditions
takes a string which should be a conditional statement returning a boolean. If the boolean evaluates to TRUE
then the constraint will be enforced. Otherwise, the constraint will not be enforced. In the constraints dataframe shown in Example 03: Penguins (Again), the first constraint has a conditions string of TRUE
, meaning it will always be enforced. The second condition is only enforced when bill_length_mm < 40
.
max_z
takes either 0 or a positive integer. If max_z
is greater than 0, then tidysynthesis will resample up to max_z
times to attempt to draw a value that meets the condition in the condition
column. If max_z
is 0 or even after max_z
draws the condition is not met, tidysynthesis assigns the observation a value by “hard bounding,” or forcing the synthetic data value to the closest of either min
or max
.
5.3 synthesize()
The synthesize()
function performs all of the specified computations to generate a synthetic data. This includes feature and target engineering, model fitting, prediction, and adding additional noise.
5.4 postsynth
synthesize()
returns a postsynth
object.
The postsynth
object includes several important objects:
synthetic_data
is a data frame with synthetic data.jth_preprocessing
is the trained preprocessing model for each synthetic variable. This inclues parameters estimated during training that are required for transforming variables back to their original units.total_synthesis_time
is the total synthesis time for each data set.jth_synthesis_time
is the synthesis time in seconds for each synthetic variable.ldiversity
is a privacy metric unique to tree-based methods.strata_keys
are ids for identifying strata. This is only reported when a synthesis is stratified.
$synthetic_data
# A tibble: 333 × 7
species island sex bill_length_mm flipper_length_mm body_mass_g
<fct> <fct> <fct> <dbl> <dbl> <dbl>
1 Chinstrap Dream male 48.4 203 4300
2 Gentoo Biscoe male 51.4 224 5300
3 Adelie Dream female 35.2 186 3525
4 Chinstrap Dream male 49.5 207 4150
5 Chinstrap Dream male 53.5 197 4150
6 Gentoo Biscoe male 50.8 229 5200
7 Chinstrap Dream female 45.2 195 3675
8 Adelie Dream female 38.1 198 3350
9 Chinstrap Dream male 50.5 201 4050
10 Chinstrap Dream female 46.9 190 3700
# ℹ 323 more rows
# ℹ 1 more variable: bill_depth_mm <dbl>
$jth_preprocessing
$jth_preprocessing$flipper_length_mm
NULL
$jth_preprocessing$bill_depth_mm
NULL
$total_synthesis_time
[1] 0.208009
$jth_synthesis_time
# A tibble: 4 × 4
j strata variable synthesis_time
<int> <chr> <ord> <dbl>
1 1 Strata 1 bill_length_mm 0.0496
2 2 Strata 1 flipper_length_mm 0.0239
3 3 Strata 1 body_mass_g 0.0243
4 4 Strata 1 bill_depth_mm 0.0277
$extractions
$extractions$bill_length_mm
[1] NA
$extractions$flipper_length_mm
[1] NA
$extractions$body_mass_g
[1] NA
$extractions$bill_depth_mm
[1] NA
$ldiversity
# A tibble: 333 × 4
bill_length_mm flipper_length_mm body_mass_g bill_depth_mm
<int> <int> <int> <int>
1 61 20 42 34
2 61 19 24 25
3 44 25 38 36
4 61 20 42 34
5 61 20 42 34
6 61 19 24 25
7 54 25 38 36
8 44 25 38 36
9 61 20 42 34
10 54 25 38 36
# ℹ 323 more rows
$strata_keys
NULL
attr(,"class")
[1] "postsynth"