9 replicates
The replicates
object handles the creation of multiple synthetic data replicates from a single presynth
object, i.e., a single roadmap
and synth_spec
combination. The replicates()
function creates a replicates
S3 object with parameters describing the patterns.
9.1 replicates
via Tidy API
The replicates()
function has three positive integer-valued parameters, all of which default to one:
start_data_replicates
: Number of starting data replicates to produce (i.e., how many timesstart_method
is executed on the inputstart_data
?)model_sample_replicates
: Number of modeling-sampling conditional syntheses to produce (i.e., how many times doessynthesize()
is executed and applied to each starting data record?) *end_to_end_replicates
: Number of replicates for the entire process described above.
You have two different options to associate a replicates
object with a roadmap
. First option is you can pass replicates
directly into roadmap
’s constructor:
Alternatively, you can use the API calls add_replicates()
or update_start_method()
to associate replicates
with the roadmap
.
9.2 replicates
Behavior and Output Types
Depending on which of these values are greater than one, synthesize()
can produce different outputs.
- If
start_data_replicates > 1
, each resulting synthetic data set will be a concatenation of eachstart_data
created. For example, ifstart_data
has 30 rows, andstart_method
produces a dataset with 20 rows, each synthetic data set will have number of rows20 * start_data_replicates
. - If only one of
model_sample_replicates > 1
ORend_to_end_replicates > 1
, the result will be a list ofpostsynth
objects, whose length is the greater of the two specifications. - If
model_sample_replicates > 1
ANDend_to_end_replicates > 1
, the result will be a nested list of lists ofpostsynth
objects. The outer list will have lengthend_to_end_replicates
and the inner list will have lengthmodel_sample_replicates
.
Now, let us look at a few different examples. To start, let’s recreate basic roadmap
and synth_spec
objects.
rpart_mod <- decision_tree() |>
set_engine(engine = "rpart") |>
set_mode(mode = "regression")
rpart_class <- decision_tree() |>
set_engine(engine = "rpart") |>
set_mode(mode = "classification")
acs_synth_spec <- synth_spec(
default_regression_model = rpart_mod,
default_classification_model = rpart_class,
default_regression_sampler = sample_rpart,
default_classification_sampler = sample_rpart
)
First, we create a roadmap with a non-trivial start_method
and start_data_replicates
:
acs_roadmap_rep1 <- acs_roadmap %>%
update_start_method(
start_func = start_resample,
n = 5
) %>%
update_replicates(
start_data_replicates = 2
)
presynth1 <- presynth(
roadmap = acs_roadmap_rep1,
synth_spec = acs_synth_spec
)
result1 <- synthesize(presynth = presynth1)
result1
Postsynth
Synthetic Data: 10 synthetic observations, 12 variables
Total Synthesis Time: 0.201096057891846 seconds
As expected, this set of code produces a single postsynth
with 5 x 2 = 10 rows.
Next, let’s update model_sample_replicates
:
acs_roadmap_rep2 <- acs_roadmap %>%
update_replicates(
model_sample_replicates = 3
)
presynth2 <- presynth(
roadmap = acs_roadmap_rep2,
synth_spec = acs_synth_spec
)
result2 <- synthesize(presynth = presynth2)
result2
[[1]]
Postsynth
Synthetic Data: 1500 synthetic observations, 12 variables
Total Synthesis Time: 1.13812112808228 seconds
[[2]]
Postsynth
Synthetic Data: 1500 synthetic observations, 12 variables
Total Synthesis Time: 1.13845705986023 seconds
[[3]]
Postsynth
Synthetic Data: 1500 synthetic observations, 12 variables
Total Synthesis Time: 1.13871693611145 seconds
As expected, this updated code produces a list of three postsynth
objects, each with nrow(start_data) = 1500
in their synthetic data.
Finally, let’s combine all three parameters:
acs_roadmap_rep3 <- acs_roadmap %>%
update_replicates(
start_data_replicates = 2,
model_sample_replicates = 3,
end_to_end_replicates = 4
)
presynth3 <- presynth(
roadmap = acs_roadmap_rep3,
synth_spec = acs_synth_spec
)
result3 <- synthesize(presynth = presynth3)
result3
[[1]]
[[1]][[1]]
Postsynth
Synthetic Data: 3000 synthetic observations, 12 variables
Total Synthesis Time: 1.71591281890869 seconds
[[1]][[2]]
Postsynth
Synthetic Data: 3000 synthetic observations, 12 variables
Total Synthesis Time: 1.71624398231506 seconds
[[1]][[3]]
Postsynth
Synthetic Data: 3000 synthetic observations, 12 variables
Total Synthesis Time: 1.71650385856628 seconds
[[2]]
[[2]][[1]]
Postsynth
Synthetic Data: 3000 synthetic observations, 12 variables
Total Synthesis Time: 1.71451902389526 seconds
[[2]][[2]]
Postsynth
Synthetic Data: 3000 synthetic observations, 12 variables
Total Synthesis Time: 1.71486115455627 seconds
[[2]][[3]]
Postsynth
Synthetic Data: 3000 synthetic observations, 12 variables
Total Synthesis Time: 1.71512603759766 seconds
[[3]]
[[3]][[1]]
Postsynth
Synthetic Data: 3000 synthetic observations, 12 variables
Total Synthesis Time: 1.79333710670471 seconds
[[3]][[2]]
Postsynth
Synthetic Data: 3000 synthetic observations, 12 variables
Total Synthesis Time: 1.79366898536682 seconds
[[3]][[3]]
Postsynth
Synthetic Data: 3000 synthetic observations, 12 variables
Total Synthesis Time: 1.79392504692078 seconds
[[4]]
[[4]][[1]]
Postsynth
Synthetic Data: 3000 synthetic observations, 12 variables
Total Synthesis Time: 1.67831110954285 seconds
[[4]][[2]]
Postsynth
Synthetic Data: 3000 synthetic observations, 12 variables
Total Synthesis Time: 1.67863988876343 seconds
[[4]][[3]]
Postsynth
Synthetic Data: 3000 synthetic observations, 12 variables
Total Synthesis Time: 1.67889189720154 seconds
We now see that this set of code produces a nested list of postsynth
objects, each of which has nrow(start_data) * 2 = 3000
synthetic records. The outer list has four entries (one per end_to_end_replicates
) and the inner list has three entries (one per model_sample_replicates
).