9 replicates
The replicates object handles the creation of multiple synthetic data replicates from a single presynth object, i.e., a single roadmap and synth_spec combination. The replicates() function creates a replicates S3 object with parameters describing the patterns.
9.1 replicates via Tidy API
The replicates() function has three positive integer-valued parameters, all of which default to one:
start_data_replicates: Number of starting data replicates to produce (i.e., how many timesstart_methodis executed on the inputstart_data?)model_sample_replicates: Number of modeling-sampling conditional syntheses to produce (i.e., how many times doessynthesize()is executed and applied to each starting data record?)end_to_end_replicates: Number of replicates for the entire process described above.
You have two different options to associate a replicates object with a roadmap. First option is you can pass replicates directly into roadmap’s constructor:
Alternatively, you can use the API calls add_replicates() or update_start_method() to associate replicates with the roadmap.
9.2 replicates Behavior and Output Types
Depending on which of these values are greater than one, synthesize() can produce different outputs.
- If
start_data_replicates > 1, each resulting synthetic data set will be a concatenation of eachstart_datacreated. For example, ifstart_datahas 30 rows, andstart_methodproduces a dataset with 20 rows, each synthetic data set will have number of rows20 * start_data_replicates. - If only one of
model_sample_replicates > 1ORend_to_end_replicates > 1, the result will be a list ofpostsynthobjects, whose length is the greater of the two specifications. - If
model_sample_replicates > 1ANDend_to_end_replicates > 1, the result will be a nested list of lists ofpostsynthobjects. The outer list will have lengthend_to_end_replicatesand the inner list will have lengthmodel_sample_replicates.
Now, let us look at a few different examples. To start, let’s recreate basic roadmap and synth_spec objects.
rpart_mod <- decision_tree() |>
set_engine(engine = "rpart") |>
set_mode(mode = "regression")
rpart_class <- decision_tree() |>
set_engine(engine = "rpart") |>
set_mode(mode = "classification")
acs_synth_spec <- synth_spec(
default_regression_model = rpart_mod,
default_classification_model = rpart_class,
default_regression_sampler = sample_rpart,
default_classification_sampler = sample_rpart
)First, we create a roadmap with a non-trivial start_method and start_data_replicates:
Postsynth
Synthetic Data: 10 synthetic observations, 12 variables
Total Synthesis Time: 0.263174772262573 seconds
As expected, this set of code produces a single postsynth with 5 x 2 = 10 rows.
Next, let’s update model_sample_replicates:
[[1]]
Postsynth
Synthetic Data: 500 synthetic observations, 12 variables
Total Synthesis Time: 0.898403167724609 seconds
[[2]]
Postsynth
Synthetic Data: 500 synthetic observations, 12 variables
Total Synthesis Time: 0.898866176605225 seconds
[[3]]
Postsynth
Synthetic Data: 500 synthetic observations, 12 variables
Total Synthesis Time: 0.899195194244385 seconds
As expected, this updated code produces a list of three postsynth objects, each with nrow(start_data) = 1500 in their synthetic data.
Finally, let’s combine all three parameters:
[[1]]
[[1]][[1]]
Postsynth
Synthetic Data: 1000 synthetic observations, 12 variables
Total Synthesis Time: 1.06586599349976 seconds
[[1]][[2]]
Postsynth
Synthetic Data: 1000 synthetic observations, 12 variables
Total Synthesis Time: 1.06631302833557 seconds
[[1]][[3]]
Postsynth
Synthetic Data: 1000 synthetic observations, 12 variables
Total Synthesis Time: 1.06663393974304 seconds
[[2]]
[[2]][[1]]
Postsynth
Synthetic Data: 1000 synthetic observations, 12 variables
Total Synthesis Time: 1.04484796524048 seconds
[[2]][[2]]
Postsynth
Synthetic Data: 1000 synthetic observations, 12 variables
Total Synthesis Time: 1.04530692100525 seconds
[[2]][[3]]
Postsynth
Synthetic Data: 1000 synthetic observations, 12 variables
Total Synthesis Time: 1.04562091827393 seconds
[[3]]
[[3]][[1]]
Postsynth
Synthetic Data: 1000 synthetic observations, 12 variables
Total Synthesis Time: 1.0179169178009 seconds
[[3]][[2]]
Postsynth
Synthetic Data: 1000 synthetic observations, 12 variables
Total Synthesis Time: 1.01836585998535 seconds
[[3]][[3]]
Postsynth
Synthetic Data: 1000 synthetic observations, 12 variables
Total Synthesis Time: 1.01868891716003 seconds
[[4]]
[[4]][[1]]
Postsynth
Synthetic Data: 1000 synthetic observations, 12 variables
Total Synthesis Time: 1.01908802986145 seconds
[[4]][[2]]
Postsynth
Synthetic Data: 1000 synthetic observations, 12 variables
Total Synthesis Time: 1.01951289176941 seconds
[[4]][[3]]
Postsynth
Synthetic Data: 1000 synthetic observations, 12 variables
Total Synthesis Time: 1.01982998847961 seconds
We now see that this set of code produces a nested list of postsynth objects, each of which has nrow(start_data) * 2 = 3000 synthetic records. The outer list has four entries (one per end_to_end_replicates) and the inner list has three entries (one per model_sample_replicates).