9  replicates

The replicates object handles the creation of multiple synthetic data replicates from a single presynth object, i.e., a single roadmap and synth_spec combination. The replicates() function creates a replicates S3 object with parameters describing the patterns.

9.1 replicates via Tidy API

The replicates() function has three positive integer-valued parameters, all of which default to one:

  • start_data_replicates: Number of starting data replicates to produce (i.e., how many times start_method is executed on the input start_data?)
  • model_sample_replicates: Number of modeling-sampling conditional syntheses to produce (i.e., how many times does synthesize() is executed and applied to each starting data record?)
  • end_to_end_replicates: Number of replicates for the entire process described above.
library(tidyverse)
library(tidymodels)
library(tidysynthesis)
acs_roadmap <- roadmap(
  conf_data = acs_conf_nw, 
  start_data = acs_start_nw
)

my_reps <- replicates(
  start_data_replicates = 2,
  model_sample_replicates = 3,
  end_to_end_replicates = 4
)

You have two different options to associate a replicates object with a roadmap. First option is you can pass replicates directly into roadmap’s constructor:

# Option 1: use the `roadmap` constructor
acs_roadmap_w_reps <- roadmap(
  conf_data = acs_conf_nw,
  start_data = acs_start,
  replicates = my_reps
)

Alternatively, you can use the API calls add_replicates() or update_start_method() to associate replicates with the roadmap.

# Option 2: use `add_replicates()`
acs_roadmap_w_reps <- acs_roadmap %>%
  add_replicates(my_reps)

# Option 3: pass arguments directly to `update_replicates()`
acs_roadmap_w_reps  <- acs_roadmap %>%
  update_replicates(
    start_data_replicates = 2,
    model_sample_replicates = 3,
    end_to_end_replicates = 4
  )

9.2 replicates Behavior and Output Types

Depending on which of these values are greater than one, synthesize() can produce different outputs.

  • If start_data_replicates > 1, each resulting synthetic data set will be a concatenation of each start_data created. For example, if start_data has 30 rows, and start_method produces a dataset with 20 rows, each synthetic data set will have number of rows 20 * start_data_replicates.
  • If only one of model_sample_replicates > 1 OR end_to_end_replicates > 1, the result will be a list of postsynth objects, whose length is the greater of the two specifications.
  • If model_sample_replicates > 1 AND end_to_end_replicates > 1, the result will be a nested list of lists of postsynth objects. The outer list will have length end_to_end_replicates and the inner list will have length model_sample_replicates.

Now, let us look at a few different examples. To start, let’s recreate basic roadmap and synth_spec objects.

rpart_mod <- decision_tree() |>
  set_engine(engine = "rpart") |>
  set_mode(mode = "regression")

rpart_class <- decision_tree() |>
  set_engine(engine = "rpart") |>
  set_mode(mode = "classification")

acs_synth_spec <- synth_spec(
  default_regression_model = rpart_mod,
  default_classification_model = rpart_class,
  default_regression_sampler = sample_rpart,
  default_classification_sampler = sample_rpart
)

First, we create a roadmap with a non-trivial start_method and start_data_replicates:

acs_roadmap_rep1 <- acs_roadmap %>%
  update_start_method(
    start_func = start_resample,
    n = 5
  ) %>%
  update_replicates(
    start_data_replicates = 2
  )

presynth1 <- presynth(
  roadmap = acs_roadmap_rep1, 
  synth_spec = acs_synth_spec
) 
result1 <- synthesize(presynth = presynth1)

result1
Postsynth 
Synthetic Data: 10 synthetic observations, 12 variables 
Total Synthesis Time: 0.263174772262573 seconds

As expected, this set of code produces a single postsynth with 5 x 2 = 10 rows.

Next, let’s update model_sample_replicates:

acs_roadmap_rep2 <- acs_roadmap %>%
  update_replicates(
    model_sample_replicates = 3
  )

presynth2 <- presynth(
  roadmap = acs_roadmap_rep2, 
  synth_spec = acs_synth_spec
) 
result2 <- synthesize(presynth = presynth2)

result2
[[1]]
Postsynth 
Synthetic Data: 500 synthetic observations, 12 variables 
Total Synthesis Time: 0.898403167724609 seconds
[[2]]
Postsynth 
Synthetic Data: 500 synthetic observations, 12 variables 
Total Synthesis Time: 0.898866176605225 seconds
[[3]]
Postsynth 
Synthetic Data: 500 synthetic observations, 12 variables 
Total Synthesis Time: 0.899195194244385 seconds

As expected, this updated code produces a list of three postsynth objects, each with nrow(start_data) = 1500 in their synthetic data.

Finally, let’s combine all three parameters:

acs_roadmap_rep3 <- acs_roadmap %>%
  update_replicates(
    start_data_replicates = 2,
    model_sample_replicates = 3,
    end_to_end_replicates = 4
  )


presynth3 <- presynth(
  roadmap = acs_roadmap_rep3, 
  synth_spec = acs_synth_spec
) 
result3 <- synthesize(presynth = presynth3)

result3
[[1]]
[[1]][[1]]
Postsynth 
Synthetic Data: 1000 synthetic observations, 12 variables 
Total Synthesis Time: 1.06586599349976 seconds
[[1]][[2]]
Postsynth 
Synthetic Data: 1000 synthetic observations, 12 variables 
Total Synthesis Time: 1.06631302833557 seconds
[[1]][[3]]
Postsynth 
Synthetic Data: 1000 synthetic observations, 12 variables 
Total Synthesis Time: 1.06663393974304 seconds

[[2]]
[[2]][[1]]
Postsynth 
Synthetic Data: 1000 synthetic observations, 12 variables 
Total Synthesis Time: 1.04484796524048 seconds
[[2]][[2]]
Postsynth 
Synthetic Data: 1000 synthetic observations, 12 variables 
Total Synthesis Time: 1.04530692100525 seconds
[[2]][[3]]
Postsynth 
Synthetic Data: 1000 synthetic observations, 12 variables 
Total Synthesis Time: 1.04562091827393 seconds

[[3]]
[[3]][[1]]
Postsynth 
Synthetic Data: 1000 synthetic observations, 12 variables 
Total Synthesis Time: 1.0179169178009 seconds
[[3]][[2]]
Postsynth 
Synthetic Data: 1000 synthetic observations, 12 variables 
Total Synthesis Time: 1.01836585998535 seconds
[[3]][[3]]
Postsynth 
Synthetic Data: 1000 synthetic observations, 12 variables 
Total Synthesis Time: 1.01868891716003 seconds

[[4]]
[[4]][[1]]
Postsynth 
Synthetic Data: 1000 synthetic observations, 12 variables 
Total Synthesis Time: 1.01908802986145 seconds
[[4]][[2]]
Postsynth 
Synthetic Data: 1000 synthetic observations, 12 variables 
Total Synthesis Time: 1.01951289176941 seconds
[[4]][[3]]
Postsynth 
Synthetic Data: 1000 synthetic observations, 12 variables 
Total Synthesis Time: 1.01982998847961 seconds

We now see that this set of code produces a nested list of postsynth objects, each of which has nrow(start_data) * 2 = 3000 synthetic records. The outer list has four entries (one per end_to_end_replicates) and the inner list has three entries (one per model_sample_replicates).