9  replicates

The replicates object handles the creation of multiple synthetic data replicates from a single presynth object, i.e., a single roadmap and synth_spec combination. The replicates() function creates a replicates S3 object with parameters describing the patterns.

9.1 replicates via Tidy API

The replicates() function has three positive integer-valued parameters, all of which default to one:

  • start_data_replicates: Number of starting data replicates to produce (i.e., how many times start_method is executed on the input start_data?)
  • model_sample_replicates: Number of modeling-sampling conditional syntheses to produce (i.e., how many times does synthesize() is executed and applied to each starting data record?) *end_to_end_replicates: Number of replicates for the entire process described above.
library(tidyverse)
library(tidymodels)
library(tidysynthesis)
acs_roadmap <- roadmap(
  conf_data = acs_conf_nw, 
  start_data = acs_start_nw
)

my_reps <- replicates(
  start_data_replicates = 2,
  model_sample_replicates = 3,
  end_to_end_replicates = 4
)

You have two different options to associate a replicates object with a roadmap. First option is you can pass replicates directly into roadmap’s constructor:

# Option 1: use the `roadmap` constructor
acs_roadmap_w_reps <- roadmap(
  conf_data = acs_conf_nw,
  start_data = acs_start,
  replicates = my_reps
)

Alternatively, you can use the API calls add_replicates() or update_start_method() to associate replicates with the roadmap.

# Option 2: use `add_replicates()`
acs_roadmap_w_reps <- acs_roadmap %>%
  add_replicates(my_reps)

# Option 3: pass arguments directly to `update_replicates()`
acs_roadmap_w_reps  <- acs_roadmap %>%
  update_replicates(
    start_data_replicates = 2,
    model_sample_replicates = 3,
    end_to_end_replicates = 4
  )

9.2 replicates Behavior and Output Types

Depending on which of these values are greater than one, synthesize() can produce different outputs.

  • If start_data_replicates > 1, each resulting synthetic data set will be a concatenation of each start_data created. For example, if start_data has 30 rows, and start_method produces a dataset with 20 rows, each synthetic data set will have number of rows 20 * start_data_replicates.
  • If only one of model_sample_replicates > 1 OR end_to_end_replicates > 1, the result will be a list of postsynth objects, whose length is the greater of the two specifications.
  • If model_sample_replicates > 1 AND end_to_end_replicates > 1, the result will be a nested list of lists of postsynth objects. The outer list will have length end_to_end_replicates and the inner list will have length model_sample_replicates.

Now, let us look at a few different examples. To start, let’s recreate basic roadmap and synth_spec objects.

rpart_mod <- decision_tree() |>
  set_engine(engine = "rpart") |>
  set_mode(mode = "regression")

rpart_class <- decision_tree() |>
  set_engine(engine = "rpart") |>
  set_mode(mode = "classification")

acs_synth_spec <- synth_spec(
  default_regression_model = rpart_mod,
  default_classification_model = rpart_class,
  default_regression_sampler = sample_rpart,
  default_classification_sampler = sample_rpart
)

First, we create a roadmap with a non-trivial start_method and start_data_replicates:

acs_roadmap_rep1 <- acs_roadmap %>%
  update_start_method(
    start_func = start_resample,
    n = 5
  ) %>%
  update_replicates(
    start_data_replicates = 2
  )

presynth1 <- presynth(
  roadmap = acs_roadmap_rep1, 
  synth_spec = acs_synth_spec
) 
result1 <- synthesize(presynth = presynth1)

result1
Postsynth 
Synthetic Data: 10 synthetic observations, 12 variables 
Total Synthesis Time: 0.201096057891846 seconds

As expected, this set of code produces a single postsynth with 5 x 2 = 10 rows.

Next, let’s update model_sample_replicates:

acs_roadmap_rep2 <- acs_roadmap %>%
  update_replicates(
    model_sample_replicates = 3
  )

presynth2 <- presynth(
  roadmap = acs_roadmap_rep2, 
  synth_spec = acs_synth_spec
) 
result2 <- synthesize(presynth = presynth2)

result2
[[1]]
Postsynth 
Synthetic Data: 1500 synthetic observations, 12 variables 
Total Synthesis Time: 1.13812112808228 seconds
[[2]]
Postsynth 
Synthetic Data: 1500 synthetic observations, 12 variables 
Total Synthesis Time: 1.13845705986023 seconds
[[3]]
Postsynth 
Synthetic Data: 1500 synthetic observations, 12 variables 
Total Synthesis Time: 1.13871693611145 seconds

As expected, this updated code produces a list of three postsynth objects, each with nrow(start_data) = 1500 in their synthetic data.

Finally, let’s combine all three parameters:

acs_roadmap_rep3 <- acs_roadmap %>%
  update_replicates(
    start_data_replicates = 2,
    model_sample_replicates = 3,
    end_to_end_replicates = 4
  )


presynth3 <- presynth(
  roadmap = acs_roadmap_rep3, 
  synth_spec = acs_synth_spec
) 
result3 <- synthesize(presynth = presynth3)

result3
[[1]]
[[1]][[1]]
Postsynth 
Synthetic Data: 3000 synthetic observations, 12 variables 
Total Synthesis Time: 1.71591281890869 seconds
[[1]][[2]]
Postsynth 
Synthetic Data: 3000 synthetic observations, 12 variables 
Total Synthesis Time: 1.71624398231506 seconds
[[1]][[3]]
Postsynth 
Synthetic Data: 3000 synthetic observations, 12 variables 
Total Synthesis Time: 1.71650385856628 seconds

[[2]]
[[2]][[1]]
Postsynth 
Synthetic Data: 3000 synthetic observations, 12 variables 
Total Synthesis Time: 1.71451902389526 seconds
[[2]][[2]]
Postsynth 
Synthetic Data: 3000 synthetic observations, 12 variables 
Total Synthesis Time: 1.71486115455627 seconds
[[2]][[3]]
Postsynth 
Synthetic Data: 3000 synthetic observations, 12 variables 
Total Synthesis Time: 1.71512603759766 seconds

[[3]]
[[3]][[1]]
Postsynth 
Synthetic Data: 3000 synthetic observations, 12 variables 
Total Synthesis Time: 1.79333710670471 seconds
[[3]][[2]]
Postsynth 
Synthetic Data: 3000 synthetic observations, 12 variables 
Total Synthesis Time: 1.79366898536682 seconds
[[3]][[3]]
Postsynth 
Synthetic Data: 3000 synthetic observations, 12 variables 
Total Synthesis Time: 1.79392504692078 seconds

[[4]]
[[4]][[1]]
Postsynth 
Synthetic Data: 3000 synthetic observations, 12 variables 
Total Synthesis Time: 1.67831110954285 seconds
[[4]][[2]]
Postsynth 
Synthetic Data: 3000 synthetic observations, 12 variables 
Total Synthesis Time: 1.67863988876343 seconds
[[4]][[3]]
Postsynth 
Synthetic Data: 3000 synthetic observations, 12 variables 
Total Synthesis Time: 1.67889189720154 seconds

We now see that this set of code produces a nested list of postsynth objects, each of which has nrow(start_data) * 2 = 3000 synthetic records. The outer list has four entries (one per end_to_end_replicates) and the inner list has three entries (one per model_sample_replicates).