3  Roadmap

3.1 Start Data

The first decision we need to make for tidysynthesis is which variables we are synthesizing with tidysynthesis. To do this, we need to create a data frame of starting data. The variables included in the starting data frame could come from a couple of different approaches:

  1. Unaltered variables from the confidential data. This creates partially synthetic data.
  2. Variables altered using a different tool or method.
  3. A bootstrap sample of a starting variable or variables.

The start data determines the number of observations in the synthetic data. Synthetic data can have more or fewer observations than the confidential data.

tidysynthesis sequentially synthesizes variables from the confidential data that are not included in the start data. The start data also determines the predictors available during the synthesis process. The more variables in the start data, the more predictors will be available when the sequential synthesis begins.

starting_data <- penguins_complete %>% 
  select(species, island, sex) %>%
  slice_sample(n = nrow(penguins_complete), replace = TRUE)

3.2 Schema

Next, we need to set up a schema object, which handles data type information about each column in the confidential data. When this information isn’t explicitly supplied to schema, data types are inferred from the confidential data.

schema <- schema(
  conf_data = penguins_complete,
  start_data = starting_data
)

The data type information is contained in the col_schema attribute. Here is what was inferred from penguins_complete:

schema$col_schema
$species
$species$dtype
[1] "fct"

$species$levels
[1] "Adelie"    "Chinstrap" "Gentoo"   

$species$na_prop
[1] 0


$island
$island$dtype
[1] "fct"

$island$levels
[1] "Biscoe"    "Dream"     "Torgersen"

$island$na_prop
[1] 0


$bill_length_mm
$bill_length_mm$dtype
[1] "dbl"

$bill_length_mm$na_prop
[1] 0


$bill_depth_mm
$bill_depth_mm$dtype
[1] "dbl"

$bill_depth_mm$na_prop
[1] 0


$flipper_length_mm
$flipper_length_mm$dtype
[1] "dbl"

$flipper_length_mm$na_prop
[1] 0


$body_mass_g
$body_mass_g$dtype
[1] "dbl"

$body_mass_g$na_prop
[1] 0


$sex
$sex$dtype
[1] "fct"

$sex$levels
[1] "female" "male"  

$sex$na_prop
[1] 0

If we need to, we can modify the col_schema by providing overrides to entries for specific variables. Here’s an example where we add an extra factor level for species:

schema2 <- schema(
  conf_data = penguins_complete,
  start_data = starting_data,
  col_schema = list(
    # example manual specification for col_schema
    "species" = list(
      "dtype" = "fct",
      "levels" = c("Adelie", 
                   "Chinstrap", 
                   "Gentoo", 
                   "Emperor")
    )
  )
)

schema has additional arguments for enforcing particular data types, like coercing numeric values to dbl or categorical values to fct.

3.3 Visit Sequence

The next decision we need to make for tidysynthesis is the order of the sequential synthesis. The order of variables for the sequential synthesis is called the visit sequence. visit_sequence() creates the visit sequence.

There are many different approaches to picking a visit sequence. Let’s consider a few examples.

First, let’s use subject matter expertise to manually set a visit sequence.

visit_seq_manual <- visit_sequence(
  schema = schema,
  type = "manual",
  manual_vars = c("bill_depth_mm", 
                  "bill_length_mm", 
                  "body_mass_g",
                  "flipper_length_mm")
)

visit_seq_manual
Method: manual 
Visit Sequence 
bill_depth_mm bill_length_mm body_mass_g flipper_length_mm

The manual approach will rarely work for data sets with many variables. Instead, let’s pick an important variable and then synthesize from most to least correlated with that important variable.

visit_seq_corr <- visit_sequence(
  schema = schema,
  type = "correlation",
  cor_var = "bill_length_mm"
)

visit_seq_corr
Method: correlation 
Visit Sequence 
bill_length_mm flipper_length_mm body_mass_g bill_depth_mm

Other approaches include ordering by the proportion of non-zero values, the weighted total of variables, or the weighted absolute totals of variables.

3.4 Roadmap

The confidential data, start data, and visit sequence are important inputs to other tidysynthesis functions. The roadmap simplifies future code by keeping the schema and visit_sequence in one object.

roadmap <- roadmap(
  visit_sequence = visit_seq_corr
)

roadmap
$conf_data
# A tibble: 333 × 7
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <dbl>       <dbl>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           39.2          19.6               195        4675
 8 Adelie  Torgersen           41.1          17.6               182        3200
 9 Adelie  Torgersen           38.6          21.2               191        3800
10 Adelie  Torgersen           34.6          21.1               198        4400
# ℹ 323 more rows
# ℹ 1 more variable: sex <fct>

$start_data
# A tibble: 333 × 3
   species   island    sex   
   <fct>     <fct>     <fct> 
 1 Chinstrap Dream     female
 2 Adelie    Biscoe    female
 3 Adelie    Dream     female
 4 Gentoo    Biscoe    male  
 5 Gentoo    Biscoe    female
 6 Adelie    Torgersen female
 7 Adelie    Biscoe    female
 8 Chinstrap Dream     male  
 9 Adelie    Torgersen male  
10 Adelie    Dream     male  
# ℹ 323 more rows

$visit_sequence
Method: correlation 
Visit Sequence 
bill_length_mm flipper_length_mm body_mass_g bill_depth_mm
$var_type
   bill_length_mm flipper_length_mm       body_mass_g     bill_depth_mm 
        "numeric"         "numeric"         "numeric"         "numeric" 

$var_characteristics
$var_characteristics$no_variation
   bill_length_mm flipper_length_mm       body_mass_g     bill_depth_mm 
            FALSE             FALSE             FALSE             FALSE 


$strata
NULL

$col_schema
$col_schema$species
$col_schema$species$dtype
[1] "fct"

$col_schema$species$levels
[1] "Adelie"    "Chinstrap" "Gentoo"   

$col_schema$species$na_prop
[1] 0


$col_schema$island
$col_schema$island$dtype
[1] "fct"

$col_schema$island$levels
[1] "Biscoe"    "Dream"     "Torgersen"

$col_schema$island$na_prop
[1] 0


$col_schema$bill_length_mm
$col_schema$bill_length_mm$dtype
[1] "dbl"

$col_schema$bill_length_mm$na_prop
[1] 0


$col_schema$bill_depth_mm
$col_schema$bill_depth_mm$dtype
[1] "dbl"

$col_schema$bill_depth_mm$na_prop
[1] 0


$col_schema$flipper_length_mm
$col_schema$flipper_length_mm$dtype
[1] "dbl"

$col_schema$flipper_length_mm$na_prop
[1] 0


$col_schema$body_mass_g
$col_schema$body_mass_g$dtype
[1] "dbl"

$col_schema$body_mass_g$na_prop
[1] 0


$col_schema$sex
$col_schema$sex$dtype
[1] "fct"

$col_schema$sex$levels
[1] "female" "male"  

$col_schema$sex$na_prop
[1] 0



attr(,"class")
[1] "roadmap"