3 Roadmap
3.1 Start Data
The first decision we need to make for tidysynthesis is which variables we are synthesizing with tidysynthesis. To do this, we need to create a data frame of starting data. The variables included in the starting data frame could come from a couple of different approaches:
- Unaltered variables from the confidential data. This creates partially synthetic data.
- Variables altered using a different tool or method.
- A bootstrap sample of a starting variable or variables.
The start data determines the number of observations in the synthetic data. Synthetic data can have more or fewer observations than the confidential data.
tidysynthesis sequentially synthesizes variables from the confidential data that are not included in the start data. The start data also determines the predictors available during the synthesis process. The more variables in the start data, the more predictors will be available when the sequential synthesis begins.
3.2 Schema
Next, we need to set up a schema
object, which handles data type information about each column in the confidential data. When this information isn’t explicitly supplied to schema
, data types are inferred from the confidential data.
The data type information is contained in the col_schema
attribute. Here is what was inferred from penguins_complete
:
$species
$species$dtype
[1] "fct"
$species$levels
[1] "Adelie" "Chinstrap" "Gentoo"
$species$na_prop
[1] 0
$island
$island$dtype
[1] "fct"
$island$levels
[1] "Biscoe" "Dream" "Torgersen"
$island$na_prop
[1] 0
$bill_length_mm
$bill_length_mm$dtype
[1] "dbl"
$bill_length_mm$na_prop
[1] 0
$bill_depth_mm
$bill_depth_mm$dtype
[1] "dbl"
$bill_depth_mm$na_prop
[1] 0
$flipper_length_mm
$flipper_length_mm$dtype
[1] "dbl"
$flipper_length_mm$na_prop
[1] 0
$body_mass_g
$body_mass_g$dtype
[1] "dbl"
$body_mass_g$na_prop
[1] 0
$sex
$sex$dtype
[1] "fct"
$sex$levels
[1] "female" "male"
$sex$na_prop
[1] 0
If we need to, we can modify the col_schema
by providing overrides to entries for specific variables. Here’s an example where we add an extra factor level for species
:
schema
has additional arguments for enforcing particular data types, like coercing numeric values to dbl
or categorical values to fct
.
3.3 Visit Sequence
The next decision we need to make for tidysynthesis
is the order of the sequential synthesis. The order of variables for the sequential synthesis is called the visit sequence. visit_sequence()
creates the visit sequence.
There are many different approaches to picking a visit sequence. Let’s consider a few examples.
First, let’s use subject matter expertise to manually set a visit sequence.
visit_seq_manual <- visit_sequence(
schema = schema,
type = "manual",
manual_vars = c("bill_depth_mm",
"bill_length_mm",
"body_mass_g",
"flipper_length_mm")
)
visit_seq_manual
Method: manual
Visit Sequence
bill_depth_mm bill_length_mm body_mass_g flipper_length_mm
The manual approach will rarely work for data sets with many variables. Instead, let’s pick an important variable and then synthesize from most to least correlated with that important variable.
visit_seq_corr <- visit_sequence(
schema = schema,
type = "correlation",
cor_var = "bill_length_mm"
)
visit_seq_corr
Method: correlation
Visit Sequence
bill_length_mm flipper_length_mm body_mass_g bill_depth_mm
Other approaches include ordering by the proportion of non-zero values, the weighted total of variables, or the weighted absolute totals of variables.
3.4 Roadmap
The confidential data, start data, and visit sequence are important inputs to other tidysynthesis functions. The roadmap simplifies future code by keeping the schema
and visit_sequence
in one object.
$conf_data
# A tibble: 333 × 7
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen 36.7 19.3 193 3450
5 Adelie Torgersen 39.3 20.6 190 3650
6 Adelie Torgersen 38.9 17.8 181 3625
7 Adelie Torgersen 39.2 19.6 195 4675
8 Adelie Torgersen 41.1 17.6 182 3200
9 Adelie Torgersen 38.6 21.2 191 3800
10 Adelie Torgersen 34.6 21.1 198 4400
# ℹ 323 more rows
# ℹ 1 more variable: sex <fct>
$start_data
# A tibble: 333 × 3
species island sex
<fct> <fct> <fct>
1 Chinstrap Dream female
2 Adelie Biscoe female
3 Adelie Dream female
4 Gentoo Biscoe male
5 Gentoo Biscoe female
6 Adelie Torgersen female
7 Adelie Biscoe female
8 Chinstrap Dream male
9 Adelie Torgersen male
10 Adelie Dream male
# ℹ 323 more rows
$visit_sequence
Method: correlation
Visit Sequence
bill_length_mm flipper_length_mm body_mass_g bill_depth_mm
$var_type
bill_length_mm flipper_length_mm body_mass_g bill_depth_mm
"numeric" "numeric" "numeric" "numeric"
$var_characteristics
$var_characteristics$no_variation
bill_length_mm flipper_length_mm body_mass_g bill_depth_mm
FALSE FALSE FALSE FALSE
$strata
NULL
$col_schema
$col_schema$species
$col_schema$species$dtype
[1] "fct"
$col_schema$species$levels
[1] "Adelie" "Chinstrap" "Gentoo"
$col_schema$species$na_prop
[1] 0
$col_schema$island
$col_schema$island$dtype
[1] "fct"
$col_schema$island$levels
[1] "Biscoe" "Dream" "Torgersen"
$col_schema$island$na_prop
[1] 0
$col_schema$bill_length_mm
$col_schema$bill_length_mm$dtype
[1] "dbl"
$col_schema$bill_length_mm$na_prop
[1] 0
$col_schema$bill_depth_mm
$col_schema$bill_depth_mm$dtype
[1] "dbl"
$col_schema$bill_depth_mm$na_prop
[1] 0
$col_schema$flipper_length_mm
$col_schema$flipper_length_mm$dtype
[1] "dbl"
$col_schema$flipper_length_mm$na_prop
[1] 0
$col_schema$body_mass_g
$col_schema$body_mass_g$dtype
[1] "dbl"
$col_schema$body_mass_g$na_prop
[1] 0
$col_schema$sex
$col_schema$sex$dtype
[1] "fct"
$col_schema$sex$levels
[1] "female" "male"
$col_schema$sex$na_prop
[1] 0
attr(,"class")
[1] "roadmap"