4 Recipes, Algorithms, and Samplers

Sequential synthesis requires a series of predictive models for imputation. tidysynthesis aims to bring the full predictive modeling toolkit to data synthesis by leveraging the power of tidymodels¹.

4.1 Recipes

Feature and target engineering² is the addition, deletion, and transformation of variables before applying algorithms to the data for predictive modeling. It is an important tool for improving predictive models.

library(recipes) specifies the relationships in a predictive model and feature and target engineering for tidymodels.

tidysynthesis requires recipes for predictive models. These recipes can include just the formula for each model. They can also include target engineering like a Yeo-Johnson transformation and feature engineering like z-score transformations.

Note: Many methods of target engineering are currently not supported. This is because predict() will generate transformed units for the synthetic data set. tidymodels will eventually natively support reverse target enginnering.

For now, we can use construct_recipes() to create a sequence of formulae without feature or target engineering.

penguins_rec <- construct_recipes(roadmap = roadmap)

penguins_rec

$bill_length_mm

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:   1
predictor: 3


$flipper_length_mm

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:   1
predictor: 4


$body_mass_g

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:   1
predictor: 5


$bill_depth_mm

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:   1
predictor: 6

4.2 Algorithms

tidymodels supports scores of different predictive modeling algorithms including linear regression, regression trees, and random forests.

tidysynthesis takes tidymodels models, without alteration, as inputs. For example, here is a linear regression model.

lm_mod <- parsnip::linear_reg() %>% 
  parsnip::set_engine("lm")

4.3 Samplers

Most predictive models predict values based on a conditional mean or conditional median. Conditional mean synthesis generates synthetic data with too little sample variance and not enough observations from the tails of distributions (Little and Rubin 2019).

tidysynthesis has special methods that synthesize values from a predictive distribution. Consider a regression tree. To make a prediction with a regression tree, it is common to plug in predictors to find a unique final node and then predict the mean of the final node. To generate adequate sample variance, sample_rpart() predicts a randomly drawn observation from the corresponding final node.

It is important to align the sample_*() method with the predictive algorithm used for each variable. Current options are sample_lm(), sample_rpart(), and sample_rf().

4.4 Synthesis Specification

The roadmap, algorithms, recipes, and predict methods all go into a synth_spec object. In this example, there is no feature or target engineering, the predictive model is linear regression, and the prediction methods is a random draw from a normal distribution centered at the conditional mean of the regression line with variance equal to the variance of the residuals of the estimated linear regression model.

synth_spec <- synth_spec(
  roadmap = roadmap,
  recipes = penguins_rec,  
  synth_algorithms = lm_mod,
  predict_methods = sample_lm
)

4.5 Custom Recipes

It is possible to use custom recipes with synth_spec() for one recipe for all variables and construct_recipes() for different recipes for different variables. To use a custom recipe, just create a custom function like the following:

custom_recipe1 <- function(recipe) {
 
  recipe %>%
    recipes::step_YeoJohnson(bill_depth_mm)
   
}

The above step performs a Yeo-Johnson transformation on bill_depth_mm. library(recipes) has a robust system for selecting variables based on their roles in the recipe in addition to explicitly listing variables.

4.6 Bespoke Combinations

The above examples use the same recipe, algorithm, and sampler for every variable in the visit sequence. There are three ways to specify recipes, algorithms, and samplers. Let’s use samplers as an example:

Use the same sampler for every variable.
Use a default sampler and manually override it for specific variables.
Manually specify the sampler for each variable.

4.6.1 Approach 1

To use the same recipe, algorithm, and/or sampler for each variable, simply pass the object into the synth_spec() function.

synth_spec <- synth_spec(
  roadmap = roadmap,
  recipes = penguins_rec,  
  synth_algorithms = lm_mod,
  predict_methods = sample_lm
)

4.6.2 Approach 2

Using a default sampler and manually overriding it for specific variables requires the construct_samplers() function. This example uses sample_rpart() for bill_length_mm and flipper_length_mm, and then uses sample_rpart_custom() for body_mass_g and bill_depth_mm.

samplers2 <- construct_samplers(
  roadmap = roadmap, 
  default_sampler = sample_rpart, 
  custom_samplers = list(
    list(vars = c("body_mass_g", "bill_depth_mm"),
         sampler = sample_rpart_custom)
  )
)

synth_spec2 <- synth_spec(
  roadmap = roadmap,
  recipes = penguins_rec,  
  synth_algorithms = lm_mod,
  predict_methods = samplers2
)

construct_recipes() and construct_algos() uses the same basic syntax.

4.6.3 Approach 3

It is also possible to manually specify the sampler for each variable. In the following example, bill_length_mm, flipper_length_mm, and body_mass_g use sample_rpart() and bill_depth_mm uses sample_rpart_custom().

samplers3 <- construct_samplers(
  roadmap = roadmap, 
  default_sampler = NULL, 
  custom_samplers = list(
    "bill_length_mm" = sample_rpart,
    "flipper_length_mm" = sample_rpart,
    "body_mass_g" = sample_rpart,
    "bill_depth_mm" = sample_rpart_custom)
)

synth_spec3 <- synth_spec(
  roadmap = roadmap,
  recipes = penguins_rec,  
  synth_algorithms = lm_mod,
  predict_methods = samplers3
)

construct_recipes() and construct_algos() uses the same basic syntax.

Little, Roderick JA, and Donald B Rubin. 2019. Statistical Analysis with Missing Data. Vol. 793. John Wiley & Sons.

Tidy modeling with R by Max Kuhn and Julia Silge, and the tidymodels website are essential background↩︎
Feature Engineering and Selection: A Practical Approach for Predictive Models by Max Kuhn and Kjell Johnson is an informative book↩︎