11  models and samplers

tidysynthesis follows the tidymodels conventions for modeling. Any supervised predictive model (i.e., any model that predicts a response variable given a set of predictors) can be incorporated into tidymodels and then used by tidysynthesis.

To learn more about currently supported tidymodels, see the searchable model list here.

To learn how to specify a custom tidymodels implementation, see the documentation here.

11.1 sampler functions and specifying custom samplers

Each sample_() function expects a signature with the following arguments:

  • model: a model_fit object created by library(parsnip)
  • new_data: a data.frame of working synthetic data, not including the new variable to synthesize. new_data will have the same number of rows as start_data in roadmap.
  • conf_data: a confidential data.frame from roadmap.

Each sample_*() function can return either…

  1. A vector of sampled responses with the same length as new_data (i.e., start_data in roadmap).
  2. A named list containing, at a minimum, one element y_hat with the same vector described in #1.

11.2 Pre-specified sampling function methodology

Recall that there are two broad classes of modeling:

  • Generative models capture the joint probability distribution of multiple random variables simultaneously.
  • Discriminative models capture the conditional distribution of a random variable given other (typically observable) variables.

Sequentially generated synthetic data uses a sequence of discriminative models to produce a single generative model by chaining together conditional probabilities into a single joint probability.

To successfully sample new records from discriminative models, either…

  1. The sampler function should implement the model’s data generating assumptions to generate random samples for a plausible new outcome given observed predictors, OR…
  2. The sampler function should impose new data generating assumptions on a model (typically an optimization without explicit data generating assumptions) to generate new samples.

In scenario #1, where we’re matching an existing data generating model, tidysynthesis provides two commonly used sampler functions:

  • sample_lm:
    • Model: classical linear regression (LM).
    • Sampler: samples new values from the predictive distribution of a classical linear regression, i.e. a normal distribution.
  • sample_glm:
    • Model: generalized linear model (GLM).
    • Sampler: samples new values from the predictive distribution of a generalized linear model (GLM). Note this function currently supports logistic and poisson regression but can be trivially extended to any GLM.

In scenario #2, where we’re imposing data generating assumptions onto optimization-based models, tidysynthesis provides two commonly used sampler functions:

  • sample_rpart:
    • Model: decision trees (also known as classification and regression trees, or CART)
    • Sampler: depends on the response type.
      • Classification: sample new values from the empirical distribution of categorical levels at the terminal nodes of the decision tree.
      • Regression: sample new values from the empirical distribution of values predicted at the terminal nodes of the decision tree.
  • sample_ranger:
    • Model: random forests.
    • Sampler: follow the same procedure as in sample_rpart using one tree from the random forest model selected uniformly at random.