11 models and samplers
tidysynthesis follows the tidymodels conventions for modeling. Any supervised predictive model (i.e., any model that predicts a response variable given a set of predictors) can be incorporated into tidymodels and then used by tidysynthesis.
To learn more about currently supported tidymodels, see the searchable model list here.
To learn how to specify a custom tidymodels implementation, see the documentation here.
11.1 sampler functions and specifying custom samplers
Each sample_() function expects a signature with the following arguments:
model: amodel_fitobject created bylibrary(parsnip)new_data: adata.frameof working synthetic data, not including the new variable to synthesize.new_datawill have the same number of rows asstart_datainroadmap.conf_data: a confidentialdata.framefromroadmap.
Each sample_*() function can return either…
- A vector of sampled responses with the same length as
new_data(i.e.,start_datainroadmap). - A named list containing, at a minimum, one element
y_hatwith the same vector described in #1.
11.2 Pre-specified sampling function methodology
Recall that there are two broad classes of modeling:
- Generative models capture the joint probability distribution of multiple random variables simultaneously.
- Discriminative models capture the conditional distribution of a random variable given other (typically observable) variables.
Sequentially generated synthetic data uses a sequence of discriminative models to produce a single generative model by chaining together conditional probabilities into a single joint probability.
To successfully sample new records from discriminative models, either…
- The
samplerfunction should implement the model’s data generating assumptions to generate random samples for a plausible new outcome given observed predictors, OR… - The
samplerfunction should impose new data generating assumptions on a model (typically an optimization without explicit data generating assumptions) to generate new samples.
In scenario #1, where we’re matching an existing data generating model, tidysynthesis provides two commonly used sampler functions:
sample_lm:- Model: classical linear regression (LM).
- Sampler: samples new values from the predictive distribution of a classical linear regression, i.e. a normal distribution.
sample_glm:- Model: generalized linear model (GLM).
- Sampler: samples new values from the predictive distribution of a generalized linear model (GLM). Note this function currently supports logistic and poisson regression but can be trivially extended to any GLM.
In scenario #2, where we’re imposing data generating assumptions onto optimization-based models, tidysynthesis provides two commonly used sampler functions:
sample_rpart:- Model: decision trees (also known as classification and regression trees, or CART)
- Sampler: depends on the response type.
- Classification: sample new values from the empirical distribution of categorical levels at the terminal nodes of the decision tree.
- Regression: sample new values from the empirical distribution of values predicted at the terminal nodes of the decision tree.
sample_ranger:- Model: random forests.
- Sampler: follow the same procedure as in
sample_rpartusing one tree from the random forest model selected uniformly at random.