11 models
and samplers
tidysynthesis
follows the tidymodels
conventions for modeling. Any supervised predictive model (i.e., any model that predicts a response variable given a set of predictors) can be incorporated into tidymodels
and then used by tidysynthesis
.
To learn more about currently supported tidymodels
, see the searchable model list here.
To learn how to specify a custom tidymodels
implementation, see the documentation here.
11.1 sampler
functions and specifying custom samplers
Each sample_()
function expects a signature with the following arguments:
model
: amodel_fit
object created bylibrary(parsnip)
new_data
: adata.frame
of working synthetic data, not including the new variable to synthesize.new_data
will have the same number of rows asstart_data
inroadmap
.conf_data
: a confidentialdata.frame
fromroadmap
.
Each sample_*()
function can return either…
- A vector of sampled responses with the same length as
new_data
(i.e.,start_data
inroadmap
). - A named list containing, at a minimum, one element
y_hat
with the same vector described in #1.
11.2 Pre-specified sampling function methodology
Recall that there are two broad classes of modeling:
- Generative models capture the joint probability distribution of multiple random variables simultaneously.
- Discriminative models capture the conditional distribution of a random variable given other (typically observable) variables.
Sequentially generated synthetic data uses a sequence of discriminative models to produce a single generative model by chaining together conditional probabilities into a single joint probability.
To successfully sample new records from discriminative models, either…
- The
sampler
function should implement the model’s data generating assumptions to generate random samples for a plausible new outcome given observed predictors, OR… - The
sampler
function should impose new data generating assumptions on a model (typically an optimization without explicit data generating assumptions) to generate new samples.
In scenario #1, where we’re matching an existing data generating model, tidysynthesis
provides two commonly used sampler functions:
sample_lm
:- Model: classical linear regression (LM).
- Sampler: samples new values from the predictive distribution of a classical linear regression, i.e. a normal distribution.
sample_glm
:- Model: generalized linear model (GLM).
- Sampler: samples new values from the predictive distribution of a generalized linear model (GLM). Note this function currently supports logistic and poisson regression but can be trivially extended to any GLM.
In scenario #2, where we’re imposing data generating assumptions onto optimization-based models, tidysynthesis
provides two commonly used sampler functions:
sample_rpart
:- Model: decision trees (also known as classification and regression trees, or CART)
- Sampler: depends on the response type.
- Classification: sample new values from the empirical distribution of categorical levels at the terminal nodes of the decision tree.
- Regression: sample new values from the empirical distribution of values predicted at the terminal nodes of the decision tree.
sample_ranger
:- Model: random forests.
- Sampler: follow the same procedure as in
sample_rpart
using one tree from the random forest model selected uniformly at random.