Rows: 1,500
Columns: 11
$ county <fct> Other, Other, Other, Other, Douglas, Lancaster, Other, Sa…
$ gq <fct> Household, Household, Household, Household, Household, Ho…
$ sex <fct> Female, Male, Male, Female, Male, Female, Male, Male, Mal…
$ marst <fct> Single, Married, Single, Single, Married, Divorced, Marri…
$ hcovany <fct> With health insurance coverage, With health insurance cov…
$ empstat <fct> NA, Employed, NA, NA, Employed, Employed, NA, NA, NA, Emp…
$ classwkr <fct> N/A, Works for wages, N/A, N/A, Self-employed, Works for …
$ age <dbl> 0, 41, 10, 12, 46, 36, 49, 5, 22, 31, 5, 55, 74, 50, 37, …
$ famsize <dbl> 5, 4, 3, 6, 5, 3, 5, 5, 4, 1, 4, 2, 2, 2, 4, 1, 1, 4, 5, …
$ transit_time <dbl> 0, 30, 0, 0, 15, 15, 0, 0, 0, 5, 0, 7, 0, 15, 10, 0, 0, 0…
$ inctot <dbl> NA, 68000, NA, NA, 91000, 26200, 6000, NA, 0, 37000, NA, …
10 constraints
Many datasets have variables with structural dependencies between variables. For example, some numeric variables have non-negativity constraints, and some combinations of categorical constraints are not possible. tidysynthesis
has the constraints
S3 object to support two kinds of these constraints in the synthesis process:
- If a user-specified condition is met based on current synthetic data variables, a newly synthesized numeric variable must fall within a particular numeric range (i.e., a variable with lower and/or upper bounds).
- If a user-specified condition is met based on current synthetic data variables, a newly synthesized categorical variable must either be restricted to certain allowed levels or prohibited from being certain levels (i.e., enforcing allowed or forbidden factor levels).
10.1 Specifying constraint
dataframes
All constraints in tidysynthesis
start with specifying constraint data frames used as inputs to the constraints
S3 object. In this document, we create some examples and describe the constraints below. We continue using the example ACS data in tidysynthesis
.
Numeric constraint data frames have four required columns:
var
: The name of the synthesized variable to constrain. This variable must be conditionally synthesized (i.e., it cannot be part of thestart_data
).min
: The numeric lower bound (non-inclusive) that the synthesized variable can take.max
: The numeric upper bound (non-inclusive) that the synthesized variable can take.conditions
: A string that tidy evaluates to a Boolean vector determining when the constraints apply.
For each row, every column must be completed. We can specify one-sided intervals using min = -Inf
or max = Inf
. Below are some examples using the ACS data:
Categorical constraint data frames have four required columns:
var
: The name of the synthesized variable to constrain. This variable must be conditionally synthesized (i.e., it cannot be part of thestart_data
).allowed
: EitherNA
or the string name of a level to include.forbidden
: EitherNA
or the string name of a level to exclude.conditions
: A string that tidy evaluates to a Boolean vector determining when the constraints apply.
For each row, var
and conditions
are required, but specify ONE of allowed
or forbidden
must be specified. Below are some examples using ACS data:
10.2 constraint
objects
The constraint
objects contain information about how constraints are implemented within a synthesis. Like visit_sequence
, standalone constraint
objects require a schema
object, and the following optional arguments dictate how the synthesis process defines and uses these constraints:
constraints_df_num
: A data frame of numeric constraints, orNULL
.constraints_df_cat
: A data frame of categorical constraints orNULL
.max_z_num
: Either an integer number of resampling attempts before enforcing constraints, or a named list mapping variable names to integers.max_z_cat
: An integer number of resampling attempts before enforcing constraints, or a named list mapping variable names to integers.
When synthesize()
is called, each new synthetic data value where constraints aren’t satisfied are resampled a maximum of max_z_num
or max_z_cat
times, depending on the new value’s type. If after max_z_num
or max_z_cat
new samples the constraints are not satisfied, then constraints are hard-enforced.
- Numeric constraints are hard-enforced by truncating synthesized values to the minimum and/or maximum bounds.
- Categorical constraints are hard-enforced by selecting (uniformly, at random) one of the allowed levels.
We can also specify maximum z values for different variables. We provide a complete example:
ex_constraints <- constraints(
# default schema from example ACS data
schema = schema(
conf_data = acs_conf_nw,
start_data = acs_start_nw
),
# constraint dataframes (defined above)
constraints_df_num = ex_acs_constraints_numeric,
constraints_df_cat = ex_acs_constraints_categorical,
max_z_num = 2, # use at most two resamples for all variables,
max_z_cat = list(
"marst" = 0, # use no resamples for constraints on 'marst'
"empstat" = 1 # use at most one resample for constraints on 'empstat'
)
)
ex_constraints
Constraints specified per variable:
age: 1
famsize: 1
transit_time: 1
inctot: 1
constraints
We recommend using the Tidy API to create constraints
from a roadmap
instance instead of directly creating a constraints
instance.
To associate constraints
with a roadmap
, there are a few different options. We recommend using the API call update_constraints()
, which tends to be the easiest for avoiding duplicate code:
# Option 1 (recommended): pass arguments directly to `update_constraints()`
acs_roadmap <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
acs_roadmap_w_constraints <- acs_roadmap %>%
update_constraints(
constraints_df_num = ex_acs_constraints_numeric,
constraints_df_cat = ex_acs_constraints_categorical,
max_z_num = 2,
max_z_cat = list("marst" = 0, "empstat" = 1)
)
Alternatively, you can manually construct a constraints
object and associate it with a roadmap
:
# Option 2: use the `roadmap` constructor
acs_roadmap_w_constraints <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start,
constraints = ex_constraints
)
# Option 3: use `add_constraints()`
acs_roadmap_w_constraints <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
) %>%
add_constraints(ex_constraints)