10  constraints

Many datasets have variables with structural dependencies between variables. For example, some numeric variables have non-negativity constraints, and some combinations of categorical constraints are not possible. tidysynthesis has the constraints S3 object to support two kinds of these constraints in the synthesis process:

10.1 Specifying constraint dataframes

All constraints in tidysynthesis start with specifying constraint data frames used as inputs to the constraints S3 object. In this document, we create some examples and describe the constraints below. We continue using the example ACS data in tidysynthesis.

dplyr::glimpse(acs_conf_nw)
Rows: 1,500
Columns: 11
$ county       <fct> Other, Other, Other, Other, Douglas, Lancaster, Other, Sa…
$ gq           <fct> Household, Household, Household, Household, Household, Ho…
$ sex          <fct> Female, Male, Male, Female, Male, Female, Male, Male, Mal…
$ marst        <fct> Single, Married, Single, Single, Married, Divorced, Marri…
$ hcovany      <fct> With health insurance coverage, With health insurance cov…
$ empstat      <fct> NA, Employed, NA, NA, Employed, Employed, NA, NA, NA, Emp…
$ classwkr     <fct> N/A, Works for wages, N/A, N/A, Self-employed, Works for …
$ age          <dbl> 0, 41, 10, 12, 46, 36, 49, 5, 22, 31, 5, 55, 74, 50, 37, …
$ famsize      <dbl> 5, 4, 3, 6, 5, 3, 5, 5, 4, 1, 4, 2, 2, 2, 4, 1, 1, 4, 5, …
$ transit_time <dbl> 0, 30, 0, 0, 15, 15, 0, 0, 0, 5, 0, 7, 0, 15, 10, 0, 0, 0…
$ inctot       <dbl> NA, 68000, NA, NA, 91000, 26200, 6000, NA, 0, 37000, NA, …

Numeric constraint data frames have four required columns:

  • var: The name of the synthesized variable to constrain. This variable must be conditionally synthesized (i.e., it cannot be part of the start_data).
  • min: The numeric lower bound (non-inclusive) that the synthesized variable can take.
  • max: The numeric upper bound (non-inclusive) that the synthesized variable can take.
  • conditions: A string that tidy evaluates to a Boolean vector determining when the constraints apply.

For each row, every column must be completed. We can specify one-sided intervals using min = -Inf or max = Inf. Below are some examples using the ACS data:

ex_acs_constraints_numeric <- tibble::tribble(
  # required column names
  ~var, ~min, ~max, ~conditions, 
  # 'age' is always positive
  "age", 0, Inf, "TRUE",
  # 'inctot' is always < 12000 whenever age <= 18
  "inctot", 0, 12000, "age <= 18" 
)

Categorical constraint data frames have four required columns:

  • var: The name of the synthesized variable to constrain. This variable must be conditionally synthesized (i.e., it cannot be part of the start_data).
  • allowed: Either NA or the string name of a level to include.
  • forbidden: Either NA or the string name of a level to exclude.
  • conditions: A string that tidy evaluates to a Boolean vector determining when the constraints apply.

For each row, var and conditions are required, but specify ONE of allowed or forbidden must be specified. Below are some examples using ACS data:

ex_acs_constraints_categorical <- tibble::tribble(
  # required column names
  ~var, ~allowed, ~forbidden, ~conditions, 
  # 'marst' is always 'Single' when age <= 18
  "marst", "Single", NA, "age <= 18",
  # 'empstat' is never 'Employed' when age <= 18
  "empstat", NA, "Employed", "age <= 18" 
)

10.2 constraint objects

The constraint objects contain information about how constraints are implemented within a synthesis. Like visit_sequence, standalone constraint objects require a schema object, and the following optional arguments dictate how the synthesis process defines and uses these constraints:

  • constraints_df_num: A data frame of numeric constraints, or NULL.
  • constraints_df_cat: A data frame of categorical constraints or NULL.
  • max_z_num: Either an integer number of resampling attempts before enforcing constraints, or a named list mapping variable names to integers.
  • max_z_cat: An integer number of resampling attempts before enforcing constraints, or a named list mapping variable names to integers.

When synthesize() is called, each new synthetic data value where constraints aren’t satisfied are resampled a maximum of max_z_num or max_z_cat times, depending on the new value’s type. If after max_z_num or max_z_cat new samples the constraints are not satisfied, then constraints are hard-enforced.

  • Numeric constraints are hard-enforced by truncating synthesized values to the minimum and/or maximum bounds.
  • Categorical constraints are hard-enforced by selecting (uniformly, at random) one of the allowed levels.

We can also specify maximum z values for different variables. We provide a complete example:

ex_constraints <- constraints(
  # default schema from example ACS data
  schema = schema(
    conf_data = acs_conf_nw,
    start_data = acs_start_nw
  ),
  # constraint dataframes (defined above)
  constraints_df_num = ex_acs_constraints_numeric,
  constraints_df_cat = ex_acs_constraints_categorical,
  max_z_num = 2, # use at most two resamples for all variables,
  max_z_cat = list(
    "marst" = 0,  # use no resamples for constraints on 'marst'
    "empstat" = 1 # use at most one resample for constraints on 'empstat'
  )
)

ex_constraints
Constraints specified per variable: 
age: 1
famsize: 1
transit_time: 1
inctot: 1
Recommended Method for Creating constraints

We recommend using the Tidy API to create constraints from a roadmap instance instead of directly creating a constraints instance.

To associate constraints with a roadmap, there are a few different options. We recommend using the API call update_constraints(), which tends to be the easiest for avoiding duplicate code:

# Option 1 (recommended): pass arguments directly to `update_constraints()`
acs_roadmap <- roadmap(
  conf_data = acs_conf_nw,
  start_data = acs_start_nw
)

acs_roadmap_w_constraints <- acs_roadmap %>%
  update_constraints(
    constraints_df_num = ex_acs_constraints_numeric,
    constraints_df_cat = ex_acs_constraints_categorical,
    max_z_num = 2, 
    max_z_cat = list("marst" = 0, "empstat" = 1)
  )

Alternatively, you can manually construct a constraints object and associate it with a roadmap:

# Option 2: use the `roadmap` constructor
acs_roadmap_w_constraints <- roadmap(
  conf_data = acs_conf_nw,
  start_data = acs_start,
  constraints = ex_constraints
)

# Option 3: use `add_constraints()`
acs_roadmap_w_constraints <- roadmap(
  conf_data = acs_conf_nw,
  start_data = acs_start_nw
) %>%
  add_constraints(ex_constraints)