7  schema

The schema object handles data type information about each column in the confidential data. By default, the roadmap constructor infers this schema information directly from the confidential data. However, users can input additional information data schema details and specify strategies for modifying data to conform these schemas.

7.1 schema Parameters and Working with col_schema

schema, like roadmap, requires conf_data and start_data. The following are optional arguments that you can input:

  • col_schema: A named list of column attributes (see more information on the structure below).
  • enforce: A Boolean that, if TRUE, will preprocess both conf_data and start_data to enforce the intended schema behavior on these datasets. Note that this enforcement happens when presynth is created.
  • coerce_to_factors: A Boolean that, if TRUE, coerces categorical data types (ex: chr, fct, ord) to base R factors when enforce_schema is called.
  • coerce_to_doubles: A Boolean that, if TRUE, coerces numeric data types to base R doubles when enforce_schema is called.
  • na_factor_to_level: A Boolean that, if TRUE, converts NA factor values to the "NA" factor level.
  • na_numeric_to_ind: A Boolean that, if TRUE, creates indicator variables for missing values.

To start, data type information is contained in the col_schema attribute, a named list with one entry per variable. Each col_schema entry has four values:

  • dtype: the abbreviated data type (following the vctrs::vec_ptype_abbr naming convention).
  • levels: a string vector of factor levels (NULL for numeric variables)
  • na_value: the value corresponding to missing data (defaults to NA)
  • na_prop: the proportion of missing values in the confidential data.

Here is what was inferred from our ACS data about inctot and classwkr, numeric and factor variables, respectively:

library(tidyverse)
library(tidysynthesis)

# create a basic roadmap
acs_roadmap <- roadmap(
  conf_data = acs_conf,
  start_data = acs_start
)

# access the schema and the col_schema for `inctot`
acs_roadmap[["schema"]][["col_schema"]][["inctot"]]
$dtype
[1] "dbl"

$levels
NULL

$na_value
[1] NA

$na_prop
[1] 0.1706667
# access the schema and the col_schema for `classwkr`
acs_roadmap[["schema"]][["col_schema"]][["classwkr"]]
$dtype
[1] "fct"

$levels
[1] "N/A"             "Self-employed"   "Works for wages"

$na_value
[1] NA

$na_prop
[1] 0

7.2 schema enforcement

The enforce_schema() function applies schema changes to data within a roadmap whenever enforce = TRUE. Typically, you do not need to manually call this function, since it is called whenever a presynth is created. For this example, we will demonstrate its behavior on our acs_roadmap:

# enforce schema 
acs_roadmap_enforced <- enforce_schema(acs_roadmap)

# show enforced schema data
glimpse(acs_roadmap_enforced[["conf_data"]])
Rows: 1,500
Columns: 13
$ county       <fct> Other, Other, Other, Other, Douglas, Lancaster, Other, Sa…
$ gq           <fct> Household, Household, Household, Household, Household, Ho…
$ sex          <fct> Female, Male, Male, Female, Male, Female, Male, Male, Mal…
$ marst        <fct> Single, Married, Single, Single, Married, Divorced, Marri…
$ hcovany      <fct> With health insurance coverage, With health insurance cov…
$ empstat      <fct> NA, Employed, NA, NA, Employed, Employed, NA, NA, NA, Emp…
$ classwkr     <fct> N/A, Works for wages, N/A, N/A, Self-employed, Works for …
$ age          <dbl> 0, 41, 10, 12, 46, 36, 49, 5, 22, 31, 5, 55, 74, 50, 37, …
$ famsize      <dbl> 5, 4, 3, 6, 5, 3, 5, 5, 4, 1, 4, 2, 2, 2, 4, 1, 1, 4, 5, …
$ transit_time <dbl> 0, 30, 0, 0, 15, 15, 0, 0, 0, 5, 0, 7, 0, 15, 10, 0, 0, 0…
$ inctot_NA    <fct> missing value, nonmissing value, missing value, missing v…
$ inctot       <dbl> NA, 68000, NA, NA, 91000, 26200, 6000, NA, 0, 37000, NA, …
$ wgt          <dbl> 1.825014, 1.916103, 1.198468, 2.641223, 4.989714, 1.67354…

We can see that two changes occurred after calling enforce_schema. The first change is na_factor_to_level modified variables like empstat with missing values so that missingness was treated as a new level:

# original `empstat` levels
levels(acs_roadmap[["conf_data"]][["empstat"]])
[1] "Employed"   "Unemployed"
# enforced `empstat` levels
levels(acs_roadmap_enforced[["conf_data"]][["empstat"]])
[1] "Employed"   "Unemployed" "NA"        

The second change is we see that the new variable inctot_NA was created to indicate where inctot’s numeric values are missing or not. This indicator will be used to model inctot as a mixture of two distributions; one for missing values and one for observed values.

7.3 Working with schema in the Tidy API

Individual elements of col_schema can be updated via the Tidy API. For example, we can update the factor levels for classwkr:

acs_roadmap_updated <- acs_roadmap %>%
  update_schema(
    col_schema = list(
      "classwkr" = list(
        "dtype" = "fct",
        "levels" = c("N/A", "Self-employed", "Works for wages", "New level")
      )
    )
  )
acs_roadmap_updated[["schema"]][["col_schema"]][["classwkr"]]
$dtype
[1] "fct"

$levels
[1] "N/A"             "Self-employed"   "Works for wages" "New level"      

$na_value
[1] NA

$na_prop
[1] 0