$dtype
[1] "dbl"
$levels
NULL
$na_value
[1] NA
$na_prop
[1] 0.1706667
7 schema
The schema object handles data type information about each column in the confidential data. By default, the roadmap constructor infers this schema information directly from the confidential data. However, users can input additional information data schema details and specify strategies for modifying data to conform these schemas.
7.1 schema Parameters and Working with col_schema
schema, like roadmap, requires conf_data and start_data. The following are optional arguments that you can input:
col_schema: A named list of column attributes (see more information on the structure below).enforce: A Boolean that, ifTRUE, will preprocess bothconf_dataandstart_datato enforce the intended schema behavior on these datasets. Note that this enforcement happens whenpresynthis created.coerce_to_factors: A Boolean that, ifTRUE, coerces categorical data types (ex:chr, fct, ord) to baseRfactors whenenforce_schemais called.coerce_to_doubles: A Boolean that, ifTRUE, coerces numeric data types to baseRdoubles whenenforce_schemais called.na_factor_to_level: A Boolean that, ifTRUE, convertsNAfactor values to the"NA"factor level.na_numeric_to_ind: A Boolean that, ifTRUE, creates indicator variables for missing values.
To start, data type information is contained in the col_schema attribute, a named list with one entry per variable. Each col_schema entry has four values:
dtype: the abbreviated data type (following thevctrs::vec_ptype_abbrnaming convention).levels: a string vector of factor levels (NULLfor numeric variables)na_value: the value corresponding to missing data (defaults toNA)na_prop: the proportion of missing values in the confidential data.
Here is what was inferred from our ACS data about inctot and classwkr, numeric and factor variables, respectively:
7.2 schema enforcement
The enforce_schema() function applies schema changes to data within a roadmap whenever enforce = TRUE. Typically, you do not need to manually call this function, since it is called whenever a presynth is created. For this example, we will demonstrate its behavior on our acs_roadmap:
Rows: 1,500
Columns: 13
$ county <fct> Other, Other, Other, Other, Douglas, Lancaster, Other, Sa…
$ gq <fct> Household, Household, Household, Household, Household, Ho…
$ sex <fct> Female, Male, Male, Female, Male, Female, Male, Male, Mal…
$ marst <fct> Single, Married, Single, Single, Married, Divorced, Marri…
$ hcovany <fct> With health insurance coverage, With health insurance cov…
$ empstat <fct> NA, Employed, NA, NA, Employed, Employed, NA, NA, NA, Emp…
$ classwkr <fct> N/A, Works for wages, N/A, N/A, Self-employed, Works for …
$ age <dbl> 0, 41, 10, 12, 46, 36, 49, 5, 22, 31, 5, 55, 74, 50, 37, …
$ famsize <dbl> 5, 4, 3, 6, 5, 3, 5, 5, 4, 1, 4, 2, 2, 2, 4, 1, 1, 4, 5, …
$ transit_time <dbl> 0, 30, 0, 0, 15, 15, 0, 0, 0, 5, 0, 7, 0, 15, 10, 0, 0, 0…
$ inctot_NA <fct> missing value, nonmissing value, missing value, missing v…
$ inctot <dbl> NA, 68000, NA, NA, 91000, 26200, 6000, NA, 0, 37000, NA, …
$ wgt <dbl> 1.825014, 1.916103, 1.198468, 2.641223, 4.989714, 1.67354…
We can see that two changes occurred after calling enforce_schema. The first change is na_factor_to_level modified variables like empstat with missing values so that missingness was treated as a new level:
[1] "Employed" "Unemployed"
[1] "Employed" "Unemployed" "NA"
The second change is we see that the new variable inctot_NA was created to indicate where inctot’s numeric values are missing or not. This indicator will be used to model inctot as a mixture of two distributions; one for missing values and one for observed values.
7.3 Working with schema in the Tidy API
Individual elements of col_schema can be updated via the Tidy API. For example, we can update the factor levels for classwkr:
$dtype
[1] "fct"
$levels
[1] "N/A" "Self-employed" "Works for wages" "New level"
$na_value
[1] NA
$na_prop
[1] 0