Custom Derived Variables
custom-derived-variables.Rmdcompile_acs_data() supports a large set of pre-built
derived variables (percentages, complements, etc.) from the package’s
registered tables. But you can also define your own derived variables
using the define_*() helpers and pass them directly to
compile_acs_data() via the tables parameter.
Your custom variables automatically get codebook entries and margins of
error, just like the built-in ones.
This vignette walks through the process of identifying variables, choosing denominators, and writing definitions, using household type data (ACS table B11001) as a running example.
The define_*() Helpers
| Helper | Use case | Key arguments |
|---|---|---|
define_percent() |
Single output: numerator / denominator |
output, numerator,
denominator (simple) or numerator_variables,
denominator_variables, regex variants (complex) |
define_across_percent() |
Percentages across a regex-matched set of columns |
input_regex, output_suffix,
denominator or denominator_function, optional
exclude_regex
|
define_across_sum() |
Summing paired columns (e.g., male + female) |
input_regex, addend_function,
output_naming_function
|
define_one_minus() |
Complement of an existing percentage (1 - x) |
output, source_variable
|
define_metadata() |
Non-computed variables (e.g., medians) |
output, definition_text
|
Definitions are passed to compile_acs_data() alongside
table names in a list(). They operate on the columns
produced by the tables you request, so make sure to include the table(s)
whose variables your definitions reference.
Step 1: Identify the Raw Variables
We’re going to create a variable describing the share of white men between the ages of 18 and 44. The needed raw variables are from ACS table B01001A, as described in the codebook
codebook = get_acs_codebook()
codebook %>%
filter(table == "B01001A") %>%
pull(variable_clean) %>%
head()
#> [1] "sex_by_age_white_alone_universe"
#> [2] "sex_by_age_white_alone_male"
#> [3] "sex_by_age_white_alone_male_under_5_years"
#> [4] "sex_by_age_white_alone_male_5_9_years"
#> [5] "sex_by_age_white_alone_male_10_14_years"
#> [6] "sex_by_age_white_alone_male_15_17_years"Step 2: Choose Denominators
Selecting an appropriate denominator for percentage variables is
critical, and at times, complex. The basic approach is to simply divide
every variable by the table universe, which should contain
universe in the variable name. But in other cases,
alternate variables may make for more insightful denominators. For
simplicity, we just use the table universe here.
Step 3: Write and Use Definitions
We express the new variable(s) using define_*() helpers
and pass the definitions directly to
compile_acs_data().
Our target variable–the share of white men aged 18-44–requires
summing several age-group columns (18-19, 20, 21, 22-24, 25-29, 30-34,
35-39, 40-44) and dividing by the table universe.
define_percent() with numerator_regex handles
this: it matches all columns fitting a pattern and sums them into the
numerator.
The auto-processed column names from B01001A follow the pattern
sex_by_age_white_alone_male_{age_range}_years, so we can
target the 18-44 age groups with a regex that matches the leading digits
of each range:
df = compile_acs_data(
tables = list(
"B01001A",
define_percent(
"white_male_18_44_percent",
numerator_regex = "sex_by_age_white_alone_male_(1[89]|2[0-9]|3[0-9]|4[0-4])",
numerator_exclude_regex = "percent",
denominator_variables = c("sex_by_age_white_alone_universe"))),
years = 2024,
geography = "county",
states = "NJ")
df %>%
select(GEOID, white_male_18_44_percent) %>%
head()
#> # A tibble: 6 × 2
#> GEOID white_male_18_44_percent
#> <chr> <dbl>
#> 1 34001 0.149
#> 2 34003 0.150
#> 3 34005 0.162
#> 4 34007 0.164
#> 5 34009 0.124
#> 6 34011 0.160The definition is executed after the raw ACS data are fetched and renamed, and the results appear in the codebook with automatically calculated margins of error:
attr(df, "codebook") %>%
filter(calculated_variable == "white_male_18_44_percent") %>%
pull(definition)
#> [1] "Numerator = sex_by_age_white_alone_male_18_19_years (B01001A_007), sex_by_age_white_alone_male_20_24_years (B01001A_008), sex_by_age_white_alone_male_25_29_years (B01001A_009), sex_by_age_white_alone_male_30_34_years (B01001A_010), sex_by_age_white_alone_male_35_44_years (B01001A_011). Denominator = sex_by_age_white_alone_universe (B01001A_001)."More Examples
Simple percentage
One numerator, one denominator:
df_snap = compile_acs_data(
tables = list(
"snap",
define_percent(
"snap_received_percent_manual",
numerator = "snap_received",
denominator = "snap_universe")),
years = 2024,
geography = "county",
states = "NJ")
## verify it matches the built-in version
df_snap %>%
transmute(
builtin = snap_received_percent,
manual = snap_received_percent_manual,
match = abs(builtin - manual) < 1e-10) %>%
filter(!match)
#> # A tibble: 0 × 3
#> # ℹ 3 variables: builtin <dbl>, manual <dbl>, match <lgl>Across-percent with a complement
Percentages for all race categories, plus a person-of-color
complement. This is how the built-in race table works
internally.
compile_acs_data(
tables = list(
"race",
define_across_percent(
input_regex = "^race_nonhispanic|^race_hispanic",
output_suffix = "_percent",
denominator = "race_universe"),
## race_personofcolor_percent is the share of all individuals who are not
## non-Hispanic, White alone, i.e., the complement
define_one_minus(
"race_personofcolor_percent",
source_variable = "race_nonhispanic_white_alone_percent")),
years = 2024,
geography = "county",
states = "DC")Across-sum followed by across-percent
Sum male + female counts into combined age variables, then calculate
percentages. This is how the built-in sex_by_age table
works internally.
This one is a bit tricky–the source table includes variables for each
age group, split by sex. To get age groups, we have to add the two
sex-specific estimates for the given age group. This requires us to
specify an input_regex, which selects, in this case, all
female-specific age variables. The addend_function then
programmatically identifies the same-named, male-specific variables. The
output_naming_function simplifies the resulting combined
variable, removing the sex category and other extraneous words.
compile_acs_data(
tables = list(
"sex_by_age",
define_across_sum(
input_regex = "sex_by_age_female_.*years($|_over$)",
addend_function = function(column) {
column %>% stringr::str_replace("female", "male")},
output_naming_function = function(column) {
column %>% stringr::str_replace("sex_by_age_female_", "age_")
}),
define_across_percent(
input_regex = "^age.*years($|_over$)",
output_suffix = "_percent",
denominator = "sex_by_age_universe")),
years = 2024,
geography = "county",
states = "DC")Complex percentage with subtraction
For cases where the numerator or denominator requires summing or
subtracting multiple variables, use the
numerator_variables, denominator_variables,
and optional *_subtract_* arguments:
df_complex = compile_acs_data(
tables = list(
"snap",
define_percent(
"snap_not_received_pct_complex",
numerator_variables = c("snap_universe"),
numerator_subtract_variables = c("snap_received"),
denominator_variables = c("snap_universe"))),
years = 2024,
geography = "county",
states = "DC")
df_complex %>%
mutate(sums_to_one = snap_received_percent + snap_not_received_pct_complex) %>%
glimpse()
#> Rows: 1
#> Columns: 14
#> $ data_source_year <dbl> 2024
#> $ GEOID <chr> "11001"
#> $ NAME <chr> "District of Columbia, District of Col…
#> $ total_population_universe <dbl> 681294
#> $ snap_universe <dbl> 324491
#> $ snap_received <dbl> 46408
#> $ snap_received_percent <dbl> 0.143
#> $ snap_not_received_pct_complex <dbl> 0.857
#> $ total_population_universe_M <dbl> 0
#> $ snap_universe_M <dbl> 1816
#> $ snap_received_M <dbl> 2088
#> $ snap_received_percent_M <dbl> 0.0064
#> $ snap_not_received_pct_complex_M <dbl> 0.0071
#> $ sums_to_one <dbl> 1Verify Results
A few strategies for quality-checking custom derived variables:
Compare to a published benchmark. Percentages reported by the Census Bureau in the Subject Tables (tables prefixed with
S) can serve as reference values for derived variables computed from the detailed tables (prefixed withBorC).Manually compute a benchmark. Identify the relevant numerator and denominator variables and manually calculate the derived variable, then compare. This is especially useful for complex definitions where the numerator is itself a sum.
Plot histograms. Use
pivot_longer()andfacet_wrap()to check for unexpected spikes or outlier values across a series of related variables.Check for missingness. Derived variables should generally have low or no missingness; substantial missingness may indicate a calculation error.
Inspect the codebook. Verify that
attr(df, "codebook")accurately documents your custom variable’s definition. The definition string drives the margin of error calculation, so errors there will propagate to MOEs.