urbnindicators
urbnindicators.Rmd
library(urbnindicators)
library(tidycensus)
library(dplyr)
library(stringr)
library(ggplot2)
library(scales)
library(urbnthemes) # install from https://github.com/UrbanInstitute/urbnthemesThis vignette is organized into three parts:
We illustrate a typical workflow for obtaining American Community Survey (ACS) data using
library(tidycensus).We show how
library(urbnindicators)can produce similar outputs, but with much less code and effort.We highlight how
library(urbnindicators)provides metadata and improves the quality and reliability of the analysis process.
A Typical Workflow
tidycensus provides a suite of functions for working with ACS data in R. While it’s versatile and comprehensive, it can require a significant amount of knowledge and time to support a robust analysis, leading many users to fall into common pitfalls without realizing they’ve made an error(s).
Identify variables to query
We load the built-in codebook and search for our construct of interest (disability). This leaves us 500 variables to choose from.
acs_codebook = load_variables(dataset = "acs5", year = 2022)
acs_codebook %>%
filter(str_detect(concept, "Disability")) %>%
select(name, label, concept) %>%
## only printing three, for brevity
head(3) %>%
reactable::reactable()Let’s imagine we’re interested in calculating the share of individuals with a disability; this requires only two variables (conceptually): the number of people with a disability and the number of all people.
So which variable(s) do we select? There’s not a clear answer. All variables relating to disability reflect disability and at least one other characteristic (e.g., “sex by age by disability status”). If we want to calculate the percent of all individuals with a disability, we want to do so using the most robust available variables (i.e., those that reflect all individuals who reported their disability status), whereas some variables that reflect disability may have smaller counts because the other characteristics combined with disability status (e.g., health insurance coverage status) may be available only for a subset of the individuals for whom disability status is available.
Let’s imagine we select the table of variables prefixed “B18101”, for
“Sex by Age by Disability”. We think that most respondents who respond
about their disability status will also have responded about their sex
and age. We then pass this to library(tidycensus) as:
df_disability = get_acs(
geography = "county",
state = "NJ",
year = 2022,
output = "wide",
survey = "acs5",
table = "B18101")This returns us 21 observations–one for each county in NJ–along with
an intimidating 80 columns with unintelligble names along the lines of
B18101_039E.
Calculating our measure of interest
Now we would need to figure out how to aggregate the needed variables for both the denominator and numerator in order to calculate a valid “% Disabled” measure, a task that is feasible but time-intensive and error-prone.
For an analysis that leverages more than a single measure, and especially when measures are required from distinct tables, this workflow is burdensome and creates significant surface area for undetected errors.
At the same time, many analysts will be overwhelmed by and unsure how
to combine the margins of error that are returned by
tidycensus::get_acs() to calculate pooled errors for the
new percent-disabled variable, opting simply to drop this critical
information from their analysis. (See Quantifying Survey Error to
learn more about how library(urbnindicators) helps simplify
this task.)
Using urbnindicators
library(urbnindicators) abstracts the workflow above
behind the scenes. Instead of a call to
tidycensus::get_acs(), a call to
urbnindicators::compile_acs_data() returns a dataset of
both raw ACS measures and derived estimates (such as the share of all
individuals who are disabled).
Acquire data
It’s as simple as the call below. Note that you can provide a vector of years and/or states if you want data over different time periods or geographies.
Note that selecting more tables or more geographic units–either by
selecting a geography option comprising more units, by
selecting more states, or selecting more years–can significantly
increase the query time.
Use list_tables() to see some of the most commonly-used
tables:
list_tables() |> head(10)
#> [1] "age" "computing_devices" "cost_burden"
#> [4] "disability" "educational_attainment" "employment"
#> [7] "gini" "health_insurance" "household_size"
#> [10] "income_quintiles"Or use get_acs_codebook() to see every table supported
by the Census Bureau API:
get_acs_codebook() |>
filter(str_detect(variable_clean, "snap")) |>
head(10)
#> # A tibble: 10 × 3
#> table variable_raw variable_clean
#> <chr> <chr> <chr>
#> 1 B09010 B09010_001 receipt_supplemental_security_income_ssi_cash_public_ass…
#> 2 B09010 B09010_002 receipt_supplemental_security_income_ssi_cash_public_ass…
#> 3 B09010 B09010_003 receipt_supplemental_security_income_ssi_cash_public_ass…
#> 4 B09010 B09010_004 receipt_supplemental_security_income_ssi_cash_public_ass…
#> 5 B09010 B09010_005 receipt_supplemental_security_income_ssi_cash_public_ass…
#> 6 B09010 B09010_006 receipt_supplemental_security_income_ssi_cash_public_ass…
#> 7 B09010 B09010_007 receipt_supplemental_security_income_ssi_cash_public_ass…
#> 8 B09010 B09010_008 receipt_supplemental_security_income_ssi_cash_public_ass…
#> 9 B09010 B09010_009 receipt_supplemental_security_income_ssi_cash_public_ass…
#> 10 B09010 B09010_010 receipt_supplemental_security_income_ssi_cash_public_ass…Here we request just two tables–disability and
transportation_to_work.
df_urbnindicators = compile_acs_data(
years = 2024,
tables = c("disability", "transportation_to_work"),
geography = "county",
states = "NJ",
spatial = TRUE)You can also pass vectors of years and/or states to pull data across multiple time periods or geographies in a single call:
df_multi = compile_acs_data(
years = c(2019, 2024),
tables = "disability",
geography = "county",
states = c("NJ", "NY"))
df_multi %>%
count(data_source_year)
#> # A tibble: 2 × 2
#> data_source_year n
#> <dbl> <int>
#> 1 2019 83
#> 2 2024 83Alternately, you can pass the name of a variable or table from
get_acs_codebook() to compile_acs_data(). The
equivalent of the first call above would be:
df_urbnindicators = compile_acs_data(
years = 2024,
tables = c("sex_by_age_by_disability_status_universe", "B08301"),
geography = "county",
states = "NJ",
spatial = TRUE)Analyze or visualize data
And now we’re ready to analyze or plot our data. Simplistically:
df_urbnindicators %>%
ggplot() +
geom_sf(aes(fill = disability_percent)) +
theme_urbn_map() +
scale_fill_continuous(labels = scales::percent, transform = "reverse") +
labs(
title = "Disability Rates Appear Higher in Southern NJ",
subtitle = "Disability rates by county, NJ, 2020-2024 ACS",
fill = "Population with an ACS-defined disability (%)" %>% str_wrap(20))
Document data
There’s a lot happening behind the scenes, so it’s important to
understand what each variable represents and how it was calculated.
library(urbnindicators) includes a codebook as an attribute
of the dataframe returned from compile_acs_data(). View and
navigate through the full codebook here.
Access the codebook with attr():
codebook = attr(df_urbnindicators, "codebook")
codebook %>%
head(5)
#> # A tibble: 5 × 9
#> calculated_variable variable_type definition numerator_vars
#> <chr> <chr> <chr> <list>
#> 1 total_population_universe Count This is a… <chr [0]>
#> 2 means_transportation_work_universe Count This is a… <chr [0]>
#> 3 means_transportation_work_car_truck_v… Count This is a… <chr [0]>
#> 4 means_transportation_work_car_truck_v… Count This is a… <chr [0]>
#> 5 means_transportation_work_car_truck_v… Count This is a… <chr [0]>
#> # ℹ 5 more variables: numerator_subtract_vars <list>, denominator_vars <list>,
#> # denominator_subtract_vars <list>, se_calculation_type <chr>,
#> # aggregation_strategy <chr>The codebook has three columns:
- calculated_variable – the variable name as it appears in the dataframe.
-
variable_type – whether the variable is a
count(raw ACS estimate), apercent(derived ratio), ormetadata(e.g., a median or geographic identifier). -
definition – a formula showing how the variable was
calculated. For raw ACS variables, this is the original Census Bureau
variable code (e.g.,
B22003_002). For derived variables, this is an expression likesnap_received / snap_universe.
These definition strings are also used internally to calculate margins of error for derived variables (see Quantifying Survey Error), so their accuracy is critical.
Some definitions are quite complex. For example,
disability_percent is the sum of all of the sex-by-age
groupings for people with disabilities (numerator) divided by the table
universe:
codebook %>%
filter(calculated_variable == "disability_percent") %>%
pull(definition)
#> [1] "Numerator = sex_by_age_by_disability_status_male_under_5_years_with_a_disability (B18101_004), sex_by_age_by_disability_status_male_5_17_years_with_a_disability (B18101_007), sex_by_age_by_disability_status_male_18_34_years_with_a_disability (B18101_010), sex_by_age_by_disability_status_male_35_64_years_with_a_disability (B18101_013), sex_by_age_by_disability_status_male_65_74_years_with_a_disability (B18101_016), sex_by_age_by_disability_status_male_75_years_over_with_a_disability (B18101_019), sex_by_age_by_disability_status_female_under_5_years_with_a_disability (B18101_023), sex_by_age_by_disability_status_female_5_17_years_with_a_disability (B18101_026), sex_by_age_by_disability_status_female_18_34_years_with_a_disability (B18101_029), sex_by_age_by_disability_status_female_35_64_years_with_a_disability (B18101_032), sex_by_age_by_disability_status_female_65_74_years_with_a_disability (B18101_035), sex_by_age_by_disability_status_female_75_years_over_with_a_disability (B18101_038). Denominator = sex_by_age_by_disability_status_universe (B18101_001)."Create your own derived variables
For tables from list_tables(), raw ACS variables and
derived variables are automatically returned. But for other tables,
there are no pre-computed (by urbnindicators) derived
variables. And even for tables reflected in list_tables(),
you may want alternate or additional derived variables.
urbnindicators provides a suite of helper functions
(define_*()) that allow you to specify how you want to
create these derived variables; these helper functions abstract away the
actual calculations and ensure that you get an updated codeboook and
correctly-pooled margins of error for each of your newly-derived
variables. See Custom Derived
Variables for more.
Interpolate data to custom geographies
ACS data are available for many statistical and political
geographies, but many analyses rfocus on other geographies like
neighborhoods or planning districts. interpolate_acs()
translates data from ACS-supported geographies to any user-defined
geography, properly re-deriving percentages and propagating margins of
error. See Translating ACS Data to
Custom Geographies for a worked example.