urbnindicators
urbnindicators.Rmd
This vignette is organized into three parts:
We illustrate a typical workflow for obtaining American Community Survey (ACS) data using
library(tidycensus)
.We show how
library(urbnindicators)
can produce similar outputs, but with much less code and effort.We highlight how
library(urbnindicators)
provides metadata and improves the quality and reliability of the analysis process.
A Typical Workflow
tidycensus
provides a suite of functions for working with ACS data in R. In fact,
library(urbnindicators)
is built on top of
library(tidycensus)
, and we highly encourage those who are
unfamiliar to explore library(tidycensus)
before using
library(urbnindicators)
.
While tidycensus
is versatile and allows users to access
many more datasets (and variables within those datasets) than does
library(urbnindicators)
, it can require a significant
amount of knowledge and time to support a robust analysis, leading many
users to fall into common pitfalls without realizing they’ve made an
error(s).
Identify variables to query
We load the built-in codebook and search for our construct of interest (disability). This leaves us 500 variables to choose from.
acs_codebook = tidycensus::load_variables(dataset = "acs5", year = 2022)
acs_codebook %>%
dplyr::filter(stringr::str_detect(concept, "Disability")) %>%
head() %>%
reactable::reactable()
Let’s imagine we’re interested in calculating the share of individuals with a disability; this requires only two variables (conceptually): the number of people with a disability and the number of all people.
So which variable(s) do we select? There’s not a clear answer. All variables relating to disability reflect disability and at least one other characteristic (e.g., “sex by age by disability status”). If we want to calculate the percent of all individuals with a disability, we want to do so using the most robust available variables (i.e., those that reflect all individuals who reported their disability status), whereas some variables that reflect disability may have smaller counts because the other characteristics combined with disability status (e.g., health insurance coverage status) may be available only for a subset of the individuals for whom disability status is available.
Putting these challenges aside, let’s imagine we select the table of
variables prefixed “B18101”, for “Sex by Age by Disability”. We think
that most respondents who respond about their disability status will
also have responded about their sex and age. We then pass this to
library(tidycensus)
as:
df_disability = get_acs(
geography = "county",
state = "NJ",
year = 2022,
output = "wide",
survey = "acs5",
table = "B18101")
df_disability %>% dim()
#> [1] 21 80
df_disability %>% colnames() %>% head(5)
#> [1] "GEOID" "NAME" "B18101_001E" "B18101_001M" "B18101_002E"
This returns us 21 observations–one for each county in NJ–along with
an intimidating 80 columns with meaningless names along the lines of
B18101_039E
.
Calculating our measure of interest
Now we would need to figure out how to aggregate the needed variables for both the denominator and numerator in order to calculate a valid “% Disabled” measure, a task that is feasible but time-intensive and error-prone.
For an analysis that leverages more than a single measure, and
especially when measures are required from distinct tables, this
workflow is burdensome and creates significant surface area for
undetected errors. To acquire data for multiple years, or for multiple
different types of geographies, also requires repeated calls to
tidycensus::get_acs()
.
At the same time, many analysts will be overwhelmed by and unsure how
to combine the margins of error that are returned by
tidycensus::get_acs()
, opting simply to drop this critical
information from their analysis as they go about calculating “%
Disabled”.
Using urbnindicators
library(urbnindicators)
abstracts the workflow above
behind the scenes. Instead of a call to
tidycensus::get_acs()
, a call to
urbnindicators::compile_acs_data()
returns a dataset of
both raw ACS measures and derived estimates (such as the share of all
individuals who are disabled). And that dataset includes a range of
measures–-spanning things such as health insurance, employment, housing
costs, and race and ethnicity–-not just one variable or table from the
ACS.
Acquire data
It’s as simple as the call below. Note that you can provide a vector of years and/or states if you want data over different time periods or geographies.
Note that selecting more geographic units–either by selecting a
geography
option comprising more units, by selecting more
states, or selecting more years–can significantly increase the query
time. A tract-level query of the entire US can take 30+ minutes.
df_urbnindicators = urbnindicators::compile_acs_data(
years = 2022,
geography = "county",
states = "NJ",
spatial = TRUE)
#> Downloading: 6.8 kB Downloading: 6.8 kB Downloading: 15 kB Downloading: 15 kB Downloading: 15 kB Downloading: 15 kB Downloading: 30 kB Downloading: 30 kB Downloading: 30 kB Downloading: 30 kB Downloading: 46 kB Downloading: 46 kB Downloading: 46 kB Downloading: 46 kB Downloading: 53 kB Downloading: 53 kB Downloading: 53 kB Downloading: 53 kB Downloading: 74 kB Downloading: 74 kB Downloading: 74 kB Downloading: 74 kB Downloading: 100 kB Downloading: 100 kB Downloading: 100 kB Downloading: 100 kB Downloading: 110 kB Downloading: 110 kB Downloading: 110 kB Downloading: 110 kB Downloading: 120 kB Downloading: 120 kB Downloading: 120 kB Downloading: 120 kB Downloading: 120 kB Downloading: 120 kB Downloading: 140 kB Downloading: 140 kB Downloading: 150 kB Downloading: 150 kB Downloading: 150 kB Downloading: 150 kB Downloading: 180 kB Downloading: 180 kB Downloading: 180 kB Downloading: 180 kB Downloading: 190 kB Downloading: 190 kB Downloading: 190 kB Downloading: 190 kB Downloading: 220 kB Downloading: 220 kB Downloading: 230 kB Downloading: 230 kB Downloading: 260 kB Downloading: 260 kB Downloading: 260 kB Downloading: 260 kB Downloading: 260 kB Downloading: 260 kB Downloading: 290 kB Downloading: 290 kB Downloading: 290 kB Downloading: 290 kB Downloading: 290 kB Downloading: 290 kB Downloading: 330 kB Downloading: 330 kB Downloading: 330 kB Downloading: 330 kB Downloading: 360 kB Downloading: 360 kB Downloading: 360 kB Downloading: 360 kB Downloading: 390 kB Downloading: 390 kB Downloading: 390 kB Downloading: 390 kB Downloading: 440 kB Downloading: 440 kB Downloading: 510 kB Downloading: 510 kB Downloading: 510 kB Downloading: 510 kB Downloading: 510 kB Downloading: 510 kB Downloading: 520 kB Downloading: 520 kB Downloading: 570 kB Downloading: 570 kB Downloading: 570 kB Downloading: 570 kB Downloading: 570 kB Downloading: 570 kB Downloading: 590 kB Downloading: 590 kB Downloading: 640 kB Downloading: 640 kB Downloading: 640 kB Downloading: 640 kB Downloading: 720 kB Downloading: 720 kB Downloading: 720 kB Downloading: 720 kB Downloading: 770 kB Downloading: 770 kB Downloading: 790 kB Downloading: 790 kB Downloading: 790 kB Downloading: 790 kB Downloading: 850 kB Downloading: 850 kB Downloading: 970 kB Downloading: 970 kB Downloading: 1.1 MB Downloading: 1.1 MB Downloading: 1.1 MB Downloading: 1.1 MB Downloading: 1.3 MB Downloading: 1.3 MB Downloading: 1.5 MB Downloading: 1.5 MB Downloading: 3.1 MB Downloading: 3.1 MB Downloading: 4.8 MB Downloading: 4.8 MB Downloading: 6.4 MB Downloading: 6.4 MB Downloading: 8.1 MB Downloading: 8.1 MB Downloading: 9.7 MB Downloading: 9.7 MB Downloading: 11 MB Downloading: 11 MB Downloading: 12 MB Downloading: 12 MB Downloading: 12 MB Downloading: 12 MB Downloading: 12 MB Downloading: 12 MB
Analyze or visualize data
And now we’re ready to analyze or plot our data. Simplistically:
df_urbnindicators %>%
select(matches("disability.*percent$"), geometry) %>%
ggplot() +
geom_sf(aes(fill = disability_percent)) +
theme_urbn_map() +
scale_fill_continuous(labels = scales::percent, trans = "reverse") +
labs(
title = "Disability rates appear higher in southern NJ",
subtitle = "Disability rates by county, NJ, 2018-2022 ACS",
fill = "Population with an ACS-defined disability (%)" %>% str_wrap(20))
Document data
There’s a lot happening behind the scenes, so it’s important to
understand what each variable represents and how it was calculated.
library(urbnindicators)
includes a codebook as an attribute
of the dataframe returned from
urbnindicators::compile_acs_data()
. View and navigate
through the full codebook here.
The codebook specifies the variable type and provides a definition of
how the variable was calculated. Most (though not all) variables that
are directly available from the ACS are count variables, such as the
number of people receiving public assistance. Many of the variables that
are calculated by library(urbnindicators)
are percent
variables, where we divide two count variables.
df_urbnindicators %>%
attr("codebook") %>%
filter(str_detect(calculated_variable, "public_assistance")) %>%
reactable::reactable()
The most common convention is that a percent variable is calculated
by dividing a count variable (e.g.,
public_assistance_received
) by the universe variable for
the corresponding table (e.g., public_assistance_universe
).
But other derived variables are more complex, such as those for commute
mode. Here, we’ve aggregated three counts representing different types
of individual motor vehicle transportation to create the numerator,
while the denominator is the table universe minus individuals who work
from home.
df_urbnindicators %>%
attr("codebook") %>%
filter(str_detect(calculated_variable, "means.*motor_vehicle")) %>%
reactable::reactable()
This allows us say something along the lines of: “Of individuals who commute to work, XX% use a motor vehicle as their primary commute mode.”
So What’s the Value Add?
Hopefully the process above has illustrated some of the advantages, which fall into two buckets: efficiency and reliability.
-
Speed:
Sensible decisions about variable and table selection eliminate the need to hunt for the right variables.
Raw ACS variables are included alongside pre-calculated variables, such as percentages, that are not directly available from the ACS.
Query multiple years and/or multiple states in a single function call– no need to loop over your desired years or states.
-
Accuracy:
Alphanumeric variable codes (e.g.,
B18101_001
) are replaced with meaningful variable names (e.g.,disability_percent
)–eliminating a common source of confusion and error.Data are returned with a codebook that documents how variables were created and what they represent.
Behind-the-scenes quality checks help to ensure (though do not guarantee) that calculations are correct, though users still should verify results themselves.
Both straight-from-the-ACS and derived variables include margins of error and coefficients of variation that enable users to conduct statistical significance testing and can inform other data quality decision-making, such as whether to suppress certain observations or whether data should be aggregated–across geographies or across variables, or both–to improve the accuracy of estimates.