Skip to contents

This vignette is organized into three parts:

  1. We illustrate a typical workflow for obtaining American Community Survey (ACS) data using library(tidycensus).

  2. We show how library(urbnindicators) can produce similar outputs, but with much less code and effort.

  3. We highlight how library(urbnindicators) provides metadata and improves the quality and reliability of the analysis process.

A Typical Workflow

tidycensus provides a suite of functions for working with ACS data in R. In fact, library(urbnindicators) is built on top of library(tidycensus), and we highly encourage those who are unfamiliar to explore library(tidycensus) before using library(urbnindicators).

While tidycensus is versatile and allows users to access many more datasets (and variables within those datasets) than does library(urbnindicators), it can require a significant amount of knowledge and time to support a robust analysis, leading many users to fall into common pitfalls without realizing they’ve made an error(s).

Identify variables to query

We load the built-in codebook and search for our construct of interest (disability). This leaves us 500 variables to choose from.

acs_codebook = tidycensus::load_variables(dataset = "acs5", year = 2022)

acs_codebook %>%
  dplyr::filter(stringr::str_detect(concept, "Disability")) %>% 
  head() %>%
  reactable::reactable()

Let’s imagine we’re interested in calculating the share of individuals with a disability; this requires only two variables (conceptually): the number of people with a disability and the number of all people.

So which variable(s) do we select? There’s not a clear answer. All variables relating to disability reflect disability and at least one other characteristic (e.g., “sex by age by disability status”). If we want to calculate the percent of all individuals with a disability, we want to do so using the most robust available variables (i.e., those that reflect all individuals who reported their disability status), whereas some variables that reflect disability may have smaller counts because the other characteristics combined with disability status (e.g., health insurance coverage status) may be available only for a subset of the individuals for whom disability status is available.

Putting these challenges aside, let’s imagine we select the table of variables prefixed “B18101”, for “Sex by Age by Disability”. We think that most respondents who respond about their disability status will also have responded about their sex and age. We then pass this to library(tidycensus) as:

df_disability = get_acs(
  geography = "county",
  state = "NJ", 
  year = 2022,
  output = "wide",
  survey = "acs5",
  table = "B18101")

df_disability %>% dim()
#> [1] 21 80
df_disability %>% colnames() %>% head(5)
#> [1] "GEOID"       "NAME"        "B18101_001E" "B18101_001M" "B18101_002E"

This returns us 21 observations–one for each county in NJ–along with an intimidating 80 columns with meaningless names along the lines of B18101_039E.

Calculating our measure of interest

Now we would need to figure out how to aggregate the needed variables for both the denominator and numerator in order to calculate a valid “% Disabled” measure, a task that is feasible but time-intensive and error-prone.

For an analysis that leverages more than a single measure, and especially when measures are required from distinct tables, this workflow is burdensome and creates significant surface area for undetected errors. To acquire data for multiple years, or for multiple different types of geographies, also requires repeated calls to tidycensus::get_acs().

At the same time, many analysts will be overwhelmed by and unsure how to combine the margins of error that are returned by tidycensus::get_acs(), opting simply to drop this critical information from their analysis as they go about calculating “% Disabled”.

Using urbnindicators

library(urbnindicators) abstracts the workflow above behind the scenes. Instead of a call to tidycensus::get_acs(), a call to urbnindicators::compile_acs_data() returns a dataset of both raw ACS measures and derived estimates (such as the share of all individuals who are disabled). And that dataset includes a range of measures–-spanning things such as health insurance, employment, housing costs, and race and ethnicity–-not just one variable or table from the ACS.

Acquire data

It’s as simple as the call below. Note that you can provide a vector of years and/or states if you want data over different time periods or geographies.

Note that selecting more geographic units–either by selecting a geography option comprising more units, by selecting more states, or selecting more years–can significantly increase the query time. A tract-level query of the entire US can take 30+ minutes.

df_urbnindicators = urbnindicators::compile_acs_data(
  years = 2022,
  geography = "county",
  states = "NJ",
  spatial = TRUE)
#> Downloading: 6.8 kB     Downloading: 6.8 kB     Downloading: 15 kB     Downloading: 15 kB     Downloading: 15 kB     Downloading: 15 kB     Downloading: 30 kB     Downloading: 30 kB     Downloading: 30 kB     Downloading: 30 kB     Downloading: 46 kB     Downloading: 46 kB     Downloading: 46 kB     Downloading: 46 kB     Downloading: 53 kB     Downloading: 53 kB     Downloading: 53 kB     Downloading: 53 kB     Downloading: 74 kB     Downloading: 74 kB     Downloading: 74 kB     Downloading: 74 kB     Downloading: 100 kB     Downloading: 100 kB     Downloading: 100 kB     Downloading: 100 kB     Downloading: 110 kB     Downloading: 110 kB     Downloading: 110 kB     Downloading: 110 kB     Downloading: 120 kB     Downloading: 120 kB     Downloading: 120 kB     Downloading: 120 kB     Downloading: 120 kB     Downloading: 120 kB     Downloading: 140 kB     Downloading: 140 kB     Downloading: 150 kB     Downloading: 150 kB     Downloading: 150 kB     Downloading: 150 kB     Downloading: 180 kB     Downloading: 180 kB     Downloading: 180 kB     Downloading: 180 kB     Downloading: 190 kB     Downloading: 190 kB     Downloading: 190 kB     Downloading: 190 kB     Downloading: 220 kB     Downloading: 220 kB     Downloading: 230 kB     Downloading: 230 kB     Downloading: 260 kB     Downloading: 260 kB     Downloading: 260 kB     Downloading: 260 kB     Downloading: 260 kB     Downloading: 260 kB     Downloading: 290 kB     Downloading: 290 kB     Downloading: 290 kB     Downloading: 290 kB     Downloading: 290 kB     Downloading: 290 kB     Downloading: 330 kB     Downloading: 330 kB     Downloading: 330 kB     Downloading: 330 kB     Downloading: 360 kB     Downloading: 360 kB     Downloading: 360 kB     Downloading: 360 kB     Downloading: 390 kB     Downloading: 390 kB     Downloading: 390 kB     Downloading: 390 kB     Downloading: 440 kB     Downloading: 440 kB     Downloading: 510 kB     Downloading: 510 kB     Downloading: 510 kB     Downloading: 510 kB     Downloading: 510 kB     Downloading: 510 kB     Downloading: 520 kB     Downloading: 520 kB     Downloading: 570 kB     Downloading: 570 kB     Downloading: 570 kB     Downloading: 570 kB     Downloading: 570 kB     Downloading: 570 kB     Downloading: 590 kB     Downloading: 590 kB     Downloading: 640 kB     Downloading: 640 kB     Downloading: 640 kB     Downloading: 640 kB     Downloading: 720 kB     Downloading: 720 kB     Downloading: 720 kB     Downloading: 720 kB     Downloading: 770 kB     Downloading: 770 kB     Downloading: 790 kB     Downloading: 790 kB     Downloading: 790 kB     Downloading: 790 kB     Downloading: 850 kB     Downloading: 850 kB     Downloading: 970 kB     Downloading: 970 kB     Downloading: 1.1 MB     Downloading: 1.1 MB     Downloading: 1.1 MB     Downloading: 1.1 MB     Downloading: 1.3 MB     Downloading: 1.3 MB     Downloading: 1.5 MB     Downloading: 1.5 MB     Downloading: 3.1 MB     Downloading: 3.1 MB     Downloading: 4.8 MB     Downloading: 4.8 MB     Downloading: 6.4 MB     Downloading: 6.4 MB     Downloading: 8.1 MB     Downloading: 8.1 MB     Downloading: 9.7 MB     Downloading: 9.7 MB     Downloading: 11 MB     Downloading: 11 MB     Downloading: 12 MB     Downloading: 12 MB     Downloading: 12 MB     Downloading: 12 MB     Downloading: 12 MB     Downloading: 12 MB

Analyze or visualize data

And now we’re ready to analyze or plot our data. Simplistically:

df_urbnindicators %>%
  select(matches("disability.*percent$"), geometry) %>%
  ggplot() +
    geom_sf(aes(fill = disability_percent)) +
    theme_urbn_map() +
    scale_fill_continuous(labels = scales::percent, trans = "reverse") +
    labs(
      title = "Disability rates appear higher in southern NJ",
      subtitle = "Disability rates by county, NJ, 2018-2022 ACS",
      fill = "Population with an ACS-defined disability (%)" %>% str_wrap(20))

Document data

There’s a lot happening behind the scenes, so it’s important to understand what each variable represents and how it was calculated. library(urbnindicators) includes a codebook as an attribute of the dataframe returned from urbnindicators::compile_acs_data(). View and navigate through the full codebook here.

df_urbnindicators %>%
  attr("codebook") %>%
  head(20) %>%
  reactable::reactable()

The codebook specifies the variable type and provides a definition of how the variable was calculated. Most (though not all) variables that are directly available from the ACS are count variables, such as the number of people receiving public assistance. Many of the variables that are calculated by library(urbnindicators) are percent variables, where we divide two count variables.

df_urbnindicators %>%
  attr("codebook") %>%
  filter(str_detect(calculated_variable, "public_assistance")) %>%
  reactable::reactable()

The most common convention is that a percent variable is calculated by dividing a count variable (e.g., public_assistance_received) by the universe variable for the corresponding table (e.g., public_assistance_universe). But other derived variables are more complex, such as those for commute mode. Here, we’ve aggregated three counts representing different types of individual motor vehicle transportation to create the numerator, while the denominator is the table universe minus individuals who work from home.

df_urbnindicators %>%
  attr("codebook") %>%
  filter(str_detect(calculated_variable, "means.*motor_vehicle")) %>%
  reactable::reactable()

This allows us say something along the lines of: “Of individuals who commute to work, XX% use a motor vehicle as their primary commute mode.”

So What’s the Value Add?

Hopefully the process above has illustrated some of the advantages, which fall into two buckets: efficiency and reliability.

  1. Speed:

    1. Sensible decisions about variable and table selection eliminate the need to hunt for the right variables.

    2. Raw ACS variables are included alongside pre-calculated variables, such as percentages, that are not directly available from the ACS.

    3. Query multiple years and/or multiple states in a single function call– no need to loop over your desired years or states.

  2. Accuracy:

    1. Alphanumeric variable codes (e.g., B18101_001) are replaced with meaningful variable names (e.g., disability_percent)–eliminating a common source of confusion and error.

    2. Data are returned with a codebook that documents how variables were created and what they represent.

    3. Behind-the-scenes quality checks help to ensure (though do not guarantee) that calculations are correct, though users still should verify results themselves.

    4. Both straight-from-the-ACS and derived variables include margins of error and coefficients of variation that enable users to conduct statistical significance testing and can inform other data quality decision-making, such as whether to suppress certain observations or whether data should be aggregated–across geographies or across variables, or both–to improve the accuracy of estimates.