urbnindicators
urbnindicators.Rmd
Welcome to urbnindicators
. The goal of this package is
to streamline the process of obtaining an analysis-ready dataset, with a
focus on use cases in the social sciences world.
This vignette is organized into three parts:
We first illustrate a common workflow for obtaining American Community Survey (ACS) data from the Census Bureau (arguably the premier source of social sciences information about people and places in the U.S).
Second, we walk through how
urbnindicators
can produce similar outputs (but much more quickly) as compared to the workflow in (1).Lastly, we touch on why and how
urbnindicators
provides a more robust and accurate set of data products than might be obtained through the workflow in (1).
The Existing tidycensus
Workflow
tidycensus
provides a suite of functions for working with select datasets available
via the Census Bureau’s API (application programming interface) and is
the backbone for all of the data produced by
urbnindicators
. While the tidycensus
API is
versatile and allows users to access many more datasets (and variables
within those datasets) than does urbnindicators
, it can
require a significant amount of knowledge and effort to use
tidycensus
to support a robust analysis process, and many
users may fall into common pitfalls without realizing they’ve made an
error(s).
A tidycensus
workflow might follow the steps below.
First, we need to identify the names of the variables we’re
interested in. We want to look at the share of the population with a
disability, at the county level, in New Jersey. So we load the variable
index for the corresponding data year and look for variables with
“Disability” in the concept
field:
# library(urbnindicators)
# library(tidycensus)
# library(dplyr)
# library(stringr)
# library(ggplot2)
acs_codebook = load_variables(dataset = "acs5", year = 2022)
# acs_codebook %>% View() # not run
acs_codebook %>%
dplyr::filter(stringr::str_detect(concept, "Disability")) %>%
nrow()
#> [1] 506
If you’re working in RStudio, you can filter the codebook via the point-and-click interface, and if not, you can do so programatically, subsetting the ~28,000 available variables down to the ~500 that match the term “Disability”. However, 500 variables is a few orders of magnitude greater than than the number of variables we actually want (read: 2): the number of people with a disability and the number of all people.
So which variable(s) do we select? There’s not a clear answer. All variables relating to disability reflect disability and at least one other characteristic (e.g., “sex by age by disability status”). If we want to calculate the percent of all individuals with a disability, we want to do so using the most robust available variables (i.e., those that reflect all individuals who reported their disability status), whereas some variables that reflect disability may have smaller counts because the other characteristics combined with disability status (e.g., health insurance coverage status, as is the case for “Age by Disability Status by Health Insurance Coverage Status”) may be available only for a subset of the individuals for whom disability status is available.
Putting these challenges aside, let’s imagine we select the table of
variables prefixed “B18101”, for “Sex by Age by Disability”. We think
that most respondents who are asked about their disability status will
also have been asked about their sex and age. We then pass this to
tidycensus
as:
df_disability = get_acs(
geography = "county",
state = "NJ",
year = 2022,
output = "wide",
survey = "acs5",
table = "B18101")
#> Getting data from the 2018-2022 5-year ACS
#> Warning: • You have not set a Census API key. Users without a key are limited to 500
#> queries per day and may experience performance limitations.
#> ℹ For best results, get a Census API key at
#> http://api.census.gov/data/key_signup.html and then supply the key to the
#> `census_api_key()` function to use it throughout your tidycensus session.
#> This warning is displayed once per session.
#> Loading ACS5 variables for 2022 from table B18101. To cache this dataset for faster access to ACS tables in the future, run this function with `cache_table = TRUE`. You only need to do this once per ACS dataset.
df_disability %>% dim()
#> [1] 21 80
# df_disability %>% head()
This returns us 21 observations (one for each county in NJ) along with an intimidating 80 columns. Now we would need to figure out how to aggregate the needed variables for both the denominator and numerator in order to calculate a valid “% Disabled” measure, a task that is feasible but time-intensive and error-prone (in no small part because each variable is named with an alphanumeric code rather than a meaningful and descriptive name). For an analysis that leverages more than a single measure, and especially when measures are required from distinct tables, this workflow is burdensome and exposes significant surface area for undetected errors.
At the same time, many analysts will be overwhelmed by and unsure how
to incorporate the margins of error that are returned by
tidycensus
, opting simply to drop this critical information
from their analysis. (See
vignette("coefficients-of-variation")
for more on how
urbnindicators
provides quick and actionable
characterizations of margins of error.)
Enter urbnindicators
urbnindicators
abstracts the workflow above behind the
scenes. In lieu of a call to tidycensus::get_acs()
, a call
to urbnindicators::compile_acs_data()
returns a dataframe
of both raw ACS measures and derived estimates (such as the share of all
individuals who are disabled).
df_urbnindicators = urbnindicators::compile_acs_data(
variables = NULL,
years = 2022,
geography = "county",
states = "NJ",
spatial = FALSE)
#>
#>
#> Variable names and geographies for ACS data products can change between years.
#> Changes to geographies are particularly significant across decades
#> (e.g., from 2019 to 2020), but these changes can occur in any year.
#>
#> Users should ensure that the logic embedded in this function--
#> which was developed around five-year ACS estimates for 2017-2021--
#> remains accurate for their use cases. Evaluation of measures and
#> geographies over time should be thoroughly quality checked.
#> Warning in urbnindicators::compile_acs_data(variables = NULL, years = 2022, : County-level queries can be slow for more than a few counties. Omit the county parameter
#> if you are interested in more than five counties; filter to your desired counties after
#> this function returns.
#> | | | 0% | | | 1% | |= | 1% | |= | 2% | |== | 2% | |== | 3% | |=== | 4% | |=== | 5% | |==== | 6% | |====== | 8% | |====== | 9% | |=========== | 16% | |================ | 24% | |====================== | 31% | |=========================== | 38% | |================================ | 45% | |===================================== | 53% | |========================================== | 60% | |=============================================== | 67% | |==================================================== | 75% | |========================================================= | 82% | |=============================================================== | 89% | |==================================================================== | 97% | |======================================================================| 100%
#> Warning: There were 3 warnings in `dplyr::mutate()`.
#> The first warning was:
#> ℹ In argument: `numerator = dplyr::case_when(...)`.
#> Caused by warning in `stri_replace_all_regex()`:
#> ! argument is not an atomic vector; coercing
#> ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.
#> Warning: There were 18 warnings in `dplyr::mutate()`.
#> The first warning was:
#> ℹ In argument: `dplyr::across(...)`.
#> Caused by warning in `sqrt()`:
#> ! NaNs produced
#> ℹ Run `dplyr::last_dplyr_warnings()` to see the 17 remaining warnings.
df_urbnindicators %>% dim()
#> [1] 21 2321
While this call returns us the same 21 observations (one per county
in NJ), it returns us some 1,300 columns. Even when we subset to those
matching “disability”, we still have 79 columns (the same columns
available from our tidycensus::get_acs()
call, plus
one).
df_urbnindicators %>%
dplyr::select(dplyr::matches("disability")) %>%
colnames() %>%
length() ## 79
#> [1] 119
This is because urbnindicators
makes the same
tidycensus::get_acs()
query as illustrated above, along
with many others. This reflects a design choice underlying
urbnindicators
–the package returns very large datasets, but
it structures them such that analysts can use simple and familiar
approaches to navigating the data while benefiting from a comprehensive
array of measures compiled into a single dataset.
The primary differences between urbnindicators
and
tidycensus
outputs are that returned columns have
descriptive names (e.g.,
sex_by_age_by_disability_status_female_75_years_over_with_a_disability
),
and–importantly–that derived variables are included:
df_urbnindicators %>%
dplyr::select(GEOID, matches("disability.*percent"))
#> # A tibble: 21 × 3
#> GEOID disability_percent disability_percent_cv
#> <chr> <dbl> <dbl>
#> 1 34001 0.142 9545.
#> 2 34003 0.0831 21133.
#> 3 34005 0.116 4108.
#> 4 34007 0.145 11165.
#> 5 34009 0.150 4736.
#> 6 34011 0.148 4668.
#> 7 34013 0.117 21042.
#> 8 34015 0.127 16335.
#> 9 34017 0.0865 11503.
#> 10 34019 0.0876 9255.
#> # ℹ 11 more rows
Indeed, the string-matching approach used above, with the pattern
select(matches("variable_of_interest.*percent$"))
, is key
to navigating the 1,300 variables returned by
urbnindicators::compile_acs_data()
. Because variables are
named semantically (i.e., their names have meaning and are not simply
the default alphanumeric variable codes), and because derived percent
variables always end in percent
, this flexible pattern can
identify standardized measures that are ready for analysis. (As a
reminder: ".*"
matches an unlimited number of characters,
while "$"
matches the end of a string.
select(matches("variable_of_interest.*percent$"))
says:
match columns with names containing “variable_of_interest”, followed by
any number of characters, and that end in “percent”).
For a look at a subset of derived percent variables:
df_urbnindicators %>%
dplyr::select(dplyr::matches("percent$")) %>%
colnames() %>% # 190+
sort() %>%
head(10) # but we'll just take a look at a few for now
#> [1] "ability_speak_english_less_than_very_well_percent"
#> [2] "ability_speak_english_very_well_better_percent"
#> [3] "age_10_14_years_percent"
#> [4] "age_15_17_years_percent"
#> [5] "age_18_19_years_percent"
#> [6] "age_20_years_percent"
#> [7] "age_21_years_percent"
#> [8] "age_22_24_years_percent"
#> [9] "age_25_29_years_percent"
#> [10] "age_30_34_years_percent"
urbnindicators::compile_acs_data()
also returns a
codebook as an attribute of the returned dataframe. Want to know more
about how
cost_burdened_30percentormore_incomeslessthan35000_percent
was calculated? No problem:
df_urbnindicators %>%
attr("codebook") %>%
filter(calculated_variable == "cost_burdened_30percentormore_incomeslessthan35000_percent") %>%
pull(definition)
#> [1] "Numerator = household_income_by_gross_rent_as_a_percentage_household_income_in_past_12_months_less_than_$10000_50_0_percent_more (B25074_009), household_income_by_gross_rent_as_a_percentage_household_income_in_past_12_months_$10000_$19999_50_0_percent_more (B25074_018), household_income_by_gross_rent_as_a_percentage_household_income_in_past_12_months_$20000_$34999_50_0_percent_more (B25074_027). Denominator = household_income_by_gross_rent_as_a_percentage_household_income_in_past_12_months_less_than_$10000 (B25074_002), household_income_by_gross_rent_as_a_percentage_household_income_in_past_12_months_$10000_$19999 (B25074_011), household_income_by_gross_rent_as_a_percentage_household_income_in_past_12_months_$20000_$34999 (B25074_020)."
So Why urbnindicators
?
Hopefully the process above has illustrated some of the advantages, which fall into two buckets: efficiency and reliability.
-
urbnindicators
saves time by:Making sensible decisions about variable and table selection;
Calculating common measures (typically percentages) behind the scenes; and
While not illustrated here,
urbnindicators
also allows for multi-year and multi-geography queries by default, whereastidycensus
does not support these approaches (there is support for multi-geography queries at some geographic levels intidycensus
;urbnindicators
extends this to all geographies), and instead users have to loop over desired years and geographies.
-
urbnindicators
improves the reliability of the data query and measure creation process by:Replacing alphanumeric variable codes (e.g.,
B18101_001
with meaningful variable names (e.g.,disability_percent
);Returning a codebook attached to the primary dataframe that documents how variables were created and what they represent;
Running default data quality checks on generated measures; and
(Forthcoming) Producing out-of-the-box summaries of measure reliability via
urbnindicators::calculate_coefficient_of_variation()
, which leverages the margins of error associated with each measure to assess the quality of estimates across all queried geographies. (Seevignette("coefficients-of-variation")
for more.)