Interpolate data using a crosswalk(s)
crosswalk_data.RdApplies geographic crosswalk weights to transform data from a source geography
to a target geography. Can either accept a pre-fetched crosswalk from
get_crosswalk() or fetch the crosswalk automatically using the provided
geography and year parameters.
Usage
crosswalk_data(
data,
crosswalk = NULL,
source_geography = NULL,
target_geography = NULL,
source_year = NULL,
target_year = NULL,
weight = "population",
cache = NULL,
geoid_column = "source_geoid",
count_columns = NULL,
non_count_columns = NULL,
return_intermediate = FALSE,
show_join_quality = TRUE
)Arguments
- data
A data frame or tibble containing the data to crosswalk.
- crosswalk
The output from
get_crosswalk()- a list containing:- crosswalks
A named list of crosswalk tibbles (step_1, step_2, etc.)
- plan
The crosswalk plan
- message
Description of the crosswalk chain
Alternatively, a single crosswalk tibble can be provided for backwards compatibility. If NULL, the crosswalk will be fetched using
source_geographyandtarget_geographyparameters.- source_geography
Character or NULL. Source geography name. Required if
crosswalkis NULL. One of c("block", "block group", "tract", "place", "county", "urban_area", "zcta", "puma", "cd118", "cd119", "core_based_statistical_area").- target_geography
Character or NULL. Target geography name. Required if
crosswalkis NULL. Same options assource_geography.- source_year
Numeric or NULL. Year of the source geography. If NULL and crosswalk is being fetched, uses same-year crosswalk via Geocorr.
- target_year
Numeric or NULL. Year of the target geography. If NULL and crosswalk is being fetched, uses same-year crosswalk via Geocorr.
- weight
Character. Weighting variable for Geocorr crosswalks when fetching. One of c("population", "housing", "land"). Default is "population".
- cache
Directory path or NULL. Where to cache fetched crosswalks. If NULL (default), crosswalk is fetched but not saved to disk.
- geoid_column
Character. The name of the column in
datacontaining the source geography identifiers (GEOIDs). Default is "source_geoid".- count_columns
Character vector or NULL. Column names in
datathat represent count variables. These will be summed after multiplying by the allocation factor. If NULL (default), automatically detects columns with the prefix "count_".- non_count_columns
Character vector or NULL. Column names in
datathat represent mean, median, percentage, and ratio variables. These will be calculated as weighted means using the allocation factor as weights. If NULL (default), automatically detects columns with prefixes "mean_", "median_", "percent_", or "ratio_".- return_intermediate
Logical. If TRUE and crosswalk has multiple steps, returns a list containing both the final result and intermediate results from each step. Default is FALSE, which returns only the final result.
- show_join_quality
Logical. If TRUE (default), prints diagnostic messages about join quality, including the number of data rows not matching the crosswalk and vice versa. For state-nested geographies (tract, county, block group, etc.), also reports state-level concentration of unmatched rows. Set to FALSE to suppress these messages.
Value
If return_intermediate = FALSE (default), a tibble with data summarized
to the final target geography.
If return_intermediate = TRUE and there are multiple crosswalk steps, a list with:
- final
The final crosswalked data
- intermediate
A named list of intermediate results (step_1, step_2, etc.)
The returned tibble(s) include an attribute crosswalk_metadata from the
underlying crosswalk (access via attr(result, "crosswalk_metadata")).
Details
Two usage patterns:
Pre-fetched crosswalk: Pass the output of
get_crosswalk()to thecrosswalkparameter. Useful when you want to inspect or reuse the crosswalk.Direct crosswalking: Pass
source_geographyandtarget_geography(and optionallysource_year,target_year,weight,cache) and the crosswalk will be fetched automatically. Useful for one-off transformations.
Count variables (specified in count_columns) are interpolated by summing
the product of the value and the allocation factor across all source geographies
that overlap with each target geography.
Non-count variables (specified in non_count_columns) are interpolated using
a weighted mean, with the allocation factor serving as the weight.
Automatic column detection: If count_columns and non_count_columns are
both NULL, the function will automatically detect columns based on naming prefixes:
Columns starting with "count_" are treated as count variables
Columns starting with "mean_", "median_", "percent_", or "ratio_" are treated as non-count variables
Other columns: Columns that are not the geoid column, count columns, or
non-count columns (e.g., metadata like data_year) are preserved by taking
the first non-missing value within each target geography group. If all values
are missing, NA is returned.
Multi-step crosswalks: When get_crosswalk() returns multiple crosswalks
(for transformations that change both geography and year), this function
automatically applies them in sequence.
Examples
if (FALSE) { # \dontrun{
# Option 1: Pre-fetched crosswalk
crosswalk <- get_crosswalk(
source_geography = "tract",
target_geography = "zcta",
weight = "population")
result <- crosswalk_data(
data = my_tract_data,
crosswalk = crosswalk,
geoid_column = "tract_geoid",
count_columns = c("count_population", "count_housing_units"))
# Option 2: Direct crosswalking (crosswalk fetched automatically)
result <- crosswalk_data(
data = my_tract_data,
source_geography = "tract",
target_geography = "zcta",
weight = "population",
geoid_column = "tract_geoid",
count_columns = c("count_population", "count_housing_units"))
# Direct crosswalking with year change
result <- crosswalk_data(
data = my_data,
source_geography = "tract",
target_geography = "zcta",
source_year = 2010,
target_year = 2020,
weight = "population",
geoid_column = "tract_geoid",
count_columns = "count_population")
# Pre-fetched crosswalk with intermediate results
crosswalk <- get_crosswalk(
source_geography = "tract",
target_geography = "zcta",
source_year = 2010,
target_year = 2020,
weight = "population")
result <- crosswalk_data(
data = my_data,
crosswalk = crosswalk,
geoid_column = "tract_geoid",
count_columns = "count_population",
return_intermediate = TRUE)
# Access intermediate and final
result$intermediate$step_1 # After first crosswalk
result$final # Final result
} # }