12  Spatial Equity Data Tool Algorithm

This section outlines the steps that the Spatial Equity Data Tool algorithm performs after a user uploads data from the web tool or API. This includes how null values are filtered, how weights are applied, and how geography is treated in scoring.

0. User Submits Analysis Request

The execution of the Spatial Equity Data Tool (SEDT) algorithm is prompted by a user submitting an analysis request to the tool via the web tool (Chapter 8) or API (Chapter 9). This request contains the following information:

  • Resource dataset for analysis (see Chapter 3)
  • Geographic scope of analysis (see Chapter 5)
  • Optional advanced options such as filtering (web tool only) or weighting (see Chapter 6)
  • Optional custom geographic and demographic data sets (API only) (see Chapter 10)
  • Whether to use a travel shed, and, if so, which one (API only, in beta mode) (see Chapter 19)

Following the analysis request, the SEDT performs the following analysis steps:

1. Filter Out Null Values

The tool filters out rows in which the longitude, latitude, weight, or filter columns have null values. All the following are treated as null values (i.e., blank values):

  • #N/A
  • #N/A N/A
  • #NA
  • -1.#IND
  • -1.#QNAN
  • -NaN
  • -nan
  • 1.#IND
  • 1.#QNAN
  • <NA>
  • N/A
  • NA
  • NULL
  • NaN
  • n/a
  • nan
  • null

2. Apply Filters

If the user selects filters for their data on the website tool, the tool applies those filters. The tool can accept three types of filters (text, numeric, and date), which are applied in the following order: text filters, numeric filters, and date filters. If multiple filter conditions are given, the filter conditions are chained and evaluated together. For example, if a user chooses the filters

  1. zip_code = 20024 and
  2. date_opened = 01/01/2020,

then the tool would filter to rows in which the zip_code column is 20024 and the date_opened column is 01/01/2020.

For the API, the user should apply filters in data preprocessing.

3. Apply Weights

If the user selects a weight column, then the tool notes it. If not, the tool creates a dummy weight column with values of 1 for every row, thus weighting every row in the dataset equally.

4. Determine the Dataset’s Source Geography

To minimize the burden on users, the website tool determines the geography automatically from the data provided. Before uploading the data, the user selects a geographic level of analysis: city, county, state, or national. The tool uses this selection to identify the relevant set of potential geographic boundaries (e.g., all cities, all counties, all states, or the nation).

For the API, the user specifies the geographic level in the function call.

For county, state, and national analyses, we use the boundaries provided by the US Census Bureau (noting that we do not include US territories as part of the US boundary, given the lack of available demographic data). We define city boundaries as all census tracts contained within the census place boundary. In some cases, our definition may be larger than the official city boundary; see Chapter 13 for more details. For city-level analysis, the tool uses a precomputed list of all 835 US cities with populations above 50,000 in 2022.

The tool then spatially joins the points in the data to a precompiled dataset of the boundaries corresponding to the selected geographic level of analysis. If the dataset has more than 50 points, then the tool takes a 5 percent sample for the spatial join to limit computation time. If the points in the data fall within the boundaries of multiple geographies, the tool only keeps the points within the most frequently occurring geography in the data. For example, if a user has selected county-level analysis and the user’s data fall within multiple counties, only data from the most frequently occurring county in the sample will be kept. As we have mentioned in other sections, if a user wants to conduct analyses across multiple geographies, the best course forward might be to replicate the analysis for each individual geography separately (for example, for the multiple counties of interest).

5. Read in All Geographic and Demographic Data for That Geography

Once the tool identifies a geography, all the precompiled geographic and demographic data for that geography gets read in. These data include the following:

  1. the boundaries for all census tracts in the geography
  2. the American Community Survey (ACS) demographic variables for every tract in the geography
  3. precomputed ACS demographic statistics for the entire geography

National:

  1. the boundaries for all states in the US (for geographic disparity score)
  2. precomputed ACS demographic statistics for all states in the US (for demographic disparity score)

State:

  1. the boundaries for all counties in the state (for geographic disparity score)
  2. precomputed ACS demographic statistics for all counties in the state (for demographic disparity score)

For more information on the specific ACS demographic variables used in the tool, see Chapter 4.

To read about how custom baseline and demographic datasets can be selected in the API, please see Chapter 10.

6. Compute Which Census Tract Each Data Point Falls Into

The tool spatially joins each point (i.e., row) in the user-uploaded data to the set of all census tracts in the geography. Any points that do not fall within any census tract—that is, points located outside the main geography boundary—are discarded and not included in the calculations. The tool then determines the proportion of the total number of data points in any given subgeography. In determining this proportion, the tool takes into account the weight variable if it is provided.

The tool handles this step differently for the beta travel shed functionality. In this case, the tool reads in the relevant travel shed polygons for the geography. The tool then determines the proportion of the total data points in each tract’s travel shed. In determining this proportion, the tool takes into account the weight variable if it is provided. The tool then joins that weighted count back to the census tract data read in during step 5. Unlike the traditional tool approach, when using travel sheds, the tool does not discard any points.

These differences in approach have two major consequences. First, since the tool does not discard any points, the tool counts points that are not joined to any shed polygon in the denominator when calculating the proportion of total points accessible to each subgeography. Second, a point can be in multiple isochrones. If this is the case, that point is deemed to be accessible to multiple census tracts. For more details on the beta travel shed functionality and how it changes the interpretation of the results, see Chapter 19.

This is the only portion of the SEDT algorithm that changes for travel shed analyses.

7. Ensure ACS Data Are Not Suppressed

On occasion, the Census Bureau suppresses (i.e., does not share) data for some geographies for some releases vintages. For example, due to a data-collection error in 2020, the Census Bureau reports that the 2019–23 and 2020–24 five-year ACS data products for the Wyandanch CDP and Brentwood CDP in New York will be suppressed. The Census Bureau lists these suppressed geographies in their errata notes.

The SEDT requires ACS demographic data to make its calculations. Consequently, the tool employs a two-step process to ensure that it has all the necessary ACS data to return valid results.

Step 7a: Determine if suppression impacts analysis

The tool determines if data suppression at any geographic scale impacts the calculations. For analyses at all geographic levels, the tool checks to see if any point in the user-uploaded resource dataset falls within a census tract in which data were suppressed. Regardless of geographic levels, we calculate the demographic disparity scores using ACS data at the census tract scale, which necessitates that all tract-level ACS data for the source geography be available.

The tool also checks if the source geography is suppressed (see step 4 details). As noted in Chapter 13, for county, state, and national-level analyses, the tool uses data at those scales to calculate the geography-wide percentages in the demographic disparity score calculations. For city-level data, the tool uses data generated from census tracts to serve as city-wide estimates, so it checks that all tracts in the city have data available.

Finally, the tool checks that no county-level data for the given state for state analyses and state-level data for national analyses are suppressed. These ACS data are used to calculate the geographic disparity scores and the sub-geography demographic disparity scores on the web tool.

Step 7b: Identify a more suitable year for analysis

If the tool lacks the necessary ACS data to complete a calculation due to data suppression, it attempts to identify a year for which all relevant ACS data are available (i.e., not suppressed). More specifically, among possible years, it checks the previous and subsequent years until it finds a year for which data are available or until it determines that all years have suppressed geographies that prevent a calculation from occurring. The tool prioritizes keeping the year as close to the user-specified year as possible, and, as a tie-breaker, chooses the more recent year. For API-based analyses, the tool returns a warning message and the updated year in addition to calculation results.

To make this process more concrete, consider the following two illustrative examples:

  1. A user submits an analysis via the API requesting 2022 data (i.e., five-year ACS data from 2018–22), but some of the data points fall into tracts that have suppressed data only in the 2018–22 vintage. Once the user uploads that data, the tool will determine that the data fall into tracts with suppressed data for 2018–22. It then assesses whether the data fall into tracts with suppressed data for 2019–23 and 2017–21. They do not, and the tool prioritizes the more recent year, so the analysis is run with 2019–23 five-year ACS data. The API response includes the updated year and a message that the year changed.

  2. A user submits an analysis via the GUI with data from a county that has suppressed data for the 2018–22 and 2019–23 five-year ACS estimates.1 Because the analysis was submitted via the GUI, the most recent year of ACS data was used, which, at the time of the writing of this example, is 2019–23. The tool determines that the data fall into a suppressed geography, so it assesses whether the 2018–22 data are suppressed. Note that, in this example, 2019–23 are the most recent data available, so the tool does not assess a more recent vintage’s suppression. In this example, the 2018–22 data are also suppressed, so the tool assesses whether 2017–21 data are suppressed. They are not, so the tool conducts the analysis with 2017–21 ACS data.

8. Compute Geographic Disparity Scores

The tool calculates geographic disparity scores to assess which areas are over- and underrepresented relative to a baseline population. The geographic disparity scores are calculated and visualized at different subgeographies, depending on the selected geography:

  • National: the tool calculates the geographic disparity score for each state in the US
  • State: the tool calculates the geographic disparity score for each county in the state
  • County: the tool calculates the geographic disparity score for each tract in the county
  • City: the tool calculates the geographic disparity score for each tract in the city

We use different subgeographies for each level of analysis because user feedback indicated that the different geographic levels of analysis required different “subgeographies” to make the geographic disparity score map meaningful (e.g., users suggested that a nationwide tract-level map was not meaningful).

The geographic disparity score is the difference between the proportion of the user-uploaded data within (or, for travel shed analyses, accessible to) that subgeography and the proportion of the main geography’s baseline population within that subgeography. For example, if a state-level analysis is selected, the geographic disparity score is the difference between the proportion of the user-uploaded data within that county and the proportion of the state’s baseline population within that county. We compute this score for the following baseline populations:

  1. total population
  2. population with low income
  3. population with extremely low income
  4. population without internet access
  5. population of cost-burdened renter households
  6. senior (≥ 65) population
  7. child (< 18) population

For the state-level analysis example, if a county contains 3.3 percent of the user-uploaded dataset but 1.4 percent of the state’s senior population, then the spatial disparity score for the senior population in that county would be 3.3 − 1.4, or 1.9 percent. The tool repeats this calculation for every subgeography and for every baseline population. These geographic disparity scores are displayed by the interactive maps on the web tool.

See Chapter 7 for more information on how to interpret the geographic disparity scores.

9. Compute Demographic Disparity Scores

The demographic disparity score is the percentage point difference between the representation of a demographic group in the data (the data-implied percentage) and the representation of that group in the geography (the geography-wide percentage). Regardless of the selected geography, for non-travel shed analyses, the demographic disparity score is calculated using tract-level data. For travel shed analyses, the data-implied percentage is calculated using travel shed, not tract, boundaries. The geography-wide percentage is identical for both approaches and is calculated at the tract scale. We consider the tract or travel shed levels the appropriate units of analysis to assess the demographics of the individuals with access to a given data point. We illustrate how we calculate this metric through the example of calculating the demographic disparity score for Black residents in a simple geography with two census tracts.

The first step is to calculate the geography-wide percentage, which answers the question, “What is the share of Black residents in the entire geography?” For city-level analysis, we calculate the geography-wide percentages from the tract-level census data for the tracts in the city. To illustrate, assume that our example geography is a city and that each of the two tracts is home to 50 percent of the city’s population. If tract 1 is 20 percent Black and tract 2 is 40 percent Black, then the geography-wide (in this case, citywide) percentage of Black residents is (0.5)(0.2) + (0.5)(0.4) = 0.3. For county, state, and national analysis, the geography-wide percentages are directly reported by the census, so we can imagine the census reported that 30 percent of the population of our geography is Black.

Now imagine 80 percent of the data uploaded by the user are located in tract 1 and 20 percent are located in tract 2. Then the data-implied percentage of Black residents would be (0.8)(0.2) + (0.2)(0.4) = 0.24. This is calculated identically across national-, state-, county-, and city-level analysis. The data-implied percentage of Black residents answers the question, “What is the share of Black residents in the tracts that the data come from?”

Finally, the demographic disparity score is the difference between the data-implied percentage and the geography-wide percentage, or 0.24 − 0.3 = −0.06. In this example, Black residents are underrepresented by 6 percentage points in the data relative to the geography population. These demographic disparity scores are calculated for all our demographic variables of interest (see Chapter 4 for a full list) and visualized by the interactive chart on the tool.

For some demographic variables, we use the appropriate universe (or denominator) instead of the total population for the geography-wide calculations. For example, to generate the demographic disparity score for veterans, we use the proportion of the civilian population 18 years and older as the denominator. For more information on the universes used for each demographic variable, see Chapter 4.

For the state- and national-level analyses in the web tool only, we choose to show the demographic disparity score for the relevant geography (state or US) and the demographic disparity score for the smaller subgeographies that make up the main geography (counties and states, respectively). We describe how we calculate these demographic disparity scores for each level of analysis:

  • National: We calculate the national demographic disparity scores as well as the demographic disparity scores for each state. To calculate the scores for each state, the geography-wide percentage is the census-published share of the given demographic group in the state (e.g., the share of Black residents in Illinois). The data-implied percentage is calculated as described above for the subset of points that fall in each state (e.g., the share of Black residents in the tracts that the data in Illinois come from). The state disparity score is the difference between the geography-wide percentage and the data-implied percentage for each state.

  • State: We calculate the state demographic disparity scores as well as the demographic disparity scores for each county in the state. To calculate the scores for each county, the geography-wide percentage is the census-published share of the given demographic group in the county (e.g., the share of Black residents in Cook County). The data-implied percentage is calculated as described above for the subset of points that fall in each county (e.g., the share of Black residents in the tracts that the data in Cook County come from). The county disparity score is the difference between the geography-wide percentage and the data-implied percentage for each county.

Note that in both cases, we do not display disparity scores for subgeographies that contain no points in the user-uploaded dataset. The subgeography disparity scores are displayed in gray in the web tool’s chart.

See Chapter 7 for more information on how to interpret the demographic disparity scores.

10. Compute Statistical Significance

Both the demographic and the geographic disparity scores above rely on ACS demographic variables, which have associated margins of error. We use the margins of error for the relevant ACS variables to construct 95 percent confidence intervals for both the geographic and the demographic disparity scores and determine if the scores differ significantly from 0. To calculate the margins of error, we use the formulas for user-derived proportions and percentages from a recent ACS handbook (US Census Bureau 2020). We perform this significance calculation for all our demographic and geographic disparity scores. Insignificant scores are reported as dark gray in the web tool to differentiate them from significant scores. For more detailed information, see Chapter 14.


  1. We know of no geography that meets this definition. We use this as an illustrative case.↩︎