3  Spatial Equity Data Tool Algorithm

This section outlines the steps that the Spatial Equity Data Tool algorithm performs after a user uploads data from the web tool or API. This includes how null values are filtered, how weights are applied, and how geography is treated in scoring.

0. User Submits Analysis Request

The execution of the Spatial Equity Data Tool (SEDT) algorithm is prompted by a user submitting an analysis request to the tool via the web tool (Chapter 9) or API (Chapter 10). This request contains the following information:

  • Resource dataset for analysis (see Chapter 6)
  • Geographic scope of analysis (see Chapter 5)
  • Optional advanced options such as filtering (web tool only) or weighting (see Chapter 7)
  • Optional custom geographic and demographic data sets (API only) (see Chapter 11)

Following the analysis request, the SEDT performs the following analysis steps:

1. Filter Out Null Values

The tool filters out rows in which the longitude, latitude, weight, or filter columns have null values. All the following are treated as null values (i.e., blank values):

  • #N/A
  • #N/A N/A
  • #NA
  • -1.#IND
  • -1.#QNAN
  • -NaN
  • -nan
  • 1.#IND
  • 1.#QNAN
  • <NA>
  • N/A
  • NA
  • NULL
  • NaN
  • n/a
  • nan
  • null

2. Apply Filters

In the website tool, if the user selects filters for their data, the tool applies those filters. The tool can accept three types of filters (text, numeric, and date) that are applied in the following order: text filters, numeric filters, and date filters. If multiple filter conditions are given, the filter conditions are chained and evaluated together. For example, if a user chooses the filters

  1. zip_code = 20024 and
  2. date_opened = 01/01/2020,

then the tool would filter to rows in which the zip_code column is 20024 and the date_opened column is 01/01/2020.

For the API, the user should apply filters in data preprocessing.

3. Apply Weights

If the user selects a weight column, then the tool notes it. If not, the tool creates a dummy weight column with values of 1 for every row, thus weighting every row in the dataset equally.

4. Determine the Dataset’s Source Geography

To minimize the burden on users, the website tool determines the geography automatically from the data provided. Before uploading the data, the user selects a geographic level of analysis: city, county, state, or national. The tool uses this selection to identify the relevant set of potential geographic boundaries (e.g., all cities, all counties, all states, or the nation).

For the API, the user specifies the geographic level in the function call.

For county, state, and national analyses, we use the boundaries provided by the US Census Bureau (noting that we do not include US territories as part of the US boundary, given the lack of available demographic data). We define city boundaries as all census tracts contained within the census place boundary. In some cases, our definition may be larger than the official city boundary; see Chapter 13 for more details. For city-level analysis, the tool uses a precomputed list of all 835 US cities with populations above 50,000 in 2022.

The tool then spatially joins the points in the data to a precompiled dataset of the boundaries corresponding to the selected geographic level of analysis. If the dataset has more than 50 points, then the tool takes a 5 percent sample for the spatial join to limit computation time. If the points in the data fall within the boundaries of multiple geographies, the tool only keeps the points within the most frequently occurring geography in the data. For example, if a user has selected county-level analysis and the user’s data fall within multiple counties, only data from the most frequently occurring county in the dataset will be kept. As we have mentioned in other sections, the most course forward might be to replicate the analysis for each of the multiple geographies is to repeat the analysis (for example, for the multiple counties of interest).

5. Read in All Geographic and Demographic Data for That Geography

Once the tool identifies a geography, it reads in all the precompiled geographic and demographic data for that geography. These data include the following:

  1. the boundaries for all census tracts in the geography
  2. the American Community Survey (ACS) demographic variables for every tract in the geography
  3. precomputed ACS demographic statistics for the entire geography

National:

  1. the boundaries for all states in the US (for geographic disparity score)
  2. precomputed ACS demographic statistics for all states in the US (for demographic disparity score)

State:

  1. the boundaries for all counties in the state (for geographic disparity score)
  2. precomputed ACS demographic statistics for all counties in the state (for demographic disparity score)

For more information on the specific ACS demographic variables used in the tool, see Chapter 4.

To read about how custom baseline and demographic datasets can be selected in the API, please see Chapter 11.

6. Compute Which Census Tract Each Data Point Falls Into

The tool spatially joins each point (i.e., row) in the user-uploaded data to the set of all census tracts in the geography. Any points that do not fall within any census tract—that is, points located outside the main geography boundary—are discarded and not included in the calculations.

7. Compute Geographic Disparity Scores

The tool calculates geographic disparity scores to assess which areas are over- and underrepresented relative to a baseline population. The geographic disparity scores are calculated and visualized at different subgeographies, depending on the selected geography:

  • National: the tool calculates the geographic disparity score for each state in the US
  • State: the tool calculates the geographic disparity score for each county in the state
  • County: the tool calculates the geographic disparity score for each tract in the county
  • City: the tool calculates the geographic disparity score for each tract in the city

We use different subgeographies for each level of analysis because user feedback indicated that the different geographic levels of analysis required different “subgeographies” to make the geographic disparity score map meaningful (e.g., users suggested that a nationwide tract level map was not meaningful).

The geographic disparity score is the difference between the proportion of the user-uploaded data within that subgeography and the proportion of the main geography’s baseline population within that subgeography. For example, if a state-level analysis is selected, the geographic disparity score is the difference between the proportion of the user-uploaded data within that county and the proportion of the state’s baseline population within that county. We compute this score for the following baseline populations:

  1. total population
  2. population with low income
  3. population with extremely low income
  4. population without internet access
  5. population of cost-burdened renter households
  6. senior (≥ 65) population
  7. child (< 18) population

For the state-level analysis example, if a county contains 3.3 percent of the user-uploaded dataset but 1.4 percent of the state’s senior population, then the spatial disparity score for the senior population in that county would be 3.3 − 1.4, or 1.9 percent. The tool repeats this calculation for every subgeography and for every baseline population. These geographic disparity scores are displayed by the interactive maps on the web tool.

See Chapter 8 for more information on how to interpret the geographic disparity scores.

8. Compute Demographic Disparity Scores

The demographic disparity score is the percentage point difference between the representation of a demographic group in the data (the data-implied percentage) and the representation of that group in the geography (the geography-wide percentage). The demographic disparity score is always calculated using tract-level data, regardless of the selected geography. This is because we consider the tract level the appropriate unit of analysis to assess the demographics of the individuals with access to a given data point. We illustrate how we calculate this metric through the example of calculating the demographic disparity score for Black residents in a simple geography with two census tracts.

The first step is to calculate the geography-wide percentage, which answers the question, “What is the share of Black residents in the entire geography?” For city-level analysis, we calculate the geographywide percentages from the tract-level census data for the tracts in the city. To illustrate, assume that our example geography is a city and that each of the two tracts is home to 50 percent of the city’s population. If tract 1 is 20 percent Black and tract 2 is 40 percent Black, then the geography-wide (in this case, citywide) percentage of Black residents is (0.5)(0.2) + (0.5)(0.4) = 0.3. For county, state, and national analysis, the geography-wide percentages are directly reported by the census, so we can imagine the census reported that 30 percent of the population of our geography is Black.

Now imagine 80 percent of the data uploaded by the user are located in tract 1 and 20 percent are located in tract 2. Then the data-implied percentage of Black residents would be (0.8)(0.2) + (0.2)(0.4) = 0.24. This is calculated identically across national-, state-, county-, and city-level analysis. The data-implied percentage of Black residents answers the question, “What is the share of Black residents in the tracts that the data come from?”

Finally, the demographic disparity score is the difference between the data-implied percentage and the geography-wide percentage, or 0.24 − 0.3 = −0.06. In this example, Black residents are underrepresented by 6 percentage points in the data relative to the geography population. These demographic disparity scores are calculated for all our demographic variables of interest (see Chapter 4 for a full list) and visualized by the interactive chart on the tool.

For some demographic variables, we use the appropriate universe (or denominator) instead of the total population for the geography-wide calculations. For example, to generate the demographic disparity score for veterans, we use the proportion of the civilian population 18 years and older as the denominator. For more information on the universes used for each demographic variable, see Chapter 4.

For the state- and national-level analyses in the web tool only, we choose to show the demographic disparity score for the relevant geography (state or US) and the demographic disparity score for the smaller subgeographies that make up the main geography (counties and states, respectively). We describe how we calculate these demographic disparity scores for each level of analysis:

  • National: We calculate the national demographic disparity scores as well as the demographic disparity scores for each state. To calculate the scores for each state, the geography-wide percentage is the census-published share of the given demographic group in the state (e.g., the share of Black residents in Illinois). The data-implied percentage is calculated as described above for the subset of points that fall in each state (e.g., the share of Black residents in the tracts that the data in Illinois come from). The state disparity score is the difference between the geography-wide percentage and the data-implied percentage for each state.

  • State: We calculate the state demographic disparity scores as well as the demographic disparity scores for each county in the state. To calculate the scores for each county, the geography-wide percentage is the census-published share of the given demographic group in the county (e.g., the share of Black residents in Cook County). The data-implied percentage is calculated as described above for the subset of points that fall in each county (e.g., the share of Black residents in the tracts that the data in Cook County come from). The county disparity score is the difference between the geography-wide percentage and the data-implied percentage for each county.

Note that in both cases, we do not display disparity scores for subgeographies that contain no points in the user-uploaded dataset. The subgeography disparity scores are displayed in gray in the tool chart.

See Chapter 8 for more information on how to interpret the demographic disparity scores.

9. Compute Statistical Significance

Both the demographic and the geographic disparity scores above rely on ACS demographic variables, which have associated margins of error. We use the margins of error for the relevant ACS variables to construct 95 percent confidence intervals for both the geographic and the demographic disparity scores and determine if the scores differ significantly from 0. To calculate the margins of error, we use the formulas for user-derived proportions and percentages from a recent ACS handbook (US Census Bureau 2020). We perform this significance calculation for all our demographic and geographic disparity scores. Insignificant scores are reported as dark gray in our tool to differentiate them from significant scores. For more detailed information, see Chapter 14.