12  Spatial Equity Data Tool Algorithm

This section outlines the steps that the Spatial Equity Data Tool algorithm performs after a user uploads data from the web tool or API. This includes how null values are filtered, how weights are applied, and how geography is treated in scoring.

0. User Submits Analysis Request

The execution of the Spatial Equity Data Tool (SEDT) algorithm is prompted by a user submitting an analysis request to the tool via the web tool (Chapter 8) or API (Chapter 9). This request contains the following information:

  • Resource dataset for analysis (see Chapter 3)
  • Geographic scope of analysis (see Chapter 5)
  • Optional advanced options such as filtering (web tool only) or weighting (see Chapter 6)
  • Optional custom geographic and demographic data sets (API only) (see Chapter 10)
  • Whether to use a travel shed, and, if so, which one (API only, in beta mode) (see Chapter 19)

Following the analysis request, the SEDT performs the following analysis steps:

1. Filter Out Null Values

The tool filters out rows in which the longitude, latitude, weight, or filter columns have null values. All the following are treated as null values (i.e., blank values):

  • #N/A
  • #N/A N/A
  • #NA
  • -1.#IND
  • -1.#QNAN
  • -NaN
  • -nan
  • 1.#IND
  • 1.#QNAN
  • <NA>
  • N/A
  • NA
  • NULL
  • NaN
  • n/a
  • nan
  • null

2. Apply Filters

If the user selects filters for their data on the website tool, the tool applies those filters. The tool can accept three types of filters (text, numeric, and date), which are applied in the following order: text filters, numeric filters, and date filters. If multiple filter conditions are given, the filter conditions are chained and evaluated together. For example, if a user chooses the filters

  1. zip_code = 20024 and
  2. date_opened = 01/01/2020,

then the tool would filter to rows in which the zip_code column is 20024 and the date_opened column is 01/01/2020.

For the API, the user should apply filters in data preprocessing.

3. Apply Weights

If the user selects a weight column, then the tool notes it. If not, the tool creates a dummy weight column with values of 1 for every row, thus weighting every row in the dataset equally.

4. Determine the Dataset’s Source Geography

To minimize the burden on users, the website tool determines the geography automatically from the data provided. Before uploading the data, the user selects a geographic level of analysis: city, county, state, or national. The tool uses this selection to identify the relevant set of potential geographic boundaries (e.g., all cities, all counties, all states, or the nation).

For the API, the user specifies the geographic level in the function call.

For county, state, and national analyses, we use the boundaries provided by the US Census Bureau (noting that we do not include US territories as part of the US boundary, given the lack of available demographic data). We define city boundaries as all census tracts contained within the census place boundary. In some cases, our definition may be larger than the official city boundary; see Chapter 13 for more details. For city-level analysis, the tool uses a precomputed list of all 835 US cities with populations above 50,000 in 2022.

The tool then spatially joins the points in the data to a precompiled dataset of the boundaries corresponding to the selected geographic level of analysis. If the dataset has more than 50 points, then the tool takes a 5 percent sample for the spatial join to limit computation time. If the points in the data fall within the boundaries of multiple geographies, the tool only keeps the points within the most frequently occurring geography in the data. For example, if a user has selected county-level analysis and the user’s data fall within multiple counties, only data from the most frequently occurring county in the sample will be kept. As we have mentioned in other sections, if a user wants to conduct analyses across multiple geographies, the best course forward might be to replicate the analysis for each individual geography separately (for example, for the multiple counties of interest).

5. Read in All Geographic and Demographic Data for That Geography

Once the tool identifies a geography, all the precompiled geographic and demographic data for that geography gets read in. These data include the following:

  1. the boundaries for all census tracts in the geography
  2. the American Community Survey (ACS) demographic variables for every tract in the geography
  3. precomputed ACS demographic statistics for the entire geography

National:

  1. the boundaries for all states in the US (for geographic disparity score)
  2. precomputed ACS demographic statistics for all states in the US (for demographic disparity score)

State:

  1. the boundaries for all counties in the state (for geographic disparity score)
  2. precomputed ACS demographic statistics for all counties in the state (for demographic disparity score)

For more information on the specific ACS demographic variables used in the tool, see Chapter 4.

To read about how custom baseline and demographic datasets can be selected in the API, please see Chapter 10.

6. Compute Which Census Tract Each Data Point Falls Into

The tool spatially joins each point (i.e., row) in the user-uploaded data to the set of all census tracts in the geography. Any points that do not fall within any census tract—that is, points located outside the main geography boundary—are discarded and not included in the calculations. The tool then determines the proportion of the total number of data points in any given subgeography. In determining this proportion, the tool takes into account the weight variable if it is provided.

The tool handles this step differently for the beta travel shed functionality. In this case, the tool reads in the relevant travel shed polygons for the geography. The tool then determines the proportion of the total data points in each tract’s travel shed. In determining this proportion, the tool takes into account the weight variable if it is provided. The tool then joins that weighted count back to the census tract data read in during step 5. Unlike the traditional tool approach, when using travel sheds, the tool does not discard any points.

These differences in approach have two major consequences. First, since the tool does not discard any points, the tool counts points that are not joined to any shed polygon in the denominator when calculating the proportion of total points accessible to each subgeography. Second, a point can be in multiple isochrones. If this is the case, that point is deemed to be accessible to multiple census tracts. For more details on the beta travel shed functionality and how it changes the interpretation of the results, see Chapter 19.

This is the only portion of the SEDT algorithm that changes for travel shed analyses.

7. Compute Geographic Disparity Scores

The tool calculates geographic disparity scores to assess which areas are over- and underrepresented relative to a baseline population. The geographic disparity scores are calculated and visualized at different subgeographies, depending on the selected geography:

  • National: the tool calculates the geographic disparity score for each state in the US
  • State: the tool calculates the geographic disparity score for each county in the state
  • County: the tool calculates the geographic disparity score for each tract in the county
  • City: the tool calculates the geographic disparity score for each tract in the city

We use different subgeographies for each level of analysis because user feedback indicated that the different geographic levels of analysis required different “subgeographies” to make the geographic disparity score map meaningful (e.g., users suggested that a nationwide tract-level map was not meaningful).

The geographic disparity score is the difference between the proportion of the user-uploaded data within (or, for travel shed analyses, accessible to) that subgeography and the proportion of the main geography’s baseline population within that subgeography. For example, if a state-level analysis is selected, the geographic disparity score is the difference between the proportion of the user-uploaded data within that county and the proportion of the state’s baseline population within that county. We compute this score for the following baseline populations:

  1. total population
  2. population with low income
  3. population with extremely low income
  4. population without internet access
  5. population of cost-burdened renter households
  6. senior (≥ 65) population
  7. child (< 18) population

For the state-level analysis example, if a county contains 3.3 percent of the user-uploaded dataset but 1.4 percent of the state’s senior population, then the spatial disparity score for the senior population in that county would be 3.3 − 1.4, or 1.9 percent. The tool repeats this calculation for every subgeography and for every baseline population. These geographic disparity scores are displayed by the interactive maps on the web tool.

See Chapter 7 for more information on how to interpret the geographic disparity scores.

8. Compute Demographic Disparity Scores

The demographic disparity score is the percentage point difference between the representation of a demographic group in the data (the data-implied percentage) and the representation of that group in the geography (the geography-wide percentage). Regardless of the selected geography, for non-travel shed analyses, the demographic disparity score is calculated using tract-level data. For travel shed analyses, the data-implied percentage is calculated using travel shed, not tract, boundaries. The geography-wide percentage is identical for both approaches and is calculated at the tract scale. We consider the tract or travel shed levels the appropriate units of analysis to assess the demographics of the individuals with access to a given data point. We illustrate how we calculate this metric through the example of calculating the demographic disparity score for Black residents in a simple geography with two census tracts.

The first step is to calculate the geography-wide percentage, which answers the question, “What is the share of Black residents in the entire geography?” For city-level analysis, we calculate the geography-wide percentages from the tract-level census data for the tracts in the city. To illustrate, assume that our example geography is a city and that each of the two tracts is home to 50 percent of the city’s population. If tract 1 is 20 percent Black and tract 2 is 40 percent Black, then the geography-wide (in this case, citywide) percentage of Black residents is (0.5)(0.2) + (0.5)(0.4) = 0.3. For county, state, and national analysis, the geography-wide percentages are directly reported by the census, so we can imagine the census reported that 30 percent of the population of our geography is Black.

Now imagine 80 percent of the data uploaded by the user are located in tract 1 and 20 percent are located in tract 2. Then the data-implied percentage of Black residents would be (0.8)(0.2) + (0.2)(0.4) = 0.24. This is calculated identically across national-, state-, county-, and city-level analysis. The data-implied percentage of Black residents answers the question, “What is the share of Black residents in the tracts that the data come from?”

Finally, the demographic disparity score is the difference between the data-implied percentage and the geography-wide percentage, or 0.24 − 0.3 = −0.06. In this example, Black residents are underrepresented by 6 percentage points in the data relative to the geography population. These demographic disparity scores are calculated for all our demographic variables of interest (see Chapter 4 for a full list) and visualized by the interactive chart on the tool.

For some demographic variables, we use the appropriate universe (or denominator) instead of the total population for the geography-wide calculations. For example, to generate the demographic disparity score for veterans, we use the proportion of the civilian population 18 years and older as the denominator. For more information on the universes used for each demographic variable, see Chapter 4.

For the state- and national-level analyses in the web tool only, we choose to show the demographic disparity score for the relevant geography (state or US) and the demographic disparity score for the smaller subgeographies that make up the main geography (counties and states, respectively). We describe how we calculate these demographic disparity scores for each level of analysis:

  • National: We calculate the national demographic disparity scores as well as the demographic disparity scores for each state. To calculate the scores for each state, the geography-wide percentage is the census-published share of the given demographic group in the state (e.g., the share of Black residents in Illinois). The data-implied percentage is calculated as described above for the subset of points that fall in each state (e.g., the share of Black residents in the tracts that the data in Illinois come from). The state disparity score is the difference between the geography-wide percentage and the data-implied percentage for each state.

  • State: We calculate the state demographic disparity scores as well as the demographic disparity scores for each county in the state. To calculate the scores for each county, the geography-wide percentage is the census-published share of the given demographic group in the county (e.g., the share of Black residents in Cook County). The data-implied percentage is calculated as described above for the subset of points that fall in each county (e.g., the share of Black residents in the tracts that the data in Cook County come from). The county disparity score is the difference between the geography-wide percentage and the data-implied percentage for each county.

Note that in both cases, we do not display disparity scores for subgeographies that contain no points in the user-uploaded dataset. The subgeography disparity scores are displayed in gray in the web tool’s chart.

See Chapter 7 for more information on how to interpret the demographic disparity scores.

9. Compute Statistical Significance

Both the demographic and the geographic disparity scores above rely on ACS demographic variables, which have associated margins of error. We use the margins of error for the relevant ACS variables to construct 95 percent confidence intervals for both the geographic and the demographic disparity scores and determine if the scores differ significantly from 0. To calculate the margins of error, we use the formulas for user-derived proportions and percentages from a recent ACS handbook (US Census Bureau 2020). We perform this significance calculation for all our demographic and geographic disparity scores. Insignificant scores are reported as dark gray in the web tool to differentiate them from significant scores. For more detailed information, see Chapter 14.