6 Resource Datasets
Uploading Your Own Data
What data can I use with the Spatial Equity Tool?
You can use any CSV file with geographic point location data as long as it satisfies the following requirements:
- The file must have column headers in the first row.
- Two columns must correspond to longitude and latitude (in the EPSG:4326 or WGS84 coordinate reference system).
- The data file must be smaller than 200 MB.
- The geographic point locations must be from the US (50 states plus the District of Columbia).
- The file should use UTF-8, UTF-16, or ISO-8859-1 (i.e., Latin-1) encoding. For help saving your CSV with UTF-8 encoding, please see this web page.
- If you have point data in a shapefile (.shp), you can convert that file to a CSV using QGIS or this simple R script.
Some preprocessing may be necessary to manipulate your data to ensure these requirements are met.
I have a dataset of polygons (e.g., census blocks). How can I use it with this tool?
You need to assign a geographic (longitude/latitude) point to each polygon to use that dataset with this tool. We recommend doing this only with polygons that map cleanly to census tracts—namely tracts, block groups, and blocks. We would encourage you to use the following workflow:
- Choose a polygon dataset with data at either the census tract scale or a smaller level of geography.
- The dataset should include one variable to analyze in the tool, and it should also contain geographic data.
- Convert the polygon dataset to having internal points.
- We recommend using internal points as opposed to centroids because centroids are not guaranteed to be inside a polygon.
- We recommend the following methods to generate internal points:
- R:
st_point_on_surface()
from thesf
package - Python:
representative_point()
from theGeoPandas
package - ArcGIS: the label point functionality
- QGIS: point in polygon functionality
- R:
- Perform any other necessary preprocessing.
- The data must be saved as a CSV with columns for latitude, longitude, and the variable to analyze.
- Upload the data to the tool either using the GUI or the API:
- Provide the latitude and longitude columns as normal.
- Use the variable to analyze as the weight variable.
How does the tool treat null values?
Null values in the longitude/latitude columns, the weight column, or any of the selected filter columns will cause the tool to discard that row. Our tool uses the Pandas default CSV reader, which treats the following values as NA:
- (i.e., blank values)
#N/A
#N/A N/A
#NA
-1.#IND
-1.#QNAN
-NaN
-nan
1.#IND
1.#QNAN
<NA>
N/A
NA
NULL
NaN
n/a
nan
null
My dataset only has addresses, not longitude and latitude. What do I do?
You need to geocode the addresses by assigning each one a longitude-latitude point to use this tool. You can find more information about available geocoders and factors to consider when selecting a geocoder here.
My file is larger than 200 MB. What do I do?
First, try getting rid of unnecessary columns. The only columns the tool needs are your longitude and latitude columns and any columns you are using for filters and weights. If your file size is still over 200 MB, we recommend taking a random sample of your data and uploading that to the tool.
Where can I find data to use with the tool?
For city- and county-level datasets, a great place to start is municipal open data portals. All of our sample city-level and county-level datasets come from such portals. The US City Open Data Census, created by the Open Knowledge Foundation and the Sunlight Foundation, gives an overview of the numerous city- and county-level datasets available. Data.gov, a central repository for the US government’s open data, also maintains a list of state, city, and county open data sites. These state open data portals are a great place to look for state-level data. If you are interested in a specific city, county, or state not listed in these centralized locations, plugging “[geography name] open data portal” into a search engine is a good next step.
For national-level data, a great place to start is Data.gov. If you are interested in data from a specific federal agency, we recommend searching their website. Several sample datasets come from such websites, including electric vehicle charging stations (US Department of Energy), public libraries (Institute of Museum and Library Services), Low-Income Housing Tax Credit projects (US Department of Housing and Urban Development), and substance-use and mental health facilities (the Substance Abuse and Mental Health Services Administration within the US Department of Health and Human Services). Note that we feature the Low-Income Housing Tax Credit projects and substance-use and mental health facilities data at the state level, illustrating how nationwide data can be filtered to examine results for a specific state.
Can I upload confidential or private data to this tool?
Per our terms of use, you should not upload confidential, private, or sensitive data to this tool. All user-uploaded data and results are stored in publicly accessible cloud storage. Although it is unlikely, another user or a bad actor could access and download your uploaded data. If you have confidential data you would like to run through our tool, please reach out to sedt@urban.org.
I received a warning that my data are only located in a few states (for national-level analysis) or counties (for state-level analysis). What does that mean?
This warning indicates that your data points fall within less than 50 percent of the states in the US (for national-level analysis) or less than 50 percent of the counties in the identified state (for state-level analysis). For both the geographic and the demographic disparity analyses, we compare the distribution of your data against the baseline population distribution in the full geography. For example, if your data only pertain to the northeast region of the United States but you selected national-level analysis, the geographic disparity map would likely show all of the northeast states as significantly overrepresented because the share of data points in each state would be compared against that state’s share of the entire US population (instead of its share of the northeast region population, which would be more appropriate).
If you get this warning, ask yourself whether your dataset is intended to represent the entire geography in question. If the answer is no, we would recommend selecting a smaller geographic level of analysis, which will by default run the analysis for the most frequently occurring geography in your data.
Using Sample Data
How do I use the sample data?
We recommend that new users start by using one of our sample datasets to understand the functionality of the tool before uploading custom data. On the website, we selected sample datasets for each geographic level that we feel represent the kinds of data users are likely to upload and that demonstrate the tool’s functionality.
Taking the city-level sample datasets as examples, the New York City Wi-Fi hotspot sample dataset demonstrates how changing the baseline dataset (from the default of total population to the population without internet access) enables users to evaluate the distribution of a resource relative to a specific target user.
The New Orleans 311 dataset shows how using the filter functionality can help users focus on a specific subset of data—in this case, requests logged between January 1, 2014, and January 1, 2019. Finally, the Minneapolis bike share station data illustrate how using weights affects the results. By weighting the data by the number of bikes available, our analysis will capture that each station serves a different number of people.
For the national, state, county, and city-level analyses, we have selected the following sample datasets and preset some advanced options:
National
Electric-vehicle charging stations
Public libraries (preset weight by the public service hours per year, or the
HOURS
variable)
State
Substance-use and mental health facilities in Washington (preset filter
State
variable to “WA”)Low-Income Housing Tax Credit projects in Alabama (preset weight by number of low-income units, or the
LI_UNITS
variable)
County
Playgrounds in Miami-Dade County, Florida (preset filter
TOTLOT
variable to “Yes” to subset to playgrounds)Polling places in Bucks County, Pennsylvania
City
Public Wi-Fi hotspots in New York City, New York
311 requests in New Orleans, Louisiana (preset filter
city
variable to “New Orleans” and preset filterticket_created_date_time
to dates between 01/01/2014 and 01/01/2019)Bike-share stations in Minneapolis, Minnesota (preset weight by station maximum
capacity
)
For more information on these particular sample datasets and how we compiled them, please see our Urban Institute Data Catalog entry.