Misc Resources
Data Processing & Analysis Guidance
This work was created by Ashlin Oglesby-Neal, Lily Robin, Emily Tiry, and Aaron R. Williams with support from Urban’s 2021 Seed Fund.
Use this guidance manual to learn about each step in processing and analyzing data. There is a corresponding worksheet for documenting the process of cleaning, preparing, and analyzing data for your project found here. Code templates for R and Stata are available on GitHub.
Many projects at Urban use quantitative data, which can come from a variety of sources. In many cases, the data needs some degree of cleaning and processing to convert it into an appropriate format for analysis. The purpose of this manual and its corresponding worksheet is to provide guidance for processing and analyzing data that can be used across Urban by analysts at any level of experience to promote more consistent and reproducible data analysis. The worksheet can then be shared with others on the project (e.g., the PI, the code reviewer) to ensure everyone is on the same page. It can also be used to help onboard new staff onto an ongoing project.
How to Use the Worksheet
This section describes how to fill out each section of the worksheet and provides context about why each step is important.
Purpose
One of the most important steps is to make sure you understand the research questions or objectives of the project and how they translate into the specific analyses you’ll be doing. Understanding these things is crucial to developing an appropriate analysis plan and ensuring you have all the data required to complete the analyses. From there, it’s important to translate the analysis plan into the planned deliverables (i.e., which results are planned to be presented in each deliverable). Ideally, this step would be completed before beginning any data processing. At this stage, you may also want to consult the Urban Institute Guide for Racial Equity in the Research Process.
Data
Acquiring Data
An analyst can acquire data in many different ways. Data can be microdata or pre-tabulated data. For example, Census microdata can be accessed through IPUMS and tabulated pre-Census microdata can be accessed through the Census Bureau. Some data sets are confidential and require data use agreements that outline appropriate uses of the data and data storage requirements. Ensure that your data processing meets the requirement of all data use agreements and the project’s IRB. When possible, programmatically access data instead of pointing-and-clicking. This can save work and is more reproducible. For example, Census data can be accessed through the Census API and Education Data Portal data can be accessed through an R package. When possible, avoid storing data in .xlsx or .xls formats, which are unstable and can create errors in analyses. Once you’ve acquired your data, document the key characteristics: the filename, where and when you acquired it, where it’s stored, the sample or population included in the file, and the unit of observation.
Sample
Examine each data file to make sure you understand the parameters of the data sample. If you requested the data from an organization (such as a government agency or a service provider), go back to the data request to make sure what you received is consistent with what you requested. For example:
Should the data be from a certain time period? Check the relevant date variables to make sure the data are within that period.
Should the data be from specific locations? Check the relevant location variables to make sure the data are within those locations.
Unit of Observation
Examine each data file to make sure you understand the unit of observation of each file. This is particularly important when receiving multiple related data files that you need to combine, as often happens when working with administrative data. For example, a set of administrative data may include a file describing the people enrolled in a program, a file describing each service those people received, and a file with a pre and post assessment for each person.
Each row of data represents one observation unit. One way to think about this is that the unit of observation it’s the level at which the observations are unique. Most administrative data files will have identifier variables to help determine this. Say you’re working on a project where you receive the three data files described above. If the first file’s unit of observation is a person, then the person ID variable should be unique within the file (i.e., no duplicates). In contrast, each person may have received multiple services, so in the service-level file, the person ID variable may repeat. However, the service ID variable should be unique within the person ID. Similarly, in the assessment-level file, each person should have two assessments, so their person ID will repeat but the assessment ID should be unique within the observations for a specific person. See the tables below for an example of what this could look like.
When trying to assess the unit of observation within a dataset it can be helpful to visually inspect the data and also check for duplicates generally, and by various combinations of ID variables and date variables.
Data Processing
The data processing step is where you take stock of the data you have and plan out how to convert it into the format needed for analysis. Below are several examples of things to consider and steps that may be required during processing.
Variables
A variable is a characteristic of observations in the data. Variables should be in columns. Take stock of what variables are in the data and whether you have all the variables you need, either to use for analysis as they are, to recode, or to use to create new variables. Make sure you understand what each variable represents and that the values of the variables match your expectations. If you received any documentation along with the data, check to make sure that what is in the data matches what is described in the documentation. Check to ensure that the variable formats make sense. For example, it is best to store dates in the ISO-8601 format. It can be helpful to maintain a codebook of variables, their format (e.g., numeric, string), and their definitions. Maintaining a codebook can help you keep track of the variables in the data and encourage consistency across datasets and/or staff members working with the data.
Aggregation
In some cases, you may need to aggregate data, depending on the unit of observation that you want for analysis and the unit of observation of the file. For example, if you want your analysis to be at the person level but you have service level data, you will need to aggregate that data up to the person level (e.g., number of services received).
Missingness
Data are often missing in administrative and survey data files. Structural missingness, when observations are missing for explainable reasons, such as skip patterns in a questionnaire (e.g., a question is only asked to respondents ages 16+), can be remedied by consulting data documentation.
Assessing the level and cause of missingness in the data is important for appropriately analyzing it. If a large proportion of observations are missing on a variable, it may not be usable for your analysis. Likewise, if the data are missing systematically rather than randomly (e.g., all observations from a given year are missing), it could bias your results to use the data as-is. There are commands that can be used to check for missingness, but it is also import to check variable values to see if a placeholder (e.g., “N/A,” “NULL,” “9999,” “01/01/1900”) is used to denote missingness for a variable.
Out of Range Values
Often in administrative data files, there are some variable values that are out of the likely or even possible range for a variable. Prior to analyzing data, it is important to assess the values of the variables used in analysis. Some tools to help with analysis include tabulations, summary statistics, and histograms.
Merging/joining and appending/binding
If you acquired more than one related data file, you will likely need to combine them together for analysis. Merging (Stata) / joining (R) adds variables (columns) from one file to another file. Usually there is a unique identifier variable that is used to link the unique observations in each file together. Determine which variables are in common across the data files (usually ID variables) and whether the files are at the appropriate unit of observation. Appending (Stata) / binding (R) is the process of adding observations (rows) from one file to another file. Check that variable names, variable types, and units of observation are consistent between the two files before appending / binding. Always check the data after merging / joining or appending / binding to ensure that it occurred as you intended.
Other manipulation
Check whether any other manipulation of the data is needed. One example might be developing weights to use in a survey analysis.
Resources
For more detailed information on data processing see the Development Impact Evaluation Analytics Data Handbook chapter on Cleaning and Processing Research Data and the Urban Institute Primer on Data Quality: Getting it Right Workshop participant guide. When processing the data and moving into analysis and reporting, you may want to review Urban’s Diversity, Equity, and Inclusion Toolkits for guidance on how to inclusively and respectfully conduct research and write about specific communities.
Analysis
Once you’ve processed your data, it’s time to analyze it. There are many different analyses you could do, but which ones you should do will depend on the research questions or objectives you determined at the beginning of this process. Below are listed several common analyses and statistical tests and links to additional resources where you can learn more about them.
Descriptive
These analyses are meant to describe the data. You should always do these types of analyses on new data to get a sense of the distribution of variables of interest and identify outliers. Understanding these characteristics are often important for choosing the appropriate additional analyses. Examples include:
Frequencies
Mean or median
Standard deviation
Minimum and maximum
When possible, it is best practice to calculate standard errors for statistics or confidence intervals for parameters.
Graphs
Graphs are helpful for showing results visually and are great for communicating results in reports or presentations. Urban’s Data Visualization User Group is a good resource for learning more about how to create effective graphs and charts, and the R User Group has a great guide for creating graphics in R with many examples of different types of charts.
Bar charts
Line charts
Scatter plots
Histograms
Urban has templates to format graphs in Urban branding in Excel and R. You may also want to consult the Do No Harm Guide: Applying Equity Awareness in Data Visualization guide when thinking through how to respectfully visualize data.
Maps
Another way of showing results visually—specifically, in relation to geography—is mapping. You can learn more about mapping through Urban’s Mapping User Group and the R User Group guide to mapping in R. Urban has templates to format maps in Urban branding in R.
Choropleth maps
Tile grid maps
Bubble maps
Dot-density maps
Statistical tests
Projects often aim to examine the statistical relationship between variables. The most appropriate statistical test depends on your research question and the characteristics of your data. The UCLA Institute for Digital Research and Education has a helpful table to guide your choice of statistical test. Bivariate tests examine the relationship between two variables, whereas multivariate tests can examine the relationship between multiple variables at the same time. Examples of each include:
Bivariate
t-test (two independent sample, paired)
Chi-square test
Multivariate
Ordinary least squares (OLS) regression
Logistic regression
Each statistical test has important assumptions and the inferences drawn from the tests will be invalid if the assumptions are not met.
Weights
Data sets from complex surveys often contain observation weights to calculate estimates that represent the populations of interest. Estimates, like means or standard errors, will be incorrect if they do not account for these weights. Consult data documentation to understand the nature of these weights and work with experienced researchers to ensure that software accounts for correct estimator and type of weight (e.g., replicate weights).
Examine results
After conducting the analyses, check the plausibility of the results. This can include comparing the results against other research and statistics on the population, reviewing the results with your project team, or discussing with other experts, including the people that provided the data.
Code Review
Code review is critical to ensuring the quality and accuracy of Urban research. However, not all projects and PIs build in code reviews to their projects. You may need to discuss whether a code review is possible (especially in terms of budget and timeline) with your PI and the desired scope of the review. During the review process, the reviewer will check the logic of your code and check that they can replicate your results. This can include catching errors (e.g., misspecifying the subset, not handling missing values) that change the results of the analysis. The reviewer can also check the code readability and style and compare the code against the results reported in written documents. Code can be reviewed by others in your center or by the Technology and Data Science team, which have a formal intake process. Thorough and clear documentation of the data processing and analysis makes this step easier for both the project team and code reviewer. In preparation for code review, it can be helpful to you and the reviewer to first check that your code can run from start to finish, and that file pathways throughout the code are not specific to a user or computer.
Archiving
Planning ahead and keeping archiving requirements in mind during the processing and analysis phases can save time at the end of a project. Many federally-funded projects require project data to be archived at the Inter-university Consortium for Political and Social Research (ICPSR). ICPSR’s Guide to Social Science Data Preparation and Archiving provides guidance on their requirements for formatting and de-identification, but if you are archiving your data somewhere else, make sure you know what their requirements are. To the extent you can, align your data processing with these requirements to make data archiving easier.
Reporting
Once the analyses are complete, the results will be included in reports. Ensure that you document the process of creating all output (e.g., graphs, tables, maps, statistics) that are in reports. Review all tables, figures, and corresponding text against the analysis output for accuracy and consistency. You may also need to annotate the data sources in graphs and tables. Reports will likely need to include the methods you used to process and analyze quantitative data and some discussion of limitations. Keeping documentation of methods and limitations throughout the process will help streamline the report writing process.
Other Resources
Development Research in Practice: The DIME Analytics Data Handbook
Primer on Data Quality: Getting it Right Workshop – sign up in Workday Learning