id (arr_d), sort: egen arrest_n=seq() by
Data Manipulation
Moving past simple data cleaning, you’ll often have to change the shape or format of your data to best accomplish the analysis you’re looking to do. For this section, we’ll be working with a made-up data set of arrests that is starting in “long” form - each row is a unique observation (i.e., one arrest on one date for one person)
Reshaping a data set
Say you have a data set that’s in “long” format, meaning that there are repeated observations/rows/records of the same entity/unit of analysis, usually across time. Sometimes it’s easier or better to work with data in “wide” format, where there is one observation/row/record for each entity or unit of analysis, as you don’t have to worry about double counting. The code below will help with converting from LONG to WIDE format; the code is similar for WIDE to LONG.
First, we need to generate a number that indicates how to order the wide format data.
arrest_n
creates a variable of the number of arrests by person id. The variable not in parenthesis (id
) is the variable that it’s created by, the variable in parenthesis(arr_d
) is sorted within that.
id (arr_d), sort: gen arrest_n2=_n by
You can also use Stata’s “_n” notation with just regular generate. The _n
is essentially the index or (usually) observation number in Stata and quite powerful. When specifying groups with by :
, the _n
index actually resets for each group, which can be used to your advantage if you want to know how many observations you have in a group or to mark the first or last observation within your group. “_N” is the total # of observations overall or within your group when combined with by :
.
Once we have the rows indexed, we can reformat the data wide using this code
*, i(id) j(arrest_n) reshape wide arr_
In the code, i
is the id, j
is the order of the dates, and arr_n
becomes a suffix.
Once the data is in wide format, we can do useful analysis, like checking the time between arrests:
= arr_d2-arr_d1 gen t_arr1
Merging
Merging is a way to join data sets based on matching one or more variables - i.e. a way to add new columns from a second data set. There are three types of merges:
One-to-one - one observation in the master data set corresponds to one observation in the using data set
merge 1:1 country using geography.dta
One-to-many - when the linking variable is unique in one file, but not unique in the other
merge 1:m fam_ID using individuals.dta
Many-to-one - same as above, but with a different order based on which is the master and using file
merge m:1 varlist using filename
Merge variable
During the merge, by default a new variable _merge
is created with a numeric code for the source data set. This is useful if you want to specify keep if _merge == 3
to just keep the rows that are matched.
Numeric Code | Meaning | Decription |
---|---|---|
1 | master | observation appeared in master only |
2 | using | observation appeared in using only |
3 | match | observation appeared in both |
Joinby
joinby
- use this when there are multiple observations of the linking variable in each data set. joinby
creates all pairwise combinations of the data based on the linking variable (it does what a many-to-many merge seems like it should do but doesn’t)
joinbyvarlist using filename
Append
append using filename
essentially stacks one data set on top of the other, as opposed to merging which places them side-by-side.