Data Manipulation

Moving past simple data cleaning, you’ll often have to change the shape or format of your data to best accomplish the analysis you’re looking to do. For this section, we’ll be working with a made-up data set of arrests that is starting in “long” form - each row is a unique observation (i.e., one arrest on one date for one person)

Reshaping a data set

Say you have a data set that’s in “long” format, meaning that there are repeated observations/rows/records of the same entity/unit of analysis, usually across time. Sometimes it’s easier or better to work with data in “wide” format, where there is one observation/row/record for each entity or unit of analysis, as you don’t have to worry about double counting. The code below will help with converting from LONG to WIDE format; the code is similar for WIDE to LONG.

First, we need to generate a number that indicates how to order the wide format data.

by id (arr_d), sort: egen arrest_n=seq()

arrest_n creates a variable of the number of arrests by person id. The variable not in parenthesis (id) is the variable that it’s created by, the variable in parenthesis(arr_d) is sorted within that.

by id (arr_d), sort: gen arrest_n2=_n 

You can also use Stata’s “_n” notation with just regular generate. The _n is essentially the index or (usually) observation number in Stata and quite powerful. When specifying groups with by :, the _n index actually resets for each group, which can be used to your advantage if you want to know how many observations you have in a group or to mark the first or last observation within your group. “_N” is the total # of observations overall or within your group when combined with by :.

Once we have the rows indexed, we can reformat the data wide using this code

reshape wide arr_*, i(id) j(arrest_n) 

In the code, i is the id, j is the order of the dates, and arr_n becomes a suffix.

Once the data is in wide format, we can do useful analysis, like checking the time between arrests:

gen t_arr1= arr_d2-arr_d1 

Merging

Merging is a way to join data sets based on matching one or more variables - i.e. a way to add new columns from a second data set. There are three types of merges:

One-to-one - one observation in the master data set corresponds to one observation in the using data set

merge 1:1 country using geography.dta

Source: http://stataproject.blogspot.com/2007/12/combine-multiple-datasets-into-one.html

One-to-many - when the linking variable is unique in one file, but not unique in the other

merge 1:m fam_ID using individuals.dta

Source: http://stataproject.blogspot.com/2007/12/combine-multiple-datasets-into-one.html

Many-to-one - same as above, but with a different order based on which is the master and using file

merge m:1 varlist using filename

Merge variable

During the merge, by default a new variable _merge is created with a numeric code for the source data set. This is useful if you want to specify keep if _merge == 3 to just keep the rows that are matched.

Numeric Code Meaning Decription
1 master observation appeared in master only
2 using observation appeared in using only
3 match observation appeared in both

Joinby

joinby- use this when there are multiple observations of the linking variable in each data set. joinby creates all pairwise combinations of the data based on the linking variable (it does what a many-to-many merge seems like it should do but doesn’t)

joinbyvarlist using filename

Append

append using filename essentially stacks one data set on top of the other, as opposed to merging which places them side-by-side.

Source: http://stataproject.blogspot.com/2007/12/combine-multiple-datasets-into-one.html