Data Cleaning

Setting up a file

Some commands have to be installed, but only once, capture prevents the error from popping up

clear

capture log close

capture ssc install savesome

Importing an excel file

Import the file

import excel "Stata Class File_long.xlsx", 
        sheet("Sheet1")
        firstrow
        case(lower)
        allstring
        clear

sheet Tells Stata which sheet to import

firstrow Tells stata first row is variable names

case(lower) Makes variable names lowercase (Stata is case sensitive)

Stata is case sensitive so case(lower)

allstring forces all of the observations to be strings since some numerics won’t import well

Duplicates

Often, you’ll want to check your data for duplicates. You may need to check for either entire duplicated rows, or just want to know if there are duplicated values in one column. Throughout the tutorial I’ll be referring to a dummy data set of arrest data, and we want just one observation per person (id) per arrest date (arrdate)

  • duplicates drop - drops any true duplicates

  • duplicates report id arrdate - reports number of duplicates of id and arrdate

Another way to check for duplicates is to create a new dummy variable to identify duplicates

  • duplicates tag id arrdate, gen(t_dup) tags duplicated id and arrdate and creates a new variable (t_dup)

Changing variable types

Depending on how a raw dataset is set up, you may want to change some columns from strings (words/characters) to numbers or vise versa. To start, we can use describe to get the characteristics of the data and see what data type (number, character, date) the columns are.

destring turns numeric values to strings. For it to work, all of the values in the column need to be numbers so if there’s a - or / in the values, Stata will throw an error.

  • destring id, gen(t_id) in this example we create a new temporary variable t_id which can help keep track of which variables to drop at the end.

tostring does the reverse of destring and will turn numeric variables to strings.

Cleaning dates

Dates are a specific data type in Stata so we need to specify that when we read them in. The date function can be used to read in dates - you just need to specify the format (e.g., Month Day Year). This code generates a new variable arr_d that converst arrdate into a date.

gen arr_d=date(arrdate, "MDY")  
Note

Dates can be tricky to work with in Stata. Stata stores dates as a numeric variable that captures either the number of days, months, quarters, or years since January 1, 1960. Dates can also have a time component (datetimes), and are then stored as the number of milliseconds since January 1, 1960. These formats are numeric, but once they are stored as a date in Stata. Several special Stata functions can be applied to them to help with calculations that relate to dates and time.

Once date variables are read in, %td tells Stata to turn the value into a human readable date:

format date1 %td

When it’s in the date format, it’s easy to extract specific components using the year , month , and day functions

gen year1=year(date1)               #generates month of variable
gen month1=month(date1)             #generates year of variable
gen day1=day(date1)             #generates day of variable

Since dates are just numbers in Stata, to get values before or after a certain time, you can just use less than (<), greater than(>), or equal to (==).

gen before2010=(date1<td(1,1,2010))
gen after2010=(date1>date("January 1 2010","MDY"))

Months and years

Months are particularly tricky because they can range between 28 and 31 days. Years are similarly problematic because they can have either 365 or 366 days depending on leap years.

If you’re working with mostly just months in your data, you should store them as dates in monthly format (the underlying data will be the number of months since January, 1960)to avoid having to worry about the number of days. You can do this using the mofd() function and then format using %tm to make them readable.

gen monthlyDate=mofd(date)
gen oneMonthLater1=monthlyDate+1
gen oneYearLater1=monthlyDate+12
format monthlyDate oneMonthLater1 oneYearLater1 %tm

String cleaning

strpos tells you the position of a string in a variable, with 0 meaning that it isn’t in the variable. This is useful if we want to create new variables conditionally based on others. For example, in an arrest dataset, we can create a variable for arrest reason.

#Say we wanted to create a variable for arrest reason
gen arrestreason=0
replace arrestreason=1 if strpos(violdesc, "theft")!=0 
replace arrestreason=2 if (strpos(violdesc, "liquor")!=0 | strpos(violdesc, "alcohol")!=0) 

Once we’ve created the new variable, label is a function that attaches labels (duh) to datasets or variables and is shown when you describe it.

label define arrestreason 0 "Other" 1 "Theft" 2 "Alcohol"
label values arrestreason arrestreason
label var arrestreason "Reason For Arrest"
tab violdesc arrestreason, m

substr(s, pos, len) gets the substring of s starting at position pos of length len. If pos is negative it means that the position is starting counting backwards from the end. If len is a missing numeric (.) it indicates everything until the end.

  • substr("abcdef", 2, 3) = "bcd"

  • substr("abcdef", -3, 2) = "de"

  • substr("abcdef", 2, .) = "bcdef"

subinstr(s1,s2, s3, n) changes the first n occurances in s1 of s2 to s3. For example:

  • subinstr("this is this", "is", "X", 1) = "thX is this"

  • subinstr("this is this", "is", "X", 2) = "thX X this"

  • subinstr("this is this", "is", "X", .) = "thX X thX"

length(s) returns the length of s. s can be either a string or observation:

  • length("word") = 4

  • length(myvar) - varies between observations

split is used to split strings into two or more parts. When you give split separators, it goes through the word and finds those separators and creates one or more new words. split is particularly useful when Stata misreads input as one string variable. For example, you can use split to divide emails between @

  • split address, p(@) = address1, containing the part of the email address before the “@”, such as “arogin”, and address2, containing the part after the “@”, such as “urban.org”. The separator itself, “@”, is discarded.

Renaming variables

You can rename variables (but keep the contents the same) using rename old_var_name new_var_name

  • rename old_var new_var

  • rename old_* new_* - using * allows you to rename multiple variables at once

  • rename *_var var_* - you can also use * to add or remove a prefix