"Stata Class File_long.xlsx",
import excel sheet("Sheet1")
firstrowcase(lower)
allstring clear
Data Cleaning
Setting up a file
Some commands have to be installed, but only once, capture prevents the error from popping up
clear
capture log close
capture ssc install savesome
Importing an excel file
Import the file
sheet
Tells Stata which sheet to import
firstrow
Tells stata first row is variable names
case(lower)
Makes variable names lowercase (Stata is case sensitive)
Stata is case sensitive so case(lower)
allstring
forces all of the observations to be strings since some numerics won’t import well
Duplicates
Often, you’ll want to check your data for duplicates. You may need to check for either entire duplicated rows, or just want to know if there are duplicated values in one column. Throughout the tutorial I’ll be referring to a dummy data set of arrest data, and we want just one observation per person (id
) per arrest date (arrdate
)
duplicates drop
- drops any true duplicatesduplicates report id arrdate
- reports number of duplicates ofid
andarrdate
Another way to check for duplicates is to create a new dummy variable to identify duplicates
duplicates tag id arrdate, gen(t_dup)
tags duplicatedid
andarrdate
and creates a new variable (t_dup)
Changing variable types
Depending on how a raw dataset is set up, you may want to change some columns from strings (words/characters) to numbers or vise versa. To start, we can use describe
to get the characteristics of the data and see what data type (number, character, date) the columns are.
destring
turns numeric values to strings. For it to work, all of the values in the column need to be numbers so if there’s a - or / in the values, Stata will throw an error.
destring id, gen(t_id)
in this example we create a new temporary variablet_id
which can help keep track of which variables to drop at the end.
tostring
does the reverse of destring
and will turn numeric variables to strings.
Cleaning dates
Dates are a specific data type in Stata so we need to specify that when we read them in. The date
function can be used to read in dates - you just need to specify the format (e.g., Month Day Year). This code generates a new variable arr_d
that converst arrdate
into a date.
=date(arrdate, "MDY") gen arr_d
Dates can be tricky to work with in Stata. Stata stores dates as a numeric variable that captures either the number of days, months, quarters, or years since January 1, 1960. Dates can also have a time component (datetimes), and are then stored as the number of milliseconds since January 1, 1960. These formats are numeric, but once they are stored as a date in Stata. Several special Stata functions can be applied to them to help with calculations that relate to dates and time.
Once date variables are read in, %td
tells Stata to turn the value into a human readable date:
format date1 %td
When it’s in the date format, it’s easy to extract specific components using the year
, month
, and day
functions
=year(date1) #generates month of variable
gen year1=month(date1) #generates year of variable
gen month1=day(date1) #generates day of variable gen day1
Since dates are just numbers in Stata, to get values before or after a certain time, you can just use less than (<), greater than(>), or equal to (==).
=(date1<td(1,1,2010))
gen before2010=(date1>date("January 1 2010","MDY")) gen after2010
Months and years
Months are particularly tricky because they can range between 28 and 31 days. Years are similarly problematic because they can have either 365 or 366 days depending on leap years.
If you’re working with mostly just months in your data, you should store them as dates in monthly format (the underlying data will be the number of months since January, 1960)to avoid having to worry about the number of days. You can do this using the mofd()
function and then format using %tm
to make them readable.
=mofd(date)
gen monthlyDate=monthlyDate+1
gen oneMonthLater1=monthlyDate+12
gen oneYearLater1 format monthlyDate oneMonthLater1 oneYearLater1 %tm
String cleaning
strpos
tells you the position of a string in a variable, with 0 meaning that it isn’t in the variable. This is useful if we want to create new variables conditionally based on others. For example, in an arrest dataset, we can create a variable for arrest reason.
#Say we wanted to create a variable for arrest reason
=0
gen arrestreason=1 if strpos(violdesc, "theft")!=0
replace arrestreason=2 if (strpos(violdesc, "liquor")!=0 | strpos(violdesc, "alcohol")!=0) replace arrestreason
Once we’ve created the new variable, label
is a function that attaches labels (duh) to datasets or variables and is shown when you describe
it.
0 "Other" 1 "Theft" 2 "Alcohol"
label define arrestreason
label values arrestreason arrestreason"Reason For Arrest"
label var arrestreason tab violdesc arrestreason, m
substr(s, pos, len)
gets the substring of s
starting at position pos
of length len
. If pos
is negative it means that the position is starting counting backwards from the end. If len
is a missing numeric (.) it indicates everything until the end.
substr("abcdef", 2, 3) = "bcd"
substr("abcdef", -3, 2) = "de"
substr("abcdef", 2, .) = "bcdef"
subinstr(s1,s2, s3, n)
changes the first n
occurances in s1
of s2
to s3
. For example:
subinstr("this is this", "is", "X", 1) = "thX is this"
subinstr("this is this", "is", "X", 2) = "thX X this"
subinstr("this is this", "is", "X", .) = "thX X thX"
length(s)
returns the length of s
. s
can be either a string or observation:
length("word") = 4
length(myvar) - varies between observations
split
is used to split strings into two or more parts. When you give split
separators, it goes through the word and finds those separators and creates one or more new words. split
is particularly useful when Stata misreads input as one string variable. For example, you can use split
to divide emails between @
split address, p(@)
= address1, containing the part of the email address before the “@”, such as “arogin”, and address2, containing the part after the “@”, such as “urban.org”. The separator itself, “@”, is discarded.
Renaming variables
You can rename variables (but keep the contents the same) using rename old_var_name new_var_name
rename old_var new_var
rename old_* new_*
- using * allows you to rename multiple variables at oncerename *_var var_*
- you can also use * to add or remove a prefix