Automating Tasks

Macros: Local vs. Global

Macros are a kind of temporary holding space akin to “variables” in R, Python, and other general program languages (remember: think of your data set as a spreadsheet, or in R/Python, etc. terms, a “data frame” and its variables as columns). These are also different from “macros” in Excel, Macs, etc. which are a set of step automated via short code scripts. There are 2 flavors of Macros in Stat: LOCALS and GLOBALS.

  • LOCALS last only as long as your commands are continuously running within a session and have limited (local/private) SCOPE, meaning their values are independent and cannot be accessed at different levels (i.e. within loops) in your code or other DO files you might run in the same session.

  • GLOBALS persist for your entire Stata session and have global (public) SCOPE, meaning their values are accessible and changeable at any point in your session.

Both types of macros can be used to storre strings or numeric values/expressions:

You define LOCALS with the “local” command, followed by what you want to name it.

local my1stlocalmacro "Hello, world"

global my1stglobalmarcro "Hello, world"
Important

The Stata syntax to refer to LOCALS with an open single quote “`” (shift+ ~), the name, and then a close single quote/apostrophe ” ’ ” (`my1stlocalmacro')

NOTE: Since quarto doesn’t support Stata code chunks, the closing single apostrophe has been changed to an open single quote for legibility in some cases.

Refer to GLOBALS with the dollars sign “$” followed by the global name (${my1stglobalmacro})

You can use macros to store lists of variables that you want to use over and over again:

local myvarlist "age gender"
sum `myvarlist`

Or for filepaths to use/save data and results from/to:

pwd
local workingdir `c(pwd)`
global saved_data "`workingdir'\Stata 301\"
#create the directory, use capture in case it already exists...
capture noisily mkdir "${saved_data}" 
#save the file there
save "${saved_data}data_wide_copy.dta", replace 

In practical terms, I find GLOBALS easier to work with as they persist, and I tend not to do anything too too fancy things like running many PROGRAMS (see below) or DO/ADO (those downloaded user written commands) files within other DO files where scoping would be a bigger issue. However, it is technically bad form to use GLOBALS generally due to potential conflicts. Strictly speaking, good programming form is to use LOCALS rather than GLOBALS unless you have a program that must remember something from one run to the next. Even with DO files, you can actually pass values from one DO file to another run within it with LOCAL arguments (see below).

Looping

We can use forvalues for looping over numeric values with consistent intervals:

forvalues i=1(1)10 {        #syntax is i=min(interval)max                   
    display "`i'"                n#call the values with `i'                 
    }

Loop over odd values:

forvalues i=1(2)10 {                        
    display "`i'"               
    }   

Loop over even values:

forvalues i=2(2)10 {                            
    display "`i'"               
    }   

foreach is used to loop over non-numeric values or numeric values with inconsistent intervals. The syntax is: foreach lname in list elements of the list are separated by a blank space.

foreach k in col_1 col_2 { 
    display "`k'"
    }

You can combine the “of varlist” syntax with the flexible ways Stata allows you to define a VARLIST in order loop through a group of variables that start, end, or have some other common section in their names. Let’s try only looking at whether age and gender are related to if individuals had specifically a 4th arrest of anytype:

foreach var of varlist arr_*4 {
    di "`var'"
    reg `var` age gender 
}

The forvalues looping syntax assumes a regular pattern numeric pattern for values of a variable you might want to step through in your loop. However, this is not always the case, and trying to refer to a value that does not exist can lead to errors. We can use the levelsof syntax to loop through just the valid values of either a numeric or even a string variable:

levelsof arrest_y1, local(levels)

display "`levels'"

foreach l of local levels {
foreach var of varlist arr_*4 {
    di "`var' `l'" 
    reg `var' age gender if arrest_y1 == `l`
    outreg2 using "regressions.doc" ,  label ctitle("`var'_`l'") append 
 }
}

If else

If and else statements let you control the flow of your program to execute some commands and/or skip others depending on some condition:

forvalues i=1(1)39 {
    if `i`<10 {
        display "Less than 10"
        }
    if `i`>=10 & `i`<20 {
        display "10-20"
        }
    else if `i`>=20     {   
        display "Greater than 20"
        }
}

Programming (functions)

Programs are analogous to defining functions in R, Python etc. and allows you to take one or more pieces of varying input and execute the same set of commands for the varying input. A general rule is that if you’re copying the same code more than twice, you should turn it into a program to automate it for you and avoid errors.

program define test         #Defines the name of the program
    args x                        #Defines the input(s) in the program
        display "`x'"           #Tells stata what to do with that input
    end                   #Tells stata this is the end of your program

To run the program type the name and the input(s)

test "Stata is great"

Suppose you want to make a bunch of histograms:

program define pretty_histogram
                args var title color num
                    histogram `var`, percent                                                            
                                title("`title'")                                       
                                color("`color'") lcolor(black)          
                                width(1)                                
                                ylabel(0(25)100,nogrid)                 
                                xlabel(0(5)35,nogrid)                     
                                xtitle("Number of Arrests")              
                                graphregion(fcolor(white))              
                                plotregion(style(none))                 
                graph export "AppendixA_Figure`num'.pdf", replace
end

You can then call this program with pretty_histogram arr_theft "Arrests for Theft" "207 232 243" 1

Outputting Descriptive Statistics

There are three (3) frequently used ways of outputting results to external files, each with pros and cons

Function Pros Cons
POSTFILE
  • Native to Stata

  • Fairly flexible on what data you can put and in what order

  • Less easy to format and include context like variables names and labels

  • Less easy to use and output lots of information as it is based on looping and writes 1 line at a time

TABOUT
  • Easy to use and output lots of information

  • Easy to format and include context like variables names and labels

  • Not built into Stata-- have to download and install from SSC

  • Less flexible on what statistics you can run and in what order

PUTEXCEL
  • Similar to POSTFILE in that it is built into Stata

  • Easier to edit and manipulate specific cells/data points and add labels/text unlike

  • Less automated and more tedious as you must specify which cells and ranges to write to in Excel with each command

POSTFILE

Postfile creates a second temporary data set in memory, separate from the one you have open. You will run your summary statistics on the active data set, store the results in local macro, write the results in the locals to the temporary data set, and, finally, open the temporary data set and export it to an external Excel file.

First create a temporary data set in memory with a temporary name assigned to the local macro “memhold”

#First create a temporary data set in memory with a temporary name assigned to the local macro "memhold"
tempname memhold
#In this line we actually create the temporary dataset in memory (but which is not active) in the namespace "memhold" The different variables/columns we want in our temporary data set are listed between "memhold" and using . Using specifies where the data set actually gets saved/stored eventually
postfile `memhold` str20(varname) meanmale meanfemale pvaloftest using "${saved_data}FINRA_02_postfile.dta", replace
#Next we will specify the different variables that we want statistics on in a loop we can iterate through. In this case we want to see if the mean/average value of these various variables differ significanlty by gender (A3)
foreach var in A3A    r_white r_black r_hisp r_asian r_other reg_ne reg_mw reg_south reg_west {
        #First find the mean value of the varible of interest for males (A3==1)
        su `var` if A3 == 1
        return list     #Return list displays the statistics stata saved and can be stored in your locals. Store this average termporarily in a local macro called "meanmale"; we will write this vlaue to our temporary data file below in the "post" command line
        loc meanmale = r(mean)
        
        #Next, find the mean value of the varible of interest for females (A3==2)
        su `var` if A3==2 
        #store the result to a local macro called "meanfemale"
        loc meanfemale = r(mean)
        
        #Then, we want to see if the average for males and females differs statistically, 
        #so we will use the mean estimation and the postestimatation test command to see.
        #We could also use "ttest" to do this in one line, but "mean" is more flexible and allows      us to use weights if we want
        mean `var`, over (A3)
        test [`var`]Male = [`var`]Female # Stata 15 syntax. If using Stata 16, use this  alternative syntax: test `var'@1.A3 = `var'@2.A3
    
        #Save the 2-sided p-value of this difference of means test to a third local macro
        local pvaloftest = r(p)
      #Finally, for the variable we are actively looping through, write the mean value for males      , mean value for females, and 2-sided p-value of the difference as a new observation (row) in our temporary data file (not the one we have open actively and are analyzing)
    post `memhold' ("`var`") (`meanmale``) (`meanfemale`) (`pvaloftest`)
    }
#Postclose tells Stata we are done writing to the temporary data file we created and to finalize/save it where we specified in the "postfile" command
postclose `memhold`

#Then we just have to open up and export that new data file and export the results. First we may want to preserve our active data file, though, so we can restore it quickly to conduct additional analyses. This part is optional, though, since you could just reopen your original data file,or save a working version here and reopen it after the export instead of using preserve/restore
preserve #optional
#Open up your results file data set:
use "${saved_data}FINRA_02_postfile.dta", clear
    #Export to Excel:
    export excel using "${saved_data}FINRA_02_postfile.xls", replace
restore #optional

#Take a look at the resulting Excel file for the output--its gives you the information you need, 
#but the numbers are not formatted and there are no column headings to tell you what metric is in each column
#Sometimes, the better option is... TABOUT 
#(though you would not be able to pull out and write the p-value for the significance test in TABOUT easily)

TABOUT

TABOUT is second way of outputting descriptive statistics that is pretty automated/easy to use

# Download command:
capture ssc install tabout 
#Can use to show tabulations
tabout A3 using "${saved_data}FINRA_02.xls", cells(col) stats(chi2) replace 

#Or crosstabs (by gender in this case):
foreach characteristic of varlist A4A A6 A5 {
    tabout `characteristic` A3 using "${saved_data}FINRA_02.xls", cells(col) stats(chi2) append 
}
    
#Can also use to show summary statistics with the "sum" option
#You need to specify which statistic to show for each variable whose mean, median, etc. you want to show
#Tabout then runs this by the differing variables listed before "using"
tabout A3 A4A using "${saved_data}FINRA_02.xls", sum c(mean A3A  mean emp_full) f(4) append 
#f(4) indicates four digits after the decimal

#Again, you can use loops and loops within loops to automate and quicken your work...
foreach characteristic of varlist A4A A6 A5 {
    foreach outcome of varlist A3A   emp_full {
        tabout `characteristic' using "${saved_data}FINRA_02.xls", sum c(N `outcome` mean`outcome`) f(4) append
    }
}
#Can also download and use the OUTSUM command from SSC to export summary statistics, but usually I find tabout along with the sum option to be adequate

PUTEXCEL

PUTEXCEL is another way of outputting results from Stata to external file (the most tedious), but also the most flexible in terms of what to place in each individual cell and how to format it.

#First specify the file that you want to write your output to:
putexcel set "${saved_data}test.xls", replace
#Can write stored results/contents of local macros similar to postfile:
sum A3A
local avg = r(mean)
putexcel B2= `avg`
local stdDev  = r(sd)
putexcel B3  = `stdDev`
#Can also add labeling more easily than in postfile:
putexcel B1="Age" A2="Average" A3="Standard Deviation"