Regression Analysis

For this section, we’ll be using the Robert Wood Johnson 2015 County Health Rankings Analytic Data.

Ordinary Least Squares (OLS) regression

OLS or linear models can be used when we have one predictor variable. Say we want to see if higher air pollution predicts more low birthweight babies?

regress lowbirthweightvalue airpollutionparticulatematterval

Lets break down this table a bit

Anova Table

Value Meaning
Source The source of variance. The Total variance is broken down into the variance explained by the independent variables (Model) and the variance not explained by the independent variables (Residual)
SS Sum of Squares associates with the three sources of variance.
df Fegrees of freedom associates with the three sources of variance. The total df is N-1. The model DF corresponds to the number of predictors minus 1 (K-1). The Residual degrees of freedom is the DF total minus the DF model
MS Mean squares (the Sum of Squares divided by their respective DF)
Number of obs The number of observations used in the regression analysis
F and Prob > F These values are used to answer the question “Do the independent variables reliably predict the dependent variable?”. A p-value < 0.05 means that the model is statistically significant. Note that this is an overall significance test assessing whether the group of independent variables when used together reliably predict the dependent variable, and does not address the ability of any of the particular independent variables to predict the dependent variable.  The ability of each individual independent variable to predict the dependent variable is addressed in the table below where each of the individual variables are listed.
R-squared The proportion of variance in the dependent variable (low birthrate) which can be predicted from the independent variables (air pollution). Note that this is an overall measure of the strength of association, and does not reflect the extent to which any particular independent variable is associated with the dependent variable.
Adj R-squared The adjusted R-square attempts to yield a more honest value to estimate the R-squared for the population.
Root MSE The standard deviation of the error term, and is the square root of the Mean Square Residual (or Error)

Overall Model Fit

Value Meaning
lowbirthweightvalue This is the dependent variable in the model, and all of the rows below it are the independent variables.
Coef These coefficient values tell you about the relationship between the independent and dependent variables. So for this example, the coef tells the unit increase in low birth rate that would be predicted by a 1 point increase in air pollution.
Std. Err. The standard error associated with the coefficient. This is used to determine if the coefficient is statistically significant (t- and p-values) and create the confidence interval.
t and P>|t| These columns are the t and two-tailed p-values used to test the null hypothesis that the coefficient is 0. Generally speaking, if the two-tailed p-value is less than 0.05 it is considered statistically significant (which air pollution is in this model).
95% confidence interval The 95% confidence interval for the coefficient is useful to put the coefficient estimate into perspective and see how much it can vary. The p-value will not be statistically significant if the confidence interval contains 0.

We can also add weights to our regression. Say we want to weight this by the population of each county

regress lowbirthweightvalue airpollutionparticulatematterval
    [w = populationestimatevalue]

There are at least three big concerns with regressions which we will mention at different points in this activity.

Caution

CONCERN 1: Unequally influential observations

OLS assumes extreme values are very rare, and it gets squirrely when it sees them so let’s try a standard robust variance estimator which helps to account for dependent data and can also be used for the purpose of detecting influential observations. Generally, large residuals (different between predicted value and observed) and high leverage (how far an independent variable deviates from its mean) are flags for using a robust variance estimator.

regress lowbirthweightvalue airpollutionparticulatematterval
    [w = populationestimatevalue], vce(robust)

We can keep this as our simplest regression finding

est store simple

Maybe water pollution is what we should worry about instead of air pollution?

regress lowbirthweightvalue airpollutionparticulatematterval
    drinkingwaterviolationsvalue 
    [w = populationestimatevalue], vce(robust)

No need to keep this finding - it was a dead end. But maybe the counties with the most air pollution have poor access to health care

regress lowbirthweightvalue airpollutionparticulatematterval
    childreninpovertyvalue  medianhouseholdincomevalue
    primarycarephysiciansvalue couldnotseedoctorduetocostvalue
    [w = populationestimatevalue], vce(robust)
    
est store plushealthcare

Or maybe the counties with the most air pollution also have systematic differences in behavior

regress lowbirthweightvalue airpollutionparticulatematterval
    childreninpovertyvalue  medianhouseholdincomevalue
    primarycarephysiciansvalue couldnotseedoctorduetocostvalue
    percentofpopulationthatisnonhisp
    [w = populationestimatevalue], vce(robust)
    
# store regression resulys
est store plusbehav

Now lets make a table of the analyses we’ve built up:

outreg2 [simple plushealthcare plusbehav] using myfile, replace see

Now, we can examine the predicted levels of low birth weight at county pollution extremes (net of county income, health services, and demographics) using the margins command - which estimates margins of responses for specified values of covariates and presents the results as a table - or predicted probabilities.

 margins, at(airpollutionparticulatematterval=(7 14)) atmeans vsquish

Now lets graph the predicted relationship:

predict plowbirthweightvalue
twoway (scatter plowbirthweightvalue airpollutionparticulatematterval)

This brings up the second concern

Caution

CONCERN 2: Non-linear relationships

One approach: look for non-linearities in the residuals of the main model without pollution.

regress lowbirthweightvalue 
    childreninpovertyvalue medianhouseholdincomevalue
    primarycarephysiciansvalue uninsuredvalue
    percentofpopulationthatisnonhisp  percentofpopulationthatishispani
    teenbirthsvalue somecollegevalue
    adultsmokingvalue excessivedrinkingvalue physicalinactivityvalue
    if airpollutionparticulatematterval ~=. [w = populationestimatevalue], vce(robust)

predict calculates predictions, residuals, influence statistics, and the like after estimation. The following code creates plbw1 with the default prediction for the previous regression command.

predict plbw1

We can also create a new variable rlbw1 with the residuals from the previous regression

predict rlbw1, residuals

Plotting this we get:

lowess rlbw1 airpollutionparticulatematterval

Another approach is to break the key independent variable into discrete categories. We do this by first turning the airpollutionparticulatematterval value into an integer, and then use fvset to identify the base level and specify how to accumulate statistics over levels.

gen airpollutionparticulatematterint = int(airpollutionparticulatematterval)

fvset base 11 airpollutionparticulatematterint

regress lowbirthweightvalue i.airpollutionparticulatematterint
    childreninpovertyvalue  medianhouseholdincomevalue
    primarycarephysiciansvalue couldnotseedoctorduetocostvalue
    [w = populationestimatevalue], vce(robust)

Here we run into our last concern:

Caution

CONCERN 3: The possibility of “hot-spots”

Logistic Regressions

Logistic or logit models are used to model dichotomous outcome variables. Following the example above, we can turn low birth rate into a binary (low birth weight/no low birth weight) to test what factors influence influence this outcome. According to summary stats, low birthweight has a sample mean of 8.2% so we’ll use that as the cutoff.

generate lowbirthweightcounty_yn = .

replace lowbirthweightcounty_yn = 0 if lowbirthweightvalue > 0 & lowbirthweightvalue < .082

replace lowbirthweightcounty_yn = 1 if lowbirthweightvalue >= 0.082 & lowbirthweightvalue < .24

Now we’ll run the logit model on the dichotomous outcome to see if the relationship still shows

logit lowbirthweightcounty_yn airpollutionparticulatematterval
    childreninpovertyvalue  medianhouseholdincomevalue
    primarycarephysiciansvalue couldnotseedoctorduetocostvalue
    percentofpopulationthatisnonhisp, vce(robust)

Commands for RCTs

T-tests

ttest is useful if you want to compare a treatment and a control group values on a continuous variable - and in simple terms tests whether the difference between means between a group or groups and a value is 0.

Now lets unpack this table:

Variable Meaning
Group The categories of the independent variable. In this case 0 and 1.
Obs The number of valid and non-missing observations in each group.
Mean The mean of the dependent variable for each level of the independent variable.  The last line is the difference between the means.
Std. Err. Standard error of the mean for each level of the independent variable.
Std. Dev. The standard deviation of the dependent variable for each of the levels of the independent variable.  The last line is the standard deviation for the difference.
95% Conf. Interval The lower and upper confidence limits of the means.
diff The value that we’re testing (difference in the mean between group 0 and 1
t The test statistic we use to evaluate our hypothesis. It’s the ratio of the mean to the standard error of the difference of the two groups
Satterthwaite’s degrees of freedom When variances are assumed to be unequal, Satterthwaite’s is an alternative way to calculate the degrees of freedom that takes this into account. It is a more conservative approach than using the traditional degrees of freedom.
Pr(|T| > |t|) Thee two-tailed p-value computed using the t distribution. If the p-value is less than 0.05, we can say that the difference in means is statistically significant.
Pr(T < t), Pr(T > t) These are the one-tailed p-values for the alternative hypotheses (difference < 0) and (difference > 0), respectively.

You can also use a ttesti with immediate commands if you are just handed the summary statistics

#ttesti (Ntreat, meantreat, sdtreat, Ncont, meancont, sdcont)
ttesti 4252 18.1 12.9 6764 32.6 18.2, unequal

Proportion Test

prtest allows you to compare a treatment and a control group values on a categorical variable.

prtest nonstandard if (RACECEN1==1 | RACECEN1==2), by(RACECEN1

Again, you can do this with immediate commands (ptesti) if you are just given summary statistics.

#prtesti (Ntreat, ptreat, Ncont, pcont)
prtesti 345 .3536 1900 .1411

Some other regression-style models you might be asked to run, with commands and outputs similar to regress:

Model Description
probit probit model
mlogit multinomial logit model
ologit ordinal logit model
tobit mixed regression and logit model
xi: glm loglinear model (plus fixed effects and random effects models for categorical variables)
streg hazard model (aka rate/survival model)