regress lowbirthweightvalue airpollutionparticulatematterval
Regression Analysis
For this section, we’ll be using the Robert Wood Johnson 2015 County Health Rankings Analytic Data.
Ordinary Least Squares (OLS) regression
OLS or linear models can be used when we have one predictor variable. Say we want to see if higher air pollution predicts more low birthweight babies?
Lets break down this table a bit
Anova Table
Value | Meaning |
---|---|
Source | The source of variance. The Total variance is broken down into the variance explained by the independent variables (Model) and the variance not explained by the independent variables (Residual) |
SS | Sum of Squares associates with the three sources of variance. |
df | Fegrees of freedom associates with the three sources of variance. The total df is N-1. The model DF corresponds to the number of predictors minus 1 (K-1). The Residual degrees of freedom is the DF total minus the DF model |
MS | Mean squares (the Sum of Squares divided by their respective DF) |
Number of obs | The number of observations used in the regression analysis |
F and Prob > F | These values are used to answer the question “Do the independent variables reliably predict the dependent variable?”. A p-value < 0.05 means that the model is statistically significant. Note that this is an overall significance test assessing whether the group of independent variables when used together reliably predict the dependent variable, and does not address the ability of any of the particular independent variables to predict the dependent variable. The ability of each individual independent variable to predict the dependent variable is addressed in the table below where each of the individual variables are listed. |
R-squared | The proportion of variance in the dependent variable (low birthrate) which can be predicted from the independent variables (air pollution). Note that this is an overall measure of the strength of association, and does not reflect the extent to which any particular independent variable is associated with the dependent variable. |
Adj R-squared | The adjusted R-square attempts to yield a more honest value to estimate the R-squared for the population. |
Root MSE | The standard deviation of the error term, and is the square root of the Mean Square Residual (or Error) |
Overall Model Fit
Value | Meaning |
---|---|
lowbirthweightvalue | This is the dependent variable in the model, and all of the rows below it are the independent variables. |
Coef | These coefficient values tell you about the relationship between the independent and dependent variables. So for this example, the coef tells the unit increase in low birth rate that would be predicted by a 1 point increase in air pollution. |
Std. Err. | The standard error associated with the coefficient. This is used to determine if the coefficient is statistically significant (t- and p-values) and create the confidence interval. |
t and P>|t| | These columns are the t and two-tailed p-values used to test the null hypothesis that the coefficient is 0. Generally speaking, if the two-tailed p-value is less than 0.05 it is considered statistically significant (which air pollution is in this model). |
95% confidence interval | The 95% confidence interval for the coefficient is useful to put the coefficient estimate into perspective and see how much it can vary. The p-value will not be statistically significant if the confidence interval contains 0. |
We can also add weights to our regression. Say we want to weight this by the population of each county
regress lowbirthweightvalue airpollutionparticulatematterval= populationestimatevalue] [w
There are at least three big concerns with regressions which we will mention at different points in this activity.
CONCERN 1: Unequally influential observations
OLS assumes extreme values are very rare, and it gets squirrely when it sees them so let’s try a standard robust variance estimator which helps to account for dependent data and can also be used for the purpose of detecting influential observations. Generally, large residuals (different between predicted value and observed) and high leverage (how far an independent variable deviates from its mean) are flags for using a robust variance estimator.
regress lowbirthweightvalue airpollutionparticulatematterval= populationestimatevalue], vce(robust) [w
We can keep this as our simplest regression finding
est store simple
Maybe water pollution is what we should worry about instead of air pollution?
regress lowbirthweightvalue airpollutionparticulatematterval
drinkingwaterviolationsvalue = populationestimatevalue], vce(robust) [w
No need to keep this finding - it was a dead end. But maybe the counties with the most air pollution have poor access to health care
regress lowbirthweightvalue airpollutionparticulatematterval
childreninpovertyvalue medianhouseholdincomevalue
primarycarephysiciansvalue couldnotseedoctorduetocostvalue= populationestimatevalue], vce(robust)
[w
est store plushealthcare
Or maybe the counties with the most air pollution also have systematic differences in behavior
regress lowbirthweightvalue airpollutionparticulatematterval
childreninpovertyvalue medianhouseholdincomevalue
primarycarephysiciansvalue couldnotseedoctorduetocostvalue
percentofpopulationthatisnonhisp= populationestimatevalue], vce(robust)
[w
# store regression resulys
est store plusbehav
Now lets make a table of the analyses we’ve built up:
outreg2 [simple plushealthcare plusbehav] using myfile, replace see
Now, we can examine the predicted levels of low birth weight at county pollution extremes (net of county income, health services, and demographics) using the margins
command - which estimates margins of responses for specified values of covariates and presents the results as a table - or predicted probabilities.
at(airpollutionparticulatematterval=(7 14)) atmeans vsquish margins,
Now lets graph the predicted relationship:
predict plowbirthweightvaluetwoway (scatter plowbirthweightvalue airpollutionparticulatematterval)
This brings up the second concern
CONCERN 2: Non-linear relationships
One approach: look for non-linearities in the residuals of the main model without pollution.
regress lowbirthweightvalue
childreninpovertyvalue medianhouseholdincomevalue
primarycarephysiciansvalue uninsuredvalue
percentofpopulationthatisnonhisp percentofpopulationthatishispani
teenbirthsvalue somecollegevalue
adultsmokingvalue excessivedrinkingvalue physicalinactivityvalueif airpollutionparticulatematterval ~=. [w = populationestimatevalue], vce(robust)
predict
calculates predictions, residuals, influence statistics, and the like after estimation. The following code creates plbw1
with the default prediction for the previous regression command.
predict plbw1
We can also create a new variable rlbw1
with the residuals from the previous regression
predict rlbw1, residuals
Plotting this we get:
lowess rlbw1 airpollutionparticulatematterval
Another approach is to break the key independent variable into discrete categories. We do this by first turning the airpollutionparticulatematterval
value into an integer, and then use fvset
to identify the base level and specify how to accumulate statistics over levels.
= int(airpollutionparticulatematterval)
gen airpollutionparticulatematterint
11 airpollutionparticulatematterint
fvset base
regress lowbirthweightvalue i.airpollutionparticulatematterint
childreninpovertyvalue medianhouseholdincomevalue
primarycarephysiciansvalue couldnotseedoctorduetocostvalue= populationestimatevalue], vce(robust) [w
Here we run into our last concern:
CONCERN 3: The possibility of “hot-spots”
Logistic Regressions
Logistic or logit models are used to model dichotomous outcome variables. Following the example above, we can turn low birth rate into a binary (low birth weight/no low birth weight) to test what factors influence influence this outcome. According to summary stats, low birthweight has a sample mean of 8.2% so we’ll use that as the cutoff.
= .
generate lowbirthweightcounty_yn
= 0 if lowbirthweightvalue > 0 & lowbirthweightvalue < .082
replace lowbirthweightcounty_yn
= 1 if lowbirthweightvalue >= 0.082 & lowbirthweightvalue < .24 replace lowbirthweightcounty_yn
Now we’ll run the logit
model on the dichotomous outcome to see if the relationship still shows
logit lowbirthweightcounty_yn airpollutionparticulatematterval
childreninpovertyvalue medianhouseholdincomevalue
primarycarephysiciansvalue couldnotseedoctorduetocostvaluevce(robust) percentofpopulationthatisnonhisp,
Commands for RCTs
T-tests
ttest
is useful if you want to compare a treatment and a control group values on a continuous variable - and in simple terms tests whether the difference between means between a group or groups and a value is 0.
Now lets unpack this table:
Variable | Meaning |
---|---|
Group | The categories of the independent variable. In this case 0 and 1. |
Obs | The number of valid and non-missing observations in each group. |
Mean | The mean of the dependent variable for each level of the independent variable. The last line is the difference between the means. |
Std. Err. | Standard error of the mean for each level of the independent variable. |
Std. Dev. | The standard deviation of the dependent variable for each of the levels of the independent variable. The last line is the standard deviation for the difference. |
95% Conf. Interval | The lower and upper confidence limits of the means. |
diff | The value that we’re testing (difference in the mean between group 0 and 1 |
t | The test statistic we use to evaluate our hypothesis. It’s the ratio of the mean to the standard error of the difference of the two groups |
Satterthwaite’s degrees of freedom | When variances are assumed to be unequal, Satterthwaite’s is an alternative way to calculate the degrees of freedom that takes this into account. It is a more conservative approach than using the traditional degrees of freedom. |
Pr(|T| > |t|) | Thee two-tailed p-value computed using the t distribution. If the p-value is less than 0.05, we can say that the difference in means is statistically significant. |
Pr(T < t), Pr(T > t) | These are the one-tailed p-values for the alternative hypotheses (difference < 0) and (difference > 0), respectively. |
You can also use a ttesti
with immediate commands if you are just handed the summary statistics
#ttesti (Ntreat, meantreat, sdtreat, Ncont, meancont, sdcont)
4252 18.1 12.9 6764 32.6 18.2, unequal ttesti
Proportion Test
prtest
allows you to compare a treatment and a control group values on a categorical variable.
if (RACECEN1==1 | RACECEN1==2), by(RACECEN1 prtest nonstandard
Again, you can do this with immediate commands (ptesti
) if you are just given summary statistics.
#prtesti (Ntreat, ptreat, Ncont, pcont)
345 .3536 1900 .1411 prtesti
Some other regression-style models you might be asked to run, with commands and outputs similar to regress:
Model | Description |
---|---|
probit | probit model |
mlogit | multinomial logit model |
ologit | ordinal logit model |
tobit | mixed regression and logit model |
xi: glm | loglinear model (plus fixed effects and random effects models for categorical variables) |
streg | hazard model (aka rate/survival model) |