Stata Tidbits

These tidbits contain bits and pieces of information I hope you find helpful to use Stata more effectively. You can receive notifications of new tidbits as they are added (via email) by clicking on the subscribe box at the left. (Every email has an unsubscribe link, making it a snap to unsubscribe.)
« Graph symbols and fonts in Stata 11: Part 1 | Robust regression vs. robust standard errors, Part 1 »
Tuesday
Feb162010

Robust regression vs. robust standard errors, Part 2

Last week we talked about robust regression, showing how this technique can be useful for reducing the influence of outlying observations. This week we will consider regression with robust standard errors. Although the names sound the same (both having the word robust in them), they are used in very different situations.

Let's first take a step back and consider one of the assumptions of linear regression models. The residuals are assumed to show homoscedasticity, or in the ANOVA world they would call this homogeneity of variance. In short, the variance of the residuals are assumed to be the same across different values of the predictor. Violation of this assumption can yield incorrect standard errors (either being underestimated or overestimated, depending on the pattern of the violation). This can lead to p-values that are either too high or too low. However, by using robust standard errors via the robust option in Stata, you can obtain estimates that are appropriate even when the homoscedasticity assumption is violated. Let's consider an example.

The fictional agetalk dataset contains information about the age of a person and the amount of time that they talk on the telephone. Below we use the file and provide summary statistics for the variables age and talk.

. use agetalk

. summarize

Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
age | 5000 31.274 13.27999 13 65
talk | 5000 336.5742 43.6513 80 521

The variable age ranges from 13 to 65 and the amount of time talking on the telephone (daily) ranges from 80 to 521. Let's perform a regression predicting talk from age.

. regress talk age

Source | SS df MS Number of obs = 5000
-------------+------------------------------ F( 1, 4998) = 3083.30
Model | 3634231.2 1 3634231.2 Prob > F = 0.0000
Residual | 5891045.27 4998 1178.68053 R-squared = 0.3815
-------------+------------------------------ Adj R-squared = 0.3814
Total | 9525276.47 4999 1905.43638 Root MSE = 34.332

------------------------------------------------------------------------------
talk | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | -2.030331 .0365644 -55.53 0.000 -2.102013 -1.958649
_cons | 400.0708 1.242322 322.03 0.000 397.6353 402.5063
------------------------------------------------------------------------------

The results show a negative association between age and the amount of time talking on the telephone. In fact, for every year one gets older, one talks on the phone about 2.03 minutes less per day. While there are many assumptions that we should investigate before going with this result, let's focus on the homoscedasticity assumption. One way we can evaluate this is by using the rvfplot command.

. rvfplot

This command shows the residuals on the y axis and the predicted value on the x axis. The spread of the residuals is clearly wider for the smaller predicted values and narrower for the larger fitted values. This is often described as a fan spread pattern because it looks like a hand held fan. This is a classic violation of the homogeneity of variance assumption.

To address this, let's include the robust option to compute robust standard errors which are tolerant of the violation of this assumption.

. regress talk age, robust

Linear regression Number of obs = 5000
F( 1, 4998) = 1902.90
Prob > F = 0.0000
R-squared = 0.3815
Root MSE = 34.332

------------------------------------------------------------------------------
| Robust
talk | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | -2.030331 .0465434 -43.62 0.000 -2.121577 -1.939086
_cons | 400.0708 1.209263 330.84 0.000 397.7001 402.4415
------------------------------------------------------------------------------

Notice how the estimate of the coefficient is exactly the same, but the robust standard error is different. In this case, it is larger (0.046 as compared to 0.036), although significant in both cases. However, given the nature of the distribution of the residuals, this is the model that appears most appropriate.

You can download the example data files from this tidbit (as well as all of the other tidbits) as shown below. These will download all of the example data files into the current folder on your computer. (If you have done this before, then you may need to specify net get stowdata, replace to overwrite the existing files.

net from http://www.MichaelNormanMitchell.com/storage/stowdata
net get stowdata

If you have thoughts on this Stata Tidbit of the Week, you can post a comment. You can also send me an email at MichaelNormanMitchell and then the at sign and gmail dot com. If you are receiving this tidbit via email, you can find the web version at http://www.michaelnormanmitchell.com/ .

 

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.
Editor Permission Required
You must have editing permission for this entry in order to post comments.