Search Tidbits
Tidbit Entries by Topic

## Stata Tidbits

These tidbits contain bits and pieces of information I hope you find helpful to use Stata more effectively. You can receive notifications of new tidbits as they are added (via email) by clicking on the subscribe box at the left. (Every email has an unsubscribe link, making it a snap to unsubscribe.)
Tuesday
Feb162010

## Robust regression vs. robust standard errors, Part 2

Last week we talked about robust regression, showing how this technique can be useful for reducing the influence of outlying observations. This week we will consider regression with robust standard errors. Although the names sound the same (both having the word robust in them), they are used in very different situations.

Let's first take a step back and consider one of the assumptions of linear regression models. The residuals are assumed to show homoscedasticity, or in the ANOVA world they would call this homogeneity of variance. In short, the variance of the residuals are assumed to be the same across different values of the predictor. Violation of this assumption can yield incorrect standard errors (either being underestimated or overestimated, depending on the pattern of the violation). This can lead to p-values that are either too high or too low. However, by using robust standard errors via the robust option in Stata, you can obtain estimates that are appropriate even when the homoscedasticity assumption is violated. Let's consider an example.

The fictional agetalk dataset contains information about the age of a person and the amount of time that they talk on the telephone. Below we use the file and provide summary statistics for the variables age and talk.

`. use agetalk. summarize    Variable |       Obs        Mean    Std. Dev.       Min        Max-------------+--------------------------------------------------------         age |      5000      31.274    13.27999         13         65        talk |      5000    336.5742     43.6513         80        521`

The variable age ranges from 13 to 65 and the amount of time talking on the telephone (daily) ranges from 80 to 521. Let's perform a regression predicting talk from age.

`. regress talk age       Source |       SS       df       MS              Number of obs =    5000-------------+------------------------------           F(  1,  4998) = 3083.30       Model |   3634231.2     1   3634231.2           Prob > F      =  0.0000    Residual |  5891045.27  4998  1178.68053           R-squared     =  0.3815-------------+------------------------------           Adj R-squared =  0.3814       Total |  9525276.47  4999  1905.43638           Root MSE      =  34.332------------------------------------------------------------------------------        talk |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]-------------+----------------------------------------------------------------         age |  -2.030331   .0365644   -55.53   0.000    -2.102013   -1.958649       _cons |   400.0708   1.242322   322.03   0.000     397.6353    402.5063------------------------------------------------------------------------------`

The results show a negative association between age and the amount of time talking on the telephone. In fact, for every year one gets older, one talks on the phone about 2.03 minutes less per day. While there are many assumptions that we should investigate before going with this result, let's focus on the homoscedasticity assumption. One way we can evaluate this is by using the rvfplot command.

`. rvfplot` This command shows the residuals on the y axis and the predicted value on the x axis. The spread of the residuals is clearly wider for the smaller predicted values and narrower for the larger fitted values. This is often described as a fan spread pattern because it looks like a hand held fan. This is a classic violation of the homogeneity of variance assumption.

To address this, let's include the robust option to compute robust standard errors which are tolerant of the violation of this assumption.

`. regress talk age, robustLinear regression                                      Number of obs =    5000                                                       F(  1,  4998) = 1902.90                                                       Prob > F      =  0.0000                                                       R-squared     =  0.3815                                                       Root MSE      =  34.332------------------------------------------------------------------------------             |               Robust        talk |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]-------------+----------------------------------------------------------------         age |  -2.030331   .0465434   -43.62   0.000    -2.121577   -1.939086       _cons |   400.0708   1.209263   330.84   0.000     397.7001    402.4415------------------------------------------------------------------------------`

Notice how the estimate of the coefficient is exactly the same, but the robust standard error is different. In this case, it is larger (0.046 as compared to 0.036), although significant in both cases. However, given the nature of the distribution of the residuals, this is the model that appears most appropriate.

You can download the example data files from this tidbit (as well as all of the other tidbits) as shown below. These will download all of the example data files into the current folder on your computer. (If you have done this before, then you may need to specify net get stowdata, replace to overwrite the existing files.

`net from http://www.MichaelNormanMitchell.com/storage/stowdatanet get stowdata`

If you have thoughts on this Stata Tidbit of the Week, you can post a comment. You can also send me an email at MichaelNormanMitchell and then the at sign and gmail dot com. If you are receiving this tidbit via email, you can find the web version at http://www.michaelnormanmitchell.com/ . View Printer Friendly Version Email Article to Friend