Robust regression vs. robust standard errors, Part 2

Tuesday, February 16, 2010 at 12:00AM

Michael Mitchell in Statistics

Michael Mitchell in Statistics

Last week we talked about **robust regression**, showing how this technique can be useful for reducing the influence of outlying observations. This week we will consider **regression with robust standard errors**. Although the names sound the same (both having the word **robust** in them), they are used in very different situations.

Let's first take a step back and consider one of the assumptions of linear regression models. The residuals are assumed to show **homoscedasticity**, or in the ANOVA world they would call this **homogeneity of variance**. In short, the variance of the residuals are assumed to be the same across different values of the predictor. Violation of this assumption can yield incorrect standard errors (either being underestimated or overestimated, depending on the pattern of the violation). This can lead to p-values that are either too high or too low. However, by using **robust standard errors** via the **robust** option in Stata, you can obtain estimates that are appropriate even when the **homoscedasticity** assumption is violated. Let's consider an example.

The fictional **agetalk** dataset contains information about the age of a person and the amount of time that they talk on the telephone. Below we use the file and provide summary statistics for the variables **age** and **talk**.

. use agetalk

. summarize

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

age | 5000 31.274 13.27999 13 65

talk | 5000 336.5742 43.6513 80 521

The variable age ranges from 13 to 65 and the amount of time talking on the telephone (daily) ranges from 80 to 521. Let's perform a regression predicting **talk** from **age**.

. regress talk age

Source | SS df MS Number of obs = 5000

-------------+------------------------------ F( 1, 4998) = 3083.30

Model | 3634231.2 1 3634231.2 Prob > F = 0.0000

Residual | 5891045.27 4998 1178.68053 R-squared = 0.3815

-------------+------------------------------ Adj R-squared = 0.3814

Total | 9525276.47 4999 1905.43638 Root MSE = 34.332

------------------------------------------------------------------------------

talk | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | -2.030331 .0365644 -55.53 0.000 -2.102013 -1.958649

_cons | 400.0708 1.242322 322.03 0.000 397.6353 402.5063

------------------------------------------------------------------------------

The results show a negative association between **age** and the amount of time talking on the telephone. In fact, for every year one gets older, one talks on the phone about 2.03 minutes less per day. While there are many assumptions that we should investigate before going with this result, let's focus on the **homoscedasticity** assumption. One way we can evaluate this is by using the **rvfplot** command.

. rvfplot

This command shows the residuals on the y axis and the predicted value on the x axis. The spread of the residuals is clearly wider for the smaller predicted values and narrower for the larger fitted values. This is often described as a **fan spread** pattern because it looks like a hand held fan. This is a classic violation of the homogeneity of variance assumption.

To address this, let's include the **robust** option to compute robust standard errors which are tolerant of the violation of this assumption.

. regress talk age, robust

Linear regression Number of obs = 5000

F( 1, 4998) = 1902.90

Prob > F = 0.0000

R-squared = 0.3815

Root MSE = 34.332

------------------------------------------------------------------------------

| Robust

talk | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | -2.030331 .0465434 -43.62 0.000 -2.121577 -1.939086

_cons | 400.0708 1.209263 330.84 0.000 397.7001 402.4415

------------------------------------------------------------------------------

Notice how the estimate of the coefficient is exactly the same, but the **robust standard error** is different. In this case, it is larger (**0.046** as compared to **0.036**), although significant in both cases. However, given the nature of the distribution of the residuals, this is the model that appears most appropriate.

You can download the example data files from this tidbit (as well as all of the other tidbits) as shown below. These will download all of the example data files into the current folder on your computer. (If you have done this before, then you may need to specify **net get stowdata, replace** to overwrite the existing files.

net from http://www.MichaelNormanMitchell.com/storage/stowdata

net get stowdata

If you have thoughts on this Stata Tidbit of the Week, you can post a comment. You can also send me an email at MichaelNormanMitchell and then the at sign and gmail dot com. If you are receiving this tidbit via email, you can find the web version at http://www.michaelnormanmitchell.com/ .

Article originally appeared on Michael Norman Mitchell (http://www.michaelnormanmitchell.com/).

See website for complete article licensing information.