## Robust regression vs. robust standard errors, Part 1

The name **robust regression** sounds similar to **regression with robust standard errors**, but these are actually very different techniques used for different kinds of situations. This tidbit briefly describes robust regression and when that technique could be useful. Next week's tidbit will address regression with robust standard errors and contrast these two techniques.

First, let's consider robust regression. This is a technique that is useful when there are outlying observations that could influence the regression coefficients. Outlying observations are downweighted (their influence is diminished) and extremely outlying observations can be weighted by a factor of 0 (removing their influence entirely).

The Stata **auto** dataset contains a good example that we can use, looking at the relationship between a cars weight and its miles per gallon. Let's first use this file.

. sysuse auto

(1978 Automobile Data)

Now let's have a quick look at a scatterplot of miles per gallon (**mpg**) by weight of the car (**weight**). But first, let's make a variable called **wt1k** that is **weight** divided by 1000.

. generate wt1k = weight / 1000

Now let's look at the scatterplot of **mpg** by **wt1k**. The low weight, high mpg cars could be influential (e.g., the VW Diesel, the Datsun 210, Suburu, and Plym. Champ).

. scatter mpg wt1k, mlabel(make) mlabsize(large)

Let's run an OLS regression predicting **mpg** from **wt1k**.

. regress mpg wt1k

Source | SS df MS Number of obs = 74

-------------+------------------------------ F( 1, 72) = 134.62

Model | 1591.99024 1 1591.99024 Prob > F = 0.0000

Residual | 851.469221 72 11.8259614 R-squared = 0.6515

-------------+------------------------------ Adj R-squared = 0.6467

Total | 2443.45946 73 33.4720474 Root MSE = 3.4389

------------------------------------------------------------------------------

mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

wt1k | -6.008687 .5178782 -11.60 0.000 -7.041058 -4.976316

_cons | 39.44028 1.614003 24.44 0.000 36.22283 42.65774

------------------------------------------------------------------------------

This regression shows that for every increase in 1000 pounds, the **mpg** is expected to decrease by 6 miles per gallon. Looking at the leverage by residual squared (below), the VW Diesel has the highest squared residual and has an above average leverage. This could be an influential observation

. lvr2plot , mlabel(make) mlabsize(large)

Let's try running this as a robust regression, and we can compare the results to the OLS results. But, first let's save the predicted values from the OLS regression, calling them **yhatols**.

. predict yhatols

(option xb assumed; fitted values)

Now, let's run this as a robust regression using the **rreg** command.

. rreg mpg wt1k

Huber iteration 1: maximum difference in weights = .79065461

Huber iteration 2: maximum difference in weights = .16435059

Huber iteration 3: maximum difference in weights = .07997524

Huber iteration 4: maximum difference in weights = .0208614

Biweight iteration 5: maximum difference in weights = .27513221

Biweight iteration 6: maximum difference in weights = .12290071

Biweight iteration 7: maximum difference in weights = .0699518

Biweight iteration 8: maximum difference in weights = .01619963

Biweight iteration 9: maximum difference in weights = .00890816

Robust regression Number of obs = 74

F( 1, 72) = 249.65

Prob > F = 0.0000

------------------------------------------------------------------------------

mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

wt1k | -5.341891 .3380843 -15.80 0.000 -6.01585 -4.667933

_cons | 36.66249 1.053663 34.80 0.000 34.56205 38.76293

------------------------------------------------------------------------------

The coefficient for **weight** was -6.008 in the OLS regression, and is -5.341 for the robust regression. In the robust regression, the slope is slightly more flat than in the OLS regression. Let's create a variable **yhatrreg** that contains the predicted values for the robust regression.

. predict yhatrreg

(option xb assumed; fitted values)

Now let's visually compare the results of the OLS and robust regressions, as shown below.

. graph twoway (scatter mpg wt1k) (line yhatols yhatrreg wt1k, sort) , ///

> legend(label (1 "Observed MPG") label(2 "OLS Regression") label(3 "Robust Regression"))

The graph above shows the OLS regression line in red and the robust regression in green. The robust regression line is not as steep, because it was influenced less by outlying observations like the the VW Diesel. In this case, the robust regression may appropriately discount the influence outlying observations.

Next week we will look at regression with robust standard errors.

You can download the example data files from this tidbit (as well as all of the other tidbits) as shown below. These will download all of the example data files into the current folder on your computer. (If you have done this before, then you may need to specify **net get stowdata, replace** to overwrite the existing files.

net from http://www.MichaelNormanMitchell.com/storage/stowdata

net get stowdata

If you have thoughts on this Stata Tidbit of the Week, you can post a comment. You can also send me an email at MichaelNormanMitchell and then the at sign and gmail dot com. If you are receiving this tidbit via email, you can find the web version at http://www.michaelnormanmitchell.com/ .

## Reader Comments (4)

When or why would one want to use this robust regression technique as opposed to quantile regression, which also places less emphasis on outliers? Are there any rules of thumb when to use one versus the other?

This is a great question Brian. As you note, both "robust regression" and "quantile regression" are techniques that reduce the influence of outliers. I am not aware of any "rules of thumb" on when to choose one over the other. To me, there are two factors I would weigh. First, I would ask whether we want to be estimate the "mean" of the outcome, conditional on the predictors, or the "median" (or other percentile) conditional on the predictors. If we want to estimate the "mean", then go with robust regression, or the median then go with quantile regression. The other factor that comes to my mind is the quantity and influence of the outliers. If there are a very large number of highly influential outliers (say 5% or 10% of the cases) and that the results of OLS vs. robust regression would be very substantially different, then it seems to me it is pushing the limits of "robust regression", since such a high percentage of observations are being downweighted. In such a case, then quantile regression seems preferable.

Thanks for the question. I hope others might weigh in with other thoughts.

Fistly thanks for putting together a great website, may there me more tidbits to come.

Now I don't want to 'nitpick' a 'tidbit' .. but in the graph where you plot the predicted curves comparing robust to OLS regression, you label the scatter points as 'predicted MPG', though aren't they really 'observed MPG'?

Dear Charles

Thanks both for the kind words and the correction. You are right on the mark and I am grateful for letting me know so I can fix it. The graph and the code have been fixed. Feel free to 'nitpick' any time!

Best regards,

Michael