Stata Tidbits

These tidbits contain bits and pieces of information I hope you find helpful to use Stata more effectively. You can receive notifications of new tidbits as they are added (via email) by clicking on the subscribe box at the left. (Every email has an unsubscribe link, making it a snap to unsubscribe.)
« Robust regression vs. robust standard errors, Part 2 | The fre command »

Robust regression vs. robust standard errors, Part 1

The name robust regression sounds similar to regression with robust standard errors, but these are actually very different techniques used for different kinds of situations. This tidbit briefly describes robust regression and when that technique could be useful. Next week's tidbit will address regression with robust standard errors and contrast these two techniques.

First, let's consider robust regression. This is a technique that is useful when there are outlying observations that could influence the regression coefficients. Outlying observations are downweighted (their influence is diminished) and extremely outlying observations can be weighted by a factor of 0 (removing their influence entirely).

The Stata auto dataset contains a good example that we can use, looking at the relationship between a cars weight and its miles per gallon. Let's first use this file.

. sysuse auto
(1978 Automobile Data)

Now let's have a quick look at a scatterplot of miles per gallon (mpg) by weight of the car (weight). But first, let's make a variable called wt1k that is weight divided by 1000.

. generate wt1k = weight / 1000

Now let's look at the scatterplot of mpg by wt1k. The low weight, high mpg cars could be influential (e.g., the VW Diesel, the Datsun 210, Suburu, and Plym. Champ).

. scatter mpg wt1k, mlabel(make) mlabsize(large)

Let's run an OLS regression predicting mpg from wt1k.

. regress mpg wt1k

Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 1, 72) = 134.62
Model | 1591.99024 1 1591.99024 Prob > F = 0.0000
Residual | 851.469221 72 11.8259614 R-squared = 0.6515
-------------+------------------------------ Adj R-squared = 0.6467
Total | 2443.45946 73 33.4720474 Root MSE = 3.4389

mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
wt1k | -6.008687 .5178782 -11.60 0.000 -7.041058 -4.976316
_cons | 39.44028 1.614003 24.44 0.000 36.22283 42.65774

This regression shows that for every increase in 1000 pounds, the mpg is expected to decrease by 6 miles per gallon. Looking at the leverage by residual squared (below), the VW Diesel has the highest squared residual and has an above average leverage. This could be an influential observation

. lvr2plot , mlabel(make) mlabsize(large)

Let's try running this as a robust regression, and we can compare the results to the OLS results. But, first let's save the predicted values from the OLS regression, calling them yhatols.

. predict yhatols
(option xb assumed; fitted values)

Now, let's run this as a robust regression using the rreg command.

. rreg mpg wt1k

Huber iteration 1: maximum difference in weights = .79065461
Huber iteration 2: maximum difference in weights = .16435059
Huber iteration 3: maximum difference in weights = .07997524
Huber iteration 4: maximum difference in weights = .0208614
Biweight iteration 5: maximum difference in weights = .27513221
Biweight iteration 6: maximum difference in weights = .12290071
Biweight iteration 7: maximum difference in weights = .0699518
Biweight iteration 8: maximum difference in weights = .01619963
Biweight iteration 9: maximum difference in weights = .00890816

Robust regression Number of obs = 74
F( 1, 72) = 249.65
Prob > F = 0.0000

mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
wt1k | -5.341891 .3380843 -15.80 0.000 -6.01585 -4.667933
_cons | 36.66249 1.053663 34.80 0.000 34.56205 38.76293

The coefficient for weight was -6.008 in the OLS regression, and is -5.341 for the robust regression. In the robust regression, the slope is slightly more flat than in the OLS regression. Let's create a variable yhatrreg that contains the predicted values for the robust regression.

. predict yhatrreg
(option xb assumed; fitted values)

Now let's visually compare the results of the OLS and robust regressions, as shown below.

. graph twoway (scatter mpg wt1k) (line yhatols yhatrreg wt1k, sort) , ///
> legend(label (1 "Observed MPG") label(2 "OLS Regression") label(3 "Robust Regression"))

The graph above shows the OLS regression line in red and the robust regression in green. The robust regression line is not as steep, because it was influenced less by outlying observations like the the VW Diesel. In this case, the robust regression may appropriately discount the influence outlying observations.

Next week we will look at regression with robust standard errors.

You can download the example data files from this tidbit (as well as all of the other tidbits) as shown below. These will download all of the example data files into the current folder on your computer. (If you have done this before, then you may need to specify net get stowdata, replace to overwrite the existing files.

net from
net get stowdata

If you have thoughts on this Stata Tidbit of the Week, you can post a comment. You can also send me an email at MichaelNormanMitchell and then the at sign and gmail dot com. If you are receiving this tidbit via email, you can find the web version at .


PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments (4)

When or why would one want to use this robust regression technique as opposed to quantile regression, which also places less emphasis on outliers? Are there any rules of thumb when to use one versus the other?

February 9, 2010 | Unregistered CommenterBrian

This is a great question Brian. As you note, both "robust regression" and "quantile regression" are techniques that reduce the influence of outliers. I am not aware of any "rules of thumb" on when to choose one over the other. To me, there are two factors I would weigh. First, I would ask whether we want to be estimate the "mean" of the outcome, conditional on the predictors, or the "median" (or other percentile) conditional on the predictors. If we want to estimate the "mean", then go with robust regression, or the median then go with quantile regression. The other factor that comes to my mind is the quantity and influence of the outliers. If there are a very large number of highly influential outliers (say 5% or 10% of the cases) and that the results of OLS vs. robust regression would be very substantially different, then it seems to me it is pushing the limits of "robust regression", since such a high percentage of observations are being downweighted. In such a case, then quantile regression seems preferable.

Thanks for the question. I hope others might weigh in with other thoughts.

February 9, 2010 | Unregistered CommenterMichael Mitchell

Fistly thanks for putting together a great website, may there me more tidbits to come.

Now I don't want to 'nitpick' a 'tidbit' .. but in the graph where you plot the predicted curves comparing robust to OLS regression, you label the scatter points as 'predicted MPG', though aren't they really 'observed MPG'?

March 18, 2010 | Unregistered Commentercharles

Dear Charles

Thanks both for the kind words and the correction. You are right on the mark and I am grateful for letting me know so I can fix it. The graph and the code have been fixed. Feel free to 'nitpick' any time!

Best regards,


March 18, 2010 | Unregistered CommenterMichael
Editor Permission Required
You must have editing permission for this entry in order to post comments.