Friday, December 21, 2012

The good and bad of fixed effects

If you ever want to scare an economist, the two words "omitted variable" will usually do the trick. I was not trained in an economics department, but I can imagine they drill it into you from the first day. It’s an interesting contrast to statistics, where I have much of my training, where the focus is much more on out-of-sample prediction skill. In economics, showing causality is often the name of the game, and it’s very important to make sure a relationship is not driven by a “latent” variable. Omitted variables can still be important for out-of-sample skill, but only if their relationships with the model variables change over space or time.

A common way to deal with omitted variable bias is to introduce dummy variables for space or time units. These “fixed effects” greatly reduce (but do not completely eliminate) the chance that a relationship is driven by an omitted variable. Fixed effects are very popular, and some economists seem to like to introduce them to the maximum extent possible. But as any economist can tell you (another lesson on day one?), there are no free lunches. In this case, the cost of reducing omitted variable problems is that you throw away a lot of the signal in the data.

Consider a bad analogy (bad analogies happen to be my specialty). Let’s say you wanted to know whether being taller caused you to get paid more. You could simply look at everyone’s height and income, and see if there was a significant correlation. But someone could plausibly argue that omitted variables related to height are actually causing the income variation. Maybe very young and old people tend to get paid less, and happen to be shorter. And women get paid less and tend to be shorter. And certain ethnicities might tend to be discriminated against, and also be shorter. And maybe living in a certain state that has good water makes you both taller and smarter, and being smarter is the real reason you earn more. And on and on and on we could go. A reasonable response would be to introduce dummy variables for all of these factors (gender, age, ethnicity, location). Then you’d be looking at whether people who are taller than average given their age, sex, ethnicity, and location get paid more than an average person of that age, sex, ethnicity, and location.

In other words, you end up comparing much smaller changes than if you were to look at the entire range of data. This helps calm the person grumbling about omitted variables (at least until they think of another one), and would probably be ok in the example, since all of these things can be measured very precisely. But think about what would happen if we only could measure age and income with 10% error. Taking out the fixed effects means removing a lot of the signal but not any of the noise, which means in statistical terms that the power of the analysis goes down.

Now to a more relevant example. (Sorry, this is where things may get a little wonkish, as Krugman would say). I was recently looking at some data that colleagues at Stanford and I are analyzing on weather and nutritional outcomes for district level data in India. As in most developing countries, the weather data in India are far from perfect. And as in most regression studies, we are worried about omitted variables. So what is the right level of fixed effects to include? Inspired by a table in a recent paper by some eminent economists (including a couple who have been rumored to blog on G-FEED once in a while), I calculated the standard deviation of residuals from regressions on different levels of fixed effects. The 2nd and 3rd columns in the table below show the results for summer (June-September) average temperatures (T) and rainfall (P). Units are not important for the point, so I’ve left them out:

 sd(T) sd(P) Cor(T1,T2) Cor(P1,P2) No FE 3.89 8.50 0.92 0.28 Year FE 3.89 4.66 0.93 0.45 Year + State FE 2.20 2.18 0.84 0.26 Year + District FE 0.30 1.63 0.33 0.22

The different rows here correspond to the raw data (no fixed effect), after removing year fixed effects (FE), year + state FE, and year + district FE. Note how including year FE reduces P variation but not T, which indicates that most of the T variation comes from spatial differences, whereas a lot of the P variation comes from year-to-year swings that are common to all areas. Both get further reduced when introducing state FE, but there’s still a good amount of variation left. But when going to district FE, the variation in T gets cut by nearly a factor of 10, from 2.2 to 0.30! That means the typical temperature deviation a regression model would be working with is less than a third of a degree Celsius.

None of this is too interesting, but the 4th and 5th columns are where things get more related to the point about signal to noise. There I’m computing the correlation between two different datasets of T or P (details of which ones are not important). When there is a low correlation between two datasets that are supposed to be measuring the same thing, that’s a good indication that measurement error is a problem. So I’m using this correlation here as an indication of where fixed effects may really cause a problem with signal to noise.

Two things to note. First is that precipitation data seems to have a lot of measurement issues even before taking any fixed effects.  Second is that temperature seems ok, at least until state fixed-effects are introduced (a correlation of 0.842 indicates some measurement error, but still more signal than noise). But when district effects are introduced, the correlation plummets by more than half.

The take-home here is that fixed effects may be valuable, even indispensible, for empirical research. But like turkey at thanksgiving, or presents at Christmas, more of a good thing is not always better.

UPDATE: If you made it to the end of this post, you are probably nerdy enough to enjoy this related cartoon in this week's Economist.