Friday, December 21, 2012

The good and bad of fixed effects

If you ever want to scare an economist, the two words "omitted variable" will usually do the trick. I was not trained in an economics department, but I can imagine they drill it into you from the first day. It’s an interesting contrast to statistics, where I have much of my training, where the focus is much more on out-of-sample prediction skill. In economics, showing causality is often the name of the game, and it’s very important to make sure a relationship is not driven by a “latent” variable. Omitted variables can still be important for out-of-sample skill, but only if their relationships with the model variables change over space or time.

A common way to deal with omitted variable bias is to introduce dummy variables for space or time units. These “fixed effects” greatly reduce (but do not completely eliminate) the chance that a relationship is driven by an omitted variable. Fixed effects are very popular, and some economists seem to like to introduce them to the maximum extent possible. But as any economist can tell you (another lesson on day one?), there are no free lunches. In this case, the cost of reducing omitted variable problems is that you throw away a lot of the signal in the data.

Consider a bad analogy (bad analogies happen to be my specialty). Let’s say you wanted to know whether being taller caused you to get paid more. You could simply look at everyone’s height and income, and see if there was a significant correlation. But someone could plausibly argue that omitted variables related to height are actually causing the income variation. Maybe very young and old people tend to get paid less, and happen to be shorter. And women get paid less and tend to be shorter. And certain ethnicities might tend to be discriminated against, and also be shorter. And maybe living in a certain state that has good water makes you both taller and smarter, and being smarter is the real reason you earn more. And on and on and on we could go. A reasonable response would be to introduce dummy variables for all of these factors (gender, age, ethnicity, location). Then you’d be looking at whether people who are taller than average given their age, sex, ethnicity, and location get paid more than an average person of that age, sex, ethnicity, and location.

In other words, you end up comparing much smaller changes than if you were to look at the entire range of data. This helps calm the person grumbling about omitted variables (at least until they think of another one), and would probably be ok in the example, since all of these things can be measured very precisely. But think about what would happen if we only could measure age and income with 10% error. Taking out the fixed effects means removing a lot of the signal but not any of the noise, which means in statistical terms that the power of the analysis goes down.

Now to a more relevant example. (Sorry, this is where things may get a little wonkish, as Krugman would say). I was recently looking at some data that colleagues at Stanford and I are analyzing on weather and nutritional outcomes for district level data in India. As in most developing countries, the weather data in India are far from perfect. And as in most regression studies, we are worried about omitted variables. So what is the right level of fixed effects to include? Inspired by a table in a recent paper by some eminent economists (including a couple who have been rumored to blog on G-FEED once in a while), I calculated the standard deviation of residuals from regressions on different levels of fixed effects. The 2nd and 3rd columns in the table below show the results for summer (June-September) average temperatures (T) and rainfall (P). Units are not important for the point, so I’ve left them out:

sd(T)
sd(P)
Cor(T1,T2)
Cor(P1,P2)
No FE
3.89
8.50
0.92
0.28
Year FE
3.89
4.66
0.93
0.45
Year + State FE
2.20
2.18
0.84
0.26
Year + District FE
0.30
1.63
0.33
0.22

The different rows here correspond to the raw data (no fixed effect), after removing year fixed effects (FE), year + state FE, and year + district FE. Note how including year FE reduces P variation but not T, which indicates that most of the T variation comes from spatial differences, whereas a lot of the P variation comes from year-to-year swings that are common to all areas. Both get further reduced when introducing state FE, but there’s still a good amount of variation left. But when going to district FE, the variation in T gets cut by nearly a factor of 10, from 2.2 to 0.30! That means the typical temperature deviation a regression model would be working with is less than a third of a degree Celsius. 

None of this is too interesting, but the 4th and 5th columns are where things get more related to the point about signal to noise. There I’m computing the correlation between two different datasets of T or P (details of which ones are not important). When there is a low correlation between two datasets that are supposed to be measuring the same thing, that’s a good indication that measurement error is a problem. So I’m using this correlation here as an indication of where fixed effects may really cause a problem with signal to noise.

Two things to note. First is that precipitation data seems to have a lot of measurement issues even before taking any fixed effects.  Second is that temperature seems ok, at least until state fixed-effects are introduced (a correlation of 0.842 indicates some measurement error, but still more signal than noise). But when district effects are introduced, the correlation plummets by more than half.

The take-home here is that fixed effects may be valuable, even indispensible, for empirical research. But like turkey at thanksgiving, or presents at Christmas, more of a good thing is not always better.


UPDATE: If you made it to the end of this post, you are probably nerdy enough to enjoy this related cartoon in this week's Economist.



Wednesday, December 12, 2012

What poop tells us about the social impacts of climate change

A growing literature in paleoclimate and archeology explores the extent to which past fluctuations in climate have shaped the evolution of human societies.  These papers get to tackle pretty sexy topics:  did climate help cause the collapse of the Maya?  Is climate implicated in dynastic transitions in China? How about in the fall of Angkor Wat?

That a lot of these papers are answering "yes" to the question of whether climate is implicated in large historical social upheavals could tell us something important about the impact of future climatic changes on social outcomes.  But there are a few things you might worry about in this literature.  One is that the studies are actually measuring what they say they're measuring -- i.e that they're picking up meaningful variation in human activity, and that changes in societies and in climate happened when they say they did.  Most of the papers published that you see on these topics spend most of their time convincing you that this is the case, and given my mere hobbyist's understanding of paleolimnology I have to take them at their word.

The second concern is one that is more familiar to folks that are used to running regressions:  can we say with certainty that the variations in climate are causally linked to the socioeconomic variation of interest?  The hard part with these papers is that they're often dealing with one-off events -- e.g. the collapse of the Maya -- that don't give you the repeated observations you need to carry out the typical statistical tests.  Basically, you'd be worried that even though the collapse event you measured was coincident with a large climate shock, by chance something unobserved might also have happened at the same time that in fact caused the collapse.  Given this, you might be worried that these studies are looking under the proverbial lamppost for the proverbial keys: we'd like to observe the universe of all climate events and all collapse events over time, but instead we focus on a few iconic ones.  

A new paper in PNAS helps overcome some of these concerns. D'Anjou and coauthors use coprostanol concentrations (Wikipedia:  chemical compounds found in fecal matter of higher order mammals - i.e. poop) that they dug up in a Norwegian lake to estimate the variation in local human activity in the nearby area over the last 2000 or so years.  They then compare this to existing reconstructions of local summertime temperature, which is the time of year when agriculture would have been possible.  The nice thing about their paper is that they have a lot of observations of the same place over time, and so can run some of the basic statistical tests you often want to see in these papers (and can't).  The other nice thing is that poop appears to be a much better indicator of human activity that many of the proxies used in the past, which could have been directly affected by climate (e.g. charcoal from fires, which could have been manmade or could have risen naturally as temperatures changed). 

Here is the money plot comparing poop and temperature (their Figure 5):



While there are a couple things you can still complain about -- e.g. you probably want to see Panel C as a plot of the time-detrended data -- this to me is one of the more convincing relationships that has shown up in these Paleo papers.  As in other studies looking at cold regions, they show that human activity responded strongly and positively to warmer temperatures: drops of ~4C caused total abandonment of (poop-related) human activity in the region.

While both the broader welfare effects and the modern implications of this and related studies are not immediately obvious (did people die or just migrate south? what do Iron Age societies' sensitivities to climate imply for modern societies?), the methodological differences between this and most of the past studies is to me a nice contribution.  And hopefully the grad students who had to dig up the poop got a PNAS paper out of the deal...

Tuesday, December 11, 2012

The summer of 2013


Last week at AGU I gave a talk about the lessons of the US corn harvest in 2011 and 2012, both of which were below trend line (see figure). That got me thinking a little more about what to look for in 2013. The obvious point is that it is likely to be better than 2012, because it can’t get much worse. But that’s not too insightful, it’s like saying that Cal’s football team will be better next year, since they were so bad this year (By the way, welcome to Max Aufhammer, our newest blogger! With Wolfram’s move to Berkeley that brings our Cal contingent up to 3. I sure hope I don’t say anything to offend them.)


As we’ve talked about in other posts, the summer of 2012 might be considered the normal in a few decades, but not now. And some recent work from Justin Sheffield and colleagues in Nature argues that drought trends globally, and in North America, are not significantly positive if calculated properly (which contradicts some earlier work). We can leave aside for now the question of whether soil moisture trends are the best measure of drought exposure if one cares about corn yields (though a good topic for a future post), and simply say that conditions in 2012 were well below trend.

This means we’d expect next year to be closer to the trend, and that seems to be the overriding sentiment of markets. As Darrell Good over at farmdoc daily explains “In the past five decades, extreme drought conditions in the U.S., like those experienced in 2012, have been followed by generally favorable growing conditions and yields near trend values.”

But two things work against this tendency to revert to the mean. First, the drought still persists throughout much of the country, as seen at UNL’s drought monitor site.  As Good goes on to say, “current dry soil moisture conditions in much of the U.S. and some recent forecasts that drought conditions could persist well into next year have raised concerns that such a rebound in yields may not occur in 2013.” In other words, if the Corn Belt does not get a wet winter and/or spring, expect prices to start climbing again.

Second, though, is that good initial moisture does not eliminate the chance of drought during the season. There’s an interesting piece by folks at the National Climate Data Center (NCDC) in the AGU newsletter I got today (it was actually published Nov. 20, but it takes about 3 weeks for me to get it!). They note that the 2012 was not like previous droughts in the 1930’s  and 1950’s, or even 1988, in that it was very much driven by high temperatures rather than low starting moisture. As they say:
“For example, at the end of February in both 2011 and 2012 the national PDSI (calculated using the observed monthly mean temperature and precipitation averaged across the contiguous United States) was 1.2 (mildly wet) and –2.5 (moderate drought), respectively, compared to 1934 and 1954 of –5.7 and –4.6, respectively.”
This is also shown pretty effectively in an animation by climate central. So the high temperatures in recent years have made drought come on much more quickly than usual. As the NCDC piece says “By the end of September, every month since June 2011 had above normal average temperatures, a record that is unprecedented.” That's 16 straight months of above normal temperatures!

So my seat-of-the-pants guess is that next year’s yields are likely to still be below trend line (which would be at around 160 bushels/acre). Obviously lots of things could push it above trend line (including changing the definition of the trend!), and it’s way too early to have much confidence about how 2013 will end up. But following Sol’s lead on the Sandy damage prediction, I’ll go out on a limb (his mean was too low by a factor of two, but the true damage was within the confidence interval!). And to pair a risky bet with a safe one, I’ll also predict Stanford wins big at the Rose Bowl.

Sunday, December 9, 2012

Climate, food prices, social conflict and....Google Hangout?

My coauthor Kyle Meng was asked to participate in this HuffPost Live discussion about climate, food prices and civil conflict. It's an interesting discussion, which gets pretty rowdy at times, with an eclectic group. I am also very impressed by HP's leveraging of Google Hangout to produce a low-cost public, intellectual forum.

David has written about the food-price and conflict linkage before, and we've discussed the association between climate and conflict a few times here.  In general, I don't think a linkage has been demonstrated conclusively with data, but that doesn't seem to get in the way of people referencing it.

The debate is interesting and entertaining, highlighting a few of the differences in how some policy-folk, economists and ecologists view theses various ideas.


Kyle was asked to participate because he was an author of our 2011 Nature paper on ENSO and conflict.  He also happens to be on the job market right now.

Thursday, December 6, 2012

Climate data and projections at your fingertips

Do you ever get jealous of Wolfram's pretty graphs on this blog or just want to know what March rainfall will look like in New Zealand at midcentury -- but you just don't have the time or energy to sort through all the various climate data sets or learn how to use GIS software?

Lucky for you, the Nature Conservancy has teamed up with scientists at the University of Washington and the University of Southern Mississippi to develop Climate Wizard, a graphical user interface available through your browser window that lets you surf real climate model projections and historical data for both the USA and the world. According to the website:
With ClimateWizard you can:
  • view historic temperature and rainfall maps for anywhere in the world
  • view state-of-the-art future predictions of temperature and rainfall around the world
  • view and download climate change maps in a few easy steps 
ClimateWizard enables technical and non-technical audiences alike to access leading climate change information and visualize the impacts anywhere on Earth.  The first generation of this web-based program allows the user to choose a state or country and both assess how climate has changed over time and to project what future changes are predicted to occur in a given area. ClimateWizard represents the first time ever the full range of climate history and impacts for a landscape have been brought together in a user-friendly format. 
The data sets underlying behind the pictures are well documented on the "about us" page, and the data in each map is easily exportable.

If this had come out four years ago, I probably could have shaved six months off of my phd...

h/t Bob Kopp