Monday, October 29, 2012

Probabilistic forecast of direct damage from Hurricane Sandy

These models are pretty preliminary, but Marshall and David convinced me to post this. I've been working with landfall statistics for only a couple of weeks, but I had enough data to put together a simple probabilistic forecast this morning for Sandy's direct damage (the number that will eventually appear on Wikipedia) based on landfall parameters (as they were forecast at around noon).  The distribution of outcomes is pretty wide, but the most likely outcome and expected loss are both at around $20B.  Below is the cumulative distribution function (left) and probability density function (right). 

click to enlarge

It will probably take several weeks for official estimates to converge. If I'm anywhere near right, I'll be sure to remind you.  Rather than explaining and caveating, I'm posting now since the power-outage frontier is two blocks away (it's dark south of 24th Street).

Tuesday, October 23, 2012

Bad control

You want to know how X affects Y.  You're worried that some other factor Z might be correlated with both X and Y - i.e. that Z is a potential "confounder" or "omitted variable" - and so you are hesitant to explore the effect of X on Y without accounting for Z.  Imagine that you are also lucky enough to have some data on Z.  So when calculating the effect of X on Y, you "control" for Z - i.e. calculate the effect of X on Y holding Z constant.  

Often this approach makes a lot of sense, and it is intuitively appealing to throw in a lot of control variables into your analysis to see if the effect of your main variable of interest (X) is "robust".  People do this routinely, and paper referees almost always ask for it in some form.

But there is a particular case where throwing in a bunch of "control" variables might actually be a really bad idea:  when these variables are themselves outcomes of the X variable of interest.   That is, if X affects Y, and X also affects Z, then "controlling" for Z when you estimate the effect of X on Y is probably a mistake.  This type of mistake is generically termed "bad control", and it can lead to dramatic misinterpretations of coefficient estimates.  Unfortunately it's a mistake that gets made a lot. 

Sol, Ted Miguel, and I have been working on a review of the rapidly growing literature on climate and conflict, and it is impressive the number of times bad controls are included.  Consider the following stylized example:  

You want to understand the effect of temperature on conflict.  You figure that temperature is not the only thing that affects conflict, and you're worried that temperature is also correlated with a lot of other stuff that might affect conflict - for instance, per capita GDP levels. So you regress conflict on temperature and GDP, and find that the effect of temperature is insignificant and the effect of GDP is large and significant.   What do you conclude?

A standard conclusion would be that the effect of temperature is "not robust", but in this case that conclusion is likely wrong.  The reason why is that temperature also affects economic productivity (see here and here), and so GDP is really an outcome variable.  This means it doesn't make sense to "hold economic productivity constant" when exploring the relationship between temperature and conflict -- part (or potentially all) of temperature's effect on conflict is through income.  At the extreme, if temperature affects conflict through only income, then controlling for income in a regression of conflict on temperature would lead you in this case to draw exactly the wrong conclusion about the relationship between temperature and conflict: that there is no effect of temperature on conflict.  (For those scoring at home with access to Stata who need to convince themselves, run the couple of lines of code below.)

The difficulty in this setting is that a growing body of research shows that climatic factors (and particularly temperature) also affect many other of the socioeconomic factors that that often get thrown in as control variables - things like crop production, infant mortality, population (via migration or mortality), and even political regime type.  To the extent that these show up as controls, studies might be drawing mistaken conclusions about the relationship between climate and conflict.

Studies can do two things to make sure their inferences are not being biased by bad controls.  First, show us the reduced form relationship between X and Y without any controls.  When X is "as good as randomly assigned" - as it typically is when X is a climate variable and the study is using variation in climate over time - then the reduced form relationship between X and Y tells us most of what we want to know.  Second, if you just have to use control variables - or referees make you, as in our 2009 PNAS paper on conflict in Africa - then be clear about the relationship between X and the controls you want to conclude.  Convince the reader that these controls are not themselves outcome variables and that controlling for them is not going to make your inference problem worse rather than better.

Finally, it's worth noting that not all is bad with bad controls:  including them can sometimes be useful to illuminate the mechanism through which X affects Y.  If X affects both Y and Z, but you're interested if X has an effect on Y through some other variable than Z, then "controlling" for Z in a regression of X on Y provides some insight into whether this is true.  (Maccini and Yang have a nice example of this in their paper on rainfall and later life outcomes.) Continuing the example above, regressing conflict on temperature and income and finding that temperature still has a significant effect on conflict suggests that temperature's effects on conflict are not only through income.  But, to reiterate, finding no effect of temperature in this regression does not tell you much at all, unless you can be sure that temperature does not also affect income.

[For a little more on bad controls, see Angrist and Pischke's nice discussion in Mostly Harmless Econometrics, p64-68].


* Code to demonstrate "bad control".
* Data generating process:
* income = temperature + noise
* conflict = income + noise

set seed 123
set obs 1000

gen temp = rnormal()  //temperature variable
gen e_1 = rnormal()  //noise variable 1
gen e_2 = rnormal()  //noise variable 2, uncorrelated with e_1

gen income =  -temp + e_1  //temp and income are negatively related - e.g. Dell et al 2012
gen conflict = -income + e_2  //income and conflict are negatively related - e.g. Miguel et al 2004

reg conflict temp
reg conflict income
reg conflict income temp  //coefficient on income is highly significant, coeff on temp is not and point estimates is close to zero

Saturday, October 13, 2012

Global hunger: down but not out

A recent revision of the FAO's calculations on how many hungry people there are in the world has garnered some attention, not least because the FAO seems to have backed off their earlier headline- and funding-generating claim that high food prices and the global economic downturn had resulted in there being over 1 billion people hungry in the world.  The roundness and bigness of that number was certainly shocking and galvanizing, but what was perhaps more worrying at the time was the implication that earlier gains in reducing the number of hungry were being rapidly reversed - that hunger was "spiking" and that there was a serious crisis underway.

FAO's revised numbers, out in their annual State of Food Insecurity, tell a somewhat different story.  See the plot below, which is pieced together from the last three SOFI reports. The total number of hungry is now about 850 million - below a billion but still a debacle by any normal standard - but the updated numbers (shown in blue) now completely wipe out the highly-publicized food crisis spike of 2008-2010. Instead, it looks like there were more hungry people in the world in the 1990s, but that this has been more or less steadily improving ever since - with some leveling off in the last half-decade. The take home from these numbers:  we had a worse starting point, but much more progress since then and no big spike.

So what happened? Why the progress, and where'd the spike go? Calculating the number of hungry in the world is not an easy task.  The way the FAO does it is to combine population estimates for a given country (which we know pretty well) with estimates of dietary requirements for people in that country (based on anthropometrics, which we know decently well), with estimates of calorie availability.  This last part is where things get tough.  What the FAO does is to (try to) use household survey data to get an estimate of the distribution of consumption within a country, and then because these data are not available every year, use broader indicators of food availability (e.g. data on country level production and trade) to shift this distribution around.  In this technical note, they suggest that the revision had a little to do with better estimates of calorie distribution across households (which reduced estimates of the number of hungry 2008-2010 by about 60million), and a lot to do with better accounting for food losses and wastage (which increased the number of hungry in each period by about 125 million).  

This to me explains why the levels went up, but does not really explain where the spike went.  In the 2012 SOFI, the authors explain:  

"The methodology estimates chronic undernourishment based on habitual consumption of dietary energy and does not fully capture the effects of price spikes, which are typically short-term. As a result, the prevalence of undernourishment (PoU) indicator should not be used to draw definitive conclusions about the effects of price spikes or other short-term shocks. Second, and most importantly, the transmission of economic shocks to many developing countries was less pronounced than initially thought."

This seems a little weird, since basically the same methodology was used to show a huge hunger spike on account of the 2008 price rise. 

In any case, it was likely there was (and is) still a hunger spike.  What of course you can't show on that plot is the counterfactual - what hunger numbers would have looked like had there been no economic downturn and food price increase.  There is plenty of evidence from other sources, including good micro work by folks at the World Bank, that price spikes in 2008 and again since mid-2010 have pushed 50-100million people below the $1.25 poverty line.  Hunger is of course different than poverty, but they are closely related - and this makes the FAO revision again confusing since it suggests that things were getting worse for a lot of people.  

Good household survey data are a critical component to any adding up of the number of hungry, and if you had these surveys every year and in a bunch of countries, you would know a whole lot more about how much people are eating and how much they are hurt by higher food prices.  And there are many other (potentially much more clever) ways to use household expenditure data to get at hunger without adding up every single calorie consumed by the household.

The FAO seems to realize this.  In their technical note on the updated numbers, the FAO notes that: 

"If nationally representative surveys collecting reliable data on habitual food consumption were conducted every year and could be processed in a timely and consistent manner throughout the world, then a simple head-count method, based on the classification of individuals, could be used. Until then, a model based estimation procedure, such as FAO’s, is still needed."

What I don't understand is why the FAO is not already doing these surveys.  Calculating the number of hungry people in the world (and its different regions) would seem like one of - if not the - most important task the FAO has on its annual to-do list, and something that might be worth throwing some money at. 

FAO's annual budget is $1 billion USD (which as noted by this website equals the "cost of six days of cat and dog food in nine industrialized countries").  Lets say you wanted to do annual household surveys in 100 poor countries.  A good rule of thumb for doing surveys in poor places is that it costs about $25 to survey one person, inclusive of all costs.  So for $100k, you could survey 4000 people, which is a decent sized national survey.  So doing these surveys annually in 100 poor countries would cost $10million, or 1% of FAO budget.  (Initial survey costs might be higher, but once you've paid the fixed costs of getting together a survey team, costs over time would go down..)  And with modern electronic data collection methods, you could collect, aggregate, and analyze this data pretty quickly - which is not to say that doing surveys is easy, but that this seems like a fixable problem. Furthermore, the much, much richer World Bank is already doing a bunch of these surveys - the LSMS - so presumably would be willing to go halvsies. 

Until they do so - and given the large differences in the what the poverty numbers and the hunger numbers seem to say about the food crisis -  it's not obvious that we're better off trusting the new estimates of the global number of hungry a whole lot more than the old ones.  Either way, there are a whole lot of hungry people in the world, and high food prices do not appear to be doing them any favors.

Friday, October 12, 2012

A random thought on adaptation

There have been some interesting stories lately about how the new drought tolerant seeds are performing. Anecdotes of farmers marveling over how their crops fare are not exactly evidence, since there may be many more who were disappointed. But it does at least suggest that new seeds are better adapted. Similarly, one often hears stories about how agriculture is expanding into new areas, which is another way that agriculture could adapt to global warming.

I have little doubt that new varieties and migration of crop areas will help with climate change, but the key question is how much. Will it be a 1% or 50% type of effect? A lot of our research is about trying to find and analyze datasets that can answer this question. Typically any one dataset can only tell us so much, so it’s really about trying to piece together a picture from multiple different analyses. Not all of these analyses have to be very sophisticated. For example, a simple plot of average country yields vs. average growing season temperature is shown below (I made this based on the methods and data in this paper from last year. The figure is part of a review that is coming out in Plant Physiology later this year.)

The green blobs each represent a country, with the size proportional to total production. The vertical gray line shows an independent estimate of the optimal season temperature for yields, based on a recent review by Hatfield et al. that was based on experimental studies. It’s not exactly a pretty figure – lots of factors differ between countries other than temperature, which results in a lot of scatter. But I think it illustrates three important points that are sometimes missed:

1. The highest yielding countries are fairly strongly clustered around the gray line, except for barley where they are significantly cooler. (This is a surprisingly good match given that the gray lines were completely independent of this dataset.) Although many countries grow each crop above its optimum, the maximum yields are clearly lower at high temperatures. This casts some doubt on the notion that yields can be maintained as temperatures rise, since the warmer countries should already have incentive to better adapt to their conditions.

2. Large producers span a pretty wide range of temperatures. This is sometimes cited as evidence that agriculture is well adapted to a wide range of climates, but I think it’s more accurate to say that farming is profitable across a wide range of climates. For example, people sometimes like to point out that if we grow corn in Alabama, how can global warming be a concern? The answer is that we grow corn in Alabama, but not nearly as well as if it had the climate of Illinois. The lack of a tight relationship between yields and crop areas indicates that agriculture is not greatly optimized to current climate. To me, this casts doubt on any argument that migration of agriculture will be a major source of adaptation. There are clearly a lot of factors other than climate that enter into a decision about which crop to grow.

3. For maize and wheat, a lot of the bigger producers tend to fall to the right of the optimum. This helps to illustrate why the global production of these crops are typically predicted to be hurt by warming, even if some countries gain.

None of these points are proven by the figure. It’s pretty rare that a simple cross-section like this can prove anything. But sometimes simple plots can really challenge a strong prior belief. If it was common to match crops to their temperature optimum – either by relocating where the crops grow or by changing the crop’s optimum temperature – then I would expect to see a much flatter cross-sectional relationships between yields and temperature, or a much tighter concentration of area around the optimum.