Tuesday, October 23, 2012

Bad control

You want to know how X affects Y.  You're worried that some other factor Z might be correlated with both X and Y - i.e. that Z is a potential "confounder" or "omitted variable" - and so you are hesitant to explore the effect of X on Y without accounting for Z.  Imagine that you are also lucky enough to have some data on Z.  So when calculating the effect of X on Y, you "control" for Z - i.e. calculate the effect of X on Y holding Z constant.  

Often this approach makes a lot of sense, and it is intuitively appealing to throw in a lot of control variables into your analysis to see if the effect of your main variable of interest (X) is "robust".  People do this routinely, and paper referees almost always ask for it in some form.

But there is a particular case where throwing in a bunch of "control" variables might actually be a really bad idea:  when these variables are themselves outcomes of the X variable of interest.   That is, if X affects Y, and X also affects Z, then "controlling" for Z when you estimate the effect of X on Y is probably a mistake.  This type of mistake is generically termed "bad control", and it can lead to dramatic misinterpretations of coefficient estimates.  Unfortunately it's a mistake that gets made a lot. 

Sol, Ted Miguel, and I have been working on a review of the rapidly growing literature on climate and conflict, and it is impressive the number of times bad controls are included.  Consider the following stylized example:  

You want to understand the effect of temperature on conflict.  You figure that temperature is not the only thing that affects conflict, and you're worried that temperature is also correlated with a lot of other stuff that might affect conflict - for instance, per capita GDP levels. So you regress conflict on temperature and GDP, and find that the effect of temperature is insignificant and the effect of GDP is large and significant.   What do you conclude?

A standard conclusion would be that the effect of temperature is "not robust", but in this case that conclusion is likely wrong.  The reason why is that temperature also affects economic productivity (see here and here), and so GDP is really an outcome variable.  This means it doesn't make sense to "hold economic productivity constant" when exploring the relationship between temperature and conflict -- part (or potentially all) of temperature's effect on conflict is through income.  At the extreme, if temperature affects conflict through only income, then controlling for income in a regression of conflict on temperature would lead you in this case to draw exactly the wrong conclusion about the relationship between temperature and conflict: that there is no effect of temperature on conflict.  (For those scoring at home with access to Stata who need to convince themselves, run the couple of lines of code below.)

The difficulty in this setting is that a growing body of research shows that climatic factors (and particularly temperature) also affect many other of the socioeconomic factors that that often get thrown in as control variables - things like crop production, infant mortality, population (via migration or mortality), and even political regime type.  To the extent that these show up as controls, studies might be drawing mistaken conclusions about the relationship between climate and conflict.

Studies can do two things to make sure their inferences are not being biased by bad controls.  First, show us the reduced form relationship between X and Y without any controls.  When X is "as good as randomly assigned" - as it typically is when X is a climate variable and the study is using variation in climate over time - then the reduced form relationship between X and Y tells us most of what we want to know.  Second, if you just have to use control variables - or referees make you, as in our 2009 PNAS paper on conflict in Africa - then be clear about the relationship between X and the controls you want to conclude.  Convince the reader that these controls are not themselves outcome variables and that controlling for them is not going to make your inference problem worse rather than better.

Finally, it's worth noting that not all is bad with bad controls:  including them can sometimes be useful to illuminate the mechanism through which X affects Y.  If X affects both Y and Z, but you're interested if X has an effect on Y through some other variable than Z, then "controlling" for Z in a regression of X on Y provides some insight into whether this is true.  (Maccini and Yang have a nice example of this in their paper on rainfall and later life outcomes.) Continuing the example above, regressing conflict on temperature and income and finding that temperature still has a significant effect on conflict suggests that temperature's effects on conflict are not only through income.  But, to reiterate, finding no effect of temperature in this regression does not tell you much at all, unless you can be sure that temperature does not also affect income.

[For a little more on bad controls, see Angrist and Pischke's nice discussion in Mostly Harmless Econometrics, p64-68].


* Code to demonstrate "bad control".
* Data generating process:
* income = temperature + noise
* conflict = income + noise

set seed 123
set obs 1000

gen temp = rnormal()  //temperature variable
gen e_1 = rnormal()  //noise variable 1
gen e_2 = rnormal()  //noise variable 2, uncorrelated with e_1

gen income =  -temp + e_1  //temp and income are negatively related - e.g. Dell et al 2012
gen conflict = -income + e_2  //income and conflict are negatively related - e.g. Miguel et al 2004

reg conflict temp
reg conflict income
reg conflict income temp  //coefficient on income is highly significant, coeff on temp is not and point estimates is close to zero

1 comment:

  1. I am not sure I understand the following sentence in the last paragraph: "Continuing the example above, regressing conflict on temperature and income and finding that temperature still has a significant effect on conflict suggests that temperature's effects on conflict are not only through income."

    On page 66 in "Mostly Harmless Econometrics", Angrist and Pischke write: "It is also incorrect to say that the conditional comparison captures the part of the effect of college [here: temperature] that is “not explained by occupation” [here: income]. In fact, the conditional comparison does not tell much that is useful without a more elaborate model of the links between college [temperature], occupation [income], and earnings [conflict]."

    I would appreciate any help here as I am not sure whether controlling for income in the above example is now advisable or not (given that my interest is in mechanisms other than income).

    Thank you!