Monday, October 19, 2020

The GDP-temperature relationship - some thoughts on Newell, Prest, and Sexton 2018

Quantifying the potential effects of climate on aggregate economic outcomes like GDP is a key part of understanding the broader economic impacts of a changing climate.  Lots of studies now use modern panel econometrics to study how past fluctuations in temperature have affected GDP around the world, including the landmark 2012 paper by Dell, Jones, and Olken (henceforth DJO), our 2015 paper (henceforth BHM), and a number of more recent papers that either analyze new data on this topic (e.g. subnational GDP or income data, see here or here or here) or revisit earlier results. 

Newell, Prest, and Sexton 2018 (draft here, henceforth NPS) is one recent paper that revisits our earlier work in BHM. Because we've gotten a lot of questions about this paper, Sol and I wanted to share our views of what we think we can learn from NPS.  In short, our view is that their approach is not suitable to the question they are trying to ask, and that their conclusions as stated in their abstract (at least in their current draft) are directly contradicted by their results.  This is important because their conclusions appear to shed important light on the aggregate economic impacts of unmitigated climate change. Unfortunately we do not believe that this is the case. 

NPS seek to take a data-driven approach to resolving a number of empirical issues that come up in our earlier work.  These include: (1) Does temperature affect the level or growth rate of GDP? (2) What is the "right" set of fixed effects or time trends to include as controls in analyzing the effect of temperature on GDP? (3) And what is the "right" functional form to describe the relationship between temperature and GDP changes?  These choices -- particularly (1) and (3) -- are certainly influential in the findings of earlier studies.  If temperature affects growth rates rather than levels, this can imply huge effects of future temperature increases on economic output, as small impacts on growth aggregate over time.  If temperature has a globally nonlinear effect on output, as we argue in BHM, this suggests that both wealthy and poor countries can be affected by changes in temperature -- and not only poor countries, as suggested in earlier analyses.  Resolving these questions is key to understanding the impacts of a warming climate, and its great that papers like NPS are taking them on.

However, we have some serious qualms with the approach that NPS take to answer these questions.  NPS wish to use cross-validation to select which set of the above choices performs "best" in describing historical data.  Cross-validation is a technique in which available data are split between disjoint training and testing datasets, and candidate models are trained on the training data and then evaluated on the held-out test data using a statistic of interest (typically RMSE or r-squared in these sorts of applications).  The model with the lowest RMSE or highest r-squared on test data is then chosen as the preferred model. 

Inference, not prediction.  This technique works great when your goal is prediction.  But what if your goal is causal inference?  i.e, in our case, in isolating variation in temperature from other correlated factors that might also affect GDP?  It's not clear at all that models that perform the best on a prediction task will also yield the right causal results.  For instance, prices for hotel rooms tend to be high when occupancy rates are high, but only a foolish hotel owner would raise prices to increase occupancy (h/t Susan Athey who I stole this example from).  A good predictive model can get the causal story wrong. 

This is clearly relevant in the panel studies of temperature on GDP.  In existing work, great care is taken to choose controls that account for a broad range of time-invariant and time-varying factors that might be correlated with both temperature and GDP.  These typically take the form of unit or time fixed effects and/or time trends. Again the goal in including these is not to better predict GDP but to isolate variation in temperature that is uncorrelated with other factors that could affect GDP, in order to identify the causal effect of temperature on GDP.  The chosen set of controls constitute the paper's "identification strategy", and in this fixed effect setup unfortunately there is no clear data-driven approach -- including cross-validation -- for selecting these controls.  The test for readers in these papers is not:  do the authors predict GDP really well?  It is instead: do the authors plausibly isolate the role of temperature from other confounding factors?

Growth vs level effects.  The main question NPS are asking is whether the causal effect of a temperature shock has "level effects", where the economy is hurt in one year but catches up in the next year, or "growth effects", where the economy is permanently smaller.  This would require an identification strategy, which is what most of the literature has focused on following the method artfully outlined by DJO:  isolate the role of temperature from other confounding factors using choices about fixed effects that most plausibly achieve this unconfounding, and then distinguish growth from level effects by looking at the sum of contemporaneous and lagged effects.  If the sum is zero, this is evidence of level effects, and if it's not zero, evidence of growth effects.   The current manuscript does not have an identification strategy for measuring these lagged effects, and instead is using goodness of fit metrics to draw causal inferences about the magnitude of these effects.  The tool they are using is again not commensurate with the question they are asking. 

Errors in interpretation. These conceptual issues aside, the authors' conclusions in the abstract and introduction of their paper about the results from their cross validation exercise do not appear consistent with their actual findings as reported in the tables.  The authors conclude that "the best performing models have non-linear temperature effects on GDP levels", but the authors demonstrate no statistically distinguishable differences in results between levels and growth models in their main tables (Table 2 and 3, and A1-A4), nor between linear and non-linear models.  This directly contradicts the statement in their abstract. 

To be precise, the authors state in the abstract "The best-performing models have non-linear temperature effects on GDP levels."  But then on page 27 they clearly state: "The MCS ["model confidence sets", or the set of best performing models whose performance is statistically indistinguishable from one another], however, does not discern among temperature functional forms or growth and level effects." This is in reference to Table 2, reproduced below; models in the MCS are denoted with asterisks, and a range of both growth and levels models have asterisks, meaning their performance cannot be statistically distinguished from one another. 

So, again, the paper's abstract is not consistent with its own stated results.  They do find that the model with quadratic time trends (as used by BHM) is outperformed by models without country time trends -- but again, the purpose of those time trends is to remove potential confounding variation, not to perfectly predict GDP. See here for a simple look at the data on whether individual countries growth rates have differential time trends that might make you worried about time-trending unobservables at a country level [spoiler: yes they do].

Errors in implementation.  Even if cross validation was the right tool here, the authors make some non-standard choices in how the CV is implemented which we again believe make the results very hard to interpret.  Instead of first splitting the data between disjoint train and test sets, they first transform the data by regressing out the fixed effects, and then split the residualized data into train and test.  But the remaining variation in the residualized data will be very different based on what set of fixed effects have been regressed out, and this will directly affect resulting estimates of the RMSE.  It is thus not a surprise that models with region-year FE have lower RMSE than models with year FE (in models with no time trends) -- region-year FE's almost mechanically take out more of the variation.  But this means you can't meaningfully compare RMSE's across different fixed effects in the way that they are doing -- you are literally looking at 2 different datasets with 2 different outcomes. You can only in principle compare functional forms within a given choice of FE.  

Imagine the following: you have outcome Y, predictor X, and covariates W and Z.  Z is a confound: correlated with both X and Y.  W is not correlated with Y but X.

In version 1 you partial Z out of both X and Y, and generate residualized values Y_1 and X_1.

In version 2 you partial W out of both X and Y, and generate residualized values Y_2 and X_2. 

This is in effect what NPS do, and then they want us to compare Y_1 = f(X_1) versus Y_2 = f(X_2).  But this clearly doesn't make sense, because Y_2 and X_2 still have the confounding variation of Z in them, and Y_1 and Y_2 are no longer measuring the same thing. So comparing the predictive power of f(X_1) vs f(X_2) is not meaningful.  It is also not how cross validation is supposed to work - instead, we should be comparing predictive performance on the same Y variables. 

Further hazards in cross validation. While we don't agree that CV can be used to select the fixed effects in this setting, we then do agree with NPS that cross validation could in principle be used to identify the "right" functional form that relates temperature to GDP (conditional on chosen controls). e.g is the relationship linear? quadratic? something fancier?  This works because the causal part has already been taken care of by the fixed effects (or so one hopes), and so what's left is to best describe the functional form of how x relates to y.  Cross validation for this purpose has been successfully implemented settings in which temperature explains a lot of the variation in the outcome -- e.g. in studies of agricultural productivity (see Schlenker and Roberts 2009 for an early example in this literature).  

But unfortunately when it comes to GDP, while temperature has a strong and statistically significant relationship to GDP in past work, it does not explain a lot of the overall interannual variation in GDP;  GDP growth is noisy and poorly predicted even by the hotshot macro guys.  In this low r-squared environment, selecting functional form by cross validation can be difficult and perhaps hazardous.  It's too easy to overfit to noise in the training set.

To see this, consider the following simulation in which we specify a known cubic relationship between y and x and then try to use cross validation to recover the "right" order of polynomial, and study our ability to do so as we crank up the noise in y.  We do this a bunch of times, each time training the regression model on 80% of the data and testing on 20%. We calculate RMSE on the test data and compute average % reduction in RMSE relative to a model with only an intercept.  As shown in the figure below, we can mostly reject the linear model but have a lot of trouble picking out the cubic model from anything else non-linear, particularly in settings with the overall explained variation in y is small.  Given that temperature explains <5% of the variation in GDP growth rates (analogous to the far right grouping of bars in each plot), cross validation is going to really struggle to pick the "right" functional form. 

To be clear, the point we're making in this section is just about functional form, not about growth versus levels.  Even for this more narrow task in which cross validation is potentially appropriate, it does not end up being a useful tool because the model overfits to noise in the training set.


This exercise does illustrate an important point, however:  right now the data are consistent with a bunch of potential functional forms that we can't easily distinguish.  We argue in BHM that there is pretty consistent evidence that the temperature/growth relationship is non-linear and roughly quadratic at the global level, but we certainly can't rule out higher order polynomials and we say so in that paper.  

Wrapping up. So where does this leave us?  Techniques like cross validation certainly still have utility for other approaches to causal inference questions (e.g. in selecting suitable controls for treated units in synthetic control settings), and there might be opportunities to apply those approaches in this domain.  Similarly, we fully agree with NPS's broader point that using data-driven approaches to make key analytical decisions in climate impacts (or any other empirical) paper should be our goal.  But in this particular application, we don't think that the specific approach taken by NPS has improved our understanding of climate/economy linkages.  We look forward to working with them and others to continue to make progress on these issues.