Monday, December 21, 2015

From the archives: Friends don't let friends add constants before squaring

I was rooting around in my hard-drive for a review article when I tripped over this old comment that Marshall, Ted and I drafted a while back.

While working on our 2013 climate meta-analysis, we ran across an interesting article by Ole Thiesen at PRIO where he coded up all sorts of violence at a highly local level in Kenya to investigate whether local climatic events, like rainfall and temperature anomalies, appeared to be affecting conflict. Thiesen was estimating a model analogous to:
and reported finding no effect of either temperature or rainfall. I was looking through the replication code of the paper to check the structure of the fixed effects being used when I noticed something, the  squared terms for temperature and rainfall were offset by a constant so that the minimum of the squared terms did not occur at zero:



(Thiesen was using standardize temperature and rainfall measures, so they were both centered at zero). This offset was not apparent in the linear terms of these variables, which got us thinking about whether this matters. Often, when working with linear models, we get used to shifting variables around by a constant, usually out of convenience, and it doesn't matter much. But in non-linear models, adding a constant incorrectly can be dangerous.

After some scratching pen on paper, we realized that

for the squared term in temperature (C is a constant), which when squared gives:

because this constant was not added to the linear terms in the model, the actual regression Thiesen was running is:

which can be converted to the earlier intended equation by computing linear combinations of the regression coefficients (as indicated by the underbraces), but directly interpreting the beta-tilde coefficients as the linear and squared effects is not right--except for beta-tilde_2 which is unchanged. Weird, huh? If you add a constant prior squaring for only the measure that is squared, then the coefficient for that term is fine, but it messes up all the other coefficients in the model.  This didn't seem intuitive to us, which is part of why we drafted up the note.

To check this theory, we swapped out the T-tilde-squared measures for the correct T-squared measures and re-estimated the model in Theisen's original analysis. As predicted, the squared coefficients don't change, but the linear effects do:


This matters substantively, since the linear effect of temperature had appeared to be insignificant in the original analysis, leading Thiesen to conclude that Marshall and Ted might have drawn incorrect conclusions in their 2009 paper finding temperature affected conflict in Africa. But just removing the offending constant term revealed a large positive and significant linear effect of temperature in this new high resolution data set, agreeing with the earlier work. It turns out that if you compute the correct linear combination of coefficients from Thiesen's original regression (stuff above the brace for beta_1 above), you actually see the correct marginal effect of temperature (and it is significant).

The error was not at all obvious to us originally, and we guess that lots of folks make similar errors without realizing it. In particular, it's easy to show that a similar effect shows up if you estimate interaction effects incorrectly (after all, temperature-squared is just an interaction with itself).

Thiesen's construction of this new data set is an important contribution, and when we emailed this point to him he was very gracious in acknowledging the mistake. This comment didn't get seen widely because when we submitted it to the original journal that published the article, we received an email back from the editor stating that the "Journal of Peace Research does not publish research notes or commentaries."

This holiday season, don't let your friends drink and drive or add constants the wrong way in nonlinear models.

Monday, December 14, 2015

The right way to overfit

As the Heisman Trophy voters showed again, it is super easy to overfit a model. Sure, the SEC is good at playing football. But that doesn’t mean that the best player in their league is *always* the best player in the country. This year I don’t think it was even close.

At the same time, there are still plenty examples of overfitting in the scientific literature. Even as datasets become larger, this is still easy to do, since models often have more parameters than they used to. Most responsible modelers are pretty careful about presenting out-of-sample errors, but even that can get misleading when cross-validation techniques are used to select models, as opposed to just estimating errors.

Recently I saw a talk here by Trevor Hastie, a colleague at Stanford in statistics, which presented a technique he and Brad Efron have recently started using that seems more immune to overfitting. They call it spraygun, which doesn’t seem too intuitive a description to me. But who am I to question two of the giants of statistics.

Anyhow, a summary figure he presented is below. The x-axis shows the degree of model variance or overfitting, with high values in the left hand side, and the y-axis shows the error on a test dataset. In this case they’re trying to predict beer ratings from over 1M samples. (statistics students will know beer has always played an important role in statistics, since the origin of the “t-test”). The light red dots show the out-of-sample error for a traditional lasso-model fit to the full training data. The dark red dots show models fit to subsets of the data, which unsurprisingly tend to overfit sooner and have worse overall performance. But what’s interesting is that the average of the predictions from these overfit models do nearly the same as the model fit to the full data, until the tuning parameter is turned up enough that that full model overfits. At that point, the average of the models fit to the subset of data continues to perform well, with no notable increase in out-of-sample error. This means one doesn’t have to be too careful about optimizing the calibration stage. Instead just (over)fit a bunch of models and take the average.

This obviously relates to the superior performance of ensembles of process-based models, such as I discussed in a previous post about crop models. Even if individual models aren't very good, because they are overfit to their training data or for other reasons, the average model tends to be quite good. But in the world of empirical models, maybe we have also been too guilty of trying to find the ‘best’ model for a given application. This maybe makes sense if one is really interested in the coefficients of the model, for instance if you are obsessed with the question of causality. But often our interest in models, and even in identifying causality, is that we just want good out-of-sample prediction. And for causality, it is still possible to look at the distribution of parameter estimates across the individual models.

Hopefully for some future posts one of us can test this kind of approach on models we’ve discussed here in the past. For now, I just thought it was worth calling attention to. Chances are that when Trevor or Brad have a new technique, it’s worth paying attention to. Just like it’s worth paying attention to states west of Alabama if you want to see the best college football player in the country.

Monday, December 7, 2015

Warming makes people unhappy: evidence from a billion tweets (guest post by Patrick Baylis)

Everyone likes fresh air, sunshine, and pleasant temperatures. But how much do we like these things? And how much would we be willing to pay to gain more of them, or to prevent a decrease in the current amount that we get?

Clean air, sunny days, and moderate temperatures can all be thought of as environmental goods. If you're not an environmental economist, it may seem strange to think about different environmental conditions as "goods". But, if you believe that someone prefers more sunshine to less and would be willing to pay some cost for it, then a unit of sunshine really isn't conceptually much different from, say, a loaf of bread or a Playstation 4.

The tricky thing about environmental goods is that they're usually difficult to value. Most of them are what economists call nonmarket goods, meaning that we don't have an explicit market for them. So unlike a Playstation 4, I can't just go to the store and buy more sunshine or a nicer outdoor temperature (or maybe I can, but it's very, very expensive). This also makes it more challenging to study how much people value these goods. Still, there is a long tradition in economics of using various nonmarket valuation methods to study this kind of problem.

New data set: a billion tweets

Wednesday, December 2, 2015

Renewable energy is not as costly as some think

The other day Marshall and Sol took on Bjorn Lomborg for ignoring the benefits of curbing greenhouse gas emissions.  Indeed.  But Bjorn, among others, is also notorious for exaggerating costs.  That fact is that most serious estimates of reducing emissions are fairly low, and there is good reason to believe cost estimates are too high for the simple fact that analysts cannot measure or imagine all ways we might curb emissions.  Anything analysts cannot model translates into cost exaggeration.

Hawai`i is a good case in point.  Since moving to Hawai`i I've started digging into energy, in large part because the situation in Hawai`i is so interesting.  Here we make electricity mainly from oil, which is super expensive.  We are also rich in sun and wind.  Add these facts to Federal and state subsidies and it spells a remarkable energy revolution.  Actually, renewables are now cost effective even without subsidies.

In the video below Matthias Fripp, who I'm lucky to be working with now, explains how we can achieve 100% renewable energy by 2030 using current technology at a cost that is roughly comparable to our conventional coal and oil system. In all likelihood, with solar and battery costs continuing to fall, this goal could be achieved for a cost that's far less.  And all of this assumes zero subsidies.

One key ingredient:  We need to shift electric loads toward the supply of renewables, and we could probably do this with a combination of smart variable pricing and smart machines that could help us shift loads.  More electric cars could help, too.  I'm sure some could argue with some of the assumptions, but it's hard to see how this could be wildly unreasonable.