Wednesday, November 7, 2012

An American, a Canadian and a physicist walk into a bar with a regression... why not to use log(temperature)

Many of us applied staticians like to transform our data (prior to analysis) by taking the natural logarithm of variable values.  This transformation is clever because it transforms regression coefficients into elasticities, which are especially nice because they are unitless. In the regression

log(y) = b* log(x)

b represents the percentage change in y that is associated with a 1% change in x. But this transformation is not always a good idea.  

I frequently see papers that examine the effect of temperature (or control for it because they care about some other factor) and use log(temperature) as an independent variable.  This is a bad idea because a 1% change in temperature is an ambiguous value. 

Imagine an author estimates

log(Y) = b*log(temperature)

and obtains the estimate b = 1. The author reports that a 1% change in temperature leads to a 1% change in Y. I have seen this done many times.

Now an American reader wants to apply this estimate to some hypothetical scenario where the temperature changes from 75 Fahrenheit (F) to 80 F. She computes the change in the independent variable  D:

DAmerican = log(80)-log(75) = 0.065

and concludes that because temperature is changing 6.5%, then Y also changes 6.5% (since 0.065*b = 0.065*1 = 0.065).

But now imagine that a Canadian reader wants to do the same thing.  Canadians use the metric system, so they measure temperature in Celsius (C) rather than Fahrenheit. Because 80F = 26.67C and 75F = 23.89C, the Canadian computes

DCanadian = log(26.67)-log(23.89) = 0.110

and concludes that Y increases 11%.

Finally, a physicist tries to compute the same change in Y, but physicists use Kelvin (K) and 80F = 299.82K and 75F = 297.04K, so she uses

Dphysicist = log(299.82) - log(297.04) = 0.009

and concludes that Y increases by a measly 0.9%.

What happened? Usually we like the log transformation because it makes units irrelevant. But here changes in units dramatically changed the predication of this model, causing it to range from 0.9% to 11%! 

The answer is that the log transformation is a bad idea when the value x = 0 is not anchored to a unique [physical] interpretation. When we change from Fahrenheit to Celsius to Kelvin, we change the meaning of "zero temperature" since 0 F does not equal 0 C which does not equal 0 K.  This causes a 1% change in F to not have the same meaning as a 1% change in C or K.   The log transformation is robust to a rescaling of units but not to a recentering of units.

For comparison, log(rainfall) is an okay measure to use as an independent variable, since zero rainfall is always the same, regardless of whether one uses inches, millimeters or Smoots to measure rainfall.


  1. All the numbers in the examples are wrong, I think you took the base 10 log instead of the natural log.

    And then, I don't really see the problem at all. Sure, taking logs with a unit like Fahrenheit doesn't give you elasticities, but instead unit dependent estimates of the coefficients. So what? That is not wrong, as long as you are aware of it, right? What is the alternative? It seems that any way you could incorporate temperature will get you a unit dependent estimate, so why would this be worse than e.g. putting in temperature in one of the three units without taking logs?

  2. Yes, the numbers in the post are log10, not ln. Although, in a log-log specification the base of the log doesn't matter since the both the RHS and LHS can be rescaled to ln by the same number. Here it's 1/ln(10). Although in a log-linear (what I'm advocating for) or a linear-log specification, this does matter.

    WRT to what the problem is:

    If temperature is kept as a linear variable, the coefficient will be scaled according to its units. This is true of all coefficients of non-logged-variables. But when we keep our non-logged-variables in this form, we explicitly label the units for our coefficients. Eg. I might report that B = $1/degree C or B = $0.56/degree F. But at least in this case, I'm explicitly keeping track of units and reporting them.

    The preference for the log-log specification is (often) that units can be ignored (if it's done correctly), so usually they are ignored and not reported. Instead, analysts just report B = 2%/1% change in temperature, or something similar. What I am saying here is that's not enough, since a 1% change in temperature is not uniquely defined (because zero temperature moves around on different scales). If you want to report coeffs saying that some change corresponds with a "1% change in temperature in Celsius," that's not wrong, but it just seems kind of silly since the unitless advantage of using elasticities is gone. So while mathematically it's fine, I think its a bad habit because most folks will not remember to report the original non-logged units every time they reference the coeff. By using a linear scale, we always remember to be honest about units since we have to be.

    In addition, it also seems a bit silly to anyone with physics training, since 0C and 0F have no real meaning (beyond things like freezing water...), so saying you're 1% higher relative to an arbitrary baseline sounds strange. If you want to use percentages and temperature, the only scale that has physical intuition is Kelvin, since it measures the average kinetic energy of molecules in the material that's being observed. But ln(temperature in Kelvin) is basically indistinguishable from a linear rescaling of Kelvin over the range of temperatures that we're usually concerned about (~300K), so again there is no real notational advantage to using logs.

    I think the only case when log(temperature) should be used is if we have very strong theoretical priors that the response function we're estimating should be a power-function of temperature (in some scale), in which case we should use log(temp) because it is the correct specification. But I have never yet seen a theoretical reason why this should be true in an econometric context.

  3. Yes, ln vs log doesn't matter for your estimate in a log-log spec, but all the percentage changes you describe in the text are wrong. A change in temperature from 75 to 80 is not 2.8%...

    I fully agree that if someone sells a log-log in temp as an elasticity, its wrong. But if someone uses that estimate to then forecast something in a paper using the same functional form, all results can be perfectly correct.

    In fact, they might be better, if the log-log model happens to fit the data better than a log-linear model. After all, this is not just a cosmetic unit choice, this is actually a choice between different models. So surely we should pick the one that performs better. And here I don't understand why we should make that choice based on theoretical priors. Just try both specifications and take the one that provides the better fit, no?

    I guess that is what I don't understand about this blog entry: why suggest that generally one spec is better than another one? Why not keep an open mind and pick the one that fits the data better?

    1. Alright, my mistake about the numbers is fair. My desktop calculator program has different notation than matlab. I've fixed the numbers now, thanks for pointing it out.

      WRT whether it's okay to use log(temp) even if it fits the model, I would still say its a bad idea. There are lots functions that one can use to transform a temperature variable (an infinite number, actually) but we don't usually play around trying to find the one that fits the data "the best," in part because there is no single measure of fit and in part because we worry about overfitting models (and besides, that feel I've never seen someone who makes the mistake of using log(temp) actually go through the trouble of trying to show that it's a good idea based on goodness-of-fit). Instead, we usually make some approximations that we think are reasonable based on theory, intuition, the structure of the data or mathematical convenience.

      Unfortunately, I don't think log(temp) wins on any of these dimensions. If you're not using Kelvin then your temperature variable can take on negative values, but the log function is not defined over negative numbers. (This gets back to my original point that how we define "zero" really matters if we use logs). Now, I could imagine a response to this comment might be "my domain is far from negative values, so log is a reasonable approximation in my sample" (a sample which I'm guessing must be relatively tropical and/or be described in Fahrenheit). But if this is someone's response, why not just use a slightly different function that looks like log(temp) over the domain but is actually defined on the reals? Someone might try to replicate the approach on a different (colder) sample, and they might run into negative numbers. Why use a model that only works in in a single context?

      If we're willing to make an approximation of the data, why insist on the log-based transformation? There are many other functions that look similar to the log over finite domains, so its definitely not the only nonlinear option. And if it comes down to picking between a few approximations, then I would prefer the one that's not so easily misinterpreted by my audience (WRT elasticities) and one that's actually defined over the range of values one could reasonably expect to encounter in the real world.