G-FEED: An American, a Canadian and a physicist walk into a bar with a regression... why not to use log(temperature)

Wednesday, November 7, 2012

An American, a Canadian and a physicist walk into a bar with a regression... why not to use log(temperature)

Many of us applied staticians like to transform our data (prior to analysis) by taking the natural logarithm of variable values. This transformation is clever because it transforms regression coefficients into elasticities, which are especially nice because they are unitless. In the regression

log(y) = b* log(x)

b represents the percentage change in y that is associated with a 1% change in x. But this transformation is not always a good idea.

I frequently see papers that examine the effect of temperature (or control for it because they care about some other factor) and use log(temperature) as an independent variable. This is a bad idea because a 1% change in temperature is an ambiguous value.

Imagine an author estimates

log(Y) = b*log(temperature)

and obtains the estimate b = 1. The author reports that a 1% change in temperature leads to a 1% change in Y. I have seen this done many times.

Now an American reader wants to apply this estimate to some hypothetical scenario where the temperature changes from 75 Fahrenheit (F) to 80 F. She computes the change in the independent variable D:

DAmerican = log(80)-log(75) = 0.065

and concludes that because temperature is changing 6.5%, then Y also changes 6.5% (since 0.065*b = 0.065*1 = 0.065).

But now imagine that a Canadian reader wants to do the same thing. Canadians use the metric system, so they measure temperature in Celsius (C) rather than Fahrenheit. Because 80F = 26.67C and 75F = 23.89C, the Canadian computes

DCanadian = log(26.67)-log(23.89) = 0.110

and concludes that Y increases 11%.

Finally, a physicist tries to compute the same change in Y, but physicists use Kelvin (K) and 80F = 299.82K and 75F = 297.04K, so she uses

Dphysicist = log(299.82) - log(297.04) = 0.009

and concludes that Y increases by a measly 0.9%.

What happened? Usually we like the log transformation because it makes units irrelevant. But here changes in units dramatically changed the predication of this model, causing it to range from 0.9% to 11%!

The answer is that the log transformation is a bad idea when the value x = 0 is not anchored to a unique [physical] interpretation. When we change from Fahrenheit to Celsius to Kelvin, we change the meaning of "zero temperature" since 0 F does not equal 0 C which does not equal 0 K. This causes a 1% change in F to not have the same meaning as a 1% change in C or K. The log transformation is robust to a rescaling of units but not to a recentering of units.

For comparison, log(rainfall) is an okay measure to use as an independent variable, since zero rainfall is always the same, regardless of whether one uses inches, millimeters or Smoots to measure rainfall.

[Cross-posted at Fight-Entropy]

4 comments:

AnonymousNovember 12, 2012 at 10:30 AM
All the numbers in the examples are wrong, I think you took the base 10 log instead of the natural log.

And then, I don't really see the problem at all. Sure, taking logs with a unit like Fahrenheit doesn't give you elasticities, but instead unit dependent estimates of the coefficients. So what? That is not wrong, as long as you are aware of it, right? What is the alternative? It seems that any way you could incorporate temperature will get you a unit dependent estimate, so why would this be worse than e.g. putting in temperature in one of the three units without taking logs?
ReplyDelete
Replies
solNovember 12, 2012 at 11:28 AM
Yes, the numbers in the post are log10, not ln. Although, in a log-log specification the base of the log doesn't matter since the both the RHS and LHS can be rescaled to ln by the same number. Here it's 1/ln(10). Although in a log-linear (what I'm advocating for) or a linear-log specification, this does matter.

WRT to what the problem is:

If temperature is kept as a linear variable, the coefficient will be scaled according to its units. This is true of all coefficients of non-logged-variables. But when we keep our non-logged-variables in this form, we explicitly label the units for our coefficients. Eg. I might report that B = $1/degree C or B = $0.56/degree F. But at least in this case, I'm explicitly keeping track of units and reporting them.

The preference for the log-log specification is (often) that units can be ignored (if it's done correctly), so usually they are ignored and not reported. Instead, analysts just report B = 2%/1% change in temperature, or something similar. What I am saying here is that's not enough, since a 1% change in temperature is not uniquely defined (because zero temperature moves around on different scales). If you want to report coeffs saying that some change corresponds with a "1% change in temperature in Celsius," that's not wrong, but it just seems kind of silly since the unitless advantage of using elasticities is gone. So while mathematically it's fine, I think its a bad habit because most folks will not remember to report the original non-logged units every time they reference the coeff. By using a linear scale, we always remember to be honest about units since we have to be.

In addition, it also seems a bit silly to anyone with physics training, since 0C and 0F have no real meaning (beyond things like freezing water...), so saying you're 1% higher relative to an arbitrary baseline sounds strange. If you want to use percentages and temperature, the only scale that has physical intuition is Kelvin, since it measures the average kinetic energy of molecules in the material that's being observed. But ln(temperature in Kelvin) is basically indistinguishable from a linear rescaling of Kelvin over the range of temperatures that we're usually concerned about (~300K), so again there is no real notational advantage to using logs.

I think the only case when log(temperature) should be used is if we have very strong theoretical priors that the response function we're estimating should be a power-function of temperature (in some scale), in which case we should use log(temp) because it is the correct specification. But I have never yet seen a theoretical reason why this should be true in an econometric context.
ReplyDelete
Replies
AnonymousNovember 12, 2012 at 1:17 PM
Yes, ln vs log doesn't matter for your estimate in a log-log spec, but all the percentage changes you describe in the text are wrong. A change in temperature from 75 to 80 is not 2.8%...

I fully agree that if someone sells a log-log in temp as an elasticity, its wrong. But if someone uses that estimate to then forecast something in a paper using the same functional form, all results can be perfectly correct.

In fact, they might be better, if the log-log model happens to fit the data better than a log-linear model. After all, this is not just a cosmetic unit choice, this is actually a choice between different models. So surely we should pick the one that performs better. And here I don't understand why we should make that choice based on theoretical priors. Just try both specifications and take the one that provides the better fit, no?

I guess that is what I don't understand about this blog entry: why suggest that generally one spec is better than another one? Why not keep an open mind and pick the one that fits the data better?
ReplyDelete
Replies

Add comment

Pages

Wednesday, November 7, 2012

An American, a Canadian and a physicist walk into a bar with a regression... why not to use log(temperature)

4 comments: