Nearly all observational data show strong spatial patterns. Location matters, partly due to geophysical attributes, partly because of history, and partly because all the things that follow from these two key factors tend to feedback and exaggerate spatial patterns. If you're a data monkey you probably like to look at cool maps that illustrate spatial patterns, and spend a lot of time trying to make sense of them. I know I do.
Most observational empirical studies in economics and other disciplines need to account for this general spatial connectedness of things. In principal, you can do this two ways: (1) develop a model of the spatial relationship; (2) account for the spatial connectedness by appropriately adjusting the standard errors of your regression model.
The first option is a truly heroic one, and most all attempts I've seen seem foolhardy. Spatial geographic patters are extremely complex and follow from deep geophysical and social histories (read
Guns, Germs and Steal). One is unlikely to uncover the full mechanism that underlies the spatial pattern. When one "models" this spatial pattern, assumptions drive the result, and the assumptions are, almost always, a heroic leap of faith.
That leaves (2), which shouldn't be all that difficult using modern statistical techniques, but does take some care and perhaps a little experimentation. It seems to me many are a little too blithe about it, and perhaps select methods that falsely exaggerate statistical significance.
Essentially, the problem is that there's normally a lot less information in a large data set than you think, because most observations from a region and/or time are correlated with other observations from that region and/or time. In statistical speak, the errors are clustered.
To illustrate how much this matters, I'll share some preliminary regressions from a current project of mine. Here I am predicting the natural log of corn yield using field-level data that span about 15 years on most of the corn fields in three major corn-producing U.S. states. I've got several hundred thousand observations. Yes, you read that right--it's a very rich data set.
But corn yields, as you can probably guess, tend to have a lot of spatial correlation. This happens in large part because weather, soils, and farming practices are spatially correlated. However, there isn't a lot of serial correlation in weather from year to year, so, my data are highly correlated within years, and average outcomes have strong geographic correlation, but errors are mostly independent between years in a fixed location.
Where the amount of information in the data normally scales with the square root of the sample size, when the data are clustered spatially or otherwise, a conservative estimate for the amount of information is the square root of the number of clusters you have. In this data set, we don't really have fixed clusters. It's more like smooth overlapping clusters. But we might proxy the "number" of clusters around the square root of 45, the number of years X states I have, because most spatial correlation in weather fades out after about 500 miles. Although these states border each other, so it may be even less than 45. Now, I do have weather matched to each field depending on the field's individual planting date, which can vary a fair amount. That adds some statistical power. So, I hope it's a bit better than the square root of 45. Either way, in the ballpark of 45 is a whole lot less than several hundred thousand.
I regress the natural log of corn yield on
YEAR: a time trend
log (potential): (output of a crop model calibrated from daily weather inputs),
gdd: growing degree days (a temperature measure),
DD29: degree days above 29C (a preferred measure of extreme heat),
prec & prec^2: season precipitation and precipitation squared,
PDay: number of days since Jan 1 until planting
interaction between DD29 and CO2 exposure.
CO2 exposure varies a little bit spatially, and also temporally, both due to a trend from burning fossil fuels and other emissions, as well as seasonal fluctuations following from tree and leaf growth (earlier planting tends to have higher CO2, and higher CO2 can improve
radiation water use efficiency in corn, which can effectively make the plants more drought tolerant).
The standard regression output gives:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.320e+00 3.014e-02 76.98 <2e-16 ***
I(YEAR - 2000) 1.291e-02 4.600e-04 28.06 <2e-16 ***
log(Potential) 5.697e-01 5.470e-03 104.14 <2e-16 ***
gdd 1.931e-04 4.177e-06 46.24 <2e-16 ***
DD29 -2.477e-02 1.149e-03 -21.56 <2e-16 ***
Prec 1.787e-02 9.424e-04 18.96 <2e-16 ***
I(Prec^2) -4.939e-04 2.038e-05 -24.24 <2e-16 ***
PDay -6.798e-03 6.269e-05 -108.45 <2e-16 ***
DD29:AvgCO2 6.229e-05 2.953e-06 21.09 <2e-16 ***
Notice the huge t-statistics: all the parameters look precisely identified. But you should be skeptical.
Most people now use White "robust" standard errors, which uses a variance-covariance matrix constructed from the residuals to account for arbitrary heteroscedasticity. Here's what that gives you:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.319894e+00 3.954834e-02 58.65970 0.000000e+00
I(YEAR - 2000) 1.290703e-02 5.362464e-04 24.06922 5.252870e-128
log(Potential) 5.696738e-01 7.161458e-03 79.54718 0.000000e+00
gdd 1.931294e-04 5.058033e-06 38.18271 0.000000e+00
DD29 -2.477002e-02 1.397239e-03 -17.72783 2.557376e-70
Prec 1.786707e-02 1.099087e-03 16.25627 2.016306e-59
I(Prec^2) -4.938967e-04 2.327153e-05 -21.22321 5.830391e-100
PDay -6.798270e-03 7.381894e-05 -92.09386 0.000000e+00
DD29:AvgCO2 6.229397e-05 3.616307e-06 17.22585 1.698989e-66
The standard errors are larger and the T-values smaller, but this standard approach still gives us extraordinary confidence in our estimates.
You should remain skeptical. Here's what happens when I use robust standard errors clustered by year:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.32e+00 5.57e-01 4.17 3.094e-05 ***
YEAR 1.29e-02 8.57e-03 1.52 0.12920
log(Potential) 5.70e-01 9.11e-02 6.25 4.000e-10 ***
gdd 1.93e-04 7.89e-05 2.45 0.01443 *
DD29 -2.48e-02 1.35e-02 -1.83 0.06719 .
Prec 1.79e-02 1.06e-02 1.68 0.09243 .
I(Prec^2) -4.94e-04 2.15e-04 -2.29 0.02178 *
PDay -6.80e-03 8.17e-04 -8.32 2.2e-16 ***
DD29:AvgCO2 6.23e-05 3.50e-05 1.78 0.07510 .
Standard errors are an order of magnitude larger and T-values are more humbling. Planting date and potential yield come in very strong, but now everything else is just borderline significant. It seems robust standard errors really aren't so robust.
But even if we cluster by year, we are probably missing some important dependence, since geographic regions may have similar errors across years, and in clustering by year, I assume all errors in one year are independent of all errors in other years.
If I cluster by state, the standard robust/clustering procedure will account for both geographic and time-series dependence within a state. Since I know from earlier work that one state is about the extent of spatial correlation, this seems reasonable. Here's what I get:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.32e+00 1.1888e+00 1.9514 0.0510065 .
YEAR 1.29e-02 4.6411e-03 2.7810 0.0054194 **
log(Potential) 5.70e-01 1.6938e-01 3.3632 0.0007706 ***
gdd 1.93e-04 2.2126e-04 0.8729 0.3827338
DD29 -2.48e-02 2.6696e-02 -0.9279 0.3534781
Prec 1.79e-02 1.2786e-02 1.3974 0.1622882
I(Prec^2) -4.94e-04 2.7371e-04 -1.8045 0.0711586 .
PDay -6.80e-03 4.9912e-04 -13.6205 < 2.2e-16 ***
DD29:AvgCO2 6.23e-05 6.8565e-05 0.9085 0.3635962
Oops. Now most of the weather variables have lost their statistical significance too. But since I'm explicitly limiting assumed dependence in the cross section within years, now the time trend (YEAR) is significant, and it wasn't when clustering by YEAR. We probably shouldn't take that significance very seriously, since some kinds of dependence (like technology) probably spans well beyond one state.
Note that this strategy of using large clusters combined with robust SE treatment (canned in STATA, for example) is what's recommended in Angrist and Pischke's
Mostly Harmless Econometrics.
There are other ways of dealing with these kinds of problems. For example, you can use a "block bootstrap" that resamples residuals whole years as a time, which preserves spatial correlation. This is great in agricultural applications since weather is pretty much IID across years in a fixed locations and we should feel reasonably comfortable that there is little serial correlation. One can also adapt the method by Conley for panel data. Soloman Hsiang has graciously provided code
here. In earlier agriculture-related work, Wolfram Schlenker and I generally found that clustering by state gives similar standard errors as these methods.
The overarching lesson is this: try it different ways and err on the side of least significance, because it's very easy to underestimate your standard errors and very hard to overestimate them.
And watch out for data errors: these have a way of screwing up both estimates and standard errors, sometimes quite dramatically.
If you had to patience to follow all of this, you might appreciate the footnotes and appendix in our
recent comment on Deschenes and Greenstone.