Any study that focuses on nonlinear temperature effects requires precise estimates of the exact temperature distribution. Unfortunately, most gridded weather data sets only give monthly estimates (e.g., CRU, University of Delaware, and up until recently PRISM). Monthly averages can hide extremes - both hot and cold. Monthly means don't capture how often and by how much temperatures pass a certain threshold.
At the time Michael Roberts and I wrote our article on nonlinear temperature effects in agriculture, the PRISM climate group only made its monthly aggregates publicly available for download, but not the underlying daily data. In the end we hence reverse-engineered the PRISM interpolation algorithm, i.e., we regressed monthly averages at each PRISM grid on monthly averages at the (7 or 10, depends on the version) closest weather stations that are publicly available. Once we had the regression estimates linking monthly PRISM averages to weather stations, we bravely applied them to the daily weather data at the stations to get daily data at the PRISM cells (for more detail, see the paper). Cross-validation suggested we weren't that far off, but then again, we only could do cross-validation tests in areas that have weather stations.
Recently, the PRISM climate group made their daily data available from the 1980s onwards. I finally got a chance to download them and compare them to the daily data we previously had constructed from monthly averages. This was quiet a nerve-wrecking exercise: how far were we off and does it change the results - or in the worst case, did I screw up the code and got garbage for our previous paper?
Below is a table that summarizes PRISM's daily data for the growing season (April-September) in all counties east of the 100 degree meridian except Florida that either grow corn or soybeans, basically the set of counties we had used in our study (small change: our study used 1980-2005, but since PRISM's daily data is only available from 1981 onwards, the tables below use 1981-2012). The summary statistics are:
First sigh of relieve! It looks like the numbers are rather close (strangely enough, the biggest deviations seems to be for precipitation, yet we used PRISM's monthly aggregates to derive season-totals and did not rely on any interpolation, so the new daily PRISM data is a bit different from the old PRISM data). Also, recall from a recent post that looked at the NARR data that degrees above 29C can differ a lot between data sets, as small differences in the daily maximum temperature will give vastly different results.
Next, I plugged both data sets into a panel of corn and soybean yields to see which one explains those yields better (i) in sample; and (ii) out of sample. I used models using only temperature variables (columns a and b) as well as models using the same four weather variables we used before (columns c and d). PRISM's daily data is used in columns a and c, our re-engineered data are in columns b and d:
Second sigh of relief: It seems to be rather close again. In all four comparisons (1b) to (1a), (1d) to (1c), (2b) to (2a), and (2d) to (2c), our reconstruction for some strange reason has a larger in-sample R-square. The reduction in RMSE is given in the second row of the footer: it is the reduction in out-of sample prediction error compared to a model with no weather variables. I take 1000 times 80% of the data as estimation sample and derive the prediction error for the remaining 20%. The given number is the average of the 1000 draws. For RMSE reductions, the picture is mixed: for the corn models that only include the two degree days variables, the PRISM daily data does slightly better, while the reverse is true for soybeans. In models that also include precipitation, the construction of season-total precipitation seems to do better when I added the monthly PRISM totals (columns d) rather than adding the new daily PRISM precipitation totals (columns c).
Finally, since the data we constructed is a knock-off, how can it do better than the original in some cases? My wild guess (and this is really only speculation) is that we took great care in filling in missing data for weather stations to get a balanced panel. That way we insured that year-to-year fluctuations are not due to fact that one averages over a different set of stations. I am not aware how exactly PRISM deals with missing weather station data.
Can you resolve some confusion on my part with the numbers in your table?
ReplyDeleteYou quote % reduction in out-of-sample RMSE from the various models. Take Corn model 1(c) for example. If it says that 1-sqrt(MSE)/sqrt(MSE0) = 0.2956 where MSE is the mean square cross-validation error, and MSE0 is the sample variance, then that implies prediction skill 1-MSE/MSE0 = 0.5038. This number is higher than R-squared (0.4519) which is clearly not plausible.
Thanks, Joe