## Friday, November 8, 2013

### More fun with MARS

(But not as much fun as watching Stanford dominate Oregon last night).

In a recent post I discussed the potential of multivariate adaptive regression splines (MARS) for crop analysis, particularly because they offer a simple way of dealing with asymmetric and nonlinear relationships. Here I continue from where I left off, so see previous post first if you haven’t already.

Using the APSIM simulations (for a single site) to train MARS resulted in the selection of four variables. One of them was related to radiation which we don’t have good data on, so here I will just take three of them, which were related to: July Tmax, May-August Tmax, and May-August Precipitation. Now, the key point is we are not using those variables as the predictors themselves, but instead using hinge functions based on them. The below figure shows specifically what thresholds I am using (based on the MARS results from previous post) to define the basis hinge functions.

I then compute these predictor values for each county-year observation in a panel dataset of US corn yields, then subtract county means from all variables (equivalent to introducing county fixed effects), and fit three different regression models:

Model 1: Just quadratic year trends (log(Yield) ~ year + year^2). This serves as a reference “no-weather” model.
Model 2: log(Yield) ~  year + year^2 + GDD  + EDD + prec + prec^2. This model adds the predictors we normally use based on Wolfram and Mike’s 2009 paper, with GDD and EDD meaning growing degree days between 8 and 29 °C and extreme degree days (above 29 °C). Note these measures rely on daily Tmin and Tmax data to compute the degree days.
Model 3: log(Yield) ~  year + year^2 + the three predictors shown in the figure above. Note these are based only on monthly average Tmax or total precipitation.

The table below shows the calibration error as well as the mean out-of-sample error for each model. What’s interesting here is that the model with the three hinge functions performs just as well as (actually even a little better than) the one based on degree day calculations. This is particularly surprising since the hinge functions (1) use only monthly data and (2) were derived from simulations at a single site in Iowa. Apparently they are representative enough to result in a pretty good model for the entire rainfed Corn Belt.

 Model Calibration R2 Average root mean square error for calibration Average root mean square error for out-of-sample data  (for 500 runs) % reduction in out-of-sample error 1 0.59 0.270 .285 -- 2 0.66 0.241 .259 8.9 3* 0.68 0.235 .254 10.7
*For those interested, the coefficients on the three hinge terms are -.074, -.0052, and -.061 respectively

The take home here for me is that even a few predictors based on monthly data can tell you a lot about crop yields, BUT it’s important to account for asymmetries. Hinge functions let you do that, and process-based crop models can help identify the right hinge functions (although there are probably other ways to do that too).

So I think this is overall a promising approach – one could use selected crop model simulations from around the world, such as those out of agmip, to identify hinge functions for different cropping systems, and then use these to build robust and simple empirical models for actual yields. Alas I probably won’t have time to develop it much in the foreseeable future, but hopefully this post will inspire something.