Thursday, October 31, 2013

Taking crop analysis to MARS

I couldn’t bear to watch the clock wind down on October without a single post this month on G-FEED. So here goes a shot at the buzzer…

A question I get asked or emailed fairly often by students is whether they should use a linear or quadratic model when relating crop yields to monthly or growing season average temperatures. This question comes from all around the world so I guess it’s a good topic for a post, especially since I rarely get the time to email them back quickly.  If you are mainly interested in posts about broad issues and not technical nerdy topics, you can stop reading now.

The short answer is you can get by with a linear model if you are looking over a small range of temperatures, such as year to year swings at one location. But if you are looking across a bigger range, such as across multiple places, you should almost surely use something that allows for nonlinearity (e.g., an optimum temperature somewhere in the middle of the data).

There are issues that arise if using a quadratic model that includes fixed effects for location, a topic which Wolfram wrote about years ago with Craig McIntosh. Essentially this re-introduces the site mean into the estimation of model coefficients, which creates problem of interpretation related to a standard panel model with fixed effects.

A bigger issue that this question points to, though, is the assumption by many that the choice is simply between linear and quadratic. Both are useful simple approaches to use, especially if data is scarce. But most datasets we work with these days allow much more flexibility in functional form. One clear direction that people have gone is to go to sub-monthly or even sub-daily measures and use flexible spline or degree day models to compute aggregate measures of temperature exposure throughout the season, then use those predictors in the regression.  I won’t say much about that here, except that it makes a good deal of sense and people who like those approaches should really blog more often.

Another approach, though, is to use more flexible functions with the monthly or seasonal data itself. This can be useful in cases where we have lots of monthly data, but not much daily data, or where we simply want something that is faster and easier than using daily data. One of my favorite methods of all time are multivariate adaptive regression splines, also called MARS. This was developed by Jerome Friedman at Stanford about 20 years ago (and I took a class from him about 10 years ago). This approach is like a three-for-one, in that it allows for nonlinearities, can capture asymmetries, and is a fairly good approach to variable selection. The latter is helpful in cases where you have more months than you think are really important for crop yields.

The basic building block of MARS is the hinge function, which is essentially a piecewise linear function that is zero on one side of the hinge, and linearly increasing on the other side. Two examples are shown below, taken from the wikipedia entry on MARS.


MARS works by repeatedly trying different hinge functions, and adds whichever one gives the maximum reduction in the sum of squared errors. As it adds hinge functions, you can have it added by itself or have it multiply an existing hinge in the model, which allows for interactions (I guess that makes it a four-for-one method). Despite searching all possible hinge functions (which covers all variables and hinges at all observed values of each variable), it is a fairly fast algorithm. And like most data mining techniques, there is some back-pruning at the end so it isn’t too prone to overfitting.

For a long time I liked MARS but couldn’t figure out how to apply it to data where you want to include spatial fixed effects to account for omitted variables. Unlike models with pre-determined predictors, such as monthly average temperature or squared temperature, MARS has to search for the right predictors. And before you know what the predictors are, you can’t substract out the site-level means as you would in a fixed-effect model. So you can’t know what the predictors are until you search, but you can’t search if you can’t compute the error of the model correctly (because you haven’t included fixed-effects.)

One semi-obvious solution would be to just ignore fixed-effects, find the hinge-function predictors, and then rerun the model with the selected predictors but including fixed effects. That seems ok but all the problems of omitted variables would still be affecting the selection process.

Recently, I settled on a different idea – first use a crop simulation model to develop a pseudo-dataset for a given crop/region, then run MARS on this simulated data (where omitted variables aren’t an issue) to find the predictors, and then use those predictors on an actual dataset, but including fixed effects to account for potential omitted variables.

I haven’t had much time to explore this, but here’s an initial attempt. First, I used some APSIM simulations for a site in Iowa that we ran for a recent paper on U.S. maize. Running MARS on this, allowing either monthly or seasonal average variables to enter the model, results in just four variables that are able to explain nearly 90% of the yield variation across years. Notice the response functions (below) show the steepest sensitivity for July Tmax, which makes sense. Also, rainfall is important but only up to about 450mm over the May-August period. In both cases, you can see how the results are definitely not linear and not symmetric. And it is a little surprising that only four variables can capture so much of the simulated variation, especially since they all contain no information at the sub-monthly time scale.


Of course this approach relies on assuming the crop model is a reasonable representation of reality. But recall we aren’t using the crop model to actually define the coefficients, just to define the predictors we will use. The next step is to then compute these predictors for actual data across the region, and see how well it works at predicting crop yields. I actually did that a few months ago but can’t find the results right now, and am heading off to teach. I’ll save that for a post in the near future (i.e. before 2015).