I couldn’t bear to watch the clock wind down on October
without a single post this month on G-FEED. So here goes a shot at the buzzer…

A question I get asked or emailed fairly often by students is
whether they should use a linear or quadratic model when relating crop yields
to monthly or growing season average temperatures. This question comes from all
around the world so I guess it’s a good topic for a post, especially since I
rarely get the time to email them back quickly. If you are mainly interested in posts about
broad issues and not technical nerdy topics, you can stop reading now.

The short answer is you can get by with a linear model if you
are looking over a small range of temperatures, such as year to year swings at
one location. But if you are looking across a bigger range, such as across
multiple places, you should almost surely use something that allows for
nonlinearity (e.g., an optimum temperature somewhere in the middle of the data).

There are issues that arise if using a quadratic model that
includes fixed effects for location, a topic which Wolfram wrote about years ago with Craig McIntosh. Essentially this re-introduces the site mean into the
estimation of model coefficients, which creates problem of interpretation
related to a standard panel model with fixed effects.

A bigger issue that this question points to, though, is the
assumption by many that the choice is simply between linear and quadratic. Both
are useful simple approaches to use, especially if data is scarce. But most
datasets we work with these days allow much more flexibility in functional
form. One clear direction that people have gone is to go to sub-monthly or even
sub-daily measures and use flexible spline or degree day models to compute aggregate
measures of temperature exposure throughout the season, then use those predictors
in the regression. I won’t say much
about that here, except that it makes a good deal of sense and people who like
those approaches should really blog more often.

Another approach, though, is to use more flexible functions
with the monthly or seasonal data itself. This can be useful in cases where we
have lots of monthly data, but not much daily data, or where we simply want
something that is faster and easier than using daily data. One of my favorite
methods of all time are multivariate adaptive regression splines, also called MARS. This was developed by Jerome Friedman at Stanford about 20 years ago (and
I took a class from him about 10 years ago). This approach is like a three-for-one,
in that it allows for nonlinearities, can capture asymmetries, and is a fairly
good approach to variable selection. The latter is helpful in cases where you
have more months than you think are really important for crop yields.

The basic building block of MARS is the hinge function,
which is essentially a piecewise linear function that is zero on one side of
the hinge, and linearly increasing on the other side. Two examples are shown
below, taken from the wikipedia entry on MARS.

MARS works by repeatedly trying different hinge functions,
and adds whichever one gives the maximum reduction in the sum of squared errors.
As it adds hinge functions, you can have it added by itself or have it multiply
an existing hinge in the model, which allows for interactions (I guess that
makes it a four-for-one method). Despite searching all possible hinge functions
(which covers all variables and hinges at all observed values of each
variable), it is a fairly fast algorithm. And like most data mining techniques,
there is some back-pruning at the end so it isn’t too prone to overfitting.

For a long time I liked MARS but couldn’t figure out how to
apply it to data where you want to include spatial fixed effects to account for
omitted variables. Unlike models with pre-determined predictors, such as
monthly average temperature or squared temperature, MARS has to search for the
right predictors. And before you know what the predictors are, you can’t substract
out the site-level means as you would in a fixed-effect model. So you can’t
know what the predictors are until you search, but you can’t search if you can’t
compute the error of the model correctly (because you haven’t included fixed-effects.)

One semi-obvious solution would be to just ignore
fixed-effects, find the hinge-function predictors, and then rerun the model
with the selected predictors but including fixed effects. That seems ok but all
the problems of omitted variables would still be affecting the selection
process.

Recently, I settled on a different idea – first use a crop
simulation model to develop a pseudo-dataset for a given crop/region, then run
MARS on this simulated data (where omitted variables aren’t an issue) to find
the predictors, and then use those predictors on an actual dataset, but
including fixed effects to account for potential omitted variables.

I haven’t had much time to explore this, but here’s an
initial attempt. First, I used some APSIM simulations for a site in Iowa that we
ran for a recent paper on U.S. maize. Running MARS on this, allowing either
monthly or seasonal average variables to enter the model, results in just four variables
that are able to explain nearly 90% of the yield variation across years. Notice
the response functions (below) show the steepest sensitivity for July Tmax,
which makes sense. Also, rainfall is important but only up to about 450mm over
the May-August period. In both cases, you can see how the results are
definitely not linear and not symmetric. And it is a little surprising that
only four variables can capture so much of the simulated variation, especially
since they all contain no information at the sub-monthly time scale.

Of course this approach relies on assuming the crop model is
a reasonable representation of reality. But recall we aren’t using the crop
model to actually define the coefficients, just to define the predictors we
will use. The next step is to then compute these predictors for actual data
across the region, and see how well it works at predicting crop yields. I
actually did that a few months ago but can’t find the results right now, and am
heading off to teach. I’ll save that for a post in the near future (i.e. before 2015).