As the Heisman Trophy voters showed again, it is super easy
to overfit a model. Sure, the SEC is good at playing football. But that doesn’t
mean that the best player in their league is *always* the best player in the
country. This year I don’t think it was even close.
At the same time, there are still plenty examples of
overfitting in the scientific literature. Even as datasets become larger, this
is still easy to do, since models often have more parameters than they used to. Most responsible
modelers are pretty careful about presenting out-of-sample errors, but even
that can get misleading when cross-validation techniques are used to select
models, as opposed to just estimating errors.
Recently I saw a talk here by Trevor
Hastie, a colleague at Stanford in statistics, which presented a technique
he and Brad Efron have recently started using that seems more immune to
overfitting. They call it spraygun, which doesn’t seem too intuitive a
description to me. But who am I to question two of the giants of statistics.
Anyhow, a summary figure he
presented is below. The x-axis shows the degree of model variance or
overfitting, with high values in the left hand side, and the y-axis shows the
error on a test dataset. In this case they’re trying to predict beer ratings
from over 1M samples. (statistics students will know beer has always played an
important role in statistics, since the origin of the “t-test”). The light red
dots show the out-of-sample error for a traditional lasso-model fit to the full
training data. The dark red dots show models fit to subsets of the data, which unsurprisingly
tend to overfit sooner and have worse overall performance. But what’s
interesting is that the average of the predictions from these overfit models do
nearly the same as the model fit to the full data, until the tuning parameter
is turned up enough that that full model overfits. At that point, the average
of the models fit to the subset of data continues to perform well, with no
notable increase in out-of-sample error. This means one doesn’t have to be too
careful about optimizing the calibration stage. Instead just (over)fit a bunch of
models and take the average.
This obviously relates to the superior
performance of ensembles of process-based models, such as I discussed in a previous
post about crop models. Even if individual models aren't very good, because they are overfit to their training data or for other reasons, the average model tends to be quite good. But in the world of empirical models, maybe we have
also been too guilty of trying to find the ‘best’ model for a given
application. This maybe makes sense if one is really interested in the
coefficients of the model, for instance if you are obsessed with the question
of causality. But often our interest in models, and even in identifying
causality, is that we just want good out-of-sample prediction. And for
causality, it is still possible to look at the distribution of parameter estimates
across the individual models.
Hopefully for some future posts one
of us can test this kind of approach on models we’ve discussed here in the
past. For now, I just thought it was worth calling attention to. Chances are
that when Trevor or Brad have a new technique, it’s worth paying attention to. Just like it’s worth paying attention to states west of Alabama if you want to
see the best college football player in the country.
No comments:
Post a Comment