As the Heisman Trophy voters showed again, it is super easy to overfit a model. Sure, the SEC is good at playing football. But that doesn’t mean that the best player in their league is *always* the best player in the country. This year I don’t think it was even close.
At the same time, there are still plenty examples of overfitting in the scientific literature. Even as datasets become larger, this is still easy to do, since models often have more parameters than they used to. Most responsible modelers are pretty careful about presenting out-of-sample errors, but even that can get misleading when cross-validation techniques are used to select models, as opposed to just estimating errors.
Recently I saw a talk here by Trevor Hastie, a colleague at Stanford in statistics, which presented a technique he and Brad Efron have recently started using that seems more immune to overfitting. They call it spraygun, which doesn’t seem too intuitive a description to me. But who am I to question two of the giants of statistics.
Anyhow, a summary figure he presented is below. The x-axis shows the degree of model variance or overfitting, with high values in the left hand side, and the y-axis shows the error on a test dataset. In this case they’re trying to predict beer ratings from over 1M samples. (statistics students will know beer has always played an important role in statistics, since the origin of the “t-test”). The light red dots show the out-of-sample error for a traditional lasso-model fit to the full training data. The dark red dots show models fit to subsets of the data, which unsurprisingly tend to overfit sooner and have worse overall performance. But what’s interesting is that the average of the predictions from these overfit models do nearly the same as the model fit to the full data, until the tuning parameter is turned up enough that that full model overfits. At that point, the average of the models fit to the subset of data continues to perform well, with no notable increase in out-of-sample error. This means one doesn’t have to be too careful about optimizing the calibration stage. Instead just (over)fit a bunch of models and take the average.
This obviously relates to the superior performance of ensembles of process-based models, such as I discussed in a previous post about crop models. Even if individual models aren't very good, because they are overfit to their training data or for other reasons, the average model tends to be quite good. But in the world of empirical models, maybe we have also been too guilty of trying to find the ‘best’ model for a given application. This maybe makes sense if one is really interested in the coefficients of the model, for instance if you are obsessed with the question of causality. But often our interest in models, and even in identifying causality, is that we just want good out-of-sample prediction. And for causality, it is still possible to look at the distribution of parameter estimates across the individual models.
Hopefully for some future posts one of us can test this kind of approach on models we’ve discussed here in the past. For now, I just thought it was worth calling attention to. Chances are that when Trevor or Brad have a new technique, it’s worth paying attention to. Just like it’s worth paying attention to states west of Alabama if you want to see the best college football player in the country.