Monday, May 11, 2015

Introducing SCYM

In the hopes of figuring out how to raise crop yields or farmer incomes around the world, it would be really nice if we had a quick and accurate way of actually measuring yields for individual fields. That has motivated a lot of work over the years on using satellite data, and we have a paper out this week describing another step in that direction.

As I see it there are three main ingredients needed for yield remote sensing to be successful on a meaningful scale. One is the raw data. As Marshall’s recent post explained, there are several new satellite data providers that are really transforming our ability to track individual fields, even in smallholder areas.

Second is the ability to process the data at scale. Five years ago, for example, I would have to hire a research assistant to download imagery, make sure it was geometrically and radiometrically calibrated (i.e. properly lined up and in meaningful units), and then apply whatever algorithms we had. That just didn’t scale very well, in terms of labor or on-site data storage or processing. When a collaborator would ask “could you produce yield estimates for my study area,” I would have to think about how many weeks or months of work that would entail. But a couple of years ago I was introduced to Google’s Earth Engine, which is “a planetary-scale platform for environmental data & analysis.” In practical terms, it means that they have a lot of geospatial data (including all historical Landsat imagery), a lot of built-in algorithms for processing, and an interface to run your own code on the data and visualize or save the output. Part of why it works is that data providers, like the USGS for Landsat, have gotten better at providing well calibrated data. Earth Engine is very cool, and the more I’ve worked with it, the more I can see how this transforms our ability to extract value out of data already collected.

Third, and arguably the rate-limiting step nowadays, is to have algorithms that can translate satellite data into accurate yield estimates. It’s easy enough to do this if you have lots of ground data to calibrate to for a particular site, but that’s generally not scalable (unless people get clever about crowdsourcing ground “truth”). What seemed to be lacking was a very generic, scalable algorithm. So in the last 8 months or so we’ve been working to develop and test one idea about how to do this. I’m calling it a scalable satellite-based crop yield mapper (SCYM, pronounced “skim”), and a description of it has just been published in Remote Sensing of Environment. Conveniently, SCYM also stands for Steph Curry’s Your MVP.

The basic idea is that if you don’t have lots of ground data to calibrate a model, why not generate lots of fake ground data? Then for whatever combination of observations you actually have (say, for instance, satellite images on 2 or 3 specific days, and measures of daily weather), you can look into your fake data to see what the best fit model is to predict the desired variable (“yield”) from the measured predictors. The paper provides more detail, which I won’t bore readers with here. But to give a sense of the type of output, below shows an animation of our maize yield estimates over part of Iowa for 2008-2013. Red are high yields, blue are low.


The figure below shows a comparison between these and ground “truth” estimates for maize, which we take from the dataset described in a previous post.




The cool thing about this is that it’s quite generic. To illustrate that, we reran the model for soybeans, with results nearly as good as for maize.


Hopefully this type of thing will help make faster progress on understanding yields and farm productivity, and figuring out what actually works for improving them. One general lesson out of this for me is that sometimes making something really scalable requires scrapping an old approach. We had been previously running crop models for specific sites and years, but that wasn't possible within the Earth Engine system. I think SCYM (which trains a regression using simulations over lots of sites and years) is more robust than what we had, and along with the new satellite data and Earth Engine-type systems, it might just provide a way to do yield mapping at scale.