Tuesday, August 23, 2016

Risk Aversion in Science

Marshall posted last week about our poverty mapping paper in Science. One thing a reporter asked me the other day was about the origins of the project, and it got me thinking about an issue that’s been on my mind for a while – how does innovation in science happen, and how sub-optimal is the current system for funding science?

First the brief back story: Stefano started as an assistant professor in Computer Science a couple years back, and reached out to me to introduce himself and discuss potential areas of mutual interest. I had recently been talking with Marshall about the GiveDirectly approach to finding poor people, and we had been wondering if there was (a) a better (automated) way to find thatched vs. metal roofs in satellites and (b) some way to verify if roof type is actually a decent predictor of poverty. So we chatted about this once or twice, and then some students working with Stefano tried to tackle the roof problem. That didn’t work too well, but then someone (Stefano, I think, but I can’t really recall) suggested we try to train a neural network to predict poverty directly without focusing on roofs. Then we talked about how poverty data was pretty scarce, how maybe we could use something like night lights instead, and yada yada yada, now we have the paper.  

Now the more general issue. I think most people outside of science, and even many people starting out as scientists, perceive the process as something like this: you apply for funding, you get it, you or a student in your group does the work, you publish it. That sounds reasonable, but in my experience that hasn’t been how it works for the papers I’m most excited about. The example of this paper is more like: you meet someone and chat about an idea, you do some initial work that fails, you keep talking and trying, you apply for funding and get rejected (in this case, twice by NSF and a few times by foundations), you keep doing the work anyway because you think it’s worthwhile, and eventually you make a breakthrough and publish a paper.

In this telling, the progress happens despite the funding incentives, not because of them. Luckily at Stanford we have pretty generous start-up packages and internal funding opportunities that enable higher risk research, and we are encouraged by our departments and the general Silicon Valley culture to take risks. And we have unbelievably good students, many of whom are partially or fully funded or very cheap (a key contributor on the Science paper was an undergrad!). But that only slightly lessens the frustration of having proposals to federal agencies being rejected because (I’m paraphrasing the last 10+ proposal reviews I’ve gotten) “it would be cool if it worked but I don’t think it’s likely to work.” If I wasn’t at Stanford, I probably would have long ago stopped submitting risky ideas, or gotten out of academia altogether.

I know this is a frustration shared by many colleagues, and also that there’s been a fair number of academic studies on incentives and innovation in science. One of the most interesting studies I’ve read is this one, about the effects of receiving a Howard Hughes Medical Institute (HHMI) investigator grant on creativity and productivity. The study isn’t all that new, but definitely a worthwhile read. For those not familiar with the HHMI, it is a fairly substantial amount of funding given to a person, rather than for a specific project, with a longer time horizon than most awards. It’s specifically designed to foster risk taking and transformational work.

The article finds a very large effect of getting a HHMI on productivity, particularly in output of “top hit” publications. Interestingly, it also finds an increase in “flops”, meaning papers that get cited much less than typical for the investigator (controlling for their pre-award performance). This is consistent with the idea that the awardees are taking more risks, with both more home runs and more strike outs. Also consistent is the fact that productivity drops in the few years after getting an award, presumably because people start to pursue new directions. Even more interesting to me was the effect of getting an HHMI on applications to NIH. First, the number of applications goes way down, presumably because recipients spend less time seeking funds and more time actually doing science. Second, the average ratings for their proposals gets worse (!) consistent with the idea that federal funds are biased against risky ideas.

Unfortunately, there aren’t any studies I can find on the “people not project” types of awards in other areas of science. Personally, I know my NASA new investigator program award was instrumental in freeing me up to explore ideas as a young faculty. I never received an NSF Career award (rejected 3 times because – you guessed it – the reviewers weren’t convinced the idea would work), but that would be a similar type of thing. I’d like to see a lot more empirical work in this area. There’s some work on awards, like in this paper, but awards generally bring attention and prestige, not actual research funds, and they apply to a fairly small fraction of scientists.

I’d also like to see some experiments set up, where people are randomly given biggish grants (i.e. enough to support multiple students for multiple years) and then tracked over time. Then we can test a few hypotheses I have, such as:
  1. Scientists spend way too much time writing and reviewing proposals. An optimal system would limit all proposals to five pages, and give money in larger chunks to promote bigger ideas. 
  2. There is little or maybe even zero need to include feasibility as a criteria in evaluating proposals for specific projects. More emphasis should be placed on whether the project will have a big positive impact if it succeeds. Scientists already have enough incentive to make sure they don’t pursue dead ends for too long, since their work will not get published. Trying to eliminate failure, or even trying hard to reduce failure rates, based on a panel of experts is counterproductive. (It may be true that panel ratings are predictive of future project impact but I think that comes from identifying high potential impact rather than correctly predicting the chance of failure)
  3. People who receive HHMI-like grants are more likely to ponder and then pursue bigger and riskier ideas. This will result in more failure and more big successes, with an average return that is much higher than a lot of little successes. (For me, getting the Macarthur award was, more than anything, a challenge to think about bigger goals. I try to explain this to Marshall when he constantly reminds me that people’s productivity decline after getting awards. I also don’t think he’s read the paper to know it’s only a temporary decline. Temporary!)
  4. Aversion to risk and failure is especially high for people who do not have experience as researchers, and thus don’t appreciate the need to fail on the way to innovation. One prediction here is that panels or program managers with more successful research histories will tend to pick more high impact projects.

I’m sure some of the above are wrong, but I’m not sure which ones. If anyone has answers, please let me know. It’s an area I’m mostly ignorant on but interested to learn more. I’d apply for some funding to study it, but it’d probably be rejected. I’d rather waste my time blogging than writing more proposals.



One final thought. On several occasions I have been asked by foundations or other donors what would be a good “niche” investment in topics around sustainability. I think they often want to know what specific topics, or what combination of disciplines, are most ripe for more funding. But my answer is typically that I don’t know enough about every topic possible to pick winners. Better to do something like HHMI for our field, i.e. encourage big thinking and risk taking among people that have good track records or indicators of promise. But that requires a tolerance for failure, and even foundations in the midst of Silicon Valley seem to struggle with that.

Thursday, August 18, 2016

Economics from space

We've got a paper out in Science today that demonstrates a new way to use satellite imagery to predict economic well-being in poor countries (see project website here).  The paper is a collaboration between some of us social scientists (or social "scientists", with emphatic air quotes, as my wife puts it) and some computer scientists across campus -- folks who have apparently figured out how to use computers for more than email and Youtube surfing.

We're hoping that this is the first of many projects with these guys, and so have codified our collaboration here, with one of those currently-popular dark-hued website designs where you scroll around a lot.

So why is it sensible to try to use satellite imagery to predict economic livelihoods?  The main motivation is the lack of good economic data in many parts of the developing world.  As best we can tell, between the years 2000 and 2010, one quarter of African countries did not conduct a survey from which nationally-representative poverty estimates could be constructed, and another 40% conducted only one survey.  So this means that in two-thirds of countries on the world's poorest continent, you've got very little sense of what's going on, poverty-wise.  And even a lot of the surveys that do get conducted are only partially in the public domain, meaning you've got to employ some trickery to even get the shape of the income distribution in these countries (and survey locations are still unavailable!).

This lack of data makes it hard to track specific targets that we've set, such as the #1 Sustainable Development Goal of eliminating poverty by 2030.  It also makes it hard to evaluate whether specific interventions aimed at reducing poverty are actually working.  The result is that we currently have little rigorous evidence about the vast majority of anti-poverty interventions undertaken in the developing world, and no real way to track progress towards SDGs or any other target.

While we don't collect a lot of survey data for many locations in the developing world, we collect other sources of information about these places constantly -- satellite information being one obvious source.  So our goal in this paper was to see whether we could use recent hi-res recent imagery to predict economic outcomes at a local level, and fill in the gaps between what we know from surveys.

We are certainly not the first people to think of using satellites or other "unconventional" data sources to study economic output in the developing world.  For instance, here is a 2012 paper by Adam Storeygard that uses nightlights to improve GDP estimates at the country level, and here is a paper from about 9 months ago by Josh Blumenstock and company where they use call data records from a cell phone company to predict local-level economic outcomes in Rwanda.  But what our approach brings to the table is that (unlike Storeygard et al) we can make very local predictions, and that (perhaps unlike Blumenstock et al) our approach is very easy to scale, given that satellite imagery are available free or at very low cost for every corner of the earth and more rolls in each day.

For a quick explanation of what we do in the paper, check out this short video that we made in collaboration with these guys.  Sort of an experiment on our end, comments or slander welcome in the comments section.



The main innovation in the paper is in figuring out what information in the hi-res daytime imagery might be useful for predicting poverty or well-being.  Standard computer vision approaches to interpreting imagery typically get fed huge training datasets - e.g. millions of "labeled" images (e.g. "dog" vs "cat") that a given model can use to learn to distinguish the two objects in an image.  But the whole problem here is that we have very little training data -- i.e. few places where we can tell a computer with certainty that a specific location is rich or poor.

So take a two-step approach to solving this problem.  First, we use lower-resolution nightlights images to train a deep learning model to identify features in the higher-resolution daytime imagery are predictive of economic activity. The idea here -- building on the paper cited above -- is that nightlights are a good but imperfect measure of economic activity, and they are available for everywhere on earth. So the nightlights help the model figure out what features in the daytime imagery are predictive of economic activity.  Without being told what to look for, the model is able to identify a number of features in the daytime imagery that look like things we recognize and tend to think are important in economic activity (e.g roads, urban areas, farmland, and waterways -- see Fig 2 in our paper).

Then in the last step of the process, we use these features in the daytime imagery to predict village-level wealth, as measured in a few household surveys that were publicly available and geo-referenced.  (As our survey measures we use data from the World Bank LSMS for consumption expenditure and from the DHS for assets.)  We call this two step approach "transfer learning", in the sense that we've transferred knowledge learned in the nightlights-prediction task to the final task of predicting village poverty.  Nightlights are not used in the final poverty prediction; they are only used in the first step to help us figure out what to use in the daytime imagery.

Josh Blumenstock (or some Science art editor) have a really nice depiction of the procedure, in a commentary that Josh wrote on our piece that also appeared today in Science.

The model does surprisingly well.  Below are cross-validated model predictions and R-squareds for consumption and assets, where we are comparing model predictions against survey measurements at the village level in five African countries (Uganda, Tanzania, Malawi, Nigeria, Rwanda).  The cross-validation part is key here -- basically we split the data in two, train the model on one part of the data, and then predict for the other part of the data that the model hasn't seen.  This guards against overfitting.



We can then use these predictions to make poverty maps of these countries.  Here is a prototype (something we're still working on), with estimates aggregated to the district level:

[Edit:  Tim Varga pointed out in an email that, while beautiful, the below plot is basically meaningless to the 10% of men and 1% of women who are red/green colorblind.  Duh - and sorry!  (Only silver lining is that this mistake harmed men differentially, subverting the normal gender bias).  Nevertheless, we will fix..]


Maybe the most exciting result is that a model trained in one country appears to do pretty well when applied outside that country, at least within our set of 5 countries.  For example, a model trained in Uganda does a pretty good job of predicting outcomes in Tanzania, without ever having seen Tanzanian data.  Granted, this would likely work a lot worse if we were trying to make predictions for a more dissimilar country (say, Iceland).  But it suggests that at least within Africa -- the continent where data gaps remain largest -- our approach could have wide application.

Finally, we don't really view our approach as a substitute for continuing to do households surveys, but rather as a strong compliment -- as a way to dramatically amplify the power of what we learn from these surveys.  It's likely that we're going to continue to learn a lot from household surveys that we might never learn from satellite imagery, even with the fanciest machine learning tricks.

We are currently trying to extend this work in multiple directions, including evaluating whether we can make predictions over time using lower-res Landsat data, and in scaling up the existing approach to all of Africa.  More results coming soon, hopefully.  We also want to work with folks who can use these data, so if that happens to be you, please get in touch!  

Friday, August 5, 2016

A midsummer’s cross-section

Summer is going too fast. It seems like just yesterday Lebron James was being a sore loser, body slamming and stepping over people – and getting rewarded for it by the NBA. Apart from that, one interesting experience this summer was getting to visit some very different maize (corn) fields within a few weeks in July. First, I was in Kenya and Uganda at some field sites, and then I was visiting some farms in Iowa.

When talking maize, it’s hard to get much different than East Africa and East Iowa. As a national average, Kenya produces a bit less than 4 million tons of maize on 2 million ha, for a yield of about 1.75 t/ha. Iowa has about seven times higher yield (12.5 t/ha), and produces nearly twenty times more maize grain. The pictures below give a sense of a typical field in each place (Kenya on the left).


Lots of things are obviously different between the two areas. There are also some things that people might think are different but really aren’t. For example, looking at annual rainfall or summer temperatures, they are pretty similar for the two areas (figures from www.climatemps.com, note different scales):



But there are also things that are less obviously different. Earlier this year I read this interesting report trying to estimate soil water holding capacity in Africa, and I’ve also been working a bunch with soil datasets in the U.S. from the USDA. Below shows the total capacity of the soil to store water in the root zone (in mm) for the two areas, plotted on the same scale. 

It’s common for people to talk about the “deep” soils of the Corn Belt, but I don’t think people typically realize just how much better they are at storing water than many other places. There’s virtually no overlap between the distribution of root zone storage in the two areas, and on average Kenya soils have about half the capacity of Iowa’s.

How much difference can this one factor make? As a quick thought experiment I ran some APSIM-maize simulations for a typical management and weather setup in Iowa, varying only the root zone storage capacity between 150 and 330mm. Simulated yields by year are shown below, with dashed lines showing the mean for each soil. 


This suggests that having half the storage capacity translates to roughly half the average yields, with much bigger relative drops in hot or dry years like 1988 or 2012. And this assumes that management is identical, when a rational farmer would surely respond by applying much less inputs to the worse soils.


Just something to keep in mind when thinking about the potential for productivity growth in Africa. There’s certainly room for growth, and I saw a lot of promising trends. But just like when it comes to the NBA officials, there's a lot going on under the surface, and I wouldn’t expect too much. 

Wednesday, August 3, 2016

Applying econometrics to elephant poaching: our response to Underwood and Burn

Summary


[warning: this is my longest post ever...]

Nitin Sekar and I recently released a paper examining whether a large legal sale of ivory affected poaching rates of elephants around the world. Our analysis indicated that the sale very likely had the opposite effect from its original intent, with poaching rates globally increasing abruptly instead of declining. Understandably, the community of policy-engaged researchers and conservationists has received these findings with healthy skepticism, particularly since prior studies had failed to detect a signal from the one-time sale. While we have mostly received clarifying questions, the critique by Dr. Fiona Underwood and Dr. Robert Burn fundamentally questions our approach and analysis, in part because their own analysis of PIKE data yielded such different results. 

Here, we address their main concerns. We begin by demonstrating that, contrary to our critics’ claims, the discontinuity in poaching rates from 2008 onwards (as measured by PIKE) is fairly visible in the raw data and made clearer using simple, valid statistical techniques—our main results are not derived from some convoluted “model output,” as suggested by Dr. Underwood and Dr. Burn (we developed an Excel spreadsheet replicating our main analysis for folks unfamiliar with statistical programming). We explain how our use of fixed effects accounts for ALL average differences between sites, both those differences for which we have data and those for which we are missing data, as well as for any potential biases from the uneven data reporting by sites—and we explain why this is better than approaches that attempt to guess what covariates were responsible for differences in poaching levels across different sites. We show that our findings are robust to the non-linear approaches recommended by Dr. Underwood and Dr. Burn (as we already had in large part in our original paper) and that similar discontinuities are not present for other poached species or Chinese demand for other precious materials (two falsification tests proposed by Underwood and Burn). We also show that previous analyses that failed to notice the discontinuity may have in part done so because they smoothed the PIKE data. 

We then discuss Dr. Underwood and Dr. Burn’s concerns about our causal inference. While we are more sympathetic to their concerns here, we a) review the notable lengths to which we went to look for reasonable alternative hypotheses for the increase in poaching; b) examine some of Dr. Underwood and Dr. Burn’s specific alternative hypotheses; c) present an argument for inferring causality in this context; and d) document that trend analyses less complete than ours have been used by CITES to infer that the one-time sale had no effect on poaching in the past, suggesting that our paper presents at least as valuable a contribution to the policy process as these prior analyses. We then try to understand why the prior analysis of PIKE by Burn et al. 2011 failed to detect the discontinuity that we uncovered in our study. 

Finally, we conclude by discussing how greater data and analysis transparency in conservation science would make resolving debates such as this one easier in the future. We also invite participation by researchers to a webinar where we will field further questions about this analysis, hopefully clarifying remaining concerns. 

Overall, while our analysis is by no means the last word on how legal trade in ivory affects elephant poaching, we assert that our approach and analysis are valid, and that our transparency makes possible fully understanding of the strengths and limitations of our research.