Wednesday, August 3, 2016

Applying econometrics to elephant poaching: our response to Underwood and Burn


[warning: this is my longest post ever...]

Nitin Sekar and I recently released a paper examining whether a large legal sale of ivory affected poaching rates of elephants around the world. Our analysis indicated that the sale very likely had the opposite effect from its original intent, with poaching rates globally increasing abruptly instead of declining. Understandably, the community of policy-engaged researchers and conservationists has received these findings with healthy skepticism, particularly since prior studies had failed to detect a signal from the one-time sale. While we have mostly received clarifying questions, the critique by Dr. Fiona Underwood and Dr. Robert Burn fundamentally questions our approach and analysis, in part because their own analysis of PIKE data yielded such different results. 

Here, we address their main concerns. We begin by demonstrating that, contrary to our critics’ claims, the discontinuity in poaching rates from 2008 onwards (as measured by PIKE) is fairly visible in the raw data and made clearer using simple, valid statistical techniques—our main results are not derived from some convoluted “model output,” as suggested by Dr. Underwood and Dr. Burn (we developed an Excel spreadsheet replicating our main analysis for folks unfamiliar with statistical programming). We explain how our use of fixed effects accounts for ALL average differences between sites, both those differences for which we have data and those for which we are missing data, as well as for any potential biases from the uneven data reporting by sites—and we explain why this is better than approaches that attempt to guess what covariates were responsible for differences in poaching levels across different sites. We show that our findings are robust to the non-linear approaches recommended by Dr. Underwood and Dr. Burn (as we already had in large part in our original paper) and that similar discontinuities are not present for other poached species or Chinese demand for other precious materials (two falsification tests proposed by Underwood and Burn). We also show that previous analyses that failed to notice the discontinuity may have in part done so because they smoothed the PIKE data. 

We then discuss Dr. Underwood and Dr. Burn’s concerns about our causal inference. While we are more sympathetic to their concerns here, we a) review the notable lengths to which we went to look for reasonable alternative hypotheses for the increase in poaching; b) examine some of Dr. Underwood and Dr. Burn’s specific alternative hypotheses; c) present an argument for inferring causality in this context; and d) document that trend analyses less complete than ours have been used by CITES to infer that the one-time sale had no effect on poaching in the past, suggesting that our paper presents at least as valuable a contribution to the policy process as these prior analyses. We then try to understand why the prior analysis of PIKE by Burn et al. 2011 failed to detect the discontinuity that we uncovered in our study. 

Finally, we conclude by discussing how greater data and analysis transparency in conservation science would make resolving debates such as this one easier in the future. We also invite participation by researchers to a webinar where we will field further questions about this analysis, hopefully clarifying remaining concerns. 

Overall, while our analysis is by no means the last word on how legal trade in ivory affects elephant poaching, we assert that our approach and analysis are valid, and that our transparency makes possible fully understanding of the strengths and limitations of our research.

What we did

Nitin Sekar (formerly at Princeton, now a science-policy fellow with the American Association for the Advancement of Science) and I wrote a paper examining whether a large legal sale of ivory affected poaching rates of elephants around the world. Our analysis indicated that the sale very likely had the opposite effect from its original intent, with poaching rates globally increasing abruptly instead of declining. We point out that this response, while inconsistent with the standard economic intuition (that the availability of legal ivory should diminish the incentive to poach), could be caused by one of two effects: legalization could increase demand for ivory, perhaps because it appears to be more socially acceptable to consume it, and legal ivory could make it easier to smuggle and trade illegal ivory, because illegal ivory could masquerade as legal ivory.  Numerous prior reports suggest both of these mechanisms are probably happening on the ground, but no prior study had examined whether their combined effect had a systematic global influence that could overpower the market-flooding effect that was the original intent of the policy. We employ an event study/regression discontinuity design to examine the effect of the sale, which was announced in 2008.  Our main measure of elephant poaching is “PIKE”, the “Proportion of Illegally Killed Elephants,” which is the fraction of elephant carcasses discovered by field workers which have been poached (ecologists devised this now-standard measure because it corrects for fluctuations in elephant populations and field-worker effort). This is our main result:

Our results suggest that the sale appeared to have a pervasive effect around the globe, that these results cannot be explained by changes in natural elephant mortality, and that the 66% average increase in poaching rates is consistent with the approximate 70% increase in attempted ivory smuggling out of Africa, as (imperfectly) measured via contraband seizures by governments:

Estimated effects of sale across countries in Africa (click to enlarge).

Response of seized raw ivory contraband leaving African countries (click to enlarge).

In our analysis, we also examine a large number of covariate variables that might explain the abrupt 2008 change in poaching, including trade flows, changes in income, and counts of Chinese workers who were abroad in elephant range states (and could possibly facilitate smuggling); however, none of these covariates exhibit patterns that would appear to explain the jump in poaching. We have publicly posted replication code for all of our analysis.

Understandably, the community of policy-engaged researchers and conservationists has received these findings with healthy skepticism, particularly since prior studies had not detected a similar signal from the one-time sale. Additionally, the debate about whether a legal trade in ivory would serve wild elephant conservation is a politically consequential one. As a result, we have received a lot of questions about, and a few challenges to, our manuscript. The most notable critique was a widely circulated blog post by Dr. Fiona Underwood and Dr. Robert Burn, which has generated inquiries for us from government analysts and others around the world. Here, as the applied econometrician on the paper, I address the technical concerns outlined by Dr. Fiona Underwood and Dr. Robert Burn, as well as some of the other questions folks have had (e.g., why we published the paper as a working paper).

Before proceeding, though, let me say something about how Nitin and I see this research. First, Nitin and I began this research because we were intellectually curious about international endangered species policy. We had no pre-conceived beliefs about whether the one-time legal sale of ivory was good or bad for elephant conservation—if anything, I was broadly inclined to believe a legal trade helped displace illegal goods (I had even blogged about legalization and the drug war before). The data and what we have read in reports have convinced us that legal trade does not appear to have the unequivocally universal effect of diminishing black markets as suggested by traditional economic theory. Second, while Nitin and I believe that our results provide new insight into the illegal ivory trade, and while we believe our interpretation of the results is the most reasonable one based on the available data, we do not claim that our research is definitive. Like all of science, our analysis is based on limited data, and black markets remain extraordinarily difficult to study. Research in the future may suggest our inference was too strong or even incorrect. Ours is not the last word on this subject.

That said, policy should be designed based on the best available data and analysis, and not solely on intuition or theory. We encourage our colleagues in conservation science and policy to examine the assumptions underpinning our work and compare them to those of prior studies, and then decide what research they should rely upon as they make decisions regarding future policy. Our goal below is to make sure that anyone in the world of conservation is able to understand the strengths and limitations of our research so that they may apply it appropriately going forward.

Re-explaining our approach

Before providing responses to the critiques we have received, I think it’s helpful to make absolutely clear to observers what we did in our analysis. The overall research design for our analysis is an “event study,” which is a special case of a “regression discontinuity” research design. The idea underpinning this approach is that if we want to learn if an event has a causal effect on an outcome and an event is triggered exogenously, then we can measure the effect by looking for a discontinuity in the outcome that aligns with the event. Importantly, however, this approach rests on the assumption that other important factors that influence the outcome do not also occur at the same time as the event and confound the analysis (an assumption we discuss at length and try to test in our paper).

Many folks seem to think that to infer A causes B we must be able to empirically trace all detailed mechanisms linking A to B, but this is simply not true.  For example, a few weeks ago the world identified the effect of Brexit on UK stock markets using an event study approach:

and essentially everyone understands that this is a causal relationship even though they cannot explain to you how stock prices evolve. That is not to say that understanding mechanisms is not important—it absolutely is. Demonstrating a mechanism strengthens the claim of causality, and can help us make better predictions about when similar events might have similar effects in the future—i.e., mechanisms can help us understand the external validity of our findings. Strictly speaking, though, one can still make a reasonable inference about causality without knowing the mechanism.

One of the keys to an event study is that you want to analyze the data in way that allows you to see if there is a discontinuous change in the outcome if there is one to see. This means you do not want to examine the outcome using the usual functions for modeling trends that will “smooth over” discontinuities. For example, if we drew a single trend line fit through all the data in the Brexit graph above, we wouldn’t be able to see the drop because we would have mechanically constrained the trend to be completely linear—we’d just see a downward sloping line (we revisit this idea below when discussing why other research did not detect our findings).

So how did we look for a discontinuity in the data without smoothing over it? We look at PIKE levels each year separately and make no assumptions about how smooth or not the trends are. In a sense, we “let the data speak” and try not to force it to map onto some preconceived notions about how continuous patterns over time must be. To look at changes each year, we allow the average PIKE level each year to be independent of prior years, and just simply evaluate the average as it is. In our final model, we have to do this while simultaneously accounting for all patterns across sites and countries, but if you just look at the raw PIKE data without doing anything to it at all, you essentially still see the main result. Computing average PIKE across sites each year is an extremely simple exercise that requires no fancy understanding of statistics or econometrics and we were a bit surprised that we had never seen this anywhere, especially because the discontinuity in 2008 is readily visible.

This isn’t the final answer (some important adjustments are necessary, detailed below), but I think it is crucial that everyone in the community understand how visible the 2008 jump in PIKE is in the data. Below is a figure where we plot the raw PIKE data (after cleaning) as grey circles for each site that reports data during 2003-2013. We then compute the average PIKE value across all reporting sites each year; these are the black diamonds. We then fit a simple line through the points before and after the sale in 2008 by OLS. The discontinuity in 2008 is unmistakable:

The size of the discontinuity doesn’t look as large as in some of our figures in the paper simply because the y-axis has to show the full 0 to 1 range to display all the raw data, but the size of the 2008 jump in PIKE is very close the +0.13 jump that is the main result in our paper. There are adjustments to this calculation that are important to make because the composition of sites that report data changes somewhat year to year, but this very simple calculation gets the answer qualitatively correct and extremely close quantitatively.

I’m not sure if I can emphasize the simplicity and power of this simple calculation enough. Once the data is in a statistical program, this is 2-3 lines of code, depending on the program. To ensure everyone in the community understands our calculations, we’ve posted an Excel spreadsheet replicating our analysis so that everyone can follow what we are doing. [Note that the posted the spreadsheet does not do the simplest calculation (shown above) but also corrects for site-level differences, discussed below.]

As an aside, I frequently tell my students to “just look at the data first” to get a basic sense of what’s going on, before jumping into complicated high-dimensional analysis where it’s easy to make mistakes and very hard to visualize relationships. There is a strong tendency in statistical analyses to always jump straight in and use the most high-powered statistical tools in the first pass, but higher powered tools often come with more restrictive or complicated assumptions and may still not get one any closer to the truth. Simply looking at raw or very mildly processed data often reveals that much simpler approaches will work.  When simpler methods get the right answer, they are clearly preferable, both because it is easier to catch mistakes in simple approaches and other people (with whom we wish to communicate) can understand and interpret them more easily.

As a second aside, note that the average of PIKE across sites (black diamonds in the figure above) is different than “global aggregate PIKE” that is often computed and displayed as a simple time series in many policy reports. As discussed in our paper, global aggregate PIKE is the sum of all illegally killed elephants found at MIKE sites around the world divided by the sum of all dead elephants found at MIKE sites around the world. Computing global aggregate PIKE has essentially no value, as far as I can tell, because the correction for elephant population fluctuations and field-worker effort (the reason PIKE is used as a measure) does not work if one sums illegal and total elephant mortality across sites before dividing them by one another (see footnote 8 in our paper).  If one uses globally aggregate PIKE for anything, they are essentially assuming that elephant populations in Thailand and data collection effort in Kenya should be used to correct for random year-to-year fluctuations in poached elephant discovery in Ghana, which doesn’t make sense.

Returning to our analysis, I mentioned above that it is important to correct for the changing composition of sites that report data during the sample. This is a potentially important issue in this context because many sites do not report data in many years, for unreported reasons.  To see why this is an issue, consider the hypothetical situation in which sites in countries where the rule of law is weak will eventually stop reporting PIKE data in later years because the original funding for the operation of MIKE sites is eventually squandered away through some corrupt channel. This would lead to the sample of countries to change over time in a way that might bias the sample: countries where rule of law is weak (and poaching rates are high) gradually leave the sample and only countries with strong rule of law (with low poaching rates) remain in later years. This would make average PIKE levels appear to decline over time even if overall true poaching rates did not. This could happen within a country at the site level too. For example, regions of a country that are remote and poor might struggle to bring their reporting online at the beginning of the sample and might have high poaching rates because local people have fewer economic opportunities. In this second hypothetical case, poaching rates would appear to rise over time as poor and remote regions where poaching is high begin reporting more regularly over time.

To prevent our estimates from being biased by a changing composition of sites over time, we demean all site-level PIKE data using each site’s average PIKE level over time. This means that the demeaned PIKE level for each site is zero on average, so that site entry or exist from the sample will have an average of zero effect on average demeaned PIKE each year. By demeaning each site, changes in average PIKE each year must be driven by changes in poaching rates within each site over time—in essence, we are just comparing each site to itself over time, and then aggregating the results across sites for each year. This allows us to see how PIKE evolves over time, on average across sites.

An alternative, but mathematically identical, way to think about this demeaning is to imagine that we “control” for all factors that make a site different from any other sites by only comparing a site to itself over time. For example, one might think that local economic conditions, local geography, local culture, national law enforcement institutions, national corruption levels, national trading partners, and distance to the nearest international shipping port all affect average site poaching rates. All of these factors affect poaching at a site by affecting whether poaching rates are high or low on average at a site.  Thus, if we control for the average poaching rate at each site, we are implicitly accounting for all these factors that raise poaching rates at each site, since they do it by raising the average. This interpretation is referred to in econometrics as “controlling for site-specific fixed effects,” which is just a fancy way of saying we account for all factors that cause there to be average differences in PIKE across sites. For anyone interested in learning more about how this correction works, see Section D of the Appendix in our paper for additional explanation.

For our purposes of estimating the effect of the 2008 sale, this nonparametric approach is much better than the alternative approach of trying to list and model the effect of all covariates explicitly. For example, researchers often will collect data on covariate variables, such as country income, forest cover, road network density, political institutions, etc. and then run a regression of PIKE on these covariates and a trend, in an effort to control for these potential confounders. There are three big challenge with this approach of modeling covariates explicitly: (a) the authors must list and obtain data on all factors that affects poaching, as leaving any out by mistake could lead to omitted variables bias (even things that nobody realizes are important could turn out to be critical); (b) the authors must assume a specific form for the mathematical relationship between each factor and poaching (e.g. linear, quadratic, exponential…), as getting any wrong could again lead to biases; (c) the authors must assume that all these factors affect poaching through this specific mathematical form in every location the same way (e.g. a single corruption index has a linear relationship with poaching across all contexts). These are tough criteria to meet, and what makes it tougher is that there is no test to be sure that one has met these criteria, so one can never really be confident that they’ve got the right approach. This means that analysts are usually left appealing to readers’ intuition and willingness to suspend disbelief that the model may be missing key covariates.

In contrast, our approach accounts for ALL average differences between sites, both those that we have data for and those for which we are missing data (known as “unobserved heterogeneity” in econometrics), including all the factors that matter but that we don’t even know about yet. Furthermore, if some covariate factors affect poaching in some complex nonlinear way, while it affects poaching differently elsewhere, that’s not a problem, our approach still nets out all of those effects.

The cost of this very flexible approach is that, while it makes many fewer assumptions about the background process that generates poaching, it provides limited insight into what site-level characteristics may be affecting poaching levels, making it more difficult to extrapolate the results to locations that have no data. But when we are trying to infer whether the sale had an effect or not in locations where we do have data, this doesn’t matter. What does matter is that we remove any potential biases from the uneven data reporting by sites, which our approach does extremely well.

Our approach has the added benefit that it is simple and easy to understand: all you have to do is demean the time series for each site before making comparisons over time. That’s it. No hocus pocus. To illustrate that this is how we got our main result, we compute demeaned PIKE by demeaning each time series for each site. These data are the grey circles below (they are now centered at zero because of the normalization). Then, just as in the graph above, we compute average PIKE across sites each year and plot that as a black diamond (this is the calculation we show in the Excel replication file). Then we fit a line through those averages before and after the sale by OLS:

The discontinuous jump in 2008 is exactly our main result, and it is actually the same as the first figure in the post (which is the main figure in the paper, Figure 2B) except that here we’ve plotted all the raw data in the background so it’s clear what we are doing. But even this raw data is shown in our paper as the first panel of the main figure (Figure 2A) where the discontinuity is clear by looking at almost-raw data (the only processing was demeaning it by each site’s average PIKE). This is the result that anyone in the world can replicate using this Excel spreadsheet that replicates our analysis.

As you can see, correcting for differences in average poaching rates across sites (using site fixed effects) is important for the slope of the two trend lines (compare this figure to the one above with the raw data): after correction there is essentially no pre-trend and a very slight post-sale trend. But in this case, this correction was not really that essential for measuring the magnitude of the discontinuity, since you get the same result either way (although there was no way to know that would be true before doing the calculation).

The rest of our paper uses a bunch of other different flavors of statistical modeling to understand how robust this 2008 discontinuity is, and the answer is that it is really extremely robust. We estimate different nonlinear probability models, change the data set, make different assumptions about trends, etc., and whatever we do we obtain pretty much the same results. We then conduct a similar analysis on ivory contraband seizure data and other variables that are relevant to alternative theories proposed elsewhere in the literature. We see that the seizures response is similar in magnitude, although less statistically significant (in part because there is less post-sale data available) and that no other related covariates exhibit a similar pattern.  We also do a lot of (boring but informative) checks on the data to make sure the model is well structured, for example checking that the residuals are normally distributed, etc.

That’s it, that’s our main finding. A massive multi-million dollar, global data collection system was set up to understand the poaching impact of legal sales, and a level increase in poaching the same year as the 2008 legal sale is easily visible after the raw data has been demeaned by site and averaged across sites each year. We hope that this makes our analysis completely transparent. (More about our interpretation of our results can be found in the next section)

Our response to Underwood and Burn

While we have received several questions about our work, the critique by Dr. Fiona Underwood and Dr. Robert Burn is the most extensive and unsparing. Dr. Underwood and Dr. Burn are consultants that were hired by CITES to evaluate whether the the legal sale designed and implemented by CITES was effective at reducing elephant poaching globally. In her blog post, Underwood states that our analysis is wrong on multiple dimensions, providing concrete criticisms in a few cases. Below we respond to Underwood’s technical concerns and demonstrate that none of them alter our findings.

Our concern upon receiving the critique by Dr. Underwood and Dr. Burn was that conservation policy makers, seeing their criticisms, would broadly lack the time or the statistical background to adjudicate the debate. We reached out to Dr. Underwood and Dr. Burn in the hopes that we might reach a joint understanding about the merits and limitations of our event study design, and that we might perhaps delineate the different assumptions underpinning our design and those of prior studies, the most prominent of which they co-authored. This, we felt, would make it easiest for policy makers to understand the relative merits of different statistical approaches in analyzing and interpreting the PIKE data. Dr. Underwood and Dr. Burn declined this offer. Unfortunately, this means we must respond to their criticisms below without their input. In places where their criticisms were non-specific or unclear to us (and they did not clarify in response to my email inquiries), I have had to guess what they are saying.

Overall, Dr. Underwood and Dr. Burn indicate that both our “analysis” and our “approach” are “wrong.” Below we respond to each substantive point they raise. First, we respond to their concerns about our statistical analysis, which we summarized above. This is where we differ most strongly with Dr. Underwood and Dr. Burn, who appear to lack familiarity with our methods and seem to have overlooked much of what we put in our paper and appendices. Second, we respond to their concerns about our overall approach to causal inference. We explain why our approach, while imperfect, is a sensible one (if not the most sensible one) given the limitations in the data.

A. Concerns about the statistical analysis

Dr. Underwood and Dr. Burn distributed their criticisms between a blog post (which summarized their overall concerns) and an attached pdf with additional details that mostly repeats the blog post. Here we directly respond to all the criticisms posted on the blog.
UNDERWOOD & BURN: To illustrate their argument the authors plot a number of points showing average PIKE and a clear step change in the value of PIKE between 2007 and 2008. But the points in the graph are not raw data but model outputs.  
We assume Dr. Underwood and Dr. Burn refers to our Figure 2. As we illustrate above, her comment isn’t accurate. Panel 2A of our paper is basically raw data; the only things we did to it was (1) clean it by removing irregularities and (2) demean the time series for each site (i.e., account for fixed effects). This panel shows the abrupt jump in PIKE at all points in the distribution except the top 5% of sites with the highest poaching rates. Panel 2B, which is our main result, differs from 2A only in showing annual average PIKE, just like we computed above (and in our Excel replication). It is true that we ran a single regression to produce these numbers, but this was done mainly so that we could properly account for covariance in the uncertainty between these annual averages.

Dr. Underwood and Dr. Burn’s suggestion that there is some complex (and perhaps untrustworthy) model behind those values is inaccurate. The discussion above explained exactly where those numbers come from. From a statistical standpoint, they are extremely straightforward.
UNDERWOOD & BURN: And the model they have used is wrong. For example, although PIKE is constrained to be between zero and one their model does not constrain these values to be between zero and one. They give many reasons for doing this including that to model the data correctly is complex, they wish to choose simplicity over complexity, and if they were to use more complex methods they would need to throw away 32.1% of the data.
Methods for analysing proportions, Generalised Linear Models (GLMs), are taught at undergraduate level on statistics courses.  GLMs are actually quite intuitive, widely used and understood and not really all that complex. 
Furthermore, it is not OK to use the simplest of methods if they are wrong and it is clearly preferable to use more complex methods if that is what is needed to correctly represent the data. 
The authors have misunderstood the methods because you do not need to throw away 32.1% of the data – all of the data can be used. 
The consequences of not modelling the data correctly are that their results could be wrong and it is difficult to know how wrong it is.
Overall, this comment suggests that we chose our modeling approach primarily because we “misunderstand” the GLMs to which they refer, and that we value simplicity even when complexity provides for more accurate predictions. This is a (perhaps inadvertent but nonetheless dramatic) misinterpretation of our deliberate decision not to focus on GLMs in our analysis. We have three responses to this comment: i) our presentation of the linear model in the paper directly addresses the concerns they outlined in their rebuttal, but they overlook this in their critique; and ii) we in fact do explicitly consider non-linear GLMs in our manuscript which recover the same result as our linear model.  We ultimately elect to use the linear model because it is well-suited for the data, provides the same results, and is far more interpretable than the non-linear alternatives.

First, in our paper we demonstrate with thoroughness that our basic statistical assumptions are valid and faithfully model the data. We note many of the merits of using fixed effects (e.g., that they account for all constant inter-site variation) above. The concern that our model does not account for the fact that PIKE values are constrained to between zero and one is addressed directly and at length in our supplement on pg. 42-6. Dr. Underwood and Dr. Burn note in their extended comments that a primary concern about using a linear modeling approach is that it requires the assumption that the errors are normally distributed. As can be seen in Appendix Figures A7 and A8, once we have accounted for fixed effects, the remaining residuals are essentially normally distributed—i.e., the key assumption underpinning the OLS approach is satisfied. It is true that this modeling approach sometimes produces estimates that fall outside the bounds of 0 and 1, but this does not damn the model automatically. In our case, as detailed in our paper, 94.7% of model predictions lie within the interval [0,1]. Furthermore, of the 5.3% of predictions that fall outside this range, the average distance from the [0,1] interval is 0.0367 with a standard deviation of 0.0379 and maximum of 0.125. In our view, this is a completely reasonable range of prediction error, given that these errors are each very small and are likely far smaller than any errors generated through data collection, reporting, etc. Applying a GLM would mechanically force these prediction errors to zero, but at the cost of introducing stronger mathematical assumptions that might distort the data in other [less visible] ways, reducing clarity, and making it far more difficult to catch errors.  Moreover, these predictions for PIKE at specific sites are not the goal of our analysis, since we are interested in the average treatment effect of the sale. Average treatment effects when using local linearizations of changes in a nonlinear probability, centered around average probabilities, will remain very close to average treatment effects in a nonlinear model unless probabilities are varying wildly within individual sites, which we show does not occur here. In summary, we know that linear probability models (our approach) cannot be exact since they are linear approximations, but they are extremely powerful and we know exactly where potential issues may arise, so we can check specifically for these issues. We did exactly this in our analysis and found that the costs of linearization were small and, in our view, acceptable given the costs of less transparent approaches. Furthermore, our results with the linearized model are confirmed by our nonlinear estimates, indicating no substantive loss of information from this approach.

Second, as Dr. Underwood and Dr. Burn note in their pdf comments, we do in fact report results from a GLM. We estimate a nonlinear Poisson fixed effects GLM in Table 4 (pg 22-23):

This approach is an important check on our main results because this GLM cannot generate PIKE predictions outside the [0,1] interval. In this check we recover results that match our linear approach, indicating that it does not matter whether one uses a GLM or our linear probability model—the 2008 discontinuity is statistical significant in either case.  It is unclear to us why Dr. Underwood and Dr. Burn completely ignore our entire section in the main text dedicated to explaining and discussing this GLM. Instead they solely focus on a single sentence discussion in our appendix (pg 47) where we specifically explain that the conditional fixed effects logistic regression cannot utilize the information contained in 32% of observations because their discretized outcomes do not vary. This is not a surprising fact and results from a well-known limitation of this specific model (e.g. here and here). Dr. Underwood and Dr. Burn appear to suggest that we do not understand or utilize GLMs based on this sentence in the appendix, which is clearly not true based on our use and presentation of a GLM in the main manuscript.

Our GLM is a Poisson fixed effects regression that models the count of illegal carcasses at each site; with this model, as mentioned above, we recover the same results as we do in the much simpler and more transparent approach (table 4, pg. 23). This is a sensible way to model the MIKE data because the generation of elephant carcasses is essentially a classical Poisson process: at each moment in time each elephant either dies with low probability or survives with high probability, in an environment where there are a large number of elephants and moments in each year (implying a large number of Poisson “trials”). We believe that this modeling approach implemented in our analysis, which agrees with our main linear findings but was left unacknowledged by Dr. Underwood and Dr. Burn, is a more sensible strategy for understanding patterns in the data than the nonlinear model that they advocate in their previous work (as we explain below).

B. Concerns about our overall approach to causal inference

As we hope is abundantly clear by now, our statistical approach demonstrates that there is in fact a discontinuous increase in PIKE between the periods 2003-2007 and 2008-2013.  Furthermore, this finding is robust to a whole slew of tests and checks in our manuscript, some of which are detailed above. While there are certainly other valid ways to approach the PIKE data, we categorically reject Drs. Underwood and Burn’s suggestions that our statistical approach is overly simplistic. At worst, it is different from previous approaches and provides equally useful insights; at best, it may provide superior insights to previous papers.

Dr. Underwood and Dr. Burn then proceed to argue that our interpretation of the discontinuous increase in PIKE beginning in 2008 is invalid. As with all non-experimental studies, there is a legitimate philosophical debate to be had about causal inferences and whether the necessary assumptions are met. The critical assumptions for causal inference in this research design are well understood and in Section 7 of our paper we made a good-faith effort to rigorously test those assumptions where possible (pg 26-30). We show here that, while Dr. Underwood and Dr. Burn make the important point that alternative explanations should be considered (as we do too), they overstate their case by making inaccurate claims about necessary assumptions, and they understate the validity of our approach.
UNDERWOOD & BURN: Their Logic: Their argument is that in their modelling they tested whether there was evidence of a step change, or discontinuity, in the PIKE data in 2008.  That is estimates of PIKE prior to the sale (up to 2007) are significantly lower than estimates of PIKE after the sale (from 2008 onwards) They say that their model shows that estimates of PIKE from 2003 to 2007 were significantly lower than estimates of PIKE from 2008 onwards.
The authors then look for a similar discontinuity in a number of variables they have selected to measure Chinese influence and presence in elephant range states. They consider these to be other potential drivers of the trade. They don’t find the same discontinuity in these variables between 2007 and 2008. Their conclusion is that if these drivers don’t show the step change then as everything else remained constant then the only explanation for the step change is the legal sale of ivory.
This is a significant misrepresentation of our assumptions. As noted before, event studies/regression discontinuities characteristically assume that if all relevant covariates are smooth across the timing of the event in question, then it is reasonable to infer that the event was responsible for any discontinuous change in the response variable (recall the Brexit example above). This isn’t perfect logic—for instance, one could construct a narrative where continuous changes in one or more covariates results in a threshold effect that happen to coincide exactly with the time of the sale, like when gradually rising temperature rather suddenly causes water to boil, but it is typically difficult to justify why such a perfectly-timed threshold is more likely than the far simpler explanation that the abrupt change in the exogenous variable (in our study, the sale) caused the abrupt and contemporaneous change in the outcome (poaching). The literature legitimizing the general applicability of event studies/regression discontinuities is broad and deep, and we find it the best available approach in this context.
UNDERWOOD & BURN: There are many things wrong with this, even if we were to ignore the fact that their models are not correct.  
In the paper they do not: 
provide an explanation as to their choice of potential drivers that they test
First, on pg. 27 of our study, we do provide some explanation of why we choose our confounders, as they are measures of pertinent economic conditions and Chinese and Japanese influence on elephant range states that might confound the effect of the sale (as an aside, we also note that, in general, a failure to provide motivating details with sufficient granularity to satisfy all readers does not invalidate an analysis). The general belief in the conservation community has been that increases in East Asian affluence and influence in African elephant range states have been primary drivers of the increase in elephant poaching—thus, it was reasonable to suspect that sudden changes in these variables may be potential causes of the discontinuous increase that occurred in 2008. This is why we examined whether there was a discontinuous increase in various measure of Chinese and Japanese presence and influence in elephant range countries (including number of Chinese service workers and engineers present in range countries, as well as foreign direct investment in and aid to range countries), to see whether these factors might have facilitated the sudden expansion of illicit trade networks and/or corrupt political relationships. We also examine macro-economic conditions in buyer and seller states, since these conditions affect the incentive to participate in black markets, as well as trade relations between these sets of countries (imports and exports, as well as relative share of import and export volumes), since higher legal trade volumes might also facilitate higher illicit trade volumes. We believe these are all important patterns to check, and welcome specific criticisms about them. Note that we exerted extra effort to report these results in detail, even though there was essentially “nothing to see” that was particularly interesting.

Second, we explicitly note in our study (pg. 26) that “it is impossible to test [whether covariates are smooth across the event] for the universe of potential confounders.” The situation is complicated in the case of our study since there is no strong, empirically substantiated theory about what (aside from the opportunities signaled by the announcement of a legal sale) may lead to the sort of global discontinuous increase in PIKE we identified above. As we demonstrate below, we are open to analyzing other potential confounders to our analysis.
UNDERWOOD & BURN: In the paper they do not:… discuss the global financial crisis of 2008. Could this also be a reason why the discontinuity is observed?
This is a reasonable concern. The 2008 financial crisis obviously led to drastic changes in the global economy, and it is possible that one or several of these changes had an effect on ivory black markets. We thought of this as well, but we do not know of any standard arguments in the community that the financial crisis caused the recent climb in poaching. Still, we make a special effort to think through ways the crisis may have changed the dynamic between China and the elephant range states. Perhaps African countries began to trade more in absolute terms with China or Japan, opening opportunities to smuggle ivory? We find no evidence of a discontinuous increase in imports or exports between the range states and China or Japan. Maybe the relative share of trade with China or Japan suddenly increased as Western economies declined? Again, no sudden change in the proportion of trade going from range states to China and Japan occurs. Perhaps African nations’ GDP dropped suddenly, making poaching more attractive to low-income communities? Again, we found no such discontinuous change in 2008. (Also, the incomes of remote rural individuals who are “potential poachers” are barely growing in normal years and their village-based incomes are almost entirely decoupled from international financial markets.) Slight slowing of income growth in China, and the crash in Japan, would have the wrong effect, as falling incomes would likely reduce demand for illicit ivory.

We are open to more potential ways to measure how the financial crisis might have affected ivory trade. However, it is neither conceptually desirable nor practically possible to test every conceivable confounder, and we think that is an unreasonable standard to meet to infer causality in the real world. Dr. Underwood and Dr. Burn can of course sympathize with this feasibility constraint, as they too did not test every dimension of the 2008 financial crisis in their own earlier work. If critics can provide specific other mechanisms and covariates that might offer an alternative hypothesis to that which we present in our paper, we are happy to test them as well.
UNDERWOOD & BURN: In the paper they do not:… talk about trends in the trade of other illegal wildlife products such as rhino horn and pangolin. These have also increased over the last few years and there have not been legal sales in these products.
It is an interesting idea to examine trade in other species, one that has been suggested by other colleagues.  Dr. Underwood and Dr. Burn’s statement that general trends in poaching of some species has occurred in recent years is not relevant to our analysis, since only the change in poaching at the moment of the sale—i.e. the discontinuity—can be interpreted as a plausibly causal effect of the sale. The correct placebo test would be to examine whether there is also a discontinuity in poaching of other species in 2008. We are currently collecting what data we can find on other species to implement this check. If any reader is aware of reliable, comprehensive data sources on these poaching patterns, please let us know (either by comment below or email). In reply to Dr. Underwood’s contacting us with this comment, we requested from Dr. Underwood her data sources on this statement so that we could check for a discontinuity in 2008. She did not reply with any data sources or analysis to support her statement.

We are not aware of any global and systematically collected poaching data sets comparable to the MIKE data we analyze, especially data that collects data on both legal and illegal mortality. However, so far we have obtained some data on other poaching of other large species in countries that overlap with our sample (sources listed here and here). Below we plot poaching trends in each country, fitting lines to sections of the trend that appear to have some coherent structure (except for rhinos in Nepal, where we simply examine conditional averages near 2008). Only the Nepali data allows us to compute a poaching measure analogous to PIKE that adjusts for total rhino mortality, other values are simply total counts.

click to enlarge

We see that there are strong trends in rhino poaching in South Africa, Kenya, and India, with a possible discontinuity in Kenya; however, all the breaks occur between 2009-2011, not 2008. Leopard poaching in India clearly has no break or kink in 2008, although there might be a decline starting after 2012. Tiger poaching in India has a kink in 2008, but no discontinuity analogous to the elephant result. 

Given that these data are far less complete and collected less systematically, we interpret them with caution. However, they do not suggest that the sharp 2008 discontinuity in elephant poaching that we observe coherently and globally is obviously visible in these data for other species, which is consistent with the theory developed in our paper.
UNDERWOOD & BURN: In the paper they do not:… consider trade in other goods that might play a similar role to ivory within China. How has demand in these changed over the same time period? In which case how does this match with the demand for ivory?
Just to reiterate, it is not possible to test the full universe of potential confounders. It is also not immediately obvious what “other goods” are similar enough to ivory to provide a meaningful comparison. It is possible Dr. Underwood is referring to other precious metals and stones that are legally traded and used in jewelry, artifacts, or as investments. This is an interesting falsification test, so we obtained data on legal gold, diamonds, and jewelry demand in China from the peer reviewed article by Hsu et al (Gems & Gemology, 2014):

click to enlarge

There is certainly rising demand through this period, with the rate of increase rising for gold and jewelry around 2009, but there is no discontinuity across the 2008 sale.
UNDERWOOD & BURN: In the paper they do not:… compared their models to a model which allows an increasing nonlinear trend in PIKE rather than a step change
This critique is misguided. First, our year-effect results (red dots in our main figure below) are fully non-parametric and does not make any assumptions about underlying trends, and the discontinuity remains clear in this simplest model (eq. 5).

Second, our trend analysis already allows an increasing trend through the sample by allowing for the trend before and after the sale to differ. Third, we narrow the window of analysis to +/- 3 and +/-2 years before the sale to eliminate the influence of any curvature of the trend and find our results unchanged. Fourth, yes, we can easily run quadratic trends before and after the sale and recover the same result because there is essentially no curvature. Finally, we are genuinely confused as to why Dr. Underwood would argue that this response should be modeled as a nonlinear function rather than a step change—viewing the figure above, once fixed effects have been accounted for, a step change is clearly what needs to be tested for.
UNDERWOOD & BURN: The argument they use that a similar step change is not observed in their other potential drivers might work for a simple situation. But the illegal ivory trade is complex and dynamic with many different drivers operating on different spatial and temporal scales all along the trade chain. It is more likely that if the sale has had an effect it contributes to the increase in demand rather than being the sole reason for an increase in demand. Any analysis should therefore look at relative contribution of different drivers and how they describe changes in PIKE by modelling it in one comprehensive model.
This is exactly why our quasi-experimental research design is so important. The event-study/regression discontinuity approach is designed to be used in complex settings (e.g. the stock market is pretty complicated), and it recognizes that there are many other factors that influence an outcome besides the event treatment. However, if the event is really abrupt and unanticipated, and trends in all other factors that influence poaching are continuous and “smooth” across the event, then the magnitude of the discontinuity can be fully attributed to the event

Of course, we recognize that the legalization event occurred against a backdrop of many other factors and trends that are relevant to ivory markets, such as the growing wealth of potential Chinese buyers. These factors would shape how supply and demand functions (that we describe in our analysis) would look the moment before the sale. As we stress in our writing and in Equation 4 of the paper, the shape of these functions directly affects how much the legal sale matters.  Had these factors been different, then the sale would have had a different effect. Our analysis is only able to identify the effect of the sale at the moment when the sale occurred. 

Addressing Dr. Underwood and Dr. Burn’s broader point, however—it is certainly possible that something else occurred in 2008 that contributed to the step increase in poaching that we detect that year. It is also, strictly speaking, possible that there was some sort of complex non-linear threshold effect in poaching generated by the linear increases in drivers of poaching that we did model. However, to say in essence, “well, this analysis does not categorically prove that it couldn’t have been something other than the one-time sale” isn’t really a cogent enough challenge given the overall strength of the case we put forward in our manuscript. Opponents of the one-time sale made predictions about how the sale would affect poaching, and those predictions are supported both by the qualitative observations made on the ground and our global analysis of PIKE data. There is a reasonable theory about how the one-time sale may have affected supply and demand—outlined in our manuscript and bolstered by reports on the ground—that explains these observations. Overall, we provide a coherent explanation for what is observed in the data. While we may very well be wrong about how the black market operates, critics should now provide more specific alternative theories, mechanisms, and data to challenge our findings than what has been provided by Dr. Underwood and Dr. Burn.  
UNDERWOOD & BURN: To be clear, I am not commenting one way or the other about whether the sale of ivory is the, or one, reason for the illegal ivory trade. My concern is that this analysis and the conclusions it draws is flawed and should not be used to guide future policy on elephants.
We have comprehensively explained why Dr. Underwood’s concerns about our statistical analysis are unmerited. Furthermore, we have gone through extra effort to make our work clear and transparent expressly because we believe it should be understood by anyone who might make policy decisions based on it. 

As to whether a trend analysis such as ours should be considered while making policy, we point out that CITES already has previously drawn policy insights from trend analysis of the same data, drawing the strong conclusion that the 2008 sale that they approved and oversaw did not contribute to poaching rates. In document SC 62 Doc 46.1 (p 13) CITES reports:
Concerns have been expressed in recent months that the international ‘one-off’ ivory sales conducted under the auspices of CITES in 2008 may have led to the observed increases in levels of illegal killing of elephants. However, the MIKE analysis found no evidence to support this view. The effect of each of the years from 2002 to 2011 on the PIKE trend was investigated through an analysis of deviance. There was no statistically significant effect of the years 2008 or 2009 on the trend….The year 2005 was the turning point in the trend, after which PIKE levels began to steadily increase up to the present. This was three years before the sale was conducted and two years before the Parties approved it. The year 2011 appears to represent another important point in the trend, in which PIKE levels appear to further accelerate. In view of the above, there is no evidence in the MIKE data to suggest that the 2008 sale caused poaching levels to increase or to decrease.
Where the trend they refer to is the quadratic trend below (which is contained only in the Supplementary Information for Document SC 62 Doc 46.1 on page 22):

Now, as we will discuss further in the next section, the analysis that produced this curve was ill suited for detecting the discontinuous jump in poaching that is readily visible in the data since it forced the trend to be smooth. Moreover, if you dig into their analysis you see they did not account for unobserved differences between sites or remove irregular data—thus our analysis should be viewed as an improvement on this analysis, better designed to detect effects of the 2008 sale. However, what is important is that CITES used exactly a trend analysis—similar to but inferior to that which we implement—as logical evidence that the 2008 sale did not affect poaching levels. Thus at least from CITES perspective, it makes perfect sense to consider a trend analysis that now, due to refinements, shows that the 2008 sale may very well have affected poaching. In fact, to exclude the best available trend analysis would directly contradict how such information has been previously used in policy decisions. 

We emphasize however that no study is perfect. Since there is so much we do not know about the black market, and because of the nature of the PIKE data, it is not possible to build a100% watertight case that the one-time sale caused the uptick in poaching observed in 2008. This is not a lab study or a carefully designed RCT, but a global data set with all the associated warts. However, as we explicitly say in our paper, “our results are most consistent with the theory that the legal sale of ivory triggered an increase in black market ivory production by increasing consumer demand and/or reducing the cost of supplying black market ivory” [emphasis added]. And our findings are more thorough and meticulously tested than anything in the literature so far. As such, we think it would be very unfortunate for policy makers to overlook our findings as they move to make future decisions on international ivory policy.

Why were we the first to notice the seemingly obvious discontinuity?

Given that several researchers have examined the PIKE data before us, a natural question to ask is why no one else saw the clear discontinuity that we show here. The short answer to this is that while our approach allows the data to “speak for itself”, revealing a discontinuity whose statistical significance we can then test, previous studies have smoothed the data, typically with some sort of polynomial, covering the very discontinuity that they should have been looking for. For instance, a CITES document produced the figure just above in which the data are clearly smoothed over with a quadratic model; such a model could never reflect the discontinuous jump in 2008.

However, just to be thorough, we are going to go beyond the short answer here. In Dr. Underwood’s and Dr. Burn’s detailed critique of our manuscript, they write, “The analyses 
of the PIKE data that you [Nitin and I] present in this paper does not follow the approach carried out in Burn et al (2011) and the trends do not look the same.” In other words, the basis of Dr. Underwood’s and Dr. Burn’s skepticism for our manuscript was that our results looked so different from theirs. 

In sharp contrast to our recent findings, Drs. Underwood and Burn report that after the 2008 sale “The results for 2009 indicate a decline.” Furthermore, they say in a footnote of their critique of our manuscript that “Although it is the case that [our previous] analyses have not primarily looked for these trends we have in the past considered a step change and not really seen any evidence for this” but as they do not publicly report these findings so we don’t know what they did or how they arrived at the conclusion that there was no evidence of a step change.

We too have long been perplexed by why Dr. Underwood and Dr. Burn’s findings were so different from ours. As early as July 21, 2014, we reached out to Dr. Underwood and asked for the data and replication code from their 2011 manuscript so that we may understand what they did differently and we received an email reply from Dr. Underwood denying our request. Since Dr. Underwood and Dr. Burn published their critique of our manuscript in mid-June of this year, I have asked them other questions about their analyses, but they have noted that they lack the time to engage with us. Thus, as you see below, I do my best to explain what the differences may be between our analyses and that of Burn et al. 2011. Please note that since they have not published their replication code, there is a certain amount of guesswork involved here. However, even if some of the assumptions I make about their implementation are incorrect, the legitimacy of our own model (as described above) still stands. Furthermore, as we note in our own section, our data and replication code are published here so anyone can replicate our findings for themselves.  The code I use below to re-examine the Burn et al 2011 analysis is here.

Concerns with the Burn et al. 2011 analysis

Dr. Underwood and Dr. Burn’s analysis was initially released as a non-peer reviewed 2010 CITES white paper and then published in PlosOne in 2011. In that analysis, the authors state they are interested in the general question “Have changes in CITES policy, and in particular the one-off ivory sales, had an impact on elephant poaching?” and they report in the abstract:
Important drivers of illegal killing that emerged at country level were poor governance and low levels of human development, and at site level, forest cover and area of the site in regions where human population density is low. After a drop from 2002, PIKE remained fairly constant from 2003 until 2006, after which it increased until 2008. The results for 2009 indicate a decline… The results of the analysis provide a sound information base for scientific evidence-based decision making in the CITES process.
The main figure of that analysis illustrated eight predicted values for PIKE using a fifth-order polynomial to estimate the trend:

We attempted to recover the basic results of their analysis, following the key methods outlined in Burn et al. 2011 (we do not use a Bayesian implementation, but that should not substantively affect the results).  We estimate a nonlinear GLM model (we used a conditional logit rather than a binomial, discussed below) while accounting for the hierarchical structure of sites within countries using random effects (following Burn et al.) and estimate a fifth-order polynomial trend (again following Burn et al.) using the original data published by Burn et al. (code and data is here).  We discard the 2002 data since based on the concerning irregularities documented in our paper. Instead of obtaining their main result pictured above, we essentially recover our main result, with an abrupt 2008 increase in PIKE. The Burn et al model with a fifth order polynomial is the grey line, with predicted values for January 1 as diamonds:

This approach shows a large and abrupt jump between 2007 and 2008, consistent with our main result. We do not see a drop in 2009.

Though we do not know exactly how Burn et al. arrived at their results, we can guess at a few possible sources of differences. 

First, Burn et al. do not present results where they drop two extreme outliers from Kenya in 2009 that had very high natural mortality due to drought. Those observations lower PIKE in those sites, which could contribute some (but probably not all) to the drop in PIKE that they report in 2009. 

Second, Burn et al do not exclude data from 2002, even though it exhibits many irregularities and was not collected for the entire year for any sites. We detail concerns with the 2002 data in the appendix of our paper and exclude it in our main analysis, although we demonstrate that our results are robust to its inclusion (and the inclusion of the outliers described in the last paragraph). Examining Burn et al.’s original data, I verified that it also exhibits the strong irregularities that we document, such as the very large fraction of sites reporting zero PIKE and a tiny fraction reporting PIKE=1, which is sharply different from the pattern in later years:

That said, if we keep the 2002 data in the sample and use the model Burn et al advocate, we still see an increase in 2008 that persists into 2009, although it looks less striking because the fluctuation caused by 2002 distorts the polynomial and there is possibly residual bias in the pre-sale trend, as discussed above.

An additional issue is smoothing over the 2008 discontinuity. Because Burn et al. use a polynomial, changes are forced to appear as if they are continuous, even if they are not. Furthermore, PIKE in each year affects the estimate in every other year (a well-known problem with polynomials) so distortions caused by the 2002 data may affect what appears to be going on much later in 2008-2009. To see how this matters, we implement the same analysis as Burn et al but only change the polynomial trend to the approach we use in our analysis, of estimating each year’s mean separately. This approach should allow for discontinuities to reveal themselves, and it does:

The linear (biased) pre-sale trend during 2003-2007 is the same as before (since site fixed effects are not included, this is why 2003 looks so high) but the jump in 2008 is unmistakable. It just doesn’t look so striking since the irregular 2002 data is way out there. But the 2002 extreme value probably doesn’t fully explain what’s going on with Burn et al, since their Figure 1 (above) shows 2002 to have abnormally high PIKE—this is particularly odd since a majority of sites reported PIKE of zero in 2002 (see histogram above).

The final remaining difference is their implementation, which has two elements that differ. First, they estimate their model in a Bayesian framework, but that shouldn’t substantively affect their findings this much unless they adopt very strong priors, and they didn’t do that. Second, they assume that elephant poaching is a binomial process, which seems pretty bizarre to me.  I spent several hours with a colleague and a white board trying to derive why this might make sense and we couldn’t figure out any justification.  One possibility is that this is an easy-to-use off-the-shelf model that has standard implementations in most statistical software packages, but it’s clearly not appropriate for this setting. The binomial model assumes the total number of trials is fixed, in this setting total elephant deaths. A weighted coin is then flipped for each elephant to determine if the elephant dies by poaching (heads) or naturally (tails). The weight of the coin (PIKE) is then modeled using a logistic function that translates trends and covariates into probabilities. This model is strange because if an elephant is poached, then that means one less elephant dies naturally this year, and visa versa. Thus there is no randomness in natural elephant mortality that is decoupled from elephant poaching, and all the randomness that is linked to poaching is constrained so that they move in opposite directions: when poaching is high natural mortality must be low, and visa versa. This is particularly strange since PIKE was originally designed as a measure to standardize poaching data because it is believed that randomness in poaching data and natural mortality move in the same direction period-to-period (both increase with higher elephant populations and higher surveillance effort, see footnote 8 in our paper). So in some sense, the constraints of the Burn et al. model are directly at odds with the measures they are using (this is not an issue for the Poisson approach we use, described earlier).  After several hours trying to figure out exactly how this odd assumption distorts their findings, we moved on. But some intuition suggests it’s probably not the right model to use. For example, there are many cases in the data where PIKE=1 for large numbers of poached elephants: 16 cases where 10 or more poached elephants were discovered (median =15, max=63). The binomial model assumes that in each of these cases, a coin was tossed at least ten times and came up heads every single time. This forces the implied weight of the coin to be very high, otherwise these cases quickly become vanishingly unlikely. But if you realize that natural elephant mortality is also random, so sometimes no elephants just happen to die naturally or you don’t find the ones that do, then you don’t have to make such extreme inferences.

Given that Burn et al. made so many assumptions that we didn’t, I thought it would be interesting to see if these assumptions made a substantive difference once they were held up directly against our results. As described above, I ran the nonlinear GLM approach they seem to advocate and compared it against our much simpler approach. I did my best to emulate the approach outlined in her 2011 PlosOne paper. I ran a multi-level hierarchical logit model using nested random effects at the country and site level, just as described in Burn, Underwood, and Blanc 2011 (I omitted the cross-sectional controls since that variation is absorbed by site-level effects, which also makes these results more comparable to our simpler approach). I compare this to our simpler linear probability model using site-level fixed effects (i.e. demeaning site time series) described above. I even implemented both approaches using the original Burn et al data set, to make sure everything was comparable (note that I dropped 2002 since it exhibits the same obvious irregularities that we document in our paper). This is the comparison between the two models using the original data from Underwood (code and data here).

The fifth order polynomial that Underwood et al use is a bit wiggly with a few odd peaks and troughs (which is why folks don’t usually use such high order polynomials with only eight time periods). But the overall pattern is the same: a big jump in 2008 after the legal sale. The downward pre-sale trend is a bit steeper using Underwood et al.’s approach, consistent with our graph above when the bias from inconsistent site reporting is not fully accounted for, but the result is essentially the same. Since the modeling approach that Dr. Underwood and Dr. Burn advocate produces the same result as our approach, it is even more unclear why they claim the GLM approach as superior and our approach as definitively “wrong.” Why did they dismiss the linear approach?

Burn et al. explicitly state their logic for using their GLM modeling approach in Supplementary Text S2 of their 2011 PlosOne paper. They give three reasons:
1. “observations at the same site are likely to be more similar than observations from different sites,”  
2. “[in a] model that includes a site-level explanatory variable, it is implicitly assumed that all of the site variation is accounted for by that variable, thus increasing the chance of inferring a significant association when in fact [it is explained by country-level properties]” 
3. “in a hierarchical modeling approach is that predictions can be obtained for all sites in the analysis, even those where only a few carcasses are observed”
None of these three reasons are reasons not to use the approach in our analysis. Point (1) is important, and it is fully addressed by the site-specific fixed effects in our analysis (to the same level of “rigor” as in Burn et al). Point (2) is irrelevant to our analysis; since we do not try to partition how site-level average PIKE is driven by site or country level factors, all we care about is removing site-level averages so they do not bias our estimate for the size and shape of the discontinuity. Point (3) is also irrelevant, because we do not wish to extrapolate average PIKE estimates to locations where there is not data.

Finally, it is worth noting that the random effects model that Dr. Underwood advocates is actually a less general version of the fixed effects model that we use. This means that if the random effects assumptions are true (i.e. the hierarchical structure and the normality of fixed effects), then both Dr. Underwood’s approach and our approach will obtain unbiased results. However, if those assumptions are wrong, Dr. Underwood’s approach will be biased but ours will continue to be correct. 

In conclusion, Burn et al. 2011 appeared to have failed to find what we found because their use of a fifth-degree polynomial for eight data points smoothed over the discontinuity—and the other assumptions they made appear to have further obscured the discontinuity. 

One other note: Underwood et al. 2013 conduct an analysis of the ETIS seizures data, which we use as corroborating evidence in our manuscript. In that work, they apply a similar data smoothing approach; they use a sixth-order polynomial to describe the trend in seizures, again assuming a functional form that mechanically cannot reveal a clear discontinuity even if one exists. Figure below, compare to figure 6 in our paper (blue figure at the beginning of this post).

Transparency, working papers, and moving forward

In the context of a debate like the present one, it becomes particularly important that the data and analytical methods of all parties are accessible so they can be poked, prodded, and interrogated. I am an affiliate of the Berkeley Institute for Transparency in Social Science where I have taught classes on how to carefully compare findings across a literature (e.g. see here), and I personally have replicated a large number of studies by other authors, sometimes verifying their original findings and sometimes uncovering inconsistencies and errors (e.g. see herehere, here, here, here). While we should obviously do our best to prevent making errors in research, mistakes happen, and I believe innocent mistakes should be destigmatized—being transparent about our research can accelerate the pace of scientific progress.

In keeping with this belief, Nitin and I have done what we can to be transparent. We have posted our replication code here so that anyone can examine exactly what we did and verify our findings. In our appendix, we spell out the various checks we did so observers can check our work. We show the statistics for all our covariates and explain why, despite the occasional low p-value, we do not think they undermine our main finding. We also detail a large number of checks on the data in our appendix.

The fact that Burn et al. have not posted their replication code publicly—and that Dr. Underwood did not share it with us when we requested it—has made it very difficult for us (or anyone else) to adjudicate our disagreement about how best to analyze the PIKE data to look for a signal from the one-time sale. 

Overall, we believe we have exerted substantial effort to be transparent with our analysis and handling of data to ensure that our results can be replicated by any other member of the community. We hope that the conservation community will make use of our transparency to check our results and verify the validity of what we have done (and, of course, graciously point out any errors we might have made). We also hope the CITES and other members of the community will move towards greater data transparency wherever it is feasible. 

A note on working papers and peer review

Multiple colleagues from ecology have inquired why we released the paper as an NBER working paper prior to peer review. In economics, this is standard procedure. I personally have multiple working papers in circulation, such as this one on the effects of tropical cyclones on economic growth and this one on the effect of temperature on US incomes. Circulating and presenting working papers allows authors to obtain feedback and ideas from colleagues for projects that are close enough to completion that they are coherent and can be understood by a broad audience, while allowing researchers to make adjustments and extensions to their work prior to its final form. In a normal departmental economics seminar, researchers only present working papers and would never present a published paper, since the purpose of the seminar is for the author to obtain feedback from the audience (while of course informing the audience of what work the author is doing). Working papers are regularly cited, both in other working papers and publication, both inside and outside of economics (e.g. interdisciplinary outlets such as Nature and Science recognize this different norm and allow working papers to be cited), as well as by policy-makers. Many of the norms around working papers in economics are similar to how authors in physics use, as I understand it.

Nonetheless, because we understand that many readers currently engaged in the ivory debate are interested in the findings and unsure if they should trust the results of a working paper that has not yet completed the peer review process, we have posted code and data to replicate all of the findings of the paper online here.  Anyone can verify our findings by opening the file and simply running it.

As an aside, we also note that Dr. Underwood and Dr. Burn were critical of us for releasing these findings prior to peer review. We note, however, that their analysis of the PIKE data contracted by CITES was released as an official CITES white paper CoP15 Doc. 44.2 (Rev. 1) for the Fifteenth meeting of the Conference of the Parties in Doha in March of 2010.  It was thus in circulation and officially sanctioned and cited as the basis of policy decisions for more than an entire year before it was submitted for review at PlosOne on April 15, 2011. It was published in September 2011, eighteen months after its endorsement by CITES. We see this as an acknowledgement of the fact that the peer review system cannot always sanction papers at the pace necessary to inform policy. Where the peer review system lags behind, transparent publication of data and replication code as working papers—as well as debates like the one we are currently engaged in—can help make sure that policy makers have the most up-to-date information in making their decisions.

Moving forward

We hope that this document, in combination with the original paper, addresses most questions about our analysis. If you have any more, though, please send them to Nitin (nitin.sekar [at] gmail [dot] com) with an email titled “QUESTION(S) ABOUT IVORY PAPER”. Nitin will schedule a webinar in which he and I will try to address these questions and generally discuss the strengths and limitations of our findings. 


  1. Hello
    This blog post and a report that this links to, describe the main reason why your analysis is so different to other analyses of MIKE data. It is not because of smoothing as you claim here, it is because your analysis does not account for the fact that the number of carcasses found varies hugely between sites and over time. R code to follow.

    1. We have responded to the updated set of arguments presented in Dr. Underwood's blog post here, where we formally derive various errors in Dr. Underwood's proposals. This is the summary conclusion from our post:

      "The three critiques/suggestions offered by Dr. Underwood are not logically consistent with themselves, since they make contradictory assumptions about whether PIKE corrects for elephant populations/surveyor effort and whether or not the total count of carcasses discovered at each site is actually a random variable or not. Furthermore, each of the three points raised is itself independently erroneous, either because the mathematical assumptions Dr. Underwood makes cannot possibly be true (in the case of 2 and 3) or because these assumptions are clearly overturned by the data (in the case of 1). We therefore conclude that none of the critiques offered by Dr. Underwood are valid."