Monday, September 26, 2016

Errors drive conclusions in World Bank post on ivory trade and elephant poaching: Response to Do, Levchenko, & Ma

(Spoiler alert: if you want to check your own skills in sniffing out statistical errors, skip the summary.) 


In their post on the World Bank blog, Dr. Quy-Toan Do, Dr. Andrei Levchenko, and Dr. Lin Ma (DLM for short) make three claims that they suggest undermine our finding that the 2008 one-time sale of ivory corresponded with a discontinuous 66% increase in elephant poaching globally. First, they claim that the discontinuity vanishes for sites reporting a large number of carcasses—i.e., there was only a sudden increase in poaching in 2008 at sites reporting a small number of total carcasses. Second, they claim that price data for ivory (which they do not make public) from TRAFFIC do not show the increase in 2008 that they argue would be expected if the one-time sale had affected ivory markets. Third, they claim that re-assigning a seemingly small proportion of illegal carcasses counted in 2008 to be legal carcasses makes the finding of a discontinuity in 2008 less than statistically significant, and they speculate that a MIKE (CITES) initiative to improve carcass classification may explain the discontinuous increase in poaching in sites with small numbers of carcasses.

In this post, we systematically demonstrate that none of these concerns are valid. First, as it turns out, the discontinuity does hold for sites reporting a large number of carcasses, so long as one does not commit a coding error that causes a systematic omission of data where poaching was lower before the one-time sale, as did DLM. Furthermore, we show that DLM misreported methods and excluded results that contradicted their narrative, and that they made other smaller coding errors in this part of their analysis. Second, we note that, notwithstanding various concerns we have about the ivory price data, our original paper had already derived why an increase in poaching due to the one-time sale would not have a predictable (and possibly no) effect on black market ivory prices. Finally, we note that (i) DLM provide no evidence that training on carcass identification could have led to a discontinuous change in PIKE (in fact, the provided evidence contradicts that hypothesis), and that (ii) in the contrived reclassification exercise modeled by DLM, the likelihood of surveyors making the seven simultaneous errors DLM state is very likely is in fact, under generous assumptions, actually less than 0.35%— i.e. extraordinarily unlikely.

Overall, while DLM motivated an interesting exercise in which we show that our result is robust to classification of sites based on the number of carcasses found, they provided no valid critiques to our analytical approach or results. The central conclusion, that our results should be dismissed, was the result of a sequence of coding, inferential, and logical errors.

Main Text:

 Dr. Quy-Toan Do (World Bank), Dr. Andrei Levchenko (University of Michigan) and Dr. Lin Ma (National University of Singapore) (DLM for short) recently blogged a critique of my recent paper with Nitin Sekar on one of the World Bank blogs. The authors follow on the heels of a post by Dr. Fiona Underwood which argued that observations should be weighted by total elephant carcasses discovered at each site. Consistent with our reply to Underwood’s critique, DLM identify that weighting by an outcome variable is a very bad idea, citing a recent paper by Solon et al. (2013) which pointed out that weighting does not actually recover population-average treatment effects as many folks seem to think.

DLM follow the advice of Solon et al., who say that weighting should only be used as a diagnostic test. If weighting gives a different answer, that indicates heterogeneity that should be hunted down. DLM try to implement this by separating out sites that have more than 20 total carcasses on average for the entire sample, and then searching for a discontinuity in poaching separately for these “large” sites. DLM verify that a strong discontinuity persists for the “2/3” of the sample that are below this threshold, but they dismiss this as unimportant because the bulk of reported poached elephants in the MIKE data are at the large sites. They display a single graph trying to replicate our results for the “1/3” of the sample that are large sites and argue that there is no discontinuity in these more important locations.

(In case you are wondering, I put “1/3” and “2/3” in quotes because DLM seem to use a very liberal rounding algorithm that suggests their result is more general than it actually is. Their “1/3” of the sample is 128 observations out of 562, which is 22.8%, less than 1/4 of sites.  This is a technicality, but it should make you uncomfortable. It turns out that the generality of the finding they claim for this 1/3 or 1/4 of the sample doesn’t matter, since it was based on an error. Read on.)

DLM look for confirmation that the 2008 discontinuity that they confirm in the 77.2% of “smaller” sites is not important by examining some nonpublic ivory price data commissioned by the Bank (we do not know if this is the black or white market price), and report finding no discontinuity. (I put “smaller” in quotes because, as discussed below, these sites are not actually smaller and they aren’t even a fixed sample of sites as implied by DLM’s references throughout.)

Finally, the authors argue that the poaching results for the 77.2% of remaining sites might in fact simply be an artifact caused by reporting errors. The authors speculate that perhaps the large discontinuity is due to an abrupt change in how elephant carcasses are miss-categorized by fieldworkers. To “test” this hypothesis, they simulate re-categorization of the most influential illegal carcasses only (those from the so-called “small” sites in 2008) and ratchet up the number of re-categorizations required to depress our main result to just below marginally statistically significant. The authors find that under just the right circumstances, re-categorizing 7 carcasses could render our results just barely “not statistically significant”. They conclude that because 7 sounds like a small number, this overturns our results.

We agree with DLM’s first point that changing estimates with weights indicates heterogeneous treatment effects, which we explore further below. In fact, this point did motivate us to do checks that further strengthened our confidence in the robustness of our findings. However, after that statement, there are so many key omissions and errors (both regarding our work and basic facts about the MIKE data collection system) in DLM’s analysis that it provides no additional insight to the issue and is extraordinarily misleading. Below, we examine each of DLM’s main claims and demonstrate they are either unsubstantiated, contradicted by their own results, or the result of coding errors.

DLM essentially make three claims:

1) The discontinuity in poaching vanishes for sites reporting large numbers of total carcasses. This is true if “large” is defined as more than 2 or more than 20 carcasses.

2) There is no similar discontinuity in some private ivory price data set that others do not have access to, and that this casts doubt on whether there was a discontinuous increase in poaching.

3) Arbitrary deletion of key observations in 2008 (i.e. randomly recoding illegal carcasses as legal only in sites with fewer than 2 total carcasses and only in 2008) obscures the statistical significance of the main findings.

Based on these claims, DLM conclude that overall “poaching therefore did not experience a step increase in 2008 as argued by the authors… Rather, we postulate that small changes in the classification of carcasses could account for the results documented by Hsiang and Sekar.”

We address each of these three points below, focusing most attention on (1) since, as we explain, (2) and (3) are pretty much irrelevant distractions.

1) Is it true that the discontinuity vanishes for sites reporting large numbers of total carcasses? Is this true if “large” is defined as more than 2 or more than 20 carcasses?

There are multiple overlapping errors in DLM’s arguments here, so I will do my best to help separate the issues and make things clear.

The authors propose to look at “large” and “small” sites separately, where large and small are defined by the total number of carcasses observed at each site. They state that they use a 2 carcass cutoff and a 20 carcass cutoff as two alternative definitions of “large” and that they recover the same thing either way.

First, DLM’s use of the words “large” and “small” throughout their post suggests something about geographic size, total elephant populations, or some other measure of overall representativeness (they use these terms the same way that one might call a city large because it contains many people). This is misleading. The MIKE cites do have differing elephant populations, but these sizes are not the sizes that DLM are using.  For example, the site Lope National Park in Gabon is, according to the most recent estimates from the African Elephant Database (AED), in the top 20% of AED sites based on elephant population (4142 elephants in 2009), but it is actually coded as one of the few “small” sites (using the 2 carcass cutoff) in 2008 in DLM’s analysis.

Second, DLM’s use of “large” and “small” suggests that the large sites somehow capture most of what is going on with poaching in Africa, so events in small sites are ignorable for practical purposes. Specifically, they state:

“The total number of carcasses (the denominator in the PIKE ratio variable) ranges from 0 to 310. Out of the total 77 sites available in the MIKE data, 24 have an average of less than 3 carcasses. At the same time, 7 sites have an average of more than 50 carcasses. These few sites account for 51.5% of all identified illegally-killed elephant carcasses in the dataset. In other words, while there are many sites in the MIKE database, the bulk of the total elephant poaching occurs at the large sites… 
What do these results suggest? In the presence of heterogeneity across sites, a simple unweighted average across sites is misleading. Because large sites are where most of the overall poaching takes place, the figures in Panels (b) or (d) above are thus better description of what happened on the aggregate than Panels (a) or (c) are: basically nothing happened beyond the ongoing upward trend.”

For starters, this logic is flawed because a large number of carcasses may have been detected at a given site for a variety of reasons, like better visibility of carcasses (e.g., in a savanna versus a forest), better surveillance due to more rangers, or high rates of natural mortality—not just because that’s where there’s the most poaching. But even if DLM’s so-called large sites truly had more poached elephants, they capture only a tiny fraction of what’s going on. In our sample there is a total of 6,098 elephants poached, whereas there were an estimated 307,636 elephants poached from 2003-2013. Thus, all of the poaching recorded in the MIKE data only represents roughly 2.0% of all poaching. So to argue that a site where an average of 10 elephants were poached annually (0.04% of annual poaching) represents global patterns much more accurately than a site where 2 were poached (0.01% of annual poaching) seems a little silly.  I think the right interpretation of the MIKE data is that we have a lot of relatively poor observations scattered across space where we don’t know which are more or less representative, but that systematic patterns across many of these sites simultaneously might signal that continental-scale changes are occurring.

Third, when DLM break their sample into large and small, using their preferred cutoff of 20 total carcasses, they mess it up by including sites with 20 total carcasses in both the large and small groups. Their exact code is:
/* TOT <= 20 */
replace sample = 1
replace sample = 0 if tot >= 21
areg pike year post_event if sample == 1, absorb(site) cluster(ccode2) 
/* TOT > 20 */
replace sample = 1
replace sample = 0 if tot < 20
areg pike year post_event if sample == 1, absorb(site) cluster(ccode2)
where “replace sample = 0 if tot < 20” should have been “replace sample = 0 if tot < 21”. This is minor, since only 12 sites (2.1% of the sample) ever report exactly 20 total carcasses, but it is a good indicator/preview of DLM’s overall carefulness when analyzing the data.

Fourth, DLM report that the discontinuity vanishes for large sites when they use a very low cutoff for what constitutes a large site.  They claim that the discontinuity vanishes even when “large” is defined by a cutoff of 2 carcasses, which matters because this means the finding is very representative of the data:
“When we split the sample into small and large sites, the result that PIKE increases is only present in the subsample of small sites, irrespective of whether we classify a site as small versus large using a cutoff of 2 carcasses (1/3 of the data), or 20 carcasses (2/3 of the data), or anything in-between.”
This is just not true.

The authors sent us replication code where they correctly restrict the sample to the lower cutoff (small is less than or equal to 2 total carcasses, large is larger than 2) and report results, so we know that they looked at these numbers and made a decision not to report them:
/* ToT > 2 */
replace sample = 1
replace sample = 0 if tot < 3

areg pike i.year if sample == 1 , absorb(site) cluster(ccode2)
What happens when you run this?

This happens:

Okay, so this is pretty bad. There is a clear discontinuity when you look at “large” sites defined by sites with 3 or more total carcasses, even though DLM explicitly said there was none.  They sent me code documenting that they had looked at this, meanwhile they publicly stated something completely different.

So, what about their results when you make the cutoff at 20 sites?

This leads me to the fifth problem I found with DLM’s analysis. I originally emailed DLM for their replication code because I could not replicate their “large” (>20) result based on what they described in their post. Why I had difficulty will become clear in a moment. But first it is worth documenting that their purported “non-discontinuity” reported in the blog post for the higher 20 carcass cutoff is actually (approximately) consistent with their code (recall that their actual code ran this for >19). But close enough, right?

So why did I have trouble generating the result above without seeing DLM’s code? Because when I tried to replicate their code, I thought that when they referred to a “large site” they meant that there was some property of the site that made it different from other sites. This interpretation was supported by their statement (emphasis added)
“Out of the total 77 sites available in the MIKE data, 24 have an average of less than 3 carcasses. At the same time, 7 sites have an average of more than 50 carcasses. These few sites account for 51.5% of all identified illegally-killed elephant carcasses in the dataset. In other words, while there are many sites in the MIKE database, the bulk of the total elephant poaching occurs at the large sites.”
Which made me think they were looking at the average number of carcasses in each site to determine which were large and which were small. From their post, this seems like a mostly reasonable strategy, given that they think overall carcass discovery rates should reflect things like “elephant population, … accessibility (forest vs. savanna), resources for patrols, etc.” and average rates of overall carcass reporting might be correlated with these factors. This sounds reasonable and seems to be what is indicated by their post, but it is not what they did. What they actually did is the biggest, and most concerning error. (Hint: If you know Stata and are a careful reader, you might already realize the problem…)

DLM is selectively removing observations from their “large” sample based on the outcome variable. Specifically, “large” sites are considered “large” only for the years in which they report a large number of carcasses.  In years when too few carcasses are discovered, then that exact same site is coded as “small.” Thus “large” and “small” as reported by DLM is not a property of actual sites, it is a property of what happened at a site in a specific year. Thus, in years where poaching rates were low enough that total carcass reporting fell below the 20 carcass threshold, then a site was dropped from the “large” sample. This means there was systematic removal of observations when the number of carcasses—and, thus, when poaching rates—are low. Thus, it should be no surprise that DLM observe no change in poaching in 2008, since they are selectively only looking at sites when they behave in pretty much the same way. If there were low-poaching at sites before the sale, those were dropped from the sample. Thus only high poaching sites before the sale were compared to high poaching sites after the sale, making it appear that there is no discontinuity.

To be clear, selecting a sample based on the outcome variable is pretty much as close as one gets to a cardinal sin in econometrics, or for that matter, all of science.  An example helps. Imagine that a company has a potential cancer drug. They do an experiment where half of patients are given the drug and half are not. When evaluating whether or not the drug saved lives, they throw out the data for anyone taking the drug who died. They then conclude that the survival rate for the drug-takers was 100%, so the drug is fantastic. Would you take this drug? The problem is that these medical scientists selected their sample based on what happened after the experiment was conducted, so they got a result that didn’t tell us anything about what the drug actually did.

A variant on this example is actually much closer to what DLM did. Imagine that instead of throwing out data on mortality for the treatment group only, these doctors threw out data for anyone in the experiment who died (both in the treatment and the control groups) and then asked if the drug improved survival. In this case, the survival rate for the remaining sample for both the treatment and control group would be 100%, so it would appear that the drug did nothing. This is similar to what DLM did because they selectively removed observations that had low carcass reporting (roughly half of which is poaching) so they only observe sites in years when carcass reporting and poaching are high. This means that nothing really changes before or after the sale because if it did, that data would be discarded.

If you are doing an event study, you need to be sure that you are observing the same sites throughout the study, and then ask if something changed at those sites when the sale is announced. That’s the whole point of the quasi-experimental design: the sites before the sale act as the “control” group for the same site right after the sale, which has received the “treatment.” But this is not what DLM did. You can see this in the snippets of code above that say to restrict the sample that is called “large” based only on what happened in each year that a site was observed, not any average reporting behavior for that site. This is why in the graph above, the title says “current total carcasses >20”, because if a site’s total carcasses fell at some point, those observations were excluded.

Here are just four sites whose data I grabbed as examples. The red dots indicate observations that were included in DLM’s sample of “large” sites. The blue circles are PIKE values reported at those same sites in earlier years, but which DLM dropped from their “large” sample.  These sites tend to have lower PIKE values because a given location will tend to have fewer overall carcasses at the same time that poaching rates are low.  Notably most of the low PIKE values are occurring before the 2008 sale. This is exactly why DLM do not recover a discontinuity for their coding of “large” sites, because they force the sample before the sale to have high PIKE values by throwing out data that has low PIKE. 

Frankly, this is super-duper bad. If a medical researcher did the same thing for a medical trial and got caught, it would probably be the last medical trial they ran.

So how much does this mistake cost in quality of results? A lot. Here’s a graph showing the same thing as above for all sites that DLM ever include in their “large” sample, but also showing observations for those same sites when they were dropped from DLM:

As you can see, the included observations tend to have systematically higher PIKE both before and after the sale, but with a bias that is strongest in the two years before the sale. This is what pushes average PIKE in DLM up just before the sale, making it look like there is no discontinuity.

What happens when you fix this? Let me show you what I did before DLM sent me their code. 

Based on the description in their post (quoted above), I thought DLM assigned a site to be “large” if their average total reported carcasses was above 20 per year. So I restricted my sample to the sites that were largest (in terms of total carcasses) over the entire reporting period. This way, I know that I’m looking at the same set of sites both before and after the sale (unlike in DLM). There are fewer sites (only 16, compared to 32 in DLM) that ever get coded as “large” this way since having high reporting in just one year is not enough to make the cut into the large group. But it also means I don’t throw out the years with low reporting for the largest sites.  On net, the total sample size increases to 154 observations (compared to 128 in DLM) since there were more low-reporting years for high-reporting sites than the reverse. 

Below is the comparison of DLM’s result (on the left), and what you get if you restrict the sample to the 16 sites that are largest on average, defined by having average total carcasses > 20, and hold the composition of sites fixed (on the right):

The answer: You get a discontinuity, even for the large sites. And it looks similar in magnitude (+0.091) to the main estimate from our analysis that pools all sites together (+0.129). For those who like formality, this result for the largest sites is definitely not statistically different from what we report in the paper. Conclusion: our results hold for the largest sites.

For those that are curious what you get for the smaller sites when you hold the sample fixed:

which is the result for the 79% of sites that have average total carcass counts below 20. The clearest difference between these small sites and the large sites is that the pre-trend is very different. In our paper, we had very little pre-trend when we looked at the whole sample (because it was an average of these two opposing trends), making the modest upward change in the trend appear to be minor. But looking at the 79% of smaller sites, you see a different story. The trend before the sale was going down, and then it completely and abruptly flipped in the 2008. This suggests even more strongly that the broad pattern of increasing poaching across Africa and Asia originated in response to an event in 2008—and as we argue in our paper, the most plausible candidate is the 2008 sale.

In terms of the discontinuity, is the total reporting of carcasses important? Suppose we were interested in the question originally posed by DLM, how would I have answered it? There is actually a much better approach than theirs: simply run the same complete model for the data, but allow the size of the discontinuity to be a function of the average number of carcasses reported at a site. (Mathematically, this is accomplished simply by simply interacting average total carcass reporting with the post 2008 dummy variable.) Notably, I use the natural logarithm of average total carcasses for this interaction because the distribution of total carcasses is roughly log-normal. When I allow the effect of the sale to vary linearly with the “log size” of sites, this is what I get:

The black line is our original estimate pooling all the data with the 95% confidence interval in grey. The red line is the effect that is allowed to vary with log size. Basically, we see a very slight downward trend, but essentially we see nothing. Our pooled estimate is a very good approximation of the discontinuity exhibited across the sample.

We can even get a little more flexible, allowing for the effect of log size to be nonlinear. Here is what you get if I allow it to be cubic (a third-order polynomial), which allows for asymmetry. This is stretching the data pretty hard, especially since I’m allowing each country to have separate trends before and after the sale. Regardless, we see that except for the sites with the lowest reporting rates, the estimated effect is always within our originally reported confidence interval. At the smallest sites, the effect looks even larger, but these sites cannot be driving our result since there are not that many of them (see the histogram at the bottom of the figure):

Thus in conclusion: DLM made numerous minor errors (that have little effect) and one catastrophic error, selection of a sample based on outcomes, that undermines all conclusions they draw about whether the discontinuity in poaching depended on the “size” of sites.  Correcting this error by holding the sample of large sites fixed reveals that a discontinuity is also present for these larger sites.  Moreover, looking across the entire spectrum of sizes with a continuous model, we see that at no “size” does the effect of the sale vanish and instead that the pooled estimate is a good approximation of the entire sample (at least in terms of the size of the discontinuity) except for a few of the smallest sites that have an even larger jump in PIKE. 

Whew! On to DLM’s other two points.

(2) Should we discard our findings because DLM find no similar discontinuity in a private ivory price data set?

DLM argue that the null PIKE results for large sales (which was an error) seem justified when they consider trends in ivory prices.  This argument is unconvincing and irrelevant.

First, the argument is unconvincing because, at this point, I do not think DLM were very careful in any of the analysis for their blogpost. They made simple coding errors, misreported results and methods, and made catastrophic research design errors all within the first few paragraphs of their analysis. Moreover, I cannot verify their claims about the price data since neither the data nor the code is public.

That said, it doesn’t even really matter. On pages 4-8 of our original analysis, we demonstrated that the price of black market ivory (what DLM probably care about) and legal ivory are not and should not, even theoretically, be the same. We don’t know what market this data is looking at (black vs white) and from DLM’s text, it sounds like a mixture. 

However, even more importantly, even if we knew what market price we were looking at, it still wouldn’t tell us anything very useful! In our original analysis (just see Figure 1B if you are feeling lazy) we demonstrated that prices could do anything while poaching was rising. They could rise, they could fall, or they could remain constant. Thus looking at poaching data is uninformative, since it doesn’t really tell you anything about what is going on. This is spelled out very explicitly on page 8 of our paper, a fact that DLM appear to have conspicuously omitted when using price data to advance their arguments:

From Hsiang and Sekar:

Conclusion: Examining price data is an uninformative exercise since it is not constrained to reflect changes in poaching, so price results where no change occurs would remain fully consistent with our original PIKE findings. However, we do not know if the DLM price data are actual black market prices and it is very difficult to verify their analysis of that data since neither the data nor their code are public.

(3) Should we discard our findings because arbitrary deletion of key observations in 2008 (i.e. randomly recoding illegal carcasses as legal only in sites with fewer than 2 total carcasses and only in 2008) obscures the statistical significance of the main findings?

DLM argue that based on their PIKE analysis (which we showed was erroneous above) and their price data analysis (which we showed to be irrelevant in the original paper), it seems likely that our main result is a statistical artifact caused by the mislabeling of legal carcasses as illegal in 2008 within the “small” sites.  This hypothesis is completely speculative. DLM’s only reported motivation is:
“We have reasons to believe that the classification of carcasses might have experienced some changes around 2008. The MIKE programme received a large grant through an ACP project from the European Development Fund. The training of and additional resources for rangers that could be funded might have led to increased classification of carcasses as illegal kills…”
The only evidence they have to point to is that “CITES reports a gradual decline over time in the proportion of carcasses for which the cause of death is either recorded as “unknown” or is missing, which, for the construction of PIKE, is labeled as “not illegally killed” (as no evidence of illegal killing was reported).” Which they support with a link the report containing this figure, where the fraction of unknown carcasses declines linearly and steadily from 2005-2010:

There is no discontinuity in “unknown” reporting to motivate them, and they don’t even tell us the date the program started, it is simply “around 2008.” This is extremely flimsy motivation, but it’s worth walking through the errors in their actual simulation as well.

DLM explain how they test their speculative theory about mis-classification:
“Figure 3 below replicates Hsiang and Sekar’s analysis by randomly re-classifying carcasses from illegal to legal in small sites (two total carcasses or less) in 2008. Each point is based on 200 random re-classifications. The figure shows that the reclassification of only 7 carcasses from illegal to legal death is all that is needed for the 2008 step increase to no longer be statistically significant at conventional levels (10 percent as per Hsiang and Sekar’s analysis). Seven carcasses is a small fraction of the nearly 1,000 carcasses found in MIKE data for 2008, among which around 500 are classified as illegally killed.”
Where Figure 3 is:

There are multiple things wrong with what DLM have done here, besides setting out on a flimsy hypothesis to begin with.

First, it is impossible for them to reclassify more than 10 illegal carcasses in 2008 as they report they do since there are only 10 there to begin with (in their figure, they go all the way up to 20!). In 2008, there are 15 sites that qualify as “small” by their definition of 2 or fewer total carcasses (one has 0 total carcasses, seven have 1, and seven have 2), for a total of 21 total carcasses at these sites. Of those 21 total carcasses, ten of them were listed as “illegal” in the MIKE data. So first off, I do not know how is it even possible that DLM think they are randomly switching more than ten of these carcasses from “illegal” to “legal.” The x-axis of their graph runs from 0 to 20 re-assignments, but only 0-10 re-assignments are possible.

What are they doing? I have no idea, but it can’t possibly be what they claim to be doing. If you were to clip their graph at 10 carcasses, which is the limit because there are not more than ten carcasses to switch, then this graph would look like it supported our findings a lot more than it refuted them (that is, assuming they did it right for fewer than 10 re-assignments—something that now feels like a strong assumption):

Perhaps more importantly, the authors propose that changing seven out of these ten of these illegal carcasses is a small change, apparently because seven is a small number “compared to the nearly 1,000 carcasses found in MIKE data for 2008.” But changing seven out of ten carcasses is not a small change. It is a huge change! Some basic probability calculations would have alerted DLM to this. Continuing with our assumption that DLM calculated things right for 0-10 re-assignments in the clipped graph above, let’s put some probabilities to those scenarios that they plot along the x-axis.

If the chance of miss-assignment in 2008 is constant across the 15 small sites, then the probability of getting a certain number of binary mis-assignments from a fixed set of 10 ambiguously illegal carcasses follows a binomial distribution:

where k is the number of miss-assignments, n=10 possibly illegal carcasses, and p is the probability that a previously marked illegal carcass should actually be marked as legal.  Holding p fixed at some assumed value, this tells us the probability of landing at some point on the x-axis of DLM’s figure above, where k is the value on the x-axis. Let’s think about what these probabilities could possibly look like.

Suppose the chance of mis-assignment of a single legal elephant as illegal is p=0.25 (it happens one in four times, which seems like a very large error rate to me) then the chance of 7 or more mis-assignments out of ten illegal carcasses is 0.0035 or 0.35%! (0.0035 = one minus the CDF of the binomial when k = 7, n = 10, p = 0.25).  This means there is a 0.35% chance of being on DLM’s graph at 7 or anywhere to the right of 7.  This would suggest that the event that they argue likely explains why we are erroneously seeing a discontinuity is actually extremely rare, happening 1 out of 286 possible realizations.

But maybe this number is small because we assumed a p that is too low? If the chance of miss-assignment is a whopping p=0.4, then the chance of 7 or more critical miss-assignment errors in 2008 is still only 5.5%. In 94.5% of cases, one will still recover the same level of significance that we originally report.

Finally, if the original surveyor was just totally random so that the probability of assignment to legal status is p=0.5 (i.e. s/he completely guessed whether an elephant was legal or illegal without ever going to see the carcass) then the chance of getting 7 or more critical errors is still only 17%. So even in the extraordinarily extreme case where the MIKE system is doing nothing more than blindly tossing a fair coin to assign whether any carcass was poached, then the DLM hypothesis (that the “true” effect should not be statistically significant) could only be true in at most 1 out of 5.9 realizations (17% of the time).

More realistically, even if the original surveyors were slightly better than pure random, but if they made errors less than 44.8% of the time (which really is only barely better than random) then the DLM hypothesis could only be true in less than 10% of cases, i.e. our original results would be “statistically significant” by DLM’s test in the sense that it was unlikely to occur by chance just due to random mis-assignment of carcasses. It seems extremely hard to believe that these MIKE surveyors are only slightly better at their job than a random number generator, which would indicate that our findings are almost guaranteed to pass DLM’s completely arbitrary test.

This whole calculation follows in the spirit of DLM’s bizarre simulation. But if one were to do this exercise more seriously, they would need to consider the possibility of mis-assignment in the reverse direction (illegal carcasses are mistaken as legal), in larger sites (by their definition), and in later years. As currently implement, the DLM simulation stacks the cards in favor of the result they want in every single way (focusing only on illegal carcasses in small sites in 2008) and it only just barely eliminates the statistical significance that they wish to break in the most extreme scenario.

Careful math aside, it is philosophically ridiculous that DLM are interested in working so hard to just depress the result to barely marginal significance. Showing that a result might have a p-value of 0.089 instead of 0.09 less than 5% of the time is really just plain old absurd. If everyone just deleted the most critical bits of one another’s data until results just barely held, under the speculation that obscure and random deletion is actually more accurate than reported official data, then there really is no point to even bothering to look at data in the first place. This kind of approach of arbitrarily “correcting” data until a desired result is achieved is not something I would expect to be published on an official World Bank website.

Overall summary

No statement or calculation in the World Bank blog post by Do, Levchenko, and Ma provides any reason to discard our original findings, as DLM suggest they do. Their claims are based on a large number of statistical, coding, and inferential errors.  When we correct their analysis, we find that our original results hold for sites that report a large number of total carcasses; and the possibility that our findings are artifacts of the data-generating process that DLM propose is extremely rare under any plausible set of assumptions. We also re-iterate the point made explicitly and clearly in our paper that examination of ivory prices is broadly uninformative as to whether poaching did or did not rise in response to the 2008 sale.

No comments:

Post a Comment