Tuesday, September 20, 2016

Normality, Aggregation & Weighting: Response to Underwood’s second critique of our elephant poaching analysis

Nitin Sekar and I recently released a paper where we tried to understand if a 2008 legal ivory sale coordinated by CITES and designed to undercut black markets for ivory increased or decreased elephant poaching. Our results suggested the sale abruptly increased poaching:

Dr. Fiona Underwood, a consultant hired by CITES and serving on the Technical Advisory Group of CITES' MIKE and ETIS programs, had previously analyzed the same data in an earlier paper with coauthors and arrived at different conclusions. Dr. Underwood posted criticisms of our analysis on her blog.  We replied to every point raised by Dr. Underwood and posted our replication code (including an Excel replication) here. We also demonstrated that when we analyzed the data published alongside Dr. Underwood's 2011 study, in an effort to reconcile our findings, we essentially recovered our main result that the sale appeared to increase poaching rates. Dr. Underwood has responded to our post with a second post. This post responds to each point raised in Dr. Underwood's most recent critique.

Dr. Underwood's latest post and associated PDF file makes three arguments/criticisms:

1) Dr. Underwood repeats the criticism that our analysis is inappropriate for poaching data because it assumes normality in residuals.  Dr. Underwood's intuition is that the data do not have this structure, so we should instead rely on generalized linear models she proposes that assume alternative error structures (as she did in her 2011 PlosOne article). Dr. Underwood's preferred approach assumes the number of elephants that die at each site in each year is deterministic, and the fraction that are poached at each site-year (after the total number marked to die is determined) is determined by a weighted coin toss where PIKE reflects the weighting (we discuss our concerns with this model in our last post).

2) Dr. Underwood additionally and specifically advocates for the evaluation of Aggregate PIKE to deal with variation in surveillance and elephant populations (as has been done in many previous reports) rather than average PIKE, as we do. Dr. Underwood argues that this measure better accounts for variation in natural elephant mortality.

3) To account for variation in the number of elephants discovered, Dr. Underwood indicates that we should be running a weighted regression where the weight of each site-year is determined by total mortality, equal to the sum of natural and illegal carcasses discovered in each site-year.

We respond to each of these points individually below. We then point out that these three criticisms are themselves not consistent with one another.  Finally, we note what this debate demonstrates the importance of having multiple approaches to analyzing a data set as critical as PIKE.

1) Is it it okay to assume that PIKE residuals are normal in the Hsiang & Sekar model? Or are we required to use binomial and other GLM models that assume different residual structures?

In Dr. Underwoods critique, she correctly describes the most parsimonious statistical model that we use (and describe in our previous post in detail). From Dr. Underwood’s post:

however she argues that the assumption that the residuals (p_ij in her Table) cannot be correct based on her intuition about the data. This motivates her to utilize more complex GLM approaches that make more and stronger assumptions that are very difficult to defend (such as the assumption that the number of elephant carcasses at each site are predetermined each year, which we think is indefensible), and that lead to her different conclusions. This logic is spelled out explicitly in her replication code where she writes (emphasis added):

# Average to get  mean value for each year 
pred.av <- tapply(pred,exp.gd$year.f,mean)... 
# But I don't like the fact that the data are treated as normal data
#Fit a Binomial model instead 
resp<- with(data.0,(cbind(illegal,totcarc-illegal)))glm1 <- glm(resp~siteid+year.f,family=binomial,data=data.0)

In contrast to Dr. Underwood, in our original analysis we did not simply trust our intuition about what we thought the data should look like. Instead, we looked at what the data looked like. And we included the necessary checks in our original paper. In the original appendix section "Checking assumptions of a linearized approach," we discussed the assumption that residuals of the model were normal and we presented Appendix Figure A8, which compared the CDF of our residuals to the CDF of a normal distribution.  If the distribution is normal, then most of the dots should lie very near the line, which they do:

Figure A8 from Hsiang & Sekar (2016) verified that the original model produces normally distributed errors.

For those who don't think in CDFs but prefer PDFs, this is what the exact same comparison looks like in terms of probabilities. Our data are the red line, the ideal normal distribution is the grey line. Given that we have fewer than 600 data points, this looks very good:

Thus, the necessary properties of the model (which Dr. Underwood confirmed were the necessary assumptions) are satisfied by a direct test that was presented in the original paper but not mentioned in Dr. Underwood’s critique. Rhetoric that our model is obviously wrong because it does not make stronger assumptions (e.g. that binomial models are necessary) is erroneous. Whatever one’s initial intuition about the data is, the actual structure of the data can be checked by simply looking at the data.

Given that these checks were included in our paper as posted on NBER, it is not immediately clear why Dr. Underwood continues to express concern about the appropriateness of our model assumptions. Similarly, in her recent critique, she notes that the PIKE values for each year “vary hugely between sites.” However, as also noted in our paper, once the PIKE values are demeaned by site—i.e., each site is only compared to itself over time, and not to other sites—this problem melts away.

Also, in our last post, we expressed concerns about Dr. Underwood’s alternative and untested assumptions. In her most recent response, instead of responding to our specific concerns, Dr. Underwood simply asserts without explanation or evidence “that this approach … is completely uncontroversial.” She reiterates that as long as “you know the numerator and denominator,” one can fit a generalized linear model “assuming a binomial distribution for the data.” We point out that the ability to fit a model does not make it the right model or even a reasonable model. Dr. Underwood has not explained why it is reasonable and correct to assume that the number of dead elephants at each site should be pre-determined, with the illegal killing of each dead elephant determined by a coin toss at the time of observation. She also provided no citations to this effect.

2) Should Aggregate PIKE ever be used instead of average PIKE to account for average trends in poaching rates across many sites?

Our reply: No. We can find no logical justification for this.

We didn't come up with the PIKE metric, but when applied at the site level it turns out to be a cleverly designed metric that fixes potential confounding introduced by variable elephant populations, variable carcass detectability, and variable surveyor effort. As explained in Footnote 8 of our paper (which I paste here since I don't know how to blog the math more elegantly):

In our analysis, we compute average changes in this measure at each site at the moment of the sale announcement, after accounting for secular trends at the site level and all constant differences between sites (both observed and unobserved).

An alternative approach that has been used in numerous reports to compute Aggregate PIKE, which involves summing illegal carcass counts (in the numerator) and total carcass counts (in the denominator) before dividing the two:

In writing, this measure is sometimes treated as though it also corrects for differences in elephant populations and surveyor effort, similar to site-level PIKE. This intuition seems to be drawn from an informal understanding of how the PIKE measure works. But the useful properties of site level PIKE do not extend to Aggregate PIKE. Because of the summation in the denominator, elephant populations and surveyor effort no longer cancel out. A simple example with two sites makes this clear.

Which cannot be simplified further. As far as we can tell, this sum has no conceptual meaning that is useful—the number of poached elephants detected at a site is no longer being normalized by site-specific variables, but instead by measures of (e.g.) surveillance effort and carcass detectability from many other sites. The site-specific correction that occurs by calculating PIKE at the site level no longer operates for Aggregate PIKE. 

Suppose location 1 is in Ghana and location 2 is in Botswana. Then Aggregate PIKE for the pair is the sum of total mortality in Ghana, divided by functions of elephant populations in both Ghana and Botswana as well as surveyor effort in both Ghana and Botswana, plus total mortality in Botswana, divided by functions of elephant populations in both Ghana and Botswana as well as surveyor effort in both Ghana and Botswana. We see no plausible reason that one would want to divide total recorded elephant mortality in Ghana by elephant populations or surveyor effort in Botswana. Rather than removing confounding influences of local elephant populations and local surveyor effort (as PIKE was designed to do), Aggregate PIKE introduces new confounding influences of elephant populations and surveyor effort at distant locations (without removing the local confounding effects).

An analogy might help with the intuition for why Aggregate PIKE is a poor (some might say misleading) measure. Weight-to-height ratios are commonly used to adjust weight measures for overall body frame size, using a simple measure (height) that proxies for body frame size. This is analogous to correcting illegal elephant mortality with total elephant mortality, since both should scale according to some difficult-to-observe factors (elephant populations and surveyor effort). If Aggregate PIKE were a reasonable measure, then we might also think that aggregate weight-to-height ratios were a good measure. To compute aggregate weight-to-height for you and a collection of your friends, one would first measure your weight and divide it by the sum of your height and all the heights of your friends. Then repeat this for each of your friends and sum up all these values. Sound uninterpretable? It is. This approach seems bizarre because you would be dividing your weight by your height and your friends' heights. If you have tall friends, then this number would be smaller. But why would you want the height of your friends to be used to correct your weight measure? You wouldn't. Similarly, you wouldn't want to use elephant populations in Mozambique to correct for the frequency of elephant carcass discovery in Nigeria.

Take away: Aggregate PIKE is a complex mathematical object with no useful interpretation or application in policy analysis. Our best guess is that it came into common use because the useful properties of PIKE computed locally (i.e. removing confounding influences of elephant population and surveyor effort) seemed appealing and seemed like they ought to apply to Aggregate PIKE as well. They do not.

3) Should we use a weighted regression where the weights are proportional to the sum of legal and illegal carcasses at each site in our regression?

While Dr. Underwood specifically advocates for the use of Aggregate PIKE, she also makes the more general proposal that we need to somehow account for the fact that there are more carcasses (legal and illegal) found in some sites than others. To do so, Dr. Underwood proposes that we alter our ordinary least squares regression by weighting observations according to the total number of carcasses discovered at each (legal + illegal).  She makes an informal case for this approach, appealing to the intuition that if fewer carcasses are discovered overall, then the denominator in PIKE will be a small number, leading to greater variations in PIKE. This argument sounds appealing because it is similar in motivation to standard arguments for a special case of Generalized Least Squares where weighted least squares is the optimal solution, in the sense that it minimizes the variance of the result. However, the approach that Dr. Underwood proposes is not the correct implementation

First, it should be clearly noted that even if Dr. Underwood's claim that locations with lower total mortality had larger variance in reported PIKE were true, this would not cause the estimated effect of the sale reported in Hsiang and Sekar to be biased. Changing variances in normally distributed residuals does not cause bias in average effects so long as they are mean zero (residuals that we showed had this property in the figure above). Moreover, the approach used in Hsiang and Sekar does not assume constant variance in residuals, and our estimated uncertainty non-parametrically takes into account the fact that the variances in residuals might change (formally, it is robust to arbitrary forms of heteroscedasticity), just like Dr. Underwood proposes. Thus there is no problem to correct. However, the correction that Dr. Underwood proposes does introduce new problems.

Dr. Underwood's proposal is to use a function of the outcome variable as the weights in a regression (since PIKE = illegal/(legal + illegal), then the proposed weights are just weight=illegal/PIKE). It is a very bad idea to use a function of the outcome variable as the weight in a regression (for brevity, I'll just say the weight is the outcome variable, but this doesn't really matter either way).

I think most people can see intuitively why this is a bad idea by using another example. Imagine that you give some fifth graders math tutoring if they are struggling in math class. At the end of the year, you give all the students (both those with and without tutoring) another math test. You want to know if the tutoring was associated with higher scores, so you regress final math scores on a variable describing how much tutoring students received. If you ran this regression and weighted by the outcome, then you would be weighting student observations based on their final math scores. Students who scored better would get more weight in the regression! And if the students who were performing better also were less likely to be struggling in class, and therefore less likely to get any tutoring help, then applying weights would focus the results of the regression on the students who were the least relevant to the question of interest. Weighting by the outcome will systematically bias any result because it will focus the most "attention" on observations with a specific outcome. Essentially, it is selecting your sample after you have already run a study, know the results, and only retain (or up-weight) data where you obtain a specific outcome. Clearly, you should never do this.

You may be thinking, “Wait, Dr. Underwood isn’t proposing that we weight PIKE by PIKE itself—just by the denominator, the number of carcasses detected at a site.” But it would require extraordinary assumptions to conclude that the PIKE value is statistically independent of the number of carcasses found (its denominator). First and foremost, both PIKE and the total number of carcasses are calculated by using the number of poached carcasses found, which is not a small fraction of total carcasses (around 40%). Secondly, the number of carcasses detected (legal or illegal) is a function of the surveillance effort and several other variables that are likely to also affect the number/proportion of elephants that are poached (i.e., PIKE), which is exactly why the PIKE measure was invented in the first place. Thus, while the weighting scheme Dr. Underwood proposes is more complex than our example above, it still amounts mathematically to having all the problems of weighting the outcome variable by itself.

This can be seen formally (apologies for my janky math below). Suppose that y is the outcome and x is the treatment, and we want to run the regression

vial OLS. The standard condition for unbiased estimation of beta is that x is exogenous or at least mean independent of the unobserved errors:

Standard stuff. Weighting by some weight w is mathematically equivalent to running the normal regression where all terms are premultiplied by the square root of the weights, giving us the regression:

 where we should just be able to recover

since this is a linear rescaling of all terms so long as the new residual terms
remain uncorrelated with the new regressor. In the normal case where weights are unrelated to the outcome, it is easy to show this is true because the independence condition (regarding the covariance) above holds since the weights can just be factored out:

However, if the weights are equal to the outcome measure (or a function of the outcome), then this is no longer true. Since we know that the outcome is

if we weight by y, then by substitution we have

 where we need

for the estimate to be unbiased. However, clearly this condition is not true since these two terms both contain x and epsilon!

Any regression that uses a function of the outcome as weights will have this issue, because any function of the outcome will necessarily be a function of epsilon. So using such weights will always generate correlation between the weighted regressors and epsilon, causing bias every time.

Thus, there is no actual reason to introduce weighting, since there is no problem to correct. Moreover, the correction that Dr. Underwood proposes (of using a function of the outcome as the weights) is guaranteed to produce erroneous results.

Contradictions between Dr. Underwood’s three critiques

None of Dr. Underwood’s three critiques is valid in its own right, as discussed above. We also point out that these three critiques are not consistent with one another. 

The PIKE metrics were designed to account for factors that generate variation in the discovery of poached elephant carcasses, like baseline elephant mortality rates and surveyor effort, all of which are accounted for by using counts of legal elephant carcasses specifically because legal carcasses are stochastic and respond similarly to these factors. Dr. Underwood simultaneously holds multiple perspectives on the design of PIKE which cannot be reconciled. First, by advocating for her binomial model of the data (as in her 2011 PlosOne article), Dr. Underwood indicates that the number of total carcasses discovered is deterministic and not stochastic, suggesting that total carcass counts cannot help correct for stochastic variation in these confounders. In her points (2) and (3), in contrast, Dr. Underwood makes arguments based on the stochasticity of total carcass discovery (in keeping with PIKE’s original design), suggesting that total carcass counts can and should be used as a correction factor. 

Furthermore, in her point (3), Dr. Underwood argues that the standard PIKE normalization that is the motivation of the correction she advocates in her point (2) is insufficient and must be corrected for a second time in the regression stage via weighting.  This suggests the total mortality correction is somehow deficient. 

Conversely, if total mortality really does account for population and surveyor effort perfectly through weighting (as advocated for in her point (3)), it will correct for these factors in site-level PIKE, making the aggregate PIKE measure advocated for in point (2) meaningless (as shown above). In addition, if this were true, then it would directly contradict the need for the second correction (weighting) that Dr. Underwood describes in point (3). 

Thus, logically, it seems that all of Dr. Underwoods arguments cannot be true simultaneously. In our evaluation, these apparent contradictions are reconciled by the fact that none of these arguments appear to be correct.

To summarize: The three critiques/suggestions offered by Dr. Underwood are not logically consistent with themselves, since they make contradictory assumptions about whether PIKE corrects for elephant populations/surveyor effort and whether or not the total count of carcasses discovered at each site is actually a random variable or not. Furthermore, each of the three points raised is itself independently erroneous, either because the mathematical assumptions Dr. Underwood makes cannot possibly be true (in the case of 2 and 3) or because these assumptions are clearly overturned by the data (in the case of 1).  We therefore conclude that none of the critiques offered by Dr. Underwood are valid.

Perhaps an institutional lesson to be learned from this debate

Nitin and I have been gratified by the sincere interest throughout the conservation and economics communities in our findings. We appreciate Dr. Underwood’s engagement with our work. However,  it seems clear to us that the conservation community needs to approach the disagreement between us here (and in similar future situations) differently. 

In the opening of her recent remarks, Dr. Underwood writes, “As a member of the MIKE-ETIS Technical Advisory Group, I was asked if I could look at [Hsiang and Sekar’s] analysis and understand why [their] results differ so much.”

MIKE-ETIS is fully aware that (i) Dr. Underwood published, with Dr. Burn and Mr. Blanc, a paper whose conclusions are directly contradicted by our own, (ii) these findings were released under the auspices of CITES as an official evaluation of the effects of the 2008 sale, and that (iii) Dr. Underwood had already publicly made claims that our analysis was unequivocally “wrong.” We believe it would be very difficult for any reasonable individual in Dr. Underwood’s situation, despite genuinely having the best intentions, to objectively evaluate the quality of our analysis as requested by MIKE-ETIS.

In general, we recognize that creating incentives to poke holes in faulty arguments is at the very heart of scientific advancement. However, what CITES and the conservation community need right now are statistically well-informed, open-minded experts that can weigh the relative merits of different analytical approaches and provide policy-makers sound, objective guidance on the relative merits of each. Given the tremendous relevance of PIKE to international policy affecting one of the world’s most iconic species, CITES should never have relied upon just one pair of statisticians to conduct the definitive analysis of the one-time sale—and now, they should certainly not rely on those same statisticians to decide whether other evaluations are legitimate. 

No comments:

Post a Comment