Simmons et al’s 2011 paper “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allow Presenting Anything as Significant”, Psychological Science, 22(11), 1359-1366 (available at http://people.psych.cornell.edu/~jec7/pcd%20pubs/simmonsetal11.pdf) has generated a lot of discussion and citations (about 150 in a Google Scholar search).
Here’s a summary of what’s in the paper:
- A study with a preposterous conclusion.
- Simulations showing how different researcher choices (adding a dependent variable; adding more observations per cell; controlling for gender or interaction with treatment; dropping or including one of three conditions; and combinations of these) can affect the family-wise Type I error rate.
- Discussion of the effect on Type I error rate of adding additional observations.
- Proposed requirements for authors and guidelines for reviewers intended to mitigate the problem of false positives.
- Discussion of how following the proposed requirements would have altered the outcome of the example with a preposterous conclusion.
- Discussion of criticisms of their suggestions (both “not going far enough” and “going too far”)
- Discussion of what they call “nonsolutions”
In my view, the paper is a mixed bag.
Positive aspects I see in the paper:
- It has generated a lot of discussion. (More on this in a later post.)
- Most of the recommendations are right on target for good, ethical science.
- The “preposterous conclusion” example and the simulations are good for making important points.
Negative aspects I see:
- Most importantly, they consider “adjusting alpha levels” as a “non-solution.” Their reasoning seems to be that there is no single, good way to do this. However, there are ways that can at least give a better sense of “worst case” than ignoring the problem of multiple testing. The authors seem to be unaware of the recent literature on the subject, giving only a reference to a 1977 paper of Pocock discussing sequential testing in clinical trials. (See Multiple Inference, for discussion and references on multiple testing.)
- Their second “requirement for authors,” that “Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data- collection justification.” I agree that too-small sample size is a problem, but the authors give no reason for the seemingly arbitrary figure of 20 per cell beyond, “Samples smaller than 20 per cell are simply not powerful enough to detect most effects,” (for which they give no reference). I cannot think of a good justification for this seemingly arbitrary figure – indeed, different experimental designs may require different sample sizes per cell to achieve the same power. What is really needed is a thorough discussion of power – including how power calculations need to take experimental design and multiple testing into account and how “small, moderate and large effect sizes” are at best a sloppy approach to calculating power.
- Listing “requirements for authors” may mislead researchers into believing that if they follow the requirements, they are doing what they need to do to have good quality research. Simmons et al do say, “This solution substantially mitigates the problem,” and that it requires “only that authors provide appropriately transparent descriptions of their methods so that reviewers and readers can make informed decisions regarding the credibility of their findings.” Still, the non-recommendation about adjusting for multiple testing and the weakness of the recommendation regarding sample size, combined with the widespread ignorance about good practices regarding multiple inference and power and other aspects of research, are likely to give researchers false confidence that their methods and reporting are good when it is not, and thereby lead equally naïve readers to believe what they read.