Beyond the Buzz Part IV: Multiple Testing

In his June 10 Guardian article , Chris Chambers gave a link to an article discussing “questionable research practices” that are common in psychology research. One practice omitted from that article, but that I believe should have been included1, is the practice of performing more than one hypothesis test on the same data without taking into account how this affects the prevalence of Type I errors (falsely rejecting the null hypothesis).

One often-helpful way to look at hypothesis tests is:

If you perform a hypothesis test using a certain significance level (I’ll use 0.05 for illustration), and if you obtain a p-value less than that significance level (here assumed to be 0.05), then there are three possibilities:

  1. The model assumptions for the hypothesis test are not satisfied in the context of your data.
  2. The null hypothesis is false.
  3. Your sample happens to be one of the 5% of samples satisfying the appropriate model conditions for which the hypothesis test gives you a Type I error.

This way of looking at hypothesis tests helps us see the problem of performing multiple hypothesis tests using the same data:

If you are performing two hypothesis tests using the same data, and if all model assumptions are satisfied, and if also both null hypotheses are true, there is in general no reason to believe that the samples giving a Type I error for one test will also give a Type I error for the other test.

Web simulations2 can help give an idea of the range of possible ways different combinations of tests can give Type I errors.

Because of the multiple testing problem, we need to look at more than the individual Type I error rate (in this case, alpha = .05) that is applied to each hypothesis test individually; we also need to consider the

Family-wise error rate (FWER): The probability that a randomly chosen sample (of the given size, satisfying the appropriate model assumptions) will give a Type I error for at least one of the hypothesis tests performed.

(The FWER is also called the joint Type I error rate, the overall Type I error rate, the joint significance level, the simultaneous Type I error rate, the experiment-wise error rate, etc.)

There is no perfect way of dealing with multiple testing, but there are some pretty good ways. The simplest is sometimes called the Bonferroni method: If you want a FWER rate of alpha, and are performing n tests, then use an individual type I error rate of alpha/n for each test individually. For example, to insure a FWER of .05 when performing 4 tests, use an individual alpha of .05/4 =  .0125 for each test.

How do the papers on stereotype susceptibility discussed on Part I  handle multiple testing?

Two comments before discussing this:

1. The caveat of Part I still applies: the comments I make in this post and the ones following are not intended, and should not be construed, as singling out for criticism the authors of any papers referred to, nor their particular area of research. Indeed, my central point is that registered reports are not, by themselves, enough to substantially improve the quality of published research. In particular, the proposals for registered reports need to be reviewed carefully with the aim of promoting best research practices.

2. If more than one study was reported in a paper, I will discuss only the first.

Steele and Aronson (1995): Here is a list of the hypothesis tests mentioned in reporting results of their Study 1 (pp. 800-801):

  1. “Chi-squared analyses performed on participants’ response to the postexperimental question about the purpose of the study revealed only an effect of condition …, p < .001”
  2. “The ANCOVA on the number of items participants got correct, using their self-reported SAT scores as the covariate … revealed a significant condition main effect … p < .02 … and a significant race main effect … p < .03…. The race-by-condition interaction did not reach conventional significance (p < .19)”
  3. “Bonferroni contrasts with SAT as a covariate supported [the reasoning of the hypothesized effect] by showing that Black participants in the diagnostic condition performed significantly worse than Black participants in either the nondiagnostic condition … p < .01, or the challenge condition … p <.01, as well as significantly worse than White participants in the diagnostic condition … p < .01”
  4. They also performed another test of interaction, which “reached marginal significance, … p < .08”

My comments on this:

  • Item 1 suggests that they performed more tests on participants’ responses to the post-experimental questions than just the one test reported.
  • So altogether, at least 8 hypothesis tests were performed, using the same sample. Using the simple Bonferroni method discussed above, each test would need individual significance level < .05/8 = .00625.
  • I’m not sure what “Bonferroni contrasts” means, but I think it means that a Bonferroni procedure was used (possibly automatically by the software?) to adjust p-values to take into account the number of contrasts considered – in other words (assuming 3 contrasts), the adjusted p-value would be the ordinary p-value divided by 3.
  • So the only test listed that would be significant using the simple Bonferroni method would be the one reported in item 1.

The upshot: Steele and Aronson may have done a little bit of taking multiple inference into account (possibly only because the software did it automatically?), but did not really consider a FWER. I am not surprised – the game of telephone effect and TTWWADI had probably made disregard of multiple testing fairly standard by 1995.

Shih et al (1999): I found no mention of the problem of multiple testing. However, I counted what appeared to be 12 hypothesis tests performed. P-values were given for only 4; the others were listed accompanied by words that suggested that no significant difference was found. The p-values listed were: p < .05, p < .05, p = .19, p = .01. With 12 hypothesis tests, the simple Bonferroni method to give FWER .05 would require using an individual alpha of .05/12, or about  .0042. None of the p-values listed reached this level of significance. It appears that disregard for the problem of multiple testing had become TTWWADI by 1999.

Gibson et al (2014): Again, no mention of the problem of multiple testing (TTWWADI, I would guess). I counted 14 hypothesis tests, with lowest p-value reported as p = .02. This occurred for two tests: 1) difference between all three groups on accuracy, when including only the 127 participants who were aware of the race and gender stereotypes; and 2) in particular, between female-primed subjects and Asian-primed subjects, also when restricted to the same subset. The simple Bonferroni method would require individual significance level .05/14, or about .0035. So again, nothing significant after taking multiple testing into account.

Moon et al (2014): P-values listed were .44, .43, .28, .28, .55, .57, .29, .31, .92, .76. I found no mention of multiple testing, but with p-values this high, there would be no need to adjust for it.

So I propose:

RECOMMENDATION #3 for improving research proposals (and thereby improving research quality):

  • Proposers should include in their proposals plans for how to take multiple testing into account in their methods of data collection, data analysis, and interpretation of results.
  • Reviewers of research proposals should check that proposers have included plans for accounting for multiple testing, and that these plans are appropriate for the aims and methods of the study.

Comment: As I will discuss in some of the following posts, multiple inference enters into research plans in more than just the way outlined in this recommendation.


1. Analogous studies of questionable practices in medical research have included the problem of multiple testing. For example, A. M. Strasak et al (The Use of Statistics in Medical Research, The American Statistician, February 1, 2007, 61(1): 47-55) report that, in an examination of 31 papers from the New England Journal of Medicine and 22 from Nature Medicine (all papers from 2004), 10 (32.3%) of those from NEJM and 6 (27.3%) from Nature Medicine were “Missing discussion of the problem of multiple significance testing if occurred.” These two journals are considered the top journals (according to impact figure) in clinical science and in research and experimental medicine, respectively.

2. See Jerry Dallal’s demo. This simulates the results of 100 independent hypothesis tests, each at 0.05 significance level. Click the “test/clear” button to see the results of one set of 100 tests (that is, for one sample of data). Click the button two more times (first to clear and then to do another simulation) to see the results of another set of 100 tests (i.e., for another sample of data). Notice as you continue to do this that i) which tests give type I errors (i.e., are statistically significant at the 0.05 level) varies from sample to sample, and ii) which samples give type I errors for a particular test varies from test to test. (To see the latter point, it may help to focus just on the first column, representing just 10 hypothesis tests.)

Also helpful in reinforcing the point:

For more discussion and further references on multiple testing, see here.

One thought on “Beyond the Buzz Part IV: Multiple Testing

  1. Pingback: Comments on Funder et al, “Improving the Dependability of Research in Personality and Social Psychology: Recommendations for Research and Educational Practice” | Character and Context

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>