In his June 10 Guardian article , Chris Chambers gave a link to an article discussing “questionable research practices” that are common in psychology research. One practice omitted from that article, but that I believe should have been included^{1}, is the practice of performing more than one hypothesis test on the same data *without* taking into account how this affects the prevalence of Type I errors (falsely rejecting the null hypothesis).

One often-helpful way to look at hypothesis tests is:

If you perform a hypothesis test using a certain significance level (I’ll use 0.05 for illustration), and if you obtain a p-value less than that significance level (here assumed to be 0.05), then there are three possibilities:

- The model assumptions for the hypothesis test are not satisfied in the context of your data.
- The null hypothesis is false.
- Your sample happens to be one of the 5% of samples satisfying the appropriate model conditions for which the hypothesis test gives you a Type I error.

This way of looking at hypothesis tests helps us see the problem of performing multiple hypothesis tests using the same data:

If you are performing *two* hypothesis tests using the *same data*, and if all model assumptions are satisfied, and if also *both* null hypotheses are true, *there is in general no reason to believe that the samples giving a Type I error for one test will also give a Type I error for the other test*.

Web simulations^{2} can help give an idea of the range of possible ways different combinations of tests can give Type I errors.

Because of the multiple testing problem, we need to look at more than the *individual* Type I error rate (in this case, alpha = .05) that is applied to each hypothesis test individually; we also need to consider the

*Family-wise error rate* (*FWER*): The probability that a randomly chosen sample (of the given size, satisfying the appropriate model assumptions) will give a Type I error for *at least one* of the hypothesis tests performed.

(The FWER is also called the *joint Type I error rate*, the *overall Type I error rate*, the *joint significance level*, the *simultaneous Type I error rate*, the *experiment-wise error rate*, etc.)

There is no perfect way of dealing with multiple testing, but there are some pretty good ways. The simplest is sometimes called the Bonferroni method: If you want a FWER rate of alpha, and are performing n tests, then use an individual type I error rate of alpha/n for each test individually. For example, to insure a FWER of .05 when performing 4 tests, use an individual alpha of .05/4 = .0125 for each test.

How do the papers on stereotype susceptibility discussed on Part I handle multiple testing?

Two comments before discussing this:

1. The caveat of Part I still applies: **the comments I make in this post and the ones following are ****not**** intended, and should not be construed, as singling out for criticism the authors of any papers referred to****, ****nor their particular area of research**.** **Indeed, my central point is that **registered reports are not, by themselves, enough to substantially improve the quality of published research. ****In particular, the proposals for registered reports need to be reviewed carefully with the aim of promoting best research practices.**

2. If more than one study was reported in a paper, I will discuss only the first.

**Steele and Aronson** (1995): Here is a list of the hypothesis tests mentioned in reporting results of their Study 1 (pp. 800-801):

- “Chi-squared analyses performed on participants’ response to the postexperimental question about the purpose of the study revealed only an effect of condition …, p < .001”
- “The ANCOVA on the number of items participants got correct, using their self-reported SAT scores as the covariate … revealed a significant condition main effect … p < .02 … and a significant race main effect … p < .03…. The race-by-condition interaction did not reach conventional significance (p < .19)”
- “Bonferroni contrasts with SAT as a covariate supported [the reasoning of the hypothesized effect] by showing that Black participants in the diagnostic condition performed significantly worse than Black participants in either the nondiagnostic condition … p < .01, or the challenge condition … p <.01, as well as significantly worse than White participants in the diagnostic condition … p < .01”
- They also performed another test of interaction, which “reached marginal significance, … p < .08”

My comments on this:

- Item 1 suggests that they performed more tests on participants’ responses to the post-experimental questions than just the one test reported.
- So altogether, at least 8 hypothesis tests were performed, using the same sample. Using the simple Bonferroni method discussed above, each test would need individual significance level < .05/8 = .00625.
- I’m not sure what “Bonferroni contrasts” means, but I think it means that a Bonferroni procedure was used (possibly automatically by the software?) to adjust p-values to take into account the number of contrasts considered – in other words (assuming 3 contrasts), the adjusted p-value would be the ordinary p-value divided by 3.
- So the only test listed that would be significant using the simple Bonferroni method would be the one reported in item 1.

The upshot: Steele and Aronson may have done a little bit of taking multiple inference into account (possibly only because the software did it automatically?), but did not really consider a FWER. I am not surprised – the game of telephone effect and TTWWADI had probably made disregard of multiple testing fairly standard by 1995.

**Shih et al (1999**): I found no mention of the problem of multiple testing. However, I counted what appeared to be 12 hypothesis tests performed. P-values were given for only 4; the others were listed accompanied by words that suggested that no significant difference was found. The p-values listed were: p < .05, p < .05, p = .19, p = .01. With 12 hypothesis tests, the simple Bonferroni method to give FWER .05 would require using an individual alpha of .05/12, or about .0042. *None* of the p-values listed reached this level of significance. It appears that disregard for the problem of multiple testing had become TTWWADI by 1999.

**Gibson et al (2014): **Again, no mention of the problem of multiple testing (TTWWADI, I would guess). I counted 14 hypothesis tests, with lowest p-value reported as p = .02. This occurred for two tests: 1) difference between all three groups on accuracy, when including only the 127 participants who were aware of the race and gender stereotypes; and 2) in particular, between female-primed subjects and Asian-primed subjects, also when restricted to the same subset. The simple Bonferroni method would require individual significance level .05/14, or about .0035. So again, nothing significant after taking multiple testing into account.

**Moon et al (2014): **P-values listed were .44, .43, .28, .28, .55, .57, .29, .31, .92, .76. I found no mention of multiple testing, but with p-values this high, there would be no need to adjust for it.

So I propose:

RECOMMENDATION #3 for improving research proposals (and thereby improving research quality):

- Proposers should include in their proposals plans for how to take multiple testing into account in their methods of data collection, data analysis, and interpretation of results.
- Reviewers of research proposals should check that proposers have included plans for accounting for multiple testing, and that these plans are appropriate for the aims and methods of the study.

Comment: As I will discuss in some of the following posts, multiple inference enters into research plans in more than just the way outlined in this recommendation.

*Notes*:

1. Analogous studies of questionable practices in medical research have included the problem of multiple testing. For example, A. M. Strasak et al (The Use of Statistics in Medical Research, *The American Statistician*, February 1, 2007, 61(1): 47-55) report that, in an examination of 31 papers from the *New England Journal of Medicine* and 22 from *Nature Medicine *(all papers from 2004), 10 (32.3%) of those from *NEJM* and 6 (27.3%) from *Nature Medicine* were “Missing discussion of the problem of multiple signiﬁcance testing if occurred.” These two journals are considered the top journals (according to impact figure) in clinical science and in research and experimental medicine, respectively.

2. See Jerry Dallal’s demo. This simulates the results of 100 independent hypothesis tests, each at 0.05 significance level. Click the “test/clear” button to see the results of one set of 100 tests (that is, for *one sample* of data). Click the button two more times (first to clear and then to do another simulation) to see the results of another set of 100 tests (i.e., for *another sample* of data). Notice as you continue to do this that i) *which tests give type I errors (i.e., are statistically significant at the 0.05 level) varies from sample to sample*, and ii) *which samples give type I errors for a particular test varies from test to test*. (To see the latter point, it may help to focus just on the first column, representing just 10 hypothesis tests.)

Also helpful in reinforcing the point:

For more discussion and further references on multiple testing, see here.