This is a continuation of a series of posts on common missteps in statistical practice, prompted by the recent special issue of *Social Psychology* featuring registered replications. As with the previous posts, I will illustrate with studies (two from the special issue, and others that preceded those) on stereotype susceptibility. As mentioned in the first post in the series, I chose this topic only because it is one with which I have previous experience; **the comments I make in this post are ****not**** intended, and should not be construed, as singling out for criticism the authors of any papers referred to****, ****nor their particular area of research**.* *Indeed, my central point is that

**registered reports are not, by themselves, enough to substantially improve the quality of published research.**

**In particular, the proposals for registered reports need to be reviewed carefully with the aim of promoting best research practices.**The Gibson et al article in the special issue states (p. 195), “A total of 164 Asian Female college students participated in this study, with approximately 52 in each condition so as to detect a medium effect size (r = .35; Shih et al., 1999, p. 81) with 80% power (Cohen, 1992)”

I checked out the Cohen reference. Indeed, Table 2 (p. 158) of Cohen’s paper^{1} lists 52 participants per group for a one-way ANOVA with three groups, medium effect size, and significance level .05 to have power .80.

But as I thought more and read Cohen’s paper more, additional concerns arose:

- I remembered that Gibson et al also tested contrasts, so I wondered if Cohen’s sample size included these or just the overall F-test. I’m still not completely sure, but Cohen’s item 7 on p. 157 suggests that the sample sizes in Table 2 just refer to the F-test.
- If Cohen’s sample size value for ANOVA does not cover the contrasts, then my best guess from his paper would be that they would be covered under his values for two-sample t-tests. However, according to his Table 2, these would require 64 in each group to detect a medium effect with power .80.
- The figures checked out above assume the significance level alpha is .05. But, as remarked in the preceding post, using the simple Bonferroni method to account for multiple testing to give a FWER (overall significance level) of .05 would require (for Gibson et al) individual significance levels of .05/14, or about .0035. Cohen’s table doesn’t go down that far; for the lowest alpha in the table (.01), the sample sizes per group for a medium effect would be 76 (for ANOVA) and 95 (for a two-sample t-test)

The upshot: The two replications *might* be seriously underpowered *to detect a medium (as defined by Cohen) effect*. However (as will be discussed in subsequent posts), this does *not *necessarily mean they were underpowered for purposes of detecting *a meaningful difference*. Still, they used samples substantially larger than those in Shih et al, and thus were higher powered than that study.

I’ll discuss power more in my next post, but would like to end this one with a note of appreciation for Cohen’s efforts. I have in the past (as have others^{2}) seen his use of small/medium/large effect sizes only as an obstacle to good calculation of power and sample size (as I will discuss in coming posts). But I had never read his paper before. Now that I have, I see that what he achieved was tremendous progress over the previous practice of ignoring power in research in the behavioral sciences. I now see that his 1992 paper was a compromise, a compromise that was effective in promoting more widespread attention to power in his field.

But now that attention to power has increased in the field, it is time to go further and pay more attention to doing better power and sample size calculations. In other words, Cohen succeeded in changing TTWWADI from “ignore power” to “use Cohen’s 1992 paper and small-medium-large effect sizes.” *The challenge now is to move on to better (more accurate) methods of calculating sample size that take into account more than Cohen’s ideas of S, M, or L effect sizes.* Pointers in that direction will be the subject of the next posts.

*Notes:*

1. Cohen, J., A Power Primer, *Quantitative Methods in Psychology*, Vol. 112, No. 1, 155 – 159).

2. Examples:

- Lenth, Russell (2001) Some practical guidelines for sample size determination,
*The American Statistician 55, No. 3*pp. 187 – 193, available here - Russ Lenth’s Sample Size Page
- Muller, K. E., and Benignus, V. A. (1992), “Increasing Scientific Power with Statistical Power,”
*Neurotoxicology and Teratology*, 14, 211–219, which states about Cohen’s method (p. 7):

“The great attraction of the method, its lack of dependence on the application, may be considered to be its greatest weakness.”