In the last post, I commented on how Cohen’s 1992 paper1 moved psychology research from ignoring power to using a simplified method proposed by Cohen for determining sample size to obtain some levels of power. But Cohen’s method has some serious limitations, and now most computer software can quickly calculate sample size needed to achieve specified power for standard statistical procedures (such as one-way ANOVA), given suitable inputs – indeed, there are even online calculators for this task. (e.g., here). This makes sample size calculations much easier than they were in the early nineties. Moreover, deciding on the values of the inputs (see below) needed to calculate sample size requires researchers to think more carefully about their planned research than Cohen’s simplified method does. This thinking in itself is likely to improve the quality of research.
Inputs typically required to calculate sample size for basic statistical procedures are:
- A minimum difference you would like to detect.
- An alpha level
- (An estimate of) a suitable standard deviation2
- The power you would like the study to have.
Before discussing these inputs, it is helpful to recall what statistical power is. The short3 definition is:
The power of a hypothesis test is the probability that the test will detect a true difference of a specified size.
This “true difference” is sometimes called an effect size. It’s important to distinguish between two kinds of effect size, which are sometimes called raw effect size and standardized effect size.
A raw effect size is typically the difference between the value of some quantity in the null hypothesis, and another value of that quantity that we wish to compare it to. This is typically what is called for in Input 1 listed above.
Example: In research on stereotype susceptibility, with outcome variable Y (e.g., a measure of performance on an exam), the researchers might perform a hypothesis test with null and alternate hypotheses
H0: µ1 = µ2 and Ha: µ1 ≠ µ2,
respectively. Here µ1 is the mean of Y for population 1 (e.g., subjects with minority identity salient) and µ2 is the mean of Y for population 2 (e.g., subjects without minority identity salient). The “raw” effect size here is µ1 – µ2.4
Similarly, if we were interested in comparing three populations, we would consider an ANOVA test with null and alternate hypotheses
H0: µ1 = µ2 = µ3 and Ha: At least two of µ1, µ2 , and µ3 are different
The corresponding “raw” effect size would be the largest of the differences between the three values of µ.
Good software or web-based calculations of power and sample size use raw effect sizes.
However, Cohen’s tables are based on standardized effect sizes.
- For a two-sample t-test of independent means, the standardized effect size is “Cohen’s d”, which is the difference in the means for the two populations, divided by the pooled standard deviation.
- For one-way ANOVA, the standardized effect size is “Cohen’s f,” which is the standard deviation of the population means, divided by the within-population standard deviation6.
One important difference between raw effect sizes and Cohen’s effect sizes is that raw effect sizes are much more interpretable in the context of the research question being studied. For example, if I asked a teacher, “what difference in mean test scores on this test would convince you that this group of people did better, on average, on the test than that group?”, they would have some reasonable chance of giving me an estimate, based just on their experience with teaching and testing; coming up with an answer would not require any statistical knowledge other than what the mean is (which they presumably learned around sixth grade). But if I asked them, “What difference of means divided by the pooled standard deviation would convince you that the this group did better than the other group?” most would likely either be clueless or think I was crazy.
The lack of interpretability in context of Cohen’s effect sizes is just one reason why most software or web-based methods of calculating sample size are preferable to Cohen’s7. Other reasons include:
- They typically allow you to specify an alpha level (input 3), rather than just choose between those listed in Cohen’s tables. This allows you to use an alpha level that takes multiple testing into account.
- They require the researcher to either find and examine a previous study, or to carry out a pilot study to obtain an estimate of the “suitable standard deviation” (input 4). By avoiding this, Cohen’s method gives the researcher a sort of “free lunch.” Sorry, there is no free lunch. You need to do more work than Cohen’s method requires to obtain a reasonably good estimate of sample size.
- They allow you to specify power (input 4), rather than just choosing between a few choices.
- They often can easily give you an overview of options that show you the tradeoffs between power, significance level, and sample size.
- Similarly, they may prompt you to see if you can use a variance-reducing design to obtain desired power with a smaller sample size.
Caution: Both Cohen’s and software or web-based methods of calculating sample size to get desired power are based on model assumptions. So if the context does not satisfy the model assumptions, then the sample size calculations cannot be trusted.
I will continue the discussion of power in the next post.
1. Cohen, J., A Power Primer, Quantitative Methods in Psychology, Vol. 112, No. 1, 155 – 159)
2. a. “Suitable” depends on the type of hypothesis test being considered. Examples:
i. For an equal variance two-sample t-test, an estimate of the common standard deviation would be used. (The pooled sample standard deviation in a previous or pilot study would be a rough estimate of this)
ii. In a one-way ANOVA, the common standard deviation within groups (a model assumption for ANOVA) would be used. (The square root of the mean squared error from a previous or pilot study would be a rough estimate of this).
b. To avoid calculating a sample size that is too small to give the desired power, the researcher should use a standard deviation estimate that is on the large side of probable values. (See p. 51 of Dean and Voss, 1999, Design and Analysis of Experiments, Springer.) Thus researchers often use the square root of the upper limit of a confidence interval for the variance from a previous or pilot study, rather than using the actual estimates mentioned above.
3. As with most short definitions, this one can be deceiving if terms are not understood correctly. See p. 14 of this for the long definition. (Other parts of that document may also be worth reading to help understand statistical power.)
4. µ2 – µ1 could be chosen instead; the results of sample size calculations would be the same.
5. From Table 1, p. 157 of Cohen (1992) [See Note 1]
6. One of the model assumptions of one-way ANOVA is that all populations in the study have the same standard deviation of the outcome variable.
7. However, standardized effect sizes can play a useful role in meta-analysis, since their “scale-free” nature allows comparisons of studies with different instruments used to measure outcomes. But they still need be used with caution, for reasons including the following:
- Many standardized effect sizes are based on model assumptions that might fit in some studies but not in others.
- Different experimental designs will produce different standard deviations. (For example, a block design usually produces a smaller standard deviation than an unblocked design; that is indeed the purpose of blocking.) Consequently, different experimental designs are likely to give different standardized effect sizes.