COMMON MISTEAKS MISTAKES IN USING STATISTICS: Spotting and Avoiding Them

Introduction    Types of Mistakes       Suggestions       Resources       Table of Contents     About    Glossary


Common Mistakes involving Power

1. Rejecting a null hypothesis without considering practical significance.

A study with large enough sample size will have high enough power to detect minuscule differences that are not of practical significance. Since power typically  increases with increasing sample size, practical significance is important to consider. See Type I and II Errors and Sample size calculations to plan an experiment (GraphPad.com) for examples.

2. Accepting a null hypothesis when a result is not statistically significant, without taking power into account.


Since smaller samples yield smaller power, a small sample size may not be able to detect an important difference. If there is strong evidence that the power of a procedure will indeed detect a difference of practical importance, then accepting the null hypothesis may be appropriate1; otherwise it is not -- all we can legitimately say then is that we fail to reject the null hypothesis.

3. Being convinced by a research study with low power.

As discussed under Detrimental Effects of Underpowered Studies underpowered studies are likely to be inconsistent and are often misleading.

4.  Neglecting to do a power analysis/sample size calculation before collecting data

Without a  power analysis, you may end up with a result that does not really answer the question of interest: you may obtain a result which is not statistically significant, but is not able to detect a difference of practical significance. You might also waste resources by using a sample size that is larger than is needed to detect a relevant difference.

5. Neglecting to take multiple inference into account when calculating power.

If more than one inference procedure is used for a data set, then power calculations need to take that into account. Doing a power calculation for just one inference will result in an underpowered study. For more detail, see
For discussion of power analysis when using Efron's version of false discovery rate, see Section 5.4 of B. Efron (2010), Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction, Cambridge, or see his Stats 329 notes


6. Using standardized effect sizes rather than considering the particulars of the question being studied.

"Standardized effect sizes" (sometimes called "canned" effect sizes) are expressions involving more than one of the factors that needs to be taken into consideration in considering appropriate levels of Type I and Type II error in deciding on power and sample size. Examples
For specific examples illustrating these points, see:
 Lenth, Russell V. (2001) Some Practical Guidelines for Effective Sample Size Determination, American Statistician, 55(3), 187 - 193 (Early draft available here.)
Lenth, Russell V. (2000) Two Sample-Size Practices that I Don't Recommend, comments from panel discussion at the 2000 Joint Statistical Meetings in Indianapolis.

7. Confusing retrospective power and prospective power.


Notes
1. In many cases, however, it would be best to use a test for equivalence. For more information, see:
2. See also Figure 1 of Richard H. Browne, The t-Test p Value and Its Relationship to the Effect Size and P(X>Y),  The American Statistician, February 1, 2010, 64(1), p. 31. This shows that, for the two-sample t-test, Cohen's classification of "large" d as 0.8 still gives substantial overlap between the two distributions being compared; d needs to be close to 4 to result in minimal overlap of the distributions. 

Last updated August 28, 2012