Beyond The Buzz Part VII: Practical vs Statistical Significance and Recommendations

As mentioned in several of the previous posts in this series, the comments I make in this post are not intended, and should not be construed, as singling out for criticism the authors of any papers referred to, nor their particular area of research. Indeed, my central point is that registered reports are not, by themselves, enough to substantially improve the quality of published research. In particular, the proposals for registered reports need to be reviewed carefully with the aim of promoting best research practices.

The preceding post explained why most software or web-based methods of calculating sample size to achieve desired power are preferable to Cohen’s method1, which seems to be the method most often used in psychology research to estimate sample size.

One important reason given was that good software methods, unlike Cohen’s method, require the researcher to think about what would be a reasonable minimum raw effect size for the hypothesis test to be able to detect. Another way of saying this is that the researcher needs to think about practical significance, not just statistical significance.

This is important because if the sample size is large enough, a hypothesis test will reject a null hypothesis even if the difference between the null and alternate values is so small that it is of no practical importance. For example, a large enough clinical trial of a new drug might reject the null hypothesis “average lifespan in both experimental and control groups is the same” in favor of the alternate hypothesis  “average lifespan in the experimental group is higher than in the control group,” when the difference in average lifespans between the two groups is just one day. That would not be a practically significant difference in lifespans.

Let’s consider the concept of practical significance in the context of the research on stereotype susceptibility that has been discussed in posts in this series.

Recall that the researchers used two outcome measures (Part I): “accuracy” and “number correct”. Since I have a lot of experience teaching math, including giving, grading, and trying to interpret the results of math exams, I asked myself the question posed in the last post to a hypothetical teacher:

What difference in mean test scores on this test would convince me that this             group of people did better on the test than that group?

If I were to do a sample size calculation to achieve high power for a hypothesis test used for research in stereotype susceptibility, this is the value I would use as the effect size that would need to be detected (See Part VI).

Considering this question for outcome variable “accuracy,” my answer was in the blank stare/you’re crazy category, for the reasons outlined in Part I: I believe that accuracy, as defined in the literature on stereotype susceptibility, is not a good measure of performance on an exam, because (for example) it would rate a person who attempted eight questions and got six answers correct lower than a person who attempted only two questions and got both correct, whereas I believe that a “good” measure of performance should give the opposite ranking.

Considering the question for outcome variable “total number correct,” I had to think a little – but not very much, since I quickly remembered the maxim,

If your measuring instrument can’t distinguish a difference, then it’s not a   meaningful difference to expect a hypothesis test to detect.

Since “number correct” (as defined in these studies) is always a whole number, I concluded that trying to detect a difference (even in mean scores) less than 1 would be unreasonable, so my answer to the question posed was, “a difference of at least 1”

This minimum practically significant difference has a place in evaluating the results of studies as well as in figuring out suitable sample size: If a study is done well and results are not statistically significant, and if the raw effect size is below the threshold of practical significance, then these two pieces of information together provide good evidence to conclude that there is no “significant” (both practically and statistically) difference between the two hypotheses being compared.

This raises the question: In the replications on stereotype susceptibility, what were the raw effect sizes in the sample? Conveniently, Gibson et al2 gave (Table 1, p. 196) the estimates for mean number correct obtained in their study, in the other replication study (Moon et al), and in the original Shih et al study. Here is the portion of that table pertaining to outcome measure “correct responses” (with standard deviations omitted):

Asian               Control            Female

N  Mean           N  Mean           N  Mean

Current N = 158                                   52  6.52           52  5.75           54  5.72

Current N = 127 aware only                40  6.93           44  5.73           43  5.60

Moon and Roeder N = 139                   53  4.75           48  5.21           38  4.50

Moon and Roeder N = 106 aware only 42  4.83           37  5.19           27  4.30

Shih et al. (1999)                                 16  5.37           16  5.31           14  4.71

From these, the raw effect sizes (i.e., the maximum differences in estimated means detected) were:

Current N = 158                                   6.52  –  5.72  =  0.80

Current N = 127 aware only                6.93  –  5.60  =  1.33

Moon and Roeder N = 139                   5.21  –  4.50  =  0.71

Moon and Roeder N = 106 aware only 5.19  –  4.30  =  0.89

Shih et al. (1999)                                 5.37 –   4.71  =  0.66

 Thus, in all groups except the Gibson et al “aware” group, the raw effect size did not reach the minimum practically significant level of 1.

RECOMMENDATION #4 for improving research proposals (and thereby improving research quality):

  • Proposers should not base proposed sample sizes on Cohen’s standardized effect sizes, but instead should use a software or web-based method that requires all four of the inputs listed in the previous post. In addition, proposers should list the type of calculation used as well as values of all of the inputs. They should also give sound justifications for their choices
  • Reviewers of research proposals should check that calculations of proposed sample sizes have followed these guidelines, and that the justifications are sound on scientific grounds, not just on “what we’d like.”


1. Cohen, J., A Power Primer, Quantitative Methods in Psychology, Vol. 112, No. 1, 155 – 159)

2.  Gibson, C. E., Losee, J., & Vitiello, C. (2014). A replication attempt of stereotype susceptibility: Identity salience and shifts in quantitative performance. Social Psychology, 45, 194–198.

One thought on “Beyond The Buzz Part VII: Practical vs Statistical Significance and Recommendations

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>