Another Mixed Bag: Gigerenzer’s Mindless Statistics

I recently saw a link on Andrew Gelman’s blog to a blog post by John Kruske  that had a link to Gerd Gigerenzer’s 2004 paper Mindless Statistics, which I had not seen before.

As the title of this post indicates, I think the Gigerenzer paper is a mix of good points and  one big questionable point.

The good points (for which I refer you to the appropriate parts of his paper for elaboration):

  • The “identification” of the “null ritual” (Section 1)
  • The discussion of what Fisher and Neyman-Pearson actually said (Section 2)
  • The discussion of different interpretations of the phrase “level of significance” (Sections 3)1
  • The discussion of Oakes’ and Haller and Krause’s studies of misinterpretations of statistical significance (Section 4)
  • The account of G. Loftus’ efforts to change journal policies (Section 5)
  • The discussion of Meehl’s conjecture (Section 7)
  • The discussion of Feynmann’s conjecture (Section 8)
  • The repeated references to the “statistical toolbox” (discussed most explicitly in Section 1) and “statistical thinking”

But I have serious reservations about his Section 6, which starts,

 “Why do intelligent people engage in statistical rituals rather than in statistical thinking? Every person of average intelligence can understand that p(D|H) is not the same as p(H|D). That this insight fades way when it comes to hypothesis testing suggests that the cause is not intellectual but social and emotional. Here is a hypothesis …: The conflict between statisticians, both suppressed by and inherent in the textbooks, has become internalized in the minds of researchers. The statistical ritual is a form of conflict resolution, like compulsive hand washing which makes it resistant to arguments. To illustrate this thesis, I use the Freudian unconscious conflicts as an analogy.”

I do appreciate that he labels his hypothesis as such. However, I see several problems with it:

  1. He gives no evidence to support it.
  2. He “illustrates” his thesis via the Freudian theory of unconscious conflicts, which theory I can’t see as anything but a religious type belief.
  3. Similarly, his analogy with compulsive hand-washing seems overextended.
  4. He rather cavalierly dismisses (without giving any substantive supporting evidence) what I believe are genuine intellectual problems in understanding frequentist inference and acquiring a good facility with statistical thinking.

I would not argue that there are no emotions, conflicts, or conflicting incentives involved in the widespread acceptance of statistical “rituals,” but my experience as a teacher of both mathematics and statistics indicates that there are also intellectual challenges that contribute to the problem, and  that addressing these challenges is important in improving statistical practice.

To elaborate:

First, Gigerenzer gives no evidence to back up his claim that “every person of average intelligence can understand that p(D|H) is not the same as p(H|D)”.  I can’t say for sure that he is wrong, since I don’t have much experience teaching people of average intelligence:  some of my teaching has been at elite universities, and most of it has been at the University of Texas at Austin, which although it is not “elite,” has mostly students who were in the top ten percent of their high school graduating class, and thus presumably are above average in intelligence. And most of the students I have taught have been majoring in STEM fields. I can only give anecdotal evidence of the students I have taught: Of above average intelligence, and usually with an interest in math, science, or technology. But if they have intellectual difficulties, then these difficulties are likely more widespread in people of average intelligence.

I have found in teaching advanced math courses for math majors, that a substantial number of these students have difficulty (at least at first) distinguishing between a statement and its converse – that is, distinguishing between “A implies B” and  “B implies A”. Sure, if the implication is stated that way, then they pick up on the idea pretty quickly. But there are lots of ways of stating an implication. (For example, “A whenever B” says “B implies A”.) Identifying hypothesis and conclusion in an implication (which is necessary to distinguish a statement from its converse) is much harder in these less straightforward situations; a fair percentage of students struggle a lot with this, particularly when the problem is imbedded in a context.

I have found in teaching probability and statistics that the same type of difficulty arises in distinguishing between p(D|H) and p(H|D): Students catch on fairly easily for simple situations, but as situations become more complex, they make more mistakes. And the situation is really very complex for frequentist hypothesis testing. So at the very least, very careful teaching is needed to build true understanding. That’s why I have learned to “test” students on questions more-or-less like those in Section 4 of Gigerenzer’s paper when teaching hypothesis testing. Most students struggle intellectually with such questions – but having their incorrect answers pointed out and explained does seem to help some of them understand.

Indeed, I have learned that few students really understand the concepts of sampling distribution, p-value, and confidence interval after just one course. More seem to catch on in a second course, but some still get seriously stuck in misinterpretations. I always reviewed these basics in a graduate course (e.g., regression or ANOVA), because I was aware of this phenomenon. And bear in mind,

  • These students are of above average intelligence
  • They are (with very few exceptions) not psychology students – they typically are in fields such as math, statistics, biology, engineering, or business.
  • Most of them have not yet been immersed in a “publish or perish” environment.

So from this experience, I believe that there are substantive intellectual challenges involved in understanding frequentist statistics, especially to the point of applying it meaningfully in a specific context. Ignoring these challenges in favor of emotional explanations such as Gigerenzer hypothesizes won’t solve the problem.


Footnote to title: The first mixed bag I blogged about was Simmons et al’s now famous paper; see,,, and

1. However, I think this section would have been better without the introduction of Dr. Publish-Perish’s superego and feelings of guilt.

More on Teaching for Reproducibility

A few months ago, I mentioned efforts to teach for reproducibility in intro stats courses. Today, the comments in Andrew Gelman’s blog post had a link to a course (taught by University of Wisconsin Biostatistican Karl Broman) that is entirely about Tools for Reproducible Research. (This link includes recommended books, a link to a list of further resources,  and a Schedule link that includes links to course notes.)

Added January 19, 2015: I received a comment pointing out that Coursera also has a course Reproducible Research; see

Some Much Needed Attention to Multiple Testing in the Psychology Literature

In several earlier posts (A Mixed Bag, , More re Simmons et al, Part IMore re Simmons et al, Part IIMore re Simons et al, Part III, Though Many Social Psychologists May Be Unaware, Multiple Testing Often Leads to Multiple Spurious Conclusions, Beyond the Buzz Part IV: Multiple Testing), I have commented on lack of accounting for multiple testing in literature in psychology. So I was pleased to receive notice of a paper “Hidden Multiplicity in the Multiway ANOVA: Prevalence, Consequences and Remedies” that has been submitted by Cramer et al to PLoS ONE and is currently available on the arXiv.

In the paper, the authors examined all articles published in 2010 in six widely read and cited psychology journals, identifying those articles using a multiway ANOVA. They found that close to half of the articles did indeed use a multiway ANOVA, but only about 1% of those papers used some correction for multiple testing (with percentages in individual journals ranging from 0 to 2.59).

They then randomly chose 10 papers from each journal that involved at least one multiway ANOVA, and used the Holm-Bonferroni procedure to adjust for multiple testing. They found that in 75% of the articles, at least one test that had been declared significant could no longer qualify as significant after the Holm-Bonferroni correction.  Regrettably, this does not surprise me, based on the articles in psychology journals that I have read.

I hope this paper gets wide circulation and helps change practices in the field.

From the New Book Shelf, Summer 2014

Comments on two books I saw on the new bookshelf in the past few months:

I. Bartholomew, David J., Unobserved Variables: Models and Misunderstandings, Springer Briefs, 2013

This is a delightful little book, a pleasure to read. From the abstract:

“Although unobserved variables go under many names there is a common structure underlying the problems in which they occur. The purpose of this Brief is to lay bare that structure and to show that the adoption of a common viewpoint unifies and simplifies the presentation.  Thus, we may acquire an understanding of many disparate problems within a common framework. … The use of [methods where unobserved variables occur] has given rise to many misunderstandings which, we shall argue, often arise because the need for a statistical, or probability, model is unrecognized or disregarded. A statistical model is the bridge between intuition and the analysis of data.”

The explanations seem at a good level – enough detail so as not to be vague, but not so much that one gets bogged down. I particularly enjoyed the section on mixture models.

The only two drawbacks:

1. The proofreading is sloppy in places, but not so much as to detract from understanding.

2.  The cost – even $40 for a used copy, while the book has only 86 pages.


 II.  Sabo and Boone, Statistical Research Methods: A Guide for Non-Statisticians, Springer 2013

Based just on a quick glance, my impression was at best mixed — I would not recommend this as a textbook or reference, although it does have a few good points. Examples of Pros and Cons (in reverse order, since my overall recommendation is negative):


  • No index [Perhaps really just intended for their teaching?]
  • The introduction talks about “representative samples” rather than “random samples.” This may be well-intentioned, but it is likely to lead to misunderstanding. Quote:

“The idea is that if a sample is representative of a population, the numeric or mathematical characteristics of that population will be present in the sample. This attribute will ensure that statistical analysis of the sample would yield similar results to a (hypothetical) statistical analysis of the population.” (p. 4).

[Note: on p. 16, they do mention random sample, but still talk about “representative”.]

  • They use probability both in frequentist and Bayesian ways, but don’t point out the difference (that I could find). Example: they seem to present only frequentist methods, but say (p. 19)

“This is one of the most important ideas in this entire textbook: if the data do not support a given assumption, then that assumption is most likely not true. On the other hand, if our data did support our assumption, then we would conclude that the assumption is likely to be true (or at least more likely than some alternative).”

(Note: They haven’t discussed power at this point.)

  • (p. 167)

“As mentioned earlier, statisticians can make only two mistakes (all others must have been made by someone else): we can falsely reject a true null hypothesis (type I error), or we can falsely fail to reject the null hypothesis when the alternative hypothesis is true (type II error).”

Is this a joke??


  • p. 28 Points out the that the level of confidence of a confidence interval

“is often taken as the quantification of our belief that the true population parameter resides in within the estimated confidence  interval; this is false. … Rather, the confidence level reflects our belief in the process of constructing confidence intervals, so that we believe that 95% of our estimated confidence intervals would contain the true population parameter, if we could repeatedly sample from the same population. This is an important distinction that underlies what classical statistical methods and inference can and cannot state (i.e., we don’t know anything about the population parameter, only our sample data).”

  • (p. 168) Mentions all four (significance level, expected variability in response, desired effect size, and sample size) components that “are interrelated and each affects power in different ways,” and gives each a paragraph.

Beyond the Buzz Part VIII: Outliers

Most of my posts so far in the Beyond the Buzz series have been more critical than positive, so I’m happy to post about a couple of things that are strongly positive.

Both stem from Monday’s SPSP* blog post, Without Replications We Will All Die**, by Jelte Wicherts.

I. Wicherts’ concluding paragraph is worth quoting, both for its own sake and as a teaser to encourage you to read his entire blog post:

 “Science is not free from errors, big egos, competition, and personal biases. But in science honest researchers should not have anything to hide or to worry about. Science is the place where we all make errors and try to deal with that somehow. Science is where disagreements sharpen our thoughts but should not make us angry. Science is where we replicate to deal with our necessary doubts. And science is where a PhD student sitting in a small university office can set out to replicate findings to contradict the statements made by a Nobel prize winner, without there ever being any animosity or fear. That is the true DNA of science and that is what makes it thrive.”

II. Reading the post prompted me to follow the link provided at the end to Wicherts’ home page, and so to look at some of his work. In particular, I found the article Bakker, M. & Wicherts, J. M. (2014). Outlier removal, sum scores, and the inflation of the type I error rate in independent samples t tests: The power of alternatives and recommendations. Psychological Methods, in press,  (available here) particularly interesting. Here are some of its highlights, with some comments in footnotes:

1. The authors surveyed publications in a few thoughtfully chosen journals from 2001 to 2010, identified those articles in these issues using the word “outlier” in the text, and selected a random sample of 25 such articles from each journal involved***. The selected articles were examined in more detail. Of the selected articles, 77% said that outliers had been removed before starting analysis.**** Forty-six percent of the articles identified  outliers on the basis of having a z-score exceeding some threshold.***** The authors also discuss other questionable criteria for removing outliers. Only 18% of the articles reported analyses both with and without outliers.

2. The authors performed and reported on a variety of simulations to investigate Type I error rate when outliers are removed in performing a t test for two independent  samples from the same distribution. One simulation assumed a normal distribution. Others were devised to simulate distributions likely to be encountered in psychological research. Two large real datasets were also used in simulations. Simulations were performed for four different sample sizes and for different thresholds for removing outliers, and also for a “data-snooping” choice of threshold. Nominal Type I error rate .05 was used. Some results:

  • In the case of a normal distribution, the actual Type I error rate could exceed .10 (e.g., for threshold 2 and large sample size), but it approached .05 as the threshold approached 4.
  • In some other cases, actual Type I error rate was as large as .15 – and as large as .45 in a “data snooping” simulation.
  • The conclusion: “The removal of outliers is therefore not recommended.”

3. The authors report more simulations regarding power, and end with a list of recommendations for good practice.


*Society for Personality and Social Psychology

** The title is a kind of pun, referring both to DNA replication, and to replication of  studies in psychology.

*** Except one journal which had only 12 articles mentioning outlier; all 12 of these were examined, giving total number of examined articles 137.

****I try to teach my students not to remove outliers unless there is good reason to believe that they are recording or measurement errors, so I groaned here.

***** More groans – my gut reaction here was, “Oh no! This is likely to mess up Type I error rate!” Happily, the authors proceed to explain intuitive reasons why removal of outliers so identified is not a good idea, before proceeding as noted with simulations providing evidence that indeed, actual Type I error rate  can increase considerably when outliers above a threshold are removed.


Beyond The Buzz Part VII: Practical vs Statistical Significance and Recommendations

As mentioned in several of the previous posts in this series, the comments I make in this post are not intended, and should not be construed, as singling out for criticism the authors of any papers referred to, nor their particular area of research. Indeed, my central point is that registered reports are not, by themselves, enough to substantially improve the quality of published research. In particular, the proposals for registered reports need to be reviewed carefully with the aim of promoting best research practices.

The preceding post explained why most software or web-based methods of calculating sample size to achieve desired power are preferable to Cohen’s method1, which seems to be the method most often used in psychology research to estimate sample size.

One important reason given was that good software methods, unlike Cohen’s method, require the researcher to think about what would be a reasonable minimum raw effect size for the hypothesis test to be able to detect. Another way of saying this is that the researcher needs to think about practical significance, not just statistical significance.

This is important because if the sample size is large enough, a hypothesis test will reject a null hypothesis even if the difference between the null and alternate values is so small that it is of no practical importance. For example, a large enough clinical trial of a new drug might reject the null hypothesis “average lifespan in both experimental and control groups is the same” in favor of the alternate hypothesis  “average lifespan in the experimental group is higher than in the control group,” when the difference in average lifespans between the two groups is just one day. That would not be a practically significant difference in lifespans.

Let’s consider the concept of practical significance in the context of the research on stereotype susceptibility that has been discussed in posts in this series.

Recall that the researchers used two outcome measures (Part I): “accuracy” and “number correct”. Since I have a lot of experience teaching math, including giving, grading, and trying to interpret the results of math exams, I asked myself the question posed in the last post to a hypothetical teacher:

What difference in mean test scores on this test would convince me that this             group of people did better on the test than that group?

If I were to do a sample size calculation to achieve high power for a hypothesis test used for research in stereotype susceptibility, this is the value I would use as the effect size that would need to be detected (See Part VI).

Considering this question for outcome variable “accuracy,” my answer was in the blank stare/you’re crazy category, for the reasons outlined in Part I: I believe that accuracy, as defined in the literature on stereotype susceptibility, is not a good measure of performance on an exam, because (for example) it would rate a person who attempted eight questions and got six answers correct lower than a person who attempted only two questions and got both correct, whereas I believe that a “good” measure of performance should give the opposite ranking.

Considering the question for outcome variable “total number correct,” I had to think a little – but not very much, since I quickly remembered the maxim,

If your measuring instrument can’t distinguish a difference, then it’s not a   meaningful difference to expect a hypothesis test to detect.

Since “number correct” (as defined in these studies) is always a whole number, I concluded that trying to detect a difference (even in mean scores) less than 1 would be unreasonable, so my answer to the question posed was, “a difference of at least 1”

This minimum practically significant difference has a place in evaluating the results of studies as well as in figuring out suitable sample size: If a study is done well and results are not statistically significant, and if the raw effect size is below the threshold of practical significance, then these two pieces of information together provide good evidence to conclude that there is no “significant” (both practically and statistically) difference between the two hypotheses being compared.

This raises the question: In the replications on stereotype susceptibility, what were the raw effect sizes in the sample? Conveniently, Gibson et al2 gave (Table 1, p. 196) the estimates for mean number correct obtained in their study, in the other replication study (Moon et al), and in the original Shih et al study. Here is the portion of that table pertaining to outcome measure “correct responses” (with standard deviations omitted):

Asian               Control            Female

N  Mean           N  Mean           N  Mean

Current N = 158                                   52  6.52           52  5.75           54  5.72

Current N = 127 aware only                40  6.93           44  5.73           43  5.60

Moon and Roeder N = 139                   53  4.75           48  5.21           38  4.50

Moon and Roeder N = 106 aware only 42  4.83           37  5.19           27  4.30

Shih et al. (1999)                                 16  5.37           16  5.31           14  4.71

From these, the raw effect sizes (i.e., the maximum differences in estimated means detected) were:

Current N = 158                                   6.52  –  5.72  =  0.80

Current N = 127 aware only                6.93  –  5.60  =  1.33

Moon and Roeder N = 139                   5.21  –  4.50  =  0.71

Moon and Roeder N = 106 aware only 5.19  –  4.30  =  0.89

Shih et al. (1999)                                 5.37 –   4.71  =  0.66

 Thus, in all groups except the Gibson et al “aware” group, the raw effect size did not reach the minimum practically significant level of 1.

RECOMMENDATION #4 for improving research proposals (and thereby improving research quality):

  • Proposers should not base proposed sample sizes on Cohen’s standardized effect sizes, but instead should use a software or web-based method that requires all four of the inputs listed in the previous post. In addition, proposers should list the type of calculation used as well as values of all of the inputs. They should also give sound justifications for their choices
  • Reviewers of research proposals should check that calculations of proposed sample sizes have followed these guidelines, and that the justifications are sound on scientific grounds, not just on “what we’d like.”


1. Cohen, J., A Power Primer, Quantitative Methods in Psychology, Vol. 112, No. 1, 155 – 159)

2.  Gibson, C. E., Losee, J., & Vitiello, C. (2014). A replication attempt of stereotype susceptibility: Identity salience and shifts in quantitative performance. Social Psychology, 45, 194–198.

Beyond The Buzz Part VI: Better ways of calculating power and sample size

In the last post, I commented on how Cohen’s 1992 paper1 moved psychology research from ignoring power to using a simplified method proposed by Cohen for determining sample size to obtain some levels of power. But Cohen’s method has some serious limitations, and now most computer software can quickly calculate sample size needed to achieve specified power for standard statistical procedures (such as one-way ANOVA), given suitable inputs – indeed, there are even online calculators for this task. (e.g., here). This makes sample size calculations much easier than they were in the early nineties. Moreover, deciding on the values of the inputs (see below) needed to calculate sample size requires researchers to think more carefully about their planned research than Cohen’s simplified method does. This thinking in itself is likely to improve the quality of research.

Inputs typically required to calculate sample size for basic statistical procedures are:

  1. A minimum difference you would like to detect.
  2. An alpha level
  3. (An estimate of) a suitable standard deviation2
  4. The power you would like the study to have.

Before discussing these inputs, it is helpful to recall what statistical power is. The short3 definition is:

The power of a hypothesis test is the probability that the test will detect a true difference of a specified size.

 This “true difference” is sometimes called an effect size. It’s important to distinguish between two kinds of effect size, which are sometimes called raw effect size and standardized effect size.

A raw effect size is typically the difference between the value of some quantity in the null hypothesis, and another value of that quantity that we wish to compare it to. This is typically what is called for in Input 1 listed above.

Example: In research on stereotype susceptibility, with outcome variable Y (e.g., a measure of performance on an exam), the researchers might perform a hypothesis test with null and alternate hypotheses

H0: µ1 = µ2     and      Ha:  µ1 ≠ µ2,

respectively. Here µ1 is the mean of Y for population 1 (e.g., subjects with minority identity salient) and µ2 is the mean of Y for population 2 (e.g., subjects without minority identity salient). The “raw” effect size here is µ1 – µ2.4

Similarly, if we were interested in comparing three populations, we would consider an ANOVA test with null and alternate hypotheses

H0: µ1 = µ2 = µ3         and      Ha: At least two of µ1, µ2 , and µ3 are different

The corresponding “raw” effect size would be the largest of the differences between the three values of µ.

Good software or web-based calculations of power and sample size use raw effect sizes.

However, Cohen’s tables are based on standardized effect sizes.


  • For a two-sample t-test of independent means, the standardized effect size is “Cohen’s d”, which is the difference in the means for the two populations, divided by the pooled standard deviation.
  • For one-way ANOVA, the standardized effect size is “Cohen’s f,” which is the standard deviation of the population means, divided by the within-population standard deviation6.

One important difference between raw effect sizes and Cohen’s effect sizes is that raw effect sizes are much more interpretable in the context of the research question being studied. For example, if I asked a teacher, “what difference in mean test scores on this test would convince you that this group of people did better, on average, on the test than that group?”, they would have some reasonable chance of giving me an estimate, based just on their experience with teaching and testing; coming up with an answer would not require any statistical knowledge other than what the mean is (which they presumably learned around sixth grade). But if I asked them, “What difference of means divided by the pooled standard deviation would convince you that the this group did better than the other group?” most would likely either be clueless or think I was crazy.

The lack of interpretability in context of Cohen’s effect sizes is just one reason why most software or web-based methods of calculating sample size are preferable to Cohen’s7. Other reasons include:

  • They typically allow you to specify an alpha level (input 3), rather than just choose between those listed in Cohen’s tables. This allows you to use an alpha level that takes multiple testing  into account.
  • They require the researcher to either find and examine a previous study, or to carry out a pilot study to obtain an estimate of the “suitable standard deviation” (input 4). By avoiding this, Cohen’s method gives the researcher a sort of “free lunch.” Sorry, there is no free lunch. You need to do more work than Cohen’s method requires to obtain a reasonably good estimate of sample size.
  • They allow you to specify power (input 4), rather than just choosing between a few choices.
  • They often can easily give you an overview of options that show you the tradeoffs between power, significance level, and sample size.
  • Similarly, they may prompt you to see if you can use a variance-reducing design to obtain desired power with a smaller sample size.

Caution: Both Cohen’s and software or web-based methods of calculating sample size to get desired power are based on model assumptions. So if the context does not satisfy the model assumptions, then the sample size calculations cannot be trusted.

 I will continue the discussion of power in the next post.


1. Cohen, J., A Power Primer, Quantitative Methods in Psychology, Vol. 112, No. 1, 155 – 159)

2. a. “Suitable” depends on the type of hypothesis test being considered. Examples:
i. For an equal variance two-sample t-test, an estimate of the common standard deviation would be used. (The pooled sample standard deviation in a previous or pilot study would be a rough estimate of this)
ii. In a one-way ANOVA, the common standard deviation within groups (a model assumption for ANOVA) would be used. (The square root of the mean squared error from a previous or pilot study would be a rough estimate of this).

b. To avoid calculating a sample size that is too small to give the desired power, the researcher should use a standard deviation estimate that is on the large side of probable values. (See p. 51 of Dean and Voss, 1999, Design and Analysis of Experiments, Springer.) Thus researchers often use the square root of the upper limit of a confidence interval for the variance from a previous or pilot study, rather than using the actual estimates mentioned above.

3. As with most short definitions, this one can be deceiving if terms are not understood correctly. See p. 14 of this  for the long definition. (Other parts of that document may also be worth reading to help understand statistical power.)

4. µ2 – µ1 could be chosen instead; the results of sample size calculations would be the same.

5. From Table 1, p. 157 of Cohen (1992) [See Note 1]

6. One of the model assumptions of one-way ANOVA is that all populations in the study have the same standard deviation of the outcome variable.

7. However, standardized effect sizes can play a useful role in meta-analysis, since their “scale-free” nature allows comparisons of studies with different instruments used to measure outcomes. But they still need be used with caution, for reasons including the following:

  • Many standardized effect sizes are based on model assumptions that might fit in some studies but not in others.
  • Different experimental designs will produce different standard deviations. (For example, a block design usually produces a smaller standard deviation than an unblocked design; that is indeed the purpose of blocking.) Consequently, different experimental designs are likely to give different standardized effect sizes.


Beyond The Buzz Part V: Power and Cohen

This is a continuation of a series of posts on common missteps in statistical practice, prompted by the recent special issue of Social Psychology featuring registered replications. As with the previous posts, I will illustrate with studies (two from the special issue, and others that preceded those) on stereotype susceptibility.  As mentioned in the first post in the series, I chose this topic only because it is one with which I have previous experience; the comments I make in this post are not intended, and should not be construed, as singling out for criticism the authors of any papers referred to, nor their particular area of research. Indeed, my central point is that registered reports are not, by themselves, enough to substantially improve the quality of published research. In particular, the proposals for registered reports need to be reviewed carefully with the aim of promoting best research practices.

The Gibson et al article in the special issue states (p. 195), “A total of 164 Asian Female college students participated in this study, with approximately 52 in each condition so as to detect a medium effect size (r = .35; Shih et al., 1999, p. 81) with 80% power (Cohen, 1992)”

I checked out the Cohen reference. Indeed, Table 2 (p. 158) of Cohen’s paper1 lists 52 participants per group for a one-way ANOVA with three groups, medium effect size, and significance level .05 to have power .80.

But as I thought more and read Cohen’s paper more, additional concerns arose:

  1. I remembered that Gibson et al also tested contrasts, so I wondered if Cohen’s sample size included these or just the overall F-test. I’m still not completely sure, but Cohen’s item 7 on p. 157 suggests that the sample sizes in Table 2 just refer to the F-test.
  2. If Cohen’s sample size value for ANOVA does not cover the contrasts, then my best guess from his paper would be that they would be covered under his values for two-sample t-tests. However, according to his Table 2, these would require 64 in each group to detect a medium effect with power .80.
  3. The figures checked out above assume the significance level alpha is .05. But, as remarked in the preceding post, using the simple Bonferroni method to account for multiple testing to give a FWER (overall significance level) of .05 would require (for Gibson et al) individual significance levels of .05/14, or about .0035. Cohen’s table doesn’t go down that far; for the lowest alpha in the table (.01), the sample sizes per group for a medium effect would be 76 (for ANOVA) and 95 (for a two-sample t-test)

The upshot: The two replications might be seriously underpowered to detect a medium (as defined by Cohen) effect. However (as will be discussed in subsequent posts), this does not necessarily mean they were underpowered for purposes of detecting a meaningful difference. Still, they used samples substantially larger than those in Shih et al, and thus were higher powered than that study.

I’ll discuss power more in my next post, but would like to end this one with a note of appreciation for Cohen’s efforts. I have in the past (as have others2) seen his use of small/medium/large effect sizes only as an obstacle to good calculation of power and sample size (as I will discuss in coming posts). But I had never read his paper before. Now that I have, I see that what he achieved was tremendous progress over the previous practice of ignoring power in research in the behavioral sciences. I now see that his 1992 paper was a compromise, a compromise that was effective in promoting more widespread attention to power in his field.

But now that attention to power has increased in the field, it is time to go further and pay more attention to doing better power and sample size calculations. In other words, Cohen succeeded in changing TTWWADI from “ignore power” to “use Cohen’s 1992 paper and small-medium-large effect sizes.” The challenge now is to move on to better (more accurate) methods of calculating sample size that take into account more than Cohen’s ideas of S, M, or L effect sizes. Pointers in that direction will be the subject of the next posts.



1. Cohen, J., A Power Primer, Quantitative Methods in Psychology, Vol. 112, No. 1, 155 – 159).

2. Examples:

  • Lenth, Russell (2001) Some practical guidelines for sample size determination, The American Statistician 55, No. 3 pp. 187 – 193, available here
  • Russ Lenth’s Sample Size Page
  • Muller, K. E., and Benignus, V. A. (1992), “Increasing Scientific Power with Statistical Power,” Neurotoxicology and Teratology, 14, 211–219, which states  about Cohen’s method (p. 7):

“The great attraction of the method, its lack of dependence on the application, may be considered to be its greatest weakness.”


Beyond the Buzz Part IV: Multiple Testing

In his June 10 Guardian article , Chris Chambers gave a link to an article discussing “questionable research practices” that are common in psychology research. One practice omitted from that article, but that I believe should have been included1, is the practice of performing more than one hypothesis test on the same data without taking into account how this affects the prevalence of Type I errors (falsely rejecting the null hypothesis).

One often-helpful way to look at hypothesis tests is:

If you perform a hypothesis test using a certain significance level (I’ll use 0.05 for illustration), and if you obtain a p-value less than that significance level (here assumed to be 0.05), then there are three possibilities:

  1. The model assumptions for the hypothesis test are not satisfied in the context of your data.
  2. The null hypothesis is false.
  3. Your sample happens to be one of the 5% of samples satisfying the appropriate model conditions for which the hypothesis test gives you a Type I error.

This way of looking at hypothesis tests helps us see the problem of performing multiple hypothesis tests using the same data:

If you are performing two hypothesis tests using the same data, and if all model assumptions are satisfied, and if also both null hypotheses are true, there is in general no reason to believe that the samples giving a Type I error for one test will also give a Type I error for the other test.

Web simulations2 can help give an idea of the range of possible ways different combinations of tests can give Type I errors.

Because of the multiple testing problem, we need to look at more than the individual Type I error rate (in this case, alpha = .05) that is applied to each hypothesis test individually; we also need to consider the

Family-wise error rate (FWER): The probability that a randomly chosen sample (of the given size, satisfying the appropriate model assumptions) will give a Type I error for at least one of the hypothesis tests performed.

(The FWER is also called the joint Type I error rate, the overall Type I error rate, the joint significance level, the simultaneous Type I error rate, the experiment-wise error rate, etc.)

There is no perfect way of dealing with multiple testing, but there are some pretty good ways. The simplest is sometimes called the Bonferroni method: If you want a FWER rate of alpha, and are performing n tests, then use an individual type I error rate of alpha/n for each test individually. For example, to insure a FWER of .05 when performing 4 tests, use an individual alpha of .05/4 =  .0125 for each test.

How do the papers on stereotype susceptibility discussed on Part I  handle multiple testing?

Two comments before discussing this:

1. The caveat of Part I still applies: the comments I make in this post and the ones following are not intended, and should not be construed, as singling out for criticism the authors of any papers referred to, nor their particular area of research. Indeed, my central point is that registered reports are not, by themselves, enough to substantially improve the quality of published research. In particular, the proposals for registered reports need to be reviewed carefully with the aim of promoting best research practices.

2. If more than one study was reported in a paper, I will discuss only the first.

Steele and Aronson (1995): Here is a list of the hypothesis tests mentioned in reporting results of their Study 1 (pp. 800-801):

  1. “Chi-squared analyses performed on participants’ response to the postexperimental question about the purpose of the study revealed only an effect of condition …, p < .001”
  2. “The ANCOVA on the number of items participants got correct, using their self-reported SAT scores as the covariate … revealed a significant condition main effect … p < .02 … and a significant race main effect … p < .03…. The race-by-condition interaction did not reach conventional significance (p < .19)”
  3. “Bonferroni contrasts with SAT as a covariate supported [the reasoning of the hypothesized effect] by showing that Black participants in the diagnostic condition performed significantly worse than Black participants in either the nondiagnostic condition … p < .01, or the challenge condition … p <.01, as well as significantly worse than White participants in the diagnostic condition … p < .01”
  4. They also performed another test of interaction, which “reached marginal significance, … p < .08”

My comments on this:

  • Item 1 suggests that they performed more tests on participants’ responses to the post-experimental questions than just the one test reported.
  • So altogether, at least 8 hypothesis tests were performed, using the same sample. Using the simple Bonferroni method discussed above, each test would need individual significance level < .05/8 = .00625.
  • I’m not sure what “Bonferroni contrasts” means, but I think it means that a Bonferroni procedure was used (possibly automatically by the software?) to adjust p-values to take into account the number of contrasts considered – in other words (assuming 3 contrasts), the adjusted p-value would be the ordinary p-value divided by 3.
  • So the only test listed that would be significant using the simple Bonferroni method would be the one reported in item 1.

The upshot: Steele and Aronson may have done a little bit of taking multiple inference into account (possibly only because the software did it automatically?), but did not really consider a FWER. I am not surprised – the game of telephone effect and TTWWADI had probably made disregard of multiple testing fairly standard by 1995.

Shih et al (1999): I found no mention of the problem of multiple testing. However, I counted what appeared to be 12 hypothesis tests performed. P-values were given for only 4; the others were listed accompanied by words that suggested that no significant difference was found. The p-values listed were: p < .05, p < .05, p = .19, p = .01. With 12 hypothesis tests, the simple Bonferroni method to give FWER .05 would require using an individual alpha of .05/12, or about  .0042. None of the p-values listed reached this level of significance. It appears that disregard for the problem of multiple testing had become TTWWADI by 1999.

Gibson et al (2014): Again, no mention of the problem of multiple testing (TTWWADI, I would guess). I counted 14 hypothesis tests, with lowest p-value reported as p = .02. This occurred for two tests: 1) difference between all three groups on accuracy, when including only the 127 participants who were aware of the race and gender stereotypes; and 2) in particular, between female-primed subjects and Asian-primed subjects, also when restricted to the same subset. The simple Bonferroni method would require individual significance level .05/14, or about .0035. So again, nothing significant after taking multiple testing into account.

Moon et al (2014): P-values listed were .44, .43, .28, .28, .55, .57, .29, .31, .92, .76. I found no mention of multiple testing, but with p-values this high, there would be no need to adjust for it.

So I propose:

RECOMMENDATION #3 for improving research proposals (and thereby improving research quality):

  • Proposers should include in their proposals plans for how to take multiple testing into account in their methods of data collection, data analysis, and interpretation of results.
  • Reviewers of research proposals should check that proposers have included plans for accounting for multiple testing, and that these plans are appropriate for the aims and methods of the study.

Comment: As I will discuss in some of the following posts, multiple inference enters into research plans in more than just the way outlined in this recommendation.


1. Analogous studies of questionable practices in medical research have included the problem of multiple testing. For example, A. M. Strasak et al (The Use of Statistics in Medical Research, The American Statistician, February 1, 2007, 61(1): 47-55) report that, in an examination of 31 papers from the New England Journal of Medicine and 22 from Nature Medicine (all papers from 2004), 10 (32.3%) of those from NEJM and 6 (27.3%) from Nature Medicine were “Missing discussion of the problem of multiple significance testing if occurred.” These two journals are considered the top journals (according to impact figure) in clinical science and in research and experimental medicine, respectively.

2. See Jerry Dallal’s demo. This simulates the results of 100 independent hypothesis tests, each at 0.05 significance level. Click the “test/clear” button to see the results of one set of 100 tests (that is, for one sample of data). Click the button two more times (first to clear and then to do another simulation) to see the results of another set of 100 tests (i.e., for another sample of data). Notice as you continue to do this that i) which tests give type I errors (i.e., are statistically significant at the 0.05 level) varies from sample to sample, and ii) which samples give type I errors for a particular test varies from test to test. (To see the latter point, it may help to focus just on the first column, representing just 10 hypothesis tests.)

Also helpful in reinforcing the point:

For more discussion and further references on multiple testing, see here.