Though Many Social Psychologists May Be Unaware, Multiple Testing Often Leads to Multiple Spurious Conclusions

While browsing through the November 29, 2013 issue of Science a couple of days ago, I noticed the catchy title of the last report (McNulty et al, “Though They May Be Unaware, Newlyweds Implicitly Know Whether Their Marriage Will Be Satisfying,” Science 29 November 2013: 1119-1120). I suspected that the catchy title would mean the article would be discussed in the popular press (as indeed it has been), and wondered if, since it appeared in a top-ranked journal, it would be of high quality in its statistical analysis.

Alas, I was disappointed (but not surprised). In particular, thirteen hypothesis tests (all using the same data) were reported in the article. Eight were declared significant — apparently at an individual .05 significance rate, since there was no mention of adjusted p-values or overall significance rate or anything else that would suggest that the authors took multiple testing into account in reporting “statistical significance.” So I did a quick Bonferroni calculation (i.e., using .05/13 as an individual significance level to ensure an overall significance rate of 0.05), and found that only three of these 8 tests were statistically significant at that conservative adjusted criterion. (I then tried the sometimes more liberal Holm’s procedure, but with the same result.) So, accounting for multiple testing, the only hypotheses that could be considered statistically significant at an overall .05 significance rate are those that were reported as significant at the 0.001 level, namely:

  • “… spouses’ marital satisfaction declined significantly over the 4 years of the study”
  • “Spouses’ conscious attitudes … were positively associated with initial levels of marital satisfaction”
  • “… spouses’ perceptions of their marital problems at each assessment significantly negatively predicted changes in their satisfaction from that assessment to the next”

Among the tests that are not supported as being statistically significant at an overall .05 level are the ones crucial to the authors’ assertions that automatic attitudes predicted changes in their marital satisfaction. (Actually, I’m being rather generous: The Supplemental Material contains many more significance tests.)

 

There are other questionable aspects to the paper, in addition to the one pointed out above; some are mentioned in Andrew Gelman’s January 1 blog.

 

Please note: I do not intend these comments as aimed primarily at the authors of the report. I believe that the most important conclusion from these comments is that neglect (often out of ignorance) of the problems inherent in performing multiple frequentist hypothesis tests on a single data set (as well as other common problems with statistical analyses) is so common and so pervasive that it can occur in one of the top rated science journals. Science (and other top journals) could and should play an important role in improving scientific practice by providing quality guidelines (including taking multiple testing into account when claiming statistical significance) for use of statistics in analyzing data.