“False positives” in the Medical Literature

Andrew Gelman’s January 24 blog I don’t believe the paper “Empirical estimates … has an interesting discussion of a paper by Jager and Leek using modeling to get a smaller estimate than Ioannidis (2005) for the proportion of “false positives” in the medical literature. Gelman and commenters critique the title as well as the methods.

For another discussion of the Ioannidis and Jager/Leek papers, see this Technology Review January 23 blog.

More re Simmons et al, Part III: Huh?

Discussant (8) in the Psychological Science editorial discussion mentioned in my January 23 post wrote,

“The [Simmons et al] paper seems to be based on the possibility of a direct link between the data and the truth of a theory. In my view, it is always the educated reader who needs to be persuaded using convincing methodology. Therefore, I am not interested in the autobiography of the researcher. That is, I do not care whether s/he has actually held the tested hypothesis before learning about the outcomes, I am not interested how many failed studies preceded the submitted paper and I do not want to know whether if the results would have been insignificant with a lower number of subjects.

Finally, I see no reason to report inconsistent findings that were assessed after the main DV …”

If a reader of papers employing frequentist hypothesis testing is indeed [well] educated, s/he should not find methodology convincing unless (among other things) the paper discusses which hypotheses were pre-planned and which were “data snooping;” the history of previous attempts to establish the result; and how multiple testing (including hypothesis testing done after testing the main dependent variable) was taken into account in establishing claims of statistical significance — in other words, exactly the things discussant (8) doesn’t care about/isn’t interested in/doesn’t want to know/sees no reason to report. (Take a look at xkcd, Significant, if you haven’t already seen it, to help drive the point home.)

Ironically, discussant (8) ends with

“Our decision should be less based on the cuteness of the findings and the headlines the[y] might cause in the popular press but more on the answers they provide concerning the underlying psychological process. Less ‘wow!’ and more ‘how?’ might be another guiding principle for the new editorship.”

I agree that editorial decisions should not be based on cuteness of findings nor on headlines they might cause. I agree that less “wow!” and more “how?” sounds like a good guiding principle — but the “how?” needs to include how the results were obtained, and whether or not the methodology is sound. The items that discussant (8) doesn’t care about/isn’t interested in/doesn’t want to know/sees no reason to report are indeed important to the “how?”.

More re Simmons et al, Part II: Not Far Enough

Quote

Sanjay Srivastava’s The Hardest Science, January 2, 2012 blog, An editorial board discusses fMRI analysis and “false-positive psychology,”  gives a link to a (summarized, with names redacted) account of an email discussion among the Psychological Science editorial board. The discussion concerned the suggestion that Simmons et al’s recommendations, as well as recommendations from an article concerning Functional Magnetic Resonance Imaging (fMRI) research, be adopted as policy of that journal. Most of the comments about the Simmons et al recommendations said that it would be premature to adopt them without further careful discussion, or that some of them were too rigid to take into account what was appropriate for different situations, both of which seem sensible to me. But some of the comments bring up points that warrant further discussion. Here is one (probably more in later posts):

Discussant (6) said,

“I really do not need 1000 more words of terribly tedious text. Can we just put all these things in one convenient table (See an impromptu example below)?”

Well, if something can be put in a table, that would indeed be better than having to search through paragraphs to find a particular item. So I looked at the example provided. Here’s a snippet:

“Group analysis:

  1. Model used: mixed-effects
  2. Statistical thresholding: …”

Uh-oh — The table has missed some important information: “Mixed effects” covers a whole class of models. The author needs to include more information: Which factors are fixed, which are random? Is there any nesting, and if so what is nested in what? And why are these choices appropriate for the data collected and the question being studied? A table would probably not be adequate for the last question, in particular. (Note: This is not an isolated problem — I have often seen “mixed models” given as the “method of analysis” in methods sections of papers, with no mention of the information in the questions above.)

Continuing with the snippet from discussant (6)’s table:

“2. Statistical thresholding:

a. Adjustment for multiple comparisons employed: Gaussian Random Field theory

b. Threshold: Z > 2.3, p < 0.001″

Well, it’s good that adjusting for multiple comparisons is addressed — but just stating the method and thresholds once again neglects the reasoning: Why was Gaussian Field theory chosen as the method? (e.g., why not permutation or bootstrap methods — see Nichols and Hayasaka (2003) for discussion of advantages and disadvantages of each method). And why were the stated thresholds chosen?

These omissions of inclusion of the details of choice of methods and the reasoning behind those choices is also an inadequacy in Simmons et al’s recommendations: They do not address the level of discussion of methods that is necessary to replicate a study, let alone to evaluate the appropriateness of the methods.

 

 

Preregistration of Studies and Mock Reports

Andrew Gelman’s 13 January 2013 blog post has some interesting comments on the problem of non-reproduced/non-reproducible/false results. In particular, he refers to the latest issue of Political Analysis , which has a special section “Symposium on Research Registration,” that includes a commentary by Gelman on a couple of papers in the special section.

In discussing one paper, he expresses a concern that preregistration with pre-decided analysis method, although it has some advantages, may “encourage a sort of robotic data;” he points out things that, indeed, I often complain about, such as not plotting the data.

The other paper reports on a “mock report,” which Gelman recommends be done more often. In this strategy, before analyzing (or perhaps even collecting) real data, the researcher simulates “fake data,” then analyzes it to prepare a publicly released “mock report.” This serves as a sort of “trial run” that helps uncover problems in the planned analysis before performing it on the real data.

More re Simmons et al, Part I: Uncertainty

In my January 9 blog “A Mixed Bag,” I said that one positive aspect of Simmons et al’s 2011 paper is that it has generated a lot of discussion, and said “More on this in a later post.” That will probably expand to two or three posts, at least. This is the first.

Andrew Gelman’s 16 February, 2012 blog “False-positive psychology,” has a number of comments. In particular, Gelman said,

“My main comment on Simmons et al. is that I’m not so happy with the framing in terms of “false positives”; to me, the problem is not so much with null effects but with uncertainty and variation.”

He’s got a good point. Here’s my own elaboration on it: It often seems that social scientists (and to a lesser, but still frequent, extent physical scientists) look at a hypothesis test as something definitive. One thing I’ve tried to stress in my website is that this is far from the case. (See especially the page Expecting Too much Uncertainty and the links from it.)

Simmons et al (and many other authors of research papers) discuss hypothesis tests, but do not even mention confidence intervals. I’ve sometimes heard the argument that, since standard errors are routinely given in papers, someone who wants a confidence interval can use the standard errors to construct it. This attitude neglects what I see as an important reason to include confidence intervals in the first place: they keep uncertainty upfront, so readers are less likely to neglect it. Just reporting results of a hypothesis test easily seduces readers to slip into a false sense of certainty.

Moreover, constructing appropriate confidence intervals when using the same data set to consider more than one question  requires taking multiple inference (and its effect on power) into account – but Simmons et al appear disinclined to recommend accounting for multiple inference, arguing (p. 7) that it would introduce “additional ambiguity [that] may make things worse.” This sounds to me like avoidance of uncertainty or other relevant complexities.

Indeed, Simmons et al appear to be seeking simple “solutions” that ignore, minimize, or dismiss the inherent ambiguity and uncertainty in science, rather than being upfront about this uncertainty.

While I’m on the topic of uncertainty: One interesting recent discussion of the inherent uncertainty in science is Iain Johnston’s article “The chaos within: Exploring noise in cellular biology,” Significance 9 (4) 17 2012.  Johnston discusses the “essential randomness of cellular systems,” including some important causes and effects of this randomness and recent efforts to describe it.

An Illusory Illusion?

Quoidbach et al’s recent paper “The End of History Illusion,” in the January 3, 2013 issue of Science, has gotten a lot of publicity on NPR and the web. For a nice critique, see Do We Really Underestimate How Much We’ll Change? (Or: Absolute Value Is Not Linear!) at the blog Quomodocumque.

Added January 10: See the critique of the critique I added this morning. But I think there’s still some good discussion on the post.

A Mixed Bag

Simmons et al’s 2011 paper “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allow Presenting Anything as Significant”, Psychological Science, 22(11), 1359-1366 (available at http://people.psych.cornell.edu/~jec7/pcd%20pubs/simmonsetal11.pdf) has generated a lot of discussion and citations (about 150 in a Google Scholar search).

Here’s a summary of what’s in the paper:

  • A study with a preposterous conclusion.
  • Simulations showing how different researcher choices (adding a dependent variable;  adding more observations per cell; controlling for gender or interaction with treatment;  dropping or including one of three conditions; and combinations of these) can affect the family-wise Type I error rate.
  • Discussion of the effect on Type I error rate of adding additional observations.
  • Proposed requirements for authors and guidelines for reviewers intended to mitigate the problem of false positives.
  • Discussion of how following the proposed requirements would have altered the outcome of the example with a preposterous conclusion.
  • Discussion of criticisms of their suggestions (both “not going far enough” and “going too far”)
  • Discussion of what they call “nonsolutions”

In my view, the paper is a mixed bag.

Positive aspects I see in the paper:

  1. It has generated a lot of discussion. (More on this in a later post.)
  2. Most of the recommendations are right on target for good, ethical science.
  3. The “preposterous conclusion” example and the simulations are good for making important points.

Negative aspects I see:

  1. Most importantly, they consider “adjusting alpha levels” as a “non-solution.” Their reasoning seems to be that there is no single, good way to do this. However, there are ways that can at least give a better sense of “worst case” than ignoring the problem of multiple testing. The authors seem to be unaware of the recent literature on the subject, giving only a reference to a 1977 paper of Pocock discussing sequential testing in clinical trials. (See Multiple Inference,  for discussion and references on multiple testing.)
  2. Their second “requirement for authors,” that “Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data- collection justification.” I agree that too-small sample size is a problem, but the authors give no reason for the seemingly arbitrary figure of 20 per cell beyond, “Samples smaller than 20 per cell are simply not powerful enough to detect most effects,” (for which they give no reference). I cannot think of a good justification for this seemingly arbitrary figure – indeed, different experimental designs may require different sample sizes per cell to achieve the same power. What is really needed is a thorough discussion of power – including how power calculations need to take experimental design and multiple testing into account and how “small, moderate and large effect sizes” are at best a sloppy approach to calculating power.
  3. Listing “requirements for authors” may mislead researchers into believing that if they follow the requirements, they are doing what they need to do to have good quality research.  Simmons et al do say, “This solution substantially mitigates the problem,” and that it requires “only that authors provide appropriately transparent descriptions of their methods so that reviewers and readers can make informed decisions regarding the credibility of their findings.” Still, the non-recommendation about adjusting for multiple testing and the weakness of the recommendation regarding sample size, combined with the widespread ignorance about good practices regarding multiple inference and power and other aspects of research, are likely to give researchers false confidence that their methods and reporting are good when it is not, and thereby lead equally naïve readers to believe what they read.

Instructor Resource Website

In an April 27 post, I reported that the best currently available text for our intro applied stats for math majors and minors seemed to be Stats: Data and Models, 3rd edition  by De Veaux, Velleman and Bock (Pearson, 2012). A couple of our instructors decided to try it out this semester, so I decided to make an Instructor Resource Website to accompany the text. It included the following items that may be of use to others teaching intro stats:

Notes on the Textbook: Some aspects of these are tied to our particular course (which has a probability prerequisite), but many could be useful to anyone using the text for the first time — or even using another text for intro stats, since a lot of what I’ve included is based on my experience teaching this an other statistics courses.

Supplementary Materials: Some of these are distinctive to this particular course, but some could be useful for teaching to other audiences as well — and some of the materials intended for math majors may be of interest to instructors who would like to understand the why’s of the subject a little more.

External Links: These are of two sorts: Links to online demos that are useful for in-class demonstrations, and links to articles that could be used for supplemental reading.

The site is at http://www.ma.utexas.edu/users/mks/M358KInstr/M358KInstructorMaterials.html

 

Another “not recommended” textbook

Michael J. Crawley’s Statistics: An Introduction using R (Wiley, 2005) is intended as “an introduction to the essentials of statistical analysis for students who have little or no background in mathematics or statistics” (p xi). Unfortunately, the book includes many all-too-common mistakes. Examples:

p. 4 “… a low p-value means the hypothesis is unlikely to be true,” a common misunderstanding of p-value.

p. 37 “The mean value y-bar is a parameter estimated from the data. ” No, y-bar is an estimate of the population mean (often called mu), which is a parameter. The estimate y-bar is calculated from the data, but the parameter is not (and cannot be) calculated from the data.

p. 75 “…the probability that the variances are the same is p < 0.002.” No, the p-value is the probability of obtaining a test statistic at least as extreme as the one calculated from the data, assuming that the variances are the same.

pp. 73 – 75 The author recommends using Fisher’s F-test to test for equal variances before using a two-sample t-test. He neglects to discuss robustness of either the F-test or the two-sample t-test. Unfortunately, the F-test is not robust under the circumstances where it is most important to have equal variances for the t-test, so is essentially useless for the purpose recommended.

p. 77 “Our null hypothesis is that the two sample means are the same, and we shall accept this unless the value of Student’s t is so large that it is unlikely that such a difference could have arisen by chance alone.” But there is no discussion of power, so all accepting the null hypothesis is not justified.

p. 80 “The two sample means are significantly different.” Once more confusing the sample mean and the population mean.

p. 97 “… when two variables are so perfectly correlated that they are identical …” Being identical is stronger than being perfectly correlated — for example, the random variables X and X + 1 are perfectly correlated, but not identical.

Textbooks for Introductory Statistics

For years I have recommended the textbooks authored or co-authored by David Moore as high quality. However, a couple of months ago, I started hearing complaints about Introduction to the Practice of Statistics, which my department has been using for its undergraduate Applied Statistics course for math majors. It seems that some topics have been omitted, many challenge problems dropped, and some of the writing is of poorer quality than in previous editions. (Note: Moore, although still listed as co-author, no longer is involved in revising the book.) So I volunteered to look for a more suitable text.

The best one I’ve found is Stats: Data and Models, 3rd edition  by De Veaux, Velleman and Bock (Pearson, 2012). It is in many ways in the tradition of Moore’s book, emphasizing the importance of model assumptions. In addition, it has more of the math, which is desirable for the course in question. The writing style is lively, and the organization seems well-thought out. (Note: The same trio of authors also have two other books, Intro Stats, which omits much of the math and is appropriate for an intro course for students in non-STEM fields, and Stats:Modeling the World, with Bock as first author, intended for the AP statistics audience.)

One that I definitely do not recommend is Kokoska, Introductory Statistics: A Problem-Solving Approach (Freeman, 2011). The author is well-intended, but misguided. Well-intended: He has identified a few areas where students have difficulty with what he is trying to teach, and has tried to point those out and guide the student past the difficulties. Misguided: Unfortunately, what he is trying to teach might be called “pseudo-statistics” or perhaps “applying statistical theory to a fantasy world.” He misses a lot of the important points in using statistics in the real world. For example, in many exercises labeled “applied,” he says, “assume the underlying distribution is normal.” In fact, his approach is largely cookbook; there is nothing really aimed at making the student an informed consumer of statistics, and a lot that could lead to misuses of statistics. (Example: Section 10.5 consists of several pages on the F-test for equal variances and the related confidence interval for the ratio of the two variances. On p. 500, he says that the test is often used to compare two population variances to decide whether or not the equal variance t-test is appropriate to compare the corresponding means. But he does not mention that this F-test is very sensitive to violations of model assumptions, and in fact is especially likely to be useless in exactly the situation of small sample sizes where it would be most desirable to be able to use the equal variance t-test for comparing means.) This is definitely not a suitable textbook for any intro stat course.

A third text is interesting: Rossman and Chance’s Investigating Statistical Concepts, Applications, and Methods.  I have contacted several people who have used it, and my impression is: It is worth trying, if you have experience teaching in an investigative style, and have experience with permutation tests, and can use a classroom with computers for student use. Otherwise, expect a real challenge in using it.