Beyond the Buzz Part III: More on Model Assumptions

My last post discussed the importance of model assumptions in using hypothesis tests, and the potential difficulties in checking them. This post will focus on what can (and should) be done to check model assumptions before one plunges into performing a hypothesis test. It will focus on One-Way ANOVA, using examples from the literature on stereotype susceptibility.

Before proceeding, I repeat the caveat from the first post in this series:

Important Caveat: I cannot emphasize enough that the comments I will be making in the posts in this series are not intended, and should not be construed, as singling out for criticism the authors of those two papers or of any papers referred to, nor their particular area of research. Indeed, my central point is that registered reports are not, by themselves, enough to substantially improve the quality of published research. In particular, the proposals for registered reports need to be reviewed carefully with the aim of promoting best research practices.  

Recall from the preceding post that the model assumptions for one-way, fixed effects ANOVA can be stated as:

  1. The groups are independent of each other
  2. The variances of Y on each group are all the same.
  3. Y is normal on each group

Checks that can and should be done include1:

  1. Plotting2 the data for each group to check for possible evidence of violations of model assumptions
  2. Checking how similar (or different) sample variances for individual groups are
  3. Being especially cautious if group sizes differ appreciably

Unfortunately, none of the papers mentioned in Part I discussed these checks.

Also unfortunately, this is not surprising to me, since model checks are all too often not discussed in the literature (TTWWADI?), although best practice definitely requires such discussion.

However, all of these papers that used an ANOVA analysis did provide group sample standard deviations, so it was possible for the reader to check that variances across groups were fairly constant (item B above).

One violation of model assumptions that requires particular attention is if Y skewed.  ANOVA compares means, but for skewed distributions, means are not good measures of what is “typical.”3 A paper on stereotype threat that I looked at a few years ago reported some means and standard deviations that strongly suggest that the distributions of the response variable in some cases were skewed to the right. For example, one group had mean .04 and standard deviation .13. A normal distribution with this mean and standard deviation would have a substantial proportion of values less than zero, which could not happen with the response variable in these studies. A plot of values would have helped bring attention to this problem.

Another model assumption check can sometimes be used: Using information about the response variable to help decide whether it is (close to) normal.

Examples:

1. Standardized tests are often constructed to have normal distributions of scores

  • However, scores on such a test cannot be assumed to have a normal distribution on a subgroup.

2. If a random variable is binomial with parameters n and p, then if p is not extreme, the variable is approximately normal.

3. Variables that are quotients of random variables can be very messy (see, e.g., http://en.wikipedia.org/wiki/Ratio_distribution). They are often skewed or have kurtosis (a measure of how sharp or flat the “peak” of the distribution is) very different from that of a normal distribution. Both of these deviations from normality can affect the alpha-level of the ANOVA test4. The accuracy response variable in the stereotype susceptibility studies is a quotient of random variables (number correct over number attempted), and thus might have properties that make an ANOVA test not robust.

Again, I do not suggest that lack of attention to model assumptions and robustness is a problem just in the area of research on stereotype susceptibility, or even just in the subject of psychological research; I have seen it frequently in a variety of areas, including many cases in biology. I invite readers to select a few papers of their choice (that use frequentist statistics) and look at how well (or how poorly) they address the problems of model assumptions and robustness.

So I propose:

RECOMMENDATION #2 for improving research proposals (and thereby improving research quality):

1. Proposers should include in their proposals:

  • How the study design is planned to increase chances that the model assumptions of the proposed analysis methods will be satisfied.
  • What checks on model assumptions will be performed after collecting data.
  • Contingency plans in case model assumptions cannot be adequately met.

2. Reviewers of research proposals should check that each of the above points is addressed soundly.

Notes:

1. Two textbooks that I am familiar with that are strong on checks for model assumptions and discussion of robustness are:

DeVeaux, Velleman, and Bock (2012), Stats: Data and Models, Addison Wesley.

The book does not use the term “robustness,” but for each statistical procedure includes a list of “conditions” (along with the model assumptions), that summarize the practical implications of robustness considerations. It gives such “conditions” for all hypothesis tests and confidence interval procedures it includes.

Dean and Voss (1999), Design and Analysis of Experiments, Springer

This discusses types of ANOVA other than one-way, as well as some alternatives when model assumptions are not satisfied.

 2. This might be done via dot plots, histograms, or box plots.

3. For example, real estate information by locality usually lists median prices, rather than mean prices, since the mean is influenced by higher-end houses to give a value higher than the typical price, which is better indicated by the median. Similarly, when I discussed exam scores with a class when returning graded exams, I would give the median rather than the “average,” since the latter would be influenced by the “tail” of a few low-performing students to give a value that would not be typical of class performance overall.

There are alternate hypothesis tests that do compare medians. Also, in some cases, medians can be compared by first taking logs, then using ANOVA on the transformed variable. (If logY is normal, then the mean of logY will also be the median of logY, which will be the log of the median of Y.)

4. See p. 316 of Harwell, M.R., E.N. Rubinstein, W.S. Hayes, and C.C. Olds. 1992. Summarizing Monte Carlo results in methodological research: the one- and two-factor fixed effects ANOVA cases. J. Educ. Stat. 17: 315-339

BEYOND THE BUZZ PART II: Model Assumptions and Robustness

This is the second in a series of posts discussing concerns that arose in reading two of the papers in the special issue of Social Psychology on Registered Replications. (The first was here; this one will be continued in the next post.)

MODEL ASSUMPTIONS

Every statistical hypothesis test has certain model assumptions. For example, the model assumptions for one-way, fixed effects ANOVA, with k groups and response variable Y, can be stated as follows1:

  1. The groups are independent of each other
  2. The variances of Y on each group are all the same.
  3. Y is normal on each group

For example, if Y is the score on a certain test, and the groups are groups of students, then the model assumptions would say:

  1. The groups of students are independent.
  2. The variance of the test scores is the same for each group of students.
  3. The distribution of test scores for each group is normal.

The model assumptions are what make the hypothesis test valid – if the model assumptions are not true, we don’t have any assurance that the logic behind the validity of the hypothesis test holds up. (For more detail in a simpler case, see pp. 10 – 26 of this.  In particular:

  • If the model assumptions are not true, then the actual type I error rate (“alpha level”) might be different (smaller or larger) than the intended type I error rate. For example, if the researcher sets alpha = .05 for rejection of the null hypothesis, the actual type I error rate might be smaller than that (e.g., .03), or larger (e.g., .07), just depending on the departures from model assumptions and other particulars of the test.
  • Similarly, if the model assumptions are not true, then power calculations (which would necessarily be based on the assumption that the model assumptions are true) are unreliable – actual power could be smaller or larger than calculated. Even if the type of departure from model assumption is known, accurate power calculations could be practically speaking impossible to carry out.

ROBUSTNESS

However, hypothesis tests might still be fairly credible if the model assumptions are not too far from true. The technical terminology for this is that “the test is robust to some violations of the model assumptions.”

Unfortunately,

  • It is usually impossible to tell whether or not the model assumptions are true in any particular case.
  • There may be lots of ifs, ands, and buts involved in when a hypothesis test is robust to some model violations.

However,

  • There is a lot that is known about robustness.2
  • There are some fairly standard “checks” that can often help a researcher make an informed decision as to whether model assumptions are so far off that using the test would be like building a house of cards, or whether it would be reasonable to proceed with the hypothesis test.(More on this in my next post)
  • There are in many cases alternate tests which have different model assumptions that might apply.3

Unfortunately,

  • Many textbooks (and websites and software guides) ignore model assumptions.
  • Some mention them but give “folklore” reasons to ignore them.4

In other words, the metaphor of the game of telephone, and TTWWADI, tend to foster lack of attention to model assumptions and robustness in using statistics.

However, there are some textbooks that do a good job of discussing model assumptions and robustness. (More on this in my next post.)

In the next post, I will discuss model assumptions in the context of the papers on stereotype susceptibility that I mentioned in the last post, and will propose some recommendations concerning model assumptions and research proposals. Meanwhile, here is an example of neglecting model assumptions that is also related to the special issue of Social Psychology on replications:

I his May 20 Guardian article Psychology’s “registration revolution”,  Chris Chambers quotes psychologist Don Simon as saying that study preregistration “keeps us from convincing ourselves that an exploratory analysis was a planned one.” One commenter responded,

“Why do we even split this stuff. I mean, if I study something and do hypothesis testing and THEN something interesting comes up by a few clicks in SPSS/PSPP, shouldn’t we just integrate it? Why write another research report?”

This comment gives an example of how the “game of telephone” phenomenon has worked to drop model assumptions  (as well as other concerns such as multiple testing, to be discussed in a later post) from consideration: SPSS (as with other statistical software) can only perform calculations that the user tells it to perform. It has no way of checking if those calculations are appropriate. In particular, the software just spits out the results of a hypothesis test it is told to do, regardless of whether or not the test is appropriate; it has no way of knowing whether or not the model assumptions fit the context. That is up to the user to figure out. So “something interesting” that “comes up by a few clicks in SPSS/PSPP” may be simply an artifact of the user’s choosing to do tests that are not appropriate in the context of the data being used.

Notes:

1. There are many ways of stating the model assumptions; I have chosen the form above to minimize use of notation. However, some statements of the model assumptions in the literature and (especially) on the web are incorrect. For example, the page http://en.wikipedia.org/wiki/One-way_analysis_of_variance (as of this writing) says, “Response variable residuals are normally distributed.” This would be correct if “residuals” were replaced by “errors”. The problem is that the residuals depend on the data; the word “errors” in this context refers to the difference between the value of Y and the mean of Y on the subgroup. The errors, with this definition, do not depend on the data; they are unknown.

Also, assumption (1) above is stated in a somewhat fuzzy manner; technically, what it means is that the random variables Y1 ,  Y2 , … , Yk, are independent, where Yi is the random variable Y restricted to group i. (In the example, Yi would be the test score for students in group i only)

2. For fixed effects ANOVA, see, for example Harwell, M.R., E.N. Rubinstein, W.S. Hayes, and C.C. Olds. 1992. Summarizing Monte Carlo results in methodological research: the one- and two-factor fixed effects ANOVA cases. J. Educ. Stat. 17: 315-339.

3. See, for example, Wilcox, Rand R. (2005 and 2012), Introduction to Robust Estimation and Hypothesis Testing, Elsevier, and Huber and Ronchetti (2009) Robust Statistics, Wiley

4. For example, Wilcox (2005) (see note 3 above) comments (p. 9), “For many years, conventional wisdom held that standard analysis of variance (ANOVA) methods are robust, and this point of view continues to dominate applied research,” and explains how that misunderstanding appeared to have come about.

Beyond the Buzz on Replications Part I: Overview of Additional Issues, Choice of Measure, the Game of Telephone, and TTWWADI

OVERVIEW OF ADDITIONAL ISSUES

Understandably, the popular press can’t be expected to go into much detail in discussing the issues involved in quality and replication of scientific findings. Although the four popular press articles mentioned in my two preceding posts (here and here) are important in bringing public attention to the issues those articles raise, improving the quality of the research literature involves much more than the points raised in the popular press articles – and in particular, much more than having registered reports.

Important Caveat: I cannot emphasize enough that the comments I will be making in this post and the ones following are not intended, and should not be construed, as singling out for criticism the authors of any papers referred to, nor their particular area of research. Indeed, my central point is that registered reports are not, by themselves, enough to substantially improve the quality of published research. In particular, the proposals for registered reports need to be reviewed carefully with the aim of promoting best research practices.  

My intent in this post and the following ones is to point out, in the context of a specific area addressed by two articles in the special issue of Social Psychology, why I believe this is necessary, and some specific points that need to be addressed in the reviewing process – and probably also in guidelines for submission of proposals.

Here is how I became convinced that registered reports are not enough:

After hearing the NPR story about the special issue of the journal Social Psychology, I decided to look at some of the papers in the special issue. The NPR article mentioned that two of the papers were about the topic of stereotype threat.  Since this is a topic with which I have some familiarity*, it made sense to look at those two.

Initial reading of these two papers showed some common problems in using statistics (more detail on these as I continue this sequence of posts). So I decided to look up the replication proposal [https://osf.io/jgh3c/]. This states in part, “The replication proposed here will be as close to the original study as possible. The same measures and procedures will be used.”

This suggested that perhaps some of my concerns would apply to the original study (Shih, M., Pittinsky, T.L., & Ambady, N. (1999). Stereotype susceptibility: Identity salience and shifts in quantitative performance. Psychological Science, 10(1), 80-83). So I looked this up – and indeed, it did prompt most of the same concerns.

In this post I will discuss just one of my concerns about the papers: the choice of one of the dependent variables. I will discuss other concerns in subsequent posts. But I have no reason to believe that these concerns apply only to research on stereotype threat; indeed, I have found these problems in a wide variety of research studies. I invite readers of this series of posts to keep an eye out for these less-than-best practices in other papers in the special issue of Psychological Science, and in other published research or research proposals.

CHOICE OF MEASURE

In the two replications on stereotype threat that I looked at (and in the original study that was being replicated), two dependent variables were considered. The first was number of correct answers on a certain math exam. That made sense; it is a commonly used measure of performance on an exam. The second was number of correct answers divided by number of questions attempted, called “accuracy” in the original Shih et al paper and in the replications. This didn’t (and still doesn’t) make sense to me as a choice of measure of exam performance.

For example, if subject A attempts eight questions and answers six of them correctly, her accuracy score is 0.75. If subject B attempts just two questions and answers both of them correctly, her accuracy score is 1. But I would say that subject A (who gets the lower score) has performed better on the exam than subject B (who gets the higher score). So “accuracy,” as defined in the studies, does not seem to be a good measure of performance on the task.

Shih et al (p. 81) do give the following rationale: “Accuracy, however, is a more meaningful dependent variable than the number of questions answered correctly because it takes into account not only the number of questions answered correctly but also the number of questions attempted.”  I can see that it does take into account number of questions attempted, but as the example above shows, it does so in a way that doesn’t agree with usual notions of what is good performance. It would make more sense to me to take off some fraction of a point for every incorrect answer (while counting 0 for questions not attempted). This is the method that has been used by the SAT and is sometimes used by teachers.

So I looked further to see if I could find a more convincing rationale for the definition of accuracy. When Shih et al defined accuracy (also on p. 81), they gave a reference to Steele and Aronson (1995), Stereotype threat and the intellectual test performance of African-Americans, Journal of Personality and Social Psychology 69, 797-811. So I looked up that paper. In it, Steele and Aronson don’t give any justification when they first define this “accuracy index” measure (first column, p. 800), but in the next column, in footnote 1 (to the report of the results of test performance), they say,

“Because we did not warn participants to avoid guessing in these experiments, we do not report the performance results in terms of the index used by Educational Testing Service, which includes a correction for guessing. This correction involves subtracting from the number correct, the number wrong adjusted for the number of response options for each wrong item and dividing this by the number of items on the test. Because 27 of our 30 items had the same number of response options (5), this correction amounts to adjusting the number correct almost invariably by the same number. All analyses are the same regardless of the index used.”

I find this footnote a little hard to parse. Here is my attempt to understand it:

  • First sentence: Gives a rationale for not adopting the ETS scoring method.
  • Second sentence: Explains the ETS scoring method
  • Third sentence: I’m not sure of the relevance here.
  • Fourth sentence: Seems to be saying that they have done the analysis using the ETS method and their “accuracy index” method (and maybe by a third method that they’re alluding to in the third sentence?), and got the same results (i.e., same effects significant, same group performing higher) no matter which index they used.

But I don’t think I can read into the footnote any assertion that their “accuracy index” is mathematically equivalent to the ETS method. My reasoning: Taking the same example as above (subject A attempts eight questions and answers six of them correctly, yielding accuracy score 0.75; subject B attempts just two questions and answers both of them correctly, yielding higher accuracy score 1), if the questions all have the same number of answers listed, and if r is the fraction subtracted for each wrong answer and N is the total number of questions, then subject A gets ETS score (6-2r)/N, and subject B gets ETS score 2/N, so the only way subject B would get ETS score higher than subject A would be if 2 > 6-2r, which could only happen if if r >2 – but surely the “fraction” in the ETS method would be less than 1.

So it looks like what happened is that, somewhere between Steele and Aronson’s paper and Shih et al’s, the (somewhat cryptic) footnote that Steele and Aronson gave got lost, and “accuracy” became a standard dependent variable in studies of stereotype threat, with no further thought as to whether it is appropriate.

This brings me to: THE GAME OF TELEPHONE

You may have played the game of telephone as a kid: Everyone sits in a circle. One person whispers something to the person next to them; that person whispers it to the next person, and so on around the circle. The last person says out loud what they heard. Almost always, it is quite different from what the first person said. It’s fun as a party game, but when the analogous effect happens with information about use of statistics and research practices, it’s more serious. And the analogous effect does happen with information about statistics. One person misses or misunderstands one detail; then that slightly altered understanding gets passed on to another (e.g., teacher to student, or colleague to colleague), and soon large misunderstandings arise. In particular, the game of telephone serves as an apt metaphor for what seems to have happened somewhere along the line between Steele and Aronson’s paper and Shih et al’s.

This then brings me to: TTWWADI

A number of years ago I worked with a group of high school math teachers who were committed to improving secondary math teaching. They had a sign that consisted of the letters TTWWADI with a slash through it. They explained that TTWWADI stood for “That’s the way we’ve always done it’; their point was that that is not a good reason for doing something.  It sounds like the use of “accuracy” in stereotype threat studies at some point became a case of TTWWADI.  If anyone can give me a good rationale for using it, I’m willing to listen.

This brings me to:

RECOMMENDATION #1 for improving research proposals (and thereby improving research quality):

  • Proposers should give a good reason (not just TTWWADI) for their choices of measures.
  • Reviewers of research proposals should check that the reasons given are sound (not just TTWWADI)

(For more examples on how choice of measure can be problematical, see http://www.ma.utexas.edu/users/mks/statmistakes/Outcomevariables.html)

As mentioned above, reading the two papers on stereotype threat in the special issue raised several concerns about common mistakes in using statistics. I will discuss the others in later posts.

* I am both a woman and a mathematician; I have served on a couple of Math Ed Ph.D. committees where stereotype threat was considered; and I have previously written an article for the Association for Women in Mathematics Newsletter (vol. 41, No.5, Sept-Oct 2011), pp. 10 – 13, cautioning about questionable use of statistics in studies involving math and gender, including stereotype threat.

Replications Buzz Part II

Two more popular press articles on replications have recently appeared:

1. Biological psychologist Pete Etchells had a May 28 Guardian article discussing some of the (sometimes vitriolic) discussion of the movement to promote replicability, including the Social Psychology special issue. Indeed, I had found some that I had seen disturbing in its seeming exaggeration. Etchells rightly emphasizes,

“”failure to replicate” does not imply that the original study was incorrect, poorly conducted, or involved fraud. Likewise, it doesn’t call into question the integrity of any scientists that were involved. It does not, and should not, impact on anyone’s reputation. It simply means that the results of the replication did not match the results from the original study. This is not a bad thing; it’s a fundamental part of the scientific process.

However, if anyone starts to question a researcher’s integrity because of a failure to replicate his or her work, that person should be educated on what the whole point of the process is. It’s not about individual reputations; it’s not even about individuals. It’s about trying to understand the reliability and generalisability of effects that we think exist in the research literature. If some of them don’t actually exist, or at least only occur in certain specific experimental contexts, that’s really useful information to know. It doesn’t make you a bad scientist.”

Etchells includes links to some other commentaries on the discussion.

2. Chris Chambers followed up on both his earlier and Etchell’s Guardian articles with another on June 10: Physics envy: Do ‘hard’ sciences hold the solution to the replication crisis in psychology?

He includes a link to a paper by Leslie K. John et al surveying the prevalence of “questionable research practices” by research psychologists.

He reiterates some of the points mentioned by Etchells regarding the sometimes-contentious debate about replications, and brings up the relevant problem of the sketchiness of methods sections in many publications in psychology.  I can vouch from my own experience that methods sections in biology are often sketchy, particularly when it comes to statistical methods – but my less extensive experience reading papers in psychology indicates that the situation is even worse in that field. (More on this in later posts.)

Chambers also gives perspectives of a few scientists in other fields. He concludes with:

            “Psychology clearly has some growing up to do. Critics may argue that it isn’t fair to judge psychology by the standards of physics, arguably the Olympic athlete of the sciences. On the other hand, perhaps this is precisely the goal we should set for ourselves. In searching for solutions, psychology cannot afford to be too inward looking, imagining itself as a unique and beautiful snowflake tackling concerns about reproducibility for the first time.

Above all, the way psychology responds to the replication crisis is paramount. Other sciences are watching us, as are the public. The last month has seen those who sought to replicate prior work – or bring in transparency reforms – subjected to a barrage of attacks from senior psychologists. They have been called “replication Nazis”, “second stringers”, “mafia”, and “fascists”, to name but a few. The fact that those at the top of our field feel comfortable launching such attacks highlights a pertinent irony. Despite all our claims to understanding human behaviour, psychologists stand to learn the psychology of actually doing science from our older cousins – physical sciences that haven’t studied psychology for a day. We would do well to listen.”

I will follow up in some later posts on more technical questions than were covered in these popular press article.

Replications Buzz Part I

There’s been some buzz in the popular press the past month about the special issue of the journal Social Psychology devoted to registered replications. This is the first of at least two posts on some such articles I have seen.

1. On May 19, NPR had a story that focused on how the current system of scientific publishing “rewards for publication over accuracy”:

 “if a study confirms an older result, the journals tend to say, well, we knew that already, it’s not a novel finding and they’re less inclined to publish it. Now, if the replication contradicts an earlier finding, where the journal sends out the study to the peer reviewers, some of the peer reviewers might have been the researchers who conducted the original study and they can now find ways to shoot down the study and reject the study and say we shouldn’t be publishing it anyway.

And so what this does is it creates a disincentive for researchers to conduct replications at all.”

2. The next day The Guardian published an article by neuroscientist Chris Chambers, emphasizing that

“Without study registration, it is easy for scientists to (even unconsciously) short-circuit the scientific method by cherry-picking “good results” out of complex data and then presenting them as though they were predicted from the beginning, ”

(AKA “harking”: Hypothesizing After Results are Known), and also making the point (similar to the one in the NPR story) that

“Worst of all, this system reinforces the dogma that the quality of science is best gauged not from the importance of the scientific question or robustness of the methodology, but from the results.”

Chambers includes a brief discussion of some of the debate about whether or not preregistration is a good thing.

Some of the comments on the article make the point that exploratory research is needed as well as research undertaken to test a pre-formed hypothesis or theory. Chambers responds to that with

“I agree with you – a lot of important discoveries stem from pure exploration with no a priori hypothesis. And I should emphasise that the argument for pre-registration is absolutely not an argument against that kind of exploration – it’s simply a call to make the difference clear, because under the current system exploratory research is often shoehorned into an ill-fitting hypothesis-driven framework. All pre-registration does is make the distinction between hypothesis-testing and exploration completely transparent.”

Indeed, an easy web search produced a recent paper by Chambers and co-authors (Instead of “playing the game” it is time to change the rules:  Registered Reports at AIMS Neuroscience and beyond) that has a nice question-and-answer format addressing several concerns that have been raised about registered reports.

David Draper on “Bayesian Model Specification: Toward a Theory of Applied Statistics”

David Draper’s talk, “Bayesian Model Specification: Toward a Theory of Applied Statistics,” at the Harrington Fellow Symposium this past Friday was very good. His (extensive) slides are available online at http://www.ams.ucsc.edu/~draper/draper-austin-2014-final-version.pdf. A brief outline of some major points:

I. Uncertainty in the model specification process is a major problem: In addition to the “first order” uncertainty about parameters, there is also “second order” uncertainty — that is, uncertainty about how to specify your uncertainty about these parameters; this latter can be called “model uncertainty”. It is a problem because ignoring it typically leads to practically significant understatement of your total uncertainty about parameter estimates, resulting in confidence bands that are too small and decisions that are insufficiently hedged against uncertainty.

II. Optimal Bayesian model specification involves optimal prior distribution specification, optimal sampling distribution specification, and optimal action space and utility function specification. (Draper gives extensive discussion of each of these points).

III. Suggestions on what to do when optimal model selection isn’t possible. These include

A. Cross-validation, or better yet, “calibrated cross-validation” (CCV). CCV involves partitioning the data into three sets, M for modeling, V for validation, and C for calibration. M is used to explore plausible models, V to test them, iterating the explore/test process as needed. Then fit the best model  (or use Bayesian model averaging) using MUV, reporting both inferences from this fit and quality of predictive calibration of this model in C.

B. Consider modeling as a decision problem; use either Bayes factors or log-scores, choosing the method whose strengths outweigh those of the other method for the particular problem at hand. (Draper gives extensive discussion on these strengths.)

C. Avoid treating a problem as inferential when it’s really decision-theoretic.

Teaching for Reproducibility

One impediment to reproducibility of statistical analyses is that the details of what is done (e.g., what is tried, then what else is tried when that does not seem satisfactory, etc.) are lost in between performing the analysis and writing up the results. In recent years, tools for facilitating the process of integrating statistical analysis and writing up a report have been developed. Sweave  was one of the first, developed for use with R. A more recent tool, R Markdown (also, as the name suggests, for use with R), has simpler syntax than Sweave. Recently, a group of mathematics and statistics professors tried using R Markdown in introductory statistics classes. See a short summary of their (positive) experience at Physics.org  (http://phys.org/news/2014-02-scientific-young.html), and the complete report (Ben et al, R Markdown: Integrating A Reproducible Analysis Tool into Introductory Statistics, Technology Innovations in Statistics Education, 8(1)) at http://escholarship.org/uc/item/90b2f5xh.

Bayesian and Frequentist Regression Methods

This morning I finally picked up Jon Wakefield’s Bayesian and Frequentist Regression Methods (Springer, 2013). From a brief glance (which is all it will get for now, since I’ve already renewed it once so had better return it by its due date this week), it looks very good. As the title suggests, Wakefield includes both Bayesian and Frequentist methods. The authors’ perspective as stated on p. 23:

Each of the frequentist and Bayesian approaches have their merits and can often be used in tandem, an approach we follow and advocate throughout this book. If substantive conclusions differ between different approaches, then discovering the reasons for the discrepancies can be informative as it may reveal that a particular analysis is leaning on inappropriate assumptions or that relevant information is being ignored by one of the approaches. Those situations in which one of the approaches is more or less suitable will also be distinguished throughout this book, with a short summary being given in the next section.”

I’ve scanned Section 5 ( Hypothesis Testing and Variable Selection), since these are topics where regression users often go astray. It looks good: Wakefield gives critiques of the Fisherian approach, the Neyman-Pearson approach, and the Bayes Factor approach; discusses Family-Wise Error Rate, False Discovery Rate, and using Bayes factors (with a decision theory approach) for dealing with multiple testing; and gives a summary (with critiques) of various approaches to variable selection and model building.

You can find information about the book (table of contents, editorial reviews, a brief list of errata, R code for tables and figures) at Wakefields’ website for the book.

Added March 29, 2015: See also Andrew Gelman’s (and commenters’) discussion of Wakefield’s book and regression in general.

Handbook of Regression Analysis

Chatterjee and Simonoff’s  Handbook of Regression Analysis (Wiley, 2013) appeared a while ago on the new book shelf. The preface starts (p. xi),

“This book is designed to be a practical guide to regression modeling. There is little theory here, and methodology appears in the service of the ultimate goal of analyzing real data, using appropriate regression tools. As such, the target audience of the book includes anyone who is faced with regression data … and whose goal is to learn as much as possible from that data. …

The coverage, while at an applied level, does not shy away from sophisticated concepts. …

This, however, is not a cookbook that presents a mechanical approach to doing regression analysis …”

I have only scanned the first two chapters (Multiple Linear Regression and Model Building), but those seem better than average at addressing the stated goals. For example, Section 1.3.1 points out some common misinterpretations/overinterpretations of regression coefficients; 1.3.4 distinguishes carefully between prediction and confidence intervals.

Strong points in Chapter 2 include:

  • Pointing out the dual problems of overfitting and underfitting, and relating this to sensible interpretation of the principle of parsimony.
  • How t-tests for coefficients can “give misleading indications of the importance of a predictor” (p. 24)
  • A pretty good yet succinct discussion of the problem of collinearity
  • A discussion of the difference between practical and statistical significance in the context of regression coefficients
  • The dangers of “data dredging”
  • Advocating looking at several measures of model fit, not just one, in model selection.
  • Discussing a weakness of Mallow’s C­p statistic.
  • Cautioning against inference after choosing a model, since doing so can be misleading because it ignores model selection uncertainty.
  • A good yet brief discussion of using holdout/validation samples for model checking.

My main complaint about these sections is the lack of attention to data quality, including missing data and possible bias in data collection.

Though Many Social Psychologists May Be Unaware, Multiple Testing Often Leads to Multiple Spurious Conclusions

While browsing through the November 29, 2013 issue of Science a couple of days ago, I noticed the catchy title of the last report (McNulty et al, “Though They May Be Unaware, Newlyweds Implicitly Know Whether Their Marriage Will Be Satisfying,” Science 29 November 2013: 1119-1120). I suspected that the catchy title would mean the article would be discussed in the popular press (as indeed it has been), and wondered if, since it appeared in a top-ranked journal, it would be of high quality in its statistical analysis.

Alas, I was disappointed (but not surprised). In particular, thirteen hypothesis tests (all using the same data) were reported in the article. Eight were declared significant — apparently at an individual .05 significance rate, since there was no mention of adjusted p-values or overall significance rate or anything else that would suggest that the authors took multiple testing into account in reporting “statistical significance.” So I did a quick Bonferroni calculation (i.e., using .05/13 as an individual significance level to ensure an overall significance rate of 0.05), and found that only three of these 8 tests were statistically significant at that conservative adjusted criterion. (I then tried the sometimes more liberal Holm’s procedure, but with the same result.) So, accounting for multiple testing, the only hypotheses that could be considered statistically significant at an overall .05 significance rate are those that were reported as significant at the 0.001 level, namely:

  • “… spouses’ marital satisfaction declined significantly over the 4 years of the study”
  • “Spouses’ conscious attitudes … were positively associated with initial levels of marital satisfaction”
  • “… spouses’ perceptions of their marital problems at each assessment significantly negatively predicted changes in their satisfaction from that assessment to the next”

Among the tests that are not supported as being statistically significant at an overall .05 level are the ones crucial to the authors’ assertions that automatic attitudes predicted changes in their marital satisfaction. (Actually, I’m being rather generous: The Supplemental Material contains many more significance tests.)

 

There are other questionable aspects to the paper, in addition to the one pointed out above; some are mentioned in Andrew Gelman’s January 1 blog.

 

Please note: I do not intend these comments as aimed primarily at the authors of the report. I believe that the most important conclusion from these comments is that neglect (often out of ignorance) of the problems inherent in performing multiple frequentist hypothesis tests on a single data set (as well as other common problems with statistical analyses) is so common and so pervasive that it can occur in one of the top rated science journals. Science (and other top journals) could and should play an important role in improving scientific practice by providing quality guidelines (including taking multiple testing into account when claiming statistical significance) for use of statistics in analyzing data.