OVERVIEW OF ADDITIONAL ISSUES
Understandably, the popular press can’t be expected to go into much detail in discussing the issues involved in quality and replication of scientific findings. Although the four popular press articles mentioned in my two preceding posts (here and here) are important in bringing public attention to the issues those articles raise, improving the quality of the research literature involves much more than the points raised in the popular press articles – and in particular, much more than having registered reports.
Important Caveat: I cannot emphasize enough that the comments I will be making in this post and the ones following are not intended, and should not be construed, as singling out for criticism the authors of any papers referred to, nor their particular area of research. Indeed, my central point is that registered reports are not, by themselves, enough to substantially improve the quality of published research. In particular, the proposals for registered reports need to be reviewed carefully with the aim of promoting best research practices.
My intent in this post and the following ones is to point out, in the context of a specific area addressed by two articles in the special issue of Social Psychology, why I believe this is necessary, and some specific points that need to be addressed in the reviewing process – and probably also in guidelines for submission of proposals.
Here is how I became convinced that registered reports are not enough:
After hearing the NPR story about the special issue of the journal Social Psychology, I decided to look at some of the papers in the special issue. The NPR article mentioned that two of the papers were about the topic of stereotype threat. Since this is a topic with which I have some familiarity*, it made sense to look at those two.
Initial reading of these two papers showed some common problems in using statistics (more detail on these as I continue this sequence of posts). So I decided to look up the replication proposal [https://osf.io/jgh3c/]. This states in part, “The replication proposed here will be as close to the original study as possible. The same measures and procedures will be used.”
This suggested that perhaps some of my concerns would apply to the original study (Shih, M., Pittinsky, T.L., & Ambady, N. (1999). Stereotype susceptibility: Identity salience and shifts in quantitative performance. Psychological Science, 10(1), 80-83). So I looked this up – and indeed, it did prompt most of the same concerns.
In this post I will discuss just one of my concerns about the papers: the choice of one of the dependent variables. I will discuss other concerns in subsequent posts. But I have no reason to believe that these concerns apply only to research on stereotype threat; indeed, I have found these problems in a wide variety of research studies. I invite readers of this series of posts to keep an eye out for these less-than-best practices in other papers in the special issue of Psychological Science, and in other published research or research proposals.
CHOICE OF MEASURE
In the two replications on stereotype threat that I looked at (and in the original study that was being replicated), two dependent variables were considered. The first was number of correct answers on a certain math exam. That made sense; it is a commonly used measure of performance on an exam. The second was number of correct answers divided by number of questions attempted, called “accuracy” in the original Shih et al paper and in the replications. This didn’t (and still doesn’t) make sense to me as a choice of measure of exam performance.
For example, if subject A attempts eight questions and answers six of them correctly, her accuracy score is 0.75. If subject B attempts just two questions and answers both of them correctly, her accuracy score is 1. But I would say that subject A (who gets the lower score) has performed better on the exam than subject B (who gets the higher score). So “accuracy,” as defined in the studies, does not seem to be a good measure of performance on the task.
Shih et al (p. 81) do give the following rationale: “Accuracy, however, is a more meaningful dependent variable than the number of questions answered correctly because it takes into account not only the number of questions answered correctly but also the number of questions attempted.” I can see that it does take into account number of questions attempted, but as the example above shows, it does so in a way that doesn’t agree with usual notions of what is good performance. It would make more sense to me to take off some fraction of a point for every incorrect answer (while counting 0 for questions not attempted). This is the method that has been used by the SAT and is sometimes used by teachers.
So I looked further to see if I could find a more convincing rationale for the definition of accuracy. When Shih et al defined accuracy (also on p. 81), they gave a reference to Steele and Aronson (1995), Stereotype threat and the intellectual test performance of African-Americans, Journal of Personality and Social Psychology 69, 797-811. So I looked up that paper. In it, Steele and Aronson don’t give any justification when they first define this “accuracy index” measure (first column, p. 800), but in the next column, in footnote 1 (to the report of the results of test performance), they say,
“Because we did not warn participants to avoid guessing in these experiments, we do not report the performance results in terms of the index used by Educational Testing Service, which includes a correction for guessing. This correction involves subtracting from the number correct, the number wrong adjusted for the number of response options for each wrong item and dividing this by the number of items on the test. Because 27 of our 30 items had the same number of response options (5), this correction amounts to adjusting the number correct almost invariably by the same number. All analyses are the same regardless of the index used.”
I find this footnote a little hard to parse. Here is my attempt to understand it:
- First sentence: Gives a rationale for not adopting the ETS scoring method.
- Second sentence: Explains the ETS scoring method
- Third sentence: I’m not sure of the relevance here.
- Fourth sentence: Seems to be saying that they have done the analysis using the ETS method and their “accuracy index” method (and maybe by a third method that they’re alluding to in the third sentence?), and got the same results (i.e., same effects significant, same group performing higher) no matter which index they used.
But I don’t think I can read into the footnote any assertion that their “accuracy index” is mathematically equivalent to the ETS method. My reasoning: Taking the same example as above (subject A attempts eight questions and answers six of them correctly, yielding accuracy score 0.75; subject B attempts just two questions and answers both of them correctly, yielding higher accuracy score 1), if the questions all have the same number of answers listed, and if r is the fraction subtracted for each wrong answer and N is the total number of questions, then subject A gets ETS score (6-2r)/N, and subject B gets ETS score 2/N, so the only way subject B would get ETS score higher than subject A would be if 2 > 6-2r, which could only happen if if r >2 – but surely the “fraction” in the ETS method would be less than 1.
So it looks like what happened is that, somewhere between Steele and Aronson’s paper and Shih et al’s, the (somewhat cryptic) footnote that Steele and Aronson gave got lost, and “accuracy” became a standard dependent variable in studies of stereotype threat, with no further thought as to whether it is appropriate.
This brings me to: THE GAME OF TELEPHONE
You may have played the game of telephone as a kid: Everyone sits in a circle. One person whispers something to the person next to them; that person whispers it to the next person, and so on around the circle. The last person says out loud what they heard. Almost always, it is quite different from what the first person said. It’s fun as a party game, but when the analogous effect happens with information about use of statistics and research practices, it’s more serious. And the analogous effect does happen with information about statistics. One person misses or misunderstands one detail; then that slightly altered understanding gets passed on to another (e.g., teacher to student, or colleague to colleague), and soon large misunderstandings arise. In particular, the game of telephone serves as an apt metaphor for what seems to have happened somewhere along the line between Steele and Aronson’s paper and Shih et al’s.
This then brings me to: TTWWADI
A number of years ago I worked with a group of high school math teachers who were committed to improving secondary math teaching. They had a sign that consisted of the letters TTWWADI with a slash through it. They explained that TTWWADI stood for “That’s the way we’ve always done it’; their point was that that is not a good reason for doing something. It sounds like the use of “accuracy” in stereotype threat studies at some point became a case of TTWWADI. If anyone can give me a good rationale for using it, I’m willing to listen.
This brings me to:
RECOMMENDATION #1 for improving research proposals (and thereby improving research quality):
- Proposers should give a good reason (not just TTWWADI) for their choices of measures.
- Reviewers of research proposals should check that the reasons given are sound (not just TTWWADI)
(For more examples on how choice of measure can be problematical, see http://www.ma.utexas.edu/users/mks/statmistakes/Outcomevariables.html)
As mentioned above, reading the two papers on stereotype threat in the special issue raised several concerns about common mistakes in using statistics. I will discuss the others in later posts.
* I am both a woman and a mathematician; I have served on a couple of Math Ed Ph.D. committees where stereotype threat was considered; and I have previously written an article for the Association for Women in Mathematics Newsletter (vol. 41, No.5, Sept-Oct 2011), pp. 10 – 13, cautioning about questionable use of statistics in studies involving math and gender, including stereotype threat.