These and other similar headlines followed the results of a large-scale initiative called the Reproducibility Project, recently published in Science magazine, which appeared to show that a majority of findings from a sample of 100 psychology studies did not hold up when independent labs attempted to replicate them. (A similar initiative is underway in cancer biology and other fields: Challenges with replication are not unique to psychology.)
Headlines tend to run a little hot. So the media’s dramatic response to the Science paper was not entirely surprising given the way these stories typically go. As it stands, though, it is not at all clear what these replications mean. What the experiments actually yielded in most cases was a different statistical value or a smaller effect-size estimate compared with the original studies, rather than positive evidence against the existence of the underlying phenomenon.
This is an important distinction. Although it would be nice if it were otherwise, the data points we collect in psychology don’t just hold up signs saying, “there’s an effect here” or “there isn’t one.” Instead, we have to make inferences based on statistical estimates, and we should expect those estimates to vary over time. In the typical scenario, an initial estimate turns out to be on the high end (that’s why it ends up getting published in the first place — it looks impressive), and then subsequent estimates are a bit more down to earth.
The true effect size, whatever it is, is something that emerges only after many replications (and even then as an approximation) as we repeat the experiment over and over.
Now, it isn’t all bunnies and roses. There are some serious problems to address — and most of them are fairly well known. There is the frequent use of overly small sample sizes; the widespread reliance on inappropriate (or just plain wrong) statistical procedures; incomplete reporting of experiments; questionable research practices and even (rarely!) outright fraud; publication bias in favor of “positive” results; the related “file drawer” problem, whereby failed replications and other negative results are simply filed away in a researcher’s bottom drawer; systematic problems with peer review, including its susceptibility to politicking and other forms of abuse — and so on.
Whether this amounts to a crisis or not, these issues cannot be breezily dismissed. And they do have implications for replication. Specifically, if you consider them all together, the likelihood that any given finding from the literature actually is a false alarm goes way up.
This doesn’t mean that we shouldn’t tolerate a certain amount of error in our publications; after all, failures of various sorts in science are often the wellspring of later discoveries and other important innovations — and the “right” amount of failure might be in the ballpark of what we are seeing.
But taking these problems seriously does mean that high-quality replication attempts — if they consistently fail to show the original phenomenon — should make us think twice about whether that phenomenon really exists. In other words, it’s complicated. And it often comes down to the details.
Let’s take an example. In their famous “walking time” study, John Bargh, Mark Chen, and Lara Burrows provided evidence that participants who were primed with the elderly stereotype (that is, subtly exposed to words that were intended to activate the idea of “old people” on an unconscious level) walked more slowly down the hallway compared with participants who were exposed to a more neutral set of words.
In a well-known replication attempt using infrared sensors (as opposed to students with stopwatches) to time participants as they walked down the hall, Stéphane Doyen and colleagues failed to find evidence of this effect in a sample of 120 undergraduates at the University of Brussels.
A great brouhaha ensued. But this is where it gets tricky. As Michael Ramscar, Cyrus Shaoul, and R. Harald Baayen point out in a forthcoming paper, the replication effort by Doyen and colleagues involved “at least one significant change to Bargh et al.’s methods that has thus far gone unnoticed in the burgeoning literature devoted to this debate.”
Specifically, the replication experiment involved French-speaking participants rather than English-speaking participants, and all of the study materials were translated into French.
You might think that this shouldn’t make a difference. But as it turns out, French and English are very different from each other in terms of the relationship between adjectives (which are used to “prime” mental associations) and nouns (in this case, the stereotype of an old person). Even more to the point, Ramscar and colleagues demonstrated that the specific adjectives used in the original English-language experiment by Bargh and colleagues — based on their frequency in common usage, and their relationship to the targeted stereotype — are actually much more likely to trigger the concept of old people than their translated French equivalents.
When it comes to their experience of encountering the specific words used to prime the elderly stereotype “in contexts where they actually served as primes to nouns,” Ramscar and his colleagues wrote, “we can expect that the subjects in Bargh et al.’s study will have had something of the order of six times more experience” than the subjects in the replication study.
Among other factors, then, this change in linguistic context could explain why the replication didn’t “work.”
It’s vital that nonscientists understand that science is messy, so that they don’t flat-out reject it when some of that mess starts to show.
Of course, it might also be the case that the original effect is illusory (although we should point out that a recent meta-analysis shows that similar effects are pretty well-supported). The lesson here is that a single failed replication tells us almost nothing about which interpretation is correct. Instead, it is the accumulation of evidence — paying close attention to the specific conditions that might be relevant to producing the finding, and being careful not to make any relevant changes — that should shape our confidence (one way or the other) over time.
In other words, we need a lot of replications. And not just any old replications, but good-faith, high-quality, “exact” replications (meaning replications that attempt to be as close to the original as possible) with large sample sizes. But with the notable exception of the Reproducibility Project (which still involved only a single replication of each of the 100 selected studies — so not enough for us to draw any meaningful conclusions), “exact” replication in psychology is like an endangered species: It’s incredibly rare, and indeed it has been from the beginnings of the discipline.
If we want to know which findings from the literature are reliable, then, we need to make replication “normal.” We need to make it par for the course. And to do that, we need to know why it currently isn’t.
Of course, psychologists do a lot of what is called conceptual replication, where certain materials or procedures are changed from the original in various ways, but as we just saw in the “walking time” example, this makes it hard to interpret the results. Psychology needs exact replications, then, but as we said, these are rarely carried out.
The reasons for this are actually pretty simple: The more time I spend trying to replicate your work, the less time I have for my own projects. And the less time I have for my own projects, the less chance I have of scoring research grants, getting tenure, and so on. It’s a classic social dilemma — a situation where collective interests are at odds with private interests.
To make the point a slightly different way: While it is in everyone’s interest that high-quality, direct replications of key studies in the field are conducted (so that we can know what degree of confidence to place in previous findings), it is not typically in any particular researcher’s interest to spend her time conducting such replications.
As Huw Green, a Ph.D. student at the City University of New York, recently put it, the “real crisis in psychology isn’t that studies don’t replicate, but that we usually don’t even try.”
What is needed is a “structural solution” — something that has the power to resolve collective-action problems like the one we’re describing. In simplest terms, if everyone is forced to cooperate (by some kind of regulation), then no single individual will be at a disadvantage compared to her peers for doing the right thing.
There are lots of ways of pulling this off — and we don’t claim to have a perfect solution. But here is one idea. As we proposed in a recent paper, graduate students in psychology should be required to conduct, write up, and submit for publication a high-quality replication attempt of at least one key finding from the literature (ideally focusing on the area of their doctoral research), as a condition of receiving their Ph.D.s.
Of course, editors would need to agree to publish these kinds of submissions, and fortunately there are a growing number — led by journals like PLoS ONE — that are willing to do just that.
Because these replication attempts would be a requirement for all students (at least in accredited programs), they would not put an unfair burden on any individual student. Then, once students got their Ph.D.s (with the clock now ticking for tenure and promotions), they could feel free to spend their precious research hours working on whatever projects they believed would be the most advantageous, personally or professionally.
If this policy were put into place, it would create a massive, constantly renewing source of high-quality replications, as a new wave of graduate students came through every year.
There is of course the question of how this would actually happen. For starters, our idea would have to be floated at the right level of decision-making authority, so that the new requirement could be instituted as a matter of policy for all accredited programs simultaneously. (This is because the problem with social dilemmas, in general, is that they can’t be resolved in a piecemeal fashion: Each individual who sticks her neck out to cooperate typically pays a price for doing so.) But before that could happen, the basic soundness of the idea would have to be established. And that means subjecting it to critical scrutiny.
Since our paper was featured several weeks ago in Nature, we’ve begun to get some constructive feedback. As one psychologist wrote to us in an email (paraphrased):
Your proposed solution would only apply to some fields of psychology. It’s not a big deal to ask students to do cheap replication studies involving, say, pen-and-paper surveys — as is common in social psychology. But to replicate an experiment involving sensitive populations (babies, for instance, or people with clinical disorders) or fancy equipment like an fMRI machine, you would need a dedicated lab, a team of experimenters, and several months of hard work — not to mention the money to pay for all of this!
That much is undoubtedly true. Expensive, time-consuming studies with hard-to-recruit participants would not be replicated very much if our proposal were taken up.
But that is exactly the way things are now — so the problem would not be made any worse. On the other hand, there are literally thousands of studies that can be tested relatively cheaply, at a skill level commensurate with a graduate student’s training, which would benefit from being replicated. In other words, having students perform replications as part of their graduate work is very unlikely to make the problem of not having enough replications any worse, but it has great potential to help make it better.
Beyond this, there is a pedagogical benefit. As Michael C. Frank and Rebecca Saxe have written: In their own courses, they have found “that replicating cutting-edge results is exciting and fun; it gives students the opportunity to make real scientific contributions (provided supervision is appropriate); and it provides object lessons about the scientific process, the importance of reporting standards, and the value of openness.”
At the end of the day, replication is indispensable. It is a key part of the scientific enterprise; it helps us determine how much confidence to place in published findings; and it will advance our knowledge in the long run.
Here is what we have to remember. It’s actually OK if some (or even many) of these replications turn out, in the end, to be “failures.” That’s how science works. We want scientists to take risks, to make discoveries, to explore the unknown — and that means getting some things wrong along the way. So there is a public-education aspect to this as well. It’s vital that nonscientists understand that science is messy (even when it’s working as it should), so that they don’t flat-out reject it when some of that mess starts to show. Headlines suggesting that psychological science is fundamentally broken are probably not helping.
Nevertheless, the days of automatic deference to people in white lab coats are over. We have to be honest about the limitations of research, and we do have a lot of work to do to cut down on questionable research practices, sloppy statistics, publication bias, and ineffective peer review. In the meantime, there is no need to panic each time we get a different statistical estimate the second time we run an experiment. Science takes time. It is the accumulation of evidence that counts. Now let’s roll up our sleeves and get to work.