At first, it looked like a paradigm of science done right. A group of behavioral scientists had repeated the same experiments over and over in separate labs, following the same rigorous methods, and found that 86 percent of their attempts had the results they expected.
In a field where the seemingly constant collapse of influential discoveries over the past decade has triggered a reproducibility crisis, this finding was welcome news. The study’s authors included heavy hitters in the science-reform movement, and it appeared in a top journal, Nature Human Behaviour, in November.
“The high replication rate justifies confidence in rigour-enhancing methods to increase the replicability of new discoveries,” concluded the paper, which has been cited more than 70 times, according to Google Scholar. “The reforms are working,” a press release declared, and a news story asked: “What reproducibility crisis?”
But now the paper has been retracted, following a monthslong journal investigation into concerns about how it had been designed and written. For starters, it claimed that every aspect of the project had been “preregistered,” referring to the increasingly popular practice of publicly describing plans for a study ahead of time. The hope is that when researchers are held accountable to an original plan, they won’t be tempted to change course later to make their data “fit.”
But this project’s core premise — that it was testing whether rigor-enhancing practices like preregistration increase replicability — did not itself appear to have been preregistered. Other documents that have been posted online since publication make clear that the study was initially conceived to explore another subject altogether. And when their hypothesis did not pan out, the scholars appeared to overhaul their paper’s focus while playing down their original intentions.
In explaining the retraction, the journal cited a “lack of transparency and misstatement of the hypotheses and predictions the reported meta-study was designed to test; lack of preregistration for measures and analyses supporting the titular claim (against statements asserting preregistration in the published article); selection of outcome measures and analyses with knowledge of the data; and incomplete reporting of data and analyses.” As a result, the journal’s editors stated that they “no longer have confidence in the reliability of the findings and conclusions reported in this article.”
Critics say that those issues cut against the paper’s message — and some of its authors’ reputations as champions of scientific rigor. Brian A. Nosek, a psychologist at the University of Virginia, conducted some of the seminal studies about the reproducibility crisis and directs the Center for Open Science, a nonprofit that advocates for transparency in science and operates a database of preregistrations, a concept he pioneered. Another collaborator, the University of California at Berkeley business professor Leif Nelson, is part of a trio of scholars whose blog, Data Colada, has rooted out problems in the work of behavioral-science stars like Amy Cuddy, Dan Ariely, and Francesca Gino.
Some observers noticed on Monday that the paper was initially labeled as retracted with no explanation. (A spokesperson for Nature Human Behaviour said by email that “a technical issue meant that the article was flagged as being retracted before the retraction notice was published.”) The retraction notice posted on Tuesday stated the authors have been invited to submit a new manuscript for peer review. It also stated that “all authors agree to this retraction due to incorrect statements of preregistration for the meta-study as a whole but disagree with other concerns listed in this note.”
In a statement released after the retraction notice was posted, six of the paper’s authors said in part, “We are embarrassed by our erroneous statement about what was preregistered.” The paper listed 17 total authors, led by John Protzko of Central Connecticut State University and Jonathan W. Schooler of the University of California at Santa Barbara.
“It seems like a lot of the aspects of this paper and its process sort of mirror the original concerns that motivated reforms in the first place,” said Joseph Bak-Coleman, a computational social scientist affiliated with the University of Konstanz, in Germany, and one of the researchers who raised concerns about the study. He added, “Here, in the best-case scenario — in a paper about the importance of embracing these reforms, by the experts who developed these reforms — the reforms themselves haven’t been well-embraced.”
He and other critics also said that they were surprised to have been met with silence and resistance from the open-science community. A decade into a reform movement that many agree is badly needed, the incident raises the question of how willing its members are to police their own. The whole experience, Bak-Coleman said, has been “like being in the upside-down.”
Data for the paper was collected over five years by labs at Santa Barbara, Stanford University, Berkeley, and UVA. According to the paper, as each of the four labs were conducting their typical research into topics like psychology, marketing, and decision making, they chose four discoveries, which hadn’t been previously reported in the scientific literature, to test as pilot studies.
The originating lab conducted a self-confirmatory test of each pilot, then attempted a replication. The other three labs also tried to replicate each pilot, following written instructions from the original lab. There were, in total, 16 self-confirmatory studies and 64 replications. Each confirmatory test and replication was done on groups of at least 1,500 people, the paper stated. And, it said, all the tests, including the pilots and the project as a whole, were preregistered.
Fifty-five of the 64 replications — 86 percent — succeeded, the paper reported, noting that rates in previous replication efforts had ranged from 30 percent to 70 percent. “The present results are reassuring about the effectiveness of what we think of as best practices in scientific investigations,” the scientists wrote. When discoveries were tested in preregistered, large confirmatory tests with transparent methods, “the observed rate of replication was high.”
But Berna Devezer, a meta-scientist at the University of Idaho, doesn’t think the study proves that “best practices” caused the high rate. “It wasn’t designed to justify these conclusions, and it can’t,” she said. To gauge whether factors like sample size affect the chances of replicability, she said, the scientists would have had to alter those variables across different rounds of experiments and compare the outcomes, instead of applying the same factors to every test.
When Devezer and Bak-Coleman began looking at the study last year, they came to believe the 86-percent replication rate had been calculated in an unusual way, as they wrote in a critique published alongside the retraction notice. For example, a replication was considered successful if it was consistent with the hypothesis seen in the pilot stage. But the team disclosed little about how it had done the initial discoveries and chosen them for replication attempts. “They’re basing success on part of the study that they didn’t make open or share,” Bak-Coleman said.
Jessica Hullman, a computer-science professor at Northwestern University, said that without more information about whether and how rigor-enhancing practices had been applied to the foundational studies, it’s difficult to conclude that those practices had been responsible for the high replication rate. “We just don’t know enough,” she said.
Hullman’s questions about the pilot studies could be addressed by their preregistrations. If only she could find all of them.
The paper states that “all confirmatory tests, replications, and analyses were preregistered,” including “for this meta-project.” It provides a link to a page on the Center for Open Science’s database. But when Bak-Coleman clicked it last year, the materials he found did not match what the paper claimed to have tested and analyzed.
On November 13, he emailed Nosek and the lead author, Protzko, an assistant professor of psychological science at Central Connecticut State, to ask where the preregistration was. Nosek deferred to Protzko, who looped in a third collaborator at the University of Wisconsin at Madison, emails show. The authors said they were traveling and weren’t sure, and ultimately didn’t produce the documentation. A week later, Bak-Coleman and Devezer posted a critique on a preprint server, titled “Causal Claims About Scientific Rigor Require Rigorous Causal Evidence.” A few weeks after that, the journal added an editor’s note that said it was investigating criticisms of the paper.
The authors then uploaded a trove of documents that revealed how the project had evolved over more than a decade from its original target: a phenomenon known as the “decline effect.”
Findings in various scientific fields sometimes diminish upon repeat experimentation. Scientists theorize that may happen when the initial discoveries are false positives or errors or are derived through questionable research practices. But Schooler, a psychologist at Santa Barbara, has long entertained a more supernatural explanation: that the mere act of scientific observation somehow causes effects to wear off over time.
The 2023 study he helped organize was funded by the Fetzer Franklin Fund, a group that describes itself as investing in “scientific methodologies for both conventional and frontier research.” According to documents uploaded in 2018 and 2019, one of which is titled “Proposed Overarching Analyses for the Decline Effect,” Schooler’s study always consisted of four labs attempting multiple confirmation studies and replications of new hypotheses. Initially, the preregistrations state, the scientists wanted to analyze whether the decline effect occurred when participants were tested in sequential groups, or as more replications accumulated. (In their statement on Tuesday, the six scientists wrote that “the unconventional explanations were not considered plausible (or even possible) to most of the team,” but nevertheless, they agreed to include “tests of those possibilities within a so-called ‘best practices’ proof-of-concept study pursuing high replicability.”)
But the decline effect failed to materialize. Then the manuscript took a new direction, as observers have pointed out.
Subsequent files describe an analysis that is not mentioned in the preregistrations, the analysis that would ultimately define the paper: a comparison of the outcomes of the various confirmatory tests and replications. In a code file last updated in 2020, the analysis is labeled as “exploratory,” a term that means data-gathering without a fixed hypothesis. In the final version, it’s labeled as “confirmatory” — a term indicating it was the theory being tested all along.
As the paper wound toward publication, the authors continued to distance themselves from the decline effect. In a lengthy and searing peer review for Nature, Tal Yarkoni, then a research associate professor of psychology at the University of Texas at Austin, wrote: “I found the recurring ‘decline’ theme throughout the article fairly puzzling” and “at odds with the overall framing.” In their response, the authors explained that “some have posited” a supernatural theory “at the fringes of scientific discourse,” but they’d thought it “prudent” to study the possibility anyway. They said, though, that they’d relegated those references to a supplementary section of the paper.
Yarkoni also raised other issues that would later draw heat. He criticized the authors for providing “essentially no explanation” of how they’d done and chosen the 16 pilot studies. (The authors said that their evaluation of the rigor-enhancing practices’ effects began after the pilot stage of testing.)
And Yarkoni wrote: “Given the strong emphasis on preregistration and its benefits, I found it a bit worrisome that the authors did not prominently link to a preregistration document for *this* project as a whole.” After that comment came in, Protzko revised the manuscript to add the claim that the study had been preregistered, with a link to the “decline effect” plans, according to documents made public as part of the investigation.
Another peer reviewer was Daniël Lakens, a psychologist and meta-scientist at the Eindhoven University of Technology, in the Netherlands. In an interview, he acknowledged that not everything in the paper had been preregistered, and agreed it should have made clear that it was originally about the decline effect. But he said that he’d sided with the scientists’ explanation for their 86-percent replication rate. He also said that, unlike others, he did not interpret the study as explicitly claiming that rigor-enhancing practices had caused the high rate.
The paper was ultimately rejected by Nature but accepted by Nature Human Behaviour. A peer reviewer at the final stage was Malte Elson, a psychologist at the University of Bern, in Switzerland, and the co-creator of a program that pays sleuths to root out scientific errors. His evaluation gave a thumb’s up. The high replication rate, he suggested, could be because “the authors are probably just better scientists than the average person contributing to ‘the literature.’”
In an interview, Elson said that his role had been “totally unclear” to him. The journal had told him that the paper had been rejected by Nature, and shared the existing peer reviews and authors’ responses, he said. Because he was under the impression that the editors were fine with the paper in its then state, Elson said he’d decided to add comments that differed from what had already been said. “Had I been a reviewer in Round 1, my review would look differently,” he said, saying that he agreed with many of Yarkoni’s criticisms.
Unraveling this trail reminded Bak-Coleman of the wayward practices that birthed the replication crisis in the first place — where “people test a fanciful idea, get a null finding, and then convert it into a positive finding aligned with their previous work,” he said. “It’s a clear-cut case of outcome-switching.”
The open-science movement generally encourages the practice of scrutinizing research. But critics of the replicability paper say that they’ve been surprised that some open-science proponents have been less than receptive to their concerns.
On November 22, two days after Bak-Coleman and Devezer posted their critique on the preprint server, Bret Beheim, an ecologist at the Max Planck Institute for Evolutionary Anthropology, posted on Bluesky about what he called “the public pile-on of Protzko et al.’s recent preregistration study,” and linked to a paper about “promoting civility” in science.
“Have you really seen a pile-on?” Devezer replied. “I have not seen anything personal.”
Beheim said he thought she and Bak-Coleman had given the authors too few days to respond.
“This is quite a strange reaction to me,” Devezer wrote, explaining that they had first contacted the scientists privately, then sent them their write-up before publishing it.
Beheim insisted he was all for “healthy criticism.” He himself had criticized a paper that got retracted in the past, he said, but had given the authors nearly a month to respond. In this case, he worried that the pair had rushed to judgment. “If the critics just misunderstood something,” he wrote, “the subsequent public ridicule was undeserved.”
Devezer said in an interview that, if anything, the open-science community appeared to say very little about the criticisms. Perhaps some individuals were scared to voice support, she said, adding that she was speculating. “Some people just have an a priori confidence and trust in the authors, and they expect that they cannot do any wrong or anything, and some people just don’t want to hear any criticism of open-science practices,” she said.
In the spring, Hullman, the Northwestern computer scientist, wrote a detailed analysis of the study on a blog popular with statisticians and social scientists. Her first post on the matter, written months prior, had chided the authors for not locating their preregistration in a timely manner. This time, she cited the slew of changes and deletions in the historical documents she’d spent hours poring over.
“I want to believe that these practices do work, and that the open-science movement is dedicated to honesty and transparency,” she wrote. “But if papers like the Nature Human Behaviour article are what people have in mind when they laud open-science researchers for their attempts to rigorously evaluate their proposals, then we have problems.”
By then, Bak-Coleman had submitted to the journal a more detailed assessment of his critique with Devezer, one that he’d written on his own and intended to keep private, he said in an interview. And Lakens, the psychologist who’d served as a peer reviewer at Nature, had publicly disclosed that he was helping Nature Human Behaviour evaluate criticisms of the paper. “I submitted an 11-page document with what I think of this, after the editors asked me to look at all documents,” he wrote in an exchange on X in March.
When Hullman posted her write-up on X later that month, Lakens chimed in and made references to developments in the ongoing investigation. “You say you are annoyed the authors have not responded after four months,” he wrote. “But you know why they did not respond and it is not the authors’ fault.” He wrote that “Bak-Coleman and Devezer” had submitted a new critique that “triggered a more extensive review process.” He added: “You know this all, I assume, but do not mention it. Which is weird.”
Hullman was not helping with the investigation at the time — though the journal asked her to participate days later, she says — so she was surprised by Lakens’s accusations, she said in an interview. All she knew of the authors’ side of the story was that Nosek had said on social media, shortly after publication, that they were looking for the preregistration. On that front, Hullman says she has little sympathy for the argument that the authors didn’t have enough time to respond. “When you publish something, at the time that it is published, what you’re saying should hold,” she said. “And if it doesn’t, then critics like me have every right to ask you about it.”
Devezer said it was “careless” of Lakens to incorrectly name her. And Bak-Coleman was upset at being identified. “This is a community where there’s been a lot of discussion of the need to protect whistle-blowers,” he said, pointing to the outpouring of support that the Data Colada scholars received in the face of a defamation lawsuit. Lakens’s post, given that he was assisting with the investigation, was “beyond the pale of what’s normal and acceptable in science.”
“I should have not tweeted about it,” Lakens said in an interview, referring to the investigation, “and only stuck with what was publicly known.” He said that he regretted erroneously naming Devezer, and that he had deleted his replies to Hullman after the journal told him that he’d violated confidentiality rules. “It was a super-stupid thing to do,” he said.
Lakens has previously worked with some of the study’s authors. He and Protzko, along with others, are co-authors of a preprint released this summer. (Lakens said that he only found out in early July that Protzko was being added to the paper, and that they never communicated while the study was being done.) He said those relationships didn’t affect his ability to evaluate the Nature Human Behaviour study. “I don’t have a problem with criticizing these people,” he said.
The takeaway, he said, is not that the authors are hypocrites. “If we want to put these people on some sort of pedestal where they are infallible, that makes no sense,” Lakens said.
And Elson points out that the story is ultimately a testament to the value of preregistering. “We have records to argue about,” he said. “All of this would have been guesswork, basically, had there not been preregistration as a practice in those labs.”
To Bak-Coleman, the situation is a “very, very human” demonstration of just how tough it is to clean up science. “Procedural reforms don’t guarantee that people will do them in the way they’re intended,” he said. “It doesn’t guarantee that reviewers will read them. It doesn’t guarantee that people who want to speak up will be protected from harassment and have their identity protected.”
But “if we don’t have a solution for these problems in the best-case scenario,” he added, “how well would they play out elsewhere?”