Research

A New Theory on How Researchers Can Solve the Reproducibility Crisis: Do the Math

June 28, 2017

Lionel Cironneau, AP Images
Jeanne Calment of France was believed to be the world’s longest lived person when she died in 1997 at age 122. A recent headline-grabbing study about the limits of the human life span has drawn rebuttals with implications for how universities and scientists might approach the reproducibility crisis in research.

From the beginning, it seemed like a difficult prediction.

In an article published last October in Nature, three researchers affiliated with the Albert Einstein College of Medicine in New York City said they had crunched the numbers and concluded that humans will never consistently live much beyond 115 years.

"From now on, this is it," one of the three authors, Jan Vijg, a professor of genetics at Albert Einstein, told The New York Times one of several major news outlets that helped promote the sobering news. "Humans will never get older than 115."

With more statistical expertise on research teams, one expert argues, scientists could think with more nuance about whether a research finding is significant.

But almost immediately, the conclusion was attacked by numerous critics citing various problems with the Albert Einstein team’s statistical analysis. That criticism cascaded Wednesday when Nature published another five rebuttals.

Among the allegations: Mr. Vijg and his partners failed to properly consider what statisticians call the "null hypothesis." In this case, Bryan G. Hughes and Siegfried Hekimi of McGill University explained in one of the critiques, applying the null hypothesis means statistically including the possibility that the maximum human life span actually will continue to increase.

"There are strong statistical grounds to question the validity of their conclusions," wrote another team, comprising Maarten P. Rozing, Thomas B.L. Kirkwood, and Rudi G.J. Westendorp of the University of Copenhagen. "There might be a limit to human lifespan, but we believe that their results provide no evidence," wrote a third, Adam Lenart and James W. Vaupel of the University of Southern Denmark.

Mr. Vijg stands by his work. The "real problem," he said, "is that some people get hysterical when someone openly sheds doubt on the idea that we can live forever, or at least much longer than we do now."

Statistical Shortcomings

As scientists across various fields move through a period of soul-searching over the disturbing number of studies that apparently cannot be reproduced, the leading suspects include industry bias, financial and career pressures, poor study design, and wide variations in research methodologies, equipment, and standards.

But the conversation over the study on aging points to another possibility: that too much research is hamstrung by a lack of pure statistical ability. Universities, scientists, and advocacy groups may have overlooked the seriousness of that problem as they hunt for more complex or nefarious causes of the reproducibility crisis.

Cory Fournier, an adjunct instructor in mathematics at the University of Massachusetts at Lowell, came to that conclusion earlier this year, after he cobbled together $1,000 in scarce union funds to journey to a big national conference on scientific reproducibility.

Mr. Fournier said he made the trip to the National Academy of Sciences headquarters in Washington, D.C., expecting to commune with fellow statisticians. After all, he reasoned, there are lots of ways that research errors can be tied to poor statistical analyses — including haste-induced shortcuts, technical confusion, and outright manipulation.

Instead, upon arrival in the conference hall, he noticed a strange absence. "I don’t believe that I met any other statisticians," he said.

At least one did speak at the three-day event — Giovanni Parmigiani, a professor of biostatistics at Harvard University. And Mr. Parmigiani and other experts assembled by the National Academies did cite statistical rigor as one of the key areas needing improvement.

But Mr. Fournier sees an oversight at a more fundamental level. In all fields, he said, researchers need either to develop a working knowledge of statistics or to include someone with statistical expertise on their research teams.

And with that expertise, scientists should think with more statistical nuance about questions such as whether a research finding is statistically significant, Mr. Fournier said.

Many studies answer that question with a simple "yes" or "no," relying on a calculation called a p-value to do so. For a p-value of .05, as is typical, a study’s finding will be deemed significant if researchers identify a 95-percent chance that it is genuine.

More useful, Mr. Fournier said, would be a practice in which yes-or-no declarations would be replaced in journal articles by more specific estimates of how likely it is that a particular research observation did not just randomly occur: such as 1 in 20, or 1 in 100, or 1 in 1,000.

That numerical specificity of estimates may already exist inside many articles, Mr. Fournier said. But highlighting it in words, he said, should help emphasize what statisticians know to be true — science cannot make definitive yes-or-no declarations in most cases — and perhaps also encourage the publication of studies now abandoned in the belief they failed to show a useful outcome. Better statistical expertise also could help scientists construct experiments that are more likely to be reliable in the first place, he said.

One of the conference organizers, Victoria Stodden of the University of Illinois at Urbana-Champaign, said she recognizes the ways that biases of various types — financial conflicts of interest, academic promotion incentives, and the allure of fame — can contribute to irreproducibility problems in science.

But Ms. Stodden, an associate professor of information sciences, said she agrees that the ongoing misuse of statistics is a broader problem. While researchers may need to work harder to include statisticians on their teams, she said, statisticians also must to work harder to study how they could be more helpful to their interdisciplinary colleagues.

"Developing a research agenda within the statistical community to address issues surrounding reproducibility is imperative," she said.

‘A Tool, Nothing More’

For his part, Mr. Vijg isn’t convinced his team failed basic statistical analysis. His paper used records from sources that included the International Database on Longevity and the Human Mortality Database. It then made calculations suggesting that, while average human life expectancy may continue to increase, the maximum of age of the oldest surviving humans will not substantially move beyond about 115 years.

"We went through a highly experienced and reputed statistician before submitting the work," Mr. Vijg said in a written exchange about the criticisms. At the same time, he argued that resolving differences in findings between competing labs is less a matter of procuring advanced statistical expertise and more a matter of the two groups getting together and identifying variations in their experimental conditions.

"Look, statistics is a tool, nothing more," he said. "It certainly is not the arbiter of scientific truth."

“Look, statistics is a tool, nothing more. It certainly is not the arbiter of scientific truth.”

An author of another of the five critiques published Wednesday by Nature, Nicholas J.L. Brown of the University of Groningen, said the case exhibits multiple problems seen across science — including statistical errors and some researchers’ basic pursuit of fame.

The statistical errors, wrote Mr. Brown and his colleagues at Groningen, included a failure by Mr. Vijg’s team to compare the fit of its model to alternatives, and the use of small sample sizes that failed to properly handle the case of a lone outlier, Jeanne Calment of France, who died in 1997 at the record age of 122.

Mr. Vijg said repeatedly that his Nature paper made no "definitive statement" about a maximum human age and that he felt "amazement" that anyone might think otherwise. But he acknowledged approving a news release about his study issued by Albert Einstein College with the headline: "Maximum human lifespan has already been reached, Einstein researchers conclude."

The scientific question at hand never even seemed to make much sense, said Mr. Brown, a doctoral student in health psychology at Groningen, because advances in average human lifespan are far more important than the future maximum age of a single person. "The whole article might as well have been designed to create clickbait headlines," he said.

That type of low-value scientific pursuit is only becoming more common with the advent of modern computer-processing capabilities, Mr. Brown said. Computers let people "explore a half-million alternative realities in 10 minutes," and then pick out something that seems interesting, without spending too much time on developing meaningful hypotheses, he said.

Without qualified statistical experts to guide them, researchers will continue to encounter big problems, Mr. Brown said. "Statistics is demanding in the same way as flying a plane, but many scientists only have the equivalent of a driver’s license," he said. "As a result, they’re crashing into the side of a mountain on a rather regular basis."

Paul Basken covers university research and its intersection with government policy. He can be found on Twitter @pbasken, or reached by email at paul.basken@chronicle.com.