Fraud Scandal Fuels Debate Over Practices of Social Psychology

Even legitimate researchers cut corners, some admit

Justin Sullivan, Getty Images

One recent study said exposure to an image of the American flag—even months earlier—could push people toward voting Republican. It made a good headline, and Travis Carter, one of its authors, says it was also good science: "We don't have a big file drawer of failed studies."
November 13, 2011

The discovery that the Dutch researcher Diederik A. Stapel made up the data for dozens of research papers has shaken up the field of social psychology, fueling a discussion not just about outright fraud, but also about subtler ways of misusing research data. Such misuse can happen even unintentionally, as researchers try to make a splash with their peers—and a splash, maybe, with the news media, too.

Mr. Stapel's conduct certainly makes him an outlier, but there's no doubt he was a talented mainstream player of one part of the academic-psychology game: The now-suspended professor at Tilburg University, in the Netherlands, served up a diet of snappy, contrarian results that reporters lapped up.

Consider just two of his most recent papers: "Power Increases Infidelity Among Men and Women," from Psychological Science, and "Coping With Chaos: How Disordered Contexts Promote Stereotyping and Discrimination," from Science—two prestigious journals. The first paper upended a gender stereotype (alpha-female politicos philander, too?!), while the second linked the physical world to the psychological one in a striking manner (a messy desk leads to racist thoughts!?). Both received extensive news coverage.

Even before the Stapel case broke, a flurry of articles had begun appearing this fall that pointed to supposed systemic flaws in the way psychologists handle data. But one methodological expert, Eric-Jan Wagenmakers, of the University of Amsterdam, added a sociological twist to the statistical debate: Psychology, he argued in a recent blog post and an interview, has become addicted to surprising, counterintuitive findings that catch the news media's eye, and that trend is warping the field.

"If high-impact journals want this kind of surprising finding, then there is pressure on researchers to come up with this stuff," says Mr. Wagenmakers, an associate professor in the psychology department's methodology unit.

Bad things happen when researchers feel under pressure, he adds—and it doesn't have to be Stapel-bad: "There's a slippery slope between making up your data and torturing your data."

In September, in comments quoted by the statistician Andrew Gelman on his blog, Mr. Wagenmakers wrote: "The field of social psychology has become very competitive, and high-impact publications are only possible for results that are really surprising. Unfortunately, most surprising hypotheses are wrong. That is, unless you test them against data you've created yourself."

Is a desire to get picked up by the Freakonomics blog, or the dozens of similar outlets for funky findings, really driving work in psychology labs? Alternatively—though not really mutually exclusively—are there broader statistical problems with the field that let snazzy but questionable findings slip through?

Statistical Significance

Discovering important results in small samples of test subjects is always a tricky business, and psychologists who want to reform the field's practices have noted that much hinges on the statistical tools used.

To show just how easy it is to get a nonsensical but "statistically significant" result, three scholars, in an article in November's Psychological Science titled "False-Positive Psychology," first showed that listening to a children's song made test subjects feel older. Nothing too controversial there.

Then they "demonstrated" that listening to the Beatles' "When I'm 64" made the test subjects literally younger, relative to when they listened to a control song. Crucially, the study followed all the rules for reporting on an experimental study. What the researchers omitted, as they went on to explain in the rest of the paper, was just how many variables they poked and prodded before sheer chance threw up a headline-making result—a clearly false headline-making result.

The odds of statistical bogosity grow when researchers don't have to report all the ways they manipulated their data in exploratory fashion. For example, the researchers "used father's age to control for baseline age across participants," thereby fudging the subjects' actual ages. They factored in lots of completely irrelevant data. And, rather than establish from the outset how many subjects they would test, they tested until they obtained the false result.

The authors of that provocative paper were Joseph P. Simmons and Uri Simonsohn of the University of Pennsylvania, and Leif D. Nelson of the University of California at Berkeley. "Many of us," they wrote—"and this includes the three authors of this article"—end up "yielding to the pressure to do whatever is justifiable to compile a set of studies that we can publish. This is driven not by a willingness to deceive but by the self-serving interpretation of ambiguity. ... "

In a forthcoming paper, also to appear in Psychological Science, Leslie K. John, an assistant professor at Harvard Business School, and two co-authors report that about a third of the 2,000 academic psychologists they surveyed admit to questionable research practices. Those don't include outright fraud, but rather such practices as stopping the collection of data when a desired result is found, or omitting from the final paper some of the variables tested.

And Mr. Wagenmakers himself was an author of a paper this year, "Why Psychologists Must Change the Way They Analyze Their Data: The Case of Psi." It appeared in the Journal of Personality and Social Psychology, inspired by that journal's publication of a much-discussed, and much-ridiculed, paper on "psi," or psychic phenomena, like "precognition," or perceiving an event before it occurs.

The Cornell University psychologist Daryl Bem had reported evidence that people could predict the future at a better-than-chance rate under some circumstances—whether an image would appear on the left or right side of a screen, for instance. That such a hypothesis could be "proved" in labs, even though clearly no one is getting rich by deploying psi in casinos, was more than a little problematic, Mr. Wagenmakers argued. Only dubious statistics could explain such a finding, he said.

The technical complaints about current statistical testing in psychology are by now familiar to those in the field. The standard measure of "statistical significance" is the "P value," which indicates the likelihood that a result is due to chance. By definition, a P value of 0.05 means there's a 1-in-20 likelihood the finding is a fluke. Add the researcher's freedom to explore multiple variables without reporting the extent of the searching in the final paper, and problems multiply. Add the so-called file-drawer effect—failed attempts to establish correlations seldom get published, but the odd lucky strike will—and the problems multiply further.

The Great Headline

Mr. Wagenmakers adds an argument involving a feedback loop between researchers looking for surprising findings and news media hungry to report them.

Unlike most other critics, he's not afraid to call out specific papers that he thinks are bogus: "Through prestigious publications and extensive media coverage," he writes in a draft of a new paper, a portion of which he shared with The Chronicle, "the general public has been informed that engineers have more sons and nurses have more daughters, ... that people choose spouses, places to live, and professions because they share letters with their name (e.g., Jenny marries Jim, Phil moves to Philadelphia, and Dennis becomes a dentist, ... that people make better decisions when their bladder is full, ... that ovulation makes it easier for women to distinguish heterosexual from homosexual men, ... and that brief exposure to an image of the American flag can push people toward the Republican end of the U.S. political spectrum, even when the flag image was presented eight months earlier."

He can't swear all those studies are wrong. "But even using common sense, a lot of these hypotheses are unlikely, a priori, and you should collect a lot more evidence in order for them to be accepted."

Needless to say, the authors of the studies he alludes to demur. "I am insulted," writes Mirjam A. Tuk, author of "Inhibitory Spillover: Increased Urination Urgency Facilitates Impulse Control in Unrelated Domains," in an e-mail. The paper was published this year in Psychological Science. The idea that self-control in one area might contribute to self-control in a different arena is one rooted in neurological theory, explains Ms. Tuk, of the University of Twente, in the Netherlands. "Conducting serious, theoretically sound research is my primary aim, and by no means one I would ever trade off [for] press attention."

Travis Carter, a postdoctoral fellow at the Center for Decision Research at the University of Chicago's Booth School of Business, co-wrote the article on how exposure to the American flag affects voting behavior, which also appeared in Psychological Science this year. He says his team has done several studies that confirm the effects of flag exposure on political views, some of which may yet be published elsewhere, and adds, "We don't have a big file drawer full of failed studies."

Yet, interestingly, he does not reject Mr. Wagenmakers's broader argument: "I absolutely agree that people strive for the kind of studies that get media attention." Those studies are problematic, he says, in part because they often don't grow out of a broader theory, but rather amount to little more than, "Here's a quick little effect that we can show." Studies like that "are more likely to be flukes," he says.

"I want to publish very high-quality work," he says, "but there's certainly a push to get more stuff out there. The temptations to cut corners are certainly there."

Eliot R. Smith, new editor of the Journal of Personality and Social Psychology, says the talk about psychologists pursuing "sexy" findings is way overblown. "Go through five issues of mainstream psychological journals," says Mr. Smith, a social psychologist at Indiana University at Bloomington. "You'll see maybe five articles out of 50 that are big counterintuitive findings that your grandmother would be interested in."

For most of the others, no one outside the relevant subfield would even understand the point of the experiment, let alone say "wow" at the result. He also doesn't see why someone interested in cutting corners would be any more likely to do so on a colorful topic than a "dull" one, of interest only to specialists. A publication is a publication, after all.

Robert V. Kail, editor of Psychological Science, says he's never heard of the likelihood of press attention being used as a reason to publish a researcher's work. Rather, he says, he asks his reviewers: "If you are a psychologist in a specialty area, is this the kind of result that is so stimulating or controversial or thought-provoking that you'd want to run down the hall and tell your colleagues in another subfield, 'This is what people in my field are doing, and it's really cool.'?

"To me that's not 'sexy.' It's the most interesting science that we're doing," says Mr. Smith. And it might have to do with reaction times or perception, not anything you'd read about in The Wall Street Journal or The New York Times. Moreover, the eye-catching studies may well be rooted in sound psychological theory—which Mr. Wagenmakers fails to mention in his drive-by attacks on specific papers, Mr. Smith says.

Research Reform

Since the extent of Mr. Stapel's misdeeds is not yet clear, it's too early to say what, if any, steps might be put in place to prevent future occurrences.

Still, reforms are in the works. Mr. Wagenmakers advocates an alternative to P-value testing, called Bayesian statistics, which incorporates such information as prior expectations that a hypothesis is true. (It's complex, but the bar for accepting something like psi would be higher, for starters.) That approach has some supporters, but it's not universally accepted, and it would require retraining both graduate students and the professors who teach them.

Mr. Simmons, Mr. Nelson, and Mr. Simonsohn, of the "When I'm 64" paper, recently met with the new editor of Psychological Science, Eric Eich, of the University of British Columbia, to push for some of the reforms they advocated for in their paper—namely, fuller descriptions of research protocols, and more tolerance of imperfections in initial papers. When the data are supposed to support a thesis perfectly, the incentives to cut corners increase.

Mr. Smith, the Journal of Personality and Social Psychology editor, describes such reforms as a natural part of any science. "There are problems with the way the field of psychology approaches statistical analysis," he says, "but my impression is there is not a clear consensus that the whole field is doing it wrong and we should change."

And it should be said that other fields are convulsed with similar internal criticisms. For example, John Ioannidis, an epidemiologist at Stanford Medical School, has suggested that most medical studies are statistically flawed.

Mr. Wagenmakers says reform needs to happen more quickly. "The field is slowly being polluted by these errors," he says of the false positives. And social psychology is in danger of becoming risible. The article on urination and self-control, published in the flagship journal of the Association for Psychological Science, won an Ig Nobel Prize this year, a tongue-in-cheek recognition given by the magazine Annals of Improbable Research for achievements "that first make people laugh, and then make them think." But they tend to be bestowed on trivial-seeming work. 

 "If the work in key psychology journals starts to get these Ig Nobel prizes," he says, "it's something we have to worry about."