Psychology is in a bit of trouble these days. It has made headlines for questionable interpretations of statistics and for well-known studies that can’t be replicated. Other quantitative social and behavioral sciences are in similar trouble, but much of the attention has focused on psychological research, probably because the discipline is so vast.
Making things worse, a genuine heavyweight in statistical analysis — the Columbia professor Andrew Gelman — recently declared in a blog post that the situation is even more dire than we thought. We were warned of this “new” problem all the way bac k in 1967, he wrote, by a prominent psychologist named Paul Meehl, who noted that the method researchers in the field use to determine whether a result is significant — null-hypothesis testing — is inherently flawed: As measurement procedures improve, it gets easier to declare results to be significant.
Think about that. Better tools usually create the expectation of better performance: If a carpenter gets a better saw, you expect straighter, more accurate cuts. Under null-hypothesis testing, though, improved measurement enables worse data to count as better evidence. Meehl contrasted this with the situation in physics, where researchers set out in advance to test a specific result that they expect their data to reflect (rather than being satisfied with any effect that is “not zero”). Thus it gets more difficult for them to confirm the theory as measurements get better.
Gelman quotes Meehl’s nightmare scenario: By using null hypothesis tests across multiple studies:
a zealous and clever investigator can slowly wend his way through a tenuous nomological network, performing a long series of related experiments which appear to the uncritical reader as a fine example of “an integrated research program,” without ever once refuting or corroborating so much as a single strand of the network…. [He is] a potent-but-sterile intellectual rake, who leaves in his merry path a long train of ravished maidens but no viable scientific offspring.
Having rediscovered Meehl’s damning critique of null-hypothesis testing, Gelman then wonders aloud how it could be that “even though Meehl was saying this over and over again, we weren’t listening.”
I am not here to refute either Meehl or Gelman. Indeed, in the main, I agree with both of them. What I would like to do, though, is expand on Gelman’s implied question — why didn’t we listen to Meehl? — and add some context about how things arrived at this rather unfortunate juncture.
F irst, a little more explanation of the null hypothesis and how it relates to other flaws in the typical way psychology researchers have handled statistical analysis over the past several decades. Roughly, our research often comes down to finding differences between groups of participants on some variable of interest (Are boys better at math than girls?) or correlations among two or more variables (Does suicide risk rise as poverty increases?). The null hypothesis is the assumption we then make that there are no such differences or correlations in the populations from which our data were drawn, and we compute the probability that we would have found data with differences or correlations as large as we actually did if the null hypothesis were true. If that probability (called a “p-value”) is less than some specified value, typically .05, then we reject the null hypothesis and conclude that the differences or correlations we found are probably real. This is known as a statistically “significant” result. If the p-value is greater than .05, then we fail to reject the null hypothesis, for lack of convincing evidence otherwise.
Research that fails to reject the null hypothesis rarely gets published in scientific journals. Likewise, research that merely replicates rejections that were found in earlier research rarely gets published. As a consequence, there is a high professional premium on testing questions that have not been previously published, and on rejecting the null hypothesis.
The widespread adoption of these customs has unintentionally encouraged a whole series of inadvisable research practices, which have led to trouble. For instance, because most studies that do not find significant effects go unpublished, we don’t know whether the findings of the published research are real or just the mistakes that our chosen p-value allows to slip through (the “file-drawer problem”). This problem might be attenuated if we published efforts to replicate previous findings, but that kind of research is rarely accepted by journals and, thus, until recently, was rarely even conducted (even though we have long been taught that replicability is the very soul of science). Another problem is that some researchers, eager to make their research highly “efficient,” repeatedly test their data as it comes in, stopping the study the moment they reach the magic p-value of .05. Or they do hundreds of tests on dozens of variables and cherry-pick those few that come up significant. These practices, called “p-hacking,” undermine the logic of null-hypothesis testing, raising the number of false positives that get published.
Until recently the focus on p-values was so intense that researchers ignored other important aspects of their data. For instance, it is sometimes easy to get a technically significant effect that is so small as to be of little psychological import. Alternatively, it is possible to miss important effects because the “power” of the statistical test is not adequate. Thus measures of “effect size” and “power” are at least as important as p-values. After decades of criticism of psychology’s reliance on null-hypothesis testing, statistical practices have slowly, grindingly begun to change over the past 10 years or so. That is good, but it hasn’t happened quickly enough or widely enough to dispel psychology’s trouble. And now Gelman has called out the discipline for failing to listen to Meehl’s warning nearly 50 years ago.
I ’m not sure that no one was listening. When I was a psychology graduate student, in the 1980s, nearly everyone knew Meehl’s work. (Statisticians like Gelman might not have known of him then.) Meehl is probably less well known among today’s students, but he is remembered by professors of a certain age. The article that Gelman quotes may not have been read by many psychologists, because it was published in Philosophy of Science, but a better-known article that made the same points was published in a psychology journal in 1978.
So, if psychologists listened, why didn’t they act? That is a complicated story. It is partly a question of the kind of statistical education that psychologists typically get (or, rather, that is typically demanded of them). There are often two years of statistics courses in graduate school. By and large, statistics instructors try not to teach “cookbook” courses (ones that comprise merely a series of isolated statistical “recipes,” without the relevant theoretical background), but the cookbook is what students who are just getting by often acquire nevertheless. Statistics examinations may contain some theoretical questions, but mostly they are about how to solve problems. So if one can pass the exam without getting the few theory questions, some students will do just that. And, of course, there’s the question of how much of the theoretical background will stick once they are out analyzing their own data.
An even larger part of the story has to do with the way in which success in psychology has been structured for the past half-century or so. Even if you know and understand everything Meehl had to say, what are you going to do about it? To graduate, to get a job, to get tenure (in research-oriented colleges) you need to publish original research. The standards of the journals are set up such that getting a significant effect (in a topic that is considered worthy) is what is required to get published. Why don’t journal editors insist on better statistical analysis in the research they publish? Partly because they were not trained as statisticians and often suffer themselves from the very educational shortcomings noted above. Their focus is on the content area, not so much on the analysis (provided the analysis follows what they understand to be conventional practice).
The same goes for most journal reviewers, grant reviewers, doctoral-committee members, and so on. To be fair, statistical standards in psychology have become more stringent over the past 10 or 15 years in response to criticism of null-hypothesis testing since the 1960s. Now many journals insist on effect sizes, power estimates (how probable it is that one will reject the null hypothesis when it is false), confidence intervals (how much the results are likely to change on replication), etc. — but these still don’t directly address the issue raised by Meehl.
So we come to the core of the problem, which has to do with the general level of theoretical sophistication in most of psychology and the other social sciences. To do what Meehl demanded — to make psychology more like physics — psychologists would have to generate predictions that don’t simply say: “If I adjust variable X thus, variable Y will go up.” They would have to predict specific quantities: “If I adjust variable X thus, variable Y will go up by exactly five points.” That would be a fantastic advance for psychology but, in its current state of development, there are few theories designed to generate such precise predictions. In physics, there are mathematical formulas — for example, F=ma — that enable us to simply compute what F should be for a given m and a. If F turns out not to be what physicists predicted, then they revise the formula until it comes out consistently right.
Almost no area of psychology has formulas of that type, because we simply don’t have the necessary level of knowledge. (Indeed, some psychologists argue that physics is the wrong model to follow, that psychological phenomena do not have the deterministic character required for formulas of that sort to operate successfully.) Imagine that I give you an energy drink that I predict will raise your IQ temporarily. How many points of IQ exactly? I don’t know. I don’t have a mathematical formula mapping drug doses onto IQ points. I’m just going to see, based on some theoretical considerations, whether it has any effect. If it does — hooray! I reject the null hypothesis and publish. If it doesn’t, I don’t publish, and I start altering the composition of the drug until I get an effect.
Perhaps what should happen is that I use null-hypothesis tests until I get a significant effect and, once I do, I get right to work building a formula connecting drug dose to effect size. But that is not how (most of) psychology works at present. It would require more mathematical know-how than most psychologists have. And it would also curtail the “productivity” of researchers who have built careers on churning out research based on significance tests at high speed.
These problems can be fixed, in principle, but it would take a titanic shift in the professional and intellectual expectations of the entire discipline to change things that much.
Christopher D. Green is a professor of psychology at York University, in Toronto, and a former editor of the Journal of the History of the Behavioral Sciences. He has taught statistics for more than 20 years.