How Do Randomized Experiments Contribute to Educational Research?

Interpreting the research findings reported in the Chronicle and other journalistic sources that keep us up-to-date on the latest happenings in and knowledge about education is a challenge for well-trained researchers. It is virtually impossible for anyone else. Does the headline accurately represent the study’s findings? Is this really a study or just a summary of existing data? It this disinterested research and analysis or advocacy disguised as research in an attempt to make it seem objective?

These questions are difficult enough without expecting readers to fully understand the differences among research methodologies. But we think greater insight into some of the common approaches to increasing our knowledge of the causes of and solutions to problems in education is both possible and vital for those concerned with education policy. In this post, we shed some light on randomized controlled trials (RCT’s)—frequently labeled the “gold standard” in research about causes and effects.

Long the preferred approach to testing the effectiveness of new drugs, RCT’s have recently found their way into the social sciences generally and into education particularly. For certain types of questions and in some circumstances RCT’s are clearly a very powerful method of analysis. But there are also limitations on the uses of the technique and it is important to recognize these limitations and to approach the promise of RCT’s to give definitive answers with appropriate humility.

The basic idea of RCT’s is that the effectiveness of a potential intervention can be tested by “treating” one group of people, while using another group of people as the “control.” Individuals (or schools or other units of study) must have been randomly assigned to the treatment and control groups from a pool of eligible and willing subjects. The average values of the pre-specified outcomes are then compared for the two groups. There are lots of pitfalls along the way, and deviation from the proper procedures can invalidate the results. But the idea is that if a group of people who took a certain drug has much better health outcomes than an identical group who did not take the drug, we can conclude that the drug caused the difference. Or if a group of students who took an online class learned more (or less) than an identical group who studied the same material in a traditional classroom, the mode of learning caused the difference.

One obvious issue is that comparing teaching methods or counseling approaches or other education strategies is harder than comparing a pill to no pill. It’s not easy to assure that every person involved in delivering a “treatment” is doing it exactly the same way.

But even if the study is perfectly designed, interpreting the results is not as easy as saying “it works” or “it doesn’t work.” If it works with low-income students in the rural South, does that mean it will work with middle-income students in the urban North? If information made a difference when delivered in a classroom in the morning, does that mean it will work when delivered in the auditorium after school? (Historically, drug companies often tested drugs only on men and assumed the results applied to women.)

Of course it’s rarely possible for the randomized control methodology to be flawless and if we couldn’t make reasonable assumptions about the effectiveness of an intervention in a somewhat different form with a somewhat different population, these expensive experiments would not be of much value. We always have to ask practical questions: Is the population on which the study was done enough like the population we want to apply it to that the results are of use? Or is the treatment in the field trial sufficiently similar to the one we plan to use that the results have a good chance of being applicable? Are the departures from the theoretical ideal that arise in the use of the RCT more or less serious than those that would arise from using other kinds of evidence in this case?

A problem is that terms such as “enough like,” “sufficiently similar,” and “good chance” don’t have a place in the theory of randomized design. As far as the theory is concerned, either you followed all the rules or you didn’t—the theory doesn’t give you any guidance in judging whether you were “close enough.” All of these matters of “more or less” fall into the realm of practical judgment and common sense, which ironically is the world that the most enthusiastic proponents of RCT’s would like to transcend.

RCT’s can make a significant contribution to our knowledge. In some cases they are clearly the best approach. But calling the methodology the “gold standard” and relegating all other social-science research to the second or third tier is ill-advised. We could end up spending a fortune and waiting years to reach any conclusions and then realizing that our evidence is of limited applicability. The kind of real world RCT’s that can be conducted in the field need to line up with all the other imperfect methods we have developed of making (inevitably risky) causal inferences, ranging from so-called “quasi-experiments” to studies of “naturally occurring data,” case studies, and so on. Which studies are most trustworthy in which case is a matter of judgment, not one that is guaranteed in advance by “the scientific method.”

Thinking more creatively about RCT’s can expand the usefulness of this approach. RCT’s are generally designed just to tell us “what works.” But they can also lead to more general understanding of individual and social behaviors. An excellent example of this more theoretically ambitious use of RCT’s in education is the experiment Bridget Terry Long, Eric Bettinger, and Phil Oreopoulos undertook to see whether the challenge of applying for federal aid was keeping some people out of college. They worked with H&R Block to arrange that, at random, some customers who came in to do their taxes would also get a free offer of help in filling out the federal financial-aid form. It turned out that those who got the help were substantially more likely to go to college than those who didn’t.

The researchers didn’t think they were testing a “treatment” that would then be applied in other places—H&R Block certainly wasn’t planning to go out and offer people free help in filling out the FAFSA. Instead, they were testing an idea—the idea that a lack of financial information and guidance is a substantial obstacle to college attendance. Most of us believed in that idea before the experiment, but the results provided strong evidence. The lesson was not that we should use this same treatment everywhere but that we should think of practically useful ways to deploy the learning that came out of the experiment.

The distinction between “what works” and “theory testing” uses of RCT’s is not hard and fast, but it is quite important. The “what works” outlook suggests that the user of the RCT should try to duplicate the particular treatment or intervention that was tested as faithfully as possible in other settings. The “theory testing” outlook suggests instead isolating the driving idea behind the treatment and looking for practicable ways to take advantage of that idea in other settings.

The “what works” framework invites the question of how to “scale up” the intervention by applying it in more settings. The “theory testing” framework invites the question of how to “bake in” the idea behind the intervention in more settings, with more attention to appropriate modifications.

RCT’s have made important contributions to the field of education and we expect they will make many more. But discounting the value of other approaches is a mistake. Some enthusiasts for RCT’s go so far as to claim RCT is the only way to learn about causation. But this just can’t be. There are whole sciences that just don’t have the opportunity to create “treatment” and “control” groups. For example, the claim that many earthquakes are caused by the motion of tectonic plates looks to be a perfectly plausible and very likely true causal claim with nary an RCT in sight. And Newton was onto something when he claimed that it is the mutual attraction of gravity that keeps the moon from flying away from the Earth.

Historians of science tell us that there is no one crisply formulated “scientific method” with universal applicability. Perhaps the closest we can get to a scientific method that applies to the full range of systematic studies of the natural and social world is through Teddy Roosevelt’s motto: “Do the best you can with what you have where you are.”

Return to Top