The implications of the study recently released with the book Academically Adrift: Limited Learning on College Campuses (University of Chicago Press), by Richard Arum and Josipa Roksa, have been portrayed in The Chronicle and other news media in apocalyptic terms: “extremely devastating in what it says about American higher education today,” an educational expert said in The Chronicle; “A Lack of Rigor Leaves Students ‘Adrift’ in College,” NPR’s Morning Edition headlined its report. Political pundits who have an ax to grind with American higher education are having a field day. The study “regrettably confirms what the American Council of Trustees and Alumni has been saying for some time: Many students aren’t learning very much at all in their first two years of college,” Anne D. Neal, the group’s president, blogged in The Washington Post.
The principal finding giving rise to such opinions is the claim that 45 percent of the more than 2,000 students tested in the study failed to show significant gains in reasoning and writing skills during their freshman and sophomore years. It is not my custom to offer technical critiques of other researchers’ work in a public forum, but the fact that this 45-percent conclusion is well on its way to becoming part of the folklore about American higher education prompts me to write. While there are many other aspects of this study that merit close scrutiny—especially the claim that the Collegiate Learning Assessment test used is an adequate measure of what students are supposed to learn during the first two years of college—I would like to comment specifically on the 45 percent. At the outset, I have to apologize to nonstatistically inclined readers that a good deal of my critique refers to statistical concepts. But Adrift is at root a statistical study, and there is simply no way to evaluate its claims without examining the underlying statistical procedures used to justify them.
What the authors—a professor of sociology at New York University and an assistant professor of sociology at the University of Virginia—did was to compare each student’s first-semester test results from the fall of 2005 to their fourth-semester results on the same test in the spring of 2007. The goal was to see how much they had improved, and to use certain statistical procedures to make a judgment about whether the degree of improvement shown by each student was “statistically significant.”
Even before considering the procedures, I should note the authors’ baffling failure to report certain other basic data that any reasonable reader would like to know. For example, nowhere do they indicate how many students (or what percentage) showed any degree of improvement. Nor do they report each student’s actual scores on the test, but only whether significant improvement was made. Given that people familiar with the Collegiate Learning Assessment might like to know how much improvement individual students showed, it is difficult to understand why the authors provide no information on the distribution of actual scores, which would indicate what proportion of students showed different amounts of improvement from their generalization. Indeed, they also fail to report how many students’ scores declined (or by how much), something that would certainly be of interest to educators. With such a large sample, certainly there must have been at least some students, for example, whose scores got worse.
Those issues aside, the method used to determine whether a student’s sophomore score was “significantly” better than his or her freshman score is ill suited to the researchers’ conclusion. The authors compared the difference between the two scores—how much improvement a student showed—with something called the “standard error of the difference” between his or her two scores. If the improvement was at least roughly twice as large as the standard error (specifically, at least 1.96 times larger, which corresponds to the ". 05 level of confidence”), they concluded that the student “improved.” By that standard, 55 percent of the students showed “significant” improvement—which led, erroneously, to the assertion that 45 percent of the students showed no improvement.
The first thing to realize is that, for the purposes of determining how many students failed to learn, the yardstick of “significance” used here—the .05 level of confidence—is utterly arbitrary. Such tests are supposed to control for what statisticians call “Type I errors,” the type you commit when you conclude that there is a real difference, when in fact there is not. But they cannot be used to prove that a student’s score did not improve.
That point is important, and it deserves a bit more explanation. Keep in mind that a test score has no intrinsic value. We instead use it as an indicator of something else we value. In the case of a student’s CLA score, the authors of Adrift have used it as an indicator of the student’s reasoning and writing skills. In drawing a conclusion about an observed improvement in any student’s score on that test, they could have committed either of two kinds of errors: If they concluded that the student’s reasoning and writing skills improved, when in fact they did not, they would have committed a “Type I” error. But if they concluded that the student’s reasoning and writing skills failed to improve when those skills actually did improve, they committed a Type II error. The authors’ conclusion that “45 percent of the students failed to improve,” and the resulting sweeping claims being made in the media, are both subject to Type II errors.
The basic problem is that the authors used procedures that have been designed to control for Type I errors in order to reach a conclusion that is subject to Type II errors. In plainer English: Just because the amount of improvement in a student’s CLA score is not large enough to be declared “statistically significant” does not prove that the student failed to improve his or her reasoning and writing skills. As a matter of fact, the more stringently we try to control Type I errors, the more Type II errors we commit.
To show how arbitrary the ". 05 level” standard is when used to prove how many students failed to learn, we only have to realize that the authors could have created a far more sensational report if they had instead employed the .01 level, which would have raised the 45 percent figure substantially, perhaps to 60 or 70 percent! On the other hand, if they had used a less-stringent level, say, the .10 level of confidence, the “nonsignificant” percent would have dropped way down, from 45 to perhaps as low as 20 percent. Such a figure would not have been very good grist for the mill of higher-education critics.
Another potentially serious problem is that the Collegiate Learning Assessment test that was used is, like any other test, subject to a certain amount of what we call “measurement error.” The negative effect of that kind of error is compounded when your measure of learning consists of a difference score (i.e., between the first and the second testing). Since both scores contain error, the difference score is even more error-prone.
Here’s the dilemma: The more error inherent in the test, the less likely you are to conclude that any given student’s improvement is “significant.” But how much error is present in the CLA? The standard means for evaluating the amount of error a test contains is to calculate its “reliability"—the greater the degree of error in the scores, the lower the reliability. But the authors provide no information on the CLA’s reliability at the individual student level, so there is no way to estimate how biased their attempt to determine the “significance” of the change in each student’s score is. Taken together, these problems make it clear that the percentage of score changes that the authors have determined to be “insignificant” has been inflated to some unknown degree by test unreliability.
I even looked up several technical articles referred to in the authors’ Methodological Appendix, but was still unable to find any information on the reliability of individual students’ scores. In fact, one of those technical reports flatly states, without explanation, that “student-level reliability coefficients are not computed for this study.” Why? A close inspection of how the CLA correlates with other measures of skill in writing and critical thinking provides a possible answer. Such correlations suggest that the reliability of the CLA may be low, especially in comparison with other tests that purport to measure similar qualities. In fact, in discussing the issue of test reliability, the authors themselves admit, “The precision of the individual-level measurement of CLA performance thus is not ideal.” For some unexplained reason, that concern does not deter them from using even less “precise” changes in students’ individual-level scores as a basis for what the reading public now probably takes to be the authors’ major conclusion.
Another telling admission is buried in a footnote toward the end of the book: “A test such as the CLA ... may face challenges of reliability, raising the possibility that some of the students showing no gain may actually be learning (the italics are mine). The authors then attempt to soften the implications of that statement by suggesting—incorrectly—that unreliability works both ways: that “Type II” errors are “balanced out” by “Type I” errors. They speculate that “some of the students reporting gains may not actually be learning much.” The fact is that Type I errors are precisely controlled for by setting a confidence level (. 05, .01, etc.), while lack of reliability in the measuring instrument (the Collegiate Learning Assessment) invariably increases the number of Type II errors. In fact, with a completely unreliable test, 95 percent of the students would fail to show “significant” improvements, no matter how much learning actually took place.
In short, these considerations suggest that the claim that 45 percent of America’s college undergraduates fail to improve their reasoning and writing skills during their first two years of college cannot be taken seriously. With a different kind of analysis, it may indeed be appropriate to conclude that many undergraduates are not benefiting as much as they should from their college experience. But the 45-percent claim is simply not justified by the data and analyses set forth in this particular report.
Alexander W. Astin is a professor emeritus of higher education and organizational change at the University of California at Los Angeles. Among his books is Assessment for Excellence: The Philosophy and Practice of Assessment and Evaluation in Higher Education (American Council on Education, 1990).