I was a member of the National Research Council’s Committee to evaluate American Ph.D. programs, which released its long-anticipated assessment last fall. In the 1990s, I also served on a committee studying the same subject. After more than seven years trying to improve the quality of the data in the recent study, I was alone in refusing to sign off on it. Rather than write a dissent, which would have further delayed an already late report, I resigned from the committee before the assessment’s publication. The report’s quality was not worthy of publication, nor, I believe, did it live up to the standards that the National Academy of Sciences, which sponsored it, should set.
With the evaluations continuing to generate interest and controversy—indeed, the committee has just announced that it is releasing new data tables because of four “substantive errors"—it is time for me to say why I believe that the NRC study failed. More important, why was it not allowed to be seen as having failed? These questions raise a generic set of issues: How is politically relevant and multidisciplinary research derailed by social pressures? Why are unreliable results declared “better than none”?
The production and misrepresentation of misleading or inaccurate results in such cases are not the outcome of fraud, finagling, or other forms of scientific misconduct. They are not the products of willful deception by researchers intent on violating the fundamental norms of scientific research. They are often the result of a false definition of the issues, leading to a series of poor decisions early in the research process. The NRC study is a case in point. My aim is not to lay blame but to avoid similar problems in future research studies.
The committee and the research council thought the study had to be published for several reasons. Too much money had been spent (well over $4-million, obtained from major foundations, the government, and the universities themselves) to publish nothing. Although the funds pale in comparison with those spent on clinical drug trials at universities—also often too big to fail—the sum was sufficiently large and the donors sufficiently important that the academy needed to publish the results in some form, regardless of their quality.
The project had also involved too much time and effort to admit failure, despite the fact that the data were rapidly becoming obsolete. To be fair, some committee members believed that the data and analysis were worth publication. But at some point, exhaustion set in and interest ran out: when new data were collected; when universities were asked to rework the data in light of errors surfacing in their submissions; when the results were repeatedly changing (sometimes significantly); when it was not clear what version of the data the committee was examining. Eventually people wanted to get on with their regular, time-consuming jobs, and they accepted negotiated results that were not worth publishing.
Further, there were huge expectations: This would be the definitive study of the quality of research programs, finally putting to shame evaluations like those of U.S. News & World Report. That made it virtually impossible for the committee or the academy to say: “Unfortunately, we could not obtain meaningful results with our data set.” Indeed, there was increasing frustration expressed by university administrators, to say nothing of the news media, at the delays in reporting and the promises that publication was just around the corner.
My assertion that the NRC study was not permitted to fail is predicated on the claim that the study was, in fact, a failure. In an effort to meet criticisms (some justified and others misguided) of the prior study, which was issued in 1995, the research-council committee spent a substantial amount of time early on debating a few important topics. In each case, I believe, it reached the incorrect conclusion, based on faulty assumptions, poor analysis, political pressure from the academy, and unexamined preconceptions. Each factor increased the probability that its study would end in failure.
The decisions reflected oversensitivity to political concerns within the committee. A minority claimed that the rankings in prior efforts had been captured by administrators from elite universities who wanted to maintain the status quo. Wanting to avoid the awkward position of defending elite—although not necessarily elitist—institutions, the majority became overly egalitarian. There would be no acknowledgment that some research programs were far better than others, or that the best were concentrated at a relatively small number of universities.
Political orthodoxy ran through the discussion of the report’s results. Consider only one example: the attempt to correlate the racial diversity of faculty with the quality of programs. When it turned out that there was a slight negative correlation between the percentage of minorities and a program’s assessed quality, an effort was made to play down the finding, essentially burying rather than analyzing it.
Indeed, one of the study’s primary failures, which has plagued previous NRC rankings, was that the focus on data collection, correction of errors, and classifications, as well as decisions about what fields to include, led to a lack of funds toward the end of the study for the adequate analysis of findings. Reports like these become perfunctory. The usual response is, “We’ll finance a follow-up volume of analysis.” Those volumes rarely appear—letting stand weak reports and unwarranted conclusions.
Unfortunately, the lack of intellectual courage in the academic world is often manifested in an unwillingness to acknowledge that an experiment or study has not worked. There are many examples in the scientific (and nonscientific) literature of large studies thought too big to fail, despite insufficient data. One can find such failures, which lead to policy changes based on poor data, in medical experiments about drugs, public-health studies, and a variety of assessments of health risks: for example, conclusions about the impact of diet on cholesterol or the harm of secondhand smoke. A crowd psychology emerges within committees where individuals don’t want to be perceived as spoilers. Discussion of why a study has failed is truncated—to the detriment of learning from our mistakes. An attitude has developed in American society, and it plagues research efforts as well. The mantra is, “It’s better than nothing.” But is it?
Perhaps the most significant misguided decision in the recent study came when the committee voted against including any measures of reputational standing or perceived quality of programs. That decision was not based on substantive evidence. Program reputations were too subjective, some major academic groups argued. Eliminating reputational measures would avoid enhancing the standing of the older private and public universities that had had high reputational standing for decades. Conversely, it would give greater visibility to undervalued newer and less well-known programs. True, although prior NRC evaluations had listed various factors assessing quality, the public and academic leaders had tended to focus on reputation alone.
The majority of committee members, however, would not grant that reputations are real—and have real consequences. Students and faculty members take reputations seriously when making choices about graduate education or jobs. Reputations are not constructed out of whole cloth; they have been earned, for the most part over decades. They are, in a sense, a composite of the other variables in the study, like publication and citation information, as well as an indicator of some less-tangible variables, like the general intellectual tone of a university.
Many committee members juxtaposed so-called subjective measures with supposedly more-precise objective measures, like citation counts. As one of a small set of social scientists who published (back in the 1960s) about citations as a measure of impact, above and beyond their value as a bibliographic tool, I know many citations (excluding self-citations) are based, in unpredictable ways, on the reputations of authors. Citation counts have great value—especially if they are used to compare groups (like the average number of citations of academy members compared with the rank and file in academe) rather than individuals. They ought to be included in the study of program quality. But citations are no more objective than reputational assessments. The dichotomy is false. Many authors will cite papers or techniques without even being familiar with them or their authors, for fear of being labeled by peer reviewers as insufficiently familiar with the field. Moreover, total counts or average citations per publication can be misleading if used for interdisciplinary programs or for comparing different subfields, where citation rates can vary greatly among subgroups and disciplines. By excluding indicators of reputation, the committee eliminated an easily and accurately measured variable that is actually important in academe.
Another problem that doomed the NRC rankings from the beginning: the choice to use random faculty assessments of what is important in program quality. That would have been appropriate had the sampled faculty been reasonably informed about Ph.D. programs at other universities. However, the faculty chosen to evaluate quality were not trained in making those judgments, and few had any experience in doing so. Prior studies had relied on universities to provide the names of people on their campuses who were particularly knowledgeable about Ph.D. programs throughout the United States. This time, all faculty members, regardless of rank or knowledge about doctoral programs, were treated essentially as equally informed. The temptation to use extremely large samples of evaluators because new technology allows for easy access to them stripped the study of expertise. There was no effort to assess how much faculty members actually knew about the quality of programs.
To create a single ranking, moreover, the committee needed to determine what weight should be given to each of 20 variables evaluated by the sampled faculty. At one level, that was solved, since the faculty members said which variables they thought were the most important. And there were no surprises: Faculty productivity, citations, research grants, and honorific recognition were among the variables identified. But in trying to create a single set of rankings, the committee needed to have a dependent variable to use in regression analysis to establish weights attached to independent variables. And there the committee made a mistake in judgment.
Having rejected open assessments of reputation or perceived quality, it chose to sneak reputations in through the back door. Instead of measuring reputations for all programs based on the responses of a reasonable, qualified sample of raters, the committee chose to obtain ratings on a subsample of programs, from a small group of faculty members drawn randomly from a huge pool. For example, if English and literature had about 120 programs, 50 had their reputations rated by a relatively small sample of about 46 faculty members (drawn from thousands). Raters were given a standard scale to assess reputations (the same as had been used in prior studies), and they did so. The results depended on the regression analysis of the independent variables and the resulting regression weights that were used to develop the ratings. Many programs, indeed a majority in most fields, were never assessed in terms of reputation. Their reputational standing was determined from the inferred weights of the variables, which were plugged into the regression equation for each program to get a reputational estimate.
In fact, the committee went so far as to give even surveyed programs the reputation scores coming from the algorithm determined by its analysis rather than from what raters actually said about the reputations of the programs. So, for example, a sociology program in the survey that actually received a mean rating of 4.52 out of a maximum of 5 could have been given a rating of 3.95 because of the way the algorithm interpreted the weights associated with the independent variables for that program. In such cases, the algorithm trumped the actual ratings by faculty. Tellingly, the committee never revealed what programs had been actually assessed, nor what proportion of total eligible raters were sampled, nor the demographics of the raters.
The distinguished University of Chicago statistics professor Stephen M. Stigler, who participated in the 1995 research-council study and has written on the problems of ranking efforts, pointed out in a Chronicle blog posting that the current report “gives ranks (as ranges), but it does not give the index values being ranked, which previous studies had included. Had the index values been reported, the readers would have seen the trivial—minute in many cases—differences between most of them, an indication that these indexes (simple weighted averages of standardized variables) do not discriminate among the programs in a useful way, that they tell us little of what we may wish to know.”
Many other problems of data collection also arose that were not dealt with in a precise way. For example, values given to honorific awards were based upon seat-of-the pants ideas about their importance rather than on any empirical basis. There were a few attempts to understand the great distance between, say, a local university award and election to the American Philosophical Society. The committee failed, as the American Sociological Association recently pointed out, to count books or citations of books in sociology, despite the fact that book publication remains a central part of the academic careers of scholars in the field. The same is true for disciplines like anthropology and political science.
Moreover, two distinguished groups within a field can exhibit very different publication (and thus citation) practices. A program with more mature scholars may tend to publish more books, book chapters, and general essays than articles in high-impact journals (which they did when they were younger and had to establish their bona fides). Of course, data were not available for citations for humanities programs, thus limiting the value of that variable for many Ph.D. programs. And finally, there was an arbitrariness to the responses that the committee made—to shouts of “foul play” by various disciplines. When computer scientists pointed out that conference papers are often as important as journal articles, the committee said it would recalculate its productivity measure for computer-science programs. As far as I know, it is not doing the same for other disciplines in which it omitted important data.
The 1995 study had allowed universities to identify faculty by program, while not restricting the inclusion of some faculty in multiple programs (many faculty members participate in multiple Ph.D. programs and are very active in them). This time around, the committee was afraid that universities would place their heavy hitters in multiple programs, so it attempted to divide each faculty member’s efforts among departments, with the total time adding up to 100 percent. The problem is that dividing research time or productivity into fractions that total 100 percent misunderstands the role that some faculty members play in the university community. Not all participate equally. Some spend 70 hours a week working in multiple programs, training undergraduates as well as doctoral students and postdoctoral fellows, and do more in each program to further its objectives than many faculty members who work in only one program.
The methodology adopted failed to capture the more-than-trivial number of cases in which some members of, for example, an English department supervised 17 doctoral dissertations in their home department and 10 more in another interdisciplinary program; or the cases of other department members who supervised only one or two doctoral students. By creating fractions for citation and publication counts, the NRC may have stopped universities from “gaming” the system, but it truncated reality.
The committee also produced a rather narrow definition of which faculty members were eligible to be studied: only those who were on dissertation committees or who taught graduate students. Further, it tried to distinguish between “core” members and those who were only “associated” with a program. But universities use different criteria in determining who is actually participating in a Ph.D. program. Since faculty size turned out once again to be an important variable in calculating rankings, this unevenly applied variable led not only to confusion but also to errors in the data and subsequent rankings.
The committee disregarded important differences among programs that were classified as being in the same field. For example, suppose we have two world-class statistics programs: one that concentrates its work on theoretical issues related to statistics and one that is essentially made up of biostatisticians. The theoretical statisticians might publish almost exclusively in journals with limited circulation and, therefore, with limited potential for citations compared with the program that emphasized biostatistics and speaks to a vast audience. The results of the citation analysis of two distinguished departments would be misleading.
Then there was the committee’s inability, or unwillingness, to deal adequately with emerging fields, despite their increasing academic importance. Many fields, like East Asian and Middle Eastern studies, were excluded because they did not exceed an arbitrary threshold for the number of Ph.D.'s produced over the past five years. New interdisciplinary fields were passed over because the committee felt that there was insufficient uniformity in their content. Some universities were allowed to submit data on multiple programs for the same basic field (Harvard University, for example, obtained assessments for three economics programs that existed in three of its schools; because each was strong, Harvard could dominate the top rankings).
Any experiment or study designed by a committee whose members are chosen not for their particular expertise but for their disciplinary balance is more likely to fail than one that selects members solely for their expertise in the subject under study. This committee, which was made up of committed and distinguished individuals, had relatively few with significant experience with social-science data and the statistical methods used to create the rankings. That increased the probability of failure.
Indeed, the committee recently announced that it had wrongly excluded some faculty honors and awards; that it had made an error in tabulating citations for papers published in 2002; and that some of its figures for job-placement rates were inaccurate and misleading. Now, as I write, we await the April 21st announcement of the four “substantive errors” that require the NRC to issue new data tables for: average citations per publication; awards per faculty member; percent of programs with academic plans; percent of first-year students with full financial aid. “These changes typically did not have a large effect on the range of rankings for individual programs,” the NRC says, adding that some recalculations will be made. But only some of the errors will be dealt with. And while it is fine for the NRC to acknowledge errors after it published its report, most people will never see the revised results.
Thus we come to this: Universities collected large quantities of information about their own programs, and the committee gathered substantial amounts of data, like citation and publication data, from publicly available sources. A significant amount of data from independent sources, for example on the proportion of women in the academic community and trends in their treatment over time, were poorer than data already produced in NRC studies. Some of the data were reliable and useful, but a good deal, unfortunately, contained significant errors, making it difficult to say just how valuable the data would be to individual institutions. Data on time to degree, for example, which varied little from one institution to another, were poorly measured and suspect. Similarly, data on the support of graduate students had been either incorrectly interpreted by institutions or were not properly classified to show variations in levels and kinds of support.
The large difference in results from the two methods used to calculate rankings (for the statistically minded, the so-called R and S ranges) can only make readers of the report wonder what could have caused such variations. The methodological description of the study was too technical for most readers to understand. Despite herculean efforts by the staff and several committee members both to improve the quality of the data and to perform some very preliminary analyses of them, there were scores of anomalous results that could not be accounted for and lacked basic face validity.
For these reasons, among others, I consider the study a failure. It is difficult to invest the time, energy, and resources on something and to conclude that the results should not be published. The great basketball player Michael Jordan said of the role of failure in his work: “I’ve missed more than 9,000 shots in my career. I’ve lost almost 300 games. Twenty-six times I’ve been trusted to take the game-winning shot and missed. I’ve failed over and over and over again in my life. And that is why I succeed.” Whether in sports or science, experiments and studies fail. We can learn from our mistakes. We missed that opportunity with the NRC study.