Faculty

Scholars Take Aim at Student Evaluations’ ‘Air of Objectivity’

By Dan Berrett

September 18, 2014

Student course evaluations are often misused statistically and shed little light on the quality of teaching, two scholars at the University of California at Berkeley argue in the draft of a new paper.

“We’re confusing consumer satisfaction with product value,” Philip B. Stark, a professor of statistics at Berkeley, said in an interview.

“An Evaluation of Course Evaluations,” which he wrote with Richard Freishtat, senior consultant at Berkeley’s Center for Teaching and Learning, lays out a mathematical critique of the evaluations and describes an alternative vision for analyzing and improving teaching.

We're sorry. Something went wrong.

We are unable to fully display the content of this page.

The most likely cause of this is a content blocker on your computer or network.

Please allow access to our site, and then refresh this page. You may then be asked to log in, create an account if you don't already have one, or subscribe.

If you continue to experience issues, please contact us at 202-466-1032 or help@chronicle.com

“We’re confusing consumer satisfaction with product value,” Philip B. Stark, a professor of statistics at Berkeley, said in an interview.

Even though evaluations have become ubiquitous in academe, they remain controversial because they often assume a high-stakes role in determining tenure and promotion. But they persist because they are easy to produce, administer, and tabulate, Mr. Stark said. And because they are based on Likert scales whose results can be added and averaged, he said, they offer the comfort of a number. But it is a false kind of security. “Averages of numerical student ratings have an air of objectivity,” the authors write, “simply because they are numerical.”

Some of what Mr. Stark and Mr. Freishtat write repeats critiques by other researchers: that evaluations often reflect snap judgments or biases about an instructor’s gender, ethnicity, or attractiveness; and that they fail to adequately capture teaching quality. While economists, education researchers, psychologists, and sociologists have weighed in on the use and misuse of these tools, it is relatively unusual for a statistician to do so.

Mr. Stark and Mr. Freishtat find fault with the mathematics underlying the evaluations. Response rates, for example, often vary widely and can bias the results.

The authors are also troubled by the common practice of averaging and comparing scores. Such a practice presumes that a five on a seven-point scale means the same thing to different students, or that a rating of a three somehow balances with a seven to mean the same thing as two fives.

“For teaching evaluations, there is no reason any of those things should be true,” they write. “Such averages and comparisons make no sense, as a matter of statistics.”

What Students Can Judge

Student course evaluations have their defenders, who argue that students’ experience in the classroom can offer useful information.

Mr. Stark doesn’t dispute that. Instead of averaging the scores, he suggests reporting their distribution and students’ response rates. A clustering of scores, in which a professor is commonly rated either a two or a seven, for example, might indicate that he or she is polarizing or perhaps good with particular kinds of students.

The authors also criticize evaluation questions that are too broad or ask students to cast judgments for which they are not equipped, such as whether the instructor was effective or the course was valuable.

Instead, Mr. Stark prefers to ask students about things on which they’re experts: Did you enjoy the class? Did you leave it more enthusiastic or less enthusiastic about the subject matter? Could you hear the instructor during lectures? Was the instructor’s handwriting legible?

“It’s totally valuable to ask them about their experience,” he said, “but it’s not synonymous with good teaching.”

Mindy S. Marks, an associate professor of economics at the University of California at Riverside, agrees that evaluations can often reflect bias in the minds of the students or fail to adequately capture the full range of students’ opinions. But she believes that the comments are often valuable and that the quantitative data can reflect how much students learn.

In a 2010 paper, she and her co-authors found a small but statistically significant relationship between students’ ratings of their instructors in a remedial mathematics course and how much their scores improved between a pretest and the final examination.

The evaluation questions might not be perfect, she said, as students tend to see them as asking a broadly similar question.

“They read all the questions as ‘Did I like the professor?’” Ms. Marks said. And the resulting rating, she added, “does have a statistically significant relationship to learning.”

Looking at the Classroom

To Mr. Stark, the evaluations as they are now used can paint only a limited picture. In the second part of his paper with Mr. Freishtat, he advocates a system of judging faculty members’ teaching that plays down the averaged scores on student evaluations.

Instead, the system adheres to a set of recommendations that are laid out in many policy handbooks but are seldom truly followed at large research universities, he said. It mirrors the system used by Berkeley’s statistics department, where Mr. Stark is chairman.

Candidates for tenure and promotion produce a teaching portfolio, syllabi, notes, websites, assignments, exams, videos, and statements on mentoring, along with students’ comments on course evaluations and their distribution.

Faculty members also visit one another’s classes and write reports.

“If we want to understand what’s going on in the classroom, we actually have to look at it,” he said. “You can’t subcontract the evaluation of teaching to students.”