In his research, Philip B. Stark pinpointed something that he believed professors already suspected to be true: that student evaluations of their teaching are biased.
Philip B. Stark, associate dean of the division of mathematical and physical sciences at the U. of California at BerkeleyCourtesy of Philip Stark
Mr. Stark and several other researchers recently examined student evaluations in online courses and found that implicit gender bias had seeped into end-of-semester evaluations. The students routinely rated professors higher when they thought they were male, even though the classroom experiences were standardized and students and professors never interacted in person.
Or subscribe now to read with unlimited access for less than $10/month.
Don’t have an account? Sign up now.
A free account provides you access to a limited number of free articles each month, plus newsletters, job postings, salary data, and exclusive store discounts.
If you need assistance, please contact us at 202-466-1032 or help@chronicle.com.
In his research, Philip B. Stark pinpointed something that he believed professors already suspected to be true: that student evaluations of their teaching are biased.
Philip B. Stark, associate dean of the division of mathematical and physical sciences at the U. of California at BerkeleyCourtesy of Philip Stark
Mr. Stark and several other researchers recently examined student evaluations in online courses and found that implicit gender bias had seeped into end-of-semester evaluations. The students routinely rated professors higher when they thought they were male, even though the classroom experiences were standardized and students and professors never interacted in person.
The scores also did not correlate with how much students actually learned, as measured by the final examination.
“Whatever it is the students are responding to, it’s certainly not what they’re learning,” said Mr. Stark, who is associate dean of the division of mathematical and physical sciences at the University of California at Berkeley.
Mr. Stark’s research built on existing studies that suggest a professor’s race, age, accent, and even physical attractiveness could alter evaluation scores.
ADVERTISEMENT
When he was chair of the statistics department, Mr. Stark analyzed those studies and eventually published a paper concluding that student-evaluation surveys were a poor measure of effective teaching. He was also aware of Berkeley’s reliance on survey feedback during the faculty-review process.
Every semester students ranked their professors’ teaching effectiveness on a scale of one to seven. Department and university committees used an average of that score — and sometimes little else — to inform their decisions. (At Berkeley, professors undergo assessments every two to three years at the start of their careers, then less frequently as they progress.)
As chair, Mr. Stark revamped the process. He had professors submit portfolios of materials they had created for their classes, including syllabi, exams, and lecture notes, as well as examples of student work. He sent other professors into classrooms to observe their peers before major reviews and write up assessments that those being evaluated could read and respond to. Student evaluations were not eliminated, and their input was still valued, said Mr. Stark. He just aimed to widen the lens through which to view a professor’s teaching.
Deandra Little, director of the Center for the Advancement of Teaching and Learning at Elon University, said many colleges are bolstering their assessment process with metrics other than student-evaluation scores. Mr. Stark’s system is unique because many departments are not recommending peer evaluations so frequently, said Ms. Little.
Now, armed with statistical evidence of bias in student evaluations, Mr. Stark wants to graft a similar approach onto the entire mathematical- and physical-sciences division, which encompasses five departments, for next fall. He and others in the division agree that the evaluations are flawed. But how to mitigate those flaws is still up for debate.
ADVERTISEMENT
Out With the Old
Elizabeth Purdom, an assistant professor in the statistics department, started teaching at Berkeley in 2009. She remembers that her first evaluations were fairly negative. The class was not smooth sailing, she said.
But even as Ms. Purdom gained experience, the numbers on her evaluations stayed low. And the written portion and numerical rating often did not align, making it difficult to establish any trend. Once a student wrote that the course was the best stats class she’d ever taken. But then she gave Ms. Purdom a five out of seven on the teaching-effectiveness question.
“Well, that number is not really useful,” Ms. Purdom thought at the time.
The departmental committee that reviews professors brought up those low scores even after her ratings had improved, Ms. Purdom said. The people who conducted her reviews also typically relied on her average score instead of the median, which meant one low rating could tank — or at least drag down — a large pool of high marks.
Ms. Purdom was eager to receive any feedback that might be more useful, so in 2013 she agreed to act as a guinea pig for Mr. Stark’s new evaluation system.
ADVERTISEMENT
A professor in another department observed one of her classes and wrote up a synopsis. Ms. Purdom said that professor gave her a wealth of positive feedback and several concrete suggestions, which gave her confidence in her teaching for the first time.
“Up until that time I was sort of like, OK, maybe I’m not one of these people who is good at teaching,” Ms. Purdom said.
The written observation, along with a teaching portfolio she had constructed, went into her dossier for her midcareer review. Those materials were a stronger foundation than just her student-evaluation scores and a brief teaching statement — the documents typically used to judge a professor at that time, Ms. Purdom said.
The statistics department still uses peer evaluations, as well as teaching portfolios, in tandem with the student scores to evaluate professors for their major career reviews. L. Craig Evans, interim chair of the mathematics department, said that process would have benefited him last fall.
As chair, he reviewed multiple professors’ promotion cases with little more than “a single number and raw teaching comments from the students,” Mr. Evans said. He wished he had had a fuller perspective.
ADVERTISEMENT
“When students evaluate how a course went, they have a view. I don’t think it’s an entire view,” Mr. Evans said.
In With the New
Though Berkeley has cautioned for several years against relying too heavily on student evaluations, the practice still happens, and the university has struggled to avoid it, said Frances Hellman, dean of the division of mathematical and physical sciences.
“All of us cling to this hope that it will be a reasonable metric,” Ms. Hellman said.
Ms. Hellman knows firsthand that student evaluations can be unreasonable, or occasionally “kind of merciless,” she said. (She remembers one student’s remark on her hair, which said she looked as if she stuck her finger in a light socket every morning.)
The Committee on Teaching for Berkeley’s Academic Senate reviewed the universitywide policy for evaluating teaching and, in 2015, published its findings. The committee concluded that “student course evaluations alone do not portray a complete picture on which to conduct an evaluation.” The group recommended requiring a teaching dossier that would include peer observation as part of a professor’s merit and promotion materials.
ADVERTISEMENT
Juan M. Pestana, a professor in the department of civil and environmental engineering and chair of the Academic Senate’s teaching panel, said it was too early to tell if departments were heeding the panel’s suggestions. But there is an active conversation on the campus about the best ways to measure effective teaching, he said.
Ms. Hellman said she supports drafting and circulating new suggestions on how to evaluate teaching to the five departments in her division for the fall. But she said she’s not convinced that peer evaluations would be less influenced by implicit biases than student evaluations are. And she’s skeptical that asking faculty members to watch one of their peers’ lectures would do much to strengthen the observed professor’s teaching.
Mr. Stark also understands the potential shortcomings of peer evaluations, but for a different reason. Asking faculty members to sacrifice time and energy to perform additional duties is “a hard sell,” he said. But he added that such work is key to actually improving teaching, not just assessing it.
Department chairs in Ms. Hellman’s division will talk with Mr. Stark throughout the summer to hammer out the specifics of how a department might put peer-assessment and teaching-portfolio requirements into practice. What teaching criteria to examine, how often to prescribe evaluations, and which professors are qualified to do the assessing are all potential points of discussion. She foresees a process that blends all options — student, peer, and self evaluations — to paint a richer portrait of a professor. She hopes it will measure how hard professors are trying to be effective instructors.
“Effort, by and large, will lead to better teaching,” said Ms. Hellman. “Just like it leads to better everything else.”
EmmaPettit is a senior reporter at The Chronicle who covers the ways people within higher ed work and live — whether strange, funny, harmful, or hopeful. She’s also interested in political interference on campus, as well as overlooked crevices of academe, such as a scrappy puppetry program at an R1 university and a charmed football team at a Kansas community college. Follow her on Twitter at @EmmaJanePettit, or email her at emma.pettit@chronicle.com.