Measuring College-Teacher Quality

David Glenn’s Chronicle article on using course sequence grades to estimate teacher quality in higher education illustrates a crucial flaw in the way education researchers often think about the role of evidence in education practice.

The article cites a recent study of Calculus grades in the Air Force Academy. All students there are required to take Calculus I and II. They’re randomly assigned to instructors who use the same syllabus. Students all take the same final, which is collectively graded by a pool of instructors. These unusual circumstances control for many external factors that might otherwise complicate an analysis of teacher quality.

The researchers found that students taught by permanent faculty got worse grades in Calculus I than students taught by short-term faculty. But the pattern reversed when those students went on to Calculus II—those taught by full-time faculty earned better grades in the more advanced course, suggesting that short-term faculty might have been “teaching to the test” at the expense of deeper conceptual understanding. Students taught by full-time faculty were also more likely to enroll in upper-level math in their junior and senior years. In addition, the study found that student course evaluations were positively correlated with grades in Calculus I but negatively correlated with grades in Calculus II.

All of which suggests that analyzing course-sequence grades is a fruitful way to evaluate the quality of teaching in higher education. There are a lot of lower-division undergraduates out there taking a relatively small number of core courses in a predictable order. Yet a number of university-based experts quoted in the article voiced deep skepticism about the idea, essentially arguing that without pristine Air Force Academy-like conditions, one couldn’t adequately control for external factors and produce a reliable estimate of teacher effects.

Here’s the problem: There’s a huge difference between the minimum standards of accuracy necessary for information to be valid as scholarship and the minimum accuracy necessary for it to be useful for making decisions about running a college.

The former is much greater than the latter, and rightly so: There should be a high bar for findings to enter the canon of human knowledge. But if you’re trying to evaluate teacher effectiveness for the purposes of deciding who is most likely to help students learn, the information needs to be accurate enough so the decisions you make are likely to be better decisions than those you would have made without the information—and that’s all. If, for example, you had to choose between hiring Teacher A and Teacher B, and you had evidence that Teacher A was much more effective that met P < .10 standards of accuracy but not P < .05, that evidence might not be good enough to get into a peer-reviewed journal but you’d be an idiot if you ignored it in choosing who to hire. That’s because while evidence of teacher effects can theoretically wait forever until it’s good enough to enter the scholarly record, someone needs to be hired for teaching today.

Yet college hiring and promotion standards are weirdly dichotomous when it comes to accuracy and evidence. In some respects they’re overly-biased toward accuracy at the expense of relevance, as with the use of student evaluations, a presumably accurate measurement of student opinions that, per the Air Force study and others, may very well signal the opposite of teacher quality. They also use scholarly publishing and citation records, which have nothing to do with teaching but are easy to count. These are then combined with factors like “collegiality” that are so wildly subjective and non-empirical that they can’t even be talked about in the same way. Meanwhile, course-sequence grade data that’s literally just sitting there for the taking is ignored.

In other words, you’re better off using reasonably accurate information about the right thing than extremely accurate information about the wrong thing. And if you step back for a minute and think about how all the day-to-day decisions driving well-functioning organizations are made, they all flow from this common-sense approach. But because universities correctly apply a very stringent standard of accuracy to their scholarship they’re ignoring information useful for their teaching and operating sub-optimally as a result.

Return to Top