Critics of standardized tests argue that the written portion of those assessments can short-circuit the process of developing ideas in writing. Using machines to grade those tests further magnifies their negative effects, according to a statement adopted last month by the National Council of Teachers of English.
As high-school students prepare for college, the statement reads, they “are ill served when their writing experience has been dictated by tests that ignore the evermore complex and varied types and uses of writing found in higher education.”
The statement is unlikely to quell controversy over the use of automated grading tools to assess a new wave of standardized tests of writing that are being developed for students at elementary and secondary levels.
The intent of the statement, which was passed unanimously by the council’s executive committee, is to prompt policy makers and designers of standardized tests to think more fully about the pitfalls of machine scoring, said Chris M. Anson, director of the writing-and-speaking program at North Carolina State University. Mr. Anson is also chair of the committee that drafted the statement for the council, a 35,000-member organization that seeks to improve the teaching and learning of English at all levels of the education system.
Chief among the council’s concerns, said Mr. Anson, is that machine grading tends to recognize, and therefore encourage, writing that may appear superficially competent, but lacks meaning or context.
Machines also cannot judge some of the most valuable aspects of good writing, the statement reads, including logic, clarity, accuracy, style, persuasiveness, humor, and irony.
To judge writing, machines analyze a text by using an algorithm that predicts a good answer, which is largely based on whether it uses certain words. The machines cannot recognize if the argument is coherent or even true, he said. Rarely used and multisyllabic words can boost a test-taker’s score even when they are included in an essay pointlessly.
“By using the word ‘cadre’ or ‘defenestration,’” he said, “the computer will think that’s good.”
Mr. Anson also worries about the larger message that machine grading sends. It tells students “that writing is so unimportant that we’re not willing to take the time to read it,” he said.
If machines value writing that has the veneer of coherence but lacks substance, he said, that factor is also likely to shape the kinds of writing exercises teachers assign. In his courses, he sometimes asks students to write an imagined dialogue between scholars, such as B.F. Skinner and Noam Chomsky, or Sigmund Freud and Karl Marx. A machine would not be able to handle such an assignment, he said, and faculty members might be dissuaded from being creative in the exercises they devise.
“It sends a message to teachers,” Mr. Anson said, “to design the most stereotypical, dull assignments that can be graded by a machine.”
‘Already Mechanical’
Machine grading is a more urgent issue in elementary and secondary education than it is for colleges. Such scoring is being considered to grade the assessments being developed for the Common Core State Standards, which have been adopted in 45 states and the District of Columbia. Those standards are intended to improve students’ readiness for the transition to college or the work force.
Still, the notion of machine grading is not foreign to higher education. The grading of the written portion of the Collegiate Learning Assessment, for example, is “almost excusively” automated, said Jeffrey Steedle, a measurement scientist for the Council for Aid to Education, which created the CLA.
While the statement from the National Council of Teachers of English is likely to find a favorable audience among many faculty members, Mr. Anson concedes a point often made by critics of the status quo: Human evaluators do not always practice the kind of close, careful, and nuanced reading the council’s statement champions.
“Machines can reproduce human essay-grading so well because human essay-grading practices are already mechanical,” Marc Bousquet wrote in a blog post for The Chronicle last year.
Although multiple human raters may collectively catch mistakes, individuals tend to make more grading errors than machines do, said Mark D. Shermis, a professor of educational foundations and leadership at the University of Akron. He is the lead author of a study, supported by the William and Flora Hewlett Foundation, that found that machines were about as reliable as human graders in evaluating short essays written by junior-high and high-school students.
In a statement provided to The Chronicle, he criticized the council’s announcement as “political posturing” and a “knee-jerk reaction,” and criticized its analysis.
“There is no evidence to suggest that scoring models for longer writing products are somehow ‘worse’ than for short, impromptu prompts,” he wrote.
In an interview, Mr. Shermis agreed that a computer cannot understand context or judge meaning. It can, however, conduct a coherence analysis, which can render a verdict on the accuracy of an argument based on probability. The grading software, he said, can identify a number of words and phrases that are highly likely to lead to the correct conclusion.
Critics of Mr. Shermis’s work, such as Les C. Perelman, of the Massachusetts Institute of Technology, have faulted his methodology and shown how a computer can be dazzled by big but meaningless words.
Someone like Mr. Perelman, who understands the algorithms, can write a bad essay full of botched facts and still get a good score, said Mr. Shermis. “The average writer doesn’t operate that way.”
An important distinction, some analysts say, is whether machines are used for high-stakes tests or for formative evaluations, which are likely to be low-stakes assignments that a student can revise and resubmit.
The council’s statement fails to recognize the distinction, according to Edward E. Brent, a professor of sociology at the University of Missouri at Columbia and president of Idea Works, which developed grading software called SAGrader.
“We need to be having a good healthy discussion of different views regarding the use of computers for assessing writing,” he wrote in an e-mail. “But it is important that the conversation be broadened to encompass the full range of such programs and their differing strengths and weaknesses.”
Ultimately, said Mr. Shermis, writing and measurement experts should work together to define what matters most in the craft of writing. The technology to measure it will continue to be developed.
“There’s nothing we can do to stop it,” he said. “It’s whether we can shape it or not.”