Creating an ideal online-learning environment requires time and technology that can be challenging or beyond the reach of many faculty members. When we set out to create a MOOC for the College of Pharmacy at the University of Texas at Austin, we envisioned a unique learning environment that would challenge a global audience to think deeply about pharmaceutical issues. We pictured 20,000 essay-waving students taking our massive open online course. But wait: How could we grade that many open-response questions and evaluate that many students? Could automated graders fill the gap?
We recently tested this approach and were surprised by how closely the results from automated graders matched those from human ones. While automated grading wasn’t perfect, we believe it will be useful in the future, and we plan to use our experience to improve the process and reliability of results.
Our course, “Take Your Medicine,” launched in the fall of 2013. We had begun developing it nearly a year before, just after our university joined the nonprofit online-learning platform edX. At that time, one of us (Janet) was collaborating with Donna Kidwell, an online and gaming expert who brought her skills in e-learning pedagogy to the course. Donna’s excitement over the call for MOOC proposals spread quickly to Janet, whose reaction was, “Great! But what’s a MOOC?”
Much has changed in the MOOC world since we began planning our course, which explores how research innovations are developed into therapeutic medicines, as well as how to be a savvy consumer and patient. The course was designed to attract students from a broad range of backgrounds and interests, incorporating bite-sized video clips to engage them and assessment tools to gauge their progress. While students could not earn credit, they could get a certificate stating they had passed the course, which included an evaluation module.
We knew students’ engagement needed to be a key component of that. But we had one major question: How could we measure what they actually learned from the multiple online resources embedded in the course, like videos, animations, and lectures? Like most instructors, we had a strong desire to move beyond the traditional multiple-choice format and involve students in at least one written and graded assignment. We wanted them to engage with the course content in a way that was personal, challenging, and deeper than answering standard true-false questions.
We decided to use automated-grading software, including that developed by edX, to evaluate student responses to a single open-response essay question in our course format. We told students in advance that their essays would be machine graded, and we gave them an opportunity to see how the system worked. That turned out to be an important step.
After we hand-graded 100 essays according to a rubric developed by edX, the automated-grading system evaluated the remaining essays, assigning each a total of eight possible points. We then randomly chose 206 of those essays and regraded them by hand. The results were very close, but further investigation revealed both strengths and weaknesses in the automated-grading system.
There were problems in navigating the automated-grading system, and it did not always properly compute students’ exact scores based on the rubric. When the correct total score of an assignment evaluated by the automated grader was compared, it tended to be within one to two points of that graded by the instructor. While such a difference could be considered acceptable for lower-stakes learning activities, it poses a problem for assignments that account for a larger share of a student’s overall grade.
Still, the results surprised us. We had expected enormous differences in human and machine grading. In more than 79 percent of assignments, the agreement between scores assigned by the instructor and those assigned by the automated system matched up.
We were also interested in students’ perceptions of the automated system. A survey of students who took other MOOCs that incorporated an automatically graded question (and that were developed at Texas with edX) found that the students were generally neutral about trusting essay scoring, both by the automated system and by instructors. However, the students in our “Take Your Medicine” course were significantly more likely than other MOOC students to respond that the two systems were comparable. We believe discussing automated grading in advance made the difference.
So, we return to the question: Given the crush of students in MOOCs, robo-graders will continue to be in demand, created, and used. But is the technology there yet? Can they grade? And, do we need to think of this question as having an absolute yes or no?
As we sort out the answers, here are some suggested practices that we believe can improve the experience of online, automated grading, both for instructors and students:
- Provide students with clear instructions regarding the essay and grading rubric. Include directions if source citations are required, and remind students that plagiarism will be monitored.
- Create an example of an essay question and the grading rubric that will be used to evaluate the response. Allow students to participate in a practice session in which they evaluate their response against the rubric feedback before submitting an essay for automated grading.
- Test and finalize the grading rubric. Even the most robust rubric should be tested via hand-grading on a spreadsheet to allow for refinements and improvement, allowing the instructor to find key words and phrases to add to the final rubric and to identify potential plagiarism issues.
- Use a plagiarism-detection tool to make sure students are not merely finding answers on the Internet without citing the source or using their own words. After this experience, we believe that the importance cannot be overstated of monitoring and discouraging student plagiarism when using automated grading for a MOOC.
- Incorporate sensitivity to non-native English speakers. For example, check for alternate spelling of key words (“color” versus “colour”) and discourage the use of translation tools for entire sentences and paragraphs, as we found they yielded poor essay-score results. Additional research on the use of translation tools would help determine their impact, both positive and negative.
Based on our experience, we believe that automated grading can work well in assessing low-stakes assignments. Such a system might also be used as an effective self-evaluation tool, providing immediate feedback to students.
Given that automated grading is probably here to stay, the educational technologists, programmers, and software engineers who are working on such assessment tools would be wise to enlist faculty help in developing them. The continued development of automated-teaching tools is vital, as faculty members who teach large online courses seek new ways to monitor student progress, make course corrections as needed, and experiment with ways to keep online students involved and engaged.