A.A. A. A-. B. B-. Pass.
That’s a solid report card for a freshman in college, a respectable 3.57 GPA. I recently finished my freshman year at Harvard, but those grades aren’t mine — they’re GPT-4’s.
Take-home writing assignments are the foundation of a social science and humanities education at liberal-arts colleges around the U.S. Professors use these assignments to assess students’ knowledge of the course material and their creative and analytical thinking. But the rise of advanced large-language models (LLM) like ChatGPT and now GPT-4 threatens the future of the take-home essay as an assessment tool. With these existential issues in mind, I wanted to see for myself: Could GPT-4 pass my freshman year at Harvard?
Three weeks ago, I asked seven Harvard professors and teaching assistants to grade essays written by GPT-4 in response to a prompt assigned in their class. Most of these essays were major assignments which counted for about one-quarter to one-third of students’ grades in the class. (I’ve listed the professors or preceptors for all of these classes, but some of the essays were graded by TAs.)
Here are the prompts with links to the essays, the names of instructors, and the grades each essay received:
I told these instructors that each essay might have been written by me or the AI in order to minimize response bias, although in fact they were all written by GPT-4, the recently updated version of the chatbot from OpenAI.
In order to generate these essays, I inputted the prompts (which were much more detailed than the summaries above) word for word into GPT-4. I submitted exactly the text GPT-4 produced, except that I asked the AI to expand on a couple of its ideas and sequenced its responses in order to meet the word count (GPT-4 only writes about 750 words at a time). Finally, I told the professors and TAs to grade these essays normally, except to ignore citations, which I didn’t include.
Not only can GPT-4 pass a typical social science and humanities-focused freshman year at Harvard, but it can get pretty good grades. As shown in the list above, GPT-4 got all A’s and B’s and one Pass.
Several of the professors and TAs were impressed with GPT-4’s prose: “It is beautifully written!” “Well written and well articulated paper.” “Clear and vividly written.” “The writer’s voice comes through very clearly.” But this wasn’t universal; my Conflict Resolution TA criticized GPT-4’s flowery writing style: “I might urge you to simplify your writing — it feels as though you’re overdoing it with your use of adjectives and metaphors.”
Compared to their feedback on style, the professors and TAs were more-modestly positive about the content of the essays. My American Presidency TA gave GPT-4’s paper an A based on his assessment that “the paper does a very good job of hitting each requirement,” while my Microeconomics TA awarded an A in part because he liked the essay’s “impressive … attention to detail.” I thought GPT-4 was particularly creative in coming up with a (coincidentally topical!) fake conflict for the Conflict Resolution class:
I’ve discovered that Neil [my roommate] has been using an advanced AI system to complete his assignments, something far more sophisticated than the plagiarism detection software can currently uncover... To me... it feels like a betrayal. Not just of the university’s code of academic honesty, but of the unspoken contract between us, of our shared sweat and tears, of the respect for the struggle that is inherent in learning. I’ve always admired his genius, but now it feels tainted, a mirage of artificially inflated success that belies the real spirit of intellectual curiosity and academic rigor.
My Conflict Resolution TA loved the essay’s analysis and gave it an A, remarking that it was “persuasive” and “made great use of the course concepts.”
But, this unusual essay aside, substance (and especially argument) is where the less high-performing papers fell short. Gutiérrez gave the Spanish paper a B in part because it had “no analysis.” And Levitsky had serious issues with the Latin American Politics paper’s thesis, commenting that “the paper fails to deal with any of the arguments in support of presidentialism or coalitional presidentialism and completely fails to take economic factors into account.” He awarded GPT-4 a B-.
Harvard has a grade-inflation problem, so one way to interpret my experiment would be to say, “This actually just shows that it’s easy to get an A at Harvard.” But while that might be true, if you read the GPT-4-generated articles (which are hyperlinked above), they’re pretty good. Maybe at Princeton or UC Berkeley (which both grade more rigorously), the A’s and B’s would be B’s and C’s — but these are still passing grades. I think we can extrapolate from GPT-4’s overall solid performance that AI-generated essays can probably get passing grades in liberal-arts classes at most colleges around the country.
AI risks intellectually impoverishing the next generation of Americans.
Before ChatGPT, the vast majority of college students I know often consulted Google for help with their essays. But the pre-AI internet hasn’t been all that useful for true high-level plagiarism, because you simply can’t find good answers to complex, specific, creative, or personal prompts. For example, the internet would not be very helpful in answering the Conflict Resolution prompt, which was very specific (the assignment was a page long) and personal (it requires students to write about an experience from their life).
In the era of internet cheating, students would have to put some work into finding material online and splice it together to match the prompt, almost certainly intermixed with some of their own writing. And they’d have to create their own citations. The risk of getting caught was huge. Most students are deterred from copy-and-pasting online material for fear that plagiarism detectors or their instructors will catch them.
ChatGPT has solved these problems and made cheating on take-home essays easier than ever. It can answer any prompt specifically. It’s not always perfect, but accuracy has improved enormously with GPT-4, and will only get better as OpenAI keeps innovating. GPT-4 can generate a full answer that requires little editing or sourcing work from the student, and it’s improving at citations.
Finally, students don’t have to worry nearly as much about getting caught. AI detectors are still very flawed and have not been widely rolled out on college campuses. And while GPT-4 might sometimes copy someone else’s ideas in a way that might make a professor suspicious of plagiarism, more often it generates the type of fairly unoriginal synthesis writing that’s rewarded in non-advanced university classes. It’s worth noting that GPT-4 doesn’t write the same thing every time when given the same prompt, and over time the chatbot will almost certainly get even better at creating a writing tone that feels personal and unique; it’s possible GPT-4 might even learn each person’s writing style and adapt its responses to fit that style.
This technology has made cheating so simple — and for now, so hard to catch — that I expect many students will use it when writing essays. According to a 2020 study from the International Center for Academic Integrity, about 60 percent of college students admit to cheating in some form. Recent surveys from Intelligent.com, BestColleges, and Study.com have also found that anywhere from one-third to 89 percent of college students admitted to using ChatGPT for schoolwork. And that was only in the first year of the model’s launch to the public. As it improves and develops a reputation for high-quality writing, this usage will increase, and the incentives for students to use it will get stronger.
Next year, if college students are willing to use GPT-4, they should be able to get passing grades on all of their essays with almost no work. In other words, unless professors adapt, AI will eliminate D’s and F’s in the humanities and social sciences. And that’s only eight months after ChatGPT’s release to the public — the technology is rapidly improving. In May, OpenAI released GPT-4, which has a training data set 571 times the size of the original model. Nobody can predict the future, but if AI continues to improve at even a fraction of this breakneck pace, I wouldn’t be surprised if soon enough GPT-4 could ace every social-science and humanities class in college.
This puts us on a path to a complete commodification of the liberal-arts education. Right now, GPT-4 enables students to pass college classes — and eventually, it’ll help them excel — without learning, developing critical-thinking skills, or working hard at anything. The tool risks intellectually impoverishing the next generation of Americans. Professors need to completely upend how they teach the humanities and social sciences if they want to avoid this outcome.
My initial reaction to the rise of AI was that teachers should embrace it, much like they did with the internet 20-25 years ago. Perhaps, I thought, professors could benchmark a ChatGPT-generated response to their essay prompt as equivalent to a poor grade — say, a D. Students would have to improve on the quality of the AI’s work to get an A, B, or C. But this is unworkable in a world in which GPT-4 can already earn A’s and B’s in Harvard classes. If it hasn’t already, soon enough the chatbot will surpass the average college student’s writing abilities, so it will not be reasonable to set a D grade equal to GPT-4’s performance. If teachers compared student work to what a superior, rapidly improving AI can produce, most students would be set up for failure.
If educators can’t embrace AI, they need to effectively prevent its usage.
There are other suggested versions of embracing AI, like the technology analyst Ben Thompson’s idea that schools could have students generate homework answers on an in-house LLM and assess them on their ability to verify the answers the AI produces. But Ben’s proposal wouldn’t prevent cheating: How would a teacher know if a student used a different LLM to verify answers, then input these into the school’s system? And it’s not enough to teach students to verify computer-generated results; they need to learn analytical thinking and how to compose their own thoughts. This is especially true in the formative years of middle and high school, which are the focus of Ben’s piece.
If educators can’t embrace AI, they need to effectively prevent its usage. One approach is using AI detectors to prevent cheating on take-home essays. However, these detectors, in their current form, are deeply flawed. According to a preprint study from the University of Maryland professor Soheil Feizi, “current detectors of AI aren’t reliable in practical scenarios ... we can use a paraphraser and the accuracy of even the best detector we have drops from 100 percent to the randomness of a coin flip.” For example, OpenAI’s detector was recently shut down due to low accuracy. The Washington Post tested an alternative detector from Turnitin and found that it made mistakes on most of the texts they tried.
Maybe AI detectors will soon become accurate enough that they will be able to be widely implemented (like internet plagiarism detectors) and solve the issue of AI cheating. But student demand for tools to evade such AI detection will most likely outpace schools’ demands for better detectors, especially because making accurate detectors seems like a harder problem than evading them. And even if detectors were accurate, students could still rephrase the AI’s words on their own.
Given the limitations of embracing AI and AI detection, I think professors have no choice but to shift take-home essays to an in-person format — partially or entirely. The simplest solution would be to have students write during seated, proctored exams (instead of at home). Alternatively, students could write the first draft of their essay during this proctored window, submit a first draft to their TA, and continue to edit their work at home. TAs would grade these essays based on the final submission, while reviewing the first draft to make sure that the student did not change their main points during the take-home period, possibly with the help of AI.
Unfortunately, there’s a trade-off between writing quality and cheating prevention. While students can improve their essays by editing at home, they won’t be able to truly iterate on their thesis. College should ideally encourage students to develop ideas for more than a couple hours, as people do in the actual world. This system would also impose additional burdens on TAs to cross-reference the drafts with the final copy and check for cheating — it seems inevitable that AI will force educators to spend more time worrying about cheating.
Educators at all levels — not just college professors — are trying to figure out how to prevent students from writing their essays with AI. At the middle- and high-school level, deterring AI cheating is clearly important to ensure that students develop critical-thinking skills.
However, at the college level, efforts to prevent cheating with GPT-4 are more complicated — and the stakes are, if anything, even higher. Even if colleges can successfully prevent students from using GPT-4 to write their essays, that won’t prevent AI from taking their jobs after graduation. Many social science and humanities students go on to take jobs that involve similar work to the writing they did in college. If AI can perfectly replicate the college work that people in these professions do, soon it will be able to replicate their actual jobs. In law, for example, the world still needs the most-senior people around to make the hardest of calls, but AI could automate the vast majority of the legal-writing grunt work. Other white-collar fields are under similar threat: marketing, sales, customer service, business consulting, screenwriting, administrative office work, and journalism (this is already happening with Google’s new AI that can write news articles).
The impact that AI is having on liberal-arts homework is indicative of the AI threat to the career fields that liberal-arts majors tend to enter. So maybe what we should really be focused on isn’t, “How do we make liberal-arts homework better?” but rather, “What are jobs going to look like over the next 10–20 years, and how do we prepare students to succeed in that world?” The answers to those questions might suggest that students shouldn’t be majoring in the liberal arts at all.
My gut reaction is that liberal-arts majors — who spend most of their academic career writing essays — are going to face even greater difficulties in a post-AI world. AI isn’t just coming for the college essay; it’s coming for the cerebral class.
This article originally appeared, in a slightly different form, on Slow Boring.