The result was astounding. “Pretty wild achievement,” tweeted a machine-learning engineer. An account devoted to artificial-intelligence news declared it a “groundbreaking study.” The study in question found that ChatGPT, the popular AI chatbot, could complete the Massachusetts Institute of Technology’s undergraduate curriculum in mathematics, computer science, and electrical engineering with 100-percent accuracy.
It got every single question right.
The study, posted in mid-June, was a preprint, meaning that it hadn’t yet passed through peer review. Still, it boasted 15 authors, including several MIT professors. It featured color-coded graphs and tables packed with statistics. And considering the remarkable feats performed by seemingly omniscient chatbots in recent months, the suggestion that AI might be able to graduate from MIT didn’t seem altogether impossible.
Soon after it was posted, though, three MIT students took a close look at the study’s methodology and at the data the authors used to reach their conclusions. They were “surprised and disappointed” by what they found, identifying “glaring problems” that amounted to, in their opinion, allowing ChatGPT to cheat its way through MIT classes. They titled their detailed critique “No, GPT4 can’t ace MIT,” adding a face-palm emoji to further emphasize their assessment.
What at first had appeared to be a landmark study documenting the rapid progress of artificial intelligence now, in light of what these students had uncovered, seemed more like an embarrassment — and perhaps a cautionary tale, too.
One of the students, Neil Deshmukh, was skeptical when he read about the paper. Could ChatGPT really navigate the curriculum at MIT — all those midterms and finals — and do so flawlessly? Deshmukh shared a link to the paper on a group chat with other MIT students interested in machine learning. Another student, Raunak Chowdhuri, read the paper and immediately noticed red flags. He suggested that he and Deshmukh write something together about their concerns.
The two of them, along with a third student, David Koplow, started digging into the findings and texting each other about what they found. After an hour, they had doubts about the paper’s methodology. After two hours, they had doubts about the data itself.
For starters, it didn’t seem as if some of the questions could be solved given the information the authors had fed to ChatGPT. There simply wasn’t enough context to answer them. Other “questions” weren’t questions at all, but rather assignments: How could ChatGPT complete those assignments and by what criteria were they being graded? “There is either leakage of the solutions into the prompts at some stage,” the students wrote, “or the questions are not being graded correctly.”
The study used what’s known as few-shot prompting, a technique that’s commonly employed when training large language models like ChatGPT to perform a task. It involves showing the chatbot multiple examples so that it can better understand what it’s being asked to do. In this case, the multiple examples were so similar to the answers themselves that it was, they wrote, “like a student who was fed the answers to a test right before taking it.”
They continued to work on their critique over the course of one Friday afternoon and late into the evening. They checked and double-checked what they found, worried that they’d somehow misunderstood or weren’t being fair to the paper’s authors, some of whom were fellow undergraduates, and some of whom were professors at the university where they are enrolled. “We couldn’t really imagine the 15 listed authors missing all of these problems,” Chowdhuri says.
They posted the critique and waited for a reaction. The trio was quickly overwhelmed with notifications and congratulations. The tweet with the link to their critique has more than 3,000 likes and has attracted the attention of high-profile scholars of artificial intelligence, including Yann LeCun, the chief AI scientist at Meta, who is considered one of the “godfathers” of AI.
For the authors of the paper, the attention was less welcome, and they scrambled to figure out what had gone wrong. One of those authors, Armando Solar-Lezama, a professor in the electrical-engineering and computer-science department at MIT and associate director of the university’s computer-science and artificial-intelligence laboratory, says he didn’t realize that the paper was going to be posted as a preprint. Also, he says he didn’t know about the claim being made that ChatGPT could ace MIT’s undergraduate curriculum. He calls that idea “outrageous.”
There was sloppy methodology that went into making a wild research claim.
Solar-Lezama thought the paper was meant to assert something much more modest: to see which prerequisites should be mandatory for MIT students. Sometimes students will take a class and discover that they lack the background to fully grapple with the material. Maybe an AI analysis could offer some insight. “This is something that we continually struggle with, deciding which course should be a hard prerequisite and which should just be a recommendation,” he says.
The driving force behind the paper, according to Solar-Lezama and other co-authors, was Iddo Drori, an associate professor of the practice of computer science at Boston University. Drori had an affiliation with MIT because Solar-Lezama had set him up with an unpaid position, essentially giving him a title that would allow him to “get into the building” so they could collaborate. The two usually met once a week or so. Solar-Lezama was intrigued by some of Drori’s ideas about training ChatGPT on course materials. “I just thought the premise of the paper was really cool,” he says.
Solar-Lezama says he was unaware of the sentence in the abstract that claimed ChatGPT could master MIT’s courses. “There was sloppy methodology that went into making a wild research claim,” he says. While he says he never signed off on the paper being posted, Drori insisted when they later spoke about the situation that Solar-Lezama had, in fact, signed off.
The problems went beyond methodology. Solar-Lezama says that permissions to use course materials hadn’t been obtained from MIT instructors even though, he adds, Drori assured him that they had been. That discovery was distressing. “I don’t think it’s an overstatement to say it was the most challenging week of my entire professional career,” he says.
Solar-Lezama and two other MIT professors who were co-authors on the paper put out a statement insisting that they hadn’t approved the paper’s posting and that permission to use assignments and exam questions in the study hadn’t been granted. “[W]e did not take lightly making such a public statement,” they wrote, “but we feel it is important to explain why the paper should never have been published and must be withdrawn.” Their statement placed the blame squarely on Drori.
Drori didn’t agree to an interview for this story, but he did email a 500-word statement providing a timeline of how and when he says the paper was prepared and posted online. In that statement, Drori writes that “we all took active part in preparing and editing the paper” via Zoom and Overleaf, a collaborative editing program for scientific papers. The other authors, according to Drori, “received seven emails confirming the submitted abstract, paper, and supplementary material.”
As for the data, he argues that he did not “infringe upon anyone’s rights” and that everything used in the paper is either public or is accessible to the MIT community. He does, however, regret uploading a “small random test set of question parts” to GitHub, a code-hosting platform. “In hindsight, it was probably a mistake, and I apologize for this,” he writes. The test set has since been removed.
Drori acknowledges that the “perfect score” in the paper was incorrect and he says he set about fixing issues in a second version. In that revised paper, he writes, ChatGPT got 90 percent of the questions correct. The revised version doesn’t appear to be available online and the original version has been withdrawn. Solar-Lezama says that Drori no longer has an affiliation at MIT.
How did all these sloppy errors get past all these readers?
Even without knowing the methodological details, the paper’s stunning claim should have instantly aroused suspicion, says Gary Marcus, professor emeritus of psychology and neural science at New York University. Marcus has argued for years that AI, while both genuinely promising and potentially dangerous, is less smart than many enthusiasts assume. “There’s no way these things can legitimately pass these tests because they don’t reason that well,” Marcus says. “So it’s an embarrassment not just for the people whose names were on the paper but for the whole hypey culture that just wants these systems to be smarter than they actually are.”
Marcus points to another, similar paper, written by Drori and a long list of co-authors, based on a dataset taken from MIT’s largest mathematics course. That paper, published last year in the Proceedings of the National Academy of Sciences, purports to “demonstrate that a neural network automatically solves, explains, and generates university-level problems.”
A number of claims in that paper were “misleading,” according to Ernest Davis, a professor of computer science at New York University. In a critique he published last August, Davis outlined how that study uses few-shot learning in a way that amounts to, in his view, allowing the AI to cheat. He also notes that the paper has 18 authors and that PNAS must have assigned three reviewers before the paper was accepted. “How did all these sloppy errors get past all these readers?” he wonders.
Davis was likewise unimpressed with the more recent paper. “It’s the same flavor of flaws,” he says. “They were using multiple attempts. So if they got the wrong answer the first time, it goes back and tries again.” In an actual classroom, it’s very unlikely that an MIT professor would let undergraduates taking an exam attempt the same problem several times, and then award a perfect score once they finally stumbled onto the correct solution. He calls the paper “way overblown and misrepresented and mishandled.”
That doesn’t mean that it’s not worth trying to see how AI handles college-level math, which was seemingly Drori’s purpose. Drori writes in his statement that “work on AI for education is a worthy goal.” Another co-author on the paper, Madeleine Udell, an assistant professor of management science and engineering at Stanford University, says that while there was “some sort of sloppiness” in the preparation of the paper, she felt that the students’ critique was too harsh, particularly considering that the paper was a preprint. Drori, she says, “just wants to be a good academic and do good work.”
The three MIT students say the problems they identified were all present in the data that the authors themselves made available and that, so far at least, no explanations have been offered for how such basic mistakes were made. It’s true that the paper hadn’t passed through peer review, but it had been posted and widely shared on social media, including by Drori himself.
While there’s no doubt at this point that the withdrawn paper was flawed — Drori acknowledges as much — the question of how ChatGPT would fare at MIT remains. Does it just need a little more time and training to get up to speed? Or is the reasoning power of current chatbots far too weak to compete alongside undergraduates at a top university? “It depends on whether you’re testing for deep understanding or for sort of a superficial ability to find the right formulas and crank through them,” says Davis. “The latter would certainly not be surprising within two years, let’s say. The deep understanding may well take considerably longer.”