Every night for nearly a month last year, Reed Roberts woke up at 3 a.m. to check the newest real-time data on something not very many people think about: the melting of Arctic Sea ice.
This unusual interest is part of what makes Roberts — then a Ph.D. candidate in organic chemistry at the University of Cambridge and now an analyst with the Economist Intelligence Unit — a superforecaster, or a nonexpert who can make extraordinarily accurate predictions about future political events. His existence and methodology have been a curiosity of sorts for researchers like Philip E. Tetlock, the psychologist who ran experiments proving that people like Roberts not only exist but can be trained to become even more accurate.
At first glance, it might seem surprising that Tetlock, a professor at the University of Pennsylvania, is the one to champion such ideas. He has been best known for telling the world, in his 2005 book Expert Political Judgment: How Good Is It? How Can We Know?, that the average expert is roughly as accurate as a dart-throwing chimp.
But for the past five years, he has been involved with and studying the Good Judgment Project, a team in an annual forecasting tournament funded by a government agency called Iarpa, for Intelligence Advanced Research Projects Activity. Led by Barbara Mellers, also a psychology professor at Penn, who is Tetlock’s wife, the researchers found that people are not quite as hopeless at prediction as initially thought. Tetlock himself has become an “optimistic skeptic.”
“The earlier work focused on cursing the darkness, and the Iarpa tournament focused on lighting candles,” says Tetlock, whose book on the tournament research, Superforecasting: The Art and Science of Prediction (Crown; written with a journalist, Dan Gardner), came out last month.
Tetlock has been interested in forecasting since the 1980s, he says during an interview at his home in Philadelphia. He’s soft-spoken, gestures frequently with his hands, and often talks in terms of trade-offs: rigor versus relevance when generating questions, or seeing belief in a “true/false” framework versus seeing it as a continuum that always needs to be updated. He’s also careful to separate what we do know from what we can only speculate.
Though Tetlock is cautious about overgeneralizing, his research has shown that, in certain conditions, people with no specific background knowledge can outperform specialists with access to classified information. People like Roberts, who says, “For me and for most ‘supers,’ the method of making predictions always starts the same way: The primary resource is Google.”
In 2010, Jason Matheny, director of Iarpa, invited Mellers and Tetlock to participate in the forecasting tournament, which began in 2011 and ran until earlier this year. It pitted five research teams, including Tetlock’s, against one another and against a control team. Tetlock’s team comprised members crowdsourced from around the world, willing to work hours each week for little more than an Amazon gift card.
All the teams made predictions about key events in the so-called Goldilocks zone of questions: both precise enough to answer and relevant to national security. For example, What will be the highest reported monthly average of Mexican oil exports to the United States between February 5, 2014, and April 1, 2014? Will Angela Merkel win the next election for chancellor of Germany?
What Tetlock shows is that experts are overconfident and could do a lot better, but not that they’re useless.
The hope was that the teams could beat the combined “wisdom of the crowd” forecast of the control group by 20 percent in the first year and 50 percent in the fourth year. Tetlock’s team emerged the clear winner, beating Iarpa’s 50-percent goal in the first and subsequent years.
Within all the teams, researchers ran experiments — for example, pitting individuals against groups — to see which methods improved accuracy. The essential insight? Prediction accuracy is possible when people participate in a setup that rewards only accuracy — and not the novelty of the explanation, or loyalty to the party line, or the importance of keeping up your reputation. It is within this condition that the “supers,” the top 2 percent of each group, emerged.
Every prominent pundit who was invited to participate in the tournament declined. Put it this way, Tetlock says: “If you have an influential column and a Nobel Prize and big speaking engagements and you’re in Davos all the time — if you have all the status cards — why in God’s name would you ever agree to play in a tournament where the best possible outcome is to break even?”
Over the past decade, an unfortunate game of telephone has mangled the message of Expert Political Judgment, in which Tetlock argued that experts were, to put it mildly, not very accurate. He did not claim that the public and the experts were equally knowledgeable, says Bryan Caplan, an economist at George Mason University who studies expertise and has been influenced by Tetlock. “He very clearly said that he asked the experts harder questions, and ones that he wasn’t sure they were going to get right. So what he shows is that experts are overconfident and could do a lot better, but not that they’re useless.”
Accordingly, the optimism of the newer research doesn’t contradict the assertion that experts can’t see what will happen in 2025. In the Iarpa tournament, nearly all of the questions asked forecasters to look ahead one year or less. Previous work had asked experts to look three to five years into the future.
“The biggest limitation of the earlier work was that it really wasn’t designed to let people shine,” Tetlock says. “The Iarpa work gives people an opportunity to see how good they can become when they’re allowed to make short-term probability judgments and update those judgments in response to news, and when they’re not given hopelessly difficult tasks like what the state of the global economy is five years from now.”
Scholars have long viewed forecasting with skepticism. Since even the best theory can be thrown off by outlier events in the real world, Tetlock notes, theorists are understandably reluctant to put their intellectual reputations at risk in forecasting exercises. Forecasting tournaments, however, are not about testing particular social-science theories. They are about “testing the ingenuity of individual human beings in cobbling together good explanations as opportunistically and effectively as possible.”
Compared with other major figures in the field, Tetlock is more sanguine about the powers of prediction. He is less convinced than Daniel Kahneman — the psychologist known for his work with cognitive biases, who has collaborated with Tetlock — that training people to let go of faulty beliefs is a losing game. And while he agrees with risk analyst Nassim Nicholas Taleb that outliers, or “black swans,” can throw off a prediction, he still thinks that the pursuit of prediction, as opposed to “anti-fragilizing,” or making everything immune to collapse, is a worthy goal.
Just as not everyone can be a Mozart or an Einstein, probably not everyone can be a superforecaster. Superforecasters think in very granular, or detailed, ways. While a normal person making a forecast will adjust the likelihood of her prediction’s coming true up or down 20 percentage points with each new development, a superforecaster tends to work in the realm of single digits and decimal points. The intelligence community suggests that the average person can distinguish between three and seven levels of uncertainty (such as “not going to happen,” “maybe,” and “probably going to happen”). Superforecasters can distinguish far more.
They are actively open-minded and score higher than average, though still far from genius territory, on measures of fluid intelligence: the ability to think logically and match patterns. They are highly numerate and enjoy wrestling with hard intellectual problems. It’s a group with specific skills, yes, and also a highly self-selecting one.
Take Ryan Adler, a budget analyst for the city government of Arvada, Colo. “At one point, I saw a question on parliamentary elections in Guinea-Bissau, and I realized that this is great, this will be the first opportunity I get to have reason to learn more about Guinea-Bissau,” he says. “It’s like my dream version of fantasy football.”
Adler achieved superforecaster status after his first season in the tournament. Others, like Jennifer L. Erickson, an assistant professor of political science at Boston College, became supers after multiple seasons, suggesting that accuracy really can be improved by learning the ropes.
One year, a 60-minute tutorial outlining basic concepts about cognitive biases improved participants’ accuracy by about 10 percent. One module reminded them to be aware of “duration neglect,” or the tendency not to take duration into account when making judgments. It suggested breaking down a three-month question into one-month increments.
Over the four years that Erickson participated, she learned that her best strategy included forecasting a little more aggressively than felt comfortable, since she tended to be underconfident, and staying away from questions when she “knew enough to know that the question was too messy to deal with.”
The setup of the tournament, which rewards only accuracy, was invaluable in teaching supers like Erickson to pick the best strategy. It also led to a finding that teams are more successful than individuals. Researchers divided the Good Judgment Project team into individuals competing alone and groups competing together. Teams were, on average, 23 percent more accurate in their predictions.
Tetlock attributes team success to “life skills” training on how to give constructive criticism, and to teams’ ability to become resources without producing groupthink. “To get those nasty groupthink effects, you have to have an opinionated leader, and here you didn’t have opinionated leaders because the only thing that mattered was the accuracy score,” he says. There was little else to prove, and no need to come up with detailed explanations, which often do more harm than good.
Thanks to the Good Judgment Project, the forecasting tournament has been solidified as one of the best modes of prediction, says Matheny, the Iarpa director. “Tetlock’s research really did help to inform how we should run forecasting tournaments,” he says. “Not just that we should run them, but that there’s best practices in how to run them, and especially in picking questions that are neither too easy nor too hard.”
Though Iarpa’s funding has ended, the team has formed a for-profit entity, Good Judgment Incorporated, which is recruiting members for a public tournament to begin later this fall. Corporate, nonprofit, government, and media clients can sponsor forecasting “challenges” on the public site, and the company will offer custom forecasts and training. It is also studying the potential of machine-human hybrids — like having IBM’s Deep Blue collaborate with Garry Kasparov in chess — that could prove more accurate than either one alone.
Now that we know some limitations, and strengths, of forecasters, Tetlock wants to focus on asking the right questions. He hopes to create what Kahneman has called “adversarial collaboration tournaments” — for instance, bringing together two politically opposed groups to discuss the Iran nuclear deal. One group thinks it’s great, one group thinks it’s terrible, and each must generate 10 questions that everyone will answer.
The idea is that each side will generate questions with answers that favor their position, and that, with everyone forced to consider all questions, a greater level of understanding will emerge. Maybe, in time, this will become the new norm for punditry, public debate, and policymaking.
The ultimate goals? Intellectual honesty. Better predictions. And, says Tetlock, “I hope we can avoid mistakes of the Iraq-war magnitude.”