What can the predictive enterprise learn from Trump vs Clinton predictive failures?
- Summary:
- The presidential election was a cautionary tale of predictive failures. What can the enterprise learn from the flaws of "election science"? Here's my rundown of seven predictive lessons.
To be fair, Silver's predictive method indicated enough Trump support in the run-up to alter his percentages. Silver took plenty of heat from those who thought he was over-weighting Trump's chances.
If you tracked Silver pre-election, you could see Clinton's electoral scenarios were more precarious than Obama's. Either way, Silver didn't look nearly as awkward as the New York Times, which had Clinton's chances upwards of 90 percent on the night of the election.
What went wrong? It's a question we can't fully answer yet - we're still waiting on important data. Some outlets, like the New York Times, have given it a shot. In the digital enterprise, "predictive" is the sex appeal that sells newfangled big data solutions. Is the sheen of predictive hurt by such obvious shortcomings?
Let's get one lesson out of the way: while some of the predictions were wrong, the vast majority of national polls were within 3 points, larger than past elections but within most poll watchers' margin of error. But if I stopped there, I wouldn't have a blog post. So let's press on, and see what we can learn from people who were much "wronger" than others.
Seven predictive enterprise lessons from Trump 2016
1. Tools and experts don't insulate us from predictive failure. In the appropriately humble How Did the Media — How Did We — Get This Wrong?, New York Time writer Michael Barbaro and two of his podcast colleagues said that assessing failure is difficult, because the auditing tools are part of the problem:
So, what just happened? “We don’t know what happened, because the tools that we would normally use to help us assess what happened failed,” Ms. Haberman says. “The polling on both sides was wrong.”
“I would say this is a failure of expertise on the order of the fall of the Soviet Union or the Vietnam War,” Mr. Confessore says. [emphasis mine]
2. Rigid data narratives are dangerous. Poll watchers clung to a data narrative that caused them to discount other approaches. Nate Silver was a target of peer disdain, such as the Huffington Posts' Nate Silver Is Unskewing Polls — All Of Them — In Trump’s Direction (last updated November 5th). HuffPost's Ryan Grim took issue with Silver's algorithms.
Grim points out that a 2012 outlier, Dean Chambers, founder of Unskewed Polls, went contrarian and said Romney would win almost all fifty states. Chambers is out of the forecasting business now:
Chambers has wisely abandoned the field of election forecasting, and this year says he thinks the various models predicting a Hillary Clinton victory are probably accurate. The models themselves are pretty confident. HuffPost Pollster is giving Clinton a 98 percent chance of winning, and The New York Times’ model at The Upshot puts her chances at 85 percent.
Grim aired his indignation:
There is one outlier, however, that is causing waves of panic among Democrats around the country, and injecting Trump backers with the hope that their guy might pull this thing off after all. Nate Silver’s 538 model is giving Donald Trump a heart-stopping 35 percent chance of winning as of this weekend.
He ratcheted the panic up to 11 on Friday with his latest forecast, tweeting out, “Trump is about 3 points behind Clinton, and 3-point polling errors happen pretty often.”
So what was Silver's supposed mistake? He was adjusting the results of polls rather than entering the polls in his model as is. This practice is called "unskewing":
Silver calls this unskewing a trend line adjustment. He compares a poll to previous polls conducted by the same polling firm, makes a series of assumptions, runs a regression analysis, and gets a new poll number. That’s the number he sticks in his model ― not the original number.
He may end up being right, but he’s just guessing. A “trend line adjustment” is merely political punditry dressed up as sophisticated mathematical modeling.
Given Silver's better performance, his tactics may be worth a second look. Grim has yet to add a reflective mea culpa - it would be interesting to hear his reasoning now. In his post-election podcast, What Just Happened?, Sam Wang - whose model grossly over-estimated Clinton's chances - warns that changing your algorithms midstream poses the risk of introducing new bias. But religiously sticking to your model strikes me as a different risk: confirmation bias.
3. Predictive "failure" is often subjective. Despite our whizz-bang tools, predictive failure isn't necessarily easy to quantify. Without Bullshit's Josh Bernoff takes the view that Nate Silver Wasn't Wrong, given that Silver's final forecast had Clinton's chances at 71.4 percent:
The problem here is that although 71.4% is not certainty, it sure felt like it. Our minds have trouble with probabilities. They rounds things up. Think about it this way. If there was a 29% chance of rain, would you bring an umbrella?
Quanta Magazine was harder on Silver: "The results of this year’s presidential election made a mockery of analytical election forecast modelers," writes Pradeep Mutalik. He threw "election science" under the bus:
Like everyone else, I am stunned. In my pre-election Abstractions post below, I commented that the “science of election modeling still has a long way to go,” but I must admit that the distance is far beyond what I had imagined. It seems pointless now to try to dissect the statewide predictions of the various models as I had promised to do — none of them were even remotely in the ballpark. It is unclear how long it will take before election forecasting is trusted again.
He didn't give Silver a pass:
You could be kind and say that the election results were not incompatible with the model that showed the most uncertainty (Nate Silver’s), but there is no doubt that all the model builders completely missed the Trump win. His surprise victory took perfect advantage of the vagaries of the electoral vote system, even as the margin in the popular vote was razor-thin in favor of Clinton. But the modelers also missed something more fundamental, and they will have to revise their models to accommodate it.
This was a systemic undetected polling error — a kind of invisible “dark matter” of polling — that underestimated support for Trump in key states by two to six percentage points.
4. Don't discard data outliers without rigorous evaluation. Quanta Magazine wasn't immune from the problem. In a pre-election post, both Silver and Wang were taken to task - Wang for being far above the aggregate 85 percent Clinton win probability at 99 percent, Silver for coming in below (low 70s on November 8). Wang hasn't assessed his missteps on his blog, but he's done so on a podcast I am still listening to. He also kept his word and ate crickets live on CNN - while warning about media trivialization of politics.
Mutalik concludes that aggregates are the easy part; prediction is hard:
Aggregating poll results accurately and assigning a probability estimate to the win are completely different problems. Forecasters do the former pretty well, but the science of election modeling still has a long way to go.
In Wang's case, he made his estimate of 99 percent by calculating the margin of error in hundreds of alternate models (aka "meta-modeling"). Basically, you assign weights to all the assumptions your primary model didn't make. But is there enough valid data to properly weight those assumptions? On Nov 8, Mutalik said no:
While this meta-modeling would put the probability estimate in perspective and be more accurate, notice that in the absence of enough empirical data, the likelihood of alternate assumptions would still be arbitrary. True accuracy would require a complex model that incorporated many more features than current models do, using data from hundreds of presidential elections, and we don’t have that luxury. An aggregation of existing models is the best simulation of such a meta-model that we have today.
Then Mutalik made a fateful leap of his own:
The best we can do is to aggregate models and discard the outliers — FiveThirtyEight and PEC —just as the modelers aggregate polls and discard outliers there.
5. Bigger data doesn't mean bigger insights - especially if that data is flawed, volatile, or redundant. In the case of presidential elections, we have all three issues: plenty of polls to aggregate, most of them redundant in their errors, and nearly all of them flawed. When the big data you are crunching is flawed, dirty, or volatile, no amount of algorithmic cleverness can compensate.
As Bernoff noted, polls have one fundamental flaw:
You cannot poll someone who chooses not to answer the poll. This creates nonresponse bias, and there is no way to correct for it. If non-respondents favor one candidate over another, pollsters will miss it. Such an error is systematic — it affects all the polls, national and in all states, in the same direction.
The data is still coming in, but there appears to be some some basis that white women voting for Trump were not as poll-responsive, which in turn could have impacted crucial polling data in the swing states of Pennsylvania, Michigan, and Wisconsin.
Third party and undecided voters also flummoxed predictors, with five percent of votes ultimately going to third parties and most of the "undecideds" going to Trump. We may learn that FBI Director James Comey's pre-election announcement on Clinton's email re-investigation had a significant impact on these undecideds, adding to data volatility. In his post-election podcast, Wang urged a re-evaluation of how pollsters reach modern voters, which might improve polling data completeness/quality.
6. Adjust models for regional, industry, and cultural variation - the nuances of the electoral college ultimately foiled predictors. Culture is a potent/elusive factor. For example, white women in the midwest seemed to place other voting values, such as economic priorities, ahead of voting for a female candidate. Trump's news brouhaha on treatment of women was not cited by these voters in exit polling.
Different industries bring different challenges. Sports will never bend to the will of predictive, though well-constructed algorithms can provide superior outcomes to the average, such as machine-calculated "March Madness" brackets that factor in strength of schedule and other variables.
Weather is fascinating; we are making strides on hurricane predictions, though there is still a margin for error as storms hit land and/or interact with other weather patterns. Predicting the exact strength of a hurricane after landfall is a key area where more improvements are needed. But the trend is towards an acceptable level of accuracy for emergency planning and response.
Earthquakes have proven much harder to predict, mostly because scientists haven't been able to establish a definitive list of causative precurors. Another problem: crucial data is either unknown, or hundreds of feet underground where earthquakes originate. This predictive context helps us to determine whether the margin for error is reliable enough for decision-making, or deeply uncertain.
7. Learn from contrarians with different methodologies - Some folks did get the election right, using entirely different approaches. Filmmaker Michael Mooore, not what anyone would call a Trump fanboy, published 5 Reasons Trump is Going to Win months ago. Some of Moore's analysis is prescient, in particular: "I believe Trump is going to focus much of his attention on the four blue states in the rustbelt of the upper Great Lakes – Michigan, Ohio, Pennsylvania and Wisconsin."
Then there is history professor Allen Lichtman, who uses a simple 13 question true/false test to predict the election winner. His system has correctly picked the winner of every election since 1984, with the caveat that he picked Al Gore, who only won the popular vote. Using "simple" to describe his system is a bit misleading, given that Lichtman developed his 13 true/false questions through his study of every election from 1860 to 1980.
Lichtman's 13 keys factor in whether there was a challenge to the incumbent party's nomination. He also considers economic recessions, political scandals, and foreign military failures. Lichtman's rule is "six keys and you're out." The race proved too close for Lichtman to call until September, when Democrats racked up the final (sixth) key: a statistically significant third party challenge. Lichtman called the race for Trump.
Final (quick) thoughts
Even Lichtman was wary of this election. He figured Trump was such an unconventional candidate that he might break Lichtman's historical models:
We have never seen someone who is broadly regarded as a history-shattering, precedent-making, dangerous candidate who could change the patterns of history that have prevailed since the election of Abraham Lincoln in 1860.
That's the problem with historical models - not all history is cyclical. I expected Clinton to prevail in a tight race, so I was wrong also. Though once Comey issued his Clinton email announcement, I figured anything could happen, whether that was an accurate weighting on my part or not.
I'm kicking myself for not paying more attention to Ohio. Ohio has elected every president since 1980. The state is clearly a bellweather, signaling issues in other midwestern states. Even in mid-October, Ohio was viewed as "up in the air." That should have alerted us all. But we were stuck in other narratives, infatuated with more elaborate algorithms.