A case study.

Based on both our own analysis, as well as the analyses of many others, we believe that the primary reason for the polling error in 2016 was due to pollsters not getting a representative sample of white non-college voters. While pollsters overall seem to have correctly projected these voters’ share of the electorate, they failed to get a representative sample of them by overrepresenting non-college respondents who work in offices and retail settings, and underrepresenting those in construction, manufacturing, agriculture, and other labor intensive work.  Pollsters also got too many of their interviews with non-college voters from areas with higher levels of education.  

We are taking two steps to prevent a repeat of this problem in 2018:

The first is to use a weighting approach for white non-college voters that weights both self-identified education and the education level by area type simultaneously, rather than weighting each of these metrics individually. This combined weighting approach alone has been found to eliminate over half of the polling error from 2016 by ensuring that polls do not overrepresent white non-college voters living in more educated areas, which was a significant problem in 2016.

The second step is to increase the share of non-office workers in our polls by conducting more daytime dialing and closely tracking the employment type of respondents. Analyses have found that polls significantly overrepresented the share of office workers among white non-college voters, and not surprisingly, that the voting patterns of white non-college office workers differ significantly from their blue collar counterparts. (This problem is heavily correlated with the problem of getting too many white non-college interviews in more educated areas as described above).

By taking these steps, we believe we can get a much more representative sample of white non-college voters in our polls and correct the biggest cause of the polling error in 2016.

Lessons from 2016

As described above, we believe that the major problem with polling in 2016 was pollsters getting an unrepresentative sample of white non-college voters, and we are taking concrete, data driven steps to prevent this problem from occurring again in 2018.

Unrepresentative samples of white non-college voters has likely been a problem in polls for years (less educated voters are consistently harder to reach), but did not become fully apparent until 2016 (with the exception of the canaries in the coal mine of the 2015 Kentucky Governor’s race and some 2014 races as well).

The reason this problem became so consequential in 2016 was due to the significant expansion of the education gap among white voters. In 2012, the gap in the presidential vote between whites with and without a college degree was only about 12 points, so if a pollster’s sample of white non-college voters was off, it likely did not have much of an impact on the overall results.

However, in 2016, the education gap among whites nearly tripled, to 34 points. With a gap this large, an unrepresentative sample of white non-college voters suddenly had a real impact on the overall results.

While unrepresentative samples of white non-college voters explain most of the polling error in 2016, they do not explain it all. The second biggest cause of the polling error was pollsters’ inability to correctly predict how undecided voters were going to break, particularly those who viewed both Trump and Clinton unfavorably, and who ended up breaking significantly for Trump.

This problem highlights the importance of another aspect of polls that we have addressed above: “solidity questions.” By giving pollsters a clear read of voters’ openness to supporting each candidate, these questions are critical to allocating undecideds and detecting late movement. The absence of these questions from many polls made it harder for pollsters to determine how voters would break late, which was a major reason that the DGA has recommended that they be used consistently in polls going forward.

There is also evidence to suggest that the analytics models that many campaigns and pollsters were basing their samples off were incorrect, demonstrating the importance of having two independent data sources.