[1] Several months after joining the committee, Kyley McGeeney took a position at PSB, her current employer.
[2] In four of those battleground states, a subset, which included Florida and Pennsylvania, the average poll margin was less than a point – signaling that either candidate could win.
[3] Neither candidate actually received exactly that number of electoral votes due to several electors voting for candidates other than one they were pledged to.
[4] A list of the microdatasets made available to the committee is provided in Appendix Table A.0.
[5] This analysis includes polls that had a final field date falling within 13 days of Election Day (October 26th or later) and if their field period began by October 16th. National poll analysis includes only a polling firm’s final estimate to ensure comparability with historical data. Analysis of state-level polls, by contrast, includes all polls completed within the final 13 days, including multiple surveys from the same firm in the same state. The exclusion of pre-final estimates from national polls results provides a clearer historical comparison to analyses by the National Council on Public Polls, which is the source of data from 1936-2012 and only includes final estimates.
[6] Even if U.S. pollsters collectively had a historical tendency to overestimate support for one major party relative to the other (e.g., as seen in U.K. polling since 1992 (Sturgis et al. 2016)), this would not be evidence of partisan behavior on behalf of the pollsters. Rather, such a pattern could be entirely (and more likely) explained by methodological factors, for example, over- or under-sampling African Americans or cell-phone only adults. We raise the issue of partisanship here only because that is one of the many criticisms leveled against the polling community from time to time. While the existence of a historical partisan error pattern would not prove partisan behavior, the absence of such a pattern should reassure poll consumers that pollsters are not putting their thumbs on the scale.
[7] Appendix A.A discusses how a poll’s margin of error relates to the accuracy metrics used in this report.
[8] Notable examples of other 2016 pre-election poll designs include internet surveys with a panel recruited offline using a probability-based sample (e.g., the USC Dornsife/LA Times Daybreak Poll) and mail surveys (the Columbus Dispatch Poll).
[9] One notable lesson from reviewing countless methodological reports is that the commonly held assumption that IVR polls just dial landlines (Cassino 2016; Clement 2016; Clinton and Rogers 2013; Cohn 2014; Enten 2012; Jackson 2016; Pew Research Center 2016) is not always correct. At least two pollsters clearly described their methodology as just IVR and yet reported that a noticeable share (10 to 25 percent) of their completed interviews were with cell phones. This detail may help explain why coverage error (e.g., excluding the half of the US population that is cell phone-only) may not have been more of a problem for IVR-only polls in 2016.
[10] A note of caution is in order. Not every survey mode (telephone, internet, IVR) was used for every primary, or in the same proportions. This lack of uniformity across the 78 contests means that evaluations of state-level accuracy could be affected by differences in the types of survey mode employed in the states. Likewise, evaluations of the accuracy of survey mode could be affected by the types of states where each mode was used. (For instance, some primaries – typically primaries for which one candidate was heavily favored – lacked a single live phone poll, and if the margin of victory in these primaries was harder to predict, this would reduce our ability to interpret these differences as reflecting the impact of survey mode.)
[11] The estimated 6-point Clinton lead based on pre-election data reflects responses from all registered voters weighted with Pew Research Center’s standard protocol for general population surveys.
[12] This difference is thought to be basically an artifact of the difficulty of treating “leaners” the same way on the phone and on the web (Cohn 2016a). In phone polls, if a respondent refuses to say they plan to vote for one of the candidates mentioned, then they are typically asked a follow up question asking which candidate they are leaning towards. This follow-up format tends to yield relatively low levels of nondisclosure. In online polls, by contrast, such a follow up format is less common.
[13] We also considered defining “over-performance” with respect to the candidate’s estimated share of the total vote, as opposed to using the Republican-Democrat margin as described in the text. We found, however, that vote share was not a suitable framework. Due to the fact that polls feature undecided voters and tended to over-estimate support for third party candidates, both Donald Trump and Hillary Clinton generally “over-performed” relative to their estimated vote share in polls.
[14] The level of imprecision or, more specifically, the standard error of these estimates is worth considering though not addressed in this report. Given that this analysis yielded basically a null result the standard errors are, for practical purposes, a moot point.
[15] As stated in the footnote of Table 9, the Census figures are based on all ages and the CNN/ORC and Pew Research Center figures are based on all adults age 18 or older. We investigated whether that discrepancy confounded the comparison in a noticeable way and concluded it did not. While it seemed possible that rural and other staunchly pro-Trump areas skew slightly older than other parts of the country, we did not see empirical evidence of that. For example, the predominantly rural and overwhelmingly pro-Trump states of Oklahoma and Wyoming represented equal shares of the entire U.S. population (1.2% and 0.2%, respectively) and the U.S. adult population (also 1.2% and 0.2%, respectively). Consequently, we concluded that this small discrepancy has no meaningful impact on the results or conclusions in this analysis.
[16] An early release of the MSU poll reported a 20-point Clinton lead (http://msutoday.msu.edu/_/pdf/assets/2016/state-of-state-survey.pdf). The corresponding microdataset provided to the committee, presumably reflecting the final release, gives a 17-point Clinton lead as shown in Figure 10.
[17] Some online opt-in polls adjust for education at the sampling stage through quotas or sample matching, which can mitigate the need to adjust for education in weighting. To the best of our knowledge, neither UPI/CVoter nor Google Consumer Surveys (shown in Figure 11) adjusted for education at the sampling stage.
[18] The Monmouth microdatasets did not have a variable to distinguish LVs from all RVs, so no weighted RV estimates are presented for those polls.
APPENDIX
Table A.0 Microdatasets Made Available to the Committee |
Organization |
Microdata |
ABC News/
Washington Post |
National tracking poll with n=9,930 fielded Oct 20-Nov 7 |
CNN/ORC |
National poll with n=1,001 fielded Sep 1-4; National poll with n=1,501 fielded Sep 28-Oct 2; National poll with n=1,017 fielded Oct 20-23; AZ poll with n=1,005 fielded Oct 27-Nov1; CO poll with n=1,009 fielded Sep 20-25; FL poll with n=1,000 fielded Sep 7-12; FL poll with n=1,011 fielded Oct 27-Nov 1; NC poll with n=1,025 fielded Oct 10-15; NV poll with n=1,006 fielded Oct 10-15; NV poll with n=1,005 fielded Oct 27-Nov 1; OH poll with n=1,004 fielded Sep 7-12; OH poll with n=1,008 fielded Oct 10-15; PA poll with n=1,032 fielded Sep 20-25; PA poll with n=1,014 fielded Oct 27-Nov 1; |
Marquette University |
WI state poll with 1,401 fielded Oct 27-30 |
Michigan State University |
MI state poll with n=1,010 fielded Sep 1-Nov 13 |
Monmouth University |
NV state poll with n=465 fielded Oct 14-17; WI state poll with n=428 fielded Oct 15-18; NC state poll with n=487 fielded Oct 20-23; AZ state poll with n=463 fielded Oct 21-24; NH state poll with n=430 fielded Oct 22-25; IN state poll with n=448 fielded Oct 27-30; MO state poll with n=457 fielded Oct 28-31; PA state poll with n=453 fielded Oct 29-Nov 1; UT state poll with n=445 fielded Oct 30-Nov 2 |
Pew Research Center |
Election Callback Study 2000, 2004, 2008, 2012, 2016; Cumulative national polls from 2016 with cumulative n=15,812; 2016 Callback Study n=1,254 fielded Nov 10-14, 2016 |
SurveyMonkey |
National tracking poll with n=219,431 fielded Oct 4-Nov 7. This dataset also supported state-level analyses. |
USC/LA Times |
2016 National panel survey, 4,509 fielded Jul 4-Nov 7 |
YouGov |
Cooperative Congressional Election Study with n=117,123 fielded Oct 4-Nov 6; Economist/YouGov poll with n=4,171 fielded Nov 4-7; Other polls across 51 states with n=81,246 fielded Oct 24-Nov 6. These datasets would have supported state-level analyses. No weights were provided. |
Table A.1 Average Absolute and Average Signed Error in 2016 State-Level General Election Polls
Type of poll |
Number of polls in final 13 days |
Average absolute error |
Average signed error |
National polls |
39 |
2.1 |
1.3 |
All state polls |
423 |
5.1 |
3.0 |
All battleground state polls |
207 |
3.6 |
2.3 |
All non-battleground polls |
206 |
6.4 |
3.3 |
|
|
|
|
Wisconsin |
13 |
6.5 |
6.5 |
Ohio |
13 |
5.2 |
5.2 |
Minnesota |
5 |
4.9 |
4.9 |
Pennsylvania |
24 |
4.2 |
4.2 |
North Carolina |
18 |
4.8 |
4.0 |
Michigan |
17 |
3.8 |
3.5 |
New Hampshire |
16 |
5.0 |
3.4 |
Florida |
23 |
2.9 |
1.3 |
Arizona |
18 |
2.5 |
1.0 |
Georgia |
14 |
2.3 |
0.9 |
Virginia |
14 |
1.9 |
-0.2 |
Colorado |
16 |
2.3 |
-1.6 |
Nevada |
15 |
2.5 |
-1.7 |
Figure A.1 Distribution of the Size of the Absolute Error in Primary Polls, 2000-2016. Note – Each bin is 5 percentage points wide.
A.A Poll Accuracy and the Margin of Error
The amount of error found in polls is important because it helps emphasize the importance of accurately accounting for, and reporting, the errors involved in polling. Most polls are accompanied by a “margin of error” that is often interpreted as the amount of error in the poll. This is an incorrect interpretation. The margin of error denotes how much error is likely due to sampling variation alone such that if the survey were to be redone 100 times under exactly the same conditions, 95 of the estimates would lie within that range. The typical margin of error understates even this assessment because it reflects the amount of variation we expect in estimating a single proportion by using a sample of respondents from a larger (fixed) population. In a horserace poll, however, the quantity of interest is often the difference in support for two candidates. The error due to sampling variability alone when estimating the difference in two (correlated) quantities from the same poll is therefore larger than what the margin of error reports.
The amount of sampling variability in a poll is not the same as a description of how close the poll’s prediction may be to the truth. It is wrong to equate the margin of error with the amount of polling error. There are many other sources of error that can affect the accuracy of a poll and whose effects are not reflected in the margin of error. For pre-election polls, these difficulties typically involve issues such as the possibility of systematic non-response, due to either technological (e.g., access to the Internet or a land line telephone) or psychological (e.g., distrust of the media and pollsters leading to a refusal to participate) reasons, as well as the additional difficulty of identifying “likely voters” and what the composition of the electorate will be.
For example, 2016 primary polls had an average of 636 respondents, which yields a margin error of ±4% using standard assumptions and calculations (which do not account for the loss in precision due to weighting or design departures from a simple random sample of the population). However, the average absolute error in the margin of victory was 9.3% – more than twice the stated margin of error. The fact that the average error was so much greater than the margin of error highlights the importance of better understanding and communicating exactly what the margin of error is and is not. It is not a statement about the potential error that the poll contains and conflating these concepts does a disservice to our ability to interpret and assess the accuracy of pre-election polls.
Table A.2 Regression of Absolute Error on Poll Characteristics |
|
|
Model 1 (Mode) |
Model 2 (Sample) |
|
B |
Sig |
S.E. |
B |
Sig |
S.E. |
(Intercept) |
1.52 |
** |
0.535 |
1.62785 |
** |
0.598 |
Mode |
|
|
|
|
|
|
Internet |
0.19 |
|
0.401 |
|
|
|
IVR |
-1.14 |
* |
0.553 |
|
|
|
IVR/Cell |
-0.38 |
|
0.628 |
|
|
|
IVR/Internet |
-0.39 |
|
0.486 |
|
|
|
Other |
4.78 |
*** |
1.244 |
|
|
|
Sample source |
|
|
|
|
|
|
Opt-in |
|
|
|
-0.12 |
|
0.494 |
Voter file |
|
|
|
-0.56 |
|
0.552 |
Opt-in/Voter file |
|
|
|
-0.29 |
|
0.635 |
RDD/Opt-in |
|
|
|
-0.93 |
|
1.168 |
Voter file/RDD |
|
|
|
1.17 |
|
1.765 |
Other |
|
|
|
-0.82 |
|
0.865 |
|
|
|
|
|
|
|
Arizona |
0.76 |
|
0.657 |
0.86 |
|
0.719 |
Colorado |
0.70 |
|
0.689 |
0.69 |
|
0.765 |
Florida |
1.18 |
. |
0.618 |
1.29 |
. |
0.672 |
Georgia |
0.79 |
|
0.725 |
0.73 |
|
0.771 |
Michigan |
2.49 |
*** |
0.714 |
2.20 |
** |
0.763 |
Minnesota |
2.93 |
** |
1.104 |
3.08 |
** |
1.164 |
Missouri |
2.75 |
|
2.33 |
1.41 |
|
2.965 |
North Carolina |
3.24 |
*** |
0.665 |
3.17 |
*** |
0.714 |
New Hampshire |
3.38 |
*** |
0.687 |
3.27 |
*** |
0.735 |
Nevada |
0.91 |
|
0.707 |
0.87 |
|
0.764 |
Ohio |
2.52 |
*** |
0.749 |
3.17 |
*** |
0.782 |
Pennsylvania |
2.43 |
*** |
0.613 |
2.51 |
*** |
0.667 |
Virginia |
0.27 |
|
0.72 |
0.27 |
|
0.78 |
Wisconsin |
4.87 |
*** |
0.749 |
4.85 |
*** |
0.807 |
Days from mid-date to election |
0.03 |
|
0.046 |
0.04 |
|
0.049 |
Adjusted R-Squared |
.28 |
.27 |
Reference categories: Live phone (Mode), RDD (Sample source), National (geography). |
A.B Regression Analysis Examining Effects of Poll Design Features on Accuracy
The focus of our evaluation is on the average overall performance of the polls in a state primary or caucus – not the performance of individual polls or even specific types of polls. Our motivating question is – among the polls conducted and publicly reported in the last two weeks for each contest, how well did the polls do at predicting the margin of victory in each contest on average? Are there characteristics of polls or contests that are related to better or worse performance on average?
To do so, we collect information on all publicly reported polls conducted within the last two weeks of each primary contest and reported by FiveThirtyEight.com, Pollster.com.
To explore differences in polling performance, for each poll we collected the following: the length of the field period, the firm conducting the poll, the sample size, the target population (“likely” voter or registered voter), the interview mode, the sample source (when possible), the percentage of cell phones in the sample (when possible), the affiliation of the pollster (partisan, sponsored, or nonpartisan), the votes received by each of the leading candidates, and the verified election results for each contest.
There was very little variation for some of these characteristics. Because 441 of the surveys had a target population of “likely voters” and only 14 reported results of registered voters in the time frame we examine, for example, we have no real ability to determine whether likely voter or registered voter samples are more accurate. Other data was hard to collect – even after trying to contact every pollster we were only able to acquire the percentage of cell phone numbers called for 323 of the publicly available polls.
While the performance of 2016 primary polls seems relatively consistent with the performance of polls in earlier primary contests, to delve deeper into the data and to characterize how polling performance varies across the primary contests in 2016, we examine how the median absolute polling error varies by the number of polls being conducted in a state’s Democratic and Republican primaries. We focus on the median absolute polling error to minimize the impact of extreme outliers, but the takeaways are unchanged. Figure A.2 presents the performance of polls within each primary contest.
Figure A.2 Median Absolute Error in Primary Contests: Note – Republican (red) and Democrat (blue) results for each state are plotted.
Each labeled point in Figure A.2 denotes the median absolute error for the polls conducted in each state contest for the Republicans (red) and Democrats (blue) as a function of the number of polls. Circled states indicate instances in which more than 50% of the polls predicted the wrong winner – something that happened in 9 out of the 78 contests.
Several conclusions are evident from Figure A.2. First, the number of polls conducted in contests varies considerably – ranging from a high in the New Hampshire Republican primary of 33 polls to a low of a single poll in 19 contests. This variation is important for several reasons.
First, insofar as each poll result is an independent estimate of the result, the average absolute error in a contest should be smaller the more polls there are for the same reason that more respondents in a poll lead to a smaller margin of error all else equal. Of course, the polls being averaged vary in important ways that can undermine the assumption that the polls’ estimates are a random sample of the population, but the fact that a smaller average error occurs in states with more polls suggests that there are more similarities than differences. (Note that this relationship is not necessarily evidence of “herding,” whereby polls are weighted to help mimic pre-existing results; if herding occurs, there is no reason to think that it would necessarily be more prevalent in states with more polls.)
Second, the variation we observe in the number of polls in each contest highlights an important limitation to our efforts to evaluate the accuracy of polls. Because each pollster decides which contests to poll, this choice can have important implications for evaluating the overall accuracy of polls. If the decision of whether or not to poll depends on the difficulty of polling in the state, the fact that only some pollsters choose to poll a contest can affect our overall assessment of poll quality. To use an analogy, evaluating the accuracy of polls using their performance in the states pollsters choose to poll in is akin to evaluating a student’s performance on a test using only those questions that they choose to answer. If students decide to only answer “easy” questions, our evaluation of their ability may be very misleading. Similarly, if pollsters are more likely to poll in states that they are more likely to be successful in, our assessment may be overly optimistic. As a result, our results can, at best, inform us of how well the polls that were conducted and publicly released performed in those states where they were conducted. Because not every pollster polls every race and the decision to poll or not to poll – or to perhaps to publicly release the poll results or not – our results could be affected by the difficulty of polling the race itself if polls are more likely to be conducted in easier states to poll in.
Finally, highlighting a point made earlier, the median of the median absolute error across the 78 contests with at least one poll conducted in the last two weeks is 9.0. That is the median amount of error between the estimated and actual margin of victory across all primary contests is 9 points. Thus, while the polls correctly predicted the winner more often than not, on average, the predicted margin of victory in polls was nine points different than the official margin on Election Day.
To analyze poll performance based on their characteristics, we estimate the absolute value of each poll’s error as a function of both poll-level and contest-level characteristics using a linear regression model. The benefit of this approach is that it allows us to directly quantify the average conditional impact of each characteristic holding all other aspects of the poll and contest fixed. This approach provides a high-level overview of the features that are related to larger and smaller errors while quantifying the average overall performance.
To do so, we control for several contest level features, including: whether it is a Republican or Democratic contest (perhaps it is harder to predict the margin when more candidates are running?), the state in which the contest occurs (to control for potential differences in the difficulty of polling in different states), the total number of polls that were conducted in the primary contest in the state (to provide a sense of how much other activity was going on in the contest), and the percentage of the vote received by the winner (perhaps it is harder to predict the margin in blow-out contests that in closely fought contests?).
We also account for several poll-specific characteristics that may affect the accuracy of the poll. The variables we control for include: the sample size (and the square of the sample size to allow for a non-linear effect), the length of the field period (and its’ square) to account for the potential impact of larger and smaller field periods, the number of days between the last field period day and Election Day to account for the possibility that later polls may be more accurate because they capture last minute changes in opinion, and whether the pollster is affiliated with the Democratic or Republican party. To allow for potential expertise effects, we also interact the partisanship of pollsters with the party of the contest to explore whether Democratic Pollsters are more accurate in Democratic primaries, for example.
The final set of variables involve the mode of survey interview and whether it was done via: interactive voice response (IVR) (86), IVR/Live Phone (9), IVR/Live Phone/Online (3), IVR/Online (66), Internet (47), Live Phone (239) or Live Phone/Online (6). We collapse these into a set of three non-exclusive, but exhaustive variables depending on whether the poll relies either exclusively or partially on each of the three modes. Given the interest in differences by polling mode, Figure A.3 presents the distribution of polling errors by mode.
Figure A.3 Absolute Error by Mode
Figure A.3 reveals that there are few differences in the absolute error when we look at the impact of survey mode – the median horserace error for internet polls, live phone polls and IVR polls are 8%, 7% and 8%. Even so, it is hard to make direct comparisons because not only are there differences in how polls are being conducted within each mode, but also not every mode is being used for every primary. Some primaries – typically primaries for which one candidate was heavily favored – lacked a single live phone poll, and if the margin of victory in these primaries are harder to predict this would impact our ability to interpret these differences as reflecting the impact of survey mode.
To better explain the relative performance of polls it is, therefore, important to control for as many aspects as possible to allow us to make a comparison, “all else equal.” We use a regression specification that includes the characteristics described above to do so. The results of this are perhaps best digested graphically. Figure A.4 depicts the coefficient estimate and the 95% confidence intervals for the survey characteristics we are able to include in the analysis, given data constraints. Several conclusions are immediately evident. First, while there are slight differences by survey mode – polls using IVR and online methods are associated with slightly larger average absolute errors, all else equal, the differences are small (.21 and .08 larger than a phone poll, respectively) the differences are not statistically distinguishable from 0. However, polls conducted further from the election contain a larger error – for every day difference between Election Day and the last field period, the average error is 0.40 larger, all else equal.
Figure A.4 Marginal Effect of a One-Unit Change in Each Feature on the Absolute Error for 2016 Primary Polls
The partisanship of the pollster also seems to have an interesting effect. Because nonpartisan pollsters are the omitted category, the impact of
DemPollster and
RepPollster reflects the relative performance of Democratic pollsters and Republican pollsters, respectively, in a Republican primary contest compared to a nonpartisan poll. While not distinguishable from 0 at conventional levels, the estimates suggest that Democratic pollsters’ error is 1.79 larger than nonpartisan pollsters while Republican pollsters are 1.22 smaller. The opposite pattern emerges when we look at the performance of partisan pollsters in a Democratic contest. In such cases, Democratic pollsters make errors that are 4.87 smaller on average than a nonpartisan pollster and Republican pollsters make errors that are 4.60 larger. The fact that the performance of partisan pollsters varies, and it is smaller in the primary contests that match the pollsters’ affiliation suggests that perhaps partisan pollsters may have a slightly better ability to predict their own contests – a disparity that is most striking in Democratic contests.[1] That said, it is important to emphasize that this difference is driven by the performance of a few pollsters in a few contests, so it is important to not over-interpret the significance of this finding.
There is also important variation in average poll performance depending on whether the election is a blowout or not, as well as the number of polls that are conducted in the state. While distinguishable from zero, the substantive magnitude of the electoral margin on poll performance is relatively slight – increasing the margin of victory by a standard deviation (12.4 points) is predicted to increase the average horserace error by 0.84 all else equal. Similarly, while the average polling error is smaller in contests with more polls, the effect size of -0.38 suggests a substantively slight impact – going from a contest with a single poll to a contest with 33 polls conducted in the last two weeks is associated with a decrease of only 1.27 in the average absolute horserace margin of error.
Of course, there are also systemic effects that may vary by state. Not every state is equally easy to poll in, and in estimating the effect for each characteristic we also control for differences across states. These differences sometimes matter. Polls in Utah, South Carolina, Oregon, Michigan and Kansas, for example, were all off by an average of 10% all else equal. While it is impossible for us to diagnose the exact reasons for these systematic errors, controlling for them in the analysis is important because it removes the impact of these state-specific errors from the estimated effects graphed in Figure A.1.
Figure A.5 Average Error in Polls After Controlling for Poll Characteristics
The value of the constant is substantively important, as it reflects the average amount of error in the polls’ horserace estimates after controlling for poll-level and contest-level differences. The estimate is 5.3 with a 95% confidence interval ranging from -2.2 to 12.9. This means that while the average estimate of the margin of victory was off by nearly 5 points, we cannot statistically reject the hypothesis that the average error was 0 at conventional significance levels.
What explains the variation in performance across states? To tackle this question we see what characteristics predict the average absolute horserace error in each of the 74 state primary contests in which at least one poll was taken in the two weeks prior to the election. We collect data on whether the primary contest is closed, open, or mixed, whether it is a caucus, whether it is a Republican or Democratic primary, how many votes were cast in the election (logged to account for outliers) and the number of polls that were conducted.
Table A.3 Results of Regressing Absolute Poll Error on Contest Characteristics |
|
Coefficient
(Standard Error) |
Closed Primary |
-1.64
(3.41) |
Open Primary |
2.51
(3.31) |
Caucus |
9.68*
(3.71) |
Republican Contest |
-0.83
(2.09) |
Number of Polls |
-0.21
(.17) |
Log(Ballots Cast) |
-2.53*
(1.21) |
Number of Contests |
74 |
R2 |
0.33 |
The results are instructive. The average absolute horserace error in closed primaries is less than the average absolute horserace error in open primaries, but the differences are not statistically distinguishable from one another.[2] Caucuses are associated with much bigger errors – the average absolute horserace is nearly 10 points greater in caucuses and this is a statistically significant difference. Relatedly, larger contests are associated with fewer polling errors – for every 1% increase in the size of the electorate the average absolute horserace decreases by 2.53% all else equal.
In general, it is hard to conclude that primary polls were noticeably worse that primary polls in earlier years, despite some high profile misses (e.g., polls in the Michigan Democratic primary). Moreover, while some states caused more trouble for pollsters than others, there are not many systematic features of either polls or contests that are related to the average accuracy of polls that lend much guidance going forward. Polls done further from Election Day contained more error, all else equal, as did polls predicting caucus outcomes. Polls seemed to do better when more polls were taken, but it is hard to know whether this reflects that polls were more likely to be conducted in some contests than others. While there will obviously always be outliers, and we have explicitly and intentionally avoided trying to estimate the impact of pollster-specific “house effects,” the analyses reveal very little evidence that the ability of polls to predict the margin of victory systematically vary according to mode of interview, sample size, field period, or proximity to election day during the last two weeks.
What the results do suggest is a need for an increased sensitivity for the many errors that are present in pre-election polling. The 2016 primary polls did not perform noticeably worse than earlier primary elections, but there is a consistent level of error that is still more than twice the “margin of error” that polls publicly report. A heightened sensitivity to the errors involved in polling seems sensible going forward.
A.C Error by Distance from Election Day
State polls that ended in the final 13 days were conducted slightly earlier than national polls, raising the possibility that state surveys failed to catch a late shift in Trump’s direction. To assess this, the distance between the middle of a poll’s field period and Election Day was calculated for all battleground state and national surveys, allowing errors to be compared among earlier and later polls. The mid-date for state polls ending in the final 13 days averaged 7.8 days away from Election Day, while national polls averaged 6.4 days before the election’s end.
National polls with a midpoint less than 5 days before the election (16) exhibited slightly higher errors than those conducted earlier in the final two weeks (2.0 vs. 1.6), and the average bias against Trump was apparent only in the final polls before the election (0.8 vs. -0.2).
State surveys with the midpoint less than 5 days before the election (3.6) as those conducted earlier in the final two weeks (3.7); the average bias underestimating Trump’s support was slightly higher in polls completed closer to Election Day than earlier polls (2.6 vs. 2.3). While there was very little difference in accuracy using the five-day cut-off, the 22 state-level surveys with midpoints less than 3.5 days from the election proved more accurate. These surveys averaged a 2.7-point vote margin error and 1.4-point bias underestimating Trump, providing at some support for the theory that inaccuracy of state polls was due to a late shift in preferences.
A.D Poll Performance during the 2016 Presidential Primaries
This section considers the accuracy of primary polls across the 2016 nomination timeline. Previous research indicates that performance during the primaries varies across states and particularly over time (Traugott and Wlezien 2009). What about 2016? Do we observe a similar pattern?
Little scholarship examines the accuracy of the polls during the nomination process. Beniger (1976) considered the relationship between the polls and primary outcomes from 1936 to 1972 and found that being the leader in early polls was the best predictor of electoral victory. While not surprising, it is not clear what it tells us about the current nomination process, which emerged in 1972.
Only two pieces of research explicitly examine the performance of polls in the current nomination system – Bartels and Broh (1989) and Traugott and Wlezien (2009). Bartels and Broh analyzed the performance of three organizations (the CBS News/
New York Times poll, the Gallup Organization, and the Harris Poll) in the 1988 primaries, polling efforts during which were limited. Bartels and Broh also found inconsistencies in the reporting of the poll numbers. Still, Bartels and Broh made some observations, the most noteworthy of which is that the polls underestimated the support for each candidate (with the exception of Senator Robert Dole).
Two decades later, Traugott and Wlezien (2009) studied poll performance over the course of the 2008 nomination process. Their poll data came from published state-level results of public pollsters from the week preceding each primary or caucus – 258 polls in 36 different Democratic events and 219 polls in 26 Republican events – and their analysis focused on the gap between the winner candidate’s vote share and poll share. They found that the vote share almost always exceeded the poll share while the race remained competitive, particularly early on in the nomination process. In an unusual perspective made possible by the length of the contest on the Democratic side in particular, this could be observed through most of the primaries; it was not the case in the Republican events after John McCain became the presumptive nominee. The analysis also shows there are state-specific contextual factors at work that can affect the quality of the estimates that public pollsters make.
Less directly relevant, though worth of note, is Hopkins’ (2009) briefly study of a
Wilder effect and
Whitman effect—the tendency for voters to overestimate their support of African American candidates and underestimate their support of female candidates in statewide elections for Governor and U.S. Senator across the period from 1989 to 2006. His analysis of general election polls found that there was a tendency to overstate support for African American candidates early in this period but that it disappeared after 1996, and polls never underestimated support for women. He extended his analysis to the 2008 Democratic primary series, looking specifically at the difference between poll support for Barack Obama and Hillary Rodham Clinton and their vote shares, and found that Obama consistently did slightly better in the elections than the polls suggested. This varied across states with the proportion of the black voters; the polls were generally accurate in primary states with few black voters but consistently understated Obama support in states with many black voters. This comports with what Traugott and Wlezien (2009) found and is the opposite of the “Wilder effect” that would have been predicted among white voters. Hopkins did not observe any “Whitman effect” for Clinton during the 2008 primaries.
The analysis relies on data identified for this report, and focuses entirely on published state-level results from the two weeks preceding each primary or caucus for which polls were available. This means that we do not have data for all states. All told, there are 457 polls, 210 of which relate to the 38 Democratic elections and 247 to 36 Republican events.[3] The polls that we do have also are not equal, as there is great variation in survey practices, including survey mode, question wording, likely voter modeling, weighting procedures, and sample size. This analysis does not attempt to take account of these differences, in part because of the difficulty of obtaining complete information. Other analysis in the report does address some of these issues, and demonstrates fairly minimal effects. The poll estimates used in the analysis are simple averages of the results for each event. The specific variable of interest is the difference between the vote margin of the two leading candidates and the poll margin in the preceding two weeks:
(1
st place vote – 2
nd place vote) – (1
st place poll – 2
nd place poll).
Thus, the variable is positive when the winner outperforms the polls and negative when the winner underperforms, and it takes the value of “0” when the margins are equivalent. It is important to use a signed error term in place of the absolute error because this is informative about patterns of poll performance over time, as we will see.
We start with basic descriptive statistics of poll errors during 2016. Table A.4 summarizes means (and standard deviations) both for signed and absolute errors, first for all 74 primaries and caucus taken together and then for Democratic and Republican events taken separately. The signed errors in the first row indicate that the vote margin tended to exceed the poll margin across primaries and causes, by about 6.8 percentage points on average. This comports with the previous research, particularly Traugott and Wlezien (2009) but also Bartels and Broh (1989). The pattern was particularly pronounced for the Democrats, where the mean error in the vote-poll margin approached 9.6 points, by comparison with only 3.8 points in Republican events. The absolute errors in the second (main) row of Table A.4 reveal that this partisan “bias” in errors did not produce proportionately greater absolute error; indeed, the mean error for Democratic events was only 1.5 point higher on average, 13.1 vs. 11.6. That the polls performed about as well in absolute terms across the parties implies that signed errors tended to cancel out more for the Republicans than for the Democrats.
Table A.4 Primary Poll Performance in 2016: Mean Difference between Winner’s Vote and Poll Margins |
|
All |
Democrat |
Republican |
Signed error |
6.8 (14.6) |
9.6 (15.1) |
3.8 (10.8) |
Absolute error |
12.4 (10.0) |
13.1 (12.5) |
11.6 (8.1) |
n |
76 |
38 |
36 |
Note – Standard deviations in parentheses. |
|
Timing is not everything, of course. Poll performance can depend on other factors, including the level of support in the polls itself. That is, in states where a candidate is dominating in the polls, we might expect a very big lead to shrink. Traugott and Wlezien (2009) observed such a pattern in the 2008 primaries, and they also revealed that the poll margins themselves varied over time.[4] Table A.5 shows bivariate correlations between the timing of the primary, the difference between the vote and poll margins, and the poll margins themselves. The top part of the table contains results for all 74 primaries and caucuses. Here we see that the vote-poll margin is negatively related to the winner’s poll margin, just as Traugott and Wlezien (2009) found. The error also is positively related to the number of days into the election year the primary occurs. The winner’s poll margin itself does not appear to increase (or decrease) over the process.
Table A.5 Selected Correlates of Primary Poll Performance |
|
Winner's vote-poll margin |
Winner's poll margin |
All Primaries |
|
|
Winner's poll margin |
-0.30 (.01) |
– |
Number of days into election year |
0.20 (.09) |
-0.03 (.79) |
|
|
|
Democratic Primaries |
|
|
Winner's poll margin |
-0.33 (.04) |
– |
Number of days into election year |
0.02 (.88) |
-0.28 (.09) |
|
|
|
Republican Primaries |
|
|
Winner's poll margin |
-0.26 (.13) |
– |
Number of days into election year |
0.42 (.01) |
0.43 (.01) |
Note – Two-tailed p-values in parentheses. |
|
The overall set of results conceals differences between the parties. First, the vote-poll margin is negatively related to the winner’s poll margin for both the Democrats and Republicans, though only significantly so for the Democrats. Second, the vote-poll margin is positively related to the primary date for both parties, though the relationship is strong and statistically significant only for the Republicans, much as we would expect given Figures A.6 and A.7. Third, the winner’s poll margin also varies with the timing of the primary for both the Democrats and Republicans, though the relationship differs dramatically by party. That is the poll margin for the Democratic winner tended to decrease over time whereas the poll margin of the Republican winner (Trump) tends to increase. This difference may – at least in part – reflect the differences in the competitiveness of the race over time.
Figure A.6 The Polls and the Vote in the 2016 Democratic Presidential Primaries. Notes – Each entry in the figure is the difference in a state between the winner’s actual vote share and the share of the second place candidate minus the corresponding pre-election poll margin in the two weeks leading up to the election. A “C” indicates a Clinton win and an “S” represents a win by Sanders.
Figure A.7 The Polls and the Vote in the 2016 Republican Presidential Primaries.
Notes – Each entry in the figure is the difference in a state between the winner’s actual vote share and the share of the second place candidate minus the corresponding pre-election poll margin in the two weeks leading up to the election. A “T” indicates a Trump victory and an “O” is used to represent a win by some other candidate.
The bivariate analyses are useful, but they only take us part of the way toward explaining the estimation errors in the pre-primary polls, and a multivariate analysis is required. Results of this analysis for the Democratic primaries are displayed in Table A.6. The first column contains results of a simple baseline regression containing the winner’s poll margin. As expected given Table A.5, we see that poll leads have a significant negative impact on the vote-poll margin difference. The coefficient (-.26) should not be taken to imply that poll leads generally shrink, as we have already seen. Rather, the greater the poll margin, the less the winner’s vote margin exceeded it—for each additional four points in poll margin, the winner’s vote-poll gap declines by one percentage point. With a poll share of about 50%, we predict no real difference between the vote and poll margins. With even larger shares, we would expect the poll margins to shrink by Election Day. The second column of Table A.6 adds the election timing variable. These results also are expected given what we have seen, as the campaign date just does not appear to matter for the vote-poll error in the Democratic primaries.
Table A.6 Regressions Predicting Winner’s Vote Margin Minus Poll Margin, Democratic Primaries |
|
Model 1 (Baseline) |
Model 2 |
Winner's poll margin |
-0.26* (0.12) |
-0.28* (0.13) |
Number of days into election year |
– |
-0.04 (0.08) |
Intercept |
13.53* (2.53) |
16.76* (7.72) |
R-squared |
0.11 |
0.12 |
Adj. R-squared |
0.09 |
0.07 |
Root MSE |
14.43 |
14.6 |
Notes – N = 38; * p < .05 (two-tailed) |
|
Table A.7 shows a slightly different structure on the Republican side. In the first column, the winner’s poll margin has a negative significant effect on error, and with virtually the same coefficient that we saw for the Democrats (-0.27 vs -0.26). In the second column, we can see the strong association noted earlier between the election date and the vote-poll margin. Indeed, the coefficient implies that we expect the signed error to increase by one-third of a percentage point each day of the nomination process. Given the intercept (-13.02), the result implies that the signed error would tend toward 0 through mid-February and then become increasingly positive, much as we saw in Figure A.7. When including the campaign date, the effect of winner’s poll share doubles in size and easily exceeds even stringent levels of statistical significance. Based on these results, the errors in polls varied in fairly predictable ways in the 2016 nomination process, particularly the Republican contests.[5]
Table A.7 Regressions Predicting Winner’s Vote Margin Minus Poll Margin, Republican Primaries |
|
Model 1 (Baseline) |
Model 2 |
Winner's poll margin |
-0.27 (0.18) |
-0.58* (0.16) |
Number of days into election year |
– |
0.33* (0.07) |
Intercept |
8.00* (3.57) |
-13.02* (5.55) |
R-squared |
0.07 |
0.41 |
Adj. R-squared |
0.04 |
0.38 |
Root MSE |
13.28 |
13.68 |
Notes – N = 36; * p < .05 (two-tailed) |
|
Though polling misses in primary elections may be the rule more than the exception, there is a good amount of pattern to the errors we observed in 2016. To begin with, we see that the polls tended to underestimate the winner’s vote margins. This tendency varies across candidates, being much more pronounced for insurgents, particularly early in the process. More generally, the performance tended to vary across space and time. The larger the poll lead in a particular state, the less the vote margin exceeds the poll margin, and timing also mattered, at least for the Republicans. Other features of context might matter as well, and separate analysis suggests that the black population of a state positively influenced Clinton’s vote margin given the poll margin. (This parallels what Hopkins (2009) and Traugott and Wlezien (2009) found for Obama in 2008.) No such patterns were observed on the Republican side. While there are differences in the details, the general lesson is clear: poll performance in primary polls is different from what we observe in general elections and that performance itself varies across in understandable ways across the electoral calendar, the level of support in each state, and the specific characteristics of the state as well.
A.E Testing for Shy Trump Mode Effects in National Polls Conducted September 1st and Later
The main characteristics that differentiate the polls are the number of days in the field, the use of a likely voter model and whether the poll is a tracking poll or not. The number of days has been considered as an indicator of higher response rates and quality (Lau, 1994). The use of a likely voter model – instead of using estimates based on registered voters – is thought to lead to better estimates, given the socio-demographical determinants of turnout; finally, tracking polls use small samples every day and publish moving average estimates. The generally small size of daily samples may have an impact on these estimates.
Table A.8. Profile of Polls by Mode of Administration |
|
|
Total |
Live phone |
Web |
IVR/Online |
Number of days in field |
4.2 |
4.5 |
4.2 |
2.9 |
Use of LV model |
93% |
97% |
89% |
94% |
Prop. tracking |
31% |
13% |
37% |
61% |
Prop. nondisclosure |
6.6 |
4.3 |
8.5 |
5.6 |
As shown in Table A.8, these characteristics are related to modes of administration. Among the 160 polls conducted during the period under study, the average number of days in the field is 4.5 for the live phone, 4.2 for the online polls and 2.9 for the IVR + Internet polls. In addition, the incidence of tracking polls varied widely by mode from 13 percent of live phone poll to 61 percent of IVR + Internet polls.[6] Finally, more than 90 percent of the polls used likely voter, and there was no difference between modes on this factor.[7] Table A.9 shows the impact of change over time and of the design features on the estimates of support for Trump over the two main candidates and of support for all the candidates. The sample of 160 polls is reduced to 156 because of some missing information for four polls. Table A.9 shows that the change in voting intention during the period can be best portrayed using a cubic model, at least in the case of support for Trump and for the third party candidates. Support for Clinton follows a quadratic curve (an inversed U). This change over time explains around 13% to 15% of the variation in the estimates.
Whatever the estimate used, the use of a likely voter model is not related to the estimates of support[8]. However, the number of days in the field is related positively to estimates of support for both Trump (+.38 per day) and Clinton (+23 per day) and negatively to support for the third party candidates (-.62), which means that polls with longer interviewing periods tended to record less support for third party candidates. Since support for these candidates tended to be too high relative to the vote, the results are in line with the idea that longer field periods indicate better methodology.
In addition, tracking polls estimate support for Trump more than 0.8 points higher than the other polls when estimating his support on the sum of the two main candidates, and almost one point higher, when we use the estimate of support for all the candidates. This higher estimate is split on the estimates of the other candidates.
Finally, the impact of mode, i.e., web and live phone compared to IVR + Internet, after controlling for change over time and the different methodological features, is somewhat more complex. The coefficients show that polls using live phone do show an estimation of Trump support over the two main candidates that is more than 1.7 points lower than IVR + Internet polls’ estimates. Web polls, however, also show a lower estimation of support for Trump, by more than two points.
The situation is somewhat different when we examine the impact of mode of administration on the support for all the candidates: Web polls’ estimates for Trump are 2.5 points lower than IVR + Internet polls’ estimates, but there are no significant difference between live phone polls and IVR + Internet polls. Analyses of support for Clinton and for the third party candidates show a significant difference between live phone estimates and IVR + Internet estimates of the support for Clinton (+2.3) and for the other candidates (-1.5). However, there are no difference between web polls and IVR + Internet polls for these candidates. In summary, Trump systematically fared worse in Web polls while Clinton fared better in live phone polls and the third party candidates in IVR + Internet polls. Methodological characteristics explain 10%-13% of the variance is estimates.
We may therefore conclude that there is a difference between modes, but not one that would validate a
Shy Trump hypothesis. For Trump, estimates differ mostly between the two types of self-administered polls while for the other candidates, the difference is between the interview and the self-administered modes. It is however possible that these differences according to mode are due to different causes, i.e., that the lower live phone estimates are due to
Shy Trump supporters but the lower web estimates are due to other factors, like sampling for example.
Another possibility is that Trump supporters were more likely than other respondents either to report being undecided or to refuse to reveal their preference. In this case, there should see be a relationship between the proportion of nondisclosers (all those who do not reveal their preference) and the proportion of Trump supporters in the polls and no such relationship for Clinton. However, we need to be careful because the proportion of nondisclosers is related to the methodological characteristics of the polls. First, as shown in table 1, there is a significant relationship between the proportion of nondisclosers in the polls and mode of administration. The average proportion of nondisclosers is 6.6% on average but it is 4.3% for polls using live phone, 5.6% for IVR/online polls and 8.5% for Web polls.
Table A.10 Determinants of the Proportion of Nondisclosers |
Intercept |
5.33 |
*** |
Time |
0.01 |
|
Time squared |
0.00 |
|
Time cubic |
0.00 |
|
Days in field |
-0.37 |
** |
Use of LV model |
0.78 |
|
Tracing poll |
1.53 |
** |
Live phone |
0.08 |
|
Online poll |
3.67 |
*** |
Explained variance |
41.5% |
|
*: p<.05; **: p<.01; ***: p<.001 |
|
|
A regression with proportion of nondisclosers as the dependent variable, presented in Table A10, shows that there is no significant change in the proportion of nondisclosers over time. Two characteristics of the polls appear related to the proportion of nondisclosers. Polls that stay in the field a larger number of days tend to show a lower proportion of nondisclosers (-.75 point per day in the field) while Web polls show proportions that are on average more than 3.6 points higher than IVR/online polls. These features explain a total of 41.5% of the variance in the proportion of nondisclosers.
If we control for all these features, what is the relationship between proportion of nondisclosers in the polls and estimates of support for the different candidates? The analysis shows that the proportion of nondisclosers is not related to the estimates of the support for Trump compared to Clinton. However, if we consider the estimates for all the candidates, we see that the higher the proportion of nondisclosers in a poll, the higher the estimates of support for Trump and for Clinton and the lower the support for the third party candidates. The proportion of nondisclosers explains more than 8% of the variance in estimates of the third party candidates.
Live phone polls use sampling frames that overlap. They combine lists of landline and cell phone telephone numbers. Since some of the cell phone users may also be joined by landline phone, they are more likely to be interviewed. Although this overlap may be corrected by weighting, this procedure may not totally compensate. Do people who can be joined both by landline and cell have different characteristics that are related to vote intention? If this is the case, it could explain the fact that support for Clinton is higher in live phone polls than in IVR/online polls where there is no such overlap of sampling frames.
If this hypothesis is true, there should be a relationship between the proportion of cell phones in the samples and estimates of support. More than a third of the polls (n=58) use the live phone mode and 57 provided the information on the proportion of cell phones in their samples. The average proportion of cell phones in these polls was 57% with a standard deviation of 10.6. The lowest proportion was 25% and the highest, 75%. The most common proportion was 65% (22% of the polls).
There is no consensus regarding the proportion of cell phones that should be included in the samples.
Table A.11 and Table A.12 show the results of the analyses of support for the candidates, controlling for change over time and for other methodological features of the polls. It shows that there is no impact of the proportion of cell phones on estimates of any of the candidates, whatever the comparison used. The variables that are significant – essentially the time variables and the fact that the poll is a tracking poll – explain between 26% and 37% of the variation in estimates of support for Trump and around 10%-18% of the variation in support for Clinton and for the other candidates. Therefore, the overlap in the sampling frames cannot explain the differences between the live phone polls and the IVR/online polls. It, however, questions the impact of the inclusion of cell phones in the samples.
With all the polls conducted online in the same category, we have seen that, on average, web polls tend to trace a different portrait of change over time in support for Trump over the two main candidates. Tables A.11 and A.12 confirm that the estimates of change over time differ between live phone and web polls. Estimates of the linear and quadratic trends are less than half for Web polls than for live phone polls.
Table A.11 Support for Trump from September 1st to Election Day |
|
|
|
|
On the sum of the two main candidates |
|
On the sum of all candidates |
|
Live phone only |
Web only |
|
Live phone only |
Web only |
Intercept |
46.49 |
*** |
45.52 |
|
|
40,082 |
|
38.01 |
|
Time variables |
|
|
|
|
|
|
|
|
|
Time |
-0.11 |
*** |
-0.06 |
** |
|
-0.12 |
** |
-0.05 |
* |
Time squared |
0.00 |
* |
0.00 |
* |
|
0.00 |
* |
0.00 |
|
Time cubic |
0.00 |
* |
0.00 |
* |
|
0.00 |
* |
0.00 |
** |
Explained variance |
25.1% |
|
15.4% |
|
|
23.6% |
|
4.7% |
|
|
|
|
|
|
|
|
|
|
|
Methods variable |
|
|
|
|
|
|
|
|
|
Days in field |
-0.17 |
|
0.16 |
|
|
-0.23 |
|
0.42 |
*** |
Used LV model |
-0.50 |
|
-0.59 |
|
|
0.18 |
|
-0.68 |
|
Tracking poll |
2.56 |
*** |
1.00 |
|
|
1.50 |
|
2.70 |
*** |
Explained variance |
37.9% |
|
15.3% |
|
|
26.3% |
|
25.5% |
|
|
|
|
|
|
|
|
|
|
|
Variables specific to modes |
|
|
|
|
|
|
|
|
% cell (live phone) |
0.02 |
|
|
|
|
0.06 |
|
|
|
Panel (online) |
|
|
1.50 |
*** |
|
|
|
2.71 |
*** |
River sampling |
|
|
-0.16 |
|
|
|
|
0.48 |
|
Explained variance |
37.6% |
|
28.4% |
|
|
30.0% |
|
53.5% |
|
N |
58 |
|
80 |
|
|
57 |
|
80 |
|
*: p<.05; **: p<.01; ***: p<.001 |
|
|
|
|
|
|
|
|
Table A.12 Support for Clinton and Other Candidates from September 1st to Election Day |
|
On the sum of the two main candidates |
|
On the sum of all candidates |
|
Live phone only |
Web only |
|
Live phone only |
Web only |
Intercept |
45.68 |
*** |
45.98 |
*** |
|
14.24 |
*** |
6.01 |
*** |
Time variables |
|
|
|
|
|
|
|
|
|
Time |
0.08 |
|
0.06 |
** |
|
0.04 |
|
-0.01 |
|
Time squared |
0.08 |
|
0.00 |
** |
|
0.00 |
|
0.00 |
|
Time cubic |
0.00 |
|
0.00 |
|
|
0.00 |
|
0.00 |
*** |
Explained variance |
9.4% |
|
28.3% |
|
|
3.3% |
|
13.4% |
|
|
|
|
|
|
|
|
|
|
|
Methods variable |
|
|
|
|
|
|
|
|
|
Days in field |
0.06 |
|
0.15 |
|
|
0.17 |
|
-0.58 |
*** |
Used LV model |
1.30 |
|
0.30 |
|
|
-1.48 |
|
0.38 |
|
Tracking poll |
-3.17 |
*** |
0.93+ |
|
|
1.66 |
|
-3.63 |
*** |
Explained variance |
17.8% |
|
47.5% |
|
|
3.3% |
|
54.7% |
|
|
|
|
|
|
|
|
|
|
|
Variables specific to modes |
|
|
|
|
|
|
|
|
% cell (live phone) |
0.03 |
|
|
|
|
-0.09 |
|
|
|
Panel (online) |
|
|
0.05 |
|
|
|
|
-2.76 |
*** |
River sampling |
|
|
0.91 |
|
|
|
|
-1.39 |
* |
|
|
|
|
|
|
|
|
|
|
Explained variance |
17.2% |
|
48.3% |
|
|
4.4% |
|
75.1% |
|
N |
57 |
|
80 |
|
|
57 |
|
80 |
|
*: p<.05; **: p<.01; ***: p<.001 |
|
|
|
|
|
|
|
|
+ Highly significant before entering panel and river sampling (14% of variance, b=1.67) |
However, the general category of online polls is quite heterogeneous. Our hypothesis was that since most online polls use panels, they may have pools of respondents that are more homogenous and that this may be related to estimates that differ from those produced by other modes of administration. Two characteristics of online polls are related to homogeneity. Most online polls use panels of respondents while others sample pools of respondents that differ with each poll. In addition, some pollsters complement their samples with river sampling[9]. Both these features are indicators of heterogeneity. The fact that panel recruitment is probabilistic instead of opt-in could also play a role but only one pollster conducting two polls during the period used a probabilistic panel. It is therefore impossible to test this possibility. Finally, the number of panel members could also be related to homogeneity but we could not get this information for all the pollsters.
Table A.11 shows that, all else being equal, using a panel to conduct online polls leads to estimates of support for Trump that are 1.5 points higher if we consider the sum of the two main candidates and 2.7 points higher when we consider all the candidates. The results of the two analyses differ. In the first case, specific methodology explains by itself 13% (28.4-15.3) of the variance and only the use of a panel is significant. All the other characteristics of the polls are not significant. In the latter case, estimates of support for Trump appear to be half a point higher per day in the field and 2.7 points higher in tracking than in non-tracking polls. Specific methodological factors explain 28% of the variance.
When we examine the estimates of support for Clinton and for the third party candidates presented in Table A.12, we conclude that the use of a panel and of river sampling are not related to estimates of support for Clinton. They are only associated with estimates of support for the third party candidates. The coefficients show that the estimates of support for the third party candidates are more than half a point lower per day in the field, 3.6 points lower in tracking polls. They are 2.8 points lower in surveys using panels and 1.4 points lower when river sampling is used. Specific methodological factors explain close to 20% of the variance.
Therefore, we may conclude that indicators of homogeneity are only related to the estimates of the relative share of support for Trump compared with the third party candidates. More possible homogeneity is associated to higher support for Trump and lower support for the third party candidates, an estimate that is closer to the final vote. It may be that web polls using panels have more control over their samples or their weighting/adjustment procedure.
A.F Polling Aggregators in the Primaries
We examined three polling aggregators, resulting in information from five different estimation methods.
Aggregator methods
First, FiveThirtyEight had three different approaches to calculating predictions. For races where limited polling information was available, the FiveThirtyEight Polling Average is a simple weighted average of the available polls. For example, in the Missouri Republican Primary, only one poll was conducted during 2016 (March 3-10) and used in the FiveThirtyEight polling average (two other polls were conducted in 2015 – one in December and one in August).
[10] The FiveThirtyEight Polls Only predictions are based only on data that come from the polls themselves; FiveThirtyEight Polls Plus incorporates additional information into the prediction, including information from national polls and endorsements. We focus primarily on the FiveThirtyEight Polls Only and FiveThirtyEight Polls Plus predictions. FiveThirtyEight includes all polls unless it was conducted by a campaign, an affiliated PAC or super PAC, or is on the list of FiveThirtyEight banned pollsters. If a poll publishes estimates based on multiple populations (likely voters, registered voters, all adults), FiveThirtyEight limits the poll estimates to the closest representation of likely voters reported by the poll.
Second, Huffington Post Pollster uses information from the polls and Cook Political Report to develop its predictions. To be included in the Pollster estimates, the poll has to meet a set of methodological disclosure requirements from the National Council on Public Polling, including dates of study, information about sponsor, field dates, mode, sample frame, sample size, and question wording, among other information. Pollster excludes landline-only polls as well as polls that do not meet an editorial evaluation of adequate quality. Furthermore, HuffPollster excludes polls that do not ask about the horserace with closed-ended questions, and uses only unique sample points (excludes overlapping samples in rolling averages). If a poll publishes estimates based on multiple populations (likely voters, registered voters, all adults), HuffPollster also limits the poll estimates to the closest representation of likely voters reported by the poll. Finally, RealClearPolitics uses a simple unweighted average of polls. RealClearPolitics does not have a clear statement about which polls are considered eligible to include or exclude from its estimates.
Aggregator Errors
Overall, the aggregators underestimated the size of the margin between the top two candidates. As shown in Table A.13, looking at all races, the average signed error in the horserace margin was -4.65, indicating that the predictions underestimated the margin by 4.65 percentage points. When the analysis is restricted to the contests where four estimates were made, the signed error decreases slightly to -3.99, so that the aggregators underestimated the margin by 3.99 percentage points. There were no significant differences between the aggregators in their signed prediction accuracy (F=0.04, p=0.99).
Turning to the absolute error, for all races, the average error across all of the aggregators was 8.32, indicating that the average difference between the margin calculated by the aggregators and the actual margin for the winner was 8.32 percentage points. When the analysis is restricted only to the contests where four estimates were made, the signed error drops by one percentage point to 7.34. Although significant differences appear across the aggregators in the absolute horserace error overall (F=5.92, p=0.0002), this is explained by different aggregators making predictions in different races. When only common races are examined, there are no significant differences across the aggregators in the average absolute error (F=0.09, p=0.96)
Table A.13 Mean Signed Horserace Error Overall and by Aggregator, All Contests and Contests with Predictions from All Firms
|
Signed Horserace Error |
Absolute Horserace Error |
|
N |
Mean |
Std. Dev. |
Min. |
Max |
N |
Mean |
Std. Dev. |
Min. |
Max |
All contests |
|
|
|
|
|
|
|
|
|
|
Overall |
230 |
-4.65 |
10.76 |
-61 |
18 |
230 |
8.32 |
8.25 |
0 |
61 |
FiveThirtyEight Polling Average |
4 |
-18.75 |
31.91 |
-52 |
11 |
4 |
27.25 |
22.29 |
6 |
52 |
FiveThirtyEight Polls Only |
59 |
-4.64 |
9.60 |
-35 |
12 |
59 |
7.90 |
7.12 |
0 |
35 |
FiveThirtyEight Polls Plus |
59 |
-4.23 |
9.77 |
-41 |
11 |
59 |
7.66 |
7.35 |
0 |
41 |
Huffington Post |
53 |
-3.77 |
11.99 |
-61 |
18 |
53 |
8.57 |
9.13 |
1 |
61 |
RealClearPolitics |
55 |
-4.93 |
8.88 |
-32 |
12 |
55 |
7.87 |
6.35 |
0 |
32 |
F Test |
F(4,225)=1.32, p=0.27 |
F(4,225)=5.92, p=0.0002 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Common contests |
|
|
|
|
|
|
|
|
|
|
Overall |
183 |
-3.99 |
8.69 |
-41 |
18 |
183 |
7.34 |
6.11 |
0 |
41 |
FiveThirtyEight Polls Only |
46 |
-4.13 |
8.84 |
-35 |
12 |
46 |
7.26 |
6.45 |
1 |
35 |
FiveThirtyEight Polls Plus |
46 |
-3.98 |
9.48 |
-41 |
11 |
46 |
7.33 |
7.15 |
0 |
41 |
Huffington Post |
45 |
-3.64 |
8.89 |
-23 |
18 |
45 |
7.73 |
5.61 |
1 |
23 |
RealClearPolitics |
46 |
-4.20 |
7.77 |
-23 |
12 |
46 |
7.07 |
5.23 |
0 |
23 |
F Test |
F(3,179)=0.04, p=0.99 |
F(3,179)=0.09, p=0.96 |
Note: The Huffington Post did not make predictions for the Republican contest in California for candidates other than Trump on the final prediction date. |
Error by Number of Aggregators Predictions. In general, contests for which more of the aggregators made predictions had smaller signed horserace errors than contests in which only one or two aggregators made predictions (Table A.14). The contests for which only one or two predictions were made included the Republican contests in Alaska (March 1), Alabama (March 1), Kansas (March 5), Kentucky (March 5), Missouri (March 15), New Jersey (June 7), Tennessee (March 1) and West Virginia (May 10) and the Democratic contests in Alabama (March 1), Missouri (March 15), and Utah (March 22).
Table A.14 Mean Signed and Absolute Horserace Error Overall and by Contest, by Number of Aggregators with Predictions
Number of predictions |
Overall |
Republican |
Democrat |
N |
Mean |
SD |
N |
Mean |
SD |
N |
Mean |
SD |
Signed Error |
|
|
|
|
|
|
|
|
|
1 |
4 |
-9.25 |
30.02 |
3 |
5.00 |
11.53 |
1 |
-52.00 |
– |
2 |
13 |
-14.00 |
21.94 |
9 |
-18.00 |
25.43 |
4 |
-5.00 |
6.68 |
3 |
30 |
-4.03 |
10.11 |
9 |
-6.11 |
12.33 |
21 |
-3.14 |
9.20 |
4 |
183 |
-3.99 |
8.70 |
91 |
-3.49 |
9.31 |
92 |
-4.48 |
8.06 |
|
|
F=3.92 |
P=0.009 |
|
F=5.04 |
P=0.003 |
|
F=11.25 |
P<.0001 |
Absolute Error |
|
|
|
|
|
|
|
|
|
1 |
4 |
20.25 |
21.34 |
3 |
9.67 |
5.51 |
1 |
52 |
– |
2 |
13 |
18.15 |
18.35 |
9 |
23.78 |
19.37 |
4 |
5.50 |
6.14 |
3 |
30 |
8.43 |
6.75 |
9 |
8.78 |
10.36 |
21 |
8.29 |
4.79 |
4 |
183 |
7.34 |
6.11 |
91 |
7.36 |
6.65 |
92 |
7.33 |
5.57 |
|
|
F=11.11 |
P<.0001 |
|
F=10.06 |
P<.0001 |
|
F=22.39 |
P<0.0001 |
Looking at all of the contests together, the average signed horserace error for races for which there was only one or two predictions were -9.25 and -14.00, respectively, compared to -4.03 and -3.99 for contests for which there were 3 or 4 predictions (F=3.92, p=0.009). When we look at the Republican contests only, 91 of the 112 contests (81.25%) had predictions from all four aggregators. These 91 contests had a mean signed horserace error of -3.49 and -6.11 for the 9 contests with three predictions, compared to larger values for Republican contests that had two or fewer predictions (F=5.04, p=0.003). There were 92 Democratic contests in which there were predictions from all four aggregators, making up 77.97% of the total. Here, the average horserace error was -4.48 for contests with four predictions and -3.14 for contests with three predictions, with larger values for the few other predictions.
We see a similar pattern when examining absolute horserace error for the for the number of predictions – contests for which there were more predictions had lower absolute error, on average, than contests for which there were fewer predictions. Overall, the absolute error for contests with only one prediction was 20.25 percentage points, falling slightly to 18.15 percentage points for contests with two predictions, 8.43 percentage points for contests with three predictions, and 7.34 for contests with 4 predictions (F=11.11, p<.0001). This pattern generally held for both the Republican (F=10.06, p<.0001) and Democratic (F=22.39, p<.0001) contests. The average absolute error for the Republican contests with four predictions was 7.36 percentage points, compared to 8.78 for contests with three predictions, 23.78 for contests with two predictions, and 9.67 for contests with only one prediction. On the Democratic side, the average absolute error was 7.33 for contests with four predictions, and increasing to 8.29, 5.50, and 52 for contests with three, two, and one predictions, respectively.
Error by Contest and State. There was no significant difference in the signed horserace error between the Democratic and Republican contests for the aggregators, either overall (Republican contests = -4.64; Democratic contests = -4.66; t=-0.01, p=0.99) or for common contests (Republican contests=-3.49; Democratic contests = -4.48, t=-0.76, p=0.45).
There were, however, significant differences in the signed horserace error across states. Figure A.8 shows the signed horserace error, averaged across aggregators, for each state for the Republican and Democratic contests by the number of predictions for each state. In most states, the average of the aggregator predictions understated the actual margin – where there were four predictions, in only six states did the average across the aggregators overstate the margin between the two candidates for the Republican race and in five states for the Democratic race.
Figure A.8 Average Signed Error Across Aggregators, by Contest, State, and Number of Predictions.
Similarly, there were no significant differences in absolute error of the aggregators’ predictions between the Democratic and Republican contests overall (Republican contests = 8.86, Democratic contests = 7.81, t=-0.96, p=0.34) or for the common contests (Republican contests = 7.36, Democratic contests = 7.33, t=-0.04, p=0.97).
As with signed error, there are significant differences in absolute error by state. Figure A.9 shows the signed absolute error for each state across the Republican and Democratic contests for states by the number of predictions for each state. On the Republican side, there were four states with four predictions in which the average absolute error from the prediction was at least 10 percentage points different from that of the actual margin – California (25.67), Louisiana (13), Oklahoma (19.75), and Pennsylvania (13.25). On the Democratic side, six states with four predictions exceeded a 10 percentage point difference between the average aggregator predictions and the actual margin – Indiana (13.25), Maryland (10.50), Michigan (22.25), Oklahoma (10.50), South Carolina (16.5), and Wisconsin (10.50).
Figure A.9 Average Absolute Error Across Aggregators, by Contest, State and Number of Predictions
Error by winner’s percentage. Looking only at the 46 state contests with four predictions, there is a negative association between the winner’s percentage of the final vote and the signed horserace error. Overall, the correlation is -0.34 (p=0.02); that is, the larger the winner’s percentage, the more that the prediction underestimated the difference between the percentage for the winner and the runner up. Examining the 23 Republican and 23 Democrat races separately, the correlation is r=-0.45 (p=0.03) for the Republican races and a non-significant r=-0.22 (p=0.31) for the Democratic races (Figure A.10).
Figure A.10 Average Signed Error Across Aggregators by Percent of Actual Vote for Winner, by Contest
We see a slightly different story for the association between absolute error and the percent of the actual vote for the winner, as shown in Figure A.11. Overall, there is no association between absolute error and the percent of the actual vote for the winner (r=0.15, p=0.32). We also see no association between absolute error and the percent of the actual vote for the winner in the Republican (r=0.26, p=0.22) and Democratic races (r=0.06, p=0.79).
Figure A.11 Average Absolute Error Across Aggregators by Percent of Actual Vote for Winner, by Contest
Correct Projection. Across all 232 projections, 23 (9.91%) were called incorrectly. Incorrect calls were equally likely to happen on both the Republican and Democratic sides of the primaries (χ
2=0.02, p=0.90) - 9.65% of the projections on the Republican side (11 out of 114) were called for the wrong candidate and 10.17% (12 out of 118) were called for the wrong candidate. When looking only at states with four predictions, 18 of the 184 contests (9.78%) were called incorrectly. As with overall, incorrect calls were equally likely to occur for both Republican (8 of 92, or 8.70%; all of the calls in Iowa and Oklahoma) and Democratic (10 of 92, or 10.87%; all of the calls in Indiana, Michigan, and two of the calls in Oklahoma) contests. Each of the aggregators was equally likely to have an incorrect call overall (χ
2=0.65, p=0.96) and for commonly called contests (χ
2=0.25, p=0.97).
As shown in Table A.15, the average signed error of the aggregator predictions of the margin is significantly larger for contests that were called incorrectly (-15.74) than for contests that were called correctly (-3.42; t=5.53, p<.0001) overall and for only common contests (-15.22 vs. -2.76; t=6.37, p<.0001). This pattern and magnitude holds for both measures of error for both the Republican and Democratic contests.
Table A.15 Average Signed and Absolute Error Overall and by Contest, by Number of Aggregators with Predictions
|
Signed Error |
|
Absolute Error |
|
|
Incorrect calls |
Correct Calls |
t |
Incorrect calls |
Correct Calls |
t |
All Contests |
|
|
|
|
|
|
Overall |
-15.74 |
-3.42 |
5.53**** |
15.74 |
7.48 |
-4.76**** |
Republican |
-15.64 |
-3.45 |
3.29*** |
15.64 |
8.12 |
-2.54* |
Democratic |
-15.83 |
-3.40 |
4.81**** |
15.83 |
6.91 |
-4.69**** |
|
|
|
|
|
|
|
Common contests |
|
|
|
|
|
|
Overall |
-15.22 |
-2.76 |
6.37**** |
15.22 |
6.48 |
-6.35**** |
Republican |
-13.75 |
-2.51 |
3.45*** |
13.75 |
6.75 |
-2.96** |
Democratic |
-16.40 |
-3.02 |
5.76**** |
16.40 |
6.22 |
-6.61**** |
Note: *p<.05, **p<.01, ***p<.001, ****p<.0001
Errors and white non-college population in each state. One hypothesis for why the poll estimates differed from the actual margin is differential nonresponse bias by systematically missing white voters without a college degree (Silver 2016b). If this is the case in the primary contests, then we should see that the errors in the difference in the margin are systematically associated with the percent of white noncollege voters in each state.
Overall, the correlation between the share of white noncollege voters in each state and the signed horserace error is small and not statistically different from zero (r=-0.05, p=0.66). When we look at the Republican and Democratic races separately, we see similar results. For the 34 Republican contests, there is a nonsignificant small positive association (r=0.06, p=0.75) between the share of white noncollege voters and the mean signed error across the aggregators. For the Democratic contests, there is a nonsignificant negative association (r=-0.20, p=0.28) between the share of white noncollege voters and the mean signed error across the aggregators.
When we instead examine the actual values for each of the 112 Republican predictions and the 118 Democratic predictions from each aggregator rather than the mean horserace prediction error across aggregators. Here, as visualized in the right panel of Figure A.12, the correlation between the signed horserace error and the percent of white noncollege voters for the Republican contests weakens (r=-0.007, p=0.94) and reaches statistical significance for the Democratic contests (r=-0.19, p=0.03). Because the signed errors start close to zero, this negative correlation indicates that the states with more white non-college voters were more likely to underestimate the margin for the Democratic side, and this pattern did not hold on the Republican side.
Figure A.12 Signed Error Averaged Across Aggregators and for Each Aggregator by Percent of White Noncollege Voters, by Contest
The left panel of Figure A.13 examines the absolute horserace error for Republican and Democratic primary contests separately, averaged across the aggregators. Here, as with the average signed horserace error, there is no correlation overall between the share of white non-college voters and the absolute horserace error averaged over all of the aggregators (r=0.11, p=0.40). Furthermore, there is no correlation between the average absolute error across the aggregators and the share of white non-college voters in either the Republican (r=0.03, p=0.86) or Democratic races (r=0.19, p=0.29).
When we instead examine the correlation between the share of white non-college voters and the absolute prediction errors for each of the aggregators, rather than the average across the aggregators, we see a similar pattern as for signed error – no correlation for the Republican contests (r=-0.01, p=0.91) and a significant positive correlation for the Democratic contests (r=0.21, p=0.02). That is, absolute errors were greater for Democratic contests with more white non-college voters, but not for Republican contests. This pattern is shown in the right panel of Figure A.13.
Figure A.13 Absolute Error Averaged Across Aggregators and for Each Aggregator by Percent of White Noncollege Voters, by Contest
We now consider all of these factors simultaneously in a hierarchical linear model with the 230 predictions nested within 36 states across both the Republican and Democratic contests, using states as random effects using Stata’s xtmixed command. We then add the variables examined above (except for whether a correct prediction was made) to account for any confounding between the predictor variables.
Table A.16 presents regression coefficients and standard errors predicting both the signed horserace error and absolute horserace error. For signed errors, the intraclass correlation coefficient (ICC) is 0.374, indicating that 37.4% of the variance in the signed errors is within states (across predictions); for absolute errors, the ICC is 0.295, indicating that 29.5% of the variance in absolute errors is within the states (across aggregator predictions), and the remainder is between the states.
To be consistent with the multivariate discussion in the state-level polling section, we will focus our interpretation of the multivariate models on the absolute horserace errors rather than signed errors. Overall, Democratic contest predictions fared differently from Republican contest predictions, with Democratic contest predictions being more accurate overall (a negative coefficient in absolute error indicates lower errors, or more accurate). Although the association between absolute horserace prediction errors and the percent of white non-college voters in the state was fairly flat for the Republican contests, errors
increased in Democratic contests for states with a higher share of white non-college educated voters. Said another way, the Democratic contest predictions were less accurate in states with higher concentrations of white non-college voters.
As seen in the bivariate analyses, absolute errors were greater in states where the winner took a larger percentage of the vote. Additionally, states where there were more predictions available had smaller absolute errors. The FiveThirtyEight Polling Average fared worse than the other aggregator predictions, but these only occurred where there were few polls available.
Table A.16 Regression Coefficients and Standard Errors (in parentheses) from Hierarchical Linear Model Predicting Signed Error and Absolute Error, Aggregators Only
|
Signed horserace error |
Absolute horserace error |
|
Null |
Main effects |
Interaction |
Null |
Main effects |
Interaction |
Democratic contests |
|
5.69****
(1.24) |
5.52****
(1.24) |
|
-3.70****
(1.00) |
-3.49****
(0.99) |
% white non-college voter population |
|
-0.10
(0.14) |
-0.01
(0.15) |
|
0.06
(0.10) |
-0.05
(0.11) |
Democratic contests *% white non-college voter population |
|
|
-0.16
(0.10) |
|
|
0.18*
(0.08) |
% for winner |
|
-0.48****
(0.06) |
-0.47****
(0.06) |
|
0.25****
(0.05) |
0.24****
(0.05) |
Number of predictions |
|
2.81*
(1.32) |
3.04*
(1.33) |
|
-4.19****
(1.02) |
-4.54****
(1.03) |
Firm (reference=RCP) |
|
|
|
|
|
|
FiveThirtyEight Polling Average |
|
-3.13
(4.56) |
-2.65
(4.53) |
|
8.65*
(3.65) |
8.01*
(3.60) |
FiveThirtyEight Polls-Only |
|
0.33
(1.38) |
0.38
(1.36) |
|
-0.02
(1.11) |
-0.06
(1.09) |
FiveThirtyEight Polls-Plus |
|
0.74
(1.38) |
0.78
(1.36) |
|
-0.25
(1.11) |
-0.30
(1.09) |
Huffington Post |
|
0.88
(1.43) |
0.88
(1.41) |
|
0.70
(1.15) |
0.69
(1.13) |
Intercept |
-4.95****
(1.29) |
-7.68
(6.20) |
-18.58****
(4.96) |
8.74****
(0.91) |
11.52*
(4.83) |
13.22**
(4.84) |
Variance components |
|
|
|
|
|
|
SD State random effects |
6.76**** |
7.15**** |
7.25**** |
4.57**** |
4.99**** |
5.22**** |
SD Residual variance |
8.74 |
7.29 |
7.22 |
7.07 |
5.89 |
5.78 |
State ICC |
0.374 |
0.491 |
0.502 |
0.295 |
0.418 |
0.449 |
Note: n=230 predictions in 36 states; *p<.05, **p<.01, ***p<.001, ****p<.0001; % White non-college population grand mean centered at 44.18879; Winner percent grand mean centered at 53.67931.
A.G Using Callback Studies to Look at the Stability of Vote Preference
With tracking data collected on independent samples of respondents, it is difficult to disentangle changes in sample composition from changes in individual voting intentions. Both Pew Research Center’s American Trends Panel (ATP) and YouGov’s Cooperative Congressional Election Study (CCES) recontacted those interviewed before the election after November 9. Table A.17 shows the relationship between pre-election voting intention and post-election vote report for registered voters who were interviewed in the October and November waves of Pew’s ATP. Table A.18 performs the same analysis for YouGov’s CCES. Data are not weighted and exclude respondents who did not take either wave, were undecided, or did not plan to vote.
The pattern of stability is similar in Tables A.17 and A.18. There was hardly any switching between the two major party candidates, though fewer Clinton supporters in the pre-election wave turned out to vote than Trump supporters. There were losses for the Libertarian and Green candidates: Johnson supporters broke toward Trump, while Stein supporters broke toward Clinton, but the overall numbers were relatively small. These data indicate that there was a small gain for Trump between the pre- and post-election interviews, but that the amount of switching was small compared to the swings seen in pre-election tracking surveys. This suggests that a substantial portion of the variation in pre-election tracking was compositional. Actual switches in voting intention were rare. Only about 1-in-20 respondents reported voting for a different candidate in the post-election interview from the one they supported in their pre-election interview. The net shift toward Trump was quite small and occurred mostly through differential turnout.
One cautionary note about this analysis, however, is that the possibility of time-in-panel effects. The ATP panelists were asked about their vote preference roughly monthly during 2016, and the CCES panelists were asked multiple times as well. There is some concern that answering the vote choice question again and again could have made the election more salient for these panelists than for typical adults. If the panelists were more engaged and more likely to have solidified their vote choice, then the data here would potentially under-state the real level of vote choice fluctuation among the electorate. It seems unlikely that time-in-sample effects would meaningfully undermine these data, but the possibility deserves mention.
Table A.17 Differences between Pre-election Vote Intention and Reported Vote (Pew ATP)
|
Pre-election interview |
Post-election interview |
Trump |
Clinton |
Johnson |
Stein |
Total |
Trump |
36.4 |
0.5 |
0.9 |
0.3 |
38.0 |
Clinton |
0.2 |
46.0 |
0.5 |
0.5 |
47.2 |
Johnson |
0.4 |
0.2 |
3.4 |
0.2 |
4.1 |
Stein |
0.1 |
0.1 |
0.1 |
1.2 |
1.3 |
Did not vote |
2.9 |
3.9 |
1.7 |
0.8 |
9.3 |
Total |
40.0 |
50.6 |
6.5 |
2.9 |
100.0 |
Table A.18 Differences Between Pre-election Vote Intention and Reported Vote (YouGov CCES)
|
Pre-election interview |
Post-election interview |
Trump |
Clinton |
Johnson |
Stein |
Total |
Trump |
30.6 |
0.5 |
0.7 |
0.4 |
32.2 |
Clinton |
0.4 |
35.1 |
0.5 |
0.5 |
36.5 |
Johnson |
0.3 |
0.3 |
3.2 |
0.2 |
4.1 |
Stein |
0.0 |
0.0 |
0.1 |
1.3 |
1.4 |
Did not vote |
7.3 |
12.1 |
3.8 |
2.6 |
25.8 |
Total |
38.5 |
48.2 |
8.2 |
5.1 |
100.0 |
A.H Adjusting on a More or Less Detailed Education Variable
Table A.19 compares the education distribution of registered voters in four samples (the October wave of the ATP, the combined ABC News/Washington Post daily tracking data, the CCES pre-election wave, and the combined Survey Monkey (SM) tracking data) with the 2012 CPS Voting and Registration Supplement. The data have been weighted using the post-stratification weight created by each survey organization. Except for the Pew ATP (which slightly over-represents registered voters without a high school degree), the samples have too few high school dropouts. The two online samples have collapsed “less than high school” and “high school graduates” into a single weighting cell, so they substantially underrepresent the number of persons in the lowest education category. This is offset by an over-representation of the next education category (high school graduates).
Table A.19 Weighted Percentage of Education Level in Five Samples of Registered Voters |
|
Education |
Survey |
No HS Degree |
HS Graduate |
Some College |
College Graduate |
Post Graduate |
CPS 2012 |
7.2 |
27.2 |
31.4 |
22.2 |
12.0 |
Pew Research Center |
9.0 |
27.5 |
33.4 |
18.6 |
11.5 |
ABC News/Washington Post |
6.1 |
25.1 |
23.0 |
32.1 |
13.7 |
CCES 2016 |
2.7 |
32.6 |
32.6 |
19.9 |
12.2 |
SurveyMonkey |
2.6 |
30.1 |
31.7 |
19.0 |
16.7 |
In the 2016 election, there was a strong correlation between education and candidate preference for white votes. (According to the NEP Exit poll, Trump won whites without a college degree 66-29 compared to 48-45 among whites with a college degree. This is almost three times as large a difference as in the 2012 Presidential election.) While this could potentially affect vote estimates, it does not appear that they were a significance source of error. A simple correction—post-stratifying the survey weights by 5 education categories—results in only small changes in the pre-election vote estimates (less than 0.4 percent) and no systematic improvement.
A.I Adjusting on Political Party Affiliation
In pre-election polls, there is the possibility of non-demographic sample skews, such as party identification. If a sample contains too many respondents of one party, it is almost certain to overestimate the vote for that party and weighting the sample by the population proportion of party identifiers could measurably improve the accuracy of vote estimates. Party identification (ID) weighting, however, is controversial. First, there is no widely accepted benchmark for party ID. Second, even if there was an authoritative source that could be used, party ID varies over time, so weighting to an out-of-date target could mask party ID swings and make the party-weighted estimates less, rather than more, accurate.
Even if the population distribution of party ID is not available and party ID changes over time, different distributions of party ID in data collected at the same period is evidence of partisan selection effects. Day-to-day variations in sample composition
could be due to shifts in party ID, but daily shifts are implausibly large. Panel data on three category party ID exhibit a high degree of stability and showed no trend toward one party or the other during the 2016 campaign. For example, in Pew Research Center’s American Trends Panel (ATP), 90 percent of Democrats and Republicans chose the same party ID in the post-election survey as in the pre-election survey. The instability was almost entirely to or from the independent category, with offsetting movements in either direction. Hardly anyone (27 out of 3,961 respondents) changed from one party to the other.
There is, however, evidence that differences between the partisan composition of different polling samples—even after demographic weighting—resulted in substantial differences in vote estimates. Figure A.14 shows the relationship between the (weighted) sample party ID distribution and daily vote estimates in two tracking polls (the online survey is SurveyMonkey and the phone tracking is ABC). Party ID is measured by the percentage of Democrats in the sample, less the percentage of Republicans. Vote intention is the percentage of voters intending to vote for Clinton, less the percentage intending to vote for Trump. The online survey had much larger sample sizes (over 5,000 most days, and over 10,000 per day for most of the final week of the campaign) than the phone tracking (about 264 respondents on the average day). As a consequence, the phone tracking is much noisier, but the pattern is very simple in both samples: an extra percent of Democratic party identifiers in the sample increases Democratic vote intention by about one percent. That said, it is important to note that neither SurveyMonkey nor ABC/Washington Post published estimates for single day interviewing. Their weighting protocols are not designed for that purpose. So in a sense, this analysis overstates how much of a problem daily variation is for such polling organizations.
Figure A.14 Relationship Between Sample Composition of Party Identifiers and Vote Intention for an Online Survey (Left Panel) versus Live Phone Survey (Right Panel). Source: Left panel is SurveyMonkey data; Right panel is ABC/Washington Post data
Note – Neither SurveyMonkey nor ABC/Washington Post published estimates for single day interviewing.
This is similar to the pattern found by Gelman and colleagues (2016) in 2012 Presidential election polling. On some days, the samples contained as much as 20 percent more Democrats than Republicans, while on other days there were more Republicans than Democrats. These partisan surges were associated with campaign events and gave the impression that there were large swings in voting intentions, despite the use of demographic weighting. In 2016, similar large swings were seen in both online and telephone tracking (as shown in the left panel of Figure A.15). On some days, Clinton led by as much as 15 percent, while on others Trump led by nearly as much. The size of these swings is much larger than could possibly be due to sampling variability. (The standard error of daily differences in lead is approximately 2 percent for the large online samples and about 8 percent for the smaller daily phone tracking samples.)
Figure A.15 Daily Clinton Lead Using Survey Weights (left) and Post-stratified by Party Identification (right). Notes – Online survey shown in red, phone tracking in blue. Neither SurveyMonkey nor ABC/Washington Post published estimates for single day interviewing.
The weights can be adjusted to reflect a constant distribution of party ID, rather than a dynamic party ID target. To explore this, we pooled each tracking sample to estimate the proportions of Democrats, Republicans, and Independents in the organization’s tracking data. The base survey weights were then post-stratified by the proportions in each party ID group, so that the weighted proportions of Democrats and Republicans would be the same each day. Daily leads were computed using these weights and are shown in the right panel of Figure A.15. About half of the daily variation in the Clinton lead is removed by reweighting with a static party ID target. The online sample shows no evident trend over the last month of the campaign. The phone tracking data is noisier, but the daily differences are now in the range that could be explained by normal sampling variation.
While this analysis demonstrates that adjusting for party ID can reduce day-to-day variability in poll estimates, it is not clear that such an adjustment reduces bias. For at least one poll, adjusting on a close cousin – political ideology – did not help. According to Blumenthal (2016), SurveyMonkey’s practice of weighting to match its own previous results for ideology had the effect of boosting Clinton’s vote total in the final week of the campaign. Dropping the ideology smoothing would have reduced Clinton’s margin in the final week from +6.0 to +4.8 (relative to a vote outcome of +2.1). It is also worth noting that the approach tested here (i.e., pooling all the tracking data collected by a firm during an election season) is not something that can be implemented in practice. In practice, the pollster only has data for the interviewing conducted to date, as opposed to the entire tracking series, which includes interviewing conducted in the future.
A.J Approaches to Likely Voter Modeling
The assumptions that pollsters make about turnout and the methods they use to measure and model the likely electorate vary widely. More than a quarter century after Irving Crespi (1988) described identifying likely voters as “a major measurement problem in pre-election polling,” this aspect of survey design remains a combination of science and art, with few pollsters taking the same approach. While a complete assessment of the various pollster likely voter models is beyond the scope of this report, we can summarize some of the most common approaches taken. Some pollsters make direct assumptions about the demographic and geographic composition of the likely electorate, and apply quotas or weights (or, more formally, pre or post-stratification) to assure that their final samples match these assumptions. One pollster, for example, weighted their Pennsylvania poll “to match expected turnout demographics for the 2016 General Election.” While easier to explain and understand, this relatively direct approach is not the most typical.
More often, the assumptions that pollsters make about turnout are not about voter demographics directly, but rather about the
techniques and
mechanisms they use to select, screen for or otherwise model the likely electorate. The voter demographics that result are more a byproduct of their respective approaches than some deliberate and explicit set of assumptions. Again, the specific techniques vary widely. Many begin by attempting to interview a random sample of all adults. They will weight their full adult sample to match the known demographics of the adult population as measured by U.S. Census. They will then use some mechanism to select or model the “likely voters” from within their sample of all adults, and allow their demographics to vary without additional weighting.
This selection process can be a straightforward screen based on the answers to one or more survey questions, or it can be based on an index constructed from as many as seven or eight questions with a cut-off between likely and unlikely voters made at some level of the index.
Some attempt to calibrate their cut-off point to some “assumption” about the
level of coming turnout. Pollsters will select a smaller fraction of their sample of adults as likely voters if they expect a lower turnout, and a larger fraction if their assumptions point to a bigger turnout.
Other pollsters screen for registered or likely voters during the interview, retaining no demographic information about the non-voters they screen out. For the purposes of weighting, such pollsters are far more likely to make direct assumptions about the demographics of the electorate since they cannot weight to match all adults. Some will weight to match the geographic distribution of likely voters based on previous vote counts at the county or town level (on the theory that such data is both readily available and precise), but not weight or adjust the demographics of selected likely voters (on the theory that benchmarks of past demographics are often conflicting and less reliable).
Pollsters who weight to match “expected” demographics often differ in the
sources they use to set their weighting targets, drawing variously from past exit polls, the CPS Voting Supplement surveys, estimates drawn from official “voter file” records or some combination of the three.
Some pollsters sample directly from voter files as a means of more accurately selecting likely voters, by restricting potential respondents to those actually registered to vote or with some past history of voting. Among pollsters who use RBS, some may only use the list to identify the
households of registered voters, using survey questions and random methods to select a “likely voter” within each household. Others may request a
specific voter by name, with that person selected based on their prior history of voting, sometimes determined from a complex statistical model.
In recent years, some pollsters have moved to increasingly more advanced and complex efforts to model the likely electorate. These include the so-called “analytics” surveys, which leverage techniques like multiple regression and poststratification (MRP). Examples include YouGov (Rivers and Lauderdale 2016), Morning Consult (2016) and the approach used by Corbett-Davies, Gelman and Rothschild to model a New York Times Upshot survey in Florida (Cohn 2016c). Again, this listing just covers some of the more prominent features in the methods used to model likely voters. Examine the methods used by any one pollster, and you will likely find combinations of the approaches listed above, where the explicit assumptions range from relatively scant and hands-off to heavy and highly complex.
References
Bartels, L. M., and Broh, C. A. (1989), “A Review: The 1988 Presidential Primaries,”
Public Opinion Quarterly, 53(4), 563-589.
Beniger, J. R. (1976), “Winning the Presidential Nomination: National Polls and State Primary Elections, 1936-1972,”
Public Opinion Quarterly, 40(1), 22-38.
Blumenthal, M. (2016), “The Latest Data and Methodological Information on How. SurveyMonkeY Measures Shifts in Voter Sentiment.” Retrieved from https://blog.electiontracking.surveymonkey.com/2016/12/22/looking-back-at-2016-what-weve-learned-so-far/.
Blumenthal, M., Cohen, J., Clinton, J., and Lapinsky, J. (2016), “Why The NBC News/ Survey Monkey Poll Now Tracks Likely Voters,” NBC News, September 10, 2016.
Cohn, N. (2016), “We Gave Four Good Pollsters the Same Raw Data. They Had Four Different Results.”
New York Times, September 20, 2016. Retrieved from https://www.nytimes.com/interactive/2016/09/20/upshot/the-error-the-polling-world-rarely-talks-about.html?_r=0.
Crespi, I. (1988),
Sources of Accuracy and Error in Pre-Election Polling. New York: Sage.
Gelman, A., Goel, S., Rivers, D., and Rothschild, D. (2016), “The Mythical Swing Voter,”
Quarterly Journal of Political Science, 11(1), 103-130.
Hopkins, D. J. (2009), “No More Wilder Effect, Never a Whitman Effect: When and Why Polls Mislead about Black and Female Candidates,”
The Journal of Politics, 71(3), 769-781.
Lau, R. R. (1994), “An Analysis of the Accuracy of ‘Trial Heat’ Polls During the 1992 Presidential Election,”
Public Opinion Quarterly, 58(1), 2-20.
Morning Consult (2016), “How We Constructed Our 50-State Snapshot.” Retrieved from https://morningconsult.com/2016/09/08/constructed-50-state-snapshot/.
Rivers, D., and Lauderdale, B. (2016). “The YouGov Model: The State of the 2016 Election,” October 4, 2016. Retrieved from https://today.yougov.com/news/2016/10/04/YouGov-Model-State-of-2016/.
Silver, N. (2016b), “Pollsters Probably Didn’t Talk to Enough White Voters Without College Degrees,” FiveThirtyEight.com, December 1, 2016. Retrieved from https://fivethirtyeight.com/features/pollsters-probably-didnt-talk-to-enough-white-voters-without-college-degrees/.
Traugott, M. W., and Wlezien, C. (2009), “The Dynamics of Poll Performance During the 2008 Presidential Nomination Contest,”
Public Opinion Quarterly, 73, 866-894.
[1] Note that there are 36 polls by a Democratic pollster in the sample and 24 are taken in a Democratic primary contest. These 24 polls were all done by PPP using an IVR methodology. There are 60 polls by a Republican affiliated pollster, and 41 of those are taken in a Republican primary. Republican polls in Democratic contests were done by a variety of firms including: Gravis, Magellan Strategies, TargetPoint, Landmark and Mitchell.
[2] Mixed primaries are the omitted category.
[3] We do not have polls in the last two weeks for both the Democratic and Republican events in the following states: Alaska, Hawaii, Maine, Minnesota, Montana, Nebraska, New Mexico, North Dakota, South Dakota, Washington and Wyoming. Polls also are missing before for the Democratic primary in Kentucky and the Republican events in California, Colorado, and New Jersey.
[4] That said, it is important to note that they focused on the winner’s share of the top two candidates’ poll shares in each primary.
[5] Analysis incorporating an interaction between number of days and the winner’s poll margin significantly improves the fit of the model and increases the estimated effect of that margin, but indicate that its impact may decrease over time.
[6] Notice that the tracking polls are entered in the data base only once during their period in order to avoid any dependency in the data.
[7] When pollsters published two types of estimates, only the likely voter estimate was retained in this analysis. Therefore, the analysis performed here does not compare likely voter estimates and registered voters estimates for the same polls but for different polls usually conducted by different pollsters.
[8] This is congruent with Blumenthal, Cohen, Clinton and Lapinsky (2016) who showed little difference between likely voters and registered voters.
[9] River sampling is a way of recruiting respondents that are not in the original samples using a procedure that ask internet users selected at random.
[10] http://projects.fivethirtyeight.com/election-2016/primary-forecast/missouri-republican/