License: CC BY 4.0
arXiv:2312.15179v1 [stat.ME] 23 Dec 2023

Evaluating District-based Election Surveys with Synthetic Dirichlet Likelihood

Adway Mitra, Palash Dey

Indian Institute of Technology Kharagpur
Abstract

In district-based multi-party elections, electors cast votes in their respective districts. In each district, the party with maximum votes wins the corresponding “seat” in the governing body. Election Surveys try to predict the election outcome (vote shares and seat shares of parties) by querying a random sample of electors. However, the survey results are often inconsistent with the actual results, which could be due to multiple reasons. The aim of this work is to estimate a posterior distribution over the possible outcomes of the election, given one or more survey results. This is achieved using a prior distribution over vote shares, election models to simulate the complete election from the vote share, and survey models to simulate survey results from a complete election. The desired posterior distribution over the space of possible outcomes is constructed using Synthetic Dirichlet Likelihoods, whose parameters are estimated from Monte Carlo sampling of elections using the election models. We further show the same approach can also use be used to evaluate the surveys - whether they were biased or not, based on the true outcome once it is known. Our work offers the first-ever probabilistic model to analyze district-based election surveys. We illustrate our approach with extensive experiments on real and simulated data of district-based political elections in India.

1 INTRODUCTION

Elections are conducted by almost all democratic countries to choose representatives for governing bodies, such as parliaments. A common democratic setup is the district-based system in which the country is spatially divided into a number of regions called districts (or constituencies). There is a seat in the governing body corresponding to each district. The residents of each district elect a representative from a set of candidates, according to any voting rule. In many countries, these candidates are representatives of political parties, and electors may cast their votes in favour of the parties rather than individual candidates.

The election results are understood in terms of the number of seats won by different parties, rather than the total number of votes obtained by them. If the relative popularity of the different parties is spatially homogeneous across all the districts, then the most popular party may win all the seats. But this is very rarely the case. One reason for this may be the individual popularity of candidates may vary. But a more complex reason is the spatial variation of demography across the country, since the popularity of different parties often varies with demography [6]. Demography varies spatially as people usually prefer to choose residences based on social identities, such as race, religion, language, caste, profession and economic status. This process is sometimes called “ghettoization”, where people with similar social identities huddle together in pockets [7, 8]. Such ghettoization plays a very important role in district-based elections if different political parties represent the interests of different social groups. Even if a political party is not popular overall, it can win a few seats if its supporters are densely concentrated in a small number of districts, which forms strongholds of the party. On the other hand, a party which is overall quite popular, may fail to win many seats if its supporters are spread all over without concentration. Also, electors often vote according to the advice of local community leaders and other local factors [5], which causes “polarization” of voters in favour of one/two parties inside each district.

Surveys are often carried out to forecast the election results. These surveys may be conducted by various agencies before or after the election. Usually a survey involves a small sample of the electorate, based on whose responses the vote share of the different parties is estimated. The number of seats to be won by the different parties can be estimated as well from this sample. However, the accuracy of these estimates depends on how well these samples represent the entire population. For example, the chosen samples may cover only a few districts, or misrepresent the true vote share of the different parties. This may arise either due to practical constraints (such as the difficulty of reaching certain geographical areas) or due to malicious intent or partisan bias of the survey agency. This gives rise to two complementary questions: i) Given a survey method and results, can we predict the true results of the election? ii) Once the full results of the election are known, can we figure out if the estimated result from any survey is consistent with a particular survey method?

A significant amount of research work exists in predicting the election results from a survey under different conditions. Most of these works like [4, 18, 9, 14, 13] focus on finding the minimum number of samples needed by a survey to forecast the winner and/or the margin of victory with a given confidence, and efficient algorithms for the same.  [12] extends this analysis to district-based settings, and provides algorithms to carry out the survey over a limited number of districts and a limited number of persons in each district. However, none of these works, to the best of our knowledge, predict the number of districts won by the parties in either deterministic or probabilistic way. Nor are we aware of any attempt to evaluate if a given survey result is consistent with the actual results.

The aim of this work is threefold. First of all, we attempt to provide a probability distribution over the space of all possible results, given a particular survey result and its various parameters. Here, an election result indicates both the vote share and seat share of different parties. Secondly, given the actual results, we attempt to provide a distribution over the space of possible survey results. This in turn can be used to check whether a given survey result is conceivable or not. Our final aim is to evaluate the above for actual district-based elections held in India.

Our approach depends heavily on the simulation of election outcomes. There are relatively few statistical models for this purpose. Eggenberger and Polya used the concept of Polya’s urn to propose a statistical voting model, which simulates the effect that if one candidate gets a vote, there are likely to get more [3]. There have been attempts to extend these to multiple districts  [20]. Another popular approach is Mallow’s Model, which assumes a ‘central’ ranking over the candidates, and simulates individual votes by perturbing it. More recently, there have been attempts to systematically represent various aspects of district-based elections through voter-centric agent-based statistical models [16, 17]. In this work, we utilize some of these models to simulate complete election results.

The main contribution of the work is to cast the problem in a Bayesian setting by defining conditional distribution of the actual outcome given the survey, and vice versa. These are modelled as Dirichlet Distributions, whose parameters can be estimated from samples of election surveys, drawn from complete election outcomes. Our second contribution is a probabilistic model for surveys, based on complete election outcome. Our third contribution is to propose an algorithm based on Approximate Bayesian Computation to identify the modal (most likely) outcomes, given a survey result. Next, we show how the above framework can be used to evaluate survey results using actual outcomes, to test whether they are feasible and consistent with the uniform sampling paradigm. Finally, we validate this approach through extensive experiments over both simulated and real data. This involves political elections in India covering millions of voters and multiple parties. The novelty of the work lies in the aims, approach and the empirical analysis.

2 Notations and Problem Definition

We consider district-based 1-plurality elections, i.e. the candidate/party with maximum votes in a district wins the corresponding seat. Consider N𝑁Nitalic_N voters divided among S𝑆Sitalic_S districts as {N1,,NS}subscript𝑁1subscript𝑁𝑆\{N_{1},\dots,N_{S}\}{ italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT }. There are K𝐾Kitalic_K parties in fray, each of whom has a candidate in each district. Denote by θsksubscript𝜃𝑠𝑘\theta_{sk}italic_θ start_POSTSUBSCRIPT italic_s italic_k end_POSTSUBSCRIPT the votes received by party k𝑘kitalic_k in district s𝑠sitalic_s, and by θksubscript𝜃𝑘\theta_{k}italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT its overall vote. Also denote by Vksubscript𝑉𝑘V_{k}italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT the number of districts where the candidate from party k𝑘kitalic_k is the winner with maximum number of votes. Clearly, kθk=Nsubscript𝑘subscript𝜃𝑘𝑁\sum_{k}\theta_{k}=N∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_N and kVk=Ssubscript𝑘subscript𝑉𝑘𝑆\sum_{k}V_{k}=S∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_S.

Denote by X𝑋Xitalic_X: the actual electoral outcome. It has two parts: X={X1,X2}𝑋subscript𝑋1subscript𝑋2X=\{X_{1},X_{2}\}italic_X = { italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } where X1={θ1N,,θKN}subscript𝑋1subscript𝜃1𝑁subscript𝜃𝐾𝑁X_{1}=\{\frac{\theta_{1}}{N},\dots,\frac{\theta_{K}}{N}\}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { divide start_ARG italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG , … , divide start_ARG italic_θ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG }, and X2={V1S,,VKS}subscript𝑋2subscript𝑉1𝑆subscript𝑉𝐾𝑆X_{2}=\{\frac{V_{1}}{S},\dots,\frac{V_{K}}{S}\}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { divide start_ARG italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_S end_ARG , … , divide start_ARG italic_V start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_ARG start_ARG italic_S end_ARG } i.e. the vote shares and seat shares of the parties. Denote by Y𝑌Yitalic_Y: the projected results based on the surveys, which also has two parts: {Y1,Y2}subscript𝑌1subscript𝑌2\{Y_{1},Y_{2}\}{ italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } which are the projected vote shares and seat shares of all the parties.

Denote by Z𝑍Zitalic_Z the complete election, where Z={Z1,Z2,,ZS}𝑍subscript𝑍1subscript𝑍2subscript𝑍𝑆Z=\{Z_{1},Z_{2},\dots,Z_{S}\}italic_Z = { italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_Z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT } where Zs={θs1,,θsK}subscript𝑍𝑠subscript𝜃𝑠1subscript𝜃𝑠𝐾Z_{s}=\{\theta_{s1},\dots,\theta_{sK}\}italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = { italic_θ start_POSTSUBSCRIPT italic_s 1 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_s italic_K end_POSTSUBSCRIPT } denotes the vote share of the parties in district s𝑠sitalic_s. Note that the overall vote share and seat share of all parties can be easily calculated given Z𝑍Zitalic_Z. An election simulation model generates Z𝑍Zitalic_Z given X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (note that X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can be calculated easily from Z𝑍Zitalic_Z). A survey model simulates Y𝑌Yitalic_Y from Z𝑍Zitalic_Z.

The first task is: given a set of M𝑀Mitalic_M surveys y1,,yMsuperscript𝑦1superscript𝑦𝑀y^{1},\dots,y^{M}italic_y start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_y start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, calculate a posterior distribution p(X|{y1,yM})𝑝conditional𝑋superscript𝑦1superscript𝑦𝑀p(X|\{y^{1},\dots y^{M}\})italic_p ( italic_X | { italic_y start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … italic_y start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT } ), at least till a proportionality constant. Even if the normalization factor cannot be calculated, we should still be able to compare different candidate outcomes. A related aim is to estimate the mode argmaxXp(X|{y1,ym})𝑎𝑟𝑔𝑚𝑎subscript𝑥𝑋𝑝conditional𝑋superscript𝑦1superscript𝑦𝑚argmax_{X}p(X|\{y^{1},\dots y^{m}\})italic_a italic_r italic_g italic_m italic_a italic_x start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_p ( italic_X | { italic_y start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … italic_y start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT } ), i.e. the most likely outcome.

The second task is the reverse: given the results x𝑥xitalic_x, calculate the distribution p(Y|x)𝑝conditional𝑌𝑥p(Y|x)italic_p ( italic_Y | italic_x ). This shows how likely is a survey (done under certain conditions) to produce a particular projection. If the projected result of a survey (claimed to have been done under the same conditions) has very low density under this distribution, then we can doubt about its actual methodology.

3 Model

Now, we describe the model in full details. This has three building blocks: the posterior construction using Dirichlet synthetic likelihood, the election simulation models and the survey models. Below, we discuss each of these aspects in details.

3.1 Constructing the Posterior

Our main aim is to model the probability distribution p(X|Y)𝑝conditional𝑋𝑌p(X|Y)italic_p ( italic_X | italic_Y ) over possible outcomes X𝑋Xitalic_X, given survey projection results Y𝑌Yitalic_Y. Using the Bayes Theorem, we can write p(X|Y)q(Y|X)r(X)proportional-to𝑝conditional𝑋𝑌𝑞conditional𝑌𝑋𝑟𝑋p(X|Y)\propto q(Y|X)r(X)italic_p ( italic_X | italic_Y ) ∝ italic_q ( italic_Y | italic_X ) italic_r ( italic_X ).

The prior r(X)𝑟𝑋r(X)italic_r ( italic_X ) on X𝑋Xitalic_X can be written as r(X)=g(X1)*f(X2|X1)𝑟𝑋𝑔subscript𝑋1𝑓conditionalsubscript𝑋2subscript𝑋1r(X)=g(X_{1})*f(X_{2}|X_{1})italic_r ( italic_X ) = italic_g ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) * italic_f ( italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). Since X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT satisfies the definition of a PMF (vote proportion of the K𝐾Kitalic_K parties), it is intuitive to use the Dirichlet distribution here. So we write g(X1)=Dir(γ1,,γK)𝑔subscript𝑋1𝐷𝑖𝑟subscript𝛾1subscript𝛾𝐾g(X_{1})=Dir(\gamma_{1},\dots,\gamma_{K})italic_g ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_D italic_i italic_r ( italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_γ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ), where (γ1,,γK)subscript𝛾1subscript𝛾𝐾(\gamma_{1},\dots,\gamma_{K})( italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_γ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) are hyper-parameters that indicate our prior beliefs about the relative popularity of the different parties (maybe based on past elections).

Now we introduce the complete election Z𝑍Zitalic_Z through an election model which represents h(Z|X1)conditional𝑍subscript𝑋1h(Z|X_{1})italic_h ( italic_Z | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and survey model, which represents q(Y|Z)𝑞conditional𝑌𝑍q(Y|Z)italic_q ( italic_Y | italic_Z ). Using them, we can write the posterior as follows:

p(X|Y)Zq(Y|Z)f(X2|Z)h(Z|X1)g(X1)proportional-to𝑝conditional𝑋𝑌subscript𝑍𝑞conditional𝑌𝑍𝑓conditionalsubscript𝑋2𝑍conditional𝑍subscript𝑋1𝑔subscript𝑋1p(X|Y)\propto\int_{Z}q(Y|Z)f(X_{2}|Z)h(Z|X_{1})g(X_{1})italic_p ( italic_X | italic_Y ) ∝ ∫ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT italic_q ( italic_Y | italic_Z ) italic_f ( italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_Z ) italic_h ( italic_Z | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_g ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) (1)

Note that f(X2|Z)𝑓conditionalsubscript𝑋2𝑍f(X_{2}|Z)italic_f ( italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_Z ) is deterministic, i.e. if we known the complete election result, then we can easily calculate the number of seats won by the parties. Now, both the election model and the survey model are simulation-based, i.e. we can sample Z𝑍Zitalic_Z given X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and Y𝑌Yitalic_Y given Z𝑍Zitalic_Z respectively, but we have no analytical representation for q𝑞qitalic_q and r𝑟ritalic_r. So the integration is intractable, and hence we need to use Approximate Bayesian Computation based on Monte Carlo Sampling, as follows:

p(X|Y)1Mi=1Mq(Y|Zi)f(X2|Zi)g(X1)proportional-to𝑝conditional𝑋𝑌1𝑀superscriptsubscript𝑖1𝑀𝑞conditional𝑌subscript𝑍𝑖𝑓conditionalsubscript𝑋2subscript𝑍𝑖𝑔subscript𝑋1p(X|Y)\propto\frac{1}{M}\sum_{i=1}^{M}q(Y|Z_{i})f(X_{2}|Z_{i})g(X_{1})italic_p ( italic_X | italic_Y ) ∝ divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_q ( italic_Y | italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_f ( italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_g ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) (2)

where Zisubscript𝑍𝑖Z_{i}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are sampled from the election model h(Z|X1)conditional𝑍subscript𝑋1h(Z|X_{1})italic_h ( italic_Z | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ).

Note that Y𝑌Yitalic_Y has two parts {Y1,Y2}subscript𝑌1subscript𝑌2\{Y_{1},Y_{2}\}{ italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }, the vote share and the seat share of the parties. In the absence of a theoretical representation of q(Y|Z)𝑞conditional𝑌𝑍q(Y|Z)italic_q ( italic_Y | italic_Z ), we can consider Synthetic Likelihood for them, like several works on Approximate Bayesian Inference [10, 11]. As both of them are proportions, Dirichlet Distribution is a sensible choice for such synthetic likelihood. The parameters α={α1,,αK}𝛼subscript𝛼1subscript𝛼𝐾\alpha=\{\alpha_{1},\dots,\alpha_{K}\}italic_α = { italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT } and β={β1,,βK}𝛽subscript𝛽1subscript𝛽𝐾\beta=\{\beta_{1},\dots,\beta_{K}\}italic_β = { italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_β start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT } of these distributions need to be estimated, based on samples of Z𝑍Zitalic_Z.

q(Y|Z)=Dir(Y1|α(Z))*Dir(Y2|β(Z))𝑞conditional𝑌𝑍𝐷𝑖𝑟conditionalsubscript𝑌1𝛼𝑍𝐷𝑖𝑟conditionalsubscript𝑌2𝛽𝑍q(Y|Z)=Dir(Y_{1}|\alpha(Z))*Dir(Y_{2}|\beta(Z))italic_q ( italic_Y | italic_Z ) = italic_D italic_i italic_r ( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_α ( italic_Z ) ) * italic_D italic_i italic_r ( italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_β ( italic_Z ) ) (3)

We can write this because given Z𝑍Zitalic_Z, Y1subscript𝑌1Y_{1}italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and Y2subscript𝑌2Y_{2}italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can be considered as conditionally independent. This is ensured by the way that the survey model works. Here α(Zi),β(Zi)𝛼subscript𝑍𝑖𝛽subscript𝑍𝑖\alpha(Z_{i}),\beta(Z_{i})italic_α ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_β ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) are complex functions of Zisubscript𝑍𝑖Z_{i}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. One possibility might be to represent them using Neural Networks, but here we again use another Monte Carlo approach:

α(Zi)𝛼subscript𝑍𝑖\displaystyle\alpha(Z_{i})italic_α ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) =𝑎𝑟𝑔𝑚𝑎𝑥αj=1LDir(y1j|α) andabsentsubscript𝑎𝑟𝑔𝑚𝑎𝑥𝛼superscriptsubscriptproduct𝑗1𝐿𝐷𝑖𝑟conditionalsubscript𝑦1𝑗𝛼 and\displaystyle=\textit{argmax}_{\alpha}\prod_{j=1}^{L}Dir(y_{1j}|\alpha)\textit% { and}= argmax start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_D italic_i italic_r ( italic_y start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT | italic_α ) and
β(Zi)𝛽subscript𝑍𝑖\displaystyle\beta(Z_{i})italic_β ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) =𝑎𝑟𝑔𝑚𝑎𝑥βj=1LDir(y2j|β)absentsubscript𝑎𝑟𝑔𝑚𝑎𝑥𝛽superscriptsubscriptproduct𝑗1𝐿𝐷𝑖𝑟conditionalsubscript𝑦2𝑗𝛽\displaystyle=\textit{argmax}_{\beta}\prod_{j=1}^{L}Dir(y_{2j}|\beta)= argmax start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_D italic_i italic_r ( italic_y start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT | italic_β )
where yjsubscript𝑦𝑗\displaystyle y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT q(yj|Zi)similar-toabsent𝑞conditionalsubscript𝑦𝑗subscript𝑍𝑖\displaystyle\sim q(y_{j}|Z_{i})∼ italic_q ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (4)

Here, yjsubscript𝑦𝑗y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are L𝐿Litalic_L sample surveys drawn from the true election Zisubscript𝑍𝑖Z_{i}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT according to the survey model q𝑞qitalic_q. Estimated vote shares y1jsubscript𝑦1𝑗y_{1j}italic_y start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT and seat shares y2jsubscript𝑦2𝑗y_{2j}italic_y start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT are obtained from them. Our synthetic Dirichlet likelihood is applicable for them too. Using these samples, maximum-likelihood estimates of (α,β)𝛼𝛽(\alpha,\beta)( italic_α , italic_β ) are calculated, using the algorithms discussed in [15]. These ML estimates are used to calculate the likelihood of the actual survey Y𝑌Yitalic_Y, using the synthetic Dirichlet likelihood again.

3.2 Election Models

Suppose we know the total number of voters in support of the different parties. However, the outcome of the election is unknown, as it depends on how these voters are distributed across the districts. To take a small example, let us consider two parties A and B, which have 15 and 10 supporters respectively. These 25 voters are spread over 5 districts, each of which have 5 voters.Now if the spread is uniform, i.e. each district has 3 voters for party A and 2 voters for party B, then party A wins all 5 districts. On the other hand, if all voters in 3 districts support A while all voters in the other 2 districts support B, then A wins 3 districts and B wins 2. But if two districts have only A voters, while the remaining 5 A voters are spread across the remaining 3 districts as (2,2,1), then party A wins only the first 2 districts, while party B wins the remaining 3 districts despite having less supporters. To explore the space of possible electoral outcomes, it is thus necessary to consider different possible spatial distributions of the voters, given the overall popularities of the parties {θ1,,θK}subscript𝜃1subscript𝜃𝐾\{\theta_{1},\dots,\theta_{K}\}{ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT }. The aim of the election model is to achieve this through sampling.

While simulating the spread of voters across districts, it is necessary to make sure that these distribution patterns are realistic. Real-world political elections have certain characteristics, such as i) In a district, most of the voters support a small subset of parties in fray, ii) People supporting any party are more likely to be staying in the same districts. These happen due to various sociological factors that influence electoral preferences, especially in a heterogeneous society where political preferences often depend on social identity. An Election Model should be able to produce these features in its simulation.

One of the most well-known election simulation models that partially captures the first aspect mentioned above is the Polya Urn model, which works on the idea that if one voter chooses a candidate, then the probability of subsequent voters choosing the same candidate increases. However, this is restricted to the single-district case. We consider the agent-based models proposed in [16] for district-based elections. These models focus on each voter as an agent, and assign them to a district and/or party according to a probabilistic process to maintain the above two properties.

We first consider the Districtwise/Seatwise Polarization Model (SPM) that has a single parameter γ𝛾\gammaitalic_γ, called concentration parameter. The idea is based on Chinese Restaurant Process [19] similar to Polya’s Urn. Each voter in a district is likely to choose a party according to its local popularity (number of votes it has already received in same district) with probability γ𝛾\gammaitalic_γ, while with the remaining probability 1γ1𝛾1-\gamma1 - italic_γ they can choose a party according to the overall popularity. In general, high value of γ𝛾\gammaitalic_γ causes concentration of support of parties in specific districts, so that the seat share is a reflection of the overall popularities of the parties. On the other hand, low value of γ𝛾\gammaitalic_γ causes the vote share in each district to reflect the overall popularities (vote shares) of the parties, and thus the most popular party wins almost all the seats.

It often happens that a party with high vote share wins fewer seats than a less popular party, because its voters are either too concentrated (reducing spatial spread) or too diffuse (failing to achieve adequate concentration to win any district). This phenomena cannot be captured by the SPM. So we consider the Partywise Concentration Model (PCM) with party-specific concentration parameters {γ1,,γK}subscript𝛾1subscript𝛾𝐾\{\gamma_{1},\dots,\gamma_{K}\}{ italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_γ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT }. This model places each voter in a district which already has other voters who support the same party k𝑘kitalic_k, with probability γksubscript𝛾𝑘\gamma_{k}italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPTḢowever, with probability 1γk1subscript𝛾𝑘1-\gamma_{k}1 - italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, the voter is placed in any district uniformly. Different combinations of high/low values of these party-specific parameters can create widely differing and unexpected results. The PCM model is much richer than SPM as it can simulate a much broader spectrum of results, but is also more difficult to calibrate as it as K𝐾Kitalic_K parameters.

3.3 Survey Models

The aim of a survey is to estimate the underlying reality by examining a small number of samples. In this case, the underlying reality is the actual voting preference of all voters, i.e. Z𝑍Zitalic_Z, and the aim of the survey is to predict the vote shares X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and seat shares X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. This is obtained by selecting a small subset of the voters and finding out their preferences (it is assumed that they respond truthfully).

The main question here is, how to choose these respondents. As already discussed, the preferences may vary from district to district. While it may not be possible to cover all districts, an unbiased survey can be considered to choose districts uniformly at random, and also choose respondents uniformly at random from these districts. This approach of uniform sampling has been discussed by other works like  [12], which provided lower bounds on the fraction of districts to be sampled, and the number of people to be queried in each district to be able to predict the winner correctly. In our model, we leave these as parameters fssubscript𝑓𝑠f_{s}italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and fnsubscript𝑓𝑛f_{n}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. We further assume that equal number of people are queried in all the chosen districts.

Suppose in district j𝑗jitalic_j, we find {Wj1,,WjK}subscript𝑊𝑗1subscript𝑊𝑗𝐾\{W_{j1},\dots,W_{jK}\}{ italic_W start_POSTSUBSCRIPT italic_j 1 end_POSTSUBSCRIPT , … , italic_W start_POSTSUBSCRIPT italic_j italic_K end_POSTSUBSCRIPT } respondents in favour of the K𝐾Kitalic_K parties. Clearly, this follows a Multinomial Distribution with parameters {Njfn,(θj1,,θjK)}subscript𝑁𝑗subscript𝑓𝑛subscript𝜃𝑗1subscript𝜃𝑗𝐾\{N_{j}f_{n},(\theta_{j1},\dots,\theta_{jK})\}{ italic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , ( italic_θ start_POSTSUBSCRIPT italic_j 1 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_j italic_K end_POSTSUBSCRIPT ) }. The next question is, given the survey results, how to predict the outcome {X1,X2}subscript𝑋1subscript𝑋2\{X_{1},X_{2}\}{ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }. Our model estimates the total vote share by simply aggregating the number of respondents across all districts, who expressed preferences for different parties. In other words, Y1(k)=jWjkNfnsubscript𝑌1𝑘subscript𝑗subscript𝑊𝑗𝑘𝑁subscript𝑓𝑛Y_{1}(k)=\frac{\sum_{j}W_{jk}}{Nf_{n}}italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_k ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_N italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG (Nfn𝑁subscript𝑓𝑛Nf_{n}italic_N italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the total number of respondents) for party k𝑘kitalic_k. Next, in each of the Sfs𝑆subscript𝑓𝑠Sf_{s}italic_S italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT districts where we carried out the survey, we identify the party with maximum number of votes among the respondents from that district. Thus, we find the number of districts {v1,,vK}subscript𝑣1subscript𝑣𝐾\{v_{1},\dots,v_{K}\}{ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT } “won” by the different parties, and we use this as our estimate Y2subscript𝑌2Y_{2}italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT of the overall seat share, i.e. Y2(k)=vkSfssubscript𝑌2𝑘subscript𝑣𝑘𝑆subscript𝑓𝑠Y_{2}(k)=\frac{v_{k}}{Sf_{s}}italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_k ) = divide start_ARG italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_S italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG.

4 Analysis of Elections

As already discussed, our aims in this paper are twofold- prediction of the results based on the surveys, and evaluating the surveys based on the results. We now discuss how these can be achieve these using the model discussed above.

4.1 Prediction from Surveys

Consider the situation where M𝑀Mitalic_M surveys have been conducted, with results {y1,,yM}subscript𝑦1subscript𝑦𝑀\{y_{1},\dots,y_{M}\}{ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }, where yi={yi1,yi2}subscript𝑦𝑖subscript𝑦𝑖1subscript𝑦𝑖2y_{i}=\{y_{i1},y_{i2}\}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_y start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT } and we aim to estimate X𝑋Xitalic_X from them. We have already described our approach to construct the posterior p(X|y1,,ym)𝑝conditional𝑋subscript𝑦1subscript𝑦𝑚p(X|y_{1},\dots,y_{m})italic_p ( italic_X | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ). However, this construction does not account for the normalization factor 1p(Y)1𝑝𝑌\frac{1}{p(Y)}divide start_ARG 1 end_ARG start_ARG italic_p ( italic_Y ) end_ARG. Even if it were known, it would be difficult to visualize the infinite space of possible outcomes.

We discuss two ways to utilize this posterior on possible outcomes. The first one is comparison of a finite number of candidate outcomes. We are often interested in very specific questions like, how many votes a particular party may win, or which party can win maximum seats, rather than the exact vote and seat shares of all parties. Accordingly, we can construct a few representative outcomes x1,,xksubscript𝑥1subscript𝑥𝑘x_{1},\dots,x_{k}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and compare their relative likelihoods through p(xi|y1,,ym)𝑝conditionalsubscript𝑥𝑖subscript𝑦1subscript𝑦𝑚p(x_{i}|y_{1},\dots,y_{m})italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ).

Also, often the seat share is more important than the vote share, and there are only a finite number of seat shares (based on how S𝑆Sitalic_S seats can be distributed among K𝐾Kitalic_K parties). So a PMF can be constructed by calculating the posterior measure for each possible seat share, and normalizing them.

If we need a distribution for an individual party’s vote share or seat share, it is difficult to calculate it analytically from the above model, because the constructed posterior does not follow a known family of distributions. However, we can still use a Monte Carlo approach again if we can draw samples from an approximate form of the posterior. The proposed approach is as follows:

  1. 1.

    Initialize sample set 𝒮=Φ𝒮Φ\mathcal{S}=\Phicaligraphic_S = roman_Φ

  2. 2.

    Draw a sample x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT from prior r𝑟ritalic_r

  3. 3.

    Simulate an election z𝑧zitalic_z based on x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT using Election Model

  4. 4.

    Calculate x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT from z𝑧zitalic_z

  5. 5.

    Simulate a survey y𝑦yitalic_y from z𝑧zitalic_z using Survey Model

  6. 6.

    If y𝑦yitalic_y is close enough to the observed surveys {y1,,yM}subscript𝑦1subscript𝑦𝑀\{y_{1},\dots,y_{M}\}{ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }, ACCEPT the sample, else REJECT it

  7. 7.

    If sample is ACCEPTED, add {x1,x2}subscript𝑥1subscript𝑥2\{x_{1},x_{2}\}{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } to 𝒮𝒮\mathcal{S}caligraphic_S

  8. 8.

    Repeat till we have sufficient samples

Step 6 ensures that the accepted samples are consistent with the surveys. Any suitable measure like Kullback-Leibler (K-L) divergence can be used to compare y𝑦yitalic_y with {y1,,yM}subscript𝑦1subscript𝑦𝑀\{y_{1},\dots,y_{M}\}{ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }. The ranks of the different parties with respect to the different estimates should also be compared.

Once we have enough samples of X|{y1,,yM}conditional𝑋subscript𝑦1subscript𝑦𝑀X|\{y_{1},\dots,y_{M}\}italic_X | { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }, we can fit another synthetic likelihood on X𝑋Xitalic_X. Once again, we use Dirichlet likelihood as X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are both proportions over K𝐾Kitalic_K parties. Once again, the parameters γ={γ1,,γK}𝛾subscript𝛾1subscript𝛾𝐾\gamma=\{\gamma_{1},\dots,\gamma_{K}\}italic_γ = { italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_γ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT } and η={η1,,ηK}𝜂subscript𝜂1subscript𝜂𝐾\eta=\{\eta_{1},\dots,\eta_{K}\}italic_η = { italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_η start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT } can be estimated using [15]. The marginal distribution of each variate in a Dirichlet distribution follows a Beta distribution. Using this property, we can easily calculate the marginal distribution over the vote-share and seat-share of any party k𝑘kitalic_k, as follows:

X1kBeta(γk,j=1Kγjγk),X2kBeta(ηk,j=1Kηjηk)formulae-sequencesimilar-tosubscript𝑋1𝑘𝐵𝑒𝑡𝑎subscript𝛾𝑘superscriptsubscript𝑗1𝐾subscript𝛾𝑗subscript𝛾𝑘similar-tosubscript𝑋2𝑘𝐵𝑒𝑡𝑎subscript𝜂𝑘superscriptsubscript𝑗1𝐾subscript𝜂𝑗subscript𝜂𝑘X_{1k}\sim Beta(\gamma_{k},\sum_{j=1}^{K}\gamma_{j}-\gamma_{k}),X_{2k}\sim Beta% (\eta_{k},\sum_{j=1}^{K}\eta_{j}-\eta_{k})italic_X start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ∼ italic_B italic_e italic_t italic_a ( italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_X start_POSTSUBSCRIPT 2 italic_k end_POSTSUBSCRIPT ∼ italic_B italic_e italic_t italic_a ( italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (5)

4.2 Investigating the Surveys

An election survey is supposed to be uniform and unbiased. Once the election result x𝑥xitalic_x is known, we want to verify if the reported survey result y𝑦yitalic_y was consistent with it. In other words, is the probability p(Y=y|X=x)𝑝𝑌conditional𝑦𝑋𝑥p(Y=y|X=x)italic_p ( italic_Y = italic_y | italic_X = italic_x ) high enough, if the uniform survey approach was indeed followed? If not, the survey result may be considered as dubious.

We have already discussed the use of synthetic Dirichlet likelihood for q(Y|Z)𝑞conditional𝑌𝑍q(Y|Z)italic_q ( italic_Y | italic_Z ). Given the observations x𝑥xitalic_x, we generate many samples of Z𝑍Zitalic_Z (the complete election) using the Election Model, generate projected result Y𝑌Yitalic_Y for each of them using the Survey Model, and then estimate the Dirichlet parameters (α,β)𝛼𝛽(\alpha,\beta)( italic_α , italic_β ). Accordingly, we can calculate p(Y=y1|x)=Dir(α)𝑝𝑌conditionalsubscript𝑦1𝑥𝐷𝑖𝑟𝛼p(Y=y_{1}|x)=Dir(\alpha)italic_p ( italic_Y = italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) = italic_D italic_i italic_r ( italic_α ) and p(Y=y2|x)=Dir(β)𝑝𝑌conditionalsubscript𝑦2𝑥𝐷𝑖𝑟𝛽p(Y=y_{2}|x)=Dir(\beta)italic_p ( italic_Y = italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) = italic_D italic_i italic_r ( italic_β ).

To understand whether p(Y=y|x)𝑝𝑌conditional𝑦𝑥p(Y=y|x)italic_p ( italic_Y = italic_y | italic_x ) is high enough for y𝑦yitalic_y to be considered consistent with x𝑥xitalic_x, one possible approach is to consider the likelihood ratio, as considered in several works of Sampling-based Approximate Inference [21]. This ratio is p(Y=y|x)p(Y=y)𝑝𝑌conditional𝑦𝑥𝑝𝑌𝑦\frac{p(Y=y|x)}{p(Y=y)}divide start_ARG italic_p ( italic_Y = italic_y | italic_x ) end_ARG start_ARG italic_p ( italic_Y = italic_y ) end_ARG. If this ratio is greater than 1, it means that the projected results are more likely than usual if conditioned on the actual result, which is an affirmation of the survey. On the other hand, the ratio being 1 or less suggest that the projected results may be dubious or independent of the actual results.

However, calculating p(Y=y)𝑝𝑌𝑦p(Y=y)italic_p ( italic_Y = italic_y ) is computationally expensive as it involves marginalizing over both Z𝑍Zitalic_Z and X𝑋Xitalic_X. Unlike p(Y/X)𝑝𝑌𝑋p(Y/X)italic_p ( italic_Y / italic_X ), we cannot express p(Y)𝑝𝑌p(Y)italic_p ( italic_Y ) as a Dirichlet distribution, as possible values of Y𝑌Yitalic_Y and their respective probabilities are too varied to be expressed by a single distribution. A possible approach is the Dirichlet Process Mixture Model (DPMM) with Dirichlet base distribution, but even then, calculating the marginal likelihood is very difficult [2]. So we carry out an alternate non-parametric approach based on Monte-Carlo Sampling, similar to the sampling procedure from P(X|Y)𝑃conditional𝑋𝑌P(X|Y)italic_P ( italic_X | italic_Y ) as discussed in Sec 4.1.

  1. 1.

    Draw N𝑁Nitalic_N candidate samples of vote share {X11c,,X1Nc}subscriptsuperscript𝑋𝑐11subscriptsuperscript𝑋𝑐1𝑁\{X^{c}_{11},\dots,X^{c}_{1N}\}{ italic_X start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , … , italic_X start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 italic_N end_POSTSUBSCRIPT } from the prior g(X)𝑔𝑋g(X)italic_g ( italic_X )

  2. 2.

    From each of them, sample an election {Z11c,,Z1Nc}subscriptsuperscript𝑍𝑐11subscriptsuperscript𝑍𝑐1𝑁\{Z^{c}_{11},\dots,Z^{c}_{1N}\}{ italic_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , … , italic_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 italic_N end_POSTSUBSCRIPT } using Election Model

  3. 3.

    Simulate surveys on them using Survey Model and obtain projected vote shares {Y11c,,Y1Nc}subscriptsuperscript𝑌𝑐11subscriptsuperscript𝑌𝑐1𝑁\{Y^{c}_{11},\dots,Y^{c}_{1N}\}{ italic_Y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 italic_N end_POSTSUBSCRIPT } and seat shares {Y21c,,Y2Nc}subscriptsuperscript𝑌𝑐21subscriptsuperscript𝑌𝑐2𝑁\{Y^{c}_{21},\dots,Y^{c}_{2N}\}{ italic_Y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 italic_N end_POSTSUBSCRIPT }

  4. 4.

    Find the number of samples of Y𝑌Yitalic_Y that are within a specified distance of both y1subscript𝑦1y_{1}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and y2subscript𝑦2y_{2}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT,

So, the density at any arbitrary projection y={y1,y2}𝑦subscript𝑦1subscript𝑦2y=\{y_{1},y_{2}\}italic_y = { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } can be obtained as p(y)1Ni=1NI(KL(y1ic,y1)<ϵ1)I(KL(y2ic,y2)<ϵ2)𝑝𝑦1𝑁superscriptsubscript𝑖1𝑁𝐼𝐾𝐿subscriptsuperscript𝑦𝑐1𝑖subscript𝑦1subscriptitalic-ϵ1𝐼𝐾𝐿subscriptsuperscript𝑦𝑐2𝑖subscript𝑦2subscriptitalic-ϵ2p(y)\approx\frac{1}{N}\sum_{i=1}^{N}I(KL(y^{c}_{1i},y_{1})<\epsilon_{1})I(KL(y% ^{c}_{2i},y_{2})<\epsilon_{2})italic_p ( italic_y ) ≈ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_I ( italic_K italic_L ( italic_y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) < italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_I ( italic_K italic_L ( italic_y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) < italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ).

Similarly, p(y|X)𝑝conditional𝑦𝑋p(y|X)italic_p ( italic_y | italic_X ) is obtained in the same way, but by considering only those samples from {X11c,,X1Nc}subscriptsuperscript𝑋𝑐11subscriptsuperscript𝑋𝑐1𝑁\{X^{c}_{11},\dots,X^{c}_{1N}\}{ italic_X start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , … , italic_X start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 italic_N end_POSTSUBSCRIPT } for which are close enough to X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and the corresponding {X21c,,X2Nc}subscriptsuperscript𝑋𝑐21subscriptsuperscript𝑋𝑐2𝑁\{X^{c}_{21},\dots,X^{c}_{2N}\}{ italic_X start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT , … , italic_X start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 italic_N end_POSTSUBSCRIPT } are also close enough to X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Closeness is once again measured in terms of K-L Divergence. We call the ratio p(y|X)p(y)𝑝conditional𝑦𝑋𝑝𝑦\frac{p(y|X)}{p(y)}divide start_ARG italic_p ( italic_y | italic_X ) end_ARG start_ARG italic_p ( italic_y ) end_ARG thus obtained as the nonparametric likelihood ratio.

An alternate approach is to calculate p(Y=y|x)maxYp(Y|x)𝑝𝑌conditional𝑦𝑥𝑚𝑎subscript𝑥𝑌𝑝conditional𝑌𝑥\frac{p(Y=y|x)}{max_{Y}p(Y|x)}divide start_ARG italic_p ( italic_Y = italic_y | italic_x ) end_ARG start_ARG italic_m italic_a italic_x start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT italic_p ( italic_Y | italic_x ) end_ARG, i.e. how likely are the projected results compared to the most likely projections from an ideal survey. The denominator can be easily calculated using the estimated Dirichlet parameters of p(Y|x)𝑝conditional𝑌𝑥p(Y|x)italic_p ( italic_Y | italic_x ). We call this ratio as the likelihood mode ratio.

5 Experimental Evaluation

In this section, we discuss detailed validation of the concepts discussed above on simulated data, and then proceed to evaluate actual political elections and surveys held in India. The main questions we wish to validate here are as follows: i) Does the Survey Model produce realistic results from an election? ii) how does the accuracy of a survey depend on its scale? iii) Can the constructed Dirichlet Posterior q(Y|Z)𝑞conditional𝑌𝑍q(Y|Z)italic_q ( italic_Y | italic_Z ) distinguish between fair and biased surveys? iv) Can we predict the election results from fair surveys using the constructed posterior? v) Can we estimate the performance of a party based on fair surveys? vi) Can we evaluate actual political elections using this setting? Below, we describe detailed experiments to answer the questions.

5.1 Survey Model Evaluation

While a single survey’s result Y𝑌Yitalic_Y is stochastic (depending on the sample of respondents and districts chosen), we can construct the distribution over projected results by Monte Carlo sampling using the Survey Model. To understand this, we construct a small experiment over N=10000𝑁10000N=10000italic_N = 10000 electors, S=5𝑆5S=5italic_S = 5 districts and K=3𝐾3K=3italic_K = 3 parties. These 5 seats can be divided among the 3 parties in 21 ways ((5,0,0), (2,3,0), (1,1,3) etc). We consider two different vote-shares: (0.4,0.35,0.25)0.40.350.25(0.4,0.35,0.25)( 0.4 , 0.35 , 0.25 ) and (0.5,0.4,0.1)0.50.40.1(0.5,0.4,0.1)( 0.5 , 0.4 , 0.1 ) over the 3 parties. In the first case there is close contest, while in the second case there is a prominent winner and loser. However, these votes may be distributed across the districts in different ways, resulting in different seat shares - from (2,2,1) to (5,0,0). The question is, can the survey results reflect these? Are the modes of the survey distributions located at these outcomes? If not, how far from the modes are they?

The complete election results Z1subscript𝑍1Z_{1}italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and Z2subscript𝑍2Z_{2}italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (corresponding to these two vote-shares) are generated using the DPM/SPM Election Model with concentration parameter 0.9. The seat shares obtained are (2,2,1)221(2,2,1)( 2 , 2 , 1 ) and (3,1,1)311(3,1,1)( 3 , 1 , 1 ) respectively.

The Survey Model is then applied on both Z1subscript𝑍1Z_{1}italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and Z2subscript𝑍2Z_{2}italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 1000 times, and the projected seat shares are recorded in each case. We consider fn=0.1subscript𝑓𝑛0.1f_{n}=0.1italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 0.1 and fs=1subscript𝑓𝑠1f_{s}=1italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 1 (i.e. 10%percent1010\%10 % people are queried from all of the districts uniformly). Thus, we obtain empirical frequency distributions over the 21 possible seat distributions. It is found that for Z1subscript𝑍1Z_{1}italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the accurate seat projection rate is 65.4%percent65.465.4\%65.4 %, i.e. the projected seat share matches the true seat share 65.4%percent65.465.4\%65.4 % times. Other results which had significant probability under the survey were (3,1,1)311(3,1,1)( 3 , 1 , 1 ) and (3,2,0)320(3,2,0)( 3 , 2 , 0 ), both close to the correct result. For Z2subscript𝑍2Z_{2}italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, this figure is 53.7%percent53.753.7\%53.7 %.

We scale up the experiments to N=1000000,S=100formulae-sequence𝑁1000000𝑆100N=1000000,S=100italic_N = 1000000 , italic_S = 100 and repeat for other values of the concentration parameter of SPM. The (fn,fs)subscript𝑓𝑛subscript𝑓𝑠(f_{n},f_{s})( italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) parameters are held at (0.1,1)0.11(0.1,1)( 0.1 , 1 ). The accurate seat projection rate for both vote shares and different concentration values are shown in Fig. 1, for different margins of error (for example, if the true seat distribution is (50,30,20)503020(50,30,20)( 50 , 30 , 20 ) and projected one is (51,28,21)512821(51,28,21)( 51 , 28 , 21 ) we can say that error is within 2). It is observed that in case of Fig 1a (comparable vote shares), performance is better for higher values of concentration, i.e. when the seat share reflects the vote share more closely. In case of Fig 1b (diverse vote shares), the relation is less clear. But the accurate seat projection rate is significantly higher compared to Fig 1a. This means, when the election is closely contested in terms of vote share, surveys are more likely to be accurate if the seat shares are compatible with vote shares. In case of lopsided elections in terms of vote share, surveys are generally expected to be more accurate.

Refer to caption
Refer to caption
Figure 1: Comparison of Accurate Seat Projection Rate for different vote and seat shares. Fig 1a (left): close contest with vote shares (0.4, 0.35, 0.25), Fig 1b (right): lopsided contest with vote shares (0.5, 0.4, 0.1)

Should a survey go wider (cover more districts) or deeper (ask more people in each district)? We study how the accurate seat projection rate varies with the scale of the survey, i.e. with fnsubscript𝑓𝑛f_{n}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and fssubscript𝑓𝑠f_{s}italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. We repeat this experiment for both the aforementioned vote shares, and also the two SPM concentration parameters (0.9 and 0.7) resulting in different seat shares. High concentration causes the seat share to reasonably resemble the vote share, while low concentration maximizes seat share of the party with highest vote share. The results are shown in Fig. 2. In Fig 2a (left) the number of people surveyed is varied, while kee** the district coverage unchanged (50%percent5050\%50 %), while in Fig 2b (right) the district coverage is varied, while kee** the people coverage unchanged (10%percent1010\%10 %).

It is observed that covering more people has no clear impact when the concentration is high, i.e. seat share reflects the vote share. But for low concentration, covering more people clearly improves the survey performance. On the other hand, covering more districts is clearly more beneficial in case of high concentration, but not so much in case of low concentration.

Refer to caption
Refer to caption
Figure 2: Comparison of Accurate Seat Projection Rate for surveying different fractions of the population (fig 2a-left), and districts (fig 2b-right) on 4 vote share/seat share combinations (see legend). Error limit: 3%percent33\%3 %

The above observations are validated further on actual political elections held in India. We consider four states of India (Tripura, Himachal Pradesh, Gujarat, Karnataka) that had elections in the past year. All of these were essentially tripartite contests, where the vote shares and seat shares of the three main parties are provided in Table 1. To avoid needless controversies, we have anonymized the parties. In each case, we refer to the party with most votes as P1, second most as P2 etc.

State N S Vote Share Seat Share
P1 P2 P3 P1 P2 P3
Tripura 2.4M 60 0.42 0.38 0.20 0.55 0.23 0.22
Himachal 4.2M 68 0.45 0.43 0.12 0.59 0.37 0.04
Gujarat 29M 182 0.56 0.30 0.14 0.88 0.09 0.03
Karnataka 36M 224 0.46 0.39 0.15 0.62 0.30 0.08
Table 1: Summary of 4 recent state assembly elections in India. Parties anonymized and ranked in order of vote share

Surveys are simulated by the Survey Model using the complete election data obtained from  [1]. Once again we vary fnsubscript𝑓𝑛f_{n}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and fssubscript𝑓𝑠f_{s}italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT as above, though fnsubscript𝑓𝑛f_{n}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is now kept to smaller values (0.15%0.1percent50.1-5\%0.1 - 5 % of the total population) due to the huge sizes of the electorate. The results shown in Fig 3. In most cases, we see that increasing the district coverage results in clear improvement of projections (fnsubscript𝑓𝑛f_{n}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT constant at 10%percent1010\%10 %), while increasing people coverage has no such effect (district coverage held constant at 50%percent5050\%50 %). This is consistent with our previous analysis, as in all cases (except Gujarat) the seat shares are not very far from the vote share. The optimal SPM concentration parameter in all these cases, using which the seat share can be obtained most accurately given the vote shares, is found to be around 0.9. So this observation is consistent with our previous result (Fig 2).

Refer to caption
Refer to caption
Figure 3: Comparison of Accurate Seat Projection Rate for surveying different fractions of the population (fig 2a-left), and districts of Indian state elections. Maximum error: 3%percent33\%3 %

6 Predicting Results from Surveys

We now set out to evaluate the constructed posterior p(X|Y)𝑝conditional𝑋𝑌p(X|Y)italic_p ( italic_X | italic_Y ), i.e. given the survey projections, how well can we identify which outcomes are most likely, and which are not? For this, we carry out three experiments.

In the first experiment, we consider the true result X0={X10,X20}superscript𝑋0subscriptsuperscript𝑋01subscriptsuperscript𝑋02X^{0}=\{X^{0}_{1},X^{0}_{2}\}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = { italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } and generate complete results from the election model. X10subscriptsuperscript𝑋01X^{0}_{1}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is obtained from the Dirichlet prior r𝑟ritalic_r. The survey model is run on it to generate a projection {Y1,Y2}subscript𝑌1subscript𝑌2\{Y_{1},Y_{2}\}{ italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }, considering N=1000000,S=100formulae-sequence𝑁1000000𝑆100N=1000000,S=100italic_N = 1000000 , italic_S = 100. Now, we develop the posterior, by Monte Carlo Sampling and Maximum Likelihood estimate of Synthetic Dirichlet parameters as discussed in Sec 3.1. We now calculate the posterior density at a number of candidate results, including X0superscript𝑋0X^{0}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. This is repeated for three sets of results: i) X10=(0.55,0.23,0.22)subscriptsuperscript𝑋010.550.230.22X^{0}_{1}=(0.55,0.23,0.22)italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( 0.55 , 0.23 , 0.22 ), X20=(0.72,0.15,0.13)subscriptsuperscript𝑋020.720.150.13X^{0}_{2}=(0.72,0.15,0.13)italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( 0.72 , 0.15 , 0.13 ), ii) X10=(0.35,0.33,0.32)subscriptsuperscript𝑋010.350.330.32X^{0}_{1}=(0.35,0.33,0.32)italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( 0.35 , 0.33 , 0.32 ), X20=(0.36,0.36,0.28)subscriptsuperscript𝑋020.360.360.28X^{0}_{2}=(0.36,0.36,0.28)italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( 0.36 , 0.36 , 0.28 ), iii) X10=(0.35,0.33,0.32)subscriptsuperscript𝑋010.350.330.32X^{0}_{1}=(0.35,0.33,0.32)italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( 0.35 , 0.33 , 0.32 ), X20=(0.71,0.27,0.02)subscriptsuperscript𝑋020.710.270.02X^{0}_{2}=(0.71,0.27,0.02)italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( 0.71 , 0.27 , 0.02 ). Note that ii) and iii) have identical vote shares but very different seat shares (due to different values of SPM concentration). Among the candidate solutions in each case, X0superscript𝑋0X^{0}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and results closest to it are the ones with highest posterior density. Fig. 4 shows how the posterior density at different results decreases as their distances (K-L Divergence) from the original result X0superscript𝑋0X^{0}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT increases. Table 2 shows the true results X0superscript𝑋0X^{0}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, projected results Y𝑌Yitalic_Y and candidate solution with highest posterior density.

Actual Results Projections Posterior Mode
(0.55,0.23,0.22)0.550.230.22(0.55,0.23,0.22)( 0.55 , 0.23 , 0.22 ) (0.51,0.26,0.23)0.510.260.23(0.51,0.26,0.23)( 0.51 , 0.26 , 0.23 ) (0.55,0.23,0.22)0.550.230.22(0.55,0.23,0.22)( 0.55 , 0.23 , 0.22 )
(0.35,0.33,0.32)0.350.330.32(0.35,0.33,0.32)( 0.35 , 0.33 , 0.32 ) (0.34,0.33,0.33)0.340.330.33(0.34,0.33,0.33)( 0.34 , 0.33 , 0.33 ) (0.34,0.32,0.33)0.340.320.33(0.34,0.32,0.33)( 0.34 , 0.32 , 0.33 )
(0.35,0.33,0.32)0.350.330.32(0.35,0.33,0.32)( 0.35 , 0.33 , 0.32 ) (0.36,0.34,0.30)0.360.340.30(0.36,0.34,0.30)( 0.36 , 0.34 , 0.30 ) (0.35,0.33,0.32)0.350.330.32(0.35,0.33,0.32)( 0.35 , 0.33 , 0.32 )
(0.72,0.15,0.13)0.720.150.13(0.72,0.15,0.13)( 0.72 , 0.15 , 0.13 ) (0.76,0.14,0.10)0.760.140.10(0.76,0.14,0.10)( 0.76 , 0.14 , 0.10 ) (0.72,0.15,0.13)0.720.150.13(0.72,0.15,0.13)( 0.72 , 0.15 , 0.13 )
(0.36,0.36,0.28)0.360.360.28(0.36,0.36,0.28)( 0.36 , 0.36 , 0.28 ) (0.36,0.34,0.30)0.360.340.30(0.36,0.34,0.30)( 0.36 , 0.34 , 0.30 ) (0.37,0.30,0.33)0.370.300.33(0.37,0.30,0.33)( 0.37 , 0.30 , 0.33 )
(0.71,0.27,0.02)0.710.270.02(0.71,0.27,0.02)( 0.71 , 0.27 , 0.02 ) (0.54,0.34,0.12)0.540.340.12(0.54,0.34,0.12)( 0.54 , 0.34 , 0.12 ) (0.71,0.27,0.02)0.710.270.02(0.71,0.27,0.02)( 0.71 , 0.27 , 0.02 )
Table 2: Original, projected, posterior mode vote shares (above) and seat shares (below) for three candidate settings
Refer to caption
Refer to caption
Figure 4: Relation between posterior density of candidate solutions and their distance (K-L divergence) from the actual result X0superscript𝑋0X^{0}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. 4a(left): X0=(0.55,0.23,0.22)/(0.72,0.15,0.13)superscript𝑋00.550.230.220.720.150.13X^{0}=(0.55,0.23,0.22)/(0.72,0.15,0.13)italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = ( 0.55 , 0.23 , 0.22 ) / ( 0.72 , 0.15 , 0.13 ), 4b(right): X0=(0.35,0.33,0.32)/(0.36,0.36,0.28)superscript𝑋00.350.330.320.360.360.28X^{0}=(0.35,0.33,0.32)/(0.36,0.36,0.28)italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = ( 0.35 , 0.33 , 0.32 ) / ( 0.36 , 0.36 , 0.28 )

Note that for case ii) the highest posterior density value is achieved at X1=(0.34,0.33,0.33)subscript𝑋10.340.330.33X_{1}=(0.34,0.33,0.33)italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( 0.34 , 0.33 , 0.33 ), X2=(0.37,0.33,0.30)subscript𝑋20.370.330.30X_{2}=(0.37,0.33,0.30)italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( 0.37 , 0.33 , 0.30 ) which is different from, but very close to X0superscript𝑋0X^{0}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. In cases i) and iii) X0superscript𝑋0X^{0}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT has the best posterior density.

How does the posterior’s performance change with the number and scales of the survey? This is the question we study in the third experiment. We repeat the second experiment by varying the number of surveys, as well as fnsubscript𝑓𝑛f_{n}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and fssubscript𝑓𝑠f_{s}italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT in each survey. We see that as we increase the number of surveys, the posterior density of the actual result increases with respect to other candidates. For example, in case of setting ii) above, X0superscript𝑋0X^{0}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT has the highest posterior density when we consider 5 surveys (which was not the case when we considered 1 survey). The results are shown in Appendix.

In the second experiment, we consider the election results from the four state elections discussed in Table 1. We consider 5 surveys in each case, by using our survey model on the complete election data. Next, the posterior density is computed for several candidate solutions including X0superscript𝑋0X^{0}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. Once again, Fig. 5 shows how the posterior density at different candidate results vary with their distances from the actual results.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Posterior Likelihood of candidate results versus their distance (K-L divergence) from the actual results in case of the 4 state elections. Note the anomalous nature of the plot for Tripura, which had a multi-modal posterior

In case of Tripura, the SPM model fails to produce the true results under any parameter settings. So we consider the PCM model. Even then, the few candidate results with highest likelihood were quite varied: including (0.39,0.36,0.25)/(0.55,0.4,0.05)0.390.360.250.550.40.05(0.39,0.36,0.25)/(0.55,0.4,0.05)( 0.39 , 0.36 , 0.25 ) / ( 0.55 , 0.4 , 0.05 ) and (0.41,0.37,0.22)/(0.9,0.1,0)0.410.370.220.90.10(0.41,0.37,0.22)/(0.9,0.1,0)( 0.41 , 0.37 , 0.22 ) / ( 0.9 , 0.1 , 0 ). In both cases, either the vote-share or the seat-share are reasonably close to the actual, but not both. This is a special case of a multi-modal posterior, where varied results seem to be equally likely. This is reflected in the nature of the plot in Fig 5. The reason is that, P3𝑃3P3italic_P 3’s vote-share was extremely skewed across districts. In case of Himachal Pradesh, the most likely result according to SPM model, based on 5 surveys is (0.46,0.43,0.11)/(0.60.40)0.460.430.110.60.40(0.46,0.43,0.11)/(0.60.40)( 0.46 , 0.43 , 0.11 ) / ( 0.60.40 ). This result has a slightly higher posterior likelihood than the actual result. The SPM model was generally unable to produce results that allocate 0.140.140.140.14 seat share to P3. In case of Gujarat and Karnataka, the actual result itself had the best likelihood among the candidate results which we considered. The comparisons of the actual result, projected results (median from 5 surveys) and posterior mode results are provided in Table 3, except for Tripura where there is no clear posterior mode. The conclusion is that, the constructed likelihood is consistent, i.e. it is able to recover the true result from the surveys in most cases. In the Appendix, we show how these results change with the number and scale of surveys, and the prior distribution g(X1)𝑔subscript𝑋1g(X_{1})italic_g ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ).

State Actual Results Projections Posterior Mode
Himachal (0.45,0.43,0.12)0.450.430.12(0.45,0.43,0.12)( 0.45 , 0.43 , 0.12 ) (0.44,0.45,0.11)0.440.450.11(0.44,0.45,0.11)( 0.44 , 0.45 , 0.11 ) (0.46,0.43,0.11)0.460.430.11(0.46,0.43,0.11)( 0.46 , 0.43 , 0.11 )
Gujarat (0.56,0.30,0.14)0.560.300.14(0.56,0.30,0.14)( 0.56 , 0.30 , 0.14 ) (0.57,0.27,0.13)0.570.270.13(0.57,0.27,0.13)( 0.57 , 0.27 , 0.13 ) (0.56,0.30,0.14)0.560.300.14(0.56,0.30,0.14)( 0.56 , 0.30 , 0.14 )
Karnataka (0.46,0.39,0.15)0.460.390.15(0.46,0.39,0.15)( 0.46 , 0.39 , 0.15 ) (0.47,0.38,0.15)0.470.380.15(0.47,0.38,0.15)( 0.47 , 0.38 , 0.15 ) (0.46,0.39,0.15)0.460.390.15(0.46,0.39,0.15)( 0.46 , 0.39 , 0.15 )
Himachal (0.59,0.37,0.04)0.590.370.04(0.59,0.37,0.04)( 0.59 , 0.37 , 0.04 ) (0.53,0.40,0.07)0.530.400.07(0.53,0.40,0.07)( 0.53 , 0.40 , 0.07 ) (0.54,0.44,0.02)0.540.440.02(0.54,0.44,0.02)( 0.54 , 0.44 , 0.02 )
Gujarat (0.88,0.09,0.03)0.880.090.03(0.88,0.09,0.03)( 0.88 , 0.09 , 0.03 ) (0.87,0.09,0.03)0.870.090.03(0.87,0.09,0.03)( 0.87 , 0.09 , 0.03 ) (0.88,0.09,0.03)0.880.090.03(0.88,0.09,0.03)( 0.88 , 0.09 , 0.03 )
Karnataka (0.61,0.30,0.09)0.610.300.09(0.61,0.30,0.09)( 0.61 , 0.30 , 0.09 ) (0.62,0.28,0.10)0.620.280.10(0.62,0.28,0.10)( 0.62 , 0.28 , 0.10 ) (0.61,0.30,0.09)0.610.300.09(0.61,0.30,0.09)( 0.61 , 0.30 , 0.09 )
Table 3: Original, projected, posterior mode vote shares (above) and seat shares (below) for each party in the 3 state elections except Tripura. The projected results mentioned are based on the median of 5 surveys. with an error range of ±0.05plus-or-minus0.05\pm 0.05± 0.05 around the median.

6.1 Party-specific Performance Distribution

A related question that arises is, given survey Y𝑌Yitalic_Y, what can we say about the probable performance of a particular party? Our approach to this question has already been discussed in Section 4.1. We evaluate the same using the same 4 state elections as above, based on 5 surveys. Table 4 shows the modal results vis-a-vis survey results for the 4 states, for both vote-share and seat-share. We can see that the approximate posterior mode is quite accurate for vote share, but not very accurate in terms of seat share. In Fig 6, we plot the synthetic posterior PDF for the first, second and third parties (both vote share and seat share) conditioned on the 5 survey results for the elections. We find that in each case, the modes for the parties’ curves are in the correct order of their actual performance, though there are significant variances, which means there is some probability that the results may have been different. For Tripura, the variances are very small and modes very close, while for Gujarat and Karnataka the seat share variance is quite large for P1𝑃1P1italic_P 1.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Synthetic Posterior distributions for the performance of each party individually, in terms of vote share and seat share, conditioned on 5 surveys for the 4 state elections.

7 Impact of Survey Settings on Posterior

The two key contributions of this paper are the survey model and the synthetic posterior likelihood. In Sections 5.1 and 5.2 we have seen some experimental validation of these, for both synthetic and real election data. The main parameters of the survey model are n𝑛nitalic_n: fraction of the population that is covered, and s𝑠sitalic_s: fraction of the districts that is covered. The synthetic likelihood construction process utilizes both of these parameters in simulating the surveys. In addition, it involves two more factors: i) the number of surveys available, and ii) the choice of the prior distribution g(X1)𝑔subscript𝑋1g(X_{1})italic_g ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). Here, we discuss the roles of these parameters and factors in greater details. Figures 2 and 3 of the main paper show the impact of n𝑛nitalic_n and s𝑠sitalic_s on the performance of the survey model. But here we shall see, how they impact the posterior distribution.

7.1 Prior distribution

First of all, we consider the prior distribution g(X1)𝑔subscript𝑋1g(X_{1})italic_g ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). Clearly this is a Dirichlet Distribution with hyperparameters (γ1,,γK)subscript𝛾1subscript𝛾𝐾(\gamma_{1},\dots,\gamma_{K})( italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_γ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ). These hyperparameters may indicate our prior belief on the relative performance of the different parties, maybe based on the past election performances. The prior mode is then {γ11kγkK,,γK1kγkK}subscript𝛾11subscript𝑘subscript𝛾𝑘𝐾subscript𝛾𝐾1subscript𝑘subscript𝛾𝑘𝐾\{\frac{\gamma_{1}-1}{\sum_{k}\gamma_{k}-K},\dots,\frac{\gamma_{K}-1}{\sum_{k}% \gamma_{k}-K}\}{ divide start_ARG italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_K end_ARG , … , divide start_ARG italic_γ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT - 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_K end_ARG }. But if a survey Y𝑌Yitalic_Y provides results which are significantly different from the prior mode, then there is a contradiction. This can easily happen if an election has significantly different, even reverse results compared to the previous one. In such a case, for any candidate outcome X𝑋Xitalic_X, the prior r(X)𝑟𝑋r(X)italic_r ( italic_X ) and the likelihood q(Y|X)𝑞conditional𝑌𝑋q(Y|X)italic_q ( italic_Y | italic_X ) can be contradictory (one high, the other low). To understand what happens to the posterior in that case, we carried out simulation studies for the Himachal Pradesh state elections (details in Table 1, main paper). The survey settings were kept constant at (n=0.01,s=0.25)formulae-sequence𝑛0.01𝑠0.25(n=0.01,s=0.25)( italic_n = 0.01 , italic_s = 0.25 ) and 5 survey was considered. 5 different values of prior hyperparematers γ𝛾\gammaitalic_γ were considered, and synthetic posterior density was estimated for 100 candidate solutions (including the actual one) in each case.

Figure 1 shows the relation between the posterior density of various candidate results and their K-L Divergence with the true result (Table 1, main paper). This is done using all 5 prior hyperparameter settings. In general, we desire that the candidate outcomes more similar to the actual outcome should have higher posterior density. Figure 1 shows that this property generally holds for all prior settings. However, if we consider only 1 survey (instead of 5), then the results become more sensitive to the prior hyper-parameters. In fact, we find that many candidate outcomes with varying similarities with the actual results have comparable posterior density, indicating a multi-modal posterior. This is demonstrated in Table 2, which shows the top candidate outcomes in each prior hyperparameter setting based on 1 survey. We find that this top candidate result is somewhat different from the actual result in each case. However, we also see in Table 2 that the density at the actual result also changes with prior hyperparameters, and it is maximum when the hyperparameters for top-2 parties are equal.

Refer to caption
Figure 7: Posterior Likelihood of candidate results versus their distance (K-L divergence) from the actual results in case of the 5 prior hyperparameter settings.

7.2 Number of Surveys

Refer to caption
Refer to caption
Figure 8: Impact of increasing the number of surveys on the posterior likelihood (left), Impact of increasing the fraction s𝑠sitalic_s of districts covered by surveys on the posterior likelihood (right).

The number of surveys based on which we carry out the posterior inference, is another important factor on which the goodness of the inference depends. This is particularly true in case of very close elections. In such cases, different surveys can yield different or opposite results even if they cover reasonable number of districts and persons. We tested the case of Himachal Pradesh assembly election, where P1 had a small advantage over P2 in terms of vote share, but a big advantage in terms of seat share (Table 1 of main paper). We tested the posterior density obtained using 2,5,10 and 20 surveys, and the results are shown in Figure 2. We find that in case of smaller number of surveys, many candidate outcomes have high posterior density though they are quite far (in terms of KL-Divergence) from the actual outcomes. This is confirmed by the top-10 candidate outcomes shown in Table 3 (top), which includes many cases where P2 outperforms P1. However, this situation improves when we consider more surveys, as the candidate outcomes that are far from the actual results have lesser and lesser posterior density (Fig 8). The top-10 candidate outcomes for 20 surveys, as shown in Table 3 (bottom), also shows many cases which are close to the actual results.

Hyper- Survey Result Most likely Actual Result
parameters outcome Density
(6,12,2) (0.48,0.42,0.1) (0.46,0.43,0.11) 6.41
(0.76,0.16,0.08) (0.59,0.4,0.01)
(8,10,2) (0.48,0.42,0.1) (0.49,0.41,0.1) 7.29
(0.76,0.16,0.08) (0.72,0.28,0)
(9,9,2) (0.48,0.42,0.1) (0.48,0.42,0.1) 7.5
(0.76,0.16,0.08) (0.63,0.37,0)
(10,8,2) (0.48,0.42,0.1) (0.51,0.4,0.09) 7.48
(0.76,0.16,0.08) (0.65,0.31,0.04)
(12,6,2) (0.48,0.42,0.1) (0.49,0.42,0.09) 6.5
(0.76,0.16,0.08) (0.78,0.22,0)
Table 4: Impact of Prior Hyperparameters on Posterior Density of voteshare (above) and seat share (below). True result: (0.45,0.43,0.12)/(0.59,0.37,0.04)0.450.430.120.590.370.04(0.45,0.43,0.12)/(0.59,0.37,0.04)( 0.45 , 0.43 , 0.12 ) / ( 0.59 , 0.37 , 0.04 )
Vote Share Seat Share Vote Share Seat Share
P1 P2 P3 P1 P2 P3 P1 P2 P3 P1 P2 P3
0.44 0.47 0.09 0.38 0.54 0.07 0.44 0.45 0.11 0.46 0.44 0.1
0.45 0.46 0.09 0.37 0.56 0.07 0.46 0.44 0.1 0.57 0.43 0
0.41 0.47 0.12 0.4 0.49 0.12 0.4 0.45 0.15 0.44 0.47 0.09
0.4 0.51 0.08 0.37 0.57 0.06 0.47 0.39 0.13 0.65 0.28 0.07
0.42 0.43 0.15 0.44 0.47 0.09 0.41 0.49 0.1 0.38 0.56 0.06
0.41 0.44 0.14 0.46 0.49 0.06 0.47 0.46 0.08 0.56 0.44 0
0.44 0.41 0.14 0.51 0.44 0.04 0.53 0.37 0.1 0.68 0.28 0.04
0.48 0.44 0.07 0.57 0.4 0.03 0.47 0.45 0.08 0.57 0.43 0
0.48 0.45 0.06 0.59 0.38 0.03 0.42 0.46 0.11 0.37 0.62 0.01
0.41 0.53 0.06 0.41 0.59 0 0.48 0.43 0.08 0.65 0.35 0
Table 5: Top-10 most likely candidate outcomes using 2 (left) or 20 (right) surveys for Himachal Pradesh Assembly Election, with n=0.01,s=0.25formulae-sequence𝑛0.01𝑠0.25n=0.01,s=0.25italic_n = 0.01 , italic_s = 0.25

7.3 District and Person Coverage

Next, we we consider the number of districts covered in the survey. We vary this fraction s𝑠sitalic_s as {0.1,0.25,0.5,0.75}0.10.250.50.75\{0.1,0.25,0.5,0.75\}{ 0.1 , 0.25 , 0.5 , 0.75 }, kee** the person coverage constant as 1%percent11\%1 %, with equal number of persons being queried in each of the chosen districts. The prior hyper-parameters are kept at (8,10,2)8102(8,10,2)( 8 , 10 , 2 ) and the number of surveys is 1. Our aim is to see how the posterior distribution of the candidate outcomes varies as the s𝑠sitalic_s parameter is changed. We study this for the case of Himachal Pradesh, which had a very close election with P1 and P2 having very close vote shares. The results are shown in Figure 8, where we see when the district coverage is low, many candidate outcomes which are far from the actual outcome also have quite high posterior likelihood. However, such cases become rarer as we increase the district coverage. Finally, we consider the number of persons covered in the survey. We vary this fraction n𝑛nitalic_n as {0.001,0.01,0.05,0.1}0.0010.010.050.1\{0.001,0.01,0.05,0.1\}{ 0.001 , 0.01 , 0.05 , 0.1 }, kee** the district coverage constant as 0.250.250.250.25, with equal number of persons being queried in each of the chosen districts. However, In Fig 9, we find that increasing n𝑛nitalic_n does not have a very clear impact on the posterior density, in case of Himachal Pradesh where the election was very close. But for Karnataka, where the winning margin for P1 was larger, the impact of increasing person coverage seems to make a clearer impact, as candidate outcomes further away from the actual results have less posterior density.

8 Baseline Methods

We have not provided any comparisons of our proposed approach in the main paper, because there are no known works with the same aim as ours. However, if our aim is restricted to obtain a distribution over the possible outcomes (vote share and seat share), one possibility is to use Bayesian prior-posterior analysis using Dirichlet distributions. We may consider two separate Dirichlet priors for X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, i.e. X1Dir(α0),X2Dir(β0)formulae-sequencesimilar-tosubscript𝑋1𝐷𝑖𝑟subscript𝛼0similar-tosubscript𝑋2𝐷𝑖𝑟subscript𝛽0X_{1}\sim Dir(\alpha_{0}),X_{2}\sim Dir(\beta_{0})italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_D italic_i italic_r ( italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ italic_D italic_i italic_r ( italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). The survey results, i.e. the total number of responses Y1=({Y11,,Y1K})*N*nsubscript𝑌1subscript𝑌11subscript𝑌1𝐾𝑁𝑛Y_{1}=(\{Y_{11},\dots,Y_{1K}\})*N*nitalic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( { italic_Y start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUBSCRIPT 1 italic_K end_POSTSUBSCRIPT } ) * italic_N * italic_n obtained for the different parties, and the number of seats dominated by each party based on the responses Y2=({Y21,,Y2K})*S*ssubscript𝑌2subscript𝑌21subscript𝑌2𝐾𝑆𝑠Y_{2}=(\{Y_{21},\dots,Y_{2K}\})*S*sitalic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( { italic_Y start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUBSCRIPT 2 italic_K end_POSTSUBSCRIPT } ) * italic_S * italic_s, are both considered to follow Multinomial Distribution. It is well-known that Dirichlet distribution is a Conjugate Prior for Multinomial, and so X1|Y1Dir(α0+Y1),X2|Y2Dir(β0+Y2)formulae-sequencesimilar-toconditionalsubscript𝑋1subscript𝑌1𝐷𝑖𝑟subscript𝛼0subscript𝑌1similar-toconditionalsubscript𝑋2subscript𝑌2𝐷𝑖𝑟subscript𝛽0subscript𝑌2X_{1}|Y_{1}\sim Dir(\alpha_{0}+Y_{1}),X_{2}|Y_{2}\sim Dir(\beta_{0}+Y_{2})italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_D italic_i italic_r ( italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ italic_D italic_i italic_r ( italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ).

Refer to caption
Refer to caption
Figure 9: Impact of increasing the fraction n𝑛nitalic_n of population covered by surveys on the posterior likelihood in case of Himachal Pradesh (left) and Karnataka (right).

The problem with this approach is that the vote-share and seat-share are considered independent of each other, which clearly they are not. Consequently, a candidate result X𝑋Xitalic_X where the X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT components are not compatible each other, i.e. the seat share X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is impossible given the vote share X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, can still have a high posterior density as X1|Y1conditionalsubscript𝑋1subscript𝑌1X_{1}|Y_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and X2|Y2conditionalsubscript𝑋2subscript𝑌2X_{2}|Y_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can still be reasonably high individually. Furthermore, the density of X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT tends to be numerically much larger than that of X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and hence a candidate result can have high posterior density based only on X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, even if the density X2|Y2conditionalsubscript𝑋2subscript𝑌2X_{2}|Y_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is low. We observe this effect in case of the Tripura Assembly Election (Table 1), where the top-10 (highest posterior density) candidate results obtained by the proposed approach of Synthetic Likelihood are compared with the top-10 candidate results obtained from this Dirichlet-Multinomial baseline. We see that in the latter case, pretty much all the results have 0 seats for P3, whereas in reality it had 13 out of 60 seats.

One way to offset this situation is by considering scaling factors on the Multinomial likelihood model, so that the posterior density values of X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are comparable. This, however, introduces the reverse problem - there are many candidate solutions where P2 or P3 have a very large number of seats or vote share.

Similar results were obtained in case of the other states (Himachal Pradesh, Gujarat, Karnataka) also. These results emphasize the fact that it is extremely important to include X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as a condition for the distribution of X2|Y2conditionalsubscript𝑋2subscript𝑌2X_{2}|Y_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. But the relation between X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is also not possible to express using any single distribution, due to which we must use a simulation-based approach as done in this work.

Vote Share Seats
P1 P2 P3 P1 P2 P3 Log-Density
0.41 0.36 0.22 55 4 1 25.44
0.41 0.39 0.20 26 29 5 25.34
0.38 0.36 0.25 33 24 3 24.86
0.48 0.34 0.18 38 21 1 23.12
0.48 0.30 0.22 42 13 5 22.29
0.43 0.36 0.21 27 30 3 21.62
0.49 0.33 0.18 33 23 4 20.7
0.38 0.37 0.25 19 32 9 19.02
0.43 0.33 0.24 48 11 1 18.78
0.43 0.32 0.25 35 13 12 18.47
Vote Share Seats
P1 P2 P3 P1 P2 P3 Log-Density
0.43 0.38 0.19 45 15 0 -103.35
0.41 0.39 0.20 27 33 0 -111.61
0.42 0.40 0.18 43 17 0 -203.19
0.40 0.41 0.192 42 18 0 -226.23
0.44 0.36 0.20 36 24 0 -231.01
0.42 0.40 0.18 43 17 0 -278.79
0.42 0.40 0.18 30 30 0 -309.60
0.45 0.35 0.20 49 10 1 -316.03
0.44 0.35 0.21 57 1 2 -316.24
0.45 0.36 0.20 60 0 0 -325.54
Table 6: Top-10 candidate outcomes (with posterior density) obtained from the proposed model (above) and Dirichlet-Multinomial baseline (below) in case of Tripura Assembly Election, based on 5 surveys covering 1%percent11\%1 % of people and 25%percent2525\%25 % of the districts

9 Post-facto Survey Evaluation

We finally validate the analysis of Section 4.2, to examine the validity of surveys once the actual result of the election is known. We compare three kinds of surveys: genuine, fake and malicious. Genuine surveys Ygensubscript𝑌𝑔𝑒𝑛Y_{gen}italic_Y start_POSTSUBSCRIPT italic_g italic_e italic_n end_POSTSUBSCRIPT are generated by running the survey model on the actual complete election data Z𝑍Zitalic_Z. For fake surveys, a first a fake election Zfakesubscript𝑍𝑓𝑎𝑘𝑒Z_{fake}italic_Z start_POSTSUBSCRIPT italic_f italic_a italic_k italic_e end_POSTSUBSCRIPT is generated by first sampling a vote share Xfakesubscript𝑋𝑓𝑎𝑘𝑒X_{fake}italic_X start_POSTSUBSCRIPT italic_f italic_a italic_k italic_e end_POSTSUBSCRIPT from the prior distribution, and then applying an election model on it. The survey model is then applied to obtain Yfakesubscript𝑌𝑓𝑎𝑘𝑒Y_{fake}italic_Y start_POSTSUBSCRIPT italic_f italic_a italic_k italic_e end_POSTSUBSCRIPT. In case of malicious surveys, the true result is intentionally skewed towards one party. Ymalsubscript𝑌𝑚𝑎𝑙Y_{mal}italic_Y start_POSTSUBSCRIPT italic_m italic_a italic_l end_POSTSUBSCRIPT is obtained by linearly combining Ygensubscript𝑌𝑔𝑒𝑛Y_{gen}italic_Y start_POSTSUBSCRIPT italic_g italic_e italic_n end_POSTSUBSCRIPT with Yksubscript𝑌𝑘Y_{k}italic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT where the entire vote is in favour of party k𝑘kitalic_k (chosen randomly).

Actual Genuine Fake Malicious
(0.4,0.35,0.25)—(0.4,0.4,0.2) 5.0 0.08 3.5
(0.4,0.35,0.25)—(0.8,0.2,0) 11.7 2.23 1.29
(0.4,0.35,0.25)—(0.4,0.4,0.2) 0.23 0.11 0
(0.4,0.35,0.25)—(0.8,0.2,0) 1.0 0 0
Table 7: Survey Evaluation on synthetic election results simulated by SPM Election Model. Top: Nonparametric Likelihood Ratio, Bottom: Likelihood Mode Ratio (average of 100 surveys in each category)

We first consider the synthetic election with N=10000𝑁10000N=10000italic_N = 10000 voters, S=5𝑆5S=5italic_S = 5 districts and K=3𝐾3K=3italic_K = 3 parties. We consider two cases: one where X1=(0.4,0.35,0.25),X2=(0.4,0.4,0.2)formulae-sequencesubscript𝑋10.40.350.25subscript𝑋20.40.40.2X_{1}=(0.4,0.35,0.25),X_{2}=(0.4,0.4,0.2)italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( 0.4 , 0.35 , 0.25 ) , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( 0.4 , 0.4 , 0.2 ) and another where X1=(0.4,0.35,0.25),X2=(0.8,0,0.2)formulae-sequencesubscript𝑋10.40.350.25subscript𝑋20.800.2X_{1}=(0.4,0.35,0.25),X_{2}=(0.8,0,0.2)italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( 0.4 , 0.35 , 0.25 ) , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( 0.8 , 0 , 0.2 ). For the three categories of surveys (genuine, fake, malicious), we compare both the nonparametric posterior likelihood and the posterior modal likelihood. The results are shown in Table 4. We clearly see that in every case, the genuine surveys have a significantly higher likelihood ratio or posterior modal ratio compared to the other surveys.

Next, we move to the real data. We sample 100 surveys from each of the above 3 categories, for each of the 4 states. In each case, we calculate both the nonparametric likelihood ratio and likelihood mode ratio as discussed in Section 4.2. The mean results are reported in Table 5. Once again, we find that the genuine surveys have a very significantly higher likelihood ratio compared to the fake or malicious cases. In case of Tripura, even for the Genuine surveys, the modal ratio is quite low because, the actual results could not be simulated accurately by any of the election models.

State Genuine Fake Malicious
Tripura 10.9 2.5 1.6
Himachal 17.1 0.72 3.11
Gujarat 4.62 1.82 3.54
Karnataka 30.3 1.42 6.66
Tripura 0.07 0.02 0
Himachal 0.41 0.00004 0.0012
Gujarat 0.13 0 0.0004
Karnataka 0.05 0.002 0
Table 8: Top: Nonparametric Likelihood Ratio, Bottom: Likelihood Mode Ratio (average of 100 surveys in each of the 3 categories)

10 Computational Complexity

One of the concerns with the proposed approach is its computational complexity, since it is based on Monte Carlo Simulations. In particular, we need a large number of samples of elections and surveys to compute the Synthetic Dirichlet parameters. To calculate p(Y|X)𝑝conditional𝑌𝑋p(Y|X)italic_p ( italic_Y | italic_X ) we use M𝑀Mitalic_M samples of complete elections {Z1,,ZM}subscript𝑍1subscript𝑍𝑀\{Z_{1},\dots,Z_{M}\}{ italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Z start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } (Eq 2 of main paper), which are obtained using the Election Model. From each of these samples, we again draw L𝐿Litalic_L sample surveys to calculate the Dirichlet parameters α(Zi),β(Zi)𝛼subscript𝑍𝑖𝛽subscript𝑍𝑖\alpha(Z_{i}),\beta(Z_{i})italic_α ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_β ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (Eq 4 of main paper). If we consider t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as the time to generate one election Zisubscript𝑍𝑖Z_{i}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, t2(L)subscript𝑡2𝐿t_{2}(L)italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_L ) as the time to generate L𝐿Litalic_L sample surveys from Zisubscript𝑍𝑖Z_{i}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and t3subscript𝑡3t_{3}italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT as the time to estimate α(Zi),β(Zi)𝛼subscript𝑍𝑖𝛽subscript𝑍𝑖\alpha(Z_{i}),\beta(Z_{i})italic_α ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_β ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) from the sample surveys and calculate the Dirichlet density at Y𝑌Yitalic_Y using them, then the total time to compute p(Y|X)𝑝conditional𝑌𝑋p(Y|X)italic_p ( italic_Y | italic_X ) for a given (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) is M*t1+M*(t2(L)+t3)𝑀subscript𝑡1𝑀subscript𝑡2𝐿subscript𝑡3M*t_{1}+M*(t_{2}(L)+t_{3})italic_M * italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_M * ( italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_L ) + italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ).

In our implementation, we consider M=200,L=100formulae-sequence𝑀200𝐿100M=200,L=100italic_M = 200 , italic_L = 100. The election model complexity t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is linear in N𝑁Nitalic_N (total number of electors), but a lot of speedup is achievable by considering the electors in batches. This partly compromises the SPM/ECM model, which assign the district Zisubscript𝑍𝑖Z_{i}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each elector i𝑖iitalic_i, by sampling from the distribution p(Zi|Z1,,Zi1)𝑝conditionalsubscript𝑍𝑖subscript𝑍1subscript𝑍𝑖1p(Z_{i}|Z_{1},\dots,Z_{i-1})italic_p ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Z start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ). The distribution changes for each elector i𝑖iitalic_i, based on the (i1)𝑖1(i-1)( italic_i - 1 ) electors before them. But here we consider batches of size b𝑏bitalic_b, where we sample the districts for b𝑏bitalic_b electors simultaneous using a Multinomial Distribution based on the assignments to the electors before them. This parallelization significantly reduces t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. For N=10million,S=100,b=100formulae-sequence𝑁10𝑚𝑖𝑙𝑙𝑖𝑜𝑛formulae-sequence𝑆100𝑏100N=10million,S=100,b=100italic_N = 10 italic_m italic_i italic_l italic_l italic_i italic_o italic_n , italic_S = 100 , italic_b = 100, we have t1=1.5secssubscript𝑡11.5𝑠𝑒𝑐𝑠t_{1}=1.5secsitalic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1.5 italic_s italic_e italic_c italic_s, while t2=0.25secssubscript𝑡20.25𝑠𝑒𝑐𝑠t_{2}=0.25secsitalic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.25 italic_s italic_e italic_c italic_s for n=0.01𝑛0.01n=0.01italic_n = 0.01 on a Intel i5 core (10th generation) processor using 8GB RAM. t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT scales linearly with N*n𝑁𝑛N*nitalic_N * italic_n. t3subscript𝑡3t_{3}italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT turns out to be just 0.050.050.050.05 seconds. So for the given configuration, calculating p(Y|X)𝑝conditional𝑌𝑋p(Y|X)italic_p ( italic_Y | italic_X ) takes about 5 minutes for a given (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ).

11 Discussions and Conclusion

While much of the past work on election prediction from surveys focuses on prediction of the winner, there has been relatively few works on predicting the number of seats or votes won by different parties in a multi-party, multi-district setting. This work actually provides a probability distribution on these, and also on the possible performance of individual parties. Furthermore, we provide a way to evaluate the feasibility of survey results, once the actual results are known. This approach can be very useful in bringing scientific accuracy in the process of large-scale opinion polling and in identifying fraudulent or dubious surveys. The unique feature of this work is that it performs extensive simulations based on actual elections involving millions of people. While much of the work presented here is based on Monte Carlo simulations and Approximate Bayesian Computing, our next aims will be to provide some theoretical guarantees regarding the actual results on the basis of surveys. We have not provided any comparison of our proposed method, since there is no known approach to achieve the same target. However, in the Appendix, we discuss what could have been possible alternatives, and their shortcomings.

References

  • [1]
  • [2] Basu, S., and Chib, S. Marginal likelihood and bayes factors for dirichlet process mixture models. Journal of the American Statistical Association 98, 461 (2003), 224–235.
  • [3] Berg, S., and Lepelley, D. On probability models in voting theory. Statistica Neerlandica 48, 2 (1994), 133–146.
  • [4] Bhattacharyya, A., and Dey, P. Predicting winner and estimating margin of victory in elections using sampling. Artificial Intelligence 296 (2021), 103476.
  • [5] Braha, D., and De Aguiar, M. A. Voting contagion: Modeling and analysis of a century of us presidential elections. PloS one 12, 5 (2017), e0177970.
  • [6] Brooks, C., Nieuwbeerta, P., and Manza, J. Cleavage-based voting behavior in cross-national perspective: Evidence from six postwar democracies. Social Science Research 35, 1 (2006), 88–128.
  • [7] Dawkins, C. J. Measuring the spatial pattern of residential segregation. Urban Studies 41, 4 (2004), 833–851.
  • [8] Dawkins, C. J. Space and the measurement of income segregation. Journal of Regional Science 47, 2 (2007), 255–272.
  • [9] Dwi Prasetyo, N., and Hauff, C. Twitter-based election prediction in the develo** world. In Proceedings of the 26th ACM Conference on Hypertext & Social Media (2015), pp. 149–158.
  • [10] Fearnhead, P., and Prangle, D. Constructing summary statistics for approximate bayesian computation: semi-automatic approximate bayesian computation. Journal of the Royal Statistical Society Series B: Statistical Methodology 74, 3 (2012), 419–474.
  • [11] Grazian, C., and Fan, Y. A review of approximate bayesian computation methods via density estimation: Inference for simulator-models. Wiley Interdisciplinary Reviews: Computational Statistics 12, 4 (2020), e1486.
  • [12] Kar, D., Dey, P., and Sanyal, S. Sampling-based winner prediction in district-based elections. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems (2023), pp. 2661–2663.
  • [13] Kennedy, R., Wojcik, S., and Lazer, D. Improving election prediction internationally. Science 355, 6324 (2017), 515–520.
  • [14] Leigh, A., and Wolfers, J. Competing approaches to forecasting elections: Economic models, opinion polling and prediction markets. Economic Record 82, 258 (2006), 325–340.
  • [15] Minka, T. Estimating a dirichlet distribution, 2000.
  • [16] Mitra, A. Electoral david-vs-goliath: probabilistic models of spatial distribution of electors to simulate district-based election outcomes. In 2021 Winter Simulation Conference (WSC) (2021), IEEE, pp. 1–12.
  • [17] Mitra, A. Agent-based simulation of district-based elections with heterogeneous populations. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems (2023), pp. 2730–2732.
  • [18] Perse, E. M., and Lambe, J. Media effects and society. Routledge, 2016.
  • [19] Pitman, J. Exchangeable and partially exchangeable random partitions. Probability theory and related fields 102, 2 (1995), 145–158.
  • [20] Pritchard, G., and Wilson, M. C. Multi-district preference modelling. Quality & Quantity 57, 1 (2023), 587–613.
  • [21] Thomas, O., Dutta, R., Corander, J., Kaski, S., and Gutmann, M. U. Likelihood-free inference by ratio estimation. Bayesian Analysis 17, 1 (2022), 1–31.