EnSolver: Uncertainty-Aware Ensemble CAPTCHA Solvers with Theoretical Guarantees

\nameDuc C. Hoang \email[email protected]
\addrDepartment of Computer Science
National University of Singapore, Singapore \AND\nameBehzad Ousat \email[email protected]
\addrKnight Foundation School of Computing and Information Sciences
Florida International University, USA \AND\nameAmin Kharraz \email[email protected]
\addrKnight Foundation School of Computing and Information Sciences
Florida International University, USA \AND\nameCuong V. Nguyen \email[email protected]
\addrDepartment of Mathematical Sciences
Durham University, UK
Abstract

The popularity of text-based CAPTCHA as a security mechanism to protect websites from automated bots has prompted researches in CAPTCHA solvers, with the aim of understanding its failure cases and subsequently making CAPTCHAs more secure. Recently proposed solvers, built on advances in deep learning, are able to crack even the very challenging CAPTCHAs with high accuracy. However, these solvers often perform poorly on out-of-distribution samples that contain visual features different from those in the training set. Furthermore, they lack the ability to detect and avoid such samples, making them susceptible to being locked out by defense systems after a certain number of failed attempts. In this paper, we propose EnSolver, a family of CAPTCHA solvers that use deep ensemble uncertainty to detect and skip out-of-distribution CAPTCHAs, making it harder to be detected. We prove novel theoretical bounds on the effectiveness of our solvers and demonstrate their use with state-of-the-art CAPTCHA solvers. Our experiments show that the proposed approaches perform well when cracking CAPTCHA datasets that contain both in-distribution and out-of-distribution samples.111The source code for this paper is available at: https://github.com/HoangCongDuc/ensolver.git.

Keywords: captcha solver, ensemble method, uncertainty estimation, theoretical bounds

1 Introduction

Automated web bots are getting increasingly more sophisticated in imitating human behaviors and evading detection. In most important and consequential situations, these adversarial operations can result in credential stuffing, account hijacking and data breaches, vulnerability scanning and exploitation. CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) (Von Ahn et al., 2003) has been a common security mechanism to defend against malicious automatic bot activities. That is, the website generates a CAPTCHA challenge and requests the remote agent to solve the CAPTCHA. The core insight is that solving a CAPTCHA is straightforward for real users but difficult for automated bots. Among different types of CAPTCHAs, text-based CAPTCHAs (Gao et al., 2016) present users with a noisy image of a short text string and ask them to enter the correct string. They are among the most popular types of CAPTCHAs due to simple implementation and user-friendliness (Deng et al., 2022).

In this adversarial landscape, as text-based CAPTCHAs became more popular, many automatic solving techniques have also been developed to evade detection. These techniques are interesting from the security viewpoint as they help researchers understand the weaknesses of these CAPTCHAs and subsequently make them more secure (Tang et al., 2018). With recent advances in deep learning and computer vision, state-of-the-art text-based CAPTCHA solvers often employ an end-to-end approach, where a sophisticated deep learning model will predict the output text string directly from the raw pixels of the input image (Deng et al., 2022; Ousat et al., 2024). Although these solvers can crack very challenging CAPTCHAs, they are unable to make correct predictions on out-of-distribution samples (i.e., images that are visually different from those in their training set). By exploiting this limitation, defense systems can detect these automatic CAPTCHA solvers and lock their access if a solver fails a certain number of attempts. Thus, from the attacker’s perspective, it is desirable to equip the solvers with the ability to recognize and skip these out-of-distribution samples and request a new CAPTCHA to increase their success rate.

In this paper, we propose EnSolver (Ensemble Solver), a simple end-to-end text-based CAPTCHA solver that is capable of detecting and skip** out-of-distribution CAPTCHAs. EnSolver uses a deep ensemble (Lakshminarayanan et al., 2017) to quantify its predictive uncertainty, subsequently deciding whether to answer a given CAPTCHA or to skip and request a new one. More specifically, EnSolver computes a distribution over predicted text strings and uses this distribution to predict the final text string along with its uncertainty estimation. The solver then decides whether to skip the input image based on this quantity. To use the EnSolver in a real scenario, we extend it to LEnSolver (Limited-skip EnSolver), where we allow the solver to skip a maximum number of times before forcing it to make a prediction. Our approach is general and can be used with any type of base models in the deep ensemble.

Besides the general algorithmic framework, we also develop new theoretical properties for our EnSolver and LEnSolver approaches. In particular, we prove a lower bound on the rate of making the right decisions of the EnSolver models and a lower bound on the success rate of the LEnSolver models. Both of these bounds depend on a novel quantity, called the out-of-distribution error bound, that is defined in terms of the ensemble size, the output domain size, and the uncertainty threshold of the solver. This out-of-distribution error bound can be shown to upper bound the EnSolver’s error rate on out-of-distribution data as well as the EnSolver’s wrong prediction rate on in-distribution data. This quantity is usually very small in real CAPTCHA solving problems, thus hel** us control the error rates of our solvers.

We apply our approaches to two state-of-the-art types of text-based CAPTCHA solvers: the 3E-Solver (Deng et al., 2022) and the object detection-based solver (Ousat et al., 2024), both of which employ a complex deep learning architecture without any uncertainty estimation mechanism. To train our solvers, we also construct a new CAPTCHA dataset that contains bounding box labels, which are required to train object detection models. We evaluate the good decision and success rates of our solvers on datasets containing both in-distribution and out-of-distribution data, where the latter are collected from eight different public CAPTCHA datasets (Deng et al., 2022). The experiment results show that our solvers are consistently better than the baselines, and the theoretical lower bounds can give us a good idea of the actual empirical performance in practice.

Contributions. Our paper makes the following contributions:

  1. 1.

    We propose EnSolver and LEnSolver, novel uncertainty-aware CAPTCHA solvers that use deep ensemble uncertainty to avoid making wrong predictions on out-of-distribution inputs (Section 3).

  2. 2.

    We prove new theoretical lower bounds on the effectiveness of our solvers (Section 4) and show through experiments that these bounds can be good indicators of our solvers’ empirical performance in practice (Section 5).

  3. 3.

    We demonstrate the use of our approaches on two state-of-the-art CAPTCHA solvers and show empirically that our solvers can achieve high success rates compared to the two original baselines (Section 5).

Refer to caption
Figure 1: The main component of EnSolver that predicts an output string together with an associated uncertainty level given an input CAPTCHA image. The input image is first fed into each base model, each of which produces a string as output. The output strings form a distribution, which is used to compute the final prediction and the uncertainty level.

2 Related Work

CAPTCHA Solvers. Early solvers often use a segmentation-based approach consisting of two main stages: character segmentation and character recognition (Chellapilla and Simard, 2004; Yan and El Ahmad, 2007, 2008). Yan and El Ahmad (2007) used the color difference to localize characters in simple 2-color CAPTCHAs. Their subsequent work (Yan and El Ahmad, 2008) used a histogram of foreground pixels to vertically segment the characters into chunks and then find the connected components to get individual characters. Once the characters are segmented, it is easy to recognize them individually using a neural network (Chellapilla and Simard, 2004; Kopp et al., 2017). Since segmentation is a crucial stage in many solvers, a number of anti-segmentation features (e.g., character overlap**, hollow scheme, noise arcs, and complicated background) have been employed to make CAPTCHAs more resistant against such solvers (Tang et al., 2018). To bypass these anti-segmentation features, various preprocessing techniques were then developed. For example, Gao et al. (2016) proposed using image binarization, noise arcs removal, and image rectification before segmentation, while Ye et al. (2018) used an image-to-image translation model (Isola et al., 2017) to turn a noisy input CAPTCHA into an easier image for the downstream solver. Tian and Xiong (2020) used three generator networks to decompose an input CAPTCHA into a background and a character layer.

Recently, end-to-end solvers have been proposed that use a single deep learning model to solve a CAPTCHA directly from the input image without any segmentation. Noury and Rezaei (2020) proposed a CNN-based solver that has multiple character classification heads, each of which is responsible to predict a character in the CAPTCHA image. However, this model can only predict CAPTCHAs with fixed length. This problem was overcome by Tian and Xiong (2020) using a null character class. Li et al. (2021) used a convolutional recurrent neural network to train a solver on cycle-GAN generated data and then employed active transfer learning to optimize it on real schemes using a small labeled dataset. Deng et al. (2022) proposed a semi-supervised solver based on the encoder-decoder architecture and attention mechanism which only requires small portion of the training dataset to be labeled. Another recent work by Ousat et al. (2024) used an object detection model to localize and predict each character in a given CAPTCHA simultaneously.

In this paper, we not only propose a new method for solving text-based CAPTCHAs but also develop a rigorous mathematical theory for this important problem. To the best of our knowledge, the only previous work that provided some theoretical analyses for CAPTCHA systems is Li and Liao (2018), where a game-theoretical approach was used to analyze the interactions between the defender and the attacker as well as the benefits of human solvers alongside machine solvers. Our work significantly differs from theirs since we only consider machine solvers and prove the effectiveness of these solvers using probability theory. Our theoretical results are the first of this kind for CAPTCHA solvers.

Uncertainty Estimation and Out-of-distribution Detection. Our paper is related to uncertainty estimation and out-of-distribution detection (Abdar et al., 2021). Here we review only the latest related work for deep learning models. Previous works (Guo et al., 2017; Pereyra et al., 2017) have shown that modern deep learning models often do not provide well-calibrated uncertainty, in the sense that they tend to make incorrect predictions on out-of-distribution data with high confidence. Several lines of work have been proposed to improve uncertainty estimation for deep learning. For example, Bayesian neural networks (Neal, 1995) provide a principled framework for studying uncertainty in deep learning. However, exact Bayesian inference is intractable for modern deep learning architectures and approximate inference is usually required (Chen et al., 2014; Blundell et al., 2015; Gal and Ghahramani, 2016; Ritter et al., 2018; Zhang et al., 2020; Rudner et al., 2021). Popular approximate inference methods include variational inference (Blundell et al., 2015; Rudner et al., 2021), Markov chain Monte Carlo methods (Chen et al., 2014; Zhang et al., 2020), and Laplace approximation (Ritter et al., 2018). Besides Bayesian approaches, Monte Carlo dropout (Gal and Ghahramani, 2016) and deep ensembles (Lakshminarayanan et al., 2017) are also commonly used for deep learning uncertainty quantification. Once a good uncertainty estimation is obtained, it can be straightforwardly used for out-of-distribution detection (Lakshminarayanan et al., 2017; Hendrycks and Gimpel, 2017).

Among these methods, deep ensembles (Lakshminarayanan et al., 2017) provide a simple way for uncertainty estimation that can be considered an approximation of Bayesian model averaging (Wilson and Izmailov, 2020). Its main idea is to train multiple neural networks (called base models) and use them for inference. Training a base model is the same as training an ordinary neural network and the model diversity is ensured by random parameter initialization and shuffling of the training dataset for each base model. This phenomenon was explained by Fort et al. (2019) using a loss landscape perspective. D’Angelo and Fortuin (2021) later proposed an improvement to deep ensembles by introducing a kernelized repulsive term in the update rule to ensure model diversity when the number of parameters is large and the effect of random initialization reduces.

Our work develops a novel theoretical analysis for ensemble methods. Previous work on ensemble theory mainly focused on general ensembles and derived risk bounds for the weighted majority vote classification method (Germain et al., 2015; Laviolette et al., 2017; Masegosa et al., 2020) and the weighted average regression method (Cuong et al., 2013; Ho et al., 2020). More recent theoretical results also proved the relationship between generalization and diversity of deep ensembles (Ortega et al., 2022) and the improvement rates of ensembles over a single model (Theisen et al., 2024). In contrast, our analysis focuses more on the CAPTCHA solving problem and exploits its structure to derive bounds on the success rates of the solver.

3 Uncertainty-Aware CAPTCHA Solver Using Deep Ensembles

In this section, we describe our proposed uncertainty-aware CAPTCHA solvers in detail. Our first solver, named EnSolver, uses an ensemble of deep learning-based CAPTCHA solvers to make a prediction along with a corresponding uncertainty estimate on a given input text-based CAPTCHA image. The uncertainty estimates allow EnSolver to detect inputs dissimilar to the training data (i.e., out-of-distribution inputs) and thus can “skip” inputs that are hard to crack. Detecting out-of-distribution CAPTCHAs is an important feature in the solving process because web applications often define policies on the number of failed attempts before locking an account or injecting a long delay before showing the next CAPTCHA. Consequently, if a trained solver is equipped with a pre-filtering mechanism that can effectively detect and skip CAPTCHAs that are not likely to be solved correctly, it can effectively bypass these account lockout policies and avoid triggering subsequent access failures used to lock out web sessions. We also extend EnSolver to LEnSolver, a new solver for the more practical setting where only a limited number of skips are allowed. LEnSolver also employs the same mechanism as EnSolver to skip uncertain inputs, but it will be forced to make a prediction once the maximum allowable number of skips is reached.

3.1 Uncertainty-Aware CAPTCHA Solver

Algorithm 1 Generic Uncertainty-Aware CAPTCHA Solver
Trained model m𝑚mitalic_m, input image x𝑥xitalic_x, threshold τ𝜏\tauitalic_τ
(y,u)predict_with_uncertainty(m,x)𝑦𝑢predict_with_uncertainty𝑚𝑥(y,u)\leftarrow\textsc{predict\_with\_uncertainty}(m,x)( italic_y , italic_u ) ← predict_with_uncertainty ( italic_m , italic_x )
if u<τ𝑢𝜏u<\tauitalic_u < italic_τ then
     Predict with y𝑦yitalic_y
else
     Skip x𝑥xitalic_x
end if
Algorithm 2 Training a Deep Ensemble
Labeled training dataset 𝒟𝒟\mathcal{D}caligraphic_D, number of base models M𝑀Mitalic_M
for all i{1,2,,M}𝑖12𝑀i\in\{1,2,\ldots,M\}italic_i ∈ { 1 , 2 , … , italic_M } do
     Initialize base model misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT randomly
     Train misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using 𝒟𝒟\mathcal{D}caligraphic_D
end for
m(m1,m2,,mM)𝑚subscript𝑚1subscript𝑚2subscript𝑚𝑀m\leftarrow(m_{1},m_{2},\ldots,m_{M})italic_m ← ( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT )
return m𝑚mitalic_m
Algorithm 3 Uncertainty Estimation with Deep Ensemble
Ensemble model m=(m1,m2,,mM)𝑚subscript𝑚1subscript𝑚2subscript𝑚𝑀m=(m_{1},m_{2},\ldots,m_{M})italic_m = ( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ), input image x𝑥xitalic_x
function predict_with_uncertainty(m,x𝑚𝑥m,xitalic_m , italic_x)
     for all i{1,2,,M}𝑖12𝑀i\in\{1,2,\ldots,M\}italic_i ∈ { 1 , 2 , … , italic_M } do
         yipredict(mi,x)subscript𝑦𝑖predictsubscript𝑚𝑖𝑥y_{i}\leftarrow\text{predict}(m_{i},x)italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← predict ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x )
     end for
     (s1,s2,,sN)unique(y1,y2,,yM)subscript𝑠1subscript𝑠2subscript𝑠𝑁uniquesubscript𝑦1subscript𝑦2subscript𝑦𝑀(s_{1},s_{2},\ldots,s_{N})\leftarrow\text{unique}(y_{1},y_{2},\ldots,y_{M})( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ← unique ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT )
     for all j{1,2,,N}𝑗12𝑁j\in\{1,2,\ldots,N\}italic_j ∈ { 1 , 2 , … , italic_N } do
         pj|{i{1,2,,M}|yi=sj}|Msubscript𝑝𝑗conditional-set𝑖12𝑀subscript𝑦𝑖subscript𝑠𝑗𝑀\displaystyle p_{j}\leftarrow\frac{|\left\{i\in\{1,2,\ldots,M\}|y_{i}=s_{j}% \right\}|}{M}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← divide start_ARG | { italic_i ∈ { 1 , 2 , … , italic_M } | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } | end_ARG start_ARG italic_M end_ARG
     end for
     pmaxmaxj=1,,N{pj}subscript𝑝maxsubscript𝑗1𝑁subscript𝑝𝑗p_{\textrm{max}}\leftarrow\max_{j=1,\ldots,N}\{p_{j}\}italic_p start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ← roman_max start_POSTSUBSCRIPT italic_j = 1 , … , italic_N end_POSTSUBSCRIPT { italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }
     jmaxargmaxj=1,,N{pj}subscript𝑗maxsubscriptargmax𝑗1𝑁subscript𝑝𝑗j_{\textrm{max}}\leftarrow\operatorname*{argmax}_{j=1,\ldots,N}\{p_{j}\}italic_j start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ← roman_argmax start_POSTSUBSCRIPT italic_j = 1 , … , italic_N end_POSTSUBSCRIPT { italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }
     u1pmax𝑢1subscript𝑝maxu\leftarrow 1-p_{\textrm{max}}italic_u ← 1 - italic_p start_POSTSUBSCRIPT max end_POSTSUBSCRIPT
     ysjmax𝑦subscript𝑠subscript𝑗maxy\leftarrow s_{j_{\textrm{max}}}italic_y ← italic_s start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUBSCRIPT
     return (y,u)𝑦𝑢(y,u)( italic_y , italic_u )
end function

For a conventional deep learning-based solver, a deep learning model m𝑚mitalic_m is trained and then used to make a prediction y=m(x)𝑦𝑚𝑥y=m(x)italic_y = italic_m ( italic_x ) if given an input CAPTCHA image x𝑥xitalic_x. In this case, y𝑦yitalic_y is the string that the model m𝑚mitalic_m returns when the input image x𝑥xitalic_x is fed into the model. Several types of deep learning models have been proposed for this problem that include generative adversarial networks (Ye et al., 2018), convolutional neural networks (Noury and Rezaei, 2020), attention-based encoder-decoder networks (Deng et al., 2022), and object detection-based models (Ousat et al., 2024).

Although experimentally accurate, these deep learning-based solvers falter on images with visual characteristics (e.g., text font or character size) unseen in the training data. When encountering these out-of-distribution samples, it is better to not give an answer than to give wrong answers and get locked out by the defense system. Additionally, most CAPTCHA systems allow users to request a new image, thus skip** an answer enables the solver to exploit this feature to switch to an easier image.

To allow this new capability in a CAPTCHA solver, we propose a simple generic uncertainty-aware CAPTCHA solver in Algorithm 1. Our solver is equipped with a trained uncertainty-aware model m𝑚mitalic_m and a real-valued threshold τ(0,1]𝜏01\tau\in(0,1]italic_τ ∈ ( 0 , 1 ]. When given an input CAPTCHA image x𝑥xitalic_x, the model m𝑚mitalic_m first predicts the string y𝑦yitalic_y together with an uncertainty level u[0,1]𝑢01u\in[0,1]italic_u ∈ [ 0 , 1 ] via the function call predict_with_uncertainty(m,x)predict_with_uncertainty𝑚𝑥\textsc{predict\_with\_uncertainty}(m,x)predict_with_uncertainty ( italic_m , italic_x ). The uncertainty level u𝑢uitalic_u is a real number indicating the extent to which the model m𝑚mitalic_m is unconfident (or uncertain) that y𝑦yitalic_y is the text string shown in the input image x𝑥xitalic_x. Note that the higher the uncertainty level u𝑢uitalic_u, the less likely that the prediction y𝑦yitalic_y is correct. With the uncertainty level u𝑢uitalic_u, the next step is to compare it with the threshold τ𝜏\tauitalic_τ. If u𝑢uitalic_u is below this threshold, the solver will return the text string y𝑦yitalic_y as the answer for the CAPTCHA. Otherwise, it will skip x𝑥xitalic_x and request a new CAPTCHA image if possible.

Note that for our uncertainty-aware solver above, the range of u𝑢uitalic_u and the choice of τ𝜏\tauitalic_τ depend on the specific uncertainty estimation method. Since our method relies on the uncertainty level u𝑢uitalic_u predicted by the uncertainty-aware model m𝑚mitalic_m, a model with a good uncertainty estimation capability is essential for our method to work well. In the next section, we shall describe EnSolver, an instance of the above generic solver that uses a deep ensemble (Lakshminarayanan et al., 2017; D’Angelo and Fortuin, 2021) to obtain the uncertainty level u𝑢uitalic_u.

3.2 EnSolver: The Deep Ensemble Solver

EnSolver is an uncertainty-aware CAPTCHA solver that employs a deep ensemble of base CAPTCHA solvers for uncertainty estimation. Deep ensemble (Lakshminarayanan et al., 2017; D’Angelo and Fortuin, 2021) is a popular uncertainty estimation approach that requires significantly less modifications to the original model architecture as well as the training and inference pipelines, as compared to other uncertainty estimation approaches such as Bayesian methods (Chen et al., 2014; Blundell et al., 2015; Gal and Ghahramani, 2016; Zhang et al., 2020; Rudner et al., 2021).

We choose deep ensemble for uncertainty estimation since it allows our approach to have a greater applicability, especially to solvers with a very complex model architecture, while not compromising the quality of the uncertainty estimates. In our experiments, we will demonstrate our approach for two such complex models, one that uses the encoder-decoder architecture combined with the attention mechanism and another that uses object detectors. For these complex models, using a Bayesian method for uncertainty estimation is unnecessarily hard and inefficient since it requires a major modification to the model architecture and training procedure (Blundell et al., 2015; Gal and Ghahramani, 2016; Ritter et al., 2018; Zhang et al., 2020). Another reason that we choose deep ensemble for uncertainty estimation is that it allows us to develop theoretical guarantees for our solvers (see Section 4). We must emphasize that despite the simplicity of our approach, we can achieve very high accuracy, as we will show in our experiments in Section 5. This is consistent with several previous works (Ovadia et al., 2019; Wilson and Izmailov, 2020) that showed the competitiveness of deep ensembles compared to Bayesian methods such as variational inference or Laplace approximation.

For EnSolver, the model m𝑚mitalic_m in Algorithm 1 is an ensemble of M𝑀Mitalic_M base models (m1,m2,,mM)subscript𝑚1subscript𝑚2subscript𝑚𝑀(m_{1},m_{2},\ldots,\allowbreak m_{M})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ), each of which is a conventional CAPTCHA solver. The uncertainty level u𝑢uitalic_u for this solver will be estimated from the agreement among the base models on a given input image x𝑥xitalic_x. Throughout this paper, we will use m𝑚mitalic_m to denote both the ensemble and the uncertainty-aware solver itself, while we use misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to denote each base solver in the ensemble. We now describe how to train this ensemble of models and how to use it for uncertainty estimation.

Training. Given the number of base models M𝑀Mitalic_M and a labeled training set 𝒟𝒟\mathcal{D}caligraphic_D, we follow Lakshminarayanan et al. (2017) and build the deep ensemble by training M𝑀Mitalic_M base models separately with different random initializations. This training process is illustrated in Algorithm 2. In general, each base model misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT may have its own architecture and can be trained with its own training process. However, in most cases in practice (Blundell et al., 2015; D’Angelo and Fortuin, 2021), we only need to use one architecture and one training process (e.g., stochastic gradient descent with cross entropy loss). To speed up the training process, we can also train the base models in parallel using different GPUs (Lakshminarayanan et al., 2017). Our final ensemble is the set of M𝑀Mitalic_M well-trained base models m=(m1,m2,,mM)𝑚subscript𝑚1subscript𝑚2subscript𝑚𝑀m=(m_{1},m_{2},\ldots,m_{M})italic_m = ( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ).

Uncertainty Estimation. Given a trained ensemble model m=(m1,m2,,mM)𝑚subscript𝑚1subscript𝑚2subscript𝑚𝑀m=(m_{1},m_{2},\ldots,m_{M})italic_m = ( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) and any input image x𝑥xitalic_x, we compute the predicted output string y𝑦yitalic_y and the uncertainty estimate u𝑢uitalic_u using the predict_with_uncertainty(m,x)predict_with_uncertainty𝑚𝑥\textsc{predict\_with\_uncertainty}(m,x)predict_with_uncertainty ( italic_m , italic_x ) function in Algorithm 3. More specifically, we first use each base model misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to predict a string yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on the input image x𝑥xitalic_x. Since some models may predict a similar string, especially when the string is the correct prediction, we will obtain a set (s1,s2,,sN)subscript𝑠1subscript𝑠2subscript𝑠𝑁(s_{1},s_{2},\ldots,s_{N})( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) of N𝑁Nitalic_N distinct strings among the M𝑀Mitalic_M predictions. Note that the set of predictions (y1,y2,,yM)subscript𝑦1subscript𝑦2subscript𝑦𝑀(y_{1},y_{2},\ldots,y_{M})( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) imposes a predictive distribution on (s1,s2,,sN)subscript𝑠1subscript𝑠2subscript𝑠𝑁(s_{1},s_{2},\ldots,s_{N})( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) where the probability pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT of sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is:

pj=|{i{1,2,,M}|yi=sj}|M,subscript𝑝𝑗conditional-set𝑖12𝑀subscript𝑦𝑖subscript𝑠𝑗𝑀p_{j}=\frac{|\left\{i\in\{1,2,\ldots,M\}|y_{i}=s_{j}\right\}|}{M},italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG | { italic_i ∈ { 1 , 2 , … , italic_M } | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } | end_ARG start_ARG italic_M end_ARG ,

which is the proportion of the number of base models that predict sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. In this distribution, pmax=maxjpjsubscript𝑝maxsubscript𝑗subscript𝑝𝑗p_{\textrm{max}}=\max_{j}p_{j}italic_p start_POSTSUBSCRIPT max end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT measures the agreement among the predictions of our base models. It has a high value when a large proportion of the base models give the same prediction, i.e., when the uncertainty is low. Thus, we can use u=1pmax𝑢1subscript𝑝maxu=1-p_{\textrm{max}}italic_u = 1 - italic_p start_POSTSUBSCRIPT max end_POSTSUBSCRIPT as the quantification of the predictive uncertainty. Note that the value of u𝑢uitalic_u ranges from 0 to 11M11𝑀1-\frac{1}{M}1 - divide start_ARG 1 end_ARG start_ARG italic_M end_ARG. The predicted string y𝑦yitalic_y returned from our procedure is the string that has the maximum number of predictions, i.e., the one corresponding to pmaxsubscript𝑝maxp_{\textrm{max}}italic_p start_POSTSUBSCRIPT max end_POSTSUBSCRIPT. An illustration of this uncertainty estimation method is depicted in Figure 1.

3.3 LEnSolver: The Limited-Skip EnSolver

Although the EnSolver in Section 3.2 can skip predictions on uncertain inputs, it would be impractical if the solver keeps skip** and does not make any prediction within a reasonable time limit. Since the purpose of a CAPTCHA solver is to gain access to a website or a system, we expect the solver to finally make a prediction when the time limit is reached regardless of the uncertainty level. Thus, for this purpose, we extend EnSolver to LEnSolver, which can be applied to the setting where we only allow a maximum number of skips.

The LEnSolver is described in Algorithm 4. This algorithm requires the original EnSolver m𝑚mitalic_m, which is obtained from Algorithm 1 with the uncertainty estimate in Algorithm 3, together with the maximum number of skips T𝑇Titalic_T. For the first T𝑇Titalic_T iterations of the algorithm, we use m𝑚mitalic_m to make a prediction on the current input image xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and allow the solver to skip to the next input if the uncertainty level is high. However, if the solver skips all the first T𝑇Titalic_T inputs, then it is forced to make a prediction on the (T+1)thsuperscript𝑇1𝑡(T+1)^{th}( italic_T + 1 ) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT input, xT+1subscript𝑥𝑇1x_{T+1}italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT, regardless of the uncertainty level. This prediction also uses the majority prediction from the base models that can be obtained by calling predict_with_uncertainty(m,xT+1)predict_with_uncertainty𝑚subscript𝑥𝑇1\textsc{predict\_with\_uncertainty}(m,x_{T+1})predict_with_uncertainty ( italic_m , italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ). We will write mTsuperscript𝑚𝑇m^{T}italic_m start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT to denote this LEnSolver. In the next section, we will prove theoretical guarantees for both EnSolver and LEnSolver.

Algorithm 4 LEnSolver: Limited-Skip EnSolver
Original EnSolver m𝑚mitalic_m, maximum number of skips T𝑇Titalic_T
for all t{1,2,,T}𝑡12𝑇t\in\{1,2,\ldots,T\}italic_t ∈ { 1 , 2 , … , italic_T } do
     Receive a random input image xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
     ytpredict(m,xt)subscript𝑦𝑡predict𝑚subscript𝑥𝑡y_{t}\leftarrow\text{predict}(m,x_{t})italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← predict ( italic_m , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
     if ytskipsubscript𝑦𝑡skipy_{t}\neq\text{skip}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≠ skip then
         Return ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
     end if
end for
Receive a random input xT+1subscript𝑥𝑇1x_{T+1}italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT
(y,u)predict_with_uncertainty(m,xT+1)𝑦𝑢predict_with_uncertainty𝑚subscript𝑥𝑇1(y,u)\leftarrow\textsc{predict\_with\_uncertainty}(m,x_{T+1})( italic_y , italic_u ) ← predict_with_uncertainty ( italic_m , italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT )
Return y𝑦yitalic_y

4 Theoretical Guarantees for EnSolver and LEnSolver

The use of ensembles for uncertainty estimation in EnSolver and LEnSolver allows us to derive some novel theoretical properties for these solvers. In particular, we will establish in this section theoretical lower bounds for the right decision rate of EnSolver and the success rate of LEnSolver that can be used to explain the effectiveness of these solvers in practice. To the best of our knowledge, our paper is the first to provide such theoretical analyses for CAPTCHA solvers. In Section 4.1, we will describe the mathematical settings for our theory. Sections 4.2 and 4.3 then show the theoretical guarantees for EnSolver and LEnSolver respectively. Empirical validations of our theory are provided in Section 5.

4.1 Mathematical Settings

Let 𝒳𝒳\mathcal{X}caligraphic_X be the input domain, which is the set of all CAPTCHA images containing at most \ellroman_ℓ characters, for some integer >00\ell>0roman_ℓ > 0. Let 𝒮𝒮\mathcal{S}caligraphic_S be the output domain, which is the set of all possible output strings containing at most \ellroman_ℓ characters from a finite alphabet. Note that the cardinality of 𝒳𝒳\mathcal{X}caligraphic_X could be infinity since 𝒳𝒳\mathcal{X}caligraphic_X contains images; however, 𝒮𝒮\mathcal{S}caligraphic_S is a finite set given a finite alphabet and a fixed \ellroman_ℓ. A data point is a pair (x,sx)𝒳×𝒮𝑥subscript𝑠𝑥𝒳𝒮(x,s_{x})\in\mathcal{X}\times\mathcal{S}( italic_x , italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) ∈ caligraphic_X × caligraphic_S, where x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X is an input CAPTCHA image and sx𝒮subscript𝑠𝑥𝒮s_{x}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ caligraphic_S is its corresponding output string. We assume the input images have a probability distribution p(x)𝑝𝑥p(x)italic_p ( italic_x ) on 𝒳𝒳\mathcal{X}caligraphic_X and each input x𝑥xitalic_x has a single correct output sxsubscript𝑠𝑥s_{x}italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. We will consider the case where 𝒳=𝒳in𝒳out𝒳superscript𝒳insuperscript𝒳out\mathcal{X}=\mathcal{X}^{\text{in}}\,\cup\,\mathcal{X}^{\text{out}}caligraphic_X = caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ∪ caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT, with 𝒳insuperscript𝒳in\mathcal{X}^{\text{in}}caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT containing in-distribution images, 𝒳outsuperscript𝒳out\mathcal{X}^{\text{out}}caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT containing out-of-distribution images, and 𝒳in𝒳out=superscript𝒳insuperscript𝒳out{\mathcal{X}^{\text{in}}\,\cap\,\mathcal{X}^{\text{out}}=\emptyset}caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ∩ caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT = ∅. Let α𝛼\alphaitalic_α be the proportion of the in-distribution data in the whole input domain. We have:

α=p(x𝒳in)=x𝒳inp(x)𝑑x.𝛼𝑝𝑥superscript𝒳insubscript𝑥superscript𝒳in𝑝𝑥differential-d𝑥\alpha=p(x\in\mathcal{X}^{\text{in}})=\int_{x\in\mathcal{X}^{\text{in}}}p(x)dx.italic_α = italic_p ( italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) = ∫ start_POSTSUBSCRIPT italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_p ( italic_x ) italic_d italic_x .

Consider the EnSolver in Section 3.2. In this solver, we have an ensemble model m=(m1,m2,,mM)𝑚subscript𝑚1subscript𝑚2subscript𝑚𝑀{m=(m_{1},m_{2},\ldots,m_{M})}italic_m = ( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) that contains M𝑀Mitalic_M base models. Each base model misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a function map** x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X to mi(x)𝒮subscript𝑚𝑖𝑥𝒮m_{i}(x)\in\mathcal{S}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ∈ caligraphic_S. Let βisubscript𝛽𝑖\beta_{i}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the in-distribution accuracy of misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We have:

βi=𝔼xp(.|x𝒳in)[𝟏(mi(x)=sx)]=p(mi(x)=sx|x𝒳in),\beta_{i}=\mathbb{E}_{x\sim p(\,.\,|\,x\in\mathcal{X}^{\text{in}})}\left[% \mathbf{1}(m_{i}(x)=s_{x})\right]=p(m_{i}(x)=s_{x}\,|\,x\in\mathcal{X}^{\text{% in}}),italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p ( . | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ bold_1 ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) ] = italic_p ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) ,

where 𝟏()1\mathbf{1}(\cdot)bold_1 ( ⋅ ) is the indicator function. We also let βminsubscript𝛽min\beta_{\text{min}}italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT and βmaxsubscript𝛽max\beta_{\text{max}}italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT be respectively the minimum and maximum in-distribution accuracies among the base models. That is:

βminsubscript𝛽min\displaystyle\beta_{\text{min}}italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT =mini{1,,M}βi, andabsentsubscript𝑖1𝑀subscript𝛽𝑖 and\displaystyle=\min_{i\in\{1,\ldots,M\}}\beta_{i},\text{ and}= roman_min start_POSTSUBSCRIPT italic_i ∈ { 1 , … , italic_M } end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , and
βmaxsubscript𝛽max\displaystyle\beta_{\text{max}}italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT =maxi{1,,M}βi.absentsubscript𝑖1𝑀subscript𝛽𝑖\displaystyle=\max_{i\in\{1,\ldots,M\}}\beta_{i}.= roman_max start_POSTSUBSCRIPT italic_i ∈ { 1 , … , italic_M } end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .

Since the behavior of EnSolver does not change when we adjust τ𝜏\tauitalic_τ to τsuperscript𝜏\tau^{\prime}italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that M(1τ)=M(1τ)𝑀1superscript𝜏𝑀1𝜏M(1-\tau^{\prime})=\lfloor M(1-\tau)\rflooritalic_M ( 1 - italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ⌊ italic_M ( 1 - italic_τ ) ⌋, we can assume M(1τ)𝑀1𝜏M(1-\tau)italic_M ( 1 - italic_τ ) is a positive integer without loss of generality. Following the convention, we assume each base model misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT makes a prediction independently with each other given an input. To derive our theoretical results, we also make the following assumptions.

  1. (A1)

    βmin>1/N𝒮subscript𝛽min1subscript𝑁𝒮\beta_{\text{min}}>1/N_{\mathcal{S}}italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT > 1 / italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT, where N𝒮subscript𝑁𝒮N_{\mathcal{S}}italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT is the size of the output domain 𝒮𝒮\mathcal{S}caligraphic_S.

  2. (A2)

    For any base model misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and x𝒳in𝑥superscript𝒳inx\in\mathcal{X}^{\text{in}}italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT, the model misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT predicts any incorrect string with the same probability. This assumption implies p(mi(x)=s|x𝒳in)=(1βi)/(N𝒮1)𝑝subscript𝑚𝑖𝑥conditional𝑠𝑥superscript𝒳in1subscript𝛽𝑖subscript𝑁𝒮1{p(m_{i}(x)=s\,|\,x\in\mathcal{X}^{\text{in}})=(1-\beta_{i})/(N_{\mathcal{S}}-% 1)}italic_p ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = italic_s | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) = ( 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / ( italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT - 1 ) for all ssx𝑠subscript𝑠𝑥s\neq s_{x}italic_s ≠ italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT.

  3. (A3)

    For any base model misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and x𝒳out𝑥superscript𝒳outx\in\mathcal{X}^{\text{out}}italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT, the base model misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT predicts any string in 𝒮𝒮\mathcal{S}caligraphic_S with the same probability. This assumption implies p(mi(x)=s|x𝒳out)=1/N𝒮𝑝subscript𝑚𝑖𝑥conditional𝑠𝑥superscript𝒳out1subscript𝑁𝒮{p(m_{i}(x)=s\,|\,x\in\mathcal{X}^{\text{out}})=1/N_{\mathcal{S}}}italic_p ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = italic_s | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT ) = 1 / italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT for all s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S.

We note that N𝒮subscript𝑁𝒮N_{\mathcal{S}}italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT is usually very large in practice, e.g., N𝒮=265subscript𝑁𝒮superscript265N_{\mathcal{S}}=26^{5}italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT = 26 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT for a typical CAPTCHA system with CAPTCHA strings of length 5555 from an alphabet of size 26262626. Thus, (A1) is a very mild assumption if all the base models are trained reasonably well. Furthermore, due to the large N𝒮subscript𝑁𝒮N_{\mathcal{S}}italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT, the uniform assumptions in (A2) and (A3) are also reasonable for deriving useful bounds, although they may not hold in practice. In Section 5, we will conduct experiments to show the usefulness of our theory even with these assumptions. Now we are ready to discuss our main theoretical results.

4.2 Theoretical Guarantee for EnSolver

We first focus on the EnSolver described in Section 3.2, which is an instance of Algorithm 1 with the uncertainty estimation method in Algorithm 3. Recall that this solver can either predict an output string or skip the prediction given an input image. To measure the performance of this solver, we will analyze its right decision rate, which is defined below.

Definition 1

For any EnSolver m𝑚mitalic_m and any input x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, let m(x)𝒮{sskip}𝑚𝑥𝒮subscript𝑠skipm(x)\in\mathcal{S}\cup\{s_{\text{skip}}\}italic_m ( italic_x ) ∈ caligraphic_S ∪ { italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT } be the output of this EnSolver, where sskipsubscript𝑠skips_{\text{skip}}italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT is a special symbol indicating that the solver will skip predicting x𝑥xitalic_x. The right decision rate Rrd(m)subscript𝑅rd𝑚R_{\text{rd}}(m)italic_R start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT ( italic_m ) of the EnSolver m𝑚mitalic_m is defined as:

Rrd(m)=p(m(x)=sx,x𝒳in)+p(m(x){sx,sskip},x𝒳out),subscript𝑅rd𝑚𝑝formulae-sequence𝑚𝑥subscript𝑠𝑥𝑥superscript𝒳in𝑝formulae-sequence𝑚𝑥subscript𝑠𝑥subscript𝑠skip𝑥superscript𝒳outR_{\text{rd}}(m)=p\left(m(x)=s_{x},x\in\mathcal{X}^{\text{in}}\right)+p\left(m% (x)\in\{s_{x},s_{\text{skip}}\},x\in\mathcal{X}^{\text{out}}\right),italic_R start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT ( italic_m ) = italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) + italic_p ( italic_m ( italic_x ) ∈ { italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT } , italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT ) ,

which is the total probability that this solver would make a correct decision on the whole input domain.

Note that this definition captures the desired behavior of an EnSolver: on an in-distribution input, we want the model to make the correct prediction, while it can either skip or make the correct prediction on an out-of-distribution input. This definition is also equivalent to the expectation that the EnSolver m𝑚mitalic_m will make a correct decision on a random input xp(x)similar-to𝑥𝑝𝑥x\sim p(x)italic_x ∼ italic_p ( italic_x ). In this section, we will give a lower bound for the right decision rate Rrd(m)subscript𝑅rd𝑚R_{\text{rd}}(m)italic_R start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT ( italic_m ) of any EnSolver m𝑚mitalic_m. Our bound will depend on a novel quantity called the out-of-distribution error bound, which we define as follows.

Definition 2

The out-of-distribution error bound (OEB) of an EnSolver with ensemble size M𝑀Mitalic_M, output domain size N𝒮subscript𝑁𝒮N_{\mathcal{S}}italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT, and uncertainty threshold τ𝜏\tauitalic_τ, is defined as:

(M,N𝒮,τ)=(MM/2)1(N𝒮)M(1τ),𝑀subscript𝑁𝒮𝜏matrix𝑀𝑀21superscriptsubscript𝑁𝒮𝑀1𝜏\mathcal{E}(M,N_{\mathcal{S}},\tau)=\begin{pmatrix}M\\ \lfloor M/2\rfloor\end{pmatrix}\frac{1}{(N_{\mathcal{S}})^{M(1-\tau)}},caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) = ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL ⌊ italic_M / 2 ⌋ end_CELL end_ROW end_ARG ) divide start_ARG 1 end_ARG start_ARG ( italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_M ( 1 - italic_τ ) end_POSTSUPERSCRIPT end_ARG ,

where (MM/2)matrix𝑀𝑀2\begin{pmatrix}M\\ \lfloor M/2\rfloor\end{pmatrix}( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL ⌊ italic_M / 2 ⌋ end_CELL end_ROW end_ARG ) is the binomial coefficient “M𝑀Mitalic_M choose M/2𝑀2\lfloor M/2\rfloor⌊ italic_M / 2 ⌋”.

In practice, the OEB is usually very small. For instance, if we consider the CAPTCHA system above with output strings of length 5 and an alphabet of size 26, an EnSolver with ensemble size M=10𝑀10M=10italic_M = 10 and uncertainty threshold τ=0.5𝜏0.5\tau=0.5italic_τ = 0.5 will have the OEB value (10,265,0.5)=252/2625103310superscript2650.5252superscript2625superscript1033{\mathcal{E}(10,26^{5},0.5)=252/26^{25}\approx 10^{-33}}caligraphic_E ( 10 , 26 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT , 0.5 ) = 252 / 26 start_POSTSUPERSCRIPT 25 end_POSTSUPERSCRIPT ≈ 10 start_POSTSUPERSCRIPT - 33 end_POSTSUPERSCRIPT. Furthermore, as its name suggests, the OEB is an upper bound of the EnSolver’s error rate on out-of-distribution data. This result is stated in Lemma 1 below, with the proof given in Appendix A.1.

Lemma 1

For any EnSolver m𝑚mitalic_m with ensemble size M𝑀Mitalic_M, output domain size N𝒮subscript𝑁𝒮N_{\mathcal{S}}italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT, and uncertainty threshold τ𝜏\tauitalic_τ, its OEB satisfies:

(M,N𝒮,τ)>p(m(x){sx,sskip}|x𝒳out).𝑀subscript𝑁𝒮𝜏𝑝𝑚𝑥conditionalsubscript𝑠𝑥subscript𝑠skip𝑥superscript𝒳out\displaystyle\mathcal{E}(M,N_{\mathcal{S}},\tau)>p\left(m(x)\notin\{s_{x},s_{% \text{skip}}\}\,|\,x\in\mathcal{X}^{\text{out}}\right).caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) > italic_p ( italic_m ( italic_x ) ∉ { italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT } | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT ) .

Given our observation that (M,N𝒮,τ)𝑀subscript𝑁𝒮𝜏\mathcal{E}(M,N_{\mathcal{S}},\tau)caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) is usually very small, this lemma tells us that the error rate of the EnSolver on out-of-distribution data is also very small. Interestingly, (M,N𝒮,τ)𝑀subscript𝑁𝒮𝜏\mathcal{E}(M,N_{\mathcal{S}},\tau)caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) also gives us an upper bound for the probability that several base models in the ensemble make the same but incorrect prediction on in-distribution data. This result is stated in the following lemma, where the proof is given in Appendix A.2.

Lemma 2

Consider any EnSolver m𝑚mitalic_m with ensemble size M𝑀Mitalic_M, output domain size N𝒮subscript𝑁𝒮N_{\mathcal{S}}italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT, and uncertainty threshold τ𝜏\tauitalic_τ. For any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X and s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S, let n(x,s)𝑛𝑥𝑠n(x,s)italic_n ( italic_x , italic_s ) be the number of base models predicting the output s𝑠sitalic_s for x𝑥xitalic_x. The OEB of m𝑚mitalic_m satisfies:

(M,N𝒮,τ)>p(ssx,n(x,s)M(1τ)+1|x𝒳in).𝑀subscript𝑁𝒮𝜏𝑝formulae-sequence𝑠subscript𝑠𝑥𝑛𝑥𝑠𝑀1𝜏conditional1𝑥superscript𝒳in\mathcal{E}(M,N_{\mathcal{S}},\tau)>p\left(\exists s\neq s_{x},\leavevmode% \nobreak\ n(x,s)\geq M(1-\tau)+1\,|\,x\in\mathcal{X}^{\text{in}}\right).caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) > italic_p ( ∃ italic_s ≠ italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_n ( italic_x , italic_s ) ≥ italic_M ( 1 - italic_τ ) + 1 | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) .

Since p(ssx,n(x,s)M(1τ)+1|x𝒳in)p(ssx,m(x)=s|x𝒳in)𝑝formulae-sequence𝑠subscript𝑠𝑥𝑛𝑥𝑠𝑀1𝜏conditional1𝑥superscript𝒳in𝑝formulae-sequence𝑠subscript𝑠𝑥𝑚𝑥conditional𝑠𝑥superscript𝒳in\,p\left(\exists s\neq s_{x},\leavevmode\nobreak\ n(x,s)\geq M(1-\tau)+1\,|\,x% \in\mathcal{X}^{\text{in}}\right)\geq p\left(\exists s\neq s_{x},m(x)=s\,|\,x% \in\mathcal{X}^{\text{in}}\right)italic_p ( ∃ italic_s ≠ italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_n ( italic_x , italic_s ) ≥ italic_M ( 1 - italic_τ ) + 1 | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) ≥ italic_p ( ∃ italic_s ≠ italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_m ( italic_x ) = italic_s | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ), Lemma 2 implies that (M,N𝒮,τ)𝑀subscript𝑁𝒮𝜏\mathcal{E}(M,N_{\mathcal{S}},\tau)caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) is also an upper bound of the EnSolver’s wrong prediction rate on in-distribution data.222Here we differentiate between the wrong prediction rate, which is p(ssx,m(x)=s|x𝒳in)𝑝formulae-sequence𝑠subscript𝑠𝑥𝑚𝑥conditional𝑠𝑥superscript𝒳inp\left(\exists s\neq s_{x},m(x)=s\,|\,x\in\mathcal{X}^{\text{in}}\right)italic_p ( ∃ italic_s ≠ italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_m ( italic_x ) = italic_s | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ), and the error rate, which is p(m(x)sx|x𝒳in)=p(m(x)=sskipssx,m(x)=s|x𝒳in)p\left(m(x)\neq s_{x}\,|\,x\in\mathcal{X}^{\text{in}}\right)=p\left(m(x)=s_{% \text{skip}}\vee\exists s\neq s_{x},m(x)=s\,|\,x\in\mathcal{X}^{\text{in}}\right)italic_p ( italic_m ( italic_x ) ≠ italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) = italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT ∨ ∃ italic_s ≠ italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_m ( italic_x ) = italic_s | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ), given an in-distribution input. It also tells us that this wrong prediction rate on in-distribution data is usually small. Using both Lemma 1 and Lemma 2, we can now prove a lower bound for the right decision rate Rrd(m)subscript𝑅rd𝑚R_{\text{rd}}(m)italic_R start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT ( italic_m ) in Theorem 1 below. The proof for this theorem is given in Appendix A.3.

Theorem 1

Consider any EnSolver m𝑚mitalic_m with ensemble size M𝑀Mitalic_M, output domain size N𝒮subscript𝑁𝒮N_{\mathcal{S}}italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT, and uncertainty threshold τ𝜏\tauitalic_τ. Let βminsubscript𝛽min\beta_{\text{min}}italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT and βmaxsubscript𝛽max\beta_{\text{max}}italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT be respectively the minimum and maximum in-distribution accuracies among the M𝑀Mitalic_M base models. The right decision rate of m𝑚mitalic_m satisfies:

Rrd(m)>αi=kM(Mi)βmini(1βmax)Mi+(1α)(M,N𝒮,τ),subscript𝑅rd𝑚𝛼superscriptsubscript𝑖𝑘𝑀matrix𝑀𝑖superscriptsubscript𝛽min𝑖superscript1subscript𝛽max𝑀𝑖1𝛼𝑀subscript𝑁𝒮𝜏R_{\text{rd}}(m)>\alpha\sum_{i=k}^{M}\begin{pmatrix}M\\ i\end{pmatrix}\beta_{\text{min}}^{i}\,(1-\beta_{\text{max}})^{M-i}+(1-\alpha)-% \mathcal{E}(M,N_{\mathcal{S}},\tau),italic_R start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT ( italic_m ) > italic_α ∑ start_POSTSUBSCRIPT italic_i = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL italic_i end_CELL end_ROW end_ARG ) italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_M - italic_i end_POSTSUPERSCRIPT + ( 1 - italic_α ) - caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) , (1)

where k=M(1τ)+1𝑘𝑀1𝜏1k=M(1-\tau)+1italic_k = italic_M ( 1 - italic_τ ) + 1 and α𝛼\alphaitalic_α is the proportion of in-distribution data in the input domain.

This theorem gives us a theoretical guarantee for the right decision rate of an EnSolver. From the theorem, we see that if α𝛼\alphaitalic_α is small (i.e., there are more out-of-distribution data), the lower bound will be dominated by (1α)1𝛼(1-\alpha)( 1 - italic_α ), which will be large. On the other hand, if there are more in-distribution data (i.e., α𝛼\alphaitalic_α is large), then the lower bound will be dominated by the sum on the right-hand side of (1), which depends on the in-distribution accuracies of the base models in the ensemble.

As an example, let us consider again the CAPTCHA system described previously. Recall that for this system, M=10𝑀10M=10italic_M = 10, τ=0.5𝜏0.5\tau=0.5italic_τ = 0.5, and (M,N𝒮,τ)1033𝑀subscript𝑁𝒮𝜏superscript1033\mathcal{E}(M,N_{\mathcal{S}},\tau)\approx 10^{-33}caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) ≈ 10 start_POSTSUPERSCRIPT - 33 end_POSTSUPERSCRIPT. If the base models were trained so that βminβmax0.9subscript𝛽minsubscript𝛽max0.9\beta_{\text{min}}\approx\beta_{\text{max}}\approx 0.9italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ≈ italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ≈ 0.9, Theorem 1 would give us the lower bounds Rrd(m)0.9998greater-than-or-approximately-equalssubscript𝑅rd𝑚0.9998R_{\text{rd}}(m)\gtrapprox 0.9998italic_R start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT ( italic_m ) ⪆ 0.9998 for α=0.1𝛼0.1\alpha=0.1italic_α = 0.1, Rrd(m)0.9991greater-than-or-approximately-equalssubscript𝑅rd𝑚0.9991R_{\text{rd}}(m)\gtrapprox 0.9991italic_R start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT ( italic_m ) ⪆ 0.9991 for α=0.5𝛼0.5\alpha=0.5italic_α = 0.5, and Rrd(m)0.9985greater-than-or-approximately-equalssubscript𝑅rd𝑚0.9985R_{\text{rd}}(m)\gtrapprox 0.9985italic_R start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT ( italic_m ) ⪆ 0.9985 for α=0.9𝛼0.9\alpha=0.9italic_α = 0.9. These bounds indicate that the EnSolver will have very high right decision rates in these settings. In Section 5, we will conduct experiments to validate our theoretical bounds on real data.

4.3 Theoretical Guarantee for LEnSolver

We now focus on the LEnSolver described in Section 3.3 and Algorithm 4. This solver always makes a prediction for an input image; thus, we will measure its performance through its success rate, which is defined below.

Definition 3

Consider an LEnSolver mTsuperscript𝑚𝑇m^{T}italic_m start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT constructed from an original EnSolver m𝑚mitalic_m with the maximum number of skips T𝑇Titalic_T. The success rate Rs(mT)subscript𝑅ssuperscript𝑚𝑇R_{\text{s}}(m^{T})italic_R start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) of the LEnSolver mTsuperscript𝑚𝑇m^{T}italic_m start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is defined as the probability that mTsuperscript𝑚𝑇m^{T}italic_m start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT successfully cracks the CAPTCHA system.

From this definition, we can write Rs(mT)subscript𝑅ssuperscript𝑚𝑇R_{\text{s}}(m^{T})italic_R start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) formally by considering (T+1)𝑇1(T+1)( italic_T + 1 ) input images x1,x2,,xT+1subscript𝑥1subscript𝑥2subscript𝑥𝑇1x_{1},x_{2},\ldots,x_{T+1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT that are sampled i.i.d. from p(x)𝑝𝑥p(x)italic_p ( italic_x ). In this case,

Rs(mT)subscript𝑅ssuperscript𝑚𝑇\displaystyle R_{\text{s}}(m^{T})italic_R start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) =p(m(x1)=sx1)+p(m(x1)=sskip,m(x2)=sx2)+absent𝑝𝑚subscript𝑥1subscript𝑠subscript𝑥1𝑝formulae-sequence𝑚subscript𝑥1subscript𝑠skip𝑚subscript𝑥2subscript𝑠subscript𝑥2\displaystyle=p\left(m(x_{1})=s_{x_{1}}\right)+p\left(m(x_{1})=s_{\text{skip}}% ,\,m(x_{2})=s_{x_{2}}\right)+\ldots= italic_p ( italic_m ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_s start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_p ( italic_m ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT , italic_m ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_s start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + …
+p(m(x1)=sskip,m(x2)=sskip,,m(xT)=sskip,m(xT+1)=sxT+1)𝑝formulae-sequence𝑚subscript𝑥1subscript𝑠skipformulae-sequence𝑚subscript𝑥2subscript𝑠skipformulae-sequence𝑚subscript𝑥𝑇subscript𝑠skip𝑚subscript𝑥𝑇1subscript𝑠subscript𝑥𝑇1\displaystyle\quad+p\left(m(x_{1})=s_{\text{skip}},\,m(x_{2})=s_{\text{skip}},% \ldots,\,m(x_{T})=s_{\text{skip}},\,m(x_{T+1})=s_{x_{T+1}}\right)+ italic_p ( italic_m ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT , italic_m ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT , … , italic_m ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT , italic_m ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) = italic_s start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
=t=1T+1(p(m(xt)=sxt)i=1t1p(m(xi)=sskip)).absentsuperscriptsubscript𝑡1𝑇1𝑝𝑚subscript𝑥𝑡subscript𝑠subscript𝑥𝑡superscriptsubscriptproduct𝑖1𝑡1𝑝𝑚subscript𝑥𝑖subscript𝑠skip\displaystyle=\sum_{t=1}^{T+1}\left(p\left(m(x_{t})=s_{x_{t}}\right)\prod_{i=1% }^{t-1}p\left(m(x_{i})=s_{\text{skip}}\right)\right).= ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T + 1 end_POSTSUPERSCRIPT ( italic_p ( italic_m ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_s start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_p ( italic_m ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT ) ) . (2)

Note that when actually running Algorithm 4, we do not need to sample all these (T+1)𝑇1(T+1)( italic_T + 1 ) input images if the solver decides to make a prediction before exhausting all the T𝑇Titalic_T allowable skips. However, we need to consider all these inputs to give a formal definition for Rs(mT)subscript𝑅ssuperscript𝑚𝑇R_{\text{s}}(m^{T})italic_R start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ). Since x1,x2,,xT+1subscript𝑥1subscript𝑥2subscript𝑥𝑇1x_{1},x_{2},\ldots,x_{T+1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT are i.i.d., Equation (2) can be further simplified by:

Rs(mT)subscript𝑅ssuperscript𝑚𝑇\displaystyle R_{\text{s}}(m^{T})italic_R start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) =p(m(x)=sx)t=0Tp(m(x)=sskip)tabsent𝑝𝑚𝑥subscript𝑠𝑥superscriptsubscript𝑡0𝑇𝑝superscript𝑚𝑥subscript𝑠skip𝑡\displaystyle=p\left(m(x)=s_{x}\right)\sum_{t=0}^{T}p\left(m(x)=s_{\text{skip}% }\right)^{t}= italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (3)
=p(m(x)=sx)1p(m(x)=sskip)T+11p(m(x)=sskip),absent𝑝𝑚𝑥subscript𝑠𝑥1𝑝superscript𝑚𝑥subscript𝑠skip𝑇11𝑝𝑚𝑥subscript𝑠skip\displaystyle=p\left(m(x)=s_{x}\right)\frac{1-p\left(m(x)=s_{\text{skip}}% \right)^{T+1}}{1-p\left(m(x)=s_{\text{skip}}\right)},= italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) divide start_ARG 1 - italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T + 1 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT ) end_ARG , (4)

where x𝑥xitalic_x is a random input sample from p(x)𝑝𝑥p(x)italic_p ( italic_x ). From Equation (4), we see that the success rate Rs(mT)subscript𝑅ssuperscript𝑚𝑇R_{\text{s}}(m^{T})italic_R start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) depends on the correct prediction rate p(m(x)=sx)𝑝𝑚𝑥subscript𝑠𝑥p\left(m(x)=s_{x}\right)italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) and the skip rate p(m(x)=sskip)𝑝𝑚𝑥subscript𝑠skipp\left(m(x)=s_{\text{skip}}\right)italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT ) of any random input x𝑥xitalic_x. We also observe that Rs(mT)subscript𝑅ssuperscript𝑚𝑇R_{\text{s}}(m^{T})italic_R start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) is a monotonically increasing function of T𝑇Titalic_T if p(m(x)=sskip)>0𝑝𝑚𝑥subscript𝑠skip0p\left(m(x)=s_{\text{skip}}\right)>0italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT ) > 0. Furthermore, when T𝑇T\rightarrow\inftyitalic_T → ∞ (i.e., we are allowed to skip infinitely), we have:

Rs(mT)p(m(x)=sx)1p(m(x)=sskip)=p(m(x)=sx|m(x)sskip).subscript𝑅ssuperscript𝑚𝑇𝑝𝑚𝑥subscript𝑠𝑥1𝑝𝑚𝑥subscript𝑠skip𝑝𝑚𝑥conditionalsubscript𝑠𝑥𝑚𝑥subscript𝑠skip\displaystyle R_{\text{s}}(m^{T})\rightarrow\frac{p(m(x)=s_{x})}{1-p(m(x)=s_{% \text{skip}})}=p(m(x)=s_{x}\,|\,m(x)\neq s_{\text{skip}}).italic_R start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) → divide start_ARG italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT ) end_ARG = italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_m ( italic_x ) ≠ italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT ) .

Note that Rs(mT)subscript𝑅ssuperscript𝑚𝑇R_{\text{s}}(m^{T})italic_R start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) does not tend to 1111 since we can only make one single prediction. We also observe from Equation (3) that Rs(mT)subscript𝑅ssuperscript𝑚𝑇R_{\text{s}}(m^{T})italic_R start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) is high when both p(m(x)=sx)𝑝𝑚𝑥subscript𝑠𝑥p(m(x)=s_{x})italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) and p(m(x)=sskip)𝑝𝑚𝑥subscript𝑠skipp(m(x)=s_{\text{skip}})italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT ) are high. However, since p(m(x)=sskip)1p(m(x)=sx)𝑝𝑚𝑥subscript𝑠skip1𝑝𝑚𝑥subscript𝑠𝑥p(m(x)=s_{\text{skip}})\leq 1-p(m(x)=s_{x})italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT ) ≤ 1 - italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ), it is usually the case that when p(m(x)=sx)𝑝𝑚𝑥subscript𝑠𝑥p(m(x)=s_{x})italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) is high, p(m(x)=sskip)𝑝𝑚𝑥subscript𝑠skipp(m(x)=s_{\text{skip}})italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT ) would be small and vice versa.

In this section, we will give a lower bound for the success rate Rs(mT)subscript𝑅ssuperscript𝑚𝑇R_{\text{s}}(m^{T})italic_R start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) of any LEnSolver mTsuperscript𝑚𝑇m^{T}italic_m start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. From Equations (3) and (4), it is sufficient to derive lower bounds for p(m(x)=sx)𝑝𝑚𝑥subscript𝑠𝑥p(m(x)=s_{x})italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) and p(m(x)=sskip)𝑝𝑚𝑥subscript𝑠skipp(m(x)=s_{\text{skip}})italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT ) and then combine them into a lower bound for Rs(mT)subscript𝑅ssuperscript𝑚𝑇R_{\text{s}}(m^{T})italic_R start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ). In Lemmas 3 and 4 below, we give the lower bounds for these probabilities, with their proofs given in Appendices A.4 and A.5 respectively. Note that the results in these lemmas are for the original EnSolver m𝑚mitalic_m of the LEnSolver mTsuperscript𝑚𝑇m^{T}italic_m start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT.

Lemma 3

Consider any EnSolver m𝑚mitalic_m with ensemble size M𝑀Mitalic_M, output domain size N𝒮subscript𝑁𝒮N_{\mathcal{S}}italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT, and uncertainty threshold τ𝜏\tauitalic_τ. Let βminsubscript𝛽min\beta_{\text{min}}italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT and βmaxsubscript𝛽max\beta_{\text{max}}italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT be respectively the minimum and maximum in-distribution accuracies among the M𝑀Mitalic_M base models. The correct prediction rate of m𝑚mitalic_m satisfies:

p(m(x)=sx)>αi=kM(Mi)βmini(1βmax)Miα(M,N𝒮,τ),𝑝𝑚𝑥subscript𝑠𝑥𝛼superscriptsubscript𝑖𝑘𝑀matrix𝑀𝑖superscriptsubscript𝛽min𝑖superscript1subscript𝛽max𝑀𝑖𝛼𝑀subscript𝑁𝒮𝜏p\left(m(x)=s_{x}\right)>\alpha\sum_{i=k}^{M}\begin{pmatrix}M\\ i\end{pmatrix}\beta_{\text{min}}^{i}(1-\beta_{\text{max}})^{M-i}-\alpha\,% \mathcal{E}(M,N_{\mathcal{S}},\tau),italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) > italic_α ∑ start_POSTSUBSCRIPT italic_i = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL italic_i end_CELL end_ROW end_ARG ) italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_M - italic_i end_POSTSUPERSCRIPT - italic_α caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) ,

where k=M(1τ)+1𝑘𝑀1𝜏1k=M(1-\tau)+1italic_k = italic_M ( 1 - italic_τ ) + 1 and α𝛼\alphaitalic_α is the proportion of in-distribution data in the input domain.

Lemma 4

Consider any EnSolver m𝑚mitalic_m with ensemble size M𝑀Mitalic_M, output domain size N𝒮subscript𝑁𝒮N_{\mathcal{S}}italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT, and uncertainty threshold τ𝜏\tauitalic_τ. Let βminsubscript𝛽min\beta_{\text{min}}italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT and βmaxsubscript𝛽max\beta_{\text{max}}italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT be respectively the minimum and maximum in-distribution accuracies among the M𝑀Mitalic_M base models. The skip rate of m𝑚mitalic_m satisfies:

p(m(x)=sskip)>αi=0k1(Mi)βmini(1βmax)Mi+(1α)N𝒮αN𝒮1(M,N𝒮,τ),𝑝𝑚𝑥subscript𝑠skip𝛼superscriptsubscript𝑖0𝑘1matrix𝑀𝑖superscriptsubscript𝛽min𝑖superscript1subscript𝛽max𝑀𝑖1𝛼subscript𝑁𝒮𝛼subscript𝑁𝒮1𝑀subscript𝑁𝒮𝜏p\left(m(x)=s_{\text{skip}}\right)>\alpha\sum_{i=0}^{k-1}\begin{pmatrix}M\\ i\end{pmatrix}\beta_{\text{min}}^{i}(1-\beta_{\text{max}})^{M-i}+(1-\alpha)-% \frac{N_{\mathcal{S}}-\alpha}{N_{\mathcal{S}}-1}\,\mathcal{E}(M,N_{\mathcal{S}% },\tau),italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT ) > italic_α ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL italic_i end_CELL end_ROW end_ARG ) italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_M - italic_i end_POSTSUPERSCRIPT + ( 1 - italic_α ) - divide start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT - italic_α end_ARG start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT - 1 end_ARG caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) ,

where k=M(1τ)+1𝑘𝑀1𝜏1k=M(1-\tau)+1italic_k = italic_M ( 1 - italic_τ ) + 1 and α𝛼\alphaitalic_α is the proportion of in-distribution data in the input domain.

Combining Lemmas 3 and 4 with Equations (3) and (4), it is straightforward to show Theorem 2 below that gives a lower bound for the success rate Rs(mT)subscript𝑅ssuperscript𝑚𝑇R_{\text{s}}(m^{T})italic_R start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) of an LEnSolver.

Theorem 2

Consider any LEnSolver mTsuperscript𝑚𝑇m^{T}italic_m start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT with an original EnSolver m𝑚mitalic_m, the maximum number of skips T𝑇Titalic_T, ensemble size M𝑀Mitalic_M, output domain size N𝒮subscript𝑁𝒮N_{\mathcal{S}}italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT, and uncertainty threshold τ𝜏\tauitalic_τ. Let βminsubscript𝛽min\beta_{\text{min}}italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT and βmaxsubscript𝛽max\beta_{\text{max}}italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT be respectively the minimum and maximum in-distribution accuracies among the M𝑀Mitalic_M base models. The success rate of mTsuperscript𝑚𝑇m^{T}italic_m start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT satisfies:

Rs(mT)>γ1ρT+11ρ,subscript𝑅ssuperscript𝑚𝑇𝛾1superscript𝜌𝑇11𝜌R_{\text{s}}(m^{T})>\gamma\,\frac{1-\rho^{T+1}}{1-\rho},italic_R start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) > italic_γ divide start_ARG 1 - italic_ρ start_POSTSUPERSCRIPT italic_T + 1 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_ρ end_ARG ,

where γ=αi=kM(Mi)βmini(1βmax)Miα(M,N𝒮,τ)𝛾𝛼superscriptsubscript𝑖𝑘𝑀matrix𝑀𝑖superscriptsubscript𝛽min𝑖superscript1subscript𝛽max𝑀𝑖𝛼𝑀subscript𝑁𝒮𝜏\displaystyle\gamma=\alpha\sum_{i=k}^{M}\begin{pmatrix}M\\ i\end{pmatrix}\beta_{\text{min}}^{i}(1-\beta_{\text{max}})^{M-i}-\alpha\,% \mathcal{E}(M,N_{\mathcal{S}},\tau)italic_γ = italic_α ∑ start_POSTSUBSCRIPT italic_i = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL italic_i end_CELL end_ROW end_ARG ) italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_M - italic_i end_POSTSUPERSCRIPT - italic_α caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ),
ρ=αi=0k1(Mi)βmini(1βmax)Mi+(1α)N𝒮αN𝒮1(M,N𝒮,τ),𝜌𝛼superscriptsubscript𝑖0𝑘1matrix𝑀𝑖superscriptsubscript𝛽min𝑖superscript1subscript𝛽max𝑀𝑖1𝛼subscript𝑁𝒮𝛼subscript𝑁𝒮1𝑀subscript𝑁𝒮𝜏{\hskip 31.2982pt}\displaystyle\rho=\alpha\sum_{i=0}^{k-1}\begin{pmatrix}M\\ i\end{pmatrix}\beta_{\text{min}}^{i}(1-\beta_{\text{max}})^{M-i}+(1-\alpha)-% \frac{N_{\mathcal{S}}-\alpha}{N_{\mathcal{S}}-1}\,\mathcal{E}(M,N_{\mathcal{S}% },\tau),italic_ρ = italic_α ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL italic_i end_CELL end_ROW end_ARG ) italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_M - italic_i end_POSTSUPERSCRIPT + ( 1 - italic_α ) - divide start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT - italic_α end_ARG start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT - 1 end_ARG caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) ,
k=M(1τ)+1𝑘𝑀1𝜏1{\hskip 31.2982pt}k=M(1-\tau)+1italic_k = italic_M ( 1 - italic_τ ) + 1, and α𝛼\alphaitalic_α is the proportion of in-distribution data in the input domain.

To give a simple illustration for this theorem, we can consider the example in Section 4.2. If we let α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 (i.e., half of the data are out-of-distribution), the success rate of LEnSolver is at least 0.75 if we allow only one skip. This lower bound will increase to 0.93 if we allow three skips and to 0.98 if we allow five skips. In Section 5, we will conduct experiments on real data to validate this theorem.

5 Experiments

In this section, we conduct experiments to evaluate the performance of our approaches as well as the usefulness of our theoretical results in practice. We first describe our experiment settings in Section 5.1. We then evaluate the right decision rates of EnSolver models in Section 5.2 and the success rates of LEnSolver models in Section 5.3. We investigate the effects of the maximum number of skips on the success rates of LEnSolvers in Section 5.4. Finally, we discuss some limitations and potential improvements for our theoretical results in Section 5.5.

Refer to caption
Figure 2: A sample CAPTCHA image in our new dataset. The ground truth label consists of the bounding boxes of each character, the correct letter for each character, and the correct output string NP5tZ.

5.1 Experiment Settings

Base Models and Baselines. We consider two types of base models for our experiments. The first type is the 3E-Solver proposed by Deng et al. (2022), which is a semi-supervised end-to-end solver that uses the encoder-decoder architecture and attention mechanism. The second type of base models used in our experiments is the end-to-end object detection-based solver (named OD-Solver in our experiments) proposed by Ousat et al. (2024) that uses the YOLOX model (Ge et al., 2021) to locate and classify each character in a given input image. These two model types are chosen due to their flexibility and state-of-the-art performance for solving text-based CAPTCHAs. In our experiments, we also use these two models (without ensembling) as our baselines.

New In-distribution Dataset. To train the above object detection models for solving CAPTCHAs, we follow Ousat et al. (2024) and generate a new dataset that contains both the bounding box location and the label for each character in the input images. We ensure our dataset is challenging for automatic solvers but still reasonable for human users by carefully generating the background colors, character locations, and noise patterns in the images. The process is as follows.

To generate the background for each image, we first randomize two different RGB triplets for its left and right-most columns. Then we interpolate these two RGB triplets horizontally to obtain the colors at the internal columns while kee** the colors constant on each vertical column. After generating a background image, we randomize a ground truth string containing alphanumeric characters with a random length from 5 to 9. Each of these characters is then converted into an image with random facecolor, font and size, together with a perspective transform to introduce character distortion. Next, the distorted characters are pasted into the background following the ground truth string order with random spacing between them. The position of each character determines its ground truth bounding box, which is combined with the ground truth string and then stored as the ground truth label.

The image obtained from the previous process can already be used as a CAPTCHA, but we can still enhance its resistance against automatic solvers by adding random noises to the image. Specifically, we draw several random curves and dots with various colors into the image where the curves can strike through the characters. The number of curves and dots as well as their thickness are chosen so that they significantly increase the CAPTCHA complexity while kee** the characters on the image recognizable by humans. An example of our generated CAPTCHAs together with the ground truth label are shown in Figure 2. For our experiments, we generate 10,000 such CAPTCHA images to use as in-distribution data for model training and another 1,000 images for testing.

Out-of-distribution Data. For out-of-distribution data, we follow Deng et al. (2022) and use publicly available CAPTCHAs from 8 different popular websites (see Figure 3(b-i) for their names and examples). From these examples, we can see that they have different visual features, background, and noise patterns compared to those of our generated dataset. Thus, it is suitable to treat them as out-of-distribution samples for our experiments. Since the label strings of these public datasets are case-insensitive, we convert all predictions and ground truth label strings to all-lowercase strings before computing the results. For the Microsoft scheme, we concatenate the strings on two lines into a single output string for both prediction and scoring purposes. For each out-of-distribution scheme, we randomly select 1,000 labeled samples that can be used for testing.

Refer to caption
(a) Our dataset (in-distribution)
Refer to caption
(b) Apple
Refer to caption
(c) Ganji
Refer to caption
(d) Google
Refer to caption
(e) Microsoft
Refer to caption
(f) Sina
Refer to caption
(g) Weibo
Refer to caption
(h) Wikipedia
Refer to caption
(i) Yandex
Figure 3: Our generated dataset (a) and public datasets (b)-(i) used in our experiments. Each dataset has unique visual features.

Base Models Training. Since the 3E-Solver was originally developed for semi-supervised learning (Deng et al., 2022), we train each base model of this type using our in-distribution training set and 10 additional unlabeled samples from each out-of-distribution scheme. Thus, each 3E-Solver base model has access to 80 additional unlabeled out-of-distribution samples. We conduct the experiment with the default settings used in the original 3E-Solver paper and train each model for 400 epochs with a learning rate of 0.01. To make each base model applicable to every scheme, we create a unified vocabulary dictionary that includes lowercase and uppercase characters as well as digits to be used for encoding and decoding the texts. This enables us to use the same architecture for different schemes with any set of characters.

For the object detection base models, we implement YOLOX (Ge et al., 2021) using the mmdetection library (Chen et al., 2019). The detectors are trained to detect characters from 62 different classes, including 52 English alphabet letters (both lowercase and uppercase) and 10 digits. Each base model is trained for 10 epochs using the Nesterov SGD optimizer (Nesterov, 1983) with learning rate 0.01 and momentum 0.9. During the training, we save the model checkpoints for every epoch and use the checkpoint with the best mAP on the test set as our base model. Since the out-of-distribution datasets do not have ground truth bounding box information, they cannot be used to train the objection detection models. Thus, we only train the object detection base models using the in-distribution training set.

Ensemble Model. We shall evaluate our methods with ensembles having M{2,6,10}𝑀2610M\in\{2,6,10\}italic_M ∈ { 2 , 6 , 10 } and τ{0.3,0.5,0.7}𝜏0.30.50.7\tau\in\{0.3,0.5,0.7\}italic_τ ∈ { 0.3 , 0.5 , 0.7 }. Note that in order to have non-zero uncertainty, we need at least 2 models in our ensemble. Furthermore, previous work on deep ensembles (Lakshminarayanan et al., 2017) showed that 10 base models are often enough to give a decent uncertainty estimation in most cases while not significantly increasing the time complexity to train the ensemble. The uncertainty threshold values τ𝜏\tauitalic_τ are chosen to give a reasonable balance for our uncertainty estimates.

5.2 Right Decision Rates of EnSolvers

Model setting Proportion of in-distribution data
Base type M τ𝜏\tauitalic_τ α=0𝛼0\alpha=0italic_α = 0 α=0.2𝛼0.2\alpha=0.2italic_α = 0.2 α=0.4𝛼0.4\alpha=0.4italic_α = 0.4 α=0.6𝛼0.6\alpha=0.6italic_α = 0.6 α=0.8𝛼0.8\alpha=0.8italic_α = 0.8 α=1𝛼1\alpha=1italic_α = 1
3E-Solver 1 0.000 0.147 0.294 0.440 0.587 0.734
(baseline) (na) (na) (na) (na) (na) (na)
2 0.5 0.999 0.930 0.861 0.793 0.724 0.655
(1.000) (0.908) (0.816) (0.723) (0.631) (0.539)
6 0.3 1.000 0.934 0.868 0.802 0.736 0.670
(1.000) (0.883) (0.766) (0.650) (0.533) (0.416)
6 0.5 1.000 0.945 0.890 0.836 0.781 0.726
(1.000) (0.933) (0.866) (0.799) (0.732) (0.665)
6 0.7 0.993 0.949 0.906 0.863 0.819 0.776
(1.000) (0.963) (0.925) (0.888) (0.850) (0.813)
10 0.3 1.000 0.931 0.861 0.792 0.722 0.653
(1.000) (0.858) (0.715) (0.573) (0.430) (0.288)
10 0.5 1.000 0.949 0.898 0.848 0.797 0.746
(1.000) (0.908) (0.816) (0.724) (0.632) (0.541)
10 0.7 1.000 0.958 0.915 0.873 0.830 0.788
(1.000) (0.919) (0.838) (0.757) (0.677) (0.596)
OD-Solver 1 0.001 0.161 0.321 0.481 0.641 0.801
(baseline) (na) (na) (na) (na) (na) (na)
2 0.5 0.988 0.942 0.897 0.851 0.805 0.759
(1.000) (0.928) (0.857) (0.785) (0.713) (0.641)
6 0.3 0.999 0.952 0.904 0.857 0.809 0.762
(1.000) (0.914) (0.828) (0.741) (0.656) (0.569)
6 0.5 0.998 0.960 0.923 0.885 0.847 0.809
(1.000) (0.952) (0.904) (0.855) (0.807) (0.759)
6 0.7 0.965 0.936 0.908 0.880 0.852 0.824
(1.000) (0.965) (0.931) (0.896) (0.862) (0.827)
10 0.3 0.999 0.953 0.906 0.859 0.813 0.766
(1.000) (0.894) (0.789) (0.683) (0.578) (0.472)
10 0.5 0.999 0.962 0.924 0.887 0.850 0.813
(1.000) (0.928) (0.857) (0.785) (0.713) (0.642)
10 0.7 0.994 0.962 0.930 0.898 0.866 0.834
(1.000) (0.932) (0.863) (0.795) (0.726) (0.658)
Table 1: Right decision rates of EnSolver with different model and data settings. Bold numbers indicate the best values in each column. Asterisks () indicate the best values among those with the same base model in each column. Numbers in parentheses are the theoretical rates obtained from Theorem 1, with “na” indicating values that do not exist. The theoretical rates are computed using the hyper-parameter values in Table 2.
Hyper-parameter 3E-Solver OD-Solver
M=2𝑀2M=2italic_M = 2 M=6𝑀6M=6italic_M = 6 M=10𝑀10M=10italic_M = 10 M=2𝑀2M=2italic_M = 2 M=6𝑀6M=6italic_M = 6 M=10𝑀10M=10italic_M = 10
N𝒮subscript𝑁𝒮N_{\mathcal{S}}italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT 2.9×10122.9superscript10122.9\times 10^{12}2.9 × 10 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT (approx.)
βminsubscript𝛽min\beta_{\text{min}}italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT 0.734 0.715 0.698 0.801 0.788 0.780
βmaxsubscript𝛽max\beta_{\text{max}}italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT 0.739 0.748 0.748 0.819 0.819 0.821
Table 2: Values of the hyper-parameters for computing the theoretical bounds in Table 1 and Table 3.

In this experiment, we evaluate the right decision rates (defined in Definition 1) of the EnSolver models. We measure the right decision rate by computing the proportion of right decisions a model makes on a test dataset containing both in-distribution and out-of-distribution samples. To give insights into the behavior of EnSolver, we vary the proportion of in-distribution data α{0,0.2,0.4,0.6,0.8,1}𝛼00.20.40.60.81\alpha\in\{0,0.2,0.4,0.6,0.8,1\}italic_α ∈ { 0 , 0.2 , 0.4 , 0.6 , 0.8 , 1 } during testing. We compute both the empirical right decision rates as well as the theoretical rates obtained from Theorem 1 for comparison. The results of this experiment are reported in Table 1.

From this table, we observe that EnSolver models are consistently better than the baselines when there are out-of-distribution data (α<1𝛼1\alpha<1italic_α < 1). Generally, M=10𝑀10M=10italic_M = 10 and τ=0.7𝜏0.7\tau=0.7italic_τ = 0.7 yield the best ensembles for both model types, with the OD-Solver ensembles achieving better rates than the 3E-Solver counterparts. In contrast to the baselines, the rates of the EnSolvers decay when α𝛼\alphaitalic_α increases. This behavior is expected since the ensembles may skip some in-distribution data, which would be counted as incorrect decisions. We can also observe this effect when α𝛼\alphaitalic_α is high and τ𝜏\tauitalic_τ is decreased. In this case, the EnSolvers skip more in-distribution samples due to the low uncertainty threshold. When comparing the empirical rates with their theoretical values, the latter are generally good indicators of the former. This confirms the usefulness of Theorem 1 in predicting the actual right decision rates of EnSolvers.

5.3 Success Rates of LEnSolvers

Model setting Proportion of in-distribution data
Base type M τ𝜏\tauitalic_τ α=0𝛼0\alpha=0italic_α = 0 α=0.2𝛼0.2\alpha=0.2italic_α = 0.2 α=0.4𝛼0.4\alpha=0.4italic_α = 0.4 α=0.6𝛼0.6\alpha=0.6italic_α = 0.6 α=0.8𝛼0.8\alpha=0.8italic_α = 0.8 α=1𝛼1\alpha=1italic_α = 1
3E-Solver 1 0.000 0.147 0.294 0.440 0.587 0.734
(baseline) (na) (na) (na) (na) (na) (na)
2 0.5 0.000 0.427 0.671 0.797 0.850 0.871
(0.000) (0.365) (0.617) (0.783) (0.885) (0.941)
6 0.3 0.000 0.443 0.695 0.824 0.875 0.896
(0.000) (0.278) (0.463) (0.578) (0.644) (0.677)
6 0.5 0.000 0.459 0.706 0.821 0.861 0.875
(0.000) (0.411) (0.634) (0.739) (0.776) (0.784)
6 0.7 0.000 0.465 0.698 0.790 0.818 0.827
(0.000) (0.481) (0.709) (0.795) (0.815) (0.817)
10 0.3 0.000 0.438 0.691 0.820 0.875 0.897
(0.000) (0.187) (0.302) (0.368) (0.401) (0.414)
10 0.5 0.000 0.469 0.715 0.825 0.862 0.869
(0.000) (0.325) (0.487) (0.553) (0.572) (0.574)
10 0.7 0.000 0.477 0.715 0.811 0.838 0.839
(0.000) (0.352) (0.519) (0.582) (0.597) (0.598)
OD-Solver 1 0.001 0.161 0.321 0.481 0.641 0.801
(baseline) (na) (na) (na) (na) (na) (na)
2 0.5 0.002 0.469 0.720 0.836 0.879 0.891
(0.000) (0.418) (0.680) (0.830) (0.906) (0.937)
6 0.3 0.001 0.480 0.739 0.857 0.898 0.911
(0.000) (0.364) (0.580) (0.695) (0.747) (0.764)
6 0.5 0.002 0.493 0.742 0.847 0.877 0.884
(0.000) (0.458) (0.689) (0.784) (0.812) (0.815)
6 0.7 0.004 0.473 0.711 0.813 0.847 0.857
(0.000) (0.488) (0.720) (0.806) (0.826) (0.828)
10 0.3 0.002 0.484 0.744 0.861 0.901 0.913
(0.000) (0.295) (0.460) (0.540) (0.571) (0.579)
10 0.5 0.003 0.497 0.747 0.851 0.881 0.890
(0.000) (0.381) (0.564) (0.634) (0.651) (0.652)
10 0.7 0.004 0.495 0.736 0.830 0.855 0.860
(0.000) (0.388) (0.572) (0.641) (0.657) (0.658)
Table 3: Success rates of LEnSolver with T=3𝑇3T=3italic_T = 3 for different model and data settings. Bold numbers indicate the best values in each column. Asterisks () indicate the best values among those with the same base model in each column. Numbers in parentheses are the theoretical rates obtained from Theorem 2, with “na” indicating values that do not exist. The theoretical rates are computed using the hyper-parameter values in Table 2.

We now focus on the LEnSolver models and evaluate their success rates (defined in Definition 3) by simulating 400,000 actual CAPTCHA cracking scenarios, where the CAPTCHAs are sampled from both in-distribution and out-of-distribution test sets with different α𝛼\alphaitalic_α values. In this experiment, we consider LEnSolvers with the maximum number of skips T=3𝑇3T=3italic_T = 3. We report both the empirical success rates and the theoretical rates obtained from Theorem 2 in Table 3. We note that the baselines do not have the skip mechanism and will make a prediction immediately on any given CAPTCHA. Thus, their success rates are exactly the same as their right decision rates in Section 5.2.

From Table 3, we can observe that the LEnSolvers are consistently better than the baselines in terms of success rates. The OD-Solver ensembles are generally better than the 3E-Solver counterparts, with ensembles of size M=10𝑀10M=10italic_M = 10 achieving the best rates among ensembles with the same base model type. When comparing ensembles of the same size, the optimal value of τ𝜏\tauitalic_τ depends on the ensemble size and the value of α𝛼\alphaitalic_α, with τ=0.5𝜏0.5\tau=0.5italic_τ = 0.5 giving a reasonable performance in all cases. We also observe that the success rates of LEnSolvers increase with α𝛼\alphaitalic_α. This behavior is expected since lower values of α𝛼\alphaitalic_α mean a model will likely encounter out-of-distribution CAPTCHAs and it will be forced to make a prediction after reaching the maximum number of skips. However, even with α𝛼\alphaitalic_α as low as 0.40.40.40.4, LEnSolvers can still achieve decent success rates relatively to α=1𝛼1\alpha=1italic_α = 1. For the LEnSolvers, the theoretical rates are also reasonable indicators of the empirical rates, which confirms the usefulness of our Theorem 2.

5.4 Effects of Maximum Number of Skips

This experiment investigates the effects of the maximum number of skips T𝑇Titalic_T on the success rates of LEnSolvers. For this purpose, we fix M=10𝑀10M=10italic_M = 10, τ=0.5𝜏0.5\tau=0.5italic_τ = 0.5 and plot the empirical and theoretical success rates of LEnSolver with T{0,1,,5}𝑇015T\in\{0,1,\ldots,5\}italic_T ∈ { 0 , 1 , … , 5 }. Figure 4 shows the plots for the two base model types and α{0.2,0.5,0.8}𝛼0.20.50.8\alpha\in\{0.2,0.5,0.8\}italic_α ∈ { 0.2 , 0.5 , 0.8 }. From the figure, we can observe that the empirical success rates increase with T𝑇Titalic_T, and our theoretical rates correctly reflect this trend. Furthermore, the theoretical lines also correctly capture the dynamics of the corresponding empirical lines (e.g., both lines plateau when T2𝑇2T\geq 2italic_T ≥ 2 and α=0.8𝛼0.8\alpha=0.8italic_α = 0.8), although the theoretical lines are not very tight to the empirical lines, especially for larger T𝑇Titalic_T. This is because we derive our bounds using βminsubscript𝛽min\beta_{\text{min}}italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT and βmaxsubscript𝛽max\beta_{\text{max}}italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT instead of all βisubscript𝛽𝑖\beta_{i}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s, leading to relatively loose bounds for the skip rate p(m(x)=sskip)𝑝𝑚𝑥subscript𝑠skipp\left(m(x)=s_{\text{skip}}\right)italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT ) in Lemma 4 and the final success rate in Theorem 2. Nevertheless, the theoretical rates are still reasonable lower bounds of the actual success rates, which are much higher in practice.

Refer to caption
(a) α=0.2𝛼0.2\alpha=0.2italic_α = 0.2
Refer to caption
(b) α=0.5𝛼0.5\alpha=0.5italic_α = 0.5
Refer to caption
(c) α=0.8𝛼0.8\alpha=0.8italic_α = 0.8
Figure 4: Actual and theoretical success rates of LEnSolver with respect to the maximum number of skips (T) for different values of α𝛼\alphaitalic_α and base model types.

5.5 Discussions

In our experiment results in Tables 1 and 3, there are some cases where the theoretical rates are higher than the empirical rates (e.g., when M=6𝑀6M=6italic_M = 6, τ=0.7𝜏0.7\tau=0.7italic_τ = 0.7 and α=0.8𝛼0.8\alpha=0.8italic_α = 0.8 in Table 1 or when M=2𝑀2M=2italic_M = 2 and α=1𝛼1\alpha=1italic_α = 1 in Table 3), although our theorems prove lower bound results for these theoretical rates. This is due to the uniform assumptions (A2) and (A3) in Section 4.1 being violated in practice. Thus, a potential direction for future work is to relax these assumptions to develop an improved theory for our methods. Nevertheless, even in these cases, our experiment results still show that the theoretical rates are close to the empirical rates.

Another issue with our theoretical rates is the relatively loose bound compared to the empirical rates for large M𝑀Mitalic_M. This is due to the fact that we derive our bounds using βminsubscript𝛽min\beta_{\text{min}}italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT and βmaxsubscript𝛽max\beta_{\text{max}}italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT. In principle, we can make the bounds tighter by using all the βisubscript𝛽𝑖\beta_{i}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s of the base models. But this would make the bounds more complicated to derive and compute for large ensembles. A potential future work would be to derive tighter bounds that are both simple and easy to compute. It is also worth to emphasize that even though our bounds are loose, they are non-trivial and can be used as pessimistic indicators of the solvers’ effectiveness, and the actual success rates would be much higher.

6 Conclusion

We proposed EnSolver and LEnSolver, novel end-to-end uncertainty-aware CAPTCHA solvers that can detect and skip out-of-distribution inputs. Our solvers use a deep ensemble of base CAPTCHA solvers to obtain uncertainty estimates and use them to make decisions. We prove theoretical guarantees on the effectiveness of the approaches and show empirically that our solvers can achieve good success rates in the presence of out-of-distribution data. Our work is potentially helpful for security experts to better understand the capability of automatic CAPTCHA solvers and improve the defense against these attacks.


Acknowledgments and Disclosure of Funding

This work was partially supported by the U.S. National Science Foundation (Award: 2331908), US National Security Agency (Award: H982302110324) and Microsoft Security AI. The views expressed are those of the authors only and not of the funding agencies.

Appendix A Proofs of Theoretical Results

Additional notations. Recall that the EnSolver m𝑚mitalic_m consists of M𝑀Mitalic_M base models m1,m2,,mMsubscript𝑚1subscript𝑚2subscript𝑚𝑀m_{1},m_{2},\ldots,\allowbreak m_{M}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT. For any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X and s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S, let n(x,s)𝑛𝑥𝑠n(x,s)italic_n ( italic_x , italic_s ) be the number of base models predicting the output s𝑠sitalic_s for x𝑥xitalic_x. That is,

n(x,s)=|{i:mi(x)=s,i{1,2,,M}}|.𝑛𝑥𝑠conditional-set𝑖formulae-sequencesubscript𝑚𝑖𝑥𝑠𝑖12𝑀\displaystyle n(x,s)=\left|\{i:m_{i}(x)=s,\leavevmode\nobreak\ i\in\{1,2,% \ldots,M\}\}\right|.italic_n ( italic_x , italic_s ) = | { italic_i : italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = italic_s , italic_i ∈ { 1 , 2 , … , italic_M } } | .

For any i{1,2,,M}𝑖12𝑀i\in\{1,2,\ldots,M\}italic_i ∈ { 1 , 2 , … , italic_M }, let 𝒦isubscript𝒦𝑖\mathcal{K}_{i}caligraphic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the set of all subsets of size i𝑖iitalic_i of {1,2,,M}12𝑀\{1,2,\ldots,M\}{ 1 , 2 , … , italic_M }:

𝒦i={κ{1,2,,M}:|κ|=i}.subscript𝒦𝑖conditional-set𝜅12𝑀𝜅𝑖\displaystyle\mathcal{K}_{i}=\{\kappa\subseteq\{1,2,\ldots,M\}:|\kappa|=i\}.caligraphic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_κ ⊆ { 1 , 2 , … , italic_M } : | italic_κ | = italic_i } .

For any κ𝒦i𝜅subscript𝒦𝑖\kappa\in\mathcal{K}_{i}italic_κ ∈ caligraphic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we denote κ¯={1,2,,M}κ¯𝜅12𝑀𝜅\widebar{\kappa}=\{1,2,\ldots,M\}\setminus\kappaover¯ start_ARG italic_κ end_ARG = { 1 , 2 , … , italic_M } ∖ italic_κ. We also let 𝒦¯isubscript¯𝒦𝑖\widebar{\mathcal{K}}_{i}over¯ start_ARG caligraphic_K end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the set of all κ¯¯𝜅\widebar{\kappa}over¯ start_ARG italic_κ end_ARG’s:

𝒦¯i={κ¯={1,2,,M}κ:κ𝒦i}.subscript¯𝒦𝑖conditional-set¯𝜅12𝑀𝜅𝜅subscript𝒦𝑖\displaystyle\widebar{\mathcal{K}}_{i}=\{\widebar{\kappa}=\{1,2,\ldots,M\}% \setminus\kappa:\kappa\in\mathcal{K}_{i}\}.over¯ start_ARG caligraphic_K end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { over¯ start_ARG italic_κ end_ARG = { 1 , 2 , … , italic_M } ∖ italic_κ : italic_κ ∈ caligraphic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } .

We now prove the following lemma that will be used in our proofs.

Lemma 5

The following inequality holds:

i=M(1τ)+1M(Mi)1(N𝒮)i<(M,N𝒮,τ)N𝒮1.superscriptsubscript𝑖𝑀1𝜏1𝑀matrix𝑀𝑖1superscriptsubscript𝑁𝒮𝑖𝑀subscript𝑁𝒮𝜏subscript𝑁𝒮1\displaystyle\sum_{i=M(1-\tau)+1}^{M}\begin{pmatrix}M\\ i\end{pmatrix}\frac{1}{(N_{\mathcal{S}})^{i}}<\frac{\mathcal{E}(M,N_{\mathcal{% S}},\tau)}{N_{\mathcal{S}}-1}.∑ start_POSTSUBSCRIPT italic_i = italic_M ( 1 - italic_τ ) + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL italic_i end_CELL end_ROW end_ARG ) divide start_ARG 1 end_ARG start_ARG ( italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG < divide start_ARG caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT - 1 end_ARG .

Proof  For brevity, we denote k=M(1τ)+1𝑘𝑀1𝜏1k=M(1-\tau)+1italic_k = italic_M ( 1 - italic_τ ) + 1. For any positive integer M𝑀Mitalic_M and any integer i𝑖iitalic_i with 0iM0𝑖𝑀0\leq i\leq M0 ≤ italic_i ≤ italic_M, we have:

(Mi)(MM/2).matrix𝑀𝑖matrix𝑀𝑀2\displaystyle\begin{pmatrix}M\\ i\end{pmatrix}\leq\begin{pmatrix}M\\ \lfloor M/2\rfloor\end{pmatrix}.( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL italic_i end_CELL end_ROW end_ARG ) ≤ ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL ⌊ italic_M / 2 ⌋ end_CELL end_ROW end_ARG ) .

Therefore,

i=kM(Mi)1(N𝒮)isuperscriptsubscript𝑖𝑘𝑀matrix𝑀𝑖1superscriptsubscript𝑁𝒮𝑖\displaystyle\sum_{i=k}^{M}\begin{pmatrix}M\\ i\end{pmatrix}\frac{1}{(N_{\mathcal{S}})^{i}}∑ start_POSTSUBSCRIPT italic_i = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL italic_i end_CELL end_ROW end_ARG ) divide start_ARG 1 end_ARG start_ARG ( italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG i=kM(MM/2)1(N𝒮)iabsentsuperscriptsubscript𝑖𝑘𝑀matrix𝑀𝑀21superscriptsubscript𝑁𝒮𝑖\displaystyle\leq\sum_{i=k}^{M}\begin{pmatrix}M\\ \lfloor M/2\rfloor\end{pmatrix}\frac{1}{(N_{\mathcal{S}})^{i}}≤ ∑ start_POSTSUBSCRIPT italic_i = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL ⌊ italic_M / 2 ⌋ end_CELL end_ROW end_ARG ) divide start_ARG 1 end_ARG start_ARG ( italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG
=(MM/2)1(N𝒮)ki=0Mk1(N𝒮)iabsentmatrix𝑀𝑀21superscriptsubscript𝑁𝒮𝑘superscriptsubscript𝑖0𝑀𝑘1superscriptsubscript𝑁𝒮𝑖\displaystyle=\begin{pmatrix}M\\ \lfloor M/2\rfloor\end{pmatrix}\frac{1}{(N_{\mathcal{S}})^{k}}\sum_{i=0}^{M-k}% \frac{1}{(N_{\mathcal{S}})^{i}}= ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL ⌊ italic_M / 2 ⌋ end_CELL end_ROW end_ARG ) divide start_ARG 1 end_ARG start_ARG ( italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - italic_k end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ( italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG
=(MM/2)1(N𝒮)k.11(N𝒮)M+1k11N𝒮formulae-sequenceabsentmatrix𝑀𝑀21superscriptsubscript𝑁𝒮𝑘11superscriptsubscript𝑁𝒮𝑀1𝑘11subscript𝑁𝒮\displaystyle=\begin{pmatrix}M\\ \lfloor M/2\rfloor\end{pmatrix}\frac{1}{(N_{\mathcal{S}})^{k}}.\frac{1-\frac{1% }{(N_{\mathcal{S}})^{M+1-k}}}{1-\frac{1}{N_{\mathcal{S}}}}= ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL ⌊ italic_M / 2 ⌋ end_CELL end_ROW end_ARG ) divide start_ARG 1 end_ARG start_ARG ( italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG . divide start_ARG 1 - divide start_ARG 1 end_ARG start_ARG ( italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_M + 1 - italic_k end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG 1 - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT end_ARG end_ARG
<(MM/2)1(N𝒮)k.111N𝒮formulae-sequenceabsentmatrix𝑀𝑀21superscriptsubscript𝑁𝒮𝑘111subscript𝑁𝒮\displaystyle<\begin{pmatrix}M\\ \lfloor M/2\rfloor\end{pmatrix}\frac{1}{(N_{\mathcal{S}})^{k}}.\frac{1}{1-% \frac{1}{N_{\mathcal{S}}}}< ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL ⌊ italic_M / 2 ⌋ end_CELL end_ROW end_ARG ) divide start_ARG 1 end_ARG start_ARG ( italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG . divide start_ARG 1 end_ARG start_ARG 1 - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT end_ARG end_ARG
=(MM/2)1(N𝒮)k1(N𝒮1)=(M,N𝒮,τ)N𝒮1.absentmatrix𝑀𝑀21superscriptsubscript𝑁𝒮𝑘1subscript𝑁𝒮1𝑀subscript𝑁𝒮𝜏subscript𝑁𝒮1\displaystyle=\begin{pmatrix}M\\ \lfloor M/2\rfloor\end{pmatrix}\frac{1}{(N_{\mathcal{S}})^{k-1}(N_{\mathcal{S}% }-1)}=\frac{\mathcal{E}(M,N_{\mathcal{S}},\tau)}{N_{\mathcal{S}}-1}.= ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL ⌊ italic_M / 2 ⌋ end_CELL end_ROW end_ARG ) divide start_ARG 1 end_ARG start_ARG ( italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT - 1 ) end_ARG = divide start_ARG caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT - 1 end_ARG .

 

A.1 Proof of Lemma 1

Let k=M(1τ)+1𝑘𝑀1𝜏1k=M(1-\tau)+1italic_k = italic_M ( 1 - italic_τ ) + 1. We have:

p𝑝\displaystyle pitalic_p (m(x){sx,sskip}|x𝒳out)𝑚𝑥conditionalsubscript𝑠𝑥subscript𝑠skip𝑥superscript𝒳out\displaystyle\left(m(x)\notin\{s_{x},s_{\text{skip}}\}\,|\,x\in\mathcal{X}^{% \text{out}}\right)( italic_m ( italic_x ) ∉ { italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT } | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT )
=p(s𝒮{sx},m(x)=s|x𝒳out)absent𝑝formulae-sequence𝑠𝒮subscript𝑠𝑥𝑚𝑥conditional𝑠𝑥superscript𝒳out\displaystyle=p\left(\exists s\in\mathcal{S}\setminus\{s_{x}\},\leavevmode% \nobreak\ m(x)=s\,|\,x\in\mathcal{X}^{\text{out}}\right)= italic_p ( ∃ italic_s ∈ caligraphic_S ∖ { italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT } , italic_m ( italic_x ) = italic_s | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT )
p(s𝒮{sx},n(x,s)k|x𝒳out)absent𝑝formulae-sequence𝑠𝒮subscript𝑠𝑥𝑛𝑥𝑠conditional𝑘𝑥superscript𝒳out\displaystyle\leq p\left(\exists s\in\mathcal{S}\setminus\{s_{x}\},\leavevmode% \nobreak\ n(x,s)\geq k\,|\,x\in\mathcal{X}^{\text{out}}\right)≤ italic_p ( ∃ italic_s ∈ caligraphic_S ∖ { italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT } , italic_n ( italic_x , italic_s ) ≥ italic_k | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT )
s𝒮{sx}p(n(x,s)k|x𝒳out)absentsubscript𝑠𝒮subscript𝑠𝑥𝑝𝑛𝑥𝑠conditional𝑘𝑥superscript𝒳out\displaystyle\leq\sum_{s\in\mathcal{S}\setminus\{s_{x}\}}p\left(n(x,s)\geq k\,% |\,x\in\mathcal{X}^{\text{out}}\right)≤ ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S ∖ { italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT } end_POSTSUBSCRIPT italic_p ( italic_n ( italic_x , italic_s ) ≥ italic_k | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT )
=s𝒮{sx}i=kM(Mi)1(N𝒮)i(11N𝒮)Miabsentsubscript𝑠𝒮subscript𝑠𝑥superscriptsubscript𝑖𝑘𝑀matrix𝑀𝑖1superscriptsubscript𝑁𝒮𝑖superscript11subscript𝑁𝒮𝑀𝑖\displaystyle=\sum_{s\in\mathcal{S}\setminus\{s_{x}\}}\,\sum_{i=k}^{M}\begin{% pmatrix}M\\ i\end{pmatrix}\frac{1}{(N_{\mathcal{S}})^{i}}\left(1-\frac{1}{N_{\mathcal{S}}}% \right)^{M-i}= ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S ∖ { italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT } end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL italic_i end_CELL end_ROW end_ARG ) divide start_ARG 1 end_ARG start_ARG ( italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG ( 1 - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_M - italic_i end_POSTSUPERSCRIPT (assumption A3)
<s𝒮{sx}i=kM(Mi)1(N𝒮)i=(N𝒮1)i=kM(Mi)1(N𝒮)i<(M,N𝒮,τ),absentsubscript𝑠𝒮subscript𝑠𝑥superscriptsubscript𝑖𝑘𝑀matrix𝑀𝑖1superscriptsubscript𝑁𝒮𝑖subscript𝑁𝒮1superscriptsubscript𝑖𝑘𝑀matrix𝑀𝑖1superscriptsubscript𝑁𝒮𝑖𝑀subscript𝑁𝒮𝜏\displaystyle<\sum_{s\in\mathcal{S}\setminus\{s_{x}\}}\,\sum_{i=k}^{M}\begin{% pmatrix}M\\ i\end{pmatrix}\frac{1}{(N_{\mathcal{S}})^{i}}=(N_{\mathcal{S}}-1)\sum_{i=k}^{M% }\begin{pmatrix}M\\ i\end{pmatrix}\frac{1}{(N_{\mathcal{S}})^{i}}<\mathcal{E}(M,N_{\mathcal{S}},% \tau),< ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S ∖ { italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT } end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL italic_i end_CELL end_ROW end_ARG ) divide start_ARG 1 end_ARG start_ARG ( italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG = ( italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT - 1 ) ∑ start_POSTSUBSCRIPT italic_i = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL italic_i end_CELL end_ROW end_ARG ) divide start_ARG 1 end_ARG start_ARG ( italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG < caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) ,

where the last inequality comes from Lemma 5. \hfill\blacksquare

A.2 Proof of Lemma 2

Let k=M(1τ)+1𝑘𝑀1𝜏1k=M(1-\tau)+1italic_k = italic_M ( 1 - italic_τ ) + 1. For any s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S such that ssx𝑠subscript𝑠𝑥s\neq s_{x}italic_s ≠ italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, we have:

p𝑝\displaystyle pitalic_p (n(x,s)k|x𝒳in)𝑛𝑥𝑠conditional𝑘𝑥superscript𝒳in\displaystyle\left(n(x,s)\geq k\,|\,x\in\mathcal{X}^{\text{in}}\right)( italic_n ( italic_x , italic_s ) ≥ italic_k | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT )
=i=kMκ𝒦i(jκ(1βjN𝒮1)tκ¯(11βtN𝒮1))absentsuperscriptsubscript𝑖𝑘𝑀subscript𝜅subscript𝒦𝑖subscriptproduct𝑗𝜅1subscript𝛽𝑗subscript𝑁𝒮1subscriptproduct𝑡¯𝜅11subscript𝛽𝑡subscript𝑁𝒮1\displaystyle=\sum_{i=k}^{M}\,\sum_{\kappa\in\mathcal{K}_{i}}\left(\,\prod_{j% \in\kappa}\left(\frac{1-\beta_{j}}{N_{\mathcal{S}}-1}\right)\prod_{t\in% \widebar{\kappa}}\left(1-\frac{1-\beta_{t}}{N_{\mathcal{S}}-1}\right)\right)= ∑ start_POSTSUBSCRIPT italic_i = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_κ ∈ caligraphic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∏ start_POSTSUBSCRIPT italic_j ∈ italic_κ end_POSTSUBSCRIPT ( divide start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT - 1 end_ARG ) ∏ start_POSTSUBSCRIPT italic_t ∈ over¯ start_ARG italic_κ end_ARG end_POSTSUBSCRIPT ( 1 - divide start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT - 1 end_ARG ) ) (assumption A2)
i=kMκ𝒦i(1βminN𝒮1)i(11βmaxN𝒮1)Miabsentsuperscriptsubscript𝑖𝑘𝑀subscript𝜅subscript𝒦𝑖superscript1subscript𝛽minsubscript𝑁𝒮1𝑖superscript11subscript𝛽maxsubscript𝑁𝒮1𝑀𝑖\displaystyle\leq\sum_{i=k}^{M}\,\sum_{\kappa\in\mathcal{K}_{i}}\left(\frac{1-% \beta_{\text{min}}}{N_{\mathcal{S}}-1}\right)^{i}\left(1-\frac{1-\beta_{\text{% max}}}{N_{\mathcal{S}}-1}\right)^{M-i}≤ ∑ start_POSTSUBSCRIPT italic_i = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_κ ∈ caligraphic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( divide start_ARG 1 - italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT - 1 end_ARG ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( 1 - divide start_ARG 1 - italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT - 1 end_ARG ) start_POSTSUPERSCRIPT italic_M - italic_i end_POSTSUPERSCRIPT
=i=kM(Mi)(1βminN𝒮1)i(11βmaxN𝒮1)Miabsentsuperscriptsubscript𝑖𝑘𝑀matrix𝑀𝑖superscript1subscript𝛽minsubscript𝑁𝒮1𝑖superscript11subscript𝛽maxsubscript𝑁𝒮1𝑀𝑖\displaystyle=\sum_{i=k}^{M}\begin{pmatrix}M\\ i\end{pmatrix}\left(\frac{1-\beta_{\text{min}}}{N_{\mathcal{S}}-1}\right)^{i}% \left(1-\frac{1-\beta_{\text{max}}}{N_{\mathcal{S}}-1}\right)^{M-i}= ∑ start_POSTSUBSCRIPT italic_i = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL italic_i end_CELL end_ROW end_ARG ) ( divide start_ARG 1 - italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT - 1 end_ARG ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( 1 - divide start_ARG 1 - italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT - 1 end_ARG ) start_POSTSUPERSCRIPT italic_M - italic_i end_POSTSUPERSCRIPT
<i=kM(Mi)1(N𝒮)iabsentsuperscriptsubscript𝑖𝑘𝑀matrix𝑀𝑖1superscriptsubscript𝑁𝒮𝑖\displaystyle<\sum_{i=k}^{M}\begin{pmatrix}M\\ i\end{pmatrix}\frac{1}{(N_{\mathcal{S}})^{i}}< ∑ start_POSTSUBSCRIPT italic_i = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL italic_i end_CELL end_ROW end_ARG ) divide start_ARG 1 end_ARG start_ARG ( italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG (assumption A1 and 11βmaxN𝒮1<111subscript𝛽maxsubscript𝑁𝒮111-\frac{1-\beta_{\text{max}}}{N_{\mathcal{S}}-1}<11 - divide start_ARG 1 - italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT - 1 end_ARG < 1)
<(M,N𝒮,τ)N𝒮1.absent𝑀subscript𝑁𝒮𝜏subscript𝑁𝒮1\displaystyle<\frac{\mathcal{E}(M,N_{\mathcal{S}},\tau)}{N_{\mathcal{S}}-1}.< divide start_ARG caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT - 1 end_ARG . (lemma 5)

Therefore,

p(ssx,n(x,s)k|x𝒳in)𝑝formulae-sequence𝑠subscript𝑠𝑥𝑛𝑥𝑠conditional𝑘𝑥superscript𝒳in\displaystyle p\left(\exists s\neq s_{x},\leavevmode\nobreak\ n(x,s)\geq k\,|% \,x\in\mathcal{X}^{\text{in}}\right)italic_p ( ∃ italic_s ≠ italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_n ( italic_x , italic_s ) ≥ italic_k | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) s𝒮{sx}p(n(x,s)k|x𝒳in)absentsubscript𝑠𝒮subscript𝑠𝑥𝑝𝑛𝑥𝑠conditional𝑘𝑥superscript𝒳in\displaystyle\leq\sum_{s\in\mathcal{S}\setminus\{s_{x}\}}p\left(n(x,s)\geq k\,% |\,x\in\mathcal{X}^{\text{in}}\right)≤ ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S ∖ { italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT } end_POSTSUBSCRIPT italic_p ( italic_n ( italic_x , italic_s ) ≥ italic_k | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT )
<s𝒮{sx}(M,N𝒮,τ)N𝒮1=(M,N𝒮,τ).absentsubscript𝑠𝒮subscript𝑠𝑥𝑀subscript𝑁𝒮𝜏subscript𝑁𝒮1𝑀subscript𝑁𝒮𝜏\displaystyle<\sum_{s\in\mathcal{S}\setminus\{s_{x}\}}\frac{\mathcal{E}(M,N_{% \mathcal{S}},\tau)}{N_{\mathcal{S}}-1}=\mathcal{E}(M,N_{\mathcal{S}},\tau).< ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S ∖ { italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT } end_POSTSUBSCRIPT divide start_ARG caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT - 1 end_ARG = caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) .

\hfill\blacksquare

A.3 Proof of Theorem 1

From Definition 1, we have:

Rrd(m)subscript𝑅rd𝑚\displaystyle R_{\text{rd}}(m)italic_R start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT ( italic_m ) =p(m(x)=sx,x𝒳in)+p(m(x){sx,sskip},x𝒳out)absent𝑝formulae-sequence𝑚𝑥subscript𝑠𝑥𝑥superscript𝒳in𝑝formulae-sequence𝑚𝑥subscript𝑠𝑥subscript𝑠skip𝑥superscript𝒳out\displaystyle=p(m(x)=s_{x},x\in\mathcal{X}^{\text{in}})+p(m(x)\in\{s_{x},s_{% \text{skip}}\},x\in\mathcal{X}^{\text{out}})= italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) + italic_p ( italic_m ( italic_x ) ∈ { italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT } , italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT )
=p(m(x)=sx|x𝒳in)p(x𝒳in)+p(m(x){sx,sskip}|x𝒳out)p(x𝒳out)absent𝑝𝑚𝑥conditionalsubscript𝑠𝑥𝑥superscript𝒳in𝑝𝑥superscript𝒳in𝑝𝑚𝑥conditionalsubscript𝑠𝑥subscript𝑠skip𝑥superscript𝒳out𝑝𝑥superscript𝒳out\displaystyle=p(m(x)=s_{x}\,|\,x\in\mathcal{X}^{\text{in}})\,p(x\in\mathcal{X}% ^{\text{in}})+p(m(x)\in\{s_{x},s_{\text{skip}}\}\,|\,x\in\mathcal{X}^{\text{% out}})\,p(x\in\mathcal{X}^{\text{out}})= italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) italic_p ( italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) + italic_p ( italic_m ( italic_x ) ∈ { italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT } | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT ) italic_p ( italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT )
=αp(m(x)=sx|x𝒳in)+(1α)p(m(x){sx,sskip}|x𝒳out).absent𝛼𝑝𝑚𝑥conditionalsubscript𝑠𝑥𝑥superscript𝒳in1𝛼𝑝𝑚𝑥conditionalsubscript𝑠𝑥subscript𝑠skip𝑥superscript𝒳out\displaystyle=\alpha\,p(m(x)=s_{x}\,|\,x\in\mathcal{X}^{\text{in}})+(1-\alpha)% \,p(m(x)\in\{s_{x},s_{\text{skip}}\}\,|\,x\in\mathcal{X}^{\text{out}}).= italic_α italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) + ( 1 - italic_α ) italic_p ( italic_m ( italic_x ) ∈ { italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT } | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT ) . (5)

From Lemma 1,

p(m(x){sx,sskip}|x𝒳out)𝑝𝑚𝑥conditionalsubscript𝑠𝑥subscript𝑠skip𝑥superscript𝒳out\displaystyle p(m(x)\in\{s_{x},s_{\text{skip}}\}\,|\,x\in\mathcal{X}^{\text{% out}})italic_p ( italic_m ( italic_x ) ∈ { italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT } | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT ) =1p(m(x){sx,sskip}|x𝒳out)absent1𝑝𝑚𝑥conditionalsubscript𝑠𝑥subscript𝑠skip𝑥superscript𝒳out\displaystyle=1-p(m(x)\notin\{s_{x},s_{\text{skip}}\}\,|\,x\in\mathcal{X}^{% \text{out}})= 1 - italic_p ( italic_m ( italic_x ) ∉ { italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT } | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT )
1(M,N𝒮,τ).absent1𝑀subscript𝑁𝒮𝜏\displaystyle\geq 1-\mathcal{E}(M,N_{\mathcal{S}},\tau).≥ 1 - caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) . (6)

Thus, we only need to lower bound p(m(x)=sx|x𝒳in)𝑝𝑚𝑥conditionalsubscript𝑠𝑥𝑥superscript𝒳inp(m(x)=s_{x}\,|\,x\in\mathcal{X}^{\text{in}})italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ). Let k=M(1τ)+1𝑘𝑀1𝜏1k=M(1-\tau)+1italic_k = italic_M ( 1 - italic_τ ) + 1 and consider the following two events:

(A)𝐴\displaystyle(A)( italic_A ) :n(x,sx)k,:absent𝑛𝑥subscript𝑠𝑥𝑘\displaystyle:n(x,s_{x})\geq k,: italic_n ( italic_x , italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) ≥ italic_k ,
(B)𝐵\displaystyle(B)( italic_B ) :n(x,s)<n(x,sx),ssx.:absentformulae-sequence𝑛𝑥𝑠𝑛𝑥subscript𝑠𝑥for-all𝑠subscript𝑠𝑥\displaystyle:n(x,s)<n(x,s_{x}),\forall s\neq s_{x}.: italic_n ( italic_x , italic_s ) < italic_n ( italic_x , italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) , ∀ italic_s ≠ italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT .

Note that:

p(m(x)=sx|x𝒳in)𝑝𝑚𝑥conditionalsubscript𝑠𝑥𝑥superscript𝒳in\displaystyle p(m(x)=s_{x}\,|\,x\in\mathcal{X}^{\text{in}})italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) =p(AB|x𝒳in)absent𝑝𝐴conditional𝐵𝑥superscript𝒳in\displaystyle=p(A\wedge B\,|\,x\in\mathcal{X}^{\text{in}})= italic_p ( italic_A ∧ italic_B | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT )
=p(A|x𝒳in)p(A¬B|x𝒳in).absent𝑝conditional𝐴𝑥superscript𝒳in𝑝𝐴conditional𝐵𝑥superscript𝒳in\displaystyle=p(A\,|\,x\in\mathcal{X}^{\text{in}})-p(A\wedge\neg B\,|\,x\in% \mathcal{X}^{\text{in}}).= italic_p ( italic_A | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) - italic_p ( italic_A ∧ ¬ italic_B | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) . (7)

We have:

p(A|x𝒳in)𝑝conditional𝐴𝑥superscript𝒳in\displaystyle p(A\,|\,x\in\mathcal{X}^{\text{in}})italic_p ( italic_A | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) =i=kMκ𝒦i(jκβj)(tκ¯(1βt))absentsuperscriptsubscript𝑖𝑘𝑀subscript𝜅subscript𝒦𝑖subscriptproduct𝑗𝜅subscript𝛽𝑗subscriptproduct𝑡¯𝜅1subscript𝛽𝑡\displaystyle=\sum_{i=k}^{M}\;\sum_{\kappa\in\mathcal{K}_{i}}\left(\prod_{j\in% \kappa}\beta_{j}\right)\left(\prod_{t\in\widebar{\kappa}}(1-\beta_{t})\right)= ∑ start_POSTSUBSCRIPT italic_i = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_κ ∈ caligraphic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∏ start_POSTSUBSCRIPT italic_j ∈ italic_κ end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( ∏ start_POSTSUBSCRIPT italic_t ∈ over¯ start_ARG italic_κ end_ARG end_POSTSUBSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )
i=kMκ𝒦iβmini(1βmax)Miabsentsuperscriptsubscript𝑖𝑘𝑀subscript𝜅subscript𝒦𝑖superscriptsubscript𝛽min𝑖superscript1subscript𝛽max𝑀𝑖\displaystyle\geq\sum_{i=k}^{M}\;\sum_{\kappa\in\mathcal{K}_{i}}\beta_{\text{% min}}^{i}\,(1-\beta_{\text{max}})^{M-i}≥ ∑ start_POSTSUBSCRIPT italic_i = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_κ ∈ caligraphic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_M - italic_i end_POSTSUPERSCRIPT
=i=kM(Mi)βmini(1βmax)Mi.absentsuperscriptsubscript𝑖𝑘𝑀matrix𝑀𝑖superscriptsubscript𝛽min𝑖superscript1subscript𝛽max𝑀𝑖\displaystyle=\sum_{i=k}^{M}\begin{pmatrix}M\\ i\end{pmatrix}\beta_{\text{min}}^{i}\,(1-\beta_{\text{max}})^{M-i}.= ∑ start_POSTSUBSCRIPT italic_i = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL italic_i end_CELL end_ROW end_ARG ) italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_M - italic_i end_POSTSUPERSCRIPT . (8)

We also have:

p(A¬B|x𝒳in)𝑝𝐴conditional𝐵𝑥superscript𝒳in\displaystyle p(A\wedge\neg B\,|\,x\in\mathcal{X}^{\text{in}})italic_p ( italic_A ∧ ¬ italic_B | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) =p({n(x,sx)kssx,n(x,s)n(x,sx)|x𝒳in)absent𝑝conditionalcases𝑛𝑥subscript𝑠𝑥𝑘formulae-sequence𝑠subscript𝑠𝑥𝑛𝑥𝑠𝑛𝑥subscript𝑠𝑥𝑥superscript𝒳in\displaystyle=p\left(\left\{\begin{array}[]{l}n(x,s_{x})\geq k\\ \exists s\neq s_{x},n(x,s)\geq n(x,s_{x})\end{array}\right.\Big{|}\leavevmode% \nobreak\ x\in\mathcal{X}^{\text{in}}\right)= italic_p ( { start_ARRAY start_ROW start_CELL italic_n ( italic_x , italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) ≥ italic_k end_CELL end_ROW start_ROW start_CELL ∃ italic_s ≠ italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_n ( italic_x , italic_s ) ≥ italic_n ( italic_x , italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARRAY | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) (11)
p({n(x,sx)kssx,n(x,s)k|x𝒳in)absent𝑝conditionalcases𝑛𝑥subscript𝑠𝑥𝑘formulae-sequence𝑠subscript𝑠𝑥𝑛𝑥𝑠𝑘𝑥superscript𝒳in\displaystyle\leq p\left(\left\{\begin{array}[]{l}n(x,s_{x})\geq k\\ \exists s\neq s_{x},n(x,s)\geq k\end{array}\right.\Big{|}\leavevmode\nobreak\ % x\in\mathcal{X}^{\text{in}}\right)≤ italic_p ( { start_ARRAY start_ROW start_CELL italic_n ( italic_x , italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) ≥ italic_k end_CELL end_ROW start_ROW start_CELL ∃ italic_s ≠ italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_n ( italic_x , italic_s ) ≥ italic_k end_CELL end_ROW end_ARRAY | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) (14)
p(ssx,n(x,s)k|x𝒳in)absent𝑝formulae-sequence𝑠subscript𝑠𝑥𝑛𝑥𝑠conditional𝑘𝑥superscript𝒳in\displaystyle\leq p\left(\exists s\neq s_{x},n(x,s)\geq k\leavevmode\nobreak\ % |\leavevmode\nobreak\ x\in\mathcal{X}^{\text{in}}\right)≤ italic_p ( ∃ italic_s ≠ italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_n ( italic_x , italic_s ) ≥ italic_k | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT )
<(M,N𝒮,τ),absent𝑀subscript𝑁𝒮𝜏\displaystyle<\mathcal{E}(M,N_{\mathcal{S}},\tau),< caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) , (15)

where the last inequality is from Lemma 2. Combining (7), (8) and (15), we have:

p(m(x)=sx|x𝒳in)𝑝𝑚𝑥conditionalsubscript𝑠𝑥𝑥superscript𝒳in\displaystyle p(m(x)=s_{x}\,|\,x\in\mathcal{X}^{\text{in}})italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) >i=kM(Mi)βmini(1βmax)Mi(M,N𝒮,τ).absentsuperscriptsubscript𝑖𝑘𝑀matrix𝑀𝑖superscriptsubscript𝛽min𝑖superscript1subscript𝛽max𝑀𝑖𝑀subscript𝑁𝒮𝜏\displaystyle>\sum_{i=k}^{M}\begin{pmatrix}M\\ i\end{pmatrix}\beta_{\text{min}}^{i}\,(1-\beta_{\text{max}})^{M-i}-\mathcal{E}% (M,N_{\mathcal{S}},\tau).> ∑ start_POSTSUBSCRIPT italic_i = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL italic_i end_CELL end_ROW end_ARG ) italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_M - italic_i end_POSTSUPERSCRIPT - caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) . (16)

From (5), (6) and (16), we have:

Rrd(m)subscript𝑅rd𝑚\displaystyle R_{\text{rd}}(m)italic_R start_POSTSUBSCRIPT rd end_POSTSUBSCRIPT ( italic_m ) >α(i=kM(Mi)βmini(1βmax)Mi(M,N𝒮,τ))+(1α)(1(M,N𝒮,τ))absent𝛼superscriptsubscript𝑖𝑘𝑀matrix𝑀𝑖superscriptsubscript𝛽min𝑖superscript1subscript𝛽max𝑀𝑖𝑀subscript𝑁𝒮𝜏1𝛼1𝑀subscript𝑁𝒮𝜏\displaystyle>\alpha\left(\,\sum_{i=k}^{M}\begin{pmatrix}M\\ i\end{pmatrix}\beta_{\text{min}}^{i}\,(1-\beta_{\text{max}})^{M-i}-\mathcal{E}% (M,N_{\mathcal{S}},\tau)\right)+(1-\alpha)\left(1-\mathcal{E}(M,N_{\mathcal{S}% },\tau)\right)> italic_α ( ∑ start_POSTSUBSCRIPT italic_i = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL italic_i end_CELL end_ROW end_ARG ) italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_M - italic_i end_POSTSUPERSCRIPT - caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) ) + ( 1 - italic_α ) ( 1 - caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) )
=αi=kM(Mi)βmini(1βmax)Mi+(1α)(M,N𝒮,τ).absent𝛼superscriptsubscript𝑖𝑘𝑀matrix𝑀𝑖superscriptsubscript𝛽min𝑖superscript1subscript𝛽max𝑀𝑖1𝛼𝑀subscript𝑁𝒮𝜏\displaystyle=\alpha\sum_{i=k}^{M}\begin{pmatrix}M\\ i\end{pmatrix}\beta_{\text{min}}^{i}\,(1-\beta_{\text{max}})^{M-i}+(1-\alpha)-% \mathcal{E}(M,N_{\mathcal{S}},\tau).= italic_α ∑ start_POSTSUBSCRIPT italic_i = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL italic_i end_CELL end_ROW end_ARG ) italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_M - italic_i end_POSTSUPERSCRIPT + ( 1 - italic_α ) - caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) .

\hfill\blacksquare

A.4 Proof of Lemma 3

Note that:

p(m(x)=sx)=αp(m(x)=sx|x𝒳in)+(1α)p(m(x)=sx|x𝒳out).𝑝𝑚𝑥subscript𝑠𝑥𝛼𝑝𝑚𝑥conditionalsubscript𝑠𝑥𝑥superscript𝒳in1𝛼𝑝𝑚𝑥conditionalsubscript𝑠𝑥𝑥superscript𝒳out\displaystyle p\left(m(x)=s_{x}\right)=\alpha\,p(m(x)=s_{x}\,|\,x\in\mathcal{X% }^{\text{in}})+(1-\alpha)\,p(m(x)=s_{x}\,|\,x\in\mathcal{X}^{\text{out}}).italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) = italic_α italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) + ( 1 - italic_α ) italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT ) .

From (16) in the proof of Theorem 1 above, we have:

p(m(x)=sx|x𝒳in)>i=kM(Mi)βmini(1βmax)Mi(M,N𝒮,τ).𝑝𝑚𝑥conditionalsubscript𝑠𝑥𝑥superscript𝒳insuperscriptsubscript𝑖𝑘𝑀matrix𝑀𝑖superscriptsubscript𝛽min𝑖superscript1subscript𝛽max𝑀𝑖𝑀subscript𝑁𝒮𝜏\displaystyle p(m(x)=s_{x}\,|\,x\in\mathcal{X}^{\text{in}})>\sum_{i=k}^{M}% \begin{pmatrix}M\\ i\end{pmatrix}\beta_{\text{min}}^{i}\,(1-\beta_{\text{max}})^{M-i}-\mathcal{E}% (M,N_{\mathcal{S}},\tau).italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) > ∑ start_POSTSUBSCRIPT italic_i = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL italic_i end_CELL end_ROW end_ARG ) italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_M - italic_i end_POSTSUPERSCRIPT - caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) .

In addition, p(m(x)=sx|x𝒳out)>0𝑝𝑚𝑥conditionalsubscript𝑠𝑥𝑥superscript𝒳out0p(m(x)=s_{x}\,|\,x\in\mathcal{X}^{\text{out}})>0italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT ) > 0. Therefore,

p(m(x)=sx)>αi=kM(Mi)βmini(1βmax)Miα(M,N𝒮,τ).𝑝𝑚𝑥subscript𝑠𝑥𝛼superscriptsubscript𝑖𝑘𝑀matrix𝑀𝑖superscriptsubscript𝛽min𝑖superscript1subscript𝛽max𝑀𝑖𝛼𝑀subscript𝑁𝒮𝜏\displaystyle p\left(m(x)=s_{x}\right)>\alpha\sum_{i=k}^{M}\begin{pmatrix}M\\ i\end{pmatrix}\beta_{\text{min}}^{i}(1-\beta_{\text{max}})^{M-i}-\alpha\,% \mathcal{E}(M,N_{\mathcal{S}},\tau).italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) > italic_α ∑ start_POSTSUBSCRIPT italic_i = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL italic_i end_CELL end_ROW end_ARG ) italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_M - italic_i end_POSTSUPERSCRIPT - italic_α caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) .

\hfill\blacksquare

A.5 Proof of Lemma 4

Similar to the proof above, we decompose the probability p(m(x)=sskip)𝑝𝑚𝑥subscript𝑠skipp\left(m(x)=s_{\text{skip}}\right)italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT ) as:

p(m(x)=sskip)=αp(m(x)=sskip|x𝒳in)+(1α)p(m(x)=sskip|x𝒳out).𝑝𝑚𝑥subscript𝑠skip𝛼𝑝𝑚𝑥conditionalsubscript𝑠skip𝑥superscript𝒳in1𝛼𝑝𝑚𝑥conditionalsubscript𝑠skip𝑥superscript𝒳out\displaystyle p\left(m(x)=s_{\text{skip}}\right)=\alpha\,p(m(x)=s_{\text{skip}% }\,|\,x\in\mathcal{X}^{\text{in}})+(1-\alpha)\,p(m(x)=s_{\text{skip}}\,|\,x\in% \mathcal{X}^{\text{out}}).italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT ) = italic_α italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) + ( 1 - italic_α ) italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT ) . (17)

We first compute a lower bound for the probability on 𝒳outsuperscript𝒳out\mathcal{X}^{\text{out}}caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT. Let k=M(1τ)+1𝑘𝑀1𝜏1k=M(1-\tau)+1italic_k = italic_M ( 1 - italic_τ ) + 1. Since m(x)=sskip𝑚𝑥subscript𝑠skipm(x)=s_{\text{skip}}italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT if and only if n(x,s)<ks𝒮𝑛𝑥𝑠𝑘for-all𝑠𝒮n(x,s)<k\leavevmode\nobreak\ \forall s\in\mathcal{S}italic_n ( italic_x , italic_s ) < italic_k ∀ italic_s ∈ caligraphic_S, we have:

p(m(x)=sskip|x𝒳out)𝑝𝑚𝑥conditionalsubscript𝑠skip𝑥superscript𝒳out\displaystyle p(m(x)=s_{\text{skip}}\,|\,x\in\mathcal{X}^{\text{out}})italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT ) =1p(s𝒮,n(x,s)k|x𝒳out)absent1𝑝formulae-sequence𝑠𝒮𝑛𝑥𝑠conditional𝑘𝑥superscript𝒳out\displaystyle=1-p\left(\exists s\in\mathcal{S},\leavevmode\nobreak\ n(x,s)\geq k% \,|\,x\in\mathcal{X}^{\text{out}}\right)= 1 - italic_p ( ∃ italic_s ∈ caligraphic_S , italic_n ( italic_x , italic_s ) ≥ italic_k | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT )
1s𝒮p(n(x,s)k|x𝒳out).absent1subscript𝑠𝒮𝑝𝑛𝑥𝑠conditional𝑘𝑥superscript𝒳out\displaystyle\geq 1-\sum_{s\in\mathcal{S}}p\left(n(x,s)\geq k\,|\,x\in\mathcal% {X}^{\text{out}}\right).≥ 1 - ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_p ( italic_n ( italic_x , italic_s ) ≥ italic_k | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT ) .

Using the same argument in the proof of Lemma 1, for each s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S, we have:

p(n(x,s)k|x𝒳out)<(M,N𝒮,τ)N𝒮1.𝑝𝑛𝑥𝑠conditional𝑘𝑥superscript𝒳out𝑀subscript𝑁𝒮𝜏subscript𝑁𝒮1\displaystyle p\left(n(x,s)\geq k\,|\,x\in\mathcal{X}^{\text{out}}\right)<% \frac{\mathcal{E}(M,N_{\mathcal{S}},\tau)}{N_{\mathcal{S}}-1}.italic_p ( italic_n ( italic_x , italic_s ) ≥ italic_k | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT ) < divide start_ARG caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT - 1 end_ARG .

This implies:

p(m(x)=sskip|x𝒳out)>1s𝒮(M,N𝒮,τ)N𝒮1=1N𝒮N𝒮1(M,N𝒮,τ).𝑝𝑚𝑥conditionalsubscript𝑠skip𝑥superscript𝒳out1subscript𝑠𝒮𝑀subscript𝑁𝒮𝜏subscript𝑁𝒮11subscript𝑁𝒮subscript𝑁𝒮1𝑀subscript𝑁𝒮𝜏\displaystyle p(m(x)=s_{\text{skip}}\,|\,x\in\mathcal{X}^{\text{out}})>1-\sum_% {s\in\mathcal{S}}\frac{\mathcal{E}(M,N_{\mathcal{S}},\tau)}{N_{\mathcal{S}}-1}% =1-\frac{N_{\mathcal{S}}}{N_{\mathcal{S}}-1}\mathcal{E}(M,N_{\mathcal{S}},\tau).italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT out end_POSTSUPERSCRIPT ) > 1 - ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT divide start_ARG caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT - 1 end_ARG = 1 - divide start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT - 1 end_ARG caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) . (18)

Next, we compute a lower bound for the probability on 𝒳insuperscript𝒳in\mathcal{X}^{\text{in}}caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT. We have:

p(m(x)\displaystyle p(m(x)italic_p ( italic_m ( italic_x ) =sskip|x𝒳in)\displaystyle=s_{\text{skip}}\,|\,x\in\mathcal{X}^{\text{in}})= italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT )
=p(s𝒮,n(x,s)<k|x𝒳in)absent𝑝formulae-sequencefor-all𝑠𝒮𝑛𝑥𝑠bra𝑘𝑥superscript𝒳in\displaystyle=p(\forall s\in\mathcal{S},n(x,s)<k\,|\,x\in\mathcal{X}^{\text{in% }})= italic_p ( ∀ italic_s ∈ caligraphic_S , italic_n ( italic_x , italic_s ) < italic_k | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT )
=p((n(x,sx)<k)(ssx,n(x,s)<k)|x𝒳in)absent𝑝𝑛𝑥subscript𝑠𝑥𝑘conditionalformulae-sequencefor-all𝑠subscript𝑠𝑥𝑛𝑥𝑠𝑘𝑥superscript𝒳in\displaystyle=p((n(x,s_{x})<k)\wedge(\forall s\neq s_{x},n(x,s)<k)\,|\,x\in% \mathcal{X}^{\text{in}})= italic_p ( ( italic_n ( italic_x , italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) < italic_k ) ∧ ( ∀ italic_s ≠ italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_n ( italic_x , italic_s ) < italic_k ) | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT )
=p((n(x,sx)<k)¬(ssx,n(x,s)k|x𝒳in)\displaystyle=p((n(x,s_{x})<k)\wedge\neg(\exists s\neq s_{x},n(x,s)\geq k\,|\,% x\in\mathcal{X}^{\text{in}})= italic_p ( ( italic_n ( italic_x , italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) < italic_k ) ∧ ¬ ( ∃ italic_s ≠ italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_n ( italic_x , italic_s ) ≥ italic_k | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT )
p(n(x,sx)<k|x𝒳in)p(ssx,n(x,s)k|x𝒳in),absent𝑝𝑛𝑥subscript𝑠𝑥bra𝑘𝑥superscript𝒳in𝑝formulae-sequence𝑠subscript𝑠𝑥𝑛𝑥𝑠conditional𝑘𝑥superscript𝒳in\displaystyle\geq p(n(x,s_{x})<k\,|\,x\in\mathcal{X}^{\text{in}})-p(\exists s% \neq s_{x},n(x,s)\geq k\,|\,x\in\mathcal{X}^{\text{in}}),≥ italic_p ( italic_n ( italic_x , italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) < italic_k | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) - italic_p ( ∃ italic_s ≠ italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_n ( italic_x , italic_s ) ≥ italic_k | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) ,

where we use the fact that p(A¬B)p(A)p(B)𝑝𝐴𝐵𝑝𝐴𝑝𝐵p(A\wedge\neg B)\geq p(A)-p(B)italic_p ( italic_A ∧ ¬ italic_B ) ≥ italic_p ( italic_A ) - italic_p ( italic_B ). From Lemma 2, we know that p(ssx,n(x,s)k|x𝒳in)<(M,N𝒮,τ)𝑝formulae-sequence𝑠subscript𝑠𝑥𝑛𝑥𝑠conditional𝑘𝑥superscript𝒳in𝑀subscript𝑁𝒮𝜏p(\exists s\neq s_{x},n(x,s)\geq k\,|\,x\in\mathcal{X}^{\text{in}})<\mathcal{E% }(M,N_{\mathcal{S}},\tau)italic_p ( ∃ italic_s ≠ italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_n ( italic_x , italic_s ) ≥ italic_k | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) < caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ). Besides, using a similar argument when deriving (8) in the proof of Theorem 1, we have:

p(n(x,sx)<k|x𝒳in)𝑝𝑛𝑥subscript𝑠𝑥bra𝑘𝑥superscript𝒳in\displaystyle p(n(x,s_{x})<k\,|\,x\in\mathcal{X}^{\text{in}})italic_p ( italic_n ( italic_x , italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) < italic_k | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) =i=0k1κ𝒦i(jκβj)(tκ¯(1βt))absentsuperscriptsubscript𝑖0𝑘1subscript𝜅subscript𝒦𝑖subscriptproduct𝑗𝜅subscript𝛽𝑗subscriptproduct𝑡¯𝜅1subscript𝛽𝑡\displaystyle=\sum_{i=0}^{k-1}\,\sum_{\kappa\in\mathcal{K}_{i}}\left(\prod_{j% \in\kappa}\beta_{j}\right)\left(\prod_{t\in\widebar{\kappa}}(1-\beta_{t})\right)= ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_κ ∈ caligraphic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∏ start_POSTSUBSCRIPT italic_j ∈ italic_κ end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( ∏ start_POSTSUBSCRIPT italic_t ∈ over¯ start_ARG italic_κ end_ARG end_POSTSUBSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )
i=0k1κ𝒦iβmini(1βmax)Miabsentsuperscriptsubscript𝑖0𝑘1subscript𝜅subscript𝒦𝑖superscriptsubscript𝛽min𝑖superscript1subscript𝛽max𝑀𝑖\displaystyle\geq\sum_{i=0}^{k-1}\,\sum_{\kappa\in\mathcal{K}_{i}}\beta_{\text% {min}}^{i}(1-\beta_{\text{max}})^{M-i}≥ ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_κ ∈ caligraphic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_M - italic_i end_POSTSUPERSCRIPT
=i=0k1(Mi)βmini(1βmax)Mi.absentsuperscriptsubscript𝑖0𝑘1matrix𝑀𝑖superscriptsubscript𝛽min𝑖superscript1subscript𝛽max𝑀𝑖\displaystyle=\sum_{i=0}^{k-1}\begin{pmatrix}M\\ i\end{pmatrix}\beta_{\text{min}}^{i}(1-\beta_{\text{max}})^{M-i}.= ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL italic_i end_CELL end_ROW end_ARG ) italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_M - italic_i end_POSTSUPERSCRIPT .

Thus,

p(m(x)=sskip|x𝒳in)>i=0k1(Mi)βmini(1βmax)Mi(M,N𝒮,τ).𝑝𝑚𝑥conditionalsubscript𝑠skip𝑥superscript𝒳insuperscriptsubscript𝑖0𝑘1matrix𝑀𝑖superscriptsubscript𝛽min𝑖superscript1subscript𝛽max𝑀𝑖𝑀subscript𝑁𝒮𝜏\displaystyle p(m(x)=s_{\text{skip}}\,|\,x\in\mathcal{X}^{\text{in}})>\sum_{i=% 0}^{k-1}\begin{pmatrix}M\\ i\end{pmatrix}\beta_{\text{min}}^{i}(1-\beta_{\text{max}})^{M-i}-\mathcal{E}(M% ,N_{\mathcal{S}},\tau).italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT | italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT in end_POSTSUPERSCRIPT ) > ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL italic_i end_CELL end_ROW end_ARG ) italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_M - italic_i end_POSTSUPERSCRIPT - caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) . (19)

From (17), (18) and (19), we have:

p(m(x)=sskip)𝑝𝑚𝑥subscript𝑠skip\displaystyle p\left(m(x)=s_{\text{skip}}\right)italic_p ( italic_m ( italic_x ) = italic_s start_POSTSUBSCRIPT skip end_POSTSUBSCRIPT ) >α(i=0k1(Mi)βmini(1βmax)Mi(M,N𝒮,τ))absent𝛼superscriptsubscript𝑖0𝑘1matrix𝑀𝑖superscriptsubscript𝛽min𝑖superscript1subscript𝛽max𝑀𝑖𝑀subscript𝑁𝒮𝜏\displaystyle>\alpha\left(\,\sum_{i=0}^{k-1}\begin{pmatrix}M\\ i\end{pmatrix}\beta_{\text{min}}^{i}(1-\beta_{\text{max}})^{M-i}-\mathcal{E}(M% ,N_{\mathcal{S}},\tau)\right)> italic_α ( ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL italic_i end_CELL end_ROW end_ARG ) italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_M - italic_i end_POSTSUPERSCRIPT - caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) )
+(1α)(1N𝒮N𝒮1(M,N𝒮,τ))1𝛼1subscript𝑁𝒮subscript𝑁𝒮1𝑀subscript𝑁𝒮𝜏\displaystyle\qquad+(1-\alpha)\left(1-\frac{N_{\mathcal{S}}}{N_{\mathcal{S}}-1% }\mathcal{E}(M,N_{\mathcal{S}},\tau)\right)+ ( 1 - italic_α ) ( 1 - divide start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT - 1 end_ARG caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) )
=αi=0k1(Mi)βmini(1βmax)Mi+(1α)N𝒮αN𝒮1(M,N𝒮,τ).absent𝛼superscriptsubscript𝑖0𝑘1matrix𝑀𝑖superscriptsubscript𝛽min𝑖superscript1subscript𝛽max𝑀𝑖1𝛼subscript𝑁𝒮𝛼subscript𝑁𝒮1𝑀subscript𝑁𝒮𝜏\displaystyle=\alpha\sum_{i=0}^{k-1}\begin{pmatrix}M\\ i\end{pmatrix}\beta_{\text{min}}^{i}(1-\beta_{\text{max}})^{M-i}+(1-\alpha)-% \frac{N_{\mathcal{S}}-\alpha}{N_{\mathcal{S}}-1}\,\mathcal{E}(M,N_{\mathcal{S}% },\tau).= italic_α ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( start_ARG start_ROW start_CELL italic_M end_CELL end_ROW start_ROW start_CELL italic_i end_CELL end_ROW end_ARG ) italic_β start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_M - italic_i end_POSTSUPERSCRIPT + ( 1 - italic_α ) - divide start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT - italic_α end_ARG start_ARG italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT - 1 end_ARG caligraphic_E ( italic_M , italic_N start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_τ ) .

\hfill\blacksquare


References

  • Abdar et al. (2021) M. Abdar, F. Pourpanah, S. Hussain, D. Rezazadegan, L. Liu, M. Ghavamzadeh, P. Fieguth, X. Cao, A. Khosravi, U. R. Acharya, et al. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information Fusion, 76:243–297, 2021.
  • Blundell et al. (2015) C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural network. In International Conference on Machine Learning (ICML), 2015.
  • Chellapilla and Simard (2004) K. Chellapilla and P. Simard. Using machine learning to break visual human interaction proofs (hips). In Conference on Neural Information Processing Systems (NeurIPS), 2004.
  • Chen et al. (2019) K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin. MMDetection: Open MMLab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
  • Chen et al. (2014) T. Chen, E. Fox, and C. Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In International Conference on Machine Learning (ICML), 2014.
  • Cuong et al. (2013) N. V. Cuong, L. S. T. Ho, and V. Dinh. Generalization and robustness of batched weighted average algorithm with V-geometrically ergodic Markov data. In International Conference on Algorithmic Learning Theory (ALT), 2013.
  • D’Angelo and Fortuin (2021) F. D’Angelo and V. Fortuin. Repulsive deep ensembles are Bayesian. In Conference on Neural Information Processing Systems (NeurIPS), 2021.
  • Deng et al. (2022) X. Deng, R. Zhao, Y. Wang, L. Chen, Y. Wang, and Z. Xue. 3E-solver: An effortless, easy-to-update, and end-to-end solver with semi-supervised learning for breaking text-based Captchas. In International Joint Conference on Artificial Intelligence (IJCAI), 2022.
  • Fort et al. (2019) S. Fort, H. Hu, and B. Lakshminarayanan. Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757, 2019.
  • Gal and Ghahramani (2016) Y. Gal and Z. Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning (ICML), 2016.
  • Gao et al. (2016) H. Gao, X. Wang, F. Cao, Z. Zhang, L. Lei, J. Qi, and X. Liu. Robustness of text-based completely automated public Turing test to tell computers and humans apart. IET Information Security, 10(1):45–52, 2016.
  • Ge et al. (2021) Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun. YOLOX: Exceeding YOLO series in 2021. arXiv preprint arXiv:2107.08430, 2021.
  • Germain et al. (2015) P. Germain, A. Lacasse, F. Laviolette, M. March, and J.-F. Roy. Risk bounds for the majority vote: From a PAC-Bayesian analysis to a learning algorithm. Journal of Machine Learning Research, 16(26):787–860, 2015.
  • Guo et al. (2017) C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning (ICML), 2017.
  • Hendrycks and Gimpel (2017) D. Hendrycks and K. Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations (ICLR), 2017.
  • Ho et al. (2020) L. S. T. Ho, B. T. Nguyen, V. Dinh, and D. Nguyen. Posterior concentration and fast convergence rates for generalized Bayesian learning. Information Sciences, 538:372–383, 2020.
  • Isola et al. (2017) P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • Kopp et al. (2017) M. Kopp, M. Nikl, and M. Holena. Breaking CAPTCHAs with convolutional neural networks. In Conference on Information Technologies – Applications and Theory (ITAT), 2017.
  • Lakshminarayanan et al. (2017) B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Conference on Neural Information Processing Systems (NeurIPS), 2017.
  • Laviolette et al. (2017) F. Laviolette, E. Morvant, L. Ralaivola, and J.-F. Roy. Risk upper bounds for general ensemble methods with an application to multiclass classification. Neurocomputing, 219:15–25, 2017.
  • Li et al. (2021) C. Li, X. Chen, H. Wang, P. Wang, Y. Zhang, and W. Wang. End-to-end attack on text-based CAPTCHAs based on cycle-consistent generative adversarial network. Neurocomputing, 433:223–236, 2021.
  • Li and Liao (2018) Z. Li and Q. Liao. CAPTCHA: Machine or human solvers? A game-theoretical analysis. In IEEE International Conference on Cyber Security and Cloud Computing (CSCloud)/IEEE International Conference on Edge Computing and Scalable Cloud (EdgeCom), 2018.
  • Masegosa et al. (2020) A. Masegosa, S. Lorenzen, C. Igel, and Y. Seldin. Second order PAC-Bayesian bounds for the weighted majority vote. In Conference on Neural Information Processing Systems (NeurIPS), 2020.
  • Neal (1995) R. M. Neal. Bayesian learning for neural network. PhD thesis, University of Toronto, 1995.
  • Nesterov (1983) Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence O(1/k^2). Doklady AN USSR, 269:543–547, 1983.
  • Noury and Rezaei (2020) Z. Noury and M. Rezaei. Deep-CAPTCHA: A deep learning based CAPTCHA solver for vulnerability assessment. arXiv preprint arXiv:2006.08296, 2020.
  • Ortega et al. (2022) L. A. Ortega, R. Cabañas, and A. Masegosa. Diversity and generalization in neural network ensembles. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022.
  • Ousat et al. (2024) B. Ousat, E. Schafir, M. A. Tofighi, D. C. Hoang, C. V. Nguyen, S. Arshad, S. Uluagac, and A. Kharraz. The matter of Captchas: An analysis of a brittle security feature on the modern web. In ACM Web Conference (WWW), 2024.
  • Ovadia et al. (2019) Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. Dillon, B. Lakshminarayanan, and J. Snoek. Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. In Conference on Neural Information Processing Systems (NeurIPS), 2019.
  • Pereyra et al. (2017) G. Pereyra, G. Tucker, J. Chorowski, Ł. Kaiser, and G. Hinton. Regularizing neural networks by penalizing confident output distributions. In International Conference on Learning Representations (ICLR), 2017.
  • Ritter et al. (2018) H. Ritter, A. Botev, and D. Barber. A scalable Laplace approximation for neural networks. In International Conference on Learning Representations (ICLR), 2018.
  • Rudner et al. (2021) T. G. Rudner, Z. Chen, Y. W. Teh, and Y. Gal. Tractable function-space variational inference in Bayesian neural networks. In Conference on Neural Information Processing Systems (NeurIPS), 2021.
  • Tang et al. (2018) M. Tang, H. Gao, Y. Zhang, Y. Liu, P. Zhang, and P. Wang. Research on deep learning techniques in breaking text-based Captchas and designing image-based Captcha. IEEE Transactions on Information Forensics and Security, 13(10):2522–2537, 2018.
  • Theisen et al. (2024) R. Theisen, H. Kim, Y. Yang, L. Hodgkinson, and M. W. Mahoney. When are ensembles really effective? In Conference on Neural Information Processing Systems (NeurIPS), 2024.
  • Tian and Xiong (2020) S. Tian and T. Xiong. A generic solver combining unsupervised learning and representation learning for breaking text-based captchas. In The Web Conference (WWW), 2020.
  • Von Ahn et al. (2003) L. Von Ahn, M. Blum, N. J. Hopper, and J. Langford. CAPTCHA: Using hard AI problems for security. In International Conference on the Theory and Applications of Cryptographic Techniques, 2003.
  • Wilson and Izmailov (2020) A. G. Wilson and P. Izmailov. Bayesian deep learning and a probabilistic perspective of generalization. In Conference on Neural Information Processing Systems (NeurIPS), 2020.
  • Yan and El Ahmad (2007) J. Yan and A. S. El Ahmad. Breaking visual CAPTCHAs with naive pattern recognition algorithms. In Annual Computer Security Applications Conference (ACSAC), 2007.
  • Yan and El Ahmad (2008) J. Yan and A. S. El Ahmad. A low-cost attack on a Microsoft CAPTCHA. In ACM Conference on Computer and Communications Security (CCS), 2008.
  • Ye et al. (2018) G. Ye, Z. Tang, D. Fang, Z. Zhu, Y. Feng, P. Xu, X. Chen, and Z. Wang. Yet another text captcha solver: A generative adversarial network based approach. In ACM Conference on Computer and Communications Security (CCS), 2018.
  • Zhang et al. (2020) R. Zhang, C. Li, J. Zhang, C. Chen, and A. G. Wilson. Cyclical stochastic gradient MCMC for Bayesian deep learning. In International Conference on Learning Representations (ICLR), 2020.