Active Sequential Two-Sample Testing

Weizhi Li [email protected]
Arizona State University
Los Alamos National Laboratory
Prad Kadambi [email protected]
Arizona State University
Pouria Saidi [email protected]
Arizona State University
Karthikeyan Natesan Ramamurthy [email protected]
IBM Research
Gautam Dasarathy [email protected]
Arizona State University
Visar Berisha [email protected]
Arizona State University
Work done when the author was at Arizona State University.
Abstract

A two-sample hypothesis test is a statistical procedure used to determine whether the distributions generating two samples are identical. We consider the two-sample testing problem in a new scenario where the sample measurements (or sample features) are inexpensive to access, but their group memberships (or labels) are costly. To address the problem, we devise the first active sequential two-sample testing framework that not only sequentially but also actively queries. Our test statistic is a likelihood ratio where one likelihood is found by maximization over all class priors, and the other is provided by a probabilistic classification model. The classification model is adaptively updated and used to predict where the (unlabelled) features have a high dependency on labels; labeling the “high-dependency” features leads to the increased power of the proposed testing framework. In theory, we provide the proof that our framework produces an anytime-valid p𝑝pitalic_p-value. In addition, we characterize the proposed framework’s gain in testing power by analyzing the mutual information between the feature and label variables in asymptotic and finite-sample scenarios. In practice, we introduce an instantiation of our framework and evaluate it using several experiments; the experiments on the synthetic, MNIST, and application-specific datasets demonstrate that the testing power of the instantiated active sequential test significantly increases while the Type I error is under control.

1 Introduction

The two-sample test is a statistical hypothesis test applied to data samples (or measurements) from two distributions. The goal is to test if the data supports the hypothesis that the distributions are different. If we consider each data point as a feature and label (which tells us which distribution the data is from) pair, then the two-sample test is equivalent to the problem of testing the dependence between the features and the labels. Viewed with this lens, the null hypothesis for the two-sample test states that the feature and label variables are independent, and the alternate hypothesis states the opposite. The analyst performing the two-sample test needs to decide between the null and the alternative hypotheses with data from the two distributions.

The analyst typically knows little about the difficulty of a two-sample testing problem before running the test. Fixing the sample size a priori may result in a test that needs to collect additional evidence to arrive at a final decision (if the problem is hard) or in an inefficient test with over-collected data (if the problem is simple). To address this dichotomy, the research community has proposed sequential two-sample tests (Wald, 1992; Lhéritier & Cazals, 2018; Hajnal, 1961; Shekhar & Ramdas, 2021; Balsubramani & Ramdas, 2015) that allow the analyst to sequentially collect data and monitor statistical evidence, i.e., a statistic is computed from the data. The test can stop anytime when sufficient evidence has been accumulated to make a decision.

Existing sequential two-sample tests (Wald, 1992; Lhéritier & Cazals, 2018; Hajnal, 1961; Shekhar & Ramdas, 2021; Balsubramani & Ramdas, 2015) are devised to collect both sample features and sample labels simultaneously. In this paper, we consider the problem of sequential two-sample testing in a novel and practical setting where the cost of obtaining sample labels is high, but accessing sample features is inexpensive. As a result, the analyst can obtain a large collection of sample features without labels; she will need to sequentially query the label of the sample features in the collection to perform the two-sample testing while ensuring the query complexity (i.e., the number of queried labels) doesn’t exceed a label budget. A motivation for this formulation comes from the field of digital health: Physicians seek inexpensive digital measurements (e.g., gait, speech, ty** speed measured using a patient’s smartphone) to replace traditional biomarkers (e.g., the amyloid buildup that indicates Alzheimer’s progression) which are often costly to access; hence they need to validate the dependency between the digital measurements (feature variables) and traditional biomarkers (label variables). While validation studies can access large registries to collect digital measurements remotely at scale, there is a fixed label budget for the expensive biomarker measures. An efficient sequential design would reveal the dependency between the features and the labels using only a reasonable label budget.

In this paper, we propose the active sequential testing framework shown in Figure 1. The framework initializes a classifier to model probabilities of sample labels given features using an initial random sample; next, depending on the classifier’s outputs, the framework queries the labels of features predicted to have a high dependency with the labels and constructs a test statistic w𝑤witalic_w. The framework rejects the null if w𝑤witalic_w is smaller than a pre-defined significance level α𝛼\alphaitalic_α; otherwise, the framework stops and retains the null if the label budget runs out or re-enters the label query and decision-making, enabling a sequential testing process.

Refer to caption
Figure 1: The active sequential two-sample testing framework.

The test statistic w𝑤witalic_w in the framework is based on the likelihood ratio between the likelihood constructed under the null that feature and label variables are independent and the likelihood constructed under the alternative that the dependency between the feature and label variables exists. Such a likelihood ratio two-sample test statistic has been first proposed in (Lhéritier & Cazals, 2018) to develop a non-active sequential two-sample test capable of controlling the Type I error (i.e., the probability of a decision made on the alternative when the null is true). We adapt the original test statistic by replacing the pre-defined label probability prior with a maximum likelihood estimate to satisfy our considered setting that the label prior is unknown. More importantly, our framework actively labels the features that are predicted to have a high dependency on labels. We will characterize the benefits of the active query over the random query by the change of mutual information between feature and label variables in the asymptotic and finite-sample scenarios. In practice, we suggest using an active query scheme called bimodal query proposed in (Li et al., 2022), in which the scheme labels samples with the highest class one or zero probabilities.

We summarize the main contributions of our work as follows:

  • We introduce the first active sequential two-sample testing framework. We prove that the proposed framework produces an anytime-valid p𝑝pitalic_p-value to achieve Type I error control. Furthermore, we provide an information-theoretic interpretation of the proposed framework. We prove that, asymptotically, the framework is capable of generating the largest mutual information (MI) between feature and label variables under standard conditions (Györfi et al., 2002); and we also analyze the gain of the testing power for the proposed framework over its passive query parallel in the finite-sample scenario through MI.

  • We instantiate the framework using the bimodal query (Li et al., 2022) (i.e., queries the labels of the samples that have the highest class one or zero probabilities) as the label query scheme. We perform extensive experiments on synthetic data, MNIST, and an application-specific Alzheimer’s disease dataset to demonstrate the effectiveness of the instantiated framework. Our proposed test exhibits a significant reduction of the Type II error using fewer labeled samples compared with a non-active sequential testing baseline.

2 Related Works

The author of (Student, 1908) developed the t𝑡titalic_t-test, probably the simplest form of a two-sample test that compares the mean difference of two samples of uni-variate data. Since then, the research community has expanded the two-sample test to many other forms, e.g., the hotelling test (Hotelling, 1992), the Friedman-Rafsky test (Friedman & Rafsky, 1979), the kernel two-sample test (Gretton et al., 2012) and the classifier two-sample test (Lopez-Paz & Oquab, 2016) for the multi-variate case. These tests are constructed with various statistics, including the Mahalanobis distance, the measurement over a graph, a kernel embedding, or classifier accuracy, all in service of increasing testing power while controlling the Type I error. In particular, (Friedman & Rafsky, 1979; Gretton et al., 2012; Lopez-Paz & Oquab, 2016) test if the data from two samples is distributionally different, which is a generalization of the hotelling and tlimit-from𝑡t-italic_t -test (Student, 1908; Hotelling, 1992) that only detect the mean difference of two samples. These two-sample tests are batch tests that have been extensively used subject to a fixed-sample size: When the collection of experimental data ends, an analyst performs the two-sample tests on the data and makes a decision; she is not allowed to continue to collect and incorporate more data into the testing after a decision made, as that will inflate the Type I error.

In contrast to the batch two-sample tests, the research community has developed a class of sequential two-sample tests (Lhéritier & Cazals, 2018; Shekhar & Ramdas, 2021; Pandeva et al., 2022) that allow the analyst to sequentially collect data and perform the two-sample test, enabling sequential decision-making. These sequential tests rectify the inflated Type I that will happen in the batch test with different statistical techniques such as Bonferroni correction (Dunn, 1961) and Ville’s maximal inequality (Doob, 1939).

There are also several works that consider the active setting in two-sample testing. The authors of (Li et al., 2022) proposed a batch two-sample test combined with active learning when curated labeled data is unavailable and querying the data labels is expensive. Several studies have also considered sequential testing for develo** active sequential hypothesis tests (Naghshvar & Javidi, 2013; Chernoff, 1959; Bessler, 1960; Blot & Meeter, 1973; Keener, 1984; Kiefer & Sacks, 1963). However, these tests require a clear parametric description of the statistical models of the hypotheses. The authors of (Duan et al., 2022) developed an interactive rank test, which is distribution-free and can similarly perform the sequential two-sample testing in the active learning setting.

The work proposed herein uses the label query scheme in (Li et al., 2022) to develop the first multivariate non-parametric sequential test for the active learning setting with a novel test statistic and theoretical results. We demonstrate that the test controls the Type I error via Ville’s maximal inequality (See Theorem 5.1). Ville’s maximal inequality results in higher testing power than the Bonferroni correction for sequential testing (Shekhar & Ramdas, 2021; Ramdas et al., 2022).

While our framework in Figure 1 employs the label query scheme introduced in (Li et al., 2022), it offers distinct advantages over (Li et al., 2022):

  • Our proposed framework follows a sequential design. Upon accumulating sufficient evidence to reject the null hypothesis, our design automatically stops label collection before exhausting the label budget. In contrast, the batch design in (Li et al., 2022) invariably exhausts the label budget.

  • Utilizing a different test statistic, our framework enables finite-sample analysis, which is not provided in (Li et al., 2022).

3 Problem Statement and Preliminaries

3.1 Notations

We use a pair of random variables (𝐒,Z)𝐒𝑍(\mathbf{S},Z)( bold_S , italic_Z ) to denote a feature and its label variables whose realization is (𝐬,z)d×{0,1}𝐬𝑧superscript𝑑01(\mathbf{s},z)\in\mathbb{R}^{d}\times\{0,1\}( bold_s , italic_z ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × { 0 , 1 }. The variable pair (𝐒,Z)𝐒𝑍(\mathbf{S},Z)( bold_S , italic_Z ) admits a joint distribution p𝐒Z(𝐬,z)subscript𝑝𝐒𝑍𝐬𝑧p_{\mathbf{S}Z}(\mathbf{s},z)italic_p start_POSTSUBSCRIPT bold_S italic_Z end_POSTSUBSCRIPT ( bold_s , italic_z ). Furthermore, we write 𝒮𝒮\mathcal{S}caligraphic_S to denote the support of p𝐒(𝐬)subscript𝑝𝐒𝐬p_{\mathbf{S}}(\mathbf{s})italic_p start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ( bold_s ). Formally, a two-sample testing problem consists of null hypothesis H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT that states p𝐒Z=0(𝐬)=p𝐒Z=1(𝐬)subscript𝑝conditional𝐒𝑍0𝐬subscript𝑝conditional𝐒𝑍1𝐬p_{\mathbf{S}\mid Z=0}(\mathbf{s})=p_{\mathbf{S}\mid Z=1}(\mathbf{s})italic_p start_POSTSUBSCRIPT bold_S ∣ italic_Z = 0 end_POSTSUBSCRIPT ( bold_s ) = italic_p start_POSTSUBSCRIPT bold_S ∣ italic_Z = 1 end_POSTSUBSCRIPT ( bold_s ) and an alternative hypothesis H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT that states p𝐒Z=0(s)p𝐒Z=1(s)subscript𝑝conditional𝐒𝑍0𝑠subscript𝑝conditional𝐒𝑍1𝑠p_{\mathbf{S}\mid Z=0}(s)\neq p_{\mathbf{S}\mid Z=1}(s)italic_p start_POSTSUBSCRIPT bold_S ∣ italic_Z = 0 end_POSTSUBSCRIPT ( italic_s ) ≠ italic_p start_POSTSUBSCRIPT bold_S ∣ italic_Z = 1 end_POSTSUBSCRIPT ( italic_s ). An analyst collects a sequence ((𝐬,z)i)i=1Nsuperscriptsubscriptsubscript𝐬𝑧𝑖𝑖1𝑁\left((\mathbf{s},z)_{i}\right)_{i=1}^{N}( ( bold_s , italic_z ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT of N𝑁Nitalic_N realizations of (𝐒,Z)𝐒𝑍(\mathbf{S},Z)( bold_S , italic_Z ) to test H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT against H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. The problem is equivalent to testing the independency between 𝐒𝐒\mathbf{S}bold_S and Z𝑍Zitalic_Z. Therefore, we equivalently restate the hypothesis test as follows:

H0:p𝐒Z(𝐬,z)=p𝐒(𝐬)PZ(z),𝐬𝒮:subscript𝐻0formulae-sequencesubscript𝑝𝐒𝑍𝐬𝑧subscript𝑝𝐒𝐬subscript𝑃𝑍𝑧for-all𝐬𝒮\displaystyle H_{0}:p_{\mathbf{S}Z}(\mathbf{s},z)=p_{\mathbf{S}}(\mathbf{s})P_% {Z}(z),\forall\mathbf{s}\in\mathcal{S}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_p start_POSTSUBSCRIPT bold_S italic_Z end_POSTSUBSCRIPT ( bold_s , italic_z ) = italic_p start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ( bold_s ) italic_P start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_z ) , ∀ bold_s ∈ caligraphic_S
H1:p𝐒Z(𝐬,z)p𝐒(𝐬)PZ(z),𝐬𝒮:subscript𝐻1formulae-sequencesubscript𝑝𝐒𝑍𝐬𝑧subscript𝑝𝐒𝐬subscript𝑃𝑍𝑧𝐬𝒮\displaystyle H_{1}:p_{\mathbf{S}Z}(\mathbf{s},z)\neq p_{\mathbf{S}}(\mathbf{s% })P_{Z}(z),\exists\mathbf{s}\in\mathcal{S}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_p start_POSTSUBSCRIPT bold_S italic_Z end_POSTSUBSCRIPT ( bold_s , italic_z ) ≠ italic_p start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ( bold_s ) italic_P start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_z ) , ∃ bold_s ∈ caligraphic_S (1)

Moving forward, we omit the subscripts in p𝐒Z(𝐬,z)subscript𝑝𝐒𝑍𝐬𝑧p_{\mathbf{S}Z}(\mathbf{s},z)italic_p start_POSTSUBSCRIPT bold_S italic_Z end_POSTSUBSCRIPT ( bold_s , italic_z ), PZ(z)subscript𝑃𝑍𝑧P_{Z}(z)italic_P start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_z ) and p𝐒(𝐬)subscript𝑝𝐒𝐬p_{\mathbf{S}}(\mathbf{s})italic_p start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ( bold_s ) and write them as p(𝐬,z)𝑝𝐬𝑧p(\mathbf{s},z)italic_p ( bold_s , italic_z ), P(z)𝑃𝑧P(z)italic_P ( italic_z ) and p(𝐬)𝑝𝐬p(\mathbf{s})italic_p ( bold_s ). In addition, we use 𝐬Nsuperscript𝐬𝑁\mathbf{s}^{N}bold_s start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, zNsuperscript𝑧𝑁z^{N}italic_z start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and (𝐬,z)Nsuperscript𝐬𝑧𝑁(\mathbf{s},z)^{N}( bold_s , italic_z ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT to denote sequences of samples (𝐬i)i=1Nsuperscriptsubscriptsubscript𝐬𝑖𝑖1𝑁(\mathbf{s}_{i})_{i=1}^{N}( bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, (zi)i=1Nsuperscriptsubscriptsubscript𝑧𝑖𝑖1𝑁(z_{i})_{i=1}^{N}( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and ((𝐬,z)i)i=1Nsubscriptsuperscriptsubscript𝐬𝑧𝑖𝑁𝑖1((\mathbf{s},z)_{i})^{N}_{i=1}( ( bold_s , italic_z ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT respectively. We use similar notation throughout the paper.

3.2 The problem

In the typical setting of a sequential two-sample test, an analyst does not have prior knowledge of sample features. The analyst sequentially collects both sample features and their labels simultaneously with the corresponding random variable pair (𝐒,Z)𝐒𝑍(\mathbf{S},Z)( bold_S , italic_Z ) i.i.d. generated from a data-generating process, i.e., p(𝐬,z)𝑝𝐬𝑧p(\mathbf{s},z)italic_p ( bold_s , italic_z ). We consider a variant of the setting in which accessing sample features is free/inexpensive. Consequently, the analyst collects a large set 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT of sample features before performing a sequential test. However, accessing the label of a feature in 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is costly. We assume the following fact throughout the paper: The already-collected 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is the result of a sample feature collection process where all 𝐬i𝒮usubscript𝐬𝑖subscript𝒮𝑢\mathbf{s}_{i}\in\mathcal{S}_{u}bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT are realizations of random variables 𝐒isubscript𝐒𝑖\mathbf{S}_{i}bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT i.i.d. generated from p(𝐬)𝑝𝐬p(\mathbf{s})italic_p ( bold_s ). There exists an oracle to return a label zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of 𝐬i𝒮usubscript𝐬𝑖subscript𝒮𝑢\mathbf{s}_{i}\in\mathcal{S}_{u}bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT with the corresponding random variable Zisubscript𝑍𝑖Z_{i}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐒isubscript𝐒𝑖\mathbf{S}_{i}bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT admitting the posterior probability P(zi|𝐬i)𝑃conditionalsubscript𝑧𝑖subscript𝐬𝑖P(z_{i}|\mathbf{s}_{i})italic_P ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). We consider the following new sequential two-sample testing problem:

An active sequential two-sample testing problem: Suppose 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is an unlabeled feature set, there exists an oracle to return a label z𝑧zitalic_z of 𝐬𝒮u𝐬subscript𝒮𝑢\mathbf{s}\in\mathcal{S}_{u}bold_s ∈ caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, and Nqsubscript𝑁𝑞N_{q}italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is a limit on the number of times the oracle can be queried (e.g., the label budget). An analyst sequentially queries the oracle for the z𝑧zitalic_z of 𝐬𝒮u𝐬subscript𝒮𝑢\mathbf{s}\in\mathcal{S}_{u}bold_s ∈ caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. After querying a new znsubscript𝑧𝑛z_{n}italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of 𝐬nsubscript𝐬𝑛\mathbf{s}_{n}bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, for 1nNq1𝑛subscript𝑁𝑞1\leq n\leq N_{q}1 ≤ italic_n ≤ italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, the analyst needs to decide whether to terminate the label querying process and make a decision (i.e., whether to reject H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) or continue with the querying process if n<Nq𝑛subscript𝑁𝑞n<N_{q}italic_n < italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT.

An analyst actively labeling 𝐬n𝒮usubscript𝐬𝑛subscript𝒮𝑢\mathbf{s}_{n}\in\mathcal{S}_{u}bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT may result in non-i.i.d pairs of (𝐒,Z)𝐒𝑍(\mathbf{S},Z)( bold_S , italic_Z ); hence the distribution of (𝐒,Z)𝐒𝑍(\mathbf{S},Z)( bold_S , italic_Z ) is shifted away from p(𝐬,z)𝑝𝐬𝑧p(\mathbf{s},z)italic_p ( bold_s , italic_z ). In contrast, an analyst passively (or randomly) labeling 𝐬n𝒮usubscript𝐬𝑛subscript𝒮𝑢\mathbf{s}_{n}\in\mathcal{S}_{u}bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT maintains (𝐒,Z)p(𝐬,z)similar-to𝐒𝑍𝑝𝐬𝑧(\mathbf{S},Z)\sim p(\mathbf{s},z)( bold_S , italic_Z ) ∼ italic_p ( bold_s , italic_z ).

3.3 Evaluation metrics for the problem

In the following, we introduce the evaluation metrics used throughout the paper.

  • Type I error P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: The probability of rejecting H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT when H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is true.

  • Type II error P1subscript𝑃1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT: The probability of rejecting H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT when H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is true.

  • Testing power: The probability of rejecting H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT when H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is true. In other words, Testing power=1P1Testing power1subscript𝑃1\text{Testing power}=1-P_{1}Testing power = 1 - italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

Testing power and Type II error are interchangeably used in the methodology and experiment sections (Section 5 and 6).

3.4 Attributes of an active two-sample test

As already generalized in many two-sample testing literature such as (Johari et al., 2022; Wald, 1992; Lhéritier & Cazals, 2018; Shekhar & Ramdas, 2021; Welch, 1990), a conventional procedure for sequential two-sample testing is to compute a p𝑝pitalic_p-value from sequentially observed samples and compare it to a pre-defined significance level α[0,1]𝛼01\alpha\in[0,1]italic_α ∈ [ 0 , 1 ] anytime. The analyst rejects H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and stops the testing if pα𝑝𝛼p\leq\alphaitalic_p ≤ italic_α. For more details, see (Wasserstein & Lazar, 2016). In addition, as the test proposed in what follows is endowed with active querying to reduce the number of label queries, the active sequential test is anticipated to spend fewer labels than a passive (random-query) test to reject H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT when H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is true. In summary, an active sequential two-sample test has the following four attributes:

  • The test generates an anytime-valid p𝑝pitalic_p-value such that P0(pα)αsubscript𝑃0𝑝𝛼𝛼P_{0}(p\leq\alpha)\leq\alphaitalic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_p ≤ italic_α ) ≤ italic_α holds at anytime of the sequential testing process. P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is exactly the Type I error and that implies the Type I is upper-bounded by α𝛼\alphaitalic_α.

  • The test has a high testing power P1(pα)subscript𝑃1𝑝𝛼P_{1}(p\leq\alpha)italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_p ≤ italic_α ).

  • The test is consistent such that P1(pα)=1subscript𝑃1𝑝𝛼1P_{1}(p\leq\alpha)=1italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_p ≤ italic_α ) = 1 under H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT when the test sample size goes to infinity.

  • The test has higher P1subscript𝑃1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT than the passive test given the same label budgets.

4 A Sequential Two-Sample Testing Statistic

We follow the well-known likelihood ratio test (Wilks, 1938) to construct a sequential testing statistic. We use the statistical models that characterize the label generation processes conditional on the observed sample features under H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. More precisely, under H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we have P(z|𝐬)=P(z),𝐬𝒮formulae-sequence𝑃conditional𝑧𝐬𝑃𝑧for-all𝐬𝒮P(z|\mathbf{s})=P(z),\forall\mathbf{s}\in\mathcal{S}italic_P ( italic_z | bold_s ) = italic_P ( italic_z ) , ∀ bold_s ∈ caligraphic_S; that is, when S𝑆Sitalic_S and Z𝑍Zitalic_Z are independent, the posterior probability P(z|𝐬)𝑃conditional𝑧𝐬P\left(z|\mathbf{s}\right)italic_P ( italic_z | bold_s ) is the same for any 𝐬𝐬\mathbf{s}bold_s in the support 𝒮𝒮\mathcal{S}caligraphic_S of p(𝐬)𝑝𝐬p(\mathbf{s})italic_p ( bold_s ). In contrast, under H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we have the following statistical model: s𝒮,P(z|𝐬)P(z)formulae-sequence𝑠𝒮𝑃conditional𝑧𝐬𝑃𝑧\exists s\in\mathcal{S},P(z|\mathbf{s})\neq P(z)∃ italic_s ∈ caligraphic_S , italic_P ( italic_z | bold_s ) ≠ italic_P ( italic_z ). We sequentially collect sample data (𝐬,z)𝐬𝑧(\mathbf{s},z)( bold_s , italic_z ), and when a new observation (𝐬n,zn)subscript𝐬𝑛subscript𝑧𝑛(\mathbf{s}_{n},z_{n})( bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) arrives, we construct a likelihood ratio wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT: With w0=1subscript𝑤01w_{0}=1italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1, wn=wn1P(zn)P(zn|𝐬n)=i=1nP(zi)P(zi𝐬i),n1formulae-sequencesubscript𝑤𝑛subscript𝑤𝑛1𝑃subscript𝑧𝑛𝑃conditionalsubscript𝑧𝑛subscript𝐬𝑛superscriptsubscriptproduct𝑖1𝑛𝑃subscript𝑧𝑖𝑃conditionalsubscript𝑧𝑖subscript𝐬𝑖𝑛1w_{n}=w_{n-1}\frac{P(z_{n})}{P(z_{n}|\mathbf{s}_{n})}=\prod_{i=1}^{n}\frac{P(z% _{i})}{P(z_{i}\mid\mathbf{s}_{i})},n\geq 1italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT divide start_ARG italic_P ( italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P ( italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_P ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG , italic_n ≥ 1 to assess H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT against H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

The statistical models P(z)𝑃𝑧P(z)italic_P ( italic_z ) and P(z|𝐬)𝑃conditional𝑧𝐬P(z|\mathbf{s})italic_P ( italic_z | bold_s ) are unknown. To formulate our two-sample test, we will use a likelihood estimate P^(zn)^𝑃superscript𝑧𝑛\hat{P}(z^{n})over^ start_ARG italic_P end_ARG ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) that is maximized over all the class priors to replace P(zn)𝑃superscript𝑧𝑛P(z^{n})italic_P ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT )–the product of the class prior. In addition, we build a class-probability predictor Qn(z𝐬)subscript𝑄𝑛conditional𝑧𝐬Q_{n}(z\mid\mathbf{s})italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_z ∣ bold_s ) with the past observed sample sequence (𝐬,z)n1superscript𝐬𝑧𝑛1(\mathbf{s},z)^{n-1}( bold_s , italic_z ) start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT to model P(zn𝐬n)𝑃conditionalsubscript𝑧𝑛subscript𝐬𝑛P(z_{n}\mid\mathbf{s}_{n})italic_P ( italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )–the posterior probability of znsubscript𝑧𝑛z_{n}italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT given newly observed 𝐬nsubscript𝐬𝑛\mathbf{s}_{n}bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT; any probabilistic classifier, such as a neural network and logistic function, can be used to build Qn(z𝐬)subscript𝑄𝑛conditional𝑧𝐬Q_{n}(z\mid\mathbf{s})italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_z ∣ bold_s ). Additionally, Q1(z𝐬)subscript𝑄1conditional𝑧𝐬Q_{1}(z\mid\mathbf{s})italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_z ∣ bold_s ) indicates an initialized class-probability predictor111It is possible to set Q1(z|𝐬)subscript𝑄1conditional𝑧𝐬Q_{1}(z|\mathbf{s})italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_z | bold_s ) as a random guess class-probability predictor, and then sequentially gather (𝐬,z)𝐬𝑧(\mathbf{s},z)( bold_s , italic_z ) for training; however, this would hurt the testing power. As suggested by Duan et al. (2022); Lhéritier & Cazals (2018), we initialize Q1(z|𝐬)subscript𝑄1conditional𝑧𝐬Q_{1}(z|\mathbf{s})italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_z | bold_s ) with a small set of samples randomly labeled and start the sequential testing after that.. We formally present our sequential testing statistic in the following:

A sequential two-sample testing statistic: Considering (𝐬,z)𝐬𝑧(\mathbf{s},z)( bold_s , italic_z ) is sequentially observed, and as a new (𝐬n,zn)subscript𝐬𝑛subscript𝑧𝑛(\mathbf{s}_{n},z_{n})( bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) arrives, then, for n=1,2,𝑛12n=1,2,\cdotsitalic_n = 1 , 2 , ⋯, an analyst constructs wn=P^(zn)Q(zn𝐬n)=i=1nP^(zi)Qi(zi𝐬i)subscript𝑤𝑛^𝑃superscript𝑧𝑛𝑄conditionalsuperscript𝑧𝑛superscript𝐬𝑛superscriptsubscriptproduct𝑖1𝑛^𝑃subscript𝑧𝑖subscript𝑄𝑖conditionalsubscript𝑧𝑖subscript𝐬𝑖\displaystyle w_{n}=\frac{\hat{P}(z^{n})}{Q(z^{n}\mid\mathbf{s}^{n})}=\prod_{i% =1}^{n}\frac{\hat{P}(z_{i})}{Q_{i}(z_{i}\mid\mathbf{s}_{i})}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG over^ start_ARG italic_P end_ARG ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_Q ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ bold_s start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) end_ARG = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG over^ start_ARG italic_P end_ARG ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG (2) where P^(Z=1)=i=1nzin^𝑃𝑍1superscriptsubscript𝑖1𝑛subscript𝑧𝑖𝑛\hat{P}(Z=1)=\frac{\sum_{i=1}^{n}z_{i}}{n}over^ start_ARG italic_P end_ARG ( italic_Z = 1 ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG is a class prior chosen to maximize P^(zn)^𝑃superscript𝑧𝑛\hat{P}(z^{n})over^ start_ARG italic_P end_ARG ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) and Qi(zi𝐬i)subscript𝑄𝑖conditionalsubscript𝑧𝑖subscript𝐬𝑖Q_{i}(z_{i}\mid\mathbf{s}_{i})italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the output of a class-probability predictor built by the past observed sequence (𝐬,z)i1superscript𝐬𝑧𝑖1(\mathbf{s},z)^{i-1}( bold_s , italic_z ) start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT.

We accordingly use Wnsubscript𝑊𝑛W_{n}italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to indicate a random variable of which wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is a realization. Our test statistic in equation 2 is a generalization of the test statistic proposed in (Lhéritier & Cazals, 2018). In contrast to that work, our test statistic does not require the prior class to be known. The analyst compares wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with α𝛼\alphaitalic_α at every step n𝑛nitalic_n starting from n=1𝑛1n=1italic_n = 1, stop** the test once encountering a step with wnαsubscript𝑤𝑛𝛼w_{n}\leq\alphaitalic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≤ italic_α. As a result, a small wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is favored under H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to reject H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for increasing testing power.

Algorithm 1 Bimodal Query Based Active Sequential Two-Sample Testing (BQ-AST)
1:Input: 𝒮u,𝒜,N0,Nq,αsubscript𝒮𝑢𝒜subscript𝑁0subscript𝑁𝑞𝛼\mathcal{S}_{u},\mathcal{A},N_{0},N_{q},\alphacaligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , caligraphic_A , italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_α
2:Output: Reject or fail to reject H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
3:Initialization: Initialize Q1(z𝐬)subscript𝑄1conditional𝑧𝐬Q_{1}(z\mid\mathbf{s})italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_z ∣ bold_s ) using 𝒜𝒜\mathcal{A}caligraphic_A with N0subscript𝑁0N_{0}italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT features uniformly sampled from 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT without replacement and then labeled.
4:Active Sequential testing:
5:for n=1𝑛1n=1italic_n = 1 to NqN0subscript𝑁𝑞subscript𝑁0N_{q}-N_{0}italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT - italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT do
6:     Sample a feature 𝐬n=𝐬q0subscript𝐬𝑛subscript𝐬subscript𝑞0\mathbf{s}_{n}=\mathbf{s}_{q_{0}}bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT or 𝐬q1subscript𝐬subscript𝑞1\mathbf{s}_{q_{1}}bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT with fair chance where 𝐬q0=argmax𝐬[Qn(Z=0|𝐬)],𝐬𝒮uformulae-sequencesubscript𝐬subscript𝑞0subscript𝐬subscript𝑄𝑛𝑍conditional0𝐬for-all𝐬subscript𝒮𝑢\mathbf{s}_{q_{0}}=\arg\max_{\mathbf{s}}\left[Q_{n}(Z=0|\mathbf{s})\right],% \forall\mathbf{s}\in\mathcal{S}_{u}bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT [ italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_Z = 0 | bold_s ) ] , ∀ bold_s ∈ caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and 𝐬q1=argmax𝐬[Qn(Z=1|𝐬)],𝐬𝒮uformulae-sequencesubscript𝐬subscript𝑞1subscript𝐬subscript𝑄𝑛𝑍conditional1𝐬for-all𝐬subscript𝒮𝑢\mathbf{s}_{q_{1}}=\arg\max_{\mathbf{s}}\left[Q_{n}(Z=1|\mathbf{s})\right],% \forall\mathbf{s}\in\mathcal{S}_{u}bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT [ italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_Z = 1 | bold_s ) ] , ∀ bold_s ∈ caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT
7:     Query the label znsubscript𝑧𝑛z_{n}italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of 𝐬nsubscript𝐬𝑛\mathbf{s}_{n}bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
8:     Update wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT in equation 3 with (𝐬n,zn)subscript𝐬𝑛subscript𝑧𝑛(\mathbf{s}_{n},z_{n})( bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) and Qn(zn𝐬n)subscript𝑄𝑛conditionalsubscript𝑧𝑛subscript𝐬𝑛Q_{n}(z_{n}\mid\mathbf{s}_{n})italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )
9:     if wnαsubscript𝑤𝑛𝛼w_{n}\leq\alphaitalic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≤ italic_α then
10:         Return Reject H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
11:     else
12:         Update Qn(z𝐬)subscript𝑄𝑛conditional𝑧𝐬Q_{n}(z\mid\mathbf{s})italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_z ∣ bold_s ) with newly queried (𝐬n,zn)subscript𝐬𝑛subscript𝑧𝑛(\mathbf{s}_{n},z_{n})( bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) and past training examples.
13:     end if
14:end for
15:Return Retain H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

5 Active Sequential Two-Sample Testing

This section introduces the active sequential two-sample testing framework and its instantiation. We demonstrate that the framework produces an anytime-valid p𝑝pitalic_p-value regardless of the selected query scheme. We also provide the asymptotic and finite-sample performance of the framework with the testing power gain measured by the change of the mutual information between feature and label variables.

5.1 An active sequential two-sample testing framework

A flow chart of the proposed framework is shown in Figure 1. Our framework starts by initializing the class-probability predictor Qn(z𝐬)subscript𝑄𝑛conditional𝑧𝐬Q_{n}(z\mid\mathbf{s})italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_z ∣ bold_s ) at n=1𝑛1n=1italic_n = 1 with a small set of sample features randomly selected from 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and then labeled. Then, the framework enters the sequential testing stage that iteratively performs the following: selects features in 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT predicted by Qnsubscript𝑄𝑛Q_{n}italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to have a high dependency on their labels, update the statistic wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, decide whether we can reject H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and update Qnsubscript𝑄𝑛Q_{n}italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT if the test has not stopped. We formally introduce our active sequential two-sample testing framework as follows,

An active sequential two-sample testing framework: Suppose Nqsubscript𝑁𝑞N_{q}italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is a label budget and α𝛼\alphaitalic_α is a significance level. An analyst uses the proposed framework to sequentially and actively query the label znsubscript𝑧𝑛z_{n}italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of 𝐬nsubscript𝐬𝑛\mathbf{s}_{n}bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT from an unlabelled feature set 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT based on the predictions of Qn(z𝐬)subscript𝑄𝑛conditional𝑧𝐬Q_{n}\left(z\mid\mathbf{s}\right)italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_z ∣ bold_s ). As a new znsubscript𝑧𝑛z_{n}italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of 𝐬nsubscript𝐬𝑛\mathbf{s}_{n}bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is queried, the analyst constructs the following statistic wn=i=1nP^(zi)Qi(zi𝐬i).subscript𝑤𝑛superscriptsubscriptproduct𝑖1𝑛^𝑃subscript𝑧𝑖subscript𝑄𝑖conditionalsubscript𝑧𝑖subscript𝐬𝑖\displaystyle w_{n}=\prod_{i=1}^{n}\frac{\hat{P}(z_{i})}{Q_{i}\left(z_{i}\mid% \mathbf{s}_{i}\right)}.italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG over^ start_ARG italic_P end_ARG ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG . (3) The analyst evaluates wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and makes one of the following decisions: (1) rejects H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT if wnαsubscript𝑤𝑛𝛼w_{n}\leq\alphaitalic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≤ italic_α; (2) retains H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT if the label budget Nqsubscript𝑁𝑞N_{q}italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is exhausted and (1) is not satisfied; and (3) continues the test and updates Qnsubscript𝑄𝑛Q_{n}italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to Qn+1subscript𝑄𝑛1Q_{n+1}italic_Q start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT if (1) and (2) are not satisfied.

Framework instantiation: We provide a framework instantiation called bimodal query based active sequential two-sample testing (BQ-AST) described in Algorithm 1. The algorithm takes the following input: an unlabelled feature set 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, a probabilistic classification algorithm 𝒜𝒜\mathcal{A}caligraphic_A, the size N0subscript𝑁0N_{0}italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT of an initialization set used for 𝒜𝒜\mathcal{A}caligraphic_A, a label budget Nqsubscript𝑁𝑞N_{q}italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and a significance level α𝛼\alphaitalic_α. Then, the algorithm initializes a class-probability predictor Q𝑄Qitalic_Q using 𝒜𝒜\mathcal{A}caligraphic_A with a small set of randomly labeled samples. In the sequential testing stage, the algorithm uses bimodal query from  Li et al. (2022) to sample 𝐬nsubscript𝐬𝑛\mathbf{s}_{n}bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with samples having the highest posteriors from either class (e.g. a fair chance to select the highest Qn(Z=0𝐬)subscript𝑄𝑛𝑍conditional0𝐬Q_{n}\left(Z=0\mid\mathbf{s}\right)italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_Z = 0 ∣ bold_s ) or Qn(Z=1𝐬)subscript𝑄𝑛𝑍conditional1𝐬Q_{n}\left(Z=1\mid\mathbf{s}\right)italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_Z = 1 ∣ bold_s )) from 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, queries its label znsubscript𝑧𝑛z_{n}italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and updates the statistic wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Next, the algorithm compares wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with α𝛼\alphaitalic_α, and if H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is not rejected, update Qnsubscript𝑄𝑛Q_{n}italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with (𝐬n,zn)subscript𝐬𝑛subscript𝑧𝑛(\mathbf{s}_{n},z_{n})( bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) and then re-enter the query labeling. The algorithm rejects H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT if wnαsubscript𝑤𝑛𝛼w_{n}\leq\alphaitalic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≤ italic_α or fails to reject H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT if the label budget is exhausted.

The label budget Nqsubscript𝑁𝑞N_{q}italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT in Algorithm 1 contains the labels for both initializing Q1(z𝐬)subscript𝑄1conditional𝑧𝐬Q_{1}(z\mid\mathbf{s})italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_z ∣ bold_s ) and constructing the statistic wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. In what follows in this section, we simply use Nqsubscript𝑁𝑞N_{q}italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT to denote the “label budget” allowed to be used after the initialization.

5.2 The proposed framework results in an anytime-valid p𝑝pitalic_p-value

Our framework rejects H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT if the statistic wnαsubscript𝑤𝑛𝛼w_{n}\leq\alphaitalic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≤ italic_α.The following theorem states that under H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is an anytime-valid p𝑝pitalic_p-value.

Theorem 5.1.

If an analyst uses the proposed framework to sequentially query the oracle for Z𝑍Zitalic_Z with 𝐒𝒮u𝐒subscript𝒮𝑢\mathbf{S}\in\mathcal{S}_{u}bold_S ∈ caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT resulting in (𝐒,Z)nsuperscript𝐒𝑍𝑛(\mathbf{S},Z)^{n}( bold_S , italic_Z ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, then we have the following under H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT,

P0(n[Nq],Wn=i=1nP^(Zi)Qi(Zi𝐒i)α)αsubscript𝑃0formulae-sequence𝑛delimited-[]subscript𝑁𝑞subscript𝑊𝑛superscriptsubscriptproduct𝑖1𝑛^𝑃subscript𝑍𝑖subscript𝑄𝑖conditionalsubscript𝑍𝑖subscript𝐒𝑖𝛼𝛼\displaystyle P_{0}\left(\exists n\in\left[N_{q}\right],W_{n}=\prod_{i=1}^{n}% \frac{\hat{P}(Z_{i})}{Q_{i}\left(Z_{i}\mid\mathbf{S}_{i}\right)}\leq\alpha% \right)\leq\alphaitalic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ∃ italic_n ∈ [ italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ] , italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG over^ start_ARG italic_P end_ARG ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ≤ italic_α ) ≤ italic_α (4)

where Nqsubscript𝑁𝑞N_{q}italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is a label budget and α𝛼\alphaitalic_α is the pre-specified significance level.

Theorem 5.1 implies the probability P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (or Type I error) that our framework mistakenly rejects H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is upper-bounded by α𝛼\alphaitalic_α. Briefly, we prove this by observing that the sequence (1W1,,1Wn)1subscript𝑊11subscript𝑊𝑛\left(\frac{1}{W_{1}},\cdots,\frac{1}{W_{n}}\right)( divide start_ARG 1 end_ARG start_ARG italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG , ⋯ , divide start_ARG 1 end_ARG start_ARG italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) is upper-bounded by a martingale, and hence we use Ville’s maximal inequality Durrett (2019); Doob (1939) to develop Theorem 5.1. See the Appendix for the complete proof.

5.3 Asymptotic properties of the proposed framework

This section provides the theoretical conditions under which the proposed framework asymptotically generates the smallest normalized statistic (normalization of the statistic in equation 3), or equivalently, maximally increases the mutual information between 𝐒𝐒\mathbf{S}bold_S and Z𝑍Zitalic_Z. Before that, we first define the consistent bimodal query as follows,

Definition 5.2.

(Consistent bimodal query) Let 𝒮𝒮\mathcal{S}caligraphic_S be the support of p(𝐬)𝑝𝐬p(\mathbf{s})italic_p ( bold_s ) that sample features are collected from and added to an unlabeled set 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, and let P(z𝐬)𝑃conditional𝑧𝐬P(z\mid\mathbf{s})italic_P ( italic_z ∣ bold_s ) denote the posterior probability of z𝑧zitalic_z given 𝐬𝒮𝐬𝒮\mathbf{s}\in\mathcal{S}bold_s ∈ caligraphic_S. An analyst adopts a label query scheme, for every n[Nq]𝑛delimited-[]subscript𝑁𝑞n\in\left[N_{q}\right]italic_n ∈ [ italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ], to query the label Znsubscript𝑍𝑛Z_{n}italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of 𝐒n𝒮usubscript𝐒𝑛subscript𝒮𝑢\mathbf{S}_{n}\in\mathcal{S}_{u}bold_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT such that 𝐒nsubscript𝐒𝑛\mathbf{S}_{n}bold_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT admits a probability density function (PDF) pn(𝐬)subscript𝑝𝑛𝐬p_{n}(\mathbf{s})italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_s ). The label query scheme is a consistent bimodal query if limnpn(𝐬)=p(𝐬)subscript𝑛subscript𝑝𝑛𝐬superscript𝑝𝐬\lim_{n\to\infty}p_{n}(\mathbf{s})=p^{*}(\mathbf{s})roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_s ) = italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s ) where

p(𝐬)superscript𝑝𝐬\displaystyle p^{*}(\mathbf{s})italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s ) =0,𝐬𝒮(𝒮q0𝒮q1), and p(𝐬)>0,𝐬𝒮q0𝒮q1,formulae-sequenceabsent0formulae-sequencefor-all𝐬𝒮subscript𝒮subscript𝑞0subscript𝒮subscript𝑞1formulae-sequence and superscript𝑝𝐬0for-all𝐬subscript𝒮subscript𝑞0subscript𝒮subscript𝑞1\displaystyle=0,\forall\mathbf{s}\in\mathcal{S}\setminus\left(\mathcal{S}_{q_{% 0}}\bigcup\mathcal{S}_{q_{1}}\right),\text{ and }p^{*}(\mathbf{s})>0,\forall% \mathbf{s}\in\mathcal{S}_{q_{0}}\bigcup\mathcal{S}_{q_{1}},= 0 , ∀ bold_s ∈ caligraphic_S ∖ ( caligraphic_S start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋃ caligraphic_S start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , and italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s ) > 0 , ∀ bold_s ∈ caligraphic_S start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋃ caligraphic_S start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (5)
𝒮q0subscript𝒮subscript𝑞0\displaystyle\mathcal{S}_{q_{0}}caligraphic_S start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ={𝐬q0|P(Z=0𝐬q0)=max𝐬𝒮P(Z=0𝐬)},\displaystyle=\left\{\mathbf{s}_{q_{0}}\left\rvert P\left(Z=0\mid\mathbf{s}_{q% _{0}}\right)=\max_{\mathbf{s}\in\mathcal{S}}P\left(Z=0\mid\mathbf{s}\right)% \right\}\right.,= { bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_P ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = roman_max start_POSTSUBSCRIPT bold_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_P ( italic_Z = 0 ∣ bold_s ) } , (6)
𝒮q1subscript𝒮subscript𝑞1\displaystyle\mathcal{S}_{q_{1}}caligraphic_S start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ={𝐬q1|P(Z=1𝐬q1)=max𝐬𝒮P(Z=1𝐬)}.\displaystyle=\left\{\mathbf{s}_{q_{1}}\left\rvert P\left(Z=1\mid\mathbf{s}_{q% _{1}}\right)=\max_{\mathbf{s}\in\mathcal{S}}P\left(Z=1\mid\mathbf{s}\right)% \right\}\right..= { bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_P ( italic_Z = 1 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = roman_max start_POSTSUBSCRIPT bold_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_P ( italic_Z = 1 ∣ bold_s ) } . (7)
Remark 5.3.

Def 5.2 considers a label query scheme that only queries the labels of 𝐬𝐬\mathbf{s}bold_s with the highest p(Z=0𝐬)𝑝𝑍conditional0𝐬p\left(Z=0\mid\mathbf{s}\right)italic_p ( italic_Z = 0 ∣ bold_s ) and p(Z=1𝐬)𝑝𝑍conditional1𝐬p\left(Z=1\mid\mathbf{s}\right)italic_p ( italic_Z = 1 ∣ bold_s ) when n𝑛nitalic_n goes to infinity. As p(z𝐬)𝑝conditional𝑧𝐬p\left(z\mid\mathbf{s}\right)italic_p ( italic_z ∣ bold_s ) is not directly available, to construct the consistent bimodal query, one can use nonparametric regressors to construct a class-probability predictor Q(z𝐬)𝑄conditional𝑧𝐬Q\left(z\mid\mathbf{s}\right)italic_Q ( italic_z ∣ bold_s ) as nonparametric estimates of P(z𝐬),𝐬𝒮𝑃conditional𝑧𝐬for-all𝐬𝒮P\left(z\mid\mathbf{s}\right),\forall\mathbf{s}\in\mathcal{S}italic_P ( italic_z ∣ bold_s ) , ∀ bold_s ∈ caligraphic_S and implements the bimodal query to label 𝐬𝐬\mathbf{s}bold_s with highest Q(Z=0𝐬)𝑄𝑍conditional0𝐬Q(Z=0\mid\mathbf{s})italic_Q ( italic_Z = 0 ∣ bold_s ) or highest Q(Z=1𝐬)𝑄𝑍conditional1𝐬Q(Z=1\mid\mathbf{s})italic_Q ( italic_Z = 1 ∣ bold_s ) after Q(z𝐬)𝑄conditional𝑧𝐬Q(z\mid\mathbf{s})italic_Q ( italic_z ∣ bold_s ) converges to P(z𝐬)𝑃conditional𝑧𝐬P\left(z\mid\mathbf{s}\right)italic_P ( italic_z ∣ bold_s ). The authors of  (Györfi et al., 2002) prove that when Q(z𝐬)𝑄conditional𝑧𝐬Q\left(z\mid\mathbf{s}\right)italic_Q ( italic_z ∣ bold_s ) is a kernel, KNN or partition estimates with proper smoothing parameters (e.g., bandwidth for the kernel) and labels are sufficiently revealed in the proximity of 𝐬,𝐬𝒮𝐬for-all𝐬𝒮\mathbf{s},\forall\mathbf{s}\in\mathcal{S}bold_s , ∀ bold_s ∈ caligraphic_S, then Q(z𝐬)𝑄conditional𝑧𝐬Q\left(z\mid\mathbf{s}\right)italic_Q ( italic_z ∣ bold_s ) converges to P(z𝐬)𝑃conditional𝑧𝐬P\left(z\mid\mathbf{s}\right)italic_P ( italic_z ∣ bold_s ).

To this end, we introduce the asymptotic property of our framework. We consider normalizing the test statistic in equation 3 as follows,

W¯n=1ni=1nlogP^(Zi)Qi(Zi𝐒i),(𝐒i,Zi)pi(𝐬,z)=p(z𝐬)pi(𝐬)formulae-sequencesubscript¯𝑊𝑛1𝑛superscriptsubscript𝑖1𝑛^𝑃subscript𝑍𝑖subscript𝑄𝑖conditionalsubscript𝑍𝑖subscript𝐒𝑖similar-tosubscript𝐒𝑖subscript𝑍𝑖subscript𝑝𝑖𝐬𝑧𝑝conditional𝑧𝐬subscript𝑝𝑖𝐬\displaystyle\overline{W}_{n}=\frac{1}{n}\sum_{i=1}^{n}\log\frac{\hat{P}(Z_{i}% )}{Q_{i}\left(Z_{i}\mid\mathbf{S}_{i}\right)},\left(\mathbf{S}_{i},Z_{i}\right% )\sim p_{i}\left(\mathbf{s},z)=p(z\mid\mathbf{s}\right)p_{i}\left(\mathbf{s}\right)over¯ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log divide start_ARG over^ start_ARG italic_P end_ARG ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG , ( bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∼ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_s , italic_z ) = italic_p ( italic_z ∣ bold_s ) italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_s ) (8)

where (𝐒i,Zi)subscript𝐒𝑖subscript𝑍𝑖(\mathbf{S}_{i},Z_{i})( bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denotes a feature-label pair returned by a label query scheme when querying the i𝑖iitalic_i-th label. Next, we state the following theorem.

Theorem 5.4.

Let 𝒮𝒮\mathcal{S}caligraphic_S be the support of p(𝐬)𝑝𝐬p\left(\mathbf{s}\right)italic_p ( bold_s ) that sample features are collected from and added to an unlabeled set 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, and let P(z𝐬)𝑃conditional𝑧𝐬P\left(z\mid\mathbf{s}\right)italic_P ( italic_z ∣ bold_s ) denote the posterior probability of z𝑧zitalic_z given 𝐬𝒮𝐬𝒮\mathbf{s}\in\mathcal{S}bold_s ∈ caligraphic_S. There exists a consistent bimodal query scheme; when an analyst uses such a scheme in the proposed active sequential framework, then, under H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, W¯nsubscript¯𝑊𝑛\overline{W}_{n}over¯ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT converges to the negation of mutual information (MI), and the converged negated MI lower-bounds the negated MI generated by any p(𝐬)𝑝𝐬p\left(\mathbf{s}\right)italic_p ( bold_s ) subject to P(z𝐬),𝐬𝒮𝑃conditional𝑧𝐬for-all𝐬𝒮P\left(z\mid\mathbf{s}\right),\forall\mathbf{s}\in\mathcal{S}italic_P ( italic_z ∣ bold_s ) , ∀ bold_s ∈ caligraphic_S. Precisely, there exists a consistent bimodal query leading to the following

limnW¯n=(H(Z)H(Z𝐒))=I(𝐒;Z)I(𝐒;Z).subscript𝑛subscript¯𝑊𝑛superscript𝐻𝑍superscript𝐻conditional𝑍𝐒superscript𝐼𝐒𝑍𝐼𝐒𝑍\displaystyle\lim_{n\to\infty}\overline{W}_{n}=-\left(H^{*}(Z)-H^{*}\left(Z% \mid\mathbf{S}\right)\right)=-I^{*}\left(\mathbf{S};Z\right)\leq-I\left(% \mathbf{S};Z\right).roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT over¯ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = - ( italic_H start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_Z ) - italic_H start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_Z ∣ bold_S ) ) = - italic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_S ; italic_Z ) ≤ - italic_I ( bold_S ; italic_Z ) . (9)

I(𝐒;Z)superscript𝐼𝐒𝑍I^{*}\left(\mathbf{S};Z\right)italic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_S ; italic_Z ) is the MI constructed with (𝐒,Z)p(𝐬,z)=P(z𝐬)p(𝐬)similar-to𝐒𝑍superscript𝑝𝐬𝑧𝑃conditional𝑧𝐬superscript𝑝𝐬\left(\mathbf{S},Z\right)\sim p^{*}\left(\mathbf{s},z\right)=P\left(z\mid% \mathbf{s}\right)p^{*}\left(\mathbf{s}\right)( bold_S , italic_Z ) ∼ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s , italic_z ) = italic_P ( italic_z ∣ bold_s ) italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s ) (See equation 5 for p(𝐬)superscript𝑝𝐬p^{*}\left(\mathbf{s}\right)italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s )); I(𝐒;Z)𝐼𝐒𝑍I\left(\mathbf{S};Z\right)italic_I ( bold_S ; italic_Z ) is MI constructed with (𝐒,Z)p(𝐬,z)=P(z𝐬)p(𝐬)similar-to𝐒𝑍𝑝𝐬𝑧𝑃conditional𝑧𝐬𝑝𝐬\left(\mathbf{S},Z\right)\sim p\left(\mathbf{s},z\right)=P\left(z\mid\mathbf{s% }\right)p\left(\mathbf{s}\right)( bold_S , italic_Z ) ∼ italic_p ( bold_s , italic_z ) = italic_P ( italic_z ∣ bold_s ) italic_p ( bold_s ).

Recalling the null H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is rejected when the test statistic wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT in  equation 3 is smaller than α𝛼\alphaitalic_α; hence, the proposed framework, when used with a consistent bimodal query to asymptotically minimize the normalized wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT in  equation 3, favorably increases the testing power when |𝒮u|subscript𝒮𝑢|\mathcal{S}_{u}|| caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | is large and Q(z𝐬)𝑄conditional𝑧𝐬Q(z\mid\mathbf{s})italic_Q ( italic_z ∣ bold_s ) is close to P(z𝐬)𝑃conditional𝑧𝐬P(z\mid\mathbf{s})italic_P ( italic_z ∣ bold_s ). In Section 5.4, we will analyze the finite-sample performance of the proposed framework considering the approximation error of Q(z𝐬)𝑄conditional𝑧𝐬Q(z\mid\mathbf{s})italic_Q ( italic_z ∣ bold_s ). Additionally, by characterizing the difficulty of a two-sample testing problem with MI, Theorem 5.4 alludes that the proposed framework asymptotically turns the original hard two-sample testing problem with low dependency between 𝐒𝐒\mathbf{S}bold_S and Z𝑍Zitalic_Z (low MI), to a simple one by increasing the dependency between 𝐒𝐒\mathbf{S}bold_S and Z𝑍Zitalic_Z (high MI).

Remark 5.5.

Our testing framework is also consistent under H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and the same conditions of Theorem 5.4 as limnP1(i=1nP^(Zi)Qi(Zi𝐒i)α)=limnP(W¯n1nlog(α))=P1(I(𝐒,Z)0)=1subscript𝑛subscript𝑃1superscriptsubscriptproduct𝑖1𝑛^𝑃subscript𝑍𝑖subscript𝑄𝑖conditionalsubscript𝑍𝑖subscript𝐒𝑖𝛼subscript𝑛𝑃subscript¯𝑊𝑛1𝑛𝛼subscript𝑃1superscript𝐼𝐒𝑍01\lim_{n\to\infty}P_{1}\left(\prod_{i=1}^{n}\frac{\hat{P}(Z_{i})}{Q_{i}(Z_{i}% \mid\mathbf{S}_{i})}\leq\alpha\right)=\lim_{n\to\infty}P\left(\overline{W}_{n}% \leq\frac{1}{n}\log(\alpha)\right)=P_{1}(-I^{*}\left(\mathbf{S},Z\right)\leq 0% )=1roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG over^ start_ARG italic_P end_ARG ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ≤ italic_α ) = roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_P ( over¯ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG roman_log ( italic_α ) ) = italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( - italic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_S , italic_Z ) ≤ 0 ) = 1. The last equality holds due to I(𝐒,Z)>0superscript𝐼𝐒𝑍0I^{*}\left(\mathbf{S},Z\right)>0italic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_S , italic_Z ) > 0 under H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

5.4 Finite-sample analysis for the proposed framework

This section analyzes the testing power of the proposed framework in the finite-sample case. Section 5.4.1 and Section 5.4.2 offer metrics that assess the approximation error of Q(z𝐬)𝑄conditional𝑧𝐬Q(z\mid\mathbf{s})italic_Q ( italic_z ∣ bold_s ) and an irreducible Type II error. These metrics together determine the finite-sample testing power. Furthermore, Section 5.4.3 presents an illustrative example of using our framework. In Section 5.4.4, we conduct a finite-sample analysis for the example, incorporating both the metrics that characterize the approximation error and the irreducible Type II error.

5.4.1 Characterizing the approximation error of Q(z𝐬)𝑄conditional𝑧𝐬Q(z\mid\mathbf{s})italic_Q ( italic_z ∣ bold_s )

As our framework constructs the test statistic in equation 2 with the approximation Q(z𝐬)𝑄conditional𝑧𝐬Q(z\mid\mathbf{s})italic_Q ( italic_z ∣ bold_s ), there arises a need to establish a metric for assessing the approximation error of Q(z𝐬)𝑄conditional𝑧𝐬Q(z\mid\mathbf{s})italic_Q ( italic_z ∣ bold_s ) for our finite-sample analysis. To this end, we introduce KL2superscriptKL2\text{KL}^{2}KL start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-divergence,

Definition 5.6.

(KL2superscriptKL2\text{KL}^{2}KL start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-divergence) Let p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT be two probability density functions on the same support 𝒳𝒳\mathcal{X}caligraphic_X. Let f(t)=log2(t)𝑓𝑡superscript2𝑡f(t)=\log^{2}(t)italic_f ( italic_t ) = roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ). Then, the KL2superscriptKL2\text{KL}^{2}KL start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-divergence between p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is

DKL2(q0p0)=𝔼𝐗p0(𝐱)[f(q0(𝐗)p0(𝐗))]=𝔼𝐗p0(𝐱)[log2(q0(𝐗)p0(𝐗))].subscript𝐷superscriptKL2conditionalsubscript𝑞0subscript𝑝0subscript𝔼similar-to𝐗subscript𝑝0𝐱delimited-[]𝑓subscript𝑞0𝐗subscript𝑝0𝐗subscript𝔼similar-to𝐗subscript𝑝0𝐱delimited-[]superscript2subscript𝑞0𝐗subscript𝑝0𝐗\displaystyle D_{\text{KL}^{2}}\left(q_{0}\|p_{0}\right)=\mathbb{E}_{\mathbf{X% }\sim p_{0}(\mathbf{x})}\left[f\left(\frac{q_{0}(\mathbf{X})}{p_{0}(\mathbf{X}% )}\right)\right]=\mathbb{E}_{\mathbf{X}\sim p_{0}(\mathbf{x})}\left[\log^{2}{% \left(\frac{q_{0}(\mathbf{X})}{p_{0}(\mathbf{X})}\right)}\right].italic_D start_POSTSUBSCRIPT KL start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT bold_X ∼ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_x ) end_POSTSUBSCRIPT [ italic_f ( divide start_ARG italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_X ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_X ) end_ARG ) ] = blackboard_E start_POSTSUBSCRIPT bold_X ∼ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_x ) end_POSTSUBSCRIPT [ roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_X ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_X ) end_ARG ) ] . (10)

DKL2(q0p0)subscript𝐷superscriptKL2conditionalsubscript𝑞0subscript𝑝0D_{\text{KL}^{2}}\left(q_{0}\|p_{0}\right)italic_D start_POSTSUBSCRIPT KL start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is the second moment of the log-likelihood ratio and has been used (see, e.g., (3.1.14) in (Koga et al., 2002)) to understand the behavior of the distribution of log(q0(𝐱)p0(𝐱))subscript𝑞0𝐱subscript𝑝0𝐱\log{\left(\frac{q_{0}(\mathbf{x})}{p_{0}(\mathbf{x})}\right)}roman_log ( divide start_ARG italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_x ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_x ) end_ARG ). We use DKL2(q0||p0)D_{\text{KL}^{2}}\left(q_{0}||p_{0}\right)italic_D start_POSTSUBSCRIPT KL start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) to evaluate the distance between p(𝐬,z)=P(z𝐬)p(𝐬)𝑝𝐬𝑧𝑃conditional𝑧𝐬𝑝𝐬p\left(\mathbf{s},z\right)=P\left(z\mid\mathbf{s}\right)p\left(\mathbf{s}\right)italic_p ( bold_s , italic_z ) = italic_P ( italic_z ∣ bold_s ) italic_p ( bold_s ) and q(𝐬,z)=Q(z𝐬)p(𝐬)𝑞𝐬𝑧𝑄conditional𝑧𝐬𝑝𝐬q(\mathbf{s},z)=Q\left(z\mid\mathbf{s}\right)p\left(\mathbf{s}\right)italic_q ( bold_s , italic_z ) = italic_Q ( italic_z ∣ bold_s ) italic_p ( bold_s ), which yields the following

DKL2(q(𝐬,z)p(𝐬,z))=𝔼(𝐒,Z)p(𝐬,z)[log2(q(𝐒,Z)p(𝐒,Z))]=𝔼(𝐒,Z)p(𝐬,z)[log2(Q(Z𝐒)P(Z𝐒))].subscript𝐷superscriptKL2conditional𝑞𝐬𝑧𝑝𝐬𝑧subscript𝔼similar-to𝐒𝑍𝑝𝐬𝑧delimited-[]superscript2𝑞𝐒𝑍𝑝𝐒𝑍subscript𝔼similar-to𝐒𝑍𝑝𝐬𝑧delimited-[]superscript2𝑄conditional𝑍𝐒𝑃conditional𝑍𝐒\displaystyle D_{\text{KL}^{2}}\left(q(\mathbf{s},z)\|p(\mathbf{s},z)\right)=% \mathbb{E}_{\left(\mathbf{S},Z\right)\sim p(\mathbf{s},z)}\left[\log^{2}\left(% \frac{q\left(\mathbf{S},Z\right)}{p\left(\mathbf{S},Z\right)}\right)\right]=% \mathbb{E}_{\left(\mathbf{S},Z\right)\sim p(\mathbf{s},z)}\left[\log^{2}\left(% \frac{Q\left(Z\mid\mathbf{S}\right)}{P\left(Z\mid\mathbf{S}\right)}\right)% \right].italic_D start_POSTSUBSCRIPT KL start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_q ( bold_s , italic_z ) ∥ italic_p ( bold_s , italic_z ) ) = blackboard_E start_POSTSUBSCRIPT ( bold_S , italic_Z ) ∼ italic_p ( bold_s , italic_z ) end_POSTSUBSCRIPT [ roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG italic_q ( bold_S , italic_Z ) end_ARG start_ARG italic_p ( bold_S , italic_Z ) end_ARG ) ] = blackboard_E start_POSTSUBSCRIPT ( bold_S , italic_Z ) ∼ italic_p ( bold_s , italic_z ) end_POSTSUBSCRIPT [ roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG italic_Q ( italic_Z ∣ bold_S ) end_ARG start_ARG italic_P ( italic_Z ∣ bold_S ) end_ARG ) ] . (11)

Remarkably, DKL2(q(𝐬,z)p(𝐬,z))subscript𝐷superscriptKL2conditional𝑞𝐬𝑧𝑝𝐬𝑧D_{\text{KL}^{2}}\left(q(\mathbf{s},z)\|p(\mathbf{s},z)\right)italic_D start_POSTSUBSCRIPT KL start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_q ( bold_s , italic_z ) ∥ italic_p ( bold_s , italic_z ) ) in equation 11 also characterizes the discrepancy between P(z𝐬)𝑃conditional𝑧𝐬P\left(z\mid\mathbf{s}\right)italic_P ( italic_z ∣ bold_s ) and Q(z𝐬)𝑄conditional𝑧𝐬Q\left(z\mid\mathbf{s}\right)italic_Q ( italic_z ∣ bold_s ) by averaging their log square distance over 𝒮𝒮\mathcal{S}caligraphic_S; in our main result, we will see that the testing power of the proposed framework depends on DKL2(q(𝐬,z)p(𝐬,z))subscript𝐷superscriptKL2conditional𝑞𝐬𝑧𝑝𝐬𝑧D_{\text{KL}^{2}}\left(q(\mathbf{s},z)\|p(\mathbf{s},z)\right)italic_D start_POSTSUBSCRIPT KL start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_q ( bold_s , italic_z ) ∥ italic_p ( bold_s , italic_z ) ). Additionally, DKL2(q(𝐬,z)p(𝐬,z))subscript𝐷superscriptKL2conditional𝑞𝐬𝑧𝑝𝐬𝑧D_{\text{KL}^{2}}\left(q(\mathbf{s},z)\|p(\mathbf{s},z)\right)italic_D start_POSTSUBSCRIPT KL start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_q ( bold_s , italic_z ) ∥ italic_p ( bold_s , italic_z ) ) is closely related to the typical KL divergence DKL(P(z𝐬)Q(z𝐬))=𝔼(𝐒,Z)p(𝐬,z)[logP(Z𝐒)Q(Z𝐒)]D_{\text{KL}}\left(P\left(z\mid\mathbf{s}\right)\|Q\left(z\mid\mathbf{s}\right% )\right)=\mathbb{E}_{\left(\mathbf{S},Z\right)\sim p\left(\mathbf{s},z\right)}% \left[\log{\frac{P\left(Z\mid\mathbf{S}\right)}{Q\left(Z\mid\mathbf{S}\right)}% }\right]italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_P ( italic_z ∣ bold_s ) ∥ italic_Q ( italic_z ∣ bold_s ) ) = blackboard_E start_POSTSUBSCRIPT ( bold_S , italic_Z ) ∼ italic_p ( bold_s , italic_z ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_P ( italic_Z ∣ bold_S ) end_ARG start_ARG italic_Q ( italic_Z ∣ bold_S ) end_ARG ]. This can be seen by expanding equation 11 using the formula Var[X]=𝔼[X2]𝔼2[X]Vardelimited-[]𝑋𝔼delimited-[]superscript𝑋2superscript𝔼2delimited-[]𝑋\text{Var}\left[X\right]=\mathbb{E}\left[X^{2}\right]-\mathbb{E}^{2}\left[X\right]Var [ italic_X ] = blackboard_E [ italic_X start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - blackboard_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ italic_X ] resulting in,

DKL2(q(𝐬,z)p(𝐬,z))=Var(𝐒,Z)p(𝐬,z)[log(P(Z𝐒)Q(Z𝐒))]+[DKL(P(z𝐬)Q(z𝐬))].\displaystyle D_{\text{KL}^{2}}\left(q(\mathbf{s},z)\|p(\mathbf{s},z)\right)=% \text{Var}_{\left(\mathbf{S},Z\right)\sim p\left(\mathbf{s},z\right)}\left[% \log\left(\frac{P(Z\mid\mathbf{S})}{Q(Z\mid\mathbf{S})}\right)\right]+\left[D_% {\text{KL}}\left(P\left(z\mid\mathbf{s}\right)\|Q\left(z\mid\mathbf{s}\right)% \right)\right].italic_D start_POSTSUBSCRIPT KL start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_q ( bold_s , italic_z ) ∥ italic_p ( bold_s , italic_z ) ) = Var start_POSTSUBSCRIPT ( bold_S , italic_Z ) ∼ italic_p ( bold_s , italic_z ) end_POSTSUBSCRIPT [ roman_log ( divide start_ARG italic_P ( italic_Z ∣ bold_S ) end_ARG start_ARG italic_Q ( italic_Z ∣ bold_S ) end_ARG ) ] + [ italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_P ( italic_z ∣ bold_s ) ∥ italic_Q ( italic_z ∣ bold_s ) ) ] . (12)

equation 12 implies that DKL2(q(𝐬,z)||p(𝐬,z))D_{\text{KL}^{2}}\left(q\left(\mathbf{s},z\right)||p\left(\mathbf{s},z\right)\right)italic_D start_POSTSUBSCRIPT KL start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_q ( bold_s , italic_z ) | | italic_p ( bold_s , italic_z ) ) not only measures the expected distance between P(z𝐬)𝑃conditional𝑧𝐬P\left(z\mid\mathbf{s}\right)italic_P ( italic_z ∣ bold_s ) and Q(z𝐬)𝑄conditional𝑧𝐬Q\left(z\mid\mathbf{s}\right)italic_Q ( italic_z ∣ bold_s ) over 𝒮𝒮\mathcal{S}caligraphic_S but also the variance of that distance. Similarly, we write

DKL2(p(𝐬,z)q(𝐬,z))=𝔼(𝐒,Z)q(𝐬,z)[log2(P(Z𝐒)Q(Z𝐒))]subscript𝐷superscriptKL2conditional𝑝𝐬𝑧𝑞𝐬𝑧subscript𝔼similar-to𝐒𝑍𝑞𝐬𝑧delimited-[]superscript2𝑃conditional𝑍𝐒𝑄conditional𝑍𝐒\displaystyle D_{\text{KL}^{2}}\left(p\left(\mathbf{s},z\right)\|q\left(% \mathbf{s},z\right)\right)=\mathbb{E}_{\left(\mathbf{S},Z\right)\sim q(\mathbf% {s},z)}\left[\log^{2}\left(\frac{P\left(Z\mid\mathbf{S}\right)}{Q\left(Z\mid% \mathbf{S}\right)}\right)\right]italic_D start_POSTSUBSCRIPT KL start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_p ( bold_s , italic_z ) ∥ italic_q ( bold_s , italic_z ) ) = blackboard_E start_POSTSUBSCRIPT ( bold_S , italic_Z ) ∼ italic_q ( bold_s , italic_z ) end_POSTSUBSCRIPT [ roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG italic_P ( italic_Z ∣ bold_S ) end_ARG start_ARG italic_Q ( italic_Z ∣ bold_S ) end_ARG ) ] (13)

to measure the discrepancy between p(𝐬,z)𝑝𝐬𝑧p(\mathbf{s},z)italic_p ( bold_s , italic_z ) and q(𝐬,z)𝑞𝐬𝑧q(\mathbf{s},z)italic_q ( bold_s , italic_z ) but with a reverse direction opposed to DKL2(q(𝐬,z)p(𝐬,z))subscript𝐷superscriptKL2conditional𝑞𝐬𝑧𝑝𝐬𝑧D_{\text{KL}^{2}}\left(q\left(\mathbf{s},z\right)\|p\left(\mathbf{s},z\right)\right)italic_D start_POSTSUBSCRIPT KL start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_q ( bold_s , italic_z ) ∥ italic_p ( bold_s , italic_z ) ).

DKL2(q(𝐬,z)p(𝐬,z))subscript𝐷superscriptKL2conditional𝑞𝐬𝑧𝑝𝐬𝑧D_{\text{KL}^{2}}\left(q\left(\mathbf{s},z\right)\|p\left(\mathbf{s},z\right)\right)italic_D start_POSTSUBSCRIPT KL start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_q ( bold_s , italic_z ) ∥ italic_p ( bold_s , italic_z ) ) and DKL2(p(𝐬,z)q(𝐬,z))subscript𝐷superscriptKL2conditional𝑝𝐬𝑧𝑞𝐬𝑧D_{\text{KL}^{2}}\left(p\left(\mathbf{s},z\right)\|q\left(\mathbf{s},z\right)\right)italic_D start_POSTSUBSCRIPT KL start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_p ( bold_s , italic_z ) ∥ italic_q ( bold_s , italic_z ) ) both characterize the approximation error of Q(z𝐬)𝑄conditional𝑧𝐬Q(z\mid\mathbf{s})italic_Q ( italic_z ∣ bold_s ), and we will also see they jointly determine the testing power of the proposed framework in Section 5.4.4.

5.4.2 Characterizing the factor that leads to the irreducible Type II error in finite-sample case

We also introduce another factor influencing testing power, which persists even in the absence of approximation error, i.e., Q(z𝐬)=P(z𝐬)𝑄conditional𝑧𝐬𝑃conditional𝑧𝐬Q(z\mid\mathbf{s})=P(z\mid\mathbf{s})italic_Q ( italic_z ∣ bold_s ) = italic_P ( italic_z ∣ bold_s ). To see this, we recall the information spectrum introduced in (Han & Verdú, 1993),

Definition 5.7.

(Information spectrum (Han & Verdú, 1993)) Let (𝐗,𝐘)𝐗𝐘\left(\mathbf{X},\mathbf{Y}\right)( bold_X , bold_Y ) be a pair of random variables over the support 𝒳×𝒴𝒳𝒴\mathcal{X}\times\mathcal{Y}caligraphic_X × caligraphic_Y. Let p𝐗𝐘subscript𝑝𝐗𝐘p_{\mathbf{X}\mathbf{Y}}italic_p start_POSTSUBSCRIPT bold_XY end_POSTSUBSCRIPT denote the joint distribution of (𝐗,𝐘)𝐗𝐘\left(\mathbf{X},\mathbf{Y}\right)( bold_X , bold_Y ), and let p𝐗subscript𝑝𝐗p_{\mathbf{X}}italic_p start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT and p𝐘subscript𝑝𝐘p_{\mathbf{Y}}italic_p start_POSTSUBSCRIPT bold_Y end_POSTSUBSCRIPT denote the marginal distributions of 𝐗𝐗\mathbf{X}bold_X and 𝐘𝐘\mathbf{Y}bold_Y. Suppose {(𝐗,𝐘)}i=1nsuperscriptsubscript𝐗𝐘𝑖1𝑛\{(\mathbf{X},\mathbf{Y})\}_{i=1}^{n}{ ( bold_X , bold_Y ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is a sequence of i.i.d random variables for (𝐗,𝐘)p𝐗𝐘(𝐱,𝐲)similar-to𝐗𝐘subscript𝑝𝐗𝐘𝐱𝐲(\mathbf{X},\mathbf{Y})\sim p_{\mathbf{X}\mathbf{Y}}(\mathbf{x},\mathbf{y})( bold_X , bold_Y ) ∼ italic_p start_POSTSUBSCRIPT bold_XY end_POSTSUBSCRIPT ( bold_x , bold_y ). Then, the information spectrum is the probability distribution of the following random variable,

I¯(𝐗n;𝐘n)=1ni=1nlogp𝐗𝐘(𝐗i,𝐘i)p𝐗(𝐗i)p𝐘(𝐘i),(𝐗,𝐘)p𝐗𝐘(𝐱,𝐲)formulae-sequence¯𝐼superscript𝐗𝑛superscript𝐘𝑛1𝑛superscriptsubscript𝑖1𝑛subscript𝑝𝐗𝐘subscript𝐗𝑖subscript𝐘𝑖subscript𝑝𝐗subscript𝐗𝑖subscript𝑝𝐘subscript𝐘𝑖similar-to𝐗𝐘subscript𝑝𝐗𝐘𝐱𝐲\displaystyle\bar{I}(\mathbf{X}^{n};\mathbf{Y}^{n})=\frac{1}{n}\sum_{i=1}^{n}% \log\frac{p_{\mathbf{X}\mathbf{Y}}\left(\mathbf{X}_{i},\mathbf{Y}_{i}\right)}{% p_{\mathbf{X}}(\mathbf{X}_{i})p_{\mathbf{Y}}(\mathbf{Y}_{i})},\quad\left(% \mathbf{X},\mathbf{Y}\right)\sim p_{\mathbf{X}\mathbf{Y}}(\mathbf{x},\mathbf{y})over¯ start_ARG italic_I end_ARG ( bold_X start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ; bold_Y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log divide start_ARG italic_p start_POSTSUBSCRIPT bold_XY end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT bold_Y end_POSTSUBSCRIPT ( bold_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG , ( bold_X , bold_Y ) ∼ italic_p start_POSTSUBSCRIPT bold_XY end_POSTSUBSCRIPT ( bold_x , bold_y ) (14)

It is easy to see the expectation of I¯(𝐗n;𝐘n)¯𝐼superscript𝐗𝑛superscript𝐘𝑛\bar{I}(\mathbf{X}^{n};\mathbf{Y}^{n})over¯ start_ARG italic_I end_ARG ( bold_X start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ; bold_Y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) is the mutual information I(𝐗;𝐘)𝐼𝐗𝐘I(\mathbf{X};\mathbf{Y})italic_I ( bold_X ; bold_Y ) for (𝐗,𝐘)p𝐗𝐘(𝐱,𝐲)similar-to𝐗𝐘subscript𝑝𝐗𝐘𝐱𝐲(\mathbf{X},\mathbf{Y})\sim p_{\mathbf{X}\mathbf{Y}}\left(\mathbf{x},\mathbf{y% }\right)( bold_X , bold_Y ) ∼ italic_p start_POSTSUBSCRIPT bold_XY end_POSTSUBSCRIPT ( bold_x , bold_y ). Substituting (𝐗,𝐘)p𝐗𝐘(𝐱,𝐲)similar-to𝐗𝐘subscript𝑝𝐗𝐘𝐱𝐲(\mathbf{X},\mathbf{Y})\sim p_{\mathbf{X}\mathbf{Y}}\left(\mathbf{x},\mathbf{y% }\right)( bold_X , bold_Y ) ∼ italic_p start_POSTSUBSCRIPT bold_XY end_POSTSUBSCRIPT ( bold_x , bold_y ) in equation 14 with the feature-label variable pair (𝐒,Z)p(𝐬,z)similar-to𝐒𝑍𝑝𝐬𝑧(\mathbf{S},Z)\sim p\left(\mathbf{s},z\right)( bold_S , italic_Z ) ∼ italic_p ( bold_s , italic_z ) in our two-sample testing problem recovers the (negated) normalizing test statistic in equation 8 with P(z)𝑃𝑧P(z)italic_P ( italic_z ) and P(z𝐬)𝑃conditional𝑧𝐬P(z\mid\mathbf{s})italic_P ( italic_z ∣ bold_s ) inserted, i.e., in the absence of approximation error.

(Han, 2000) leverages the dispersion of the information spectrum (the distribution of I¯(𝐗n;𝐘n)¯𝐼superscript𝐗𝑛superscript𝐘𝑛\bar{I}\left(\mathbf{X}^{n};\mathbf{Y}^{n}\right)over¯ start_ARG italic_I end_ARG ( bold_X start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ; bold_Y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT )) for {(𝐗,𝐘)}i=1nsuperscriptsubscript𝐗𝐘𝑖1𝑛\{\left(\mathbf{X},\mathbf{Y}\right)\}_{i=1}^{n}{ ( bold_X , bold_Y ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT to quantify the rate that Type II error goes to zero with increasing n𝑛nitalic_n. Their underlying rationale is that, for a larger variance of I¯(𝐗n;𝐘n)¯𝐼superscript𝐗𝑛superscript𝐘𝑛\bar{I}(\mathbf{X}^{n};\mathbf{Y}^{n})over¯ start_ARG italic_I end_ARG ( bold_X start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ; bold_Y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ), the probability of I¯(𝐗n;𝐘n)¯𝐼superscript𝐗𝑛superscript𝐘𝑛\bar{I}(\mathbf{X}^{n};\mathbf{Y}^{n})over¯ start_ARG italic_I end_ARG ( bold_X start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ; bold_Y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) falling outside the acceptance region for an alternative hypothesis also increases, thereby resulting in a slower convergence rate for the Type II error. In our work, we will make use of the variance of the log-likelihood ratio between p(𝐬,z)𝑝𝐬𝑧p(\mathbf{s},z)italic_p ( bold_s , italic_z ) and p(𝐬)p(z)𝑝𝐬𝑝𝑧p(\mathbf{s})p(z)italic_p ( bold_s ) italic_p ( italic_z )

Var(𝐒;Z)p(𝐬,z)I¯(𝐒,Z)=nVar(𝐒;Z)np((𝐬,z)n)I¯(𝐒n,Zn)=Var(𝐒,Z)p(𝐬,z)[logP(Z)P(Z𝐒)].subscriptVarsimilar-to𝐒𝑍𝑝𝐬𝑧¯𝐼𝐒𝑍𝑛subscriptVarsimilar-tosuperscript𝐒𝑍𝑛𝑝superscript𝐬𝑧𝑛¯𝐼superscript𝐒𝑛superscript𝑍𝑛subscriptVarsimilar-to𝐒𝑍𝑝𝐬𝑧delimited-[]𝑃𝑍𝑃conditional𝑍𝐒\displaystyle\text{Var}_{(\mathbf{S};Z)\sim p(\mathbf{s},z)}\bar{I}(\mathbf{S}% ,Z)=n\text{Var}_{(\mathbf{S};Z)^{n}\sim p\left(\left(\mathbf{s},z\right)^{n}% \right)}\bar{I}(\mathbf{S}^{n},Z^{n})=\text{Var}_{\left(\mathbf{S},Z\right)% \sim p\left(\mathbf{s},z\right)}\left[-\log\frac{P(Z)}{P\left(Z\mid\mathbf{S}% \right)}\right].Var start_POSTSUBSCRIPT ( bold_S ; italic_Z ) ∼ italic_p ( bold_s , italic_z ) end_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG ( bold_S , italic_Z ) = italic_n Var start_POSTSUBSCRIPT ( bold_S ; italic_Z ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∼ italic_p ( ( bold_s , italic_z ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG ( bold_S start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_Z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) = Var start_POSTSUBSCRIPT ( bold_S , italic_Z ) ∼ italic_p ( bold_s , italic_z ) end_POSTSUBSCRIPT [ - roman_log divide start_ARG italic_P ( italic_Z ) end_ARG start_ARG italic_P ( italic_Z ∣ bold_S ) end_ARG ] . (15)

Scaling Var(𝐒,Z)p(𝐬,z)I¯(𝐒;Z)subscriptVarsimilar-to𝐒𝑍𝑝𝐬𝑧¯𝐼𝐒𝑍\text{Var}_{(\mathbf{S},Z)\sim p(\mathbf{s},z)}\bar{I}(\mathbf{S};Z)Var start_POSTSUBSCRIPT ( bold_S , italic_Z ) ∼ italic_p ( bold_s , italic_z ) end_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG ( bold_S ; italic_Z ) down by n𝑛nitalic_n is the variance of I¯(𝐒n;𝐙n)¯𝐼superscript𝐒𝑛superscript𝐙𝑛\bar{I}\left(\mathbf{S}^{n};\mathbf{Z}^{n}\right)over¯ start_ARG italic_I end_ARG ( bold_S start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ; bold_Z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ), characterizing the the dispersion of the information spectrum for {(𝐒,𝐙)}i=1nsuperscriptsubscript𝐒𝐙𝑖1𝑛\{(\mathbf{S},\mathbf{Z})\}_{i=1}^{n}{ ( bold_S , bold_Z ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT given n𝑛nitalic_n. Var(𝐒,Z)p(𝐬,z)I¯(𝐒;Z)subscriptVarsimilar-to𝐒𝑍𝑝𝐬𝑧¯𝐼𝐒𝑍\text{Var}_{(\mathbf{S},Z)\sim p(\mathbf{s},z)}\bar{I}(\mathbf{S};Z)Var start_POSTSUBSCRIPT ( bold_S , italic_Z ) ∼ italic_p ( bold_s , italic_z ) end_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG ( bold_S ; italic_Z ) is also known as the relative entropy variance (See e.g., (2.29) in (Tan et al., 2014)). It remains present even in the absence of approximation error (i.e., Q(z𝐬)=P(z𝐬)𝑄conditional𝑧𝐬𝑃conditional𝑧𝐬Q(z\mid\mathbf{s})=P(z\mid\mathbf{s})italic_Q ( italic_z ∣ bold_s ) = italic_P ( italic_z ∣ bold_s )). As we will see in Section 5.4.4, the persistent Var(𝐒,Z)p(𝐬,z)I¯(𝐒;Z)subscriptVarsimilar-to𝐒𝑍𝑝𝐬𝑧¯𝐼𝐒𝑍\text{Var}_{(\mathbf{S},Z)\sim p(\mathbf{s},z)}\bar{I}(\mathbf{S};Z)Var start_POSTSUBSCRIPT ( bold_S , italic_Z ) ∼ italic_p ( bold_s , italic_z ) end_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG ( bold_S ; italic_Z ) leads to a non-zero Type II error in the finite-sample case.

5.4.3 An example of using the proposed framework

We first introduce the notation that will be used in the ensuing sections. We write 𝒫={A1,,Am}𝒫subscript𝐴1subscript𝐴𝑚\mathcal{P}=\left\{A_{1},\cdots,A_{m}\right\}caligraphic_P = { italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } to denote a partition of the support 𝒮𝒮\mathcal{S}caligraphic_S of p(𝐬)𝑝𝐬p(\mathbf{s})italic_p ( bold_s ) from which unlabeled sample features in 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT are generated; in other words, i=1mAi=𝒮superscriptsubscript𝑖1𝑚subscript𝐴𝑖𝒮\bigcup_{i=1}^{m}A_{i}=\mathcal{S}⋃ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_S. We compare an example of our proposed framework with the baseline, where features are randomly sampled from 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and labeled. We quantitatively analyze the testing power of both cases. Both the example and the baseline are detained as follows:

(An example of using the proposed framework) Given a label budget Nqsubscript𝑁𝑞N_{q}italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, α𝛼\alphaitalic_α, an unlabeled set 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, a partition 𝒫={A1,,Am}𝒫subscript𝐴1subscript𝐴𝑚\mathcal{P}=\{A_{1},\cdots,A_{m}\}caligraphic_P = { italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }, and class priors {P(Z=0A1),,P(Z=0Am)}𝑃𝑍conditional0subscript𝐴1𝑃𝑍conditional0subscript𝐴𝑚\left\{P(Z=0\mid A_{1}),\cdots,P(Z=0\mid A_{m})\right\}{ italic_P ( italic_Z = 0 ∣ italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ⋯ , italic_P ( italic_Z = 0 ∣ italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) }, an analyst initializes Q(z𝐬)𝑄conditional𝑧𝐬Q\left(z\mid\mathbf{s}\right)italic_Q ( italic_z ∣ bold_s ) with a set of labeled features randomly sampled from 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, then, she estimates I(𝐒;ZAi)𝐼𝐒conditional𝑍subscript𝐴𝑖I\left(\mathbf{S};Z\mid A_{i}\right)italic_I ( bold_S ; italic_Z ∣ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) by I^^𝐼\displaystyle\hat{I}over^ start_ARG italic_I end_ARG (𝐒;ZAi)𝐒conditional𝑍subscript𝐴𝑖\displaystyle\left(\mathbf{S};Z\mid A_{i}\right)( bold_S ; italic_Z ∣ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) =H(ZAi)H^(Z𝐒,Ai)absent𝐻conditional𝑍subscript𝐴𝑖^𝐻conditional𝑍𝐒subscript𝐴𝑖\displaystyle=H\left(Z\mid A_{i}\right)-\hat{H}\left(Z\mid\mathbf{S},A_{i}\right)= italic_H ( italic_Z ∣ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - over^ start_ARG italic_H end_ARG ( italic_Z ∣ bold_S , italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) =z=01P(Z=zAi)logP(Z=zAi)+𝐬Ai𝒮uz=01Q(Z=z𝐬)logQ(Z=z𝐬)|Ai𝒮u|,absentsuperscriptsubscript𝑧01𝑃𝑍conditional𝑧subscript𝐴𝑖𝑃𝑍conditional𝑧subscript𝐴𝑖subscript𝐬subscript𝐴𝑖subscript𝒮𝑢superscriptsubscript𝑧01𝑄𝑍conditional𝑧𝐬𝑄𝑍conditional𝑧𝐬subscript𝐴𝑖subscript𝒮𝑢\displaystyle=-\sum_{z=0}^{1}P\left(Z=z\mid A_{i}\right)\log P\left(Z=z\mid A_% {i}\right)+\frac{\sum_{\mathbf{s}\in A_{i}\bigcap\mathcal{S}_{u}}\sum_{z=0}^{1% }Q\left(Z=z\mid\mathbf{s}\right)\log Q\left(Z=z\mid\mathbf{s}\right)}{\left|A_% {i}\bigcap\mathcal{S}_{u}\right|},= - ∑ start_POSTSUBSCRIPT italic_z = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_P ( italic_Z = italic_z ∣ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log italic_P ( italic_Z = italic_z ∣ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + divide start_ARG ∑ start_POSTSUBSCRIPT bold_s ∈ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋂ caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_z = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_Q ( italic_Z = italic_z ∣ bold_s ) roman_log italic_Q ( italic_Z = italic_z ∣ bold_s ) end_ARG start_ARG | italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋂ caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | end_ARG , (16) selects A=argmaxA𝒫I^(𝐒;ZA)superscript𝐴subscript𝐴𝒫^𝐼𝐒conditional𝑍𝐴A^{*}=\arg\max_{A\in\mathcal{P}}\hat{I}\left(\mathbf{S};Z\mid A\right)italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_A ∈ caligraphic_P end_POSTSUBSCRIPT over^ start_ARG italic_I end_ARG ( bold_S ; italic_Z ∣ italic_A ), and sequentially constructs the statistic wn=i=1nP(zi)Q(zi𝐬i)subscript𝑤𝑛superscriptsubscriptproduct𝑖1𝑛𝑃subscript𝑧𝑖𝑄conditionalsubscript𝑧𝑖subscript𝐬𝑖w_{n}=\prod_{i=1}^{n}\frac{P(z_{i})}{Q\left(z_{i}\mid\mathbf{s}_{i}\right)}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_P ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG by labelling features randomly sampled from A𝒮usuperscript𝐴subscript𝒮𝑢A^{*}\bigcap\mathcal{S}_{u}italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⋂ caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. The analyst rejects H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT whenever wnαsubscript𝑤𝑛𝛼w_{n}\leq\alphaitalic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≤ italic_α or retains H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT if the label budget runs out. (Baseline test) Given a label budget Nqsubscript𝑁𝑞N_{q}italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, α𝛼\alphaitalic_α, an unlabeled set 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and the class prior P(Z=0)𝑃𝑍0P(Z=0)italic_P ( italic_Z = 0 ), an analyst initializes Q(z𝐬)𝑄conditional𝑧𝐬Q(z\mid\mathbf{s})italic_Q ( italic_z ∣ bold_s ) with a set of labeled features randomly sampled from 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, then, she sequentially constructs the statistic wn=i=1nP(zi)Q(zi𝐬i)subscript𝑤𝑛superscriptsubscriptproduct𝑖1𝑛𝑃subscript𝑧𝑖𝑄conditionalsubscript𝑧𝑖subscript𝐬𝑖w_{n}=\prod_{i=1}^{n}\frac{P(z_{i})}{Q\left(z_{i}\mid\mathbf{s}_{i}\right)}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_P ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG by labelling features randomly sampled from 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. The analyst rejects H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT whenever wnαsubscript𝑤𝑛𝛼w_{n}\leq\alphaitalic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≤ italic_α or retains H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT if the label budget runs out.

In the example of using the proposed framework, the class priors {P(zAi)}𝑃conditional𝑧subscript𝐴𝑖\left\{P\left(z\mid A_{i}\right)\right\}{ italic_P ( italic_z ∣ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } are given to simplify our analytical results; however, one can estimate these priors with labels in each Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and use the prior estimates to replace {P(zAi)}𝑃conditional𝑧subscript𝐴𝑖\left\{P(z\mid A_{i})\right\}{ italic_P ( italic_z ∣ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) }, and that will not change the main argument of our theorem. In addition, the analyst chooses the partition Asuperscript𝐴A^{*}italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT predicted by Q(z𝐬)𝑄conditional𝑧𝐬Q\left(z\mid\mathbf{s}\right)italic_Q ( italic_z ∣ bold_s ) to have the highest dependency between 𝐒𝐒\mathbf{S}bold_S and Z𝑍Zitalic_Z and only conducts sequential testing with the labeled points in Asuperscript𝐴A^{*}italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. In contrast, the baseline conducts the sequential test entirely the same, except that the analyst queries the labels of features that are randomly generated from 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. Both the proposed framework and the baseline assert the use of a stable Q(z𝐬)𝑄conditional𝑧𝐬Q\left(z\mid\mathbf{s}\right)italic_Q ( italic_z ∣ bold_s ) with no updates in the sequential testing; that is sufficient for our analysis as we will see the testing power for the above cases depend on DKL2(q(𝐬,z)||p(𝐬,z))D_{\text{KL}^{2}}\left(q(\mathbf{s},z)||p(\mathbf{s},z)\right)italic_D start_POSTSUBSCRIPT KL start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_q ( bold_s , italic_z ) | | italic_p ( bold_s , italic_z ) ) in equation 11, DKL2(p(𝐬,z)||q(𝐬,z))D_{\text{KL}^{2}}\left(p(\mathbf{s},z)||q(\mathbf{s},z)\right)italic_D start_POSTSUBSCRIPT KL start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_p ( bold_s , italic_z ) | | italic_q ( bold_s , italic_z ) ) in equation 12 and Var(𝐒,Z)p(𝐬,z)I¯(𝐒,Z)subscriptVarsimilar-to𝐒𝑍𝑝𝐬𝑧¯𝐼𝐒𝑍\text{Var}_{(\mathbf{S},Z)\sim p(\mathbf{s},z)}\bar{I}(\mathbf{S},Z)Var start_POSTSUBSCRIPT ( bold_S , italic_Z ) ∼ italic_p ( bold_s , italic_z ) end_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG ( bold_S , italic_Z ) in equation 15

5.4.4 Finite-sample analysis for the example

We use ϵ1=maxA𝒫DKL2(q(𝐬,z)p(𝐬,z)A)subscriptitalic-ϵ1subscript𝐴𝒫subscript𝐷superscriptKL2𝑞𝐬𝑧delimited-‖∣𝑝𝐬𝑧𝐴\epsilon_{1}=\max_{A\in\mathcal{P}}D_{\text{KL}^{2}}\left(q\left(\mathbf{s},z% \right)\|p\left(\mathbf{s},z\right)\mid A\right)italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_A ∈ caligraphic_P end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT KL start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_q ( bold_s , italic_z ) ∥ italic_p ( bold_s , italic_z ) ∣ italic_A ) and ϵ2=maxA𝒫DKL2(p(𝐬,z)q(𝐬,z)A)subscriptitalic-ϵ2subscript𝐴𝒫subscript𝐷superscriptKL2𝑝𝐬𝑧delimited-‖∣𝑞𝐬𝑧𝐴\epsilon_{2}=\max_{A\in\mathcal{P}}D_{\text{KL}^{2}}\left(p\left(\mathbf{s},z% \right)\|q\left(\mathbf{s},z\right)\mid A\right)italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_A ∈ caligraphic_P end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT KL start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_p ( bold_s , italic_z ) ∥ italic_q ( bold_s , italic_z ) ∣ italic_A ) to capture the maximum approximation error of Q(z𝐬)𝑄conditional𝑧𝐬Q(z\mid\mathbf{s})italic_Q ( italic_z ∣ bold_s ) over the partition 𝒫={A1,,Am}𝒫subscript𝐴1subscript𝐴𝑚\mathcal{P}=\{A_{1},\cdots,A_{m}\}caligraphic_P = { italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }, and use σ2=max{maxA𝒫Var(𝐒,Z)p(𝐬,zA)I¯(𝐒;Z),Var(𝐒,Z)p(𝐬,z)I¯(𝐒;Z)}superscript𝜎2subscript𝐴𝒫subscriptVarsimilar-to𝐒𝑍𝑝𝐬conditional𝑧𝐴¯𝐼𝐒𝑍subscriptVarsimilar-to𝐒𝑍𝑝𝐬𝑧¯𝐼𝐒𝑍\sigma^{2}=\max\left\{\max_{A\in\mathcal{P}}\text{Var}_{\left(\mathbf{S},Z% \right)\sim p\left(\mathbf{s},z\mid A\right)}\bar{I}(\mathbf{S};Z),\text{Var}_% {\left(\mathbf{S},Z\right)\sim p\left(\mathbf{s},z\right)}\bar{I}(\mathbf{S};Z% )\right\}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_max { roman_max start_POSTSUBSCRIPT italic_A ∈ caligraphic_P end_POSTSUBSCRIPT Var start_POSTSUBSCRIPT ( bold_S , italic_Z ) ∼ italic_p ( bold_s , italic_z ∣ italic_A ) end_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG ( bold_S ; italic_Z ) , Var start_POSTSUBSCRIPT ( bold_S , italic_Z ) ∼ italic_p ( bold_s , italic_z ) end_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG ( bold_S ; italic_Z ) } to capture the maximum irreducible Type II error over the same partition 𝒫𝒫\mathcal{P}caligraphic_P.

We will need to make the following assumptions before presenting our results.

Assumption 5.8.

(Maximum mutual information gain) maxA𝒫I(𝐒;ZA)I(𝐒;Z)=Δ0subscript𝐴𝒫𝐼𝐒conditional𝑍𝐴𝐼𝐒𝑍Δ0\max_{A\in\mathcal{P}}I\left(\mathbf{S};Z\mid A\right)-I(\mathbf{S};Z)=\Delta\geq 0roman_max start_POSTSUBSCRIPT italic_A ∈ caligraphic_P end_POSTSUBSCRIPT italic_I ( bold_S ; italic_Z ∣ italic_A ) - italic_I ( bold_S ; italic_Z ) = roman_Δ ≥ 0.

Assumption 5.8 characterizes the largest MI gain of the proposed framework in the example over the baseline; that is the direct reason for the increased testing power of the proposed framework.

Assumption 5.9.

(Sufficient number of unlabeled samples) 𝐬A𝒮u𝔼ZQ(z𝐬)[log(Q(Z𝐬)P(Z𝐬))]|A𝒮u|DKL(Q(z𝐬)P(z𝐬)A),A𝒫\frac{\sum_{\mathbf{s}\in A\cap\mathcal{S}_{u}}\mathbb{E}_{Z\sim Q\left(z\mid% \mathbf{s}\right)}\left[\log\left(\frac{Q\left(Z\mid\mathbf{s}\right)}{P\left(% Z\mid\mathbf{s}\right)}\right)\right]}{\left|A\cap\mathcal{S}_{u}\right|}% \approx D_{\text{KL}}\left(Q\left(z\mid\mathbf{s}\right)\|P\left(z\mid\mathbf{% s}\right)\mid A\right),\forall A\in\mathcal{P}divide start_ARG ∑ start_POSTSUBSCRIPT bold_s ∈ italic_A ∩ caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_Q ( italic_z ∣ bold_s ) end_POSTSUBSCRIPT [ roman_log ( divide start_ARG italic_Q ( italic_Z ∣ bold_s ) end_ARG start_ARG italic_P ( italic_Z ∣ bold_s ) end_ARG ) ] end_ARG start_ARG | italic_A ∩ caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | end_ARG ≈ italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_Q ( italic_z ∣ bold_s ) ∥ italic_P ( italic_z ∣ bold_s ) ∣ italic_A ) , ∀ italic_A ∈ caligraphic_P.

Even though we typically have access to only a finite number of unlabeled samples in real-world scenarios, this number is usually quite large and affordable for many applications. Hence, similar to (Hanneke & Yang, 2015), Assumption 5.9 assumes a sufficient supply of unlabeled samples to simplify the analysis and concentrate solely on the number of labels needed for the proposed framework in the example.

Now, we present our theorem to address the testing power of the framework in the example and the baseline test in the finite-sample case.

Theorem 5.10.

Under Assumption 5.85.9, the proposed framework in the example with a label budget Nqsubscript𝑁𝑞N_{q}italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and α𝛼\alphaitalic_α has a testing power of approximately at least

Φ(logαNq+Nq(I(𝐒;Z)+Δ2ϵ1ϵ2)(ϵ1+σ2+2σϵ1)1/2);Φ𝛼subscript𝑁𝑞subscript𝑁𝑞𝐼𝐒𝑍Δ2subscriptitalic-ϵ1subscriptitalic-ϵ2superscriptsubscriptitalic-ϵ1superscript𝜎22𝜎subscriptitalic-ϵ112\displaystyle\Phi\left(\frac{\frac{\log\alpha}{\sqrt{N_{q}}}+\sqrt{N_{q}}\left% (I\left(\mathbf{S};Z\right)+\Delta-2\sqrt{\epsilon}_{1}-\sqrt{\epsilon}_{2}% \right)}{\left(\epsilon_{1}+\sigma^{2}+2\sigma\sqrt{\epsilon_{1}}\right)^{1/2}% }\right);roman_Φ ( divide start_ARG divide start_ARG roman_log italic_α end_ARG start_ARG square-root start_ARG italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_ARG end_ARG + square-root start_ARG italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_ARG ( italic_I ( bold_S ; italic_Z ) + roman_Δ - 2 square-root start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - square-root start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_σ square-root start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG ) ; (17)

and the baseline test with features randomly sampled from 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and labeled has a testing power of approximately at least

Φ(logαNq+Nq(I(𝐒;Z)ϵ1)(ϵ1+σ2+2σϵ1)1/2).Φ𝛼subscript𝑁𝑞subscript𝑁𝑞𝐼𝐒𝑍subscriptitalic-ϵ1superscriptsubscriptitalic-ϵ1superscript𝜎22𝜎subscriptitalic-ϵ112\displaystyle\Phi\left(\frac{\frac{\log\alpha}{\sqrt{N_{q}}}+\sqrt{N_{q}}\left% (I\left(\mathbf{S};Z\right)-\sqrt{\epsilon}_{1}\right)}{\left(\epsilon_{1}+% \sigma^{2}+2\sigma\sqrt{\epsilon_{1}}\right)^{1/2}}\right).roman_Φ ( divide start_ARG divide start_ARG roman_log italic_α end_ARG start_ARG square-root start_ARG italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_ARG end_ARG + square-root start_ARG italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_ARG ( italic_I ( bold_S ; italic_Z ) - square-root start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_σ square-root start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG ) . (18)

equation 17 and equation 18 state approximate testing power’s lower bounds for the proposed framework in the example and the baseline test. We can observe that

  • Given α𝛼\alphaitalic_α, then, a large budget Nqsubscript𝑁𝑞N_{q}italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, and small approximation errors characterized by ϵ1subscriptitalic-ϵ1\epsilon_{1}italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, increase the two testing power’s lower-bounds of the proposed framework and the baseline, as structured similarly in equation 17 and equation 18.

  • Comparing equation 17 for the proposed framework to the equation 18 for the baseline, the extra ΔΔ\Deltaroman_Δ is ascribed to the maximum power gain, and ϵ1+ϵ2subscriptitalic-ϵ1subscriptitalic-ϵ2\sqrt{\epsilon_{1}}+\sqrt{\epsilon_{2}}square-root start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG + square-root start_ARG italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG accounts for the diminishing of the maximum power gain in selecting a A𝒫superscript𝐴𝒫A^{*}\in\mathcal{P}italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_P that does not have the highest MI over A𝒫𝐴𝒫A\in\mathcal{P}italic_A ∈ caligraphic_P.

  • When the approximation errors ϵ1=0subscriptitalic-ϵ10\epsilon_{1}=0italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 and/or ϵ2=0subscriptitalic-ϵ20\epsilon_{2}=0italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0, both testing power’s lower-bounds are decreased by a factor of σ𝜎\sigmaitalic_σ, resulting in the irreducible Type II error.

  • When the maximum MI gain ΔΔ\Deltaroman_Δ can compensate the approximation error of Q(z𝐬)𝑄conditional𝑧𝐬Q\left(z\mid\mathbf{s}\right)italic_Q ( italic_z ∣ bold_s ) being larger than ϵ1+ϵ2subscriptitalic-ϵ1subscriptitalic-ϵ2\sqrt{\epsilon_{1}}+\sqrt{\epsilon_{2}}square-root start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG + square-root start_ARG italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG, our framework in the example has higher testing power’s lower bound than the baseline test given the same label budget Nqsubscript𝑁𝑞N_{q}italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and α𝛼\alphaitalic_α.

6 Experimental Results

We have proposed a practical instantiation of the framework, and its algorithmic description BQ-AST is presented in Algorithm 1. In this section, we compare the BQ-AST with a sequential testing baseline (Lhéritier & Cazals, 2018) that uses the same statistic in equation 2, but the baseline labels features randomly sampled from the unlabeled set 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. In addition, we build Q(z𝐬)𝑄conditional𝑧𝐬Q\left(z\mid\mathbf{s}\right)italic_Q ( italic_z ∣ bold_s ) for the test statistic in equation 2 using logistic regression, SVM, or KNN classifiers; we set N0=10subscript𝑁010N_{0}=10italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 10 for the number of label queries used to initialize Q(z𝐬)𝑄conditional𝑧𝐬Q\left(z\mid\mathbf{s}\right)italic_Q ( italic_z ∣ bold_s ), and set significance level α=0.05𝛼0.05\alpha=0.05italic_α = 0.05.

6.1 Experiments on Synthetic Datasets

Our first suite of experiment results is generated from synthetic data. We create synthetic datasets that comprise two samples of data to simulate cases under the null hypothesis H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the alternative hypothesis H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT; the data for the first sample (Z=0𝑍0Z=0italic_Z = 0) is generated from p(𝐬Z=0)𝒩((δ,0),I)𝑝conditional𝐬𝑍0𝒩𝛿0𝐼p\left(\mathbf{s}\mid Z=0\right)\equiv\mathcal{N}\left(\left(-\delta,0\right),% I\right)italic_p ( bold_s ∣ italic_Z = 0 ) ≡ caligraphic_N ( ( - italic_δ , 0 ) , italic_I ) and the data for the second sample (Z=1𝑍1Z=1italic_Z = 1) is generated from p(𝐬Z=1)𝒩((δ,0),I)𝑝conditional𝐬𝑍1𝒩𝛿0𝐼p\left(\mathbf{s}\mid Z=1\right)\equiv\mathcal{N}\left(\left(\delta,0\right),I\right)italic_p ( bold_s ∣ italic_Z = 1 ) ≡ caligraphic_N ( ( italic_δ , 0 ) , italic_I ). In addition, we set P(Z=0)𝑃𝑍0P(Z=0)italic_P ( italic_Z = 0 ) from 0.50.50.50.5 to 0.80.80.80.8 to vary the ratio of the data sizes for two samples. For the simulations of the data under H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we set δ=0𝛿0\delta=0italic_δ = 0, implying there is no difference between the distributions that generate the two samples; for the simulations of the data under H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we vary δ𝛿\deltaitalic_δ from 0.20.20.20.2 to 0.50.50.50.5 to simulate two samples from small to high discrepancy under H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Having constructed the data-generating process, we simulate 200 cases of data for each pair of P(Z=0)𝑃𝑍0P(Z=0)italic_P ( italic_Z = 0 ) and δ𝛿\deltaitalic_δ under H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and simulate 500 cases of data for each pair of P(Z=0)𝑃𝑍0P(Z=0)italic_P ( italic_Z = 0 ) and δ=0𝛿0\delta=0italic_δ = 0 under H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Each case of data is of size 2000 with labels masked, resulting in an unlabeled set 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT with |𝒮u|=2000subscript𝒮𝑢2000|\mathcal{S}_{u}|=2000| caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | = 2000. The proposed test actively and sequentially labels feature in 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT to test the difference between the two samples.

Refer to caption
Figure 2: Under H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in which δ=0𝛿0\delta=0italic_δ = 0, empirical Type I errors of the proposed test for different P(Z=0)𝑃𝑍0P(Z=0)italic_P ( italic_Z = 0 ) when using the logistic regression to build Q(z𝐬)𝑄conditional𝑧𝐬Q(z\mid\mathbf{s})italic_Q ( italic_z ∣ bold_s ). All Type I errors are smaller than α=0.05𝛼0.05\alpha=0.05italic_α = 0.05, which agrees with Theorem 5.1.

Figure 2 presents the empirical Type I errors: when H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is true, the probability of the proposed test mistakenly predicting the two samples is generated under H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. As observed, the empirical Type I errors are all smaller than α=0.05𝛼0.05\alpha=0.05italic_α = 0.05 for using various classifiers and label budgets in the experiments; this provides empirical evidence for Theorem 5.1, which states that the Type I error is controlled to be smaller than the significance level α𝛼\alphaitalic_α.

Table-1 presents the empirical Type II errors: when H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is true, the probability of the proposed test and the baseline test mistakenly predicting the two samples are generated under H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Table 2 presents the average label queried spent to reject H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT when H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is true. We can observe from Table-1 that the proposed test produces lower Type II errors than that of the baseline under different classifiers and label budgets; furthermore, in Table 2, we observe the proposed test spends a smaller number of label queries than the baseline test. Additionally, we run a two-sample t-test to assess the mean difference of label query numbers generated by 200 runs using both methods. The resultant p𝑝pitalic_p-values, truncated to the last 6 decimal places, all equate to zero, indicating that the label spent by our framework is statistically smaller than the baseline test. All these observations demonstrate that, under H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the proposed test labels the features that have a high dependency on labels to effectively decrease the Type II error and reduce the number of label queries needed to reject H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Table 1: Under H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, Type II errors of conducting the proposed/baseline with various classifiers and label budgets for the synthetic data generated by setting δ=0.2𝛿0.2\delta=0.2italic_δ = 0.2 and different class priors P(Z=0)𝑃𝑍0P(Z=0)italic_P ( italic_Z = 0 ). Due to the active query, our test produces lower Type II errors than the baseline for various label budgets.
P(Z=0)𝑃𝑍0P(Z=0)italic_P ( italic_Z = 0 ) Logistic KNN
0.50.50.50.5 Label budget 200 400 600 800 1000 200 400 600 800 1000
Baseline 0.82 0.53 0.29 0.11 0.04 0.95 0.77 0.50 0.28 0.14
Proposed 0.16 0.02 0.00 0.00 0.00 0.49 0.17 0.06 0.03 0.01
0.60.60.60.6 Baseline 0.80 0.50 0.23 0.12 0.06 0.95 0.77 0.48 0.29 0.14
Proposed 0.26 0.06 0.01 0.01 0.01 0.59 0.26 0.09 0.03 0.01
0.70.70.70.7 Baseline 0.81 0.56 0.34 0.22 0.10 0.96 0.81 0.58 0.36 0.28
Proposed 0.26 0.04 0.01 0.01 0.01 0.71 0.33 0.14 0.04 0.02
0.80.80.80.8 Baseline 0.88 0.73 0.56 0.35 0.21 0.98 0.90 0.77 0.59 0.48
Proposed 0.38 0.10 0.04 0.03 0.02 0.80 0.50 0.28 0.16 0.10
Table 2: Under H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, average number of label queries needed to reject H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for the proposed/baseline test using various classifiers and label budgets in the synthetic data generated by setting δ=0.2𝛿0.2\delta=0.2italic_δ = 0.2 and different class priors P(Z=0)𝑃𝑍0P(Z=0)italic_P ( italic_Z = 0 ). Due to the active query, our test spends fewer label queries to reject H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT than the baseline for various label budgets.
P(Z=0)𝑃𝑍0P(Z=0)italic_P ( italic_Z = 0 ) Logistic KNN
0.50.50.50.5 Label budget 200 400 600 800 1000 200 400 600 800 1000
Baseline 183.5±plus-or-minus\pm±41 319.7±plus-or-minus\pm±113 399.1±plus-or-minus\pm±183 438.1±plus-or-minus\pm±233 451.4±plus-or-minus\pm±257 198.1±plus-or-minus\pm±10 374.4±plus-or-minus\pm±61 500.4±plus-or-minus\pm±132 578.1±plus-or-minus\pm±201 619.5±plus-or-minus\pm±254
Proposed 95.3±plus-or-minus\pm±64 108.1±plus-or-minus\pm±92 108.6±plus-or-minus\pm±93 108.6±plus-or-minus\pm±93 108.6±plus-or-minus\pm±93 162.1±plus-or-minus\pm±50 223.1±plus-or-minus\pm±116 240.8±plus-or-minus\pm±149 249.7±plus-or-minus\pm±173 252.8±plus-or-minus\pm±184
0.60.60.60.6 Baseline 182.3±plus-or-minus\pm±41 312.4±plus-or-minus\pm±116 386.0±plus-or-minus\pm±184 419.7±plus-or-minus\pm±231 439.0±plus-or-minus\pm±266 196.7±plus-or-minus\pm±16 373.7±plus-or-minus\pm±66 499.3±plus-or-minus\pm±134 578.9±plus-or-minus\pm±206 619.7±plus-or-minus\pm±256
Proposed 107.9±plus-or-minus\pm±70 134.2±plus-or-minus\pm±114 142.3±plus-or-minus\pm±136 143.7±plus-or-minus\pm±142 144.7±plus-or-minus\pm±147 166.3±plus-or-minus\pm±48 246.8±plus-or-minus\pm±123 282.2±plus-or-minus\pm±175 294.3±plus-or-minus\pm±200 296.6±plus-or-minus\pm±207
0.70.70.70.7 Baseline 184.0±plus-or-minus\pm±41 323.3±plus-or-minus\pm±113 415.5±plus-or-minus\pm±188 472.2±plus-or-minus\pm±252 505.0±plus-or-minus\pm±299 198.3±plus-or-minus\pm±11 378.5±plus-or-minus\pm±58 520.0±plus-or-minus\pm±127 613.4±plus-or-minus\pm±199 678.1±plus-or-minus\pm±268
Proposed 120.4±plus-or-minus\pm±67 143.4±plus-or-minus\pm±104 147.6±plus-or-minus\pm±117 149.0±plus-or-minus\pm±122 150.0±plus-or-minus\pm±128 178.0±plus-or-minus\pm±43 282.2±plus-or-minus\pm±117 327.4±plus-or-minus\pm±173 345.9±plus-or-minus\pm±207 351.7±plus-or-minus\pm±222
0.80.80.80.8 Baseline 190.8±plus-or-minus\pm±31 351.7±plus-or-minus\pm±96 479.6±plus-or-minus\pm±172 571.1±plus-or-minus\pm±245 628.0±plus-or-minus\pm±306 199.0±plus-or-minus\pm±8 386.6±plus-or-minus\pm±47 555.0±plus-or-minus\pm±106 689.5±plus-or-minus\pm±175 798.4±plus-or-minus\pm±253
Proposed 134.7±plus-or-minus\pm±64 174.8±plus-or-minus\pm±118 189.5±plus-or-minus\pm±151 195.6±plus-or-minus\pm±170 199.7±plus-or-minus\pm±186 184.4±plus-or-minus\pm±36 310.2±plus-or-minus\pm±111 387.7±plus-or-minus\pm±186 434.7±plus-or-minus\pm±247 462.6±plus-or-minus\pm±293

We present the average number of label queries spent for two samples with small to big discrepancies under H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in Table 3. A small discrepancy between two samples indicates a more difficult two-sample testing problem than one with a large discrepancy between the two samples, as a two-sample test requires more data to test the existence of the small discrepancy. Table 3 shows that the proposed active sequential test spends fewer labels to reject H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT when increasing the mean discrepancy δ𝛿\deltaitalic_δ between two samples, which demonstrates the proposed sequential test automatically adapts the number of label queries to the problem’s complexity.

Table 3: Under H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and label budget Nq=1000subscript𝑁𝑞1000N_{q}=1000italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = 1000, the average number of label queries needed to reject H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for different δ𝛿\deltaitalic_δ. When the mean difference δ𝛿\deltaitalic_δ increases between two samples, both our active sequential test and the baseline test reject H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with a reduced number of label queries spent, exhibiting the sequential test’s benefit that the tests adapt the label queries to the problem’s complexity. Due to the active query, our test spends fewer label queries to reject H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT than the baseline for various δ𝛿\deltaitalic_δ.
P(Z=0)𝑃𝑍0P(Z=0)italic_P ( italic_Z = 0 ) Logistic KNN
0.50.50.50.5 δ𝛿\deltaitalic_δ 0.2 0.3 0.4 0.5 0.2 0.3 0.4 0.5
Baseline 451.4±plus-or-minus\pm±257 178.3±plus-or-minus\pm±105 101.0±plus-or-minus\pm±58 63.9±plus-or-minus\pm±32 619.5±plus-or-minus\pm±254 287.8±plus-or-minus\pm±129 167.4±plus-or-minus\pm±70 116.8±plus-or-minus\pm±43
Proposed 108.6±plus-or-minus\pm±93 37.3±plus-or-minus\pm±22 24.3±plus-or-minus\pm±10 19.7±plus-or-minus\pm±5 252.8±plus-or-minus\pm±184 109.5±plus-or-minus\pm±64 72.2±plus-or-minus\pm±33 54.9±plus-or-minus\pm±20
0.60.60.60.6 Baseline 439.0±plus-or-minus\pm±266 175.3±plus-or-minus\pm±118 96.9±plus-or-minus\pm±65 65.5±plus-or-minus\pm±40 619.7±plus-or-minus\pm±256 289.8±plus-or-minus\pm±130 170.2±plus-or-minus\pm±72 116.2±plus-or-minus\pm±47
Proposed 144.7±plus-or-minus\pm±147 40.5±plus-or-minus\pm±30 24.9±plus-or-minus\pm±11 20.1±plus-or-minus\pm±7 296.6±plus-or-minus\pm±207 134.3±plus-or-minus\pm±88 84.3±plus-or-minus\pm±43 58.3±plus-or-minus\pm±25
0.70.70.70.7 Baseline 505.0±plus-or-minus\pm±299 223.6±plus-or-minus\pm±145 115.7±plus-or-minus\pm±70 75.7±plus-or-minus\pm±47 678.1±plus-or-minus\pm±268 349.3±plus-or-minus\pm±178 198.2±plus-or-minus\pm±93 133.3±plus-or-minus\pm±56
Proposed 150.0±plus-or-minus\pm±128 57.1±plus-or-minus\pm±42 32.3±plus-or-minus\pm±21 22.2±plus-or-minus\pm±8 351.7±plus-or-minus\pm±222 160.2±plus-or-minus\pm±107 94.0±plus-or-minus\pm±54 67.0±plus-or-minus\pm±30
0.80.80.80.8 Baseline 628.0±plus-or-minus\pm±306 278.1±plus-or-minus\pm±177 149.3±plus-or-minus\pm±95 94.8±plus-or-minus\pm±56 798.4±plus-or-minus\pm±253 470.3±plus-or-minus\pm±223 268.7±plus-or-minus\pm±126 176.3±plus-or-minus\pm±81
Proposed 199.7±plus-or-minus\pm±186 66.7±plus-or-minus\pm±41 40.0±plus-or-minus\pm±22 29.4±plus-or-minus\pm±15 462.6±plus-or-minus\pm±293 198.8±plus-or-minus\pm±143 115.7±plus-or-minus\pm±65 83.8±plus-or-minus\pm±46

6.2 Experiments on MNIST

Refer to caption
Figure 3: Empirical Type I errors of the proposed test for different P(Z=0)𝑃𝑍0P(Z=0)italic_P ( italic_Z = 0 ) in the MNIST experiment. SVM is used to build Q(z𝐬)𝑄conditional𝑧𝐬Q(z\mid\mathbf{s})italic_Q ( italic_z ∣ bold_s ). All Type I errors are smaller than α=0.05𝛼0.05\alpha=0.05italic_α = 0.05, which agrees with Theorem 5.1.

In addition to the synthetic datasets, We simulate the cases of H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with MNIST (LeCun, 1998). To create a case for H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we randomly pick one digit category from 0-9, then randomly sample images from the selected digit category, and lastly divide the images to sample zero (Z=0𝑍0Z=0italic_Z = 0) and one (Z=1𝑍1Z=1italic_Z = 1) based on a pre-defined class prior P(Z=0)𝑃𝑍0P(Z=0)italic_P ( italic_Z = 0 ); for each case, the two samples contain data from the same digit, but the digit categories could be different over cases. To create a case for H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we randomly pick two different digit categories from 0-9, then sample images from one digit category and place the images to sample zero (Z=0𝑍0Z=0italic_Z = 0); to create sample one (Z=1𝑍1Z=1italic_Z = 1), we sample images from the two digits, mix the sampled images, and place them to sample one. We set the mixture ratio 0.70.70.70.7, meaning there are roughly 30%percent3030\%30 % data in sample one generated from a distribution different from sample zero. We also adjust P(Z=0)𝑃𝑍0P(Z=0)italic_P ( italic_Z = 0 ) to create cases with different ratios for the size of sample zero over sample one for H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. We produce 500 cases for H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 200 cases for H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with the stated procedure for each P(Z=0)𝑃𝑍0P(Z=0)italic_P ( italic_Z = 0 ) that ranges from 0.50.50.50.5 to 0.80.80.80.8; each case comprises an unlabeled set 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT with a size of 2000 and its corresponding labels that are unknown to an analyst. Instead of using the raw data in the created cases, we projected the MNIST data to a 28-dimensional space by a convolutional autoencoder before conducting the two-sample testing.

We first present the empirical Type I errors in Figure 3. We use the support vector machine (SVM) to build Q(z𝐬)𝑄conditional𝑧𝐬Q(z\mid\mathbf{s})italic_Q ( italic_z ∣ bold_s ) to generate the results. As observed, all the Type I errors are smaller than α=0.05𝛼0.05\alpha=0.05italic_α = 0.05, which agrees with Theorem 5.1. In addition, we present the Type II errors, as shown in Table 4. The proposed test generates smaller Type II errors than the baseline sequential test for various classifiers, label budgets, and P(Z=0)𝑃𝑍0P(Z=0)italic_P ( italic_Z = 0 ), implying the proposed sequential testing combined with the active query is effective. This is further corroborated by Table 5 that exhibits the average number of label queries needed to reject H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT; the proposed test spent fewer label queries than the baseline test to reject H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We additionally run a two-sample t-test to statistically compare the mean difference between the label query numbers generated by both methods. The resultant p𝑝pitalic_p-values, truncated to the last 6 decimal places, all equate to zero, indicating that the label spent by our framework is statistically smaller than the baseline test in the MNIST experiment.

Table 4: Under H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, Type II errors of conducting the proposed/baseline with various classifiers and label budgets for MNIST and different class priors P(Z=0)𝑃𝑍0P(Z=0)italic_P ( italic_Z = 0 ). Due to the active query, our test produces lower Type II errors than the baseline for various label budgets.
P(Z=0)𝑃𝑍0P(Z=0)italic_P ( italic_Z = 0 ) Logistic SVM KNN
0.50.50.50.5 Label budget 200 400 600 800 1000 200 400 600 800 1000 200 400 600 800 1000
Baseline 0.65 0.21 0.02 0.01 0.01 0.59 0.07 0.00 0.00 0.00 0.84 0.43 0.15 0.07 0.03
Proposed 0.12 0.01 0.01 0.00 0.00 0.12 0.03 0.01 0.01 0.00 0.10 0.01 0.01 0.00 0.00
0.60.60.60.6 Baseline 0.59 0.16 0.02 0.01 0.01 0.55 0.04 0.00 0.00 0.00 0.89 0.43 0.15 0.06 0.03
Proposed 0.01 0.00 0.00 0.00 0.00 0.06 0.01 0.00 0.00 0.00 0.06 0.02 0.01 0.01 0.00
0.70.70.70.7 Baseline 0.58 0.21 0.04 0.01 0.00 0.67 0.15 0.01 0.00 0.00 0.91 0.58 0.29 0.10 0.04
Proposed 0.00 0.00 0.00 0.00 0.00 0.10 0.01 0.00 0.00 0.00 0.12 0.03 0.01 0.00 0.00
0.80.80.80.8 Baseline 0.66 0.24 0.04 0.01 0.01 0.77 0.32 0.10 0.01 0.01 0.95 0.71 0.47 0.27 0.12
Proposed 0.00 0.00 0.00 0.00 0.00 0.06 0.01 0.00 0.00 0.00 0.14 0.03 0.01 0.01 0.00
Table 5: Under H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, average number of label queries needed to reject H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for the proposed/baseline test using various classifiers and label budgets for MNIST and different class priors P(Z=0)𝑃𝑍0P(Z=0)italic_P ( italic_Z = 0 ). Due to the active query, our test spends fewer label queries to reject H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT than the baseline for various label budgets.
P(Z=0)𝑃𝑍0P(Z=0)italic_P ( italic_Z = 0 ) Logistic SVM KNN
0.50.50.50.5 Label budget 200 400 600 800 1000 200 400 600 800 1000 200 400 600 800 1000
Baseline 165.3±plus-or-minus\pm±56 251.7±plus-or-minus\pm±126 267.9±plus-or-minus\pm±150 270.7±plus-or-minus\pm±158 271.7±plus-or-minus\pm±162 175.4±plus-or-minus\pm±39 229.9±plus-or-minus\pm±93 233.0±plus-or-minus\pm±99 233.0±plus-or-minus\pm±99 233.0±plus-or-minus\pm±99 187.0±plus-or-minus\pm±30 311.3±plus-or-minus\pm±93 359.1±plus-or-minus\pm±141 376.8±plus-or-minus\pm±167 384.6±plus-or-minus\pm±185
Proposed 90.4±plus-or-minus\pm±62 99.8±plus-or-minus\pm±84 101.0±plus-or-minus\pm±89 101.5±plus-or-minus\pm±92 101.5±plus-or-minus\pm±92 93.5±plus-or-minus\pm±55 106.5±plus-or-minus\pm±87 109.4±plus-or-minus\pm±98 110.5±plus-or-minus\pm±105 110.7±plus-or-minus\pm±106 89.4±plus-or-minus\pm±51 97.9±plus-or-minus\pm±75 99.9±plus-or-minus\pm±84 100.1±plus-or-minus\pm±86 100.1±plus-or-minus\pm±86
0.60.60.60.6 Baseline 160.8±plus-or-minus\pm±59 233.2±plus-or-minus\pm±125 247.5±plus-or-minus\pm±148 249.3±plus-or-minus\pm±154 250.3±plus-or-minus\pm±158 173.5±plus-or-minus\pm±39 226.8±plus-or-minus\pm±95 229.8±plus-or-minus\pm±101 229.8±plus-or-minus\pm±101 229.8±plus-or-minus\pm±101 187.5±plus-or-minus\pm±31 315.1±plus-or-minus\pm±93 363.9±plus-or-minus\pm±142 379.8±plus-or-minus\pm±166 385.4±plus-or-minus\pm±178
Proposed 61.7±plus-or-minus\pm±43 61.7±plus-or-minus\pm±43 61.7±plus-or-minus\pm±43 61.7±plus-or-minus\pm±43 61.7±plus-or-minus\pm±43 79.4±plus-or-minus\pm±48 83.2±plus-or-minus\pm±60 83.4±plus-or-minus\pm±62 83.4±plus-or-minus\pm±62 83.4±plus-or-minus\pm±62 85.0±plus-or-minus\pm±49 90.9±plus-or-minus\pm±68 94.0±plus-or-minus\pm±85 95.6±plus-or-minus\pm±95 96.6±plus-or-minus\pm±103
0.70.70.70.7 Baseline 160.3±plus-or-minus\pm±59 234.8±plus-or-minus\pm±128 255.2±plus-or-minus\pm±161 257.8±plus-or-minus\pm±167 258.0±plus-or-minus\pm±168 174.6±plus-or-minus\pm±45 252.1±plus-or-minus\pm±109 264.6±plus-or-minus\pm±130 265.4±plus-or-minus\pm±133 265.4±plus-or-minus\pm±133 188.4±plus-or-minus\pm±31 330.7±plus-or-minus\pm±94 415.7±plus-or-minus\pm±162 451.8±plus-or-minus\pm±206 463.2±plus-or-minus\pm±225
Proposed 46.2±plus-or-minus\pm±28 46.2±plus-or-minus\pm±28 46.2±plus-or-minus\pm±28 46.2±plus-or-minus\pm±28 46.2±plus-or-minus\pm±28 74.7±plus-or-minus\pm±56 82.0±plus-or-minus\pm±76 83.0±plus-or-minus\pm±81 83.0±plus-or-minus\pm±81 83.0±plus-or-minus\pm±81 89.3±plus-or-minus\pm±54 101.5±plus-or-minus\pm±85 104.6±plus-or-minus\pm±98 105.1±plus-or-minus\pm±101 105.1±plus-or-minus\pm±101
0.80.80.80.8 Baseline 163.9±plus-or-minus\pm±58 243.3±plus-or-minus\pm±126 268.1±plus-or-minus\pm±163 273.1±plus-or-minus\pm±175 275.3±plus-or-minus\pm±183 92.6±plus-or-minus\pm±16 148.5±plus-or-minus\pm±52 167.3±plus-or-minus\pm±76 171.7±plus-or-minus\pm±85 172.6±plus-or-minus\pm±88 192.8±plus-or-minus\pm±25 357.3±plus-or-minus\pm±76 471.8±plus-or-minus\pm±146 540.7±plus-or-minus\pm±210 575.4±plus-or-minus\pm±255
Proposed 34.8±plus-or-minus\pm±17 34.8±plus-or-minus\pm±17 34.8±plus-or-minus\pm±17 34.8±plus-or-minus\pm±17 34.8±plus-or-minus\pm±17 77.2±plus-or-minus\pm±55 81.6±plus-or-minus\pm±68 82.3±plus-or-minus\pm±72 82.3±plus-or-minus\pm±72 82.3±plus-or-minus\pm±72 104.8±plus-or-minus\pm±54 116.9±plus-or-minus\pm±81 119.1±plus-or-minus\pm±90 120.1±plus-or-minus\pm±96 120.3±plus-or-minus\pm±98

6.3 Experiments on An Alzheimer’s Disease Dataset

Table 6: Under H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, Type II errors of conducting the proposed/baseline with various classifiers and label budgets for ADNI and different class priors P(Z=0)𝑃𝑍0P(Z=0)italic_P ( italic_Z = 0 ). Due to the active query, our test produces lower Type II errors than the baseline for various label budgets.
P(Z=0)𝑃𝑍0P(Z=0)italic_P ( italic_Z = 0 ) Logistic SVM KNN
0.50.50.50.5 Label budget 100 200 300 400 500 100 200 300 400 500 100 200 300 400 500
Baseline 0.32 0.06 0.01 0.00 0.00 0.67 0.17 0.02 0.00 0.00 0.72 0.49 0.25 0.13 0.04
Proposed 0.10 0.01 0.00 0.00 0.00 0.24 0.03 0.01 0.00 0.00 0.21 0.04 0.00 0.00 0.00
0.60.60.60.6 Baseline 0.35 0.04 0.00 0.00 0.00 0.62 0.15 0.01 0.00 0.00 0.73 0.25 0.06 0.01 0.00
Proposed 0.07 0.00 0.00 0.00 0.00 0.18 0.04 0.03 0.00 0.00 0.10 0.01 0.00 0.00 0.00
0.70.70.70.7 Baseline 0.40 0.10 0.01 0.00 0.00 0.65 0.21 0.06 0.00 0.00 0.81 0.36 0.12 0.04 0.02
Proposed 0.11 0.03 0.00 0.00 0.00 0.32 0.07 0.02 0.01 0.01 0.25 0.04 0.01 0.00 0.00
0.80.80.80.8 Baseline 0.52 0.23 0.07 0.01 0.00 0.89 0.53 0.27 0.07 0.02 0.90 0.59 0.28 0.16 0.07
Proposed 0.28 0.01 0.00 0.00 0.00 0.49 0.15 0.06 0.03 0.01 0.38 0.10 0.03 0.01 0.01

We demonstrate the utility of the proposed test in a clinical application using data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (Jack Jr et al., 2008). The ADNI study protocol was approved by local institutional review boards (IRB). All the personal information in the data provided to researchers has been removed. The motivation for applying the proposed test to Alzheimer’s disease research is as follows. Amyloid has been linked to the development of Alzheimer’s disease; identifying the amount of amyloid in the human brain is an important step in predicting the progression of Alzheimer’s disease. To measure the amyloid level, an expensive CT scan is required used to assess the amyloid deposition in the brain. A useful replacement would be an easy-to-measure and inexpensive replacement for the amyloid to indicate the progression of Alzheimer’s disease. In the following experiments, we considered using digital test results that include five cognition measurement scores of participants as a replacement. To verify if the digital test results are suitable replacements, clinicians are seeking an approach to test the independence between the digital test results and the amyloid amount with a limited number of expensive CT scans to measure the amyloid levels. We use a binary version of the amyloid level where Z=0𝑍0Z=0italic_Z = 0 and Z=1𝑍1Z=1italic_Z = 1 suggest low and high amyloid depositions in the brain respectively; we can now formulate a two-sample test and use the proposed scheme. As the results show, our proposed test is endowed with sequential decision-making and active label query, resulting in fewer CT scans needed compared with the conventional sequential test.

Table 7: Under H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, average number of label queries needed to reject H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for the proposed/baseline test using various classifiers and label budgets for ADNI and different class priors P(Z=0)𝑃𝑍0P(Z=0)italic_P ( italic_Z = 0 ). Due to the active query, our test spends fewer label queries to reject H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT than the baseline for various label budgets.
P(Z=0)𝑃𝑍0P(Z=0)italic_P ( italic_Z = 0 ) Logistic SVM KNN
0.50.50.50.5 Label budget 100 200 300 400 500 100 200 300 400 500 100 200 300 400 500
Baseline 68.1±plus-or-minus\pm±29 83.7±plus-or-minus\pm±52 85.5±plus-or-minus\pm±57 85.6±plus-or-minus\pm±57 85.6±plus-or-minus\pm±57 87.0±plus-or-minus\pm±22 127.2±plus-or-minus\pm±55 135.4±plus-or-minus\pm±69 136.1±plus-or-minus\pm±70 136.1±plus-or-minus\pm±70 86.2±plus-or-minus\pm±26 145.8±plus-or-minus\pm±66 181.8±plus-or-minus\pm±100 199.5±plus-or-minus\pm±124 207.3±plus-or-minus\pm±138
Proposed 43.9±plus-or-minus\pm±29 47.1±plus-or-minus\pm±36 47.1±plus-or-minus\pm±37 47.1±plus-or-minus\pm±37 47.1±plus-or-minus\pm±37 64.0±plus-or-minus\pm±28 75.1±plus-or-minus\pm±47 76.5±plus-or-minus\pm±51 76.6±plus-or-minus\pm±52 76.6±plus-or-minus\pm±52 69.3±plus-or-minus\pm±22 76.6±plus-or-minus\pm±37 77.8±plus-or-minus\pm±41 77.8±plus-or-minus\pm±41 77.8±plus-or-minus\pm±41
0.60.60.60.6 Baseline 68.4±plus-or-minus\pm±29 84.0±plus-or-minus\pm±51 86.1±plus-or-minus\pm±57 86.1±plus-or-minus\pm±57 86.1±plus-or-minus\pm±57 85.0±plus-or-minus\pm±23 121.0±plus-or-minus\pm±55 127.5±plus-or-minus\pm±67 127.5±plus-or-minus\pm±67 127.5±plus-or-minus\pm±67 92.7±plus-or-minus\pm±15 140.3±plus-or-minus\pm±51 153.0±plus-or-minus\pm±69 156.0±plus-or-minus\pm±77 156.3±plus-or-minus\pm±78
Proposed 43.9±plus-or-minus\pm±29 45.3±plus-or-minus\pm±32 45.3±plus-or-minus\pm±32 45.3±plus-or-minus\pm±32 45.3±plus-or-minus\pm±32 61.3±plus-or-minus\pm±26 70.0±plus-or-minus\pm±44 72.9±plus-or-minus\pm±54 74.5±plus-or-minus\pm±61 74.5±plus-or-minus\pm±61 60.8±plus-or-minus\pm±20 64.4±plus-or-minus\pm±30 64.4±plus-or-minus\pm±30 64.4±plus-or-minus\pm±30 64.4±plus-or-minus\pm±30
0.70.70.70.7 Baseline 72.3±plus-or-minus\pm±29 95.6±plus-or-minus\pm±58 100.9±plus-or-minus\pm±70 101.1±plus-or-minus\pm±70 101.1±plus-or-minus\pm±70 86.5±plus-or-minus\pm±23 126.7±plus-or-minus\pm±57 139.0±plus-or-minus\pm±76 141.3±plus-or-minus\pm±82 141.3±plus-or-minus\pm±82 94.7±plus-or-minus\pm±13 153.5±plus-or-minus\pm±49 176.5±plus-or-minus\pm±77 183.7±plus-or-minus\pm±90 186.1±plus-or-minus\pm±97
Proposed 50.6±plus-or-minus\pm±29 56.6±plus-or-minus\pm±43 57.1±plus-or-minus\pm±45 57.1±plus-or-minus\pm±45 57.1±plus-or-minus\pm±45 68.8±plus-or-minus\pm±29 85.1±plus-or-minus\pm±53 89.0±plus-or-minus\pm±63 90.0±plus-or-minus\pm±67 90.5±plus-or-minus\pm±69 68.6±plus-or-minus\pm±24 78.5±plus-or-minus\pm±42 79.9±plus-or-minus\pm±47 79.9±plus-or-minus\pm±47 79.9±plus-or-minus\pm±47
0.80.80.80.8 Baseline 78.1±plus-or-minus\pm±28 115.4±plus-or-minus\pm±65 128.5±plus-or-minus\pm±86 132.5±plus-or-minus\pm±95 132.9±plus-or-minus\pm±96 95.9±plus-or-minus\pm±13 166.9±plus-or-minus\pm±47 204.8±plus-or-minus\pm±81 219.6±plus-or-minus\pm±101 222.8±plus-or-minus\pm±108 97.6±plus-or-minus\pm±8 171.0±plus-or-minus\pm±43 215.3±plus-or-minus\pm±79 235.8±plus-or-minus\pm±106 247.6±plus-or-minus\pm±126
Proposed 63.6±plus-or-minus\pm±32 72.0±plus-or-minus\pm±44 72.2±plus-or-minus\pm±45 72.2±plus-or-minus\pm±45 72.2±plus-or-minus\pm±45 80.1±plus-or-minus\pm±26 108.6±plus-or-minus\pm±57 118.1±plus-or-minus\pm±75 121.9±plus-or-minus\pm±86 124.0±plus-or-minus\pm±93 80.7±plus-or-minus\pm±21 98.3±plus-or-minus\pm±46 102.4±plus-or-minus\pm±57 103.9±plus-or-minus\pm±63 104.9±plus-or-minus\pm±68

The obtained ADNI data contains both digital test results and the amyloid amount of participants. We use the cut-off value suggested by ADNI and binarize the amyloid amount to create two-sample cases where 𝐬𝐬\mathbf{s}bold_s denote a vector of cognition measurement scores and z𝑧zitalic_z denotes low or high amyloid amount for the participants. We create 200 data cases for each P(Z=0)𝑃𝑍0P(Z=0)italic_P ( italic_Z = 0 ) that ranges from 0.50.50.50.5 to 0.80.80.80.8; these cases are simulations for H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and each case comprises an unlabeled set 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT with a size of 1000 and its corresponding labels that are unknown to an analyst.

Table 6 and Table 7 present the results of empirical Type II errors and the average number of label queries needed to reject H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Our proposed test has Type II errors decreased by 58% and saves on label queries by 62% at most compared with the baseline test with the same label budgets. Additionally, we run a two-sample t-test to statistically compare the mean difference between the label query numbers generated by both methods. The resultant p𝑝pitalic_p-values, truncated to the last 6 decimal places, all equate to zero; this indicates that the label savings are statistically significant.

7 Conclusion

We propose an active sequential two-sample testing framework that sequentially and actively labels the data to increase the testing power and adapt the number of label queries to the problem’s complexity. We provide both finite-sample and asymptotic analysis of the proposed framework; the framework’s benefit is characterized by the change of the mutual information between feature and label variables over a random labeling scheme in both finite-sample and asymptotic cases. Moreover, we suggest an instantiation of the framework, in which we adopt the bimodal query that labels the features predicted by a classifier to have the highest class one or zero probabilities. Our experiments on synthetic data, MNIST, and an Alzheimer’s Disease dataset demonstrate the effectiveness of the suggested instantiation of the proposed framework.

Acknowledgement

This work was funded in part by Office of Naval Research grant N00014-21-1-2615 and by the National Science Foundation (NSF) under grants CNS-2003111, and CCF-2048223.

References

  • Aaditya Ramdas (2018) Aaditya Ramdas. Martingales, ville and doob, 2018. https://www.stat.cmu.edu/~aramdas/martingales18/L2-martingales.pdf.
  • Balsubramani & Ramdas (2015) Akshay Balsubramani and Aaditya Ramdas. Sequential nonparametric testing with the law of the iterated logarithm. arXiv preprint arXiv:1506.03486, 2015.
  • Bessler (1960) Stuart A Bessler. Theory and applications of the sequential design of experiments, k-actions and infinitely many experiments. part i. theory. Technical report, Stanford Univ CA Applied Mathematics and Statistics Labs, 1960.
  • Blot & Meeter (1973) William J Blot and Duane A Meeter. Sequential experimental design procedures. Journal of the American Statistical Association, 68(343):586–593, 1973.
  • Chernoff (1959) Herman Chernoff. Sequential design of experiments. The Annals of Mathematical Statistics, 30(3):755–770, 1959.
  • Doob (1939) JL Doob. Jean ville, étude critique de la notion de collectif. Bulletin of the American mathematical society, 45(11):824–824, 1939.
  • Duan et al. (2022) Boyan Duan, Aaditya Ramdas, and Larry Wasserman. Interactive rank testing by betting. In Conference on Causal Learning and Reasoning, pp. 201–235. PMLR, 2022.
  • Dunn (1961) Olive Jean Dunn. Multiple comparisons among means. Journal of the American statistical association, 56(293):52–64, 1961.
  • Durrett (2019) Rick Durrett. Probability: theory and examples, volume 49. Cambridge university press, 2019.
  • Friedman & Rafsky (1979) Jerome H Friedman and Lawrence C Rafsky. Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. The Annals of Statistics, pp.  697–717, 1979.
  • Gretton et al. (2012) Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012.
  • Györfi et al. (2002) László Györfi, Michael Köhler, Adam Krzyżak, and Harro Walk. A distribution-free theory of nonparametric regression, volume 1. Springer, 2002.
  • Hajnal (1961) J Hajnal. A two-sample sequential t-test. Biometrika, 48(1/2):65–75, 1961.
  • Han (2000) Te Sun Han. Hypothesis testing with the general source. arXiv preprint math/0004121, 2000.
  • Han & Verdú (1993) Te Sun Han and Sergio Verdú. Approximation theory of output statistics. IEEE Transactions on Information Theory, 39(3):752–772, 1993.
  • Hanneke & Yang (2015) Steve Hanneke and Liu Yang. Minimax analysis of active learning. J. Mach. Learn. Res., 16(1):3487–3602, 2015.
  • Hotelling (1992) Harold Hotelling. The generalization of student’s ratio. In Breakthroughs in statistics, pp.  54–65. Springer, 1992.
  • Jack Jr et al. (2008) Clifford R Jack Jr, Matt A Bernstein, Nick C Fox, Paul Thompson, Gene Alexander, Danielle Harvey, Bret Borowski, Paula J Britson, Jennifer L. Whitwell, Chadwick Ward, et al. The Alzheimer’s disease neuroimaging initiative (ADNI): MRI methods. Journal of Magnetic Resonance Imaging: An Official Journal of the International Society for Magnetic Resonance in Medicine, 27(4):685–691, 2008.
  • Johari et al. (2022) Ramesh Johari, Pete Koomen, Leonid Pekelis, and David Walsh. Always valid inference: Continuous monitoring of a/b tests. Operations Research, 70(3):1806–1821, 2022.
  • Keener (1984) Robert Keener. Second order efficiency in the sequential design of experiments. The Annals of Statistics, pp.  510–532, 1984.
  • Kiefer & Sacks (1963) J Kiefer and J Sacks. Asymptotically optimum sequential inference and design. The Annals of Mathematical Statistics, pp.  705–750, 1963.
  • Koga et al. (2002) H Koga et al. Information-spectrum methods in information theory, volume 50. Springer Science & Business Media, 2002.
  • LeCun (1998) Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
  • Lhéritier & Cazals (2018) Alix Lhéritier and Frédéric Cazals. A sequential non-parametric multivariate two-sample test. IEEE Transactions on Information Theory, 64(5):3361–3370, 2018.
  • Li et al. (2022) Weizhi Li, Gautam Dasarathy, Karthikeyan Natesan Ramamurthy, and Visar Berisha. A label efficient two-sample test. In Uncertainty in Artificial Intelligence, pp.  1168–1177. PMLR, 2022.
  • Lopez-Paz & Oquab (2016) David Lopez-Paz and Maxime Oquab. Revisiting classifier two-sample tests. arXiv preprint arXiv:1610.06545, 2016.
  • Miller (2007) Steven J Miller. An introduction to linear programming. lecture notes, 2007.
  • Naghshvar & Javidi (2013) Mohammad Naghshvar and Tara Javidi. Active sequential hypothesis testing. The Annals of Statistics, 41(6):2703–2738, 2013.
  • Pandeva et al. (2022) Teodora Pandeva, Tim Bakker, Christian A Naesseth, and Patrick Forré. E-valuating classifier two-sample tests. arXiv preprint arXiv:2210.13027, 2022.
  • Ramdas et al. (2022) Aaditya Ramdas, Peter Grünwald, Vladimir Vovk, and Glenn Shafer. Game-theoretic statistics and safe anytime-valid inference. arXiv preprint arXiv:2210.01948, 2022.
  • Shekhar & Ramdas (2021) Shubhanshu Shekhar and Aaditya Ramdas. Game-theoretic formulations of sequential nonparametric one-and two-sample tests. arXiv preprint arXiv:2112.09162, 2021.
  • Student (1908) Student. The probable error of a mean. Biometrika, 6(1):1–25, 1908.
  • Tan et al. (2014) Vincent YF Tan et al. Asymptotic estimates in information theory with non-vanishing error probabilities. Foundations and Trends® in Communications and Information Theory, 11(1-2):1–184, 2014.
  • Ville (1939) Jean Ville. Etude critique de la notion de collectif. Bull. Amer. Math. Soc, 45(11):824, 1939.
  • Wald (1992) Abraham Wald. Sequential tests of statistical hypotheses. In Breakthroughs in Statistics, pp.  256–298. Springer, 1992.
  • Wasserstein & Lazar (2016) Ronald L Wasserstein and Nicole A Lazar. The ASA statement on p-values: context, process, and purpose, 2016.
  • Welch (1990) William J Welch. Construction of permutation tests. Journal of the American Statistical Association, 85(411):693–698, 1990.
  • Wilks (1938) Samuel S Wilks. The large-sample distribution of the likelihood ratio for testing composite hypotheses. The annals of mathematical statistics, 9(1):60–62, 1938.

Appendix A Proof of Theorem 5.1 and Its Preliminaries

A.1 Some statistical preliminaries

In probability theory, a sequence {X0,,Xn}subscript𝑋0subscript𝑋𝑛\left\{X_{0},\cdots,X_{n}\right\}{ italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } of random variables is called martingale if at a particular time, the expectation of the next random variable is equivalent to the present observation; this is formally defined as follows,

Definition A.1.

(Martingale) A sequence of random variables {X0,,Xn}subscript𝑋0subscript𝑋𝑛\{X_{0},\cdots,X_{n}\}{ italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } is a martingale if, for any n0𝑛0n\geq 0italic_n ≥ 0,

𝔼[|Xn|]𝔼delimited-[]subscript𝑋𝑛\displaystyle\mathop{\mathbb{E}}\left[|X_{n}|\right]blackboard_E [ | italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | ] absent\displaystyle\leq\infty≤ ∞ (19)
𝔼[Xn+1|X0,,Xn]𝔼delimited-[]conditionalsubscript𝑋𝑛1subscript𝑋0subscript𝑋𝑛\displaystyle\mathop{\mathbb{E}}\left[X_{n+1}|X_{0},\cdots,X_{n}\right]blackboard_E [ italic_X start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] =Xnabsentsubscript𝑋𝑛\displaystyle=X_{n}= italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (20)

We refer interested readers to (Aaditya Ramdas, 2018) for a complete introduction to the martingale and its related properties.

Next, we state Ville’s maximal inequalityVille (1939), which will be applied to prove Theorem 5.1.

Theorem A.2.

(Ville’s Maximal Inequality Ville (1939)): If {Xn}subscript𝑋𝑛\{X_{n}\}{ italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } is a nonnegative martingale, then for any c>0𝑐0c>0italic_c > 0, we have

P(supn0Xn>c)𝔼[X0]c𝑃subscriptsupremum𝑛0subscript𝑋𝑛𝑐𝔼delimited-[]subscript𝑋0𝑐\displaystyle P\left(\sup_{n\geq 0}X_{n}>c\right)\leq\frac{\mathop{\mathbb{E}}% \left[X_{0}\right]}{c}italic_P ( roman_sup start_POSTSUBSCRIPT italic_n ≥ 0 end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT > italic_c ) ≤ divide start_ARG blackboard_E [ italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] end_ARG start_ARG italic_c end_ARG (21)

Ville’s maximal inequality gives a probability upper bound for the event that the martingale crosses a threshold c𝑐citalic_c; it is a sequential extension of Markov’s inequality.

A.2 Proof of Theorem 5.1

Proof.

Our proof comprises proving the following two ordered parts:
(1) The first part is to demonstrate that, under the null hypothesis H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the independence between unqueried label random variables and the corresponding feature random variables still holds following the adaptive label query. In particular, Under H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the feature and label variables 𝐒isubscript𝐒𝑖\mathbf{S}_{i}bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Zisubscript𝑍𝑖Z_{i}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT used to construct the test statistic in equation 3 in the proposed framework are independent i[Nq]for-all𝑖delimited-[]subscript𝑁𝑞\forall i\in\left[N_{q}\right]∀ italic_i ∈ [ italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ].
(2) In the second part, we consider W~n=i=1nP(Zi)Qi(Zi𝐒i)subscript~𝑊𝑛superscriptsubscriptproduct𝑖1𝑛𝑃subscript𝑍𝑖subscript𝑄𝑖conditionalsubscript𝑍𝑖subscript𝐒𝑖\tilde{W}_{n}=\prod_{i=1}^{n}\frac{P(Z_{i})}{Q_{i}(Z_{i}\mid\mathbf{S}_{i})}over~ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_P ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG, which is the test statistic in equation 2 with true class prior P(z)𝑃𝑧P(z)italic_P ( italic_z ) plugged in. Moving forward, the second part is to demonstrate the following inequalities under H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

P0(n[Nq],Wn=i=1nP^(Zi)Qi(Zi𝐒i)α)P0(n[Nq],W~n=i=1nP(Zi)Qi(Zi𝐒i)α)α.subscript𝑃0formulae-sequence𝑛delimited-[]subscript𝑁𝑞subscript𝑊𝑛superscriptsubscriptproduct𝑖1𝑛^𝑃subscript𝑍𝑖subscript𝑄𝑖conditionalsubscript𝑍𝑖subscript𝐒𝑖𝛼subscript𝑃0formulae-sequence𝑛delimited-[]subscript𝑁𝑞subscript~𝑊𝑛superscriptsubscriptproduct𝑖1𝑛𝑃subscript𝑍𝑖subscript𝑄𝑖conditionalsubscript𝑍𝑖subscript𝐒𝑖𝛼𝛼\displaystyle P_{0}\left(\exists n\in\left[N_{q}\right],W_{n}=\prod_{i=1}^{n}% \frac{\hat{P}(Z_{i})}{Q_{i}\left(Z_{i}\mid\mathbf{S}_{i}\right)}\leq\alpha% \right)\leq P_{0}\left(\exists n\in\left[N_{q}\right],\tilde{W}_{n}=\prod_{i=1% }^{n}\frac{P(Z_{i})}{Q_{i}(Z_{i}\mid\mathbf{S}_{i})}\leq\alpha\right)\leq% \alpha\>.italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ∃ italic_n ∈ [ italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ] , italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG over^ start_ARG italic_P end_ARG ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ≤ italic_α ) ≤ italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ∃ italic_n ∈ [ italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ] , over~ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_P ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ≤ italic_α ) ≤ italic_α . (22)

equation 22 immediately implies that the Type I error of our proposed framework is upper-bounded by α𝛼\alphaitalic_α.

  • Proof for the first part
    We write 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and 𝒵usubscript𝒵𝑢\mathcal{Z}_{u}caligraphic_Z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT to denote the sets of original unlabeled feature variables on an analyst’s hand and unrevealed label variables provided by an oracle. We write 𝒮ilsubscriptsuperscript𝒮𝑙𝑖\mathcal{S}^{l}_{i}caligraphic_S start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒵ilsuperscriptsubscript𝒵𝑖𝑙\mathcal{Z}_{i}^{l}caligraphic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT to denote the sets of the labeled feature and the corresponding label variables after including the i-th (𝐒i,Zi)subscript𝐒𝑖subscript𝑍𝑖(\mathbf{S}_{i},Z_{i})( bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) to construct the statistic in equation 3. We use 𝒮iu=𝒮u𝒮ilsuperscriptsubscript𝒮𝑖𝑢subscript𝒮𝑢subscriptsuperscript𝒮𝑙𝑖\mathcal{S}_{i}^{u}=\mathcal{S}_{u}\setminus\mathcal{S}^{l}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∖ caligraphic_S start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒵iu=𝒵u𝒵ilsuperscriptsubscript𝒵𝑖𝑢subscript𝒵𝑢subscriptsuperscript𝒵𝑙𝑖\mathcal{Z}_{i}^{u}=\mathcal{Z}_{u}\setminus\mathcal{Z}^{l}_{i}caligraphic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = caligraphic_Z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∖ caligraphic_Z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to denote their complements that comprise unlabeled feature and unrevealed label variables. In particular, we use 𝒮0lsuperscriptsubscript𝒮0𝑙\mathcal{S}_{0}^{l}caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and 𝒵0lsuperscriptsubscript𝒵0𝑙\mathcal{Z}_{0}^{l}caligraphic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT to denote the feature and label variable sets used to initialize Q1(z𝐬)subscript𝑄1conditional𝑧𝐬Q_{1}\left(z\mid\mathbf{s}\right)italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_z ∣ bold_s ) in the first place; 𝒮0u=𝒮u𝒮0lsuperscriptsubscript𝒮0𝑢subscript𝒮𝑢superscriptsubscript𝒮0𝑙\mathcal{S}_{0}^{u}=\mathcal{S}_{u}\setminus\mathcal{S}_{0}^{l}caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∖ caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and 𝒵0u=𝒵u𝒵0lsuperscriptsubscript𝒵0𝑢subscript𝒵𝑢superscriptsubscript𝒵0𝑙\mathcal{Z}_{0}^{u}=\mathcal{Z}_{u}\setminus\mathcal{Z}_{0}^{l}caligraphic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = caligraphic_Z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∖ caligraphic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT are their complements that comprise unlabeled feature and unrevealed label variables. H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT being true implies 𝒮u𝒵u\mathcal{S}_{u}\perp\!\!\!\!\perp\mathcal{Z}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ⟂ ⟂ caligraphic_Z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. In our setting, an analyst randomly samples features and labels them to build 𝒮0lsuperscriptsubscript𝒮0𝑙\mathcal{S}_{0}^{l}caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and 𝒵0lsuperscriptsubscript𝒵0𝑙\mathcal{Z}_{0}^{l}caligraphic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, implying 𝒮0l𝒵0l\mathcal{S}_{0}^{l}\perp\!\!\!\!\perp\mathcal{Z}_{0}^{l}caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ⟂ ⟂ caligraphic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and 𝒮0u𝒵0u\mathcal{S}_{0}^{u}\perp\!\!\!\!\perp\mathcal{Z}_{0}^{u}caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ⟂ ⟂ caligraphic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT when H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is true. In the following, we employ the induction method to prove 𝐒isubscript𝐒𝑖\mathbf{S}_{i}bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Zisubscript𝑍𝑖Z_{i}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are independent i[Nq]for-all𝑖delimited-[]subscript𝑁𝑞\forall i\in\left[N_{q}\right]∀ italic_i ∈ [ italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ].

    Base case (i=1𝑖1i=1italic_i = 1): Under H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we have 𝒮0l𝒵0l\mathcal{S}^{l}_{0}\perp\!\!\!\!\perp\mathcal{Z}^{l}_{0}caligraphic_S start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟂ ⟂ caligraphic_Z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝒮0u𝒵0u\mathcal{S}^{u}_{0}\perp\!\!\!\!\perp\mathcal{Z}^{u}_{0}caligraphic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟂ ⟂ caligraphic_Z start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The analyst first initializes Q1(z𝐬)subscript𝑄1conditional𝑧𝐬Q_{1}(z\mid\mathbf{s})italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_z ∣ bold_s ) with 𝒮0l𝒵0l\mathcal{S}^{l}_{0}\perp\!\!\!\!\perp\mathcal{Z}^{l}_{0}caligraphic_S start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟂ ⟂ caligraphic_Z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT before starting the sequential testing. Subsequently, the analyst makes a query on a label based on the prediction of Q1(z𝐬)subscript𝑄1conditional𝑧𝐬Q_{1}(z\mid\mathbf{s})italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_z ∣ bold_s ) and includes the first variable pair (𝐒1,Z1)subscript𝐒1subscript𝑍1(\mathbf{S}_{1},Z_{1})( bold_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) to construct the test statistic. That immediately implies 𝐒1Z1\mathbf{S}_{1}\perp\!\!\!\!\perp Z_{1}bold_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟂ ⟂ italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 𝒮1l𝒵1l\mathcal{S}^{l}_{1}\perp\!\!\!\!\perp\mathcal{Z}^{l}_{1}caligraphic_S start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟂ ⟂ caligraphic_Z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒮1u𝒵1u\mathcal{S}^{u}_{1}\perp\!\!\!\!\perp\mathcal{Z}^{u}_{1}caligraphic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟂ ⟂ caligraphic_Z start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

    Induction step: Suppose 𝒮iu𝒵iu\mathcal{S}^{u}_{i}\perp\!\!\!\!\perp\mathcal{Z}^{u}_{i}caligraphic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟂ ⟂ caligraphic_Z start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒮il𝒵il\mathcal{S}^{l}_{i}\perp\!\!\!\!\perp\mathcal{Z}^{l}_{i}caligraphic_S start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟂ ⟂ caligraphic_Z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the analyst updates Qi1(z𝐬)subscript𝑄𝑖1conditional𝑧𝐬Q_{i-1}\left(z\mid\mathbf{s}\right)italic_Q start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ( italic_z ∣ bold_s ) to Qi(z𝐬)subscript𝑄𝑖conditional𝑧𝐬Q_{i}\left(z\mid\mathbf{s}\right)italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_z ∣ bold_s ) with 𝒮iu𝒵iu\mathcal{S}^{u}_{i}\perp\!\!\!\!\perp\mathcal{Z}^{u}_{i}caligraphic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟂ ⟂ caligraphic_Z start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒮il𝒵il\mathcal{S}^{l}_{i}\perp\!\!\!\!\perp\mathcal{Z}^{l}_{i}caligraphic_S start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟂ ⟂ caligraphic_Z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, makes a query on a label based on the prediction of Qi(z𝐬)subscript𝑄𝑖conditional𝑧𝐬Q_{i}(z\mid\mathbf{s})italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_z ∣ bold_s ) and includes the (i+1)-th variable pair (𝐒i+1,Zi+1)subscript𝐒𝑖1subscript𝑍𝑖1\left(\mathbf{S}_{i+1},Z_{i+1}\right)( bold_S start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) to update the statistic. That immediately implies 𝐒i+1Zi+1\mathbf{S}_{i+1}\perp\!\!\!\!\perp Z_{i+1}bold_S start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ⟂ ⟂ italic_Z start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT, 𝒮i+1u𝒵i+1u\mathcal{S}^{u}_{i+1}\perp\!\!\!\!\perp\mathcal{Z}^{u}_{i+1}caligraphic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ⟂ ⟂ caligraphic_Z start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT, and 𝒮i+1l𝒵i+1l\mathcal{S}^{l}_{i+1}\perp\!\!\!\!\perp\mathcal{Z}^{l}_{i+1}caligraphic_S start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ⟂ ⟂ caligraphic_Z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT.

    Combining the base step and the induction step leads to SiZi,i[Nq]S_{i}\perp\!\!\!\!\perp Z_{i},\forall i\in\left[N_{q}\right]italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟂ ⟂ italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∀ italic_i ∈ [ italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ] under H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

  • Proof for the second part
    Suppose ((s,z)i)i=1nsuperscriptsubscriptsubscript𝑠𝑧𝑖𝑖1𝑛\left(\left(s,z\right)_{i}\right)_{i=1}^{n}( ( italic_s , italic_z ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is a sequence of realizations of ((𝐒,Z)i)i=1nsuperscriptsubscriptsubscript𝐒𝑍𝑖𝑖1𝑛\left(\left(\mathbf{S},Z\right)_{i}\right)_{i=1}^{n}( ( bold_S , italic_Z ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT collected under H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the proposed framework. We use ϕitalic-ϕ\phiitalic_ϕ to denote a class-one prior probability parameter, and hence P(z1,,znϕ)𝑃subscript𝑧1conditionalsubscript𝑧𝑛italic-ϕP\left(z_{1},\cdots,z_{n}\mid\phi\right)italic_P ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∣ italic_ϕ ) is a likelihood function of ϕitalic-ϕ\phiitalic_ϕ. Maximizing P(z1,,znϕ)𝑃subscript𝑧1conditionalsubscript𝑧𝑛italic-ϕP\left(z_{1},\cdots,z_{n}\mid\phi\right)italic_P ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∣ italic_ϕ ) over the prior parameter ϕitalic-ϕ\phiitalic_ϕ leads to the solution ϕ=i=1nzinsuperscriptitalic-ϕsuperscriptsubscript𝑖1𝑛subscript𝑧𝑖𝑛\phi^{*}=\frac{\sum_{i=1}^{n}z_{i}}{n}italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG. In other words, P(z1,,znϕ)=i=1nP^(zi)𝑃subscript𝑧1conditionalsubscript𝑧𝑛superscriptitalic-ϕsuperscriptsubscriptproduct𝑖1𝑛^𝑃subscript𝑧𝑖P\left(z_{1},\cdots,z_{n}\mid\phi^{*}\right)=\prod_{i=1}^{n}\hat{P}(z_{i})italic_P ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∣ italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over^ start_ARG italic_P end_ARG ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is a maximized likelihood obtained from (zi)i=1nsuperscriptsubscriptsubscript𝑧𝑖𝑖1𝑛(z_{i})_{i=1}^{n}( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, where ϕ=P^(Z=1)superscriptitalic-ϕ^𝑃𝑍1\phi^{*}=\hat{P}(Z=1)italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = over^ start_ARG italic_P end_ARG ( italic_Z = 1 ). We use P(Z=1)𝑃𝑍1P(Z=1)italic_P ( italic_Z = 1 ) to denote the true prior-one probability under H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and plugging P(Z=1)𝑃𝑍1P(Z=1)italic_P ( italic_Z = 1 ) to ϕitalic-ϕ\phiitalic_ϕ leads to the true likelihood inP(zi)superscriptsubscriptproduct𝑖𝑛𝑃subscript𝑧𝑖\prod_{i}^{n}P(z_{i})∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for (zi)i=1nsuperscriptsubscriptsubscript𝑧𝑖𝑖1𝑛(z_{i})_{i=1}^{n}( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT under H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. It is easy to see i=1nP^(zi)inP(zi)superscriptsubscriptproduct𝑖1𝑛^𝑃subscript𝑧𝑖superscriptsubscriptproduct𝑖𝑛𝑃subscript𝑧𝑖\prod_{i=1}^{n}\hat{P}(z_{i})\geq\prod_{i}^{n}P(z_{i})∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over^ start_ARG italic_P end_ARG ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) thus i=1nP^(zi)Qi(zi|𝐬i)i=1nP(zi)Qi(zi|𝐬i)superscriptsubscriptproduct𝑖1𝑛^𝑃subscript𝑧𝑖subscript𝑄𝑖conditionalsubscript𝑧𝑖subscript𝐬𝑖superscriptsubscriptproduct𝑖1𝑛𝑃subscript𝑧𝑖subscript𝑄𝑖conditionalsubscript𝑧𝑖subscript𝐬𝑖\prod_{i=1}^{n}\frac{\hat{P}(z_{i})}{Q_{i}(z_{i}|\mathbf{s}_{i})}\geq\prod_{i=% 1}^{n}\frac{P(z_{i})}{Q_{i}(z_{i}|\mathbf{s}_{i})}∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG over^ start_ARG italic_P end_ARG ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ≥ ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_P ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG for any realization (zi)i=1nsuperscriptsubscriptsubscript𝑧𝑖𝑖1𝑛(z_{i})_{i=1}^{n}( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT of (Zi)i=1nsuperscriptsubscriptsubscript𝑍𝑖𝑖1𝑛(Z_{i})_{i=1}^{n}( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT under H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. As a result, we have P0(n[Nq],Wn=i=1nP^(Zi)Qi1(Zi𝐒i)α)P0(n,W~n=i=1nP(Zi)Qi1(Zi𝐒i)α)subscript𝑃0formulae-sequence𝑛delimited-[]subscript𝑁𝑞subscript𝑊𝑛superscriptsubscriptproduct𝑖1𝑛^𝑃subscript𝑍𝑖subscript𝑄𝑖1conditionalsubscript𝑍𝑖subscript𝐒𝑖𝛼subscript𝑃0𝑛subscript~𝑊𝑛superscriptsubscriptproduct𝑖1𝑛𝑃subscript𝑍𝑖subscript𝑄𝑖1conditionalsubscript𝑍𝑖subscript𝐒𝑖𝛼P_{0}\left(\exists n\in\left[N_{q}\right],W_{n}=\prod_{i=1}^{n}\frac{\hat{P}(Z% _{i})}{Q_{i-1}(Z_{i}\mid\mathbf{S}_{i})}\leq\alpha\right)\leq P_{0}\left(% \exists n,\tilde{W}_{n}=\prod_{i=1}^{n}\frac{P(Z_{i})}{Q_{i-1}\left(Z_{i}\mid% \mathbf{S}_{i}\right)}\leq\alpha\right)italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ∃ italic_n ∈ [ italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ] , italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG over^ start_ARG italic_P end_ARG ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ≤ italic_α ) ≤ italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ∃ italic_n , over~ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_P ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ≤ italic_α ).
    Lastly, we prove P0(n,W~n=i=1nP(Zi)Qi1(Zi𝐒i)α)αsubscript𝑃0𝑛subscript~𝑊𝑛superscriptsubscriptproduct𝑖1𝑛𝑃subscript𝑍𝑖subscript𝑄𝑖1conditionalsubscript𝑍𝑖subscript𝐒𝑖𝛼𝛼P_{0}\left(\exists n,\tilde{W}_{n}=\prod_{i=1}^{n}\frac{P(Z_{i})}{Q_{i-1}\left% (Z_{i}\mid\mathbf{S}_{i}\right)}\leq\alpha\right)\leq\alphaitalic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ∃ italic_n , over~ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_P ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ≤ italic_α ) ≤ italic_α. We let W~n1W~nsubscriptsuperscript~𝑊𝑛1subscript~𝑊𝑛\tilde{W}^{\prime}_{n}\equiv\frac{1}{\tilde{W}_{n}}over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≡ divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG. Therefore, W~nW~n1Qn(Zn𝐒n)P(Zn)subscriptsuperscript~𝑊𝑛subscriptsuperscript~𝑊𝑛1subscript𝑄𝑛conditionalsubscript𝑍𝑛subscript𝐒𝑛𝑃subscript𝑍𝑛\tilde{W}^{\prime}_{n}\equiv\tilde{W}^{\prime}_{n-1}\frac{Q_{n}(Z_{n}\mid% \mathbf{S}_{n})}{P(Z_{n})}over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≡ over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT divide start_ARG italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P ( italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG with W~01subscriptsuperscript~𝑊01\tilde{W}^{\prime}_{0}\equiv 1over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≡ 1 for n[Nq]𝑛delimited-[]subscript𝑁𝑞n\in\left[N_{q}\right]italic_n ∈ [ italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ]. The sequence (W~i)i=1nsuperscriptsubscriptsubscriptsuperscript~𝑊𝑖𝑖1𝑛(\tilde{W}^{\prime}_{i})_{i=1}^{n}( over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is a non-negative martingale under H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT given

    .𝔼[W~n|W~1,,W~n1]\displaystyle.\mathop{\mathbb{E}}\left[\tilde{W}^{\prime}_{n}\right\rvert% \tilde{W}^{\prime}_{1},\cdots,\tilde{W}^{\prime}_{n-1}]. blackboard_E [ over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] 𝔼[W~n1Qn(Zn𝐒n)P(Zn)|W~1,,W~n1]\displaystyle\equiv\mathop{\mathbb{E}}\left.\left[\tilde{W}^{\prime}_{n-1}% \frac{Q_{n}(Z_{n}\mid\mathbf{S}_{n})}{P(Z_{n})}\right\rvert\tilde{W}^{\prime}_% {1},\cdots,\tilde{W}^{\prime}_{n-1}\right]≡ blackboard_E [ over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT divide start_ARG italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P ( italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG | over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] (23)
    W~n1𝔼[Qn(Zn𝐒n)P(Zn)|W~1,,W~n1]\displaystyle\equiv\tilde{W}^{\prime}_{n-1}\mathop{\mathbb{E}}\left.\left[% \frac{Q_{n}(Z_{n}\mid\mathbf{S}_{n})}{P(Z_{n})}\right\rvert\tilde{W}^{\prime}_% {1},\cdots,\tilde{W}^{\prime}_{n-1}\right]≡ over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT blackboard_E [ divide start_ARG italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P ( italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG | over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] (24)
    =W~n1𝔼[z=01P(Zn=z)Qn1(Zn=z𝐒n)P(Zn=z)]absentsubscriptsuperscript~𝑊𝑛1𝔼delimited-[]superscriptsubscript𝑧01𝑃subscript𝑍𝑛𝑧subscript𝑄𝑛1subscript𝑍𝑛conditional𝑧subscript𝐒𝑛𝑃subscript𝑍𝑛𝑧\displaystyle=\tilde{W}^{\prime}_{n-1}\mathop{\mathbb{E}}\left[\sum_{z=0}^{1}P% (Z_{n}=z)\frac{Q_{n-1}(Z_{n}=z\mid\mathbf{S}_{n})}{P(Z_{n}=z)}\right]= over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT blackboard_E [ ∑ start_POSTSUBSCRIPT italic_z = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_P ( italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_z ) divide start_ARG italic_Q start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_z ∣ bold_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P ( italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_z ) end_ARG ] (25)
    =W~n1absentsubscriptsuperscript~𝑊𝑛1\displaystyle=\tilde{W}^{\prime}_{n-1}= over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT (26)

    Using Ville’s maximal inequality in Theorem A.2 leads to the following: For any α>0𝛼0\alpha>0italic_α > 0, we have

    P(supn[Nq]W~n>1α)α𝔼[W~0]=α𝑃subscriptsupremum𝑛delimited-[]subscript𝑁𝑞subscriptsuperscript~𝑊𝑛1𝛼𝛼𝔼delimited-[]subscriptsuperscript~𝑊0𝛼\displaystyle P\left(\sup_{n\in\left[N_{q}\right]}\tilde{W}^{\prime}_{n}>\frac% {1}{\alpha}\right)\leq\frac{\alpha}{\mathop{\mathbb{E}}[\tilde{W}^{\prime}_{0}% ]}=\alphaitalic_P ( roman_sup start_POSTSUBSCRIPT italic_n ∈ [ italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT > divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ) ≤ divide start_ARG italic_α end_ARG start_ARG blackboard_E [ over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] end_ARG = italic_α (27)
    P(supn[Nq]1W~n>1α)α𝔼[1W~0]=αabsent𝑃subscriptsupremum𝑛delimited-[]subscript𝑁𝑞1subscript~𝑊𝑛1𝛼𝛼𝔼delimited-[]1subscript~𝑊0𝛼\displaystyle\equiv P\left(\sup_{n\in\left[N_{q}\right]}\frac{1}{\tilde{W}_{n}% }>\frac{1}{\alpha}\right)\leq\frac{\alpha}{\mathop{\mathbb{E}}\left[\frac{1}{% \tilde{W}_{0}}\right]}=\alpha≡ italic_P ( roman_sup start_POSTSUBSCRIPT italic_n ∈ [ italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG > divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ) ≤ divide start_ARG italic_α end_ARG start_ARG blackboard_E [ divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_W end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ] end_ARG = italic_α (28)
    P(infn[Nq]W~nα)αabsent𝑃subscriptinfimum𝑛delimited-[]subscript𝑁𝑞subscript~𝑊𝑛𝛼𝛼\displaystyle\equiv P\left(\inf_{n\in\left[N_{q}\right]}\tilde{W}_{n}\leq% \alpha\right)\leq\alpha≡ italic_P ( roman_inf start_POSTSUBSCRIPT italic_n ∈ [ italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT over~ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≤ italic_α ) ≤ italic_α (29)

    Therefore, we have P0(n[Nq],Wn=i=1nP^(Zi)Qi(Zi𝐒i)α)P0(n[Nq],W~n=i=1nP(Zi)Qi(Zi𝐒i)α)αsubscript𝑃0formulae-sequence𝑛delimited-[]subscript𝑁𝑞subscript𝑊𝑛superscriptsubscriptproduct𝑖1𝑛^𝑃subscript𝑍𝑖subscript𝑄𝑖conditionalsubscript𝑍𝑖subscript𝐒𝑖𝛼subscript𝑃0formulae-sequence𝑛delimited-[]subscript𝑁𝑞subscript~𝑊𝑛superscriptsubscriptproduct𝑖1𝑛𝑃subscript𝑍𝑖subscript𝑄𝑖conditionalsubscript𝑍𝑖subscript𝐒𝑖𝛼𝛼P_{0}\left(\exists n\in\left[N_{q}\right],W_{n}=\prod_{i=1}^{n}\frac{\hat{P}(Z% _{i})}{Q_{i}(Z_{i}\mid\mathbf{S}_{i})}\leq\alpha\right)\leq P_{0}\left(\exists n% \in\left[N_{q}\right],\tilde{W}_{n}=\prod_{i=1}^{n}\frac{P(Z_{i})}{Q_{i}(Z_{i}% \mid\mathbf{S}_{i})}\leq\alpha\right)\leq\alphaitalic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ∃ italic_n ∈ [ italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ] , italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG over^ start_ARG italic_P end_ARG ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ≤ italic_α ) ≤ italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ∃ italic_n ∈ [ italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ] , over~ start_ARG italic_W end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_P ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ≤ italic_α ) ≤ italic_α.

Appendix B Proof of Theorem 5.4

Proof.

In the following, we formulate an optimization problem that seeks an arbitrary marginal distribution g(𝐬)𝑔𝐬g(\mathbf{s})italic_g ( bold_s ) to maximize the mutual information (MI) between 𝐒𝐒\mathbf{S}bold_S and Z𝑍Zitalic_Z, where (𝐒,Z)g(s)p(z𝐬)similar-to𝐒𝑍𝑔𝑠𝑝conditional𝑧𝐬\left(\mathbf{S},Z\right)\sim g(s)p\left(z\mid\mathbf{s}\right)( bold_S , italic_Z ) ∼ italic_g ( italic_s ) italic_p ( italic_z ∣ bold_s ). Solving this optimization problem leads to a consistent bimodal query (see Definition 5.2), asymptotically minimizing the test statistic in equation 2.

  • Constructing an optimization problem that maximizes MI
    We write g(𝐬)𝑔𝐬g\left(\mathbf{s}\right)italic_g ( bold_s ) to denote an arbitrary probability distribution of 𝐬𝐬{\mathbf{s}}bold_s. Recall P(z𝐬)𝑃conditional𝑧𝐬P(z\mid\mathbf{s})italic_P ( italic_z ∣ bold_s ) and p(𝐬)𝑝𝐬p(\mathbf{s})italic_p ( bold_s ) that indicate the class probability given 𝐬𝐬\mathbf{s}bold_s and a marginal probability distribution of 𝐬𝐬\mathbf{s}bold_s for the two-sample testing problem on the analyst’s hand; we write g(𝐬,z)=g(𝐬)P(z𝐬)𝑔𝐬𝑧𝑔𝐬𝑃conditional𝑧𝐬g\left(\mathbf{s},z\right)=g(\mathbf{s})P(z\mid\mathbf{s})italic_g ( bold_s , italic_z ) = italic_g ( bold_s ) italic_P ( italic_z ∣ bold_s ) and G(z)=g(𝐬,z)𝑑𝐬𝐺𝑧𝑔𝐬𝑧differential-d𝐬G(z)=\int g\left(\mathbf{s},z\right)d\mathbf{s}italic_G ( italic_z ) = ∫ italic_g ( bold_s , italic_z ) italic_d bold_s to denote the joint probability distribution and the class prior for a new two-sample testing problem with the original p(𝐬)𝑝𝐬p(\mathbf{s})italic_p ( bold_s ) replaced by g(𝐬)𝑔𝐬g(\mathbf{s})italic_g ( bold_s ). The mutual information (MI) that characterizes the new two-sample testing problem is as follows

    MI=z=01(G(z))log(G(z))+(z=01P(z𝐬)log(P(z𝐬)))g(𝐬)𝑑𝐬MIsuperscriptsubscript𝑧01𝐺𝑧𝐺𝑧superscriptsubscript𝑧01𝑃conditional𝑧𝐬𝑃conditional𝑧𝐬𝑔𝐬differential-d𝐬\displaystyle\text{MI}=-\sum_{z=0}^{1}\left(G(z)\right)\log\left(G(z)\right)+% \int\left(\sum_{z=0}^{1}P(z\mid\mathbf{s})\log\left(P\left(z\mid\mathbf{s}% \right)\right)\right)g(\mathbf{s})d\mathbf{s}MI = - ∑ start_POSTSUBSCRIPT italic_z = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_G ( italic_z ) ) roman_log ( italic_G ( italic_z ) ) + ∫ ( ∑ start_POSTSUBSCRIPT italic_z = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_P ( italic_z ∣ bold_s ) roman_log ( italic_P ( italic_z ∣ bold_s ) ) ) italic_g ( bold_s ) italic_d bold_s (30)

    We expand equation 30 and consider the following optimization problem,

    maxg(𝐬)z=01(p(z𝐬)g(𝐬)𝑑𝐬)log(p(z𝐬)g(𝐬)𝑑𝐬)+(z=01P(z𝐬)log(P(z𝐬)))g(𝐬)𝑑𝐬subscript𝑔𝐬superscriptsubscript𝑧01𝑝conditional𝑧𝐬𝑔𝐬differential-d𝐬𝑝conditional𝑧𝐬𝑔𝐬differential-d𝐬superscriptsubscript𝑧01𝑃conditional𝑧𝐬𝑃conditional𝑧𝐬𝑔𝐬differential-d𝐬\displaystyle\max_{g(\mathbf{s})}-\sum_{z=0}^{1}\left(\int p(z\mid\mathbf{s})g% (\mathbf{s})d\mathbf{s}\right)\log\left(\int p(z\mid\mathbf{s})g(\mathbf{s})d% \mathbf{s}\right)+\int\left(\sum_{z=0}^{1}P(z\mid\mathbf{s})\log\left(P(z\mid% \mathbf{s})\right)\right)g(\mathbf{s})d\mathbf{s}roman_max start_POSTSUBSCRIPT italic_g ( bold_s ) end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_z = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( ∫ italic_p ( italic_z ∣ bold_s ) italic_g ( bold_s ) italic_d bold_s ) roman_log ( ∫ italic_p ( italic_z ∣ bold_s ) italic_g ( bold_s ) italic_d bold_s ) + ∫ ( ∑ start_POSTSUBSCRIPT italic_z = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_P ( italic_z ∣ bold_s ) roman_log ( italic_P ( italic_z ∣ bold_s ) ) ) italic_g ( bold_s ) italic_d bold_s (31)

    In other words, equation 31 is seeking an g(𝐬)𝑔𝐬g(\mathbf{s})italic_g ( bold_s ) to maximize the MI of a new two-sample testing problem with p(z𝐬)𝑝conditional𝑧𝐬p(z\mid\mathbf{s})italic_p ( italic_z ∣ bold_s ) provided by the original two-sample testing problem. In what follows, we will see that solving 31 leads to a probability distribution in which a consistent bimodal query (see Definition 5.2) results, proving the asymptotic property in Theorem 5.4. Instead of directly solving equation 31, we fix G(Z=0)=P(Z=0s)g(𝐬)𝑑𝐬=u𝐺𝑍0𝑃𝑍conditional0𝑠𝑔𝐬differential-d𝐬𝑢G(Z=0)=\int P(Z=0\mid s)g(\mathbf{s})d\mathbf{s}=uitalic_G ( italic_Z = 0 ) = ∫ italic_P ( italic_Z = 0 ∣ italic_s ) italic_g ( bold_s ) italic_d bold_s = italic_u, and resort to finding the solution of the following,

    ming(𝐬)subscript𝑔𝐬\displaystyle\min_{g(\mathbf{s})}\quadroman_min start_POSTSUBSCRIPT italic_g ( bold_s ) end_POSTSUBSCRIPT (z=01P(z𝐬)log(P(z𝐬)))g(𝐬)𝑑𝐬superscriptsubscript𝑧01𝑃conditional𝑧𝐬𝑃conditional𝑧𝐬𝑔𝐬differential-d𝐬\displaystyle-\int\left(\sum_{z=0}^{1}P(z\mid\mathbf{s})\log\left(P\left(z\mid% \mathbf{s}\right)\right)\right)g(\mathbf{s})d\mathbf{s}- ∫ ( ∑ start_POSTSUBSCRIPT italic_z = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_P ( italic_z ∣ bold_s ) roman_log ( italic_P ( italic_z ∣ bold_s ) ) ) italic_g ( bold_s ) italic_d bold_s (32)
    s.t. P(Z=0𝐬)g(𝐬)𝑑𝐬=u,𝑃𝑍conditional0𝐬𝑔𝐬differential-d𝐬𝑢\displaystyle\int P(Z=0\mid\mathbf{s})g(\mathbf{s})d\mathbf{s}=u,∫ italic_P ( italic_Z = 0 ∣ bold_s ) italic_g ( bold_s ) italic_d bold_s = italic_u , (33)
    g(𝐬)𝑑𝐬=1,𝑔𝐬differential-d𝐬1\displaystyle\int g(\mathbf{s})d\mathbf{s}=1,∫ italic_g ( bold_s ) italic_d bold_s = 1 , (34)
    g(𝐬)0,s𝒮.formulae-sequence𝑔𝐬0for-all𝑠𝒮\displaystyle g(\mathbf{s})\geq 0,\forall s\in\mathcal{S}.italic_g ( bold_s ) ≥ 0 , ∀ italic_s ∈ caligraphic_S . (35)

    Then, we approximate equation 32 with a discrete version of the same by partitioning the sample space 𝒮𝒮\mathcal{S}caligraphic_S into L𝐿Litalic_L balls {B(𝐬i,r)}i=1Lsuperscriptsubscript𝐵subscript𝐬𝑖𝑟𝑖1𝐿\{B\left(\mathbf{s}_{i},r\right)\}_{i=1}^{L}{ italic_B ( bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT; in addition, L>2𝐿2L>2italic_L > 2. Each B(𝐬,r){B(𝐬i,r)}i=1L𝐵𝐬𝑟superscriptsubscript𝐵subscript𝐬𝑖𝑟𝑖1𝐿B\left(\mathbf{s},r\right)\in\{B\left(\mathbf{s}_{i},r\right)\}_{i=1}^{L}italic_B ( bold_s , italic_r ) ∈ { italic_B ( bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT has a radius r𝑟ritalic_r centering at 𝐬𝐬\mathbf{s}bold_s leading to an approximation P^(Z=0|𝐬)=P(Z=0|𝐬)p(𝐬B(𝐬,r))𝑑𝐬^𝑃𝑍conditional0𝐬𝑃𝑍conditional0𝐬𝑝conditional𝐬𝐵𝐬𝑟differential-d𝐬\hat{P}(Z=0|\mathbf{s})=\int P(Z=0|\mathbf{s})p\left(\mathbf{s}\mid B(\mathbf{% s},r)\right)d\mathbf{s}over^ start_ARG italic_P end_ARG ( italic_Z = 0 | bold_s ) = ∫ italic_P ( italic_Z = 0 | bold_s ) italic_p ( bold_s ∣ italic_B ( bold_s , italic_r ) ) italic_d bold_s, and a probability mass function G(𝐬)=𝐬B(𝐬,r)g(𝐬)𝑑𝐬𝐺𝐬subscript𝐬𝐵𝐬𝑟𝑔𝐬differential-d𝐬G(\mathbf{s})=\int_{\mathbf{s}\in B\left(\mathbf{s},r\right)}g(\mathbf{s})d% \mathbf{s}italic_G ( bold_s ) = ∫ start_POSTSUBSCRIPT bold_s ∈ italic_B ( bold_s , italic_r ) end_POSTSUBSCRIPT italic_g ( bold_s ) italic_d bold_s. Hence, we approximate equation 32 by the following linear programming (LP):

    minG(𝐬)subscript𝐺𝐬\displaystyle\min_{G(\mathbf{s})}\quadroman_min start_POSTSUBSCRIPT italic_G ( bold_s ) end_POSTSUBSCRIPT i=1LHi(Z)G(𝐬i)superscriptsubscript𝑖1𝐿subscript𝐻𝑖𝑍𝐺subscript𝐬𝑖\displaystyle\sum_{i=1}^{L}H_{i}(Z)G(\mathbf{s}_{i})∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_Z ) italic_G ( bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (36)
    s.t. i=1LP^(Z=0𝐬i)G(𝐬i)=u,superscriptsubscript𝑖1𝐿^𝑃𝑍conditional0subscript𝐬𝑖𝐺subscript𝐬𝑖𝑢\displaystyle\sum_{i=1}^{L}\hat{P}(Z=0\mid\mathbf{s}_{i})G(\mathbf{s}_{i})=u,∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_G ( bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_u , (37)
    i=1LG(𝐬i)=1,superscriptsubscript𝑖1𝐿𝐺subscript𝐬𝑖1\displaystyle\sum_{i=1}^{L}G(\mathbf{s}_{i})=1,∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_G ( bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 1 , (38)
    G(𝐬i)0,i[L].formulae-sequence𝐺subscript𝐬𝑖0for-all𝑖delimited-[]𝐿\displaystyle G(\mathbf{s}_{i})\geq 0,\forall i\in\left[L\right].italic_G ( bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ 0 , ∀ italic_i ∈ [ italic_L ] . (39)

    where Hi(Z)=z=01P^(zsi)log(P^(z𝐬i)),i[L]formulae-sequencesubscript𝐻𝑖𝑍superscriptsubscript𝑧01^𝑃conditional𝑧subscript𝑠𝑖^𝑃conditional𝑧subscript𝐬𝑖for-all𝑖delimited-[]𝐿H_{i}(Z)=-\sum_{z=0}^{1}\hat{P}(z\mid s_{i})\log\left(\hat{P}(z\mid\mathbf{s}_% {i})\right),\forall i\in\left[L\right]italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_Z ) = - ∑ start_POSTSUBSCRIPT italic_z = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT over^ start_ARG italic_P end_ARG ( italic_z ∣ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log ( over^ start_ARG italic_P end_ARG ( italic_z ∣ bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , ∀ italic_i ∈ [ italic_L ] indicates constant coefficients in the LP in equation 36.

  • Solving the optimization problem
    The constraints in equation 37 and  equation 38 construct a region of feasible solutions to the considered LP in equation 36; we write this region U={𝐬𝐬 is non-negative and 𝐬 satisfies equation 37 and equation 38.}U=\{\mathbf{s}\mid\mathbf{s}\text{ is non-negative and }\mathbf{s}\text{ % satisfies~{}equation~{}\ref{LPConstraint1} and~{}equation~{}\ref{LPConstraint2% }}.\}italic_U = { bold_s ∣ bold_s is non-negative and bold_s satisfies equation and equation . }. In addition, we need to make one more definition of one kind of solution to the system of linear equations, which is well-known in linear algebra.

    Definition B.1.

    (Basic solutions) Let A𝐱=b𝐴𝐱𝑏A\mathbf{x}=bitalic_A bold_x = italic_b be a system of linear equations. Let {𝐱j1,,𝐱jk}subscript𝐱subscript𝑗1subscript𝐱subscript𝑗𝑘\{\mathbf{x}_{j_{1}},\cdots,\mathbf{x}_{j_{k}}\}{ bold_x start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , bold_x start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT } be positive and other entries be zero in 𝐱𝐱\mathbf{x}bold_x. Then, if the corresponding columns Aj1,,Ajksubscript𝐴subscript𝑗1subscript𝐴subscript𝑗𝑘A_{j_{1}},\cdots,A_{j_{k}}italic_A start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT are linearly independent, then 𝐱𝐱\mathbf{x}bold_x is a basic solution to the system.

    Moreover, we will need to apply the following Theorems to derive the optimal feasible solution for the LP.

    Theorem B.2.

    If the feasible region of an LP is bounded, then at least one optimal solution occurs at a vertex of the corresponding polytope (or the feasible region).

    Theorem B.3.

    Let U𝑈Uitalic_U be the feasible region of a linear program. Then, 𝐱U𝐱𝑈\mathbf{x}\in Ubold_x ∈ italic_U is a basic feasible solution if and only if x𝑥xitalic_x is a vertex of U𝑈Uitalic_U.

    Theorem B.2 and Theorem B.3 are well-known in LP; we refer interested readers to (Miller, 2007) for the elaboration on their proofs. Theorem B.2 and B.3 suggests one optimal solution of equation 36 is a vector (G(𝐬1),,G(𝐬L))𝐺subscript𝐬1𝐺subscript𝐬𝐿\left(G\left(\mathbf{s}_{1}\right),\cdots,G\left(\mathbf{s}_{L}\right)\right)( italic_G ( bold_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ⋯ , italic_G ( bold_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) ) with at most two non-zero entries. Herein, we write G(𝐬q0)𝐺subscript𝐬subscript𝑞0G(\mathbf{s}_{q_{0}})italic_G ( bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) and G(𝐬q1)𝐺subscript𝐬subscript𝑞1G(\mathbf{s}_{q_{1}})italic_G ( bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) to denote the two non-zero entries. That reduces the LP in equation 36 to the following:

    maxq0,q1subscriptsubscript𝑞0subscript𝑞1\displaystyle\max_{q_{0},q_{1}}roman_max start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ((z=01P^(z𝐬q0)logP^(z𝐬q0))G(𝐬q0)+(z=01P^(z𝐬q1)logP^(z𝐬q1))G(𝐬q1))superscriptsubscript𝑧01^𝑃conditional𝑧subscript𝐬subscript𝑞0^𝑃conditional𝑧subscript𝐬subscript𝑞0𝐺subscript𝐬subscript𝑞0superscriptsubscript𝑧01^𝑃conditional𝑧subscript𝐬subscript𝑞1^𝑃conditional𝑧subscript𝐬subscript𝑞1𝐺subscript𝐬subscript𝑞1\displaystyle\left(\left(\sum_{z=0}^{1}\hat{P}\left(z\mid\mathbf{s}_{q_{0}}% \right)\log\hat{P}\left(z\mid\mathbf{s}_{q_{0}}\right)\right)G\left(\mathbf{s}% _{q_{0}}\right)+\left(\sum_{z=0}^{1}\hat{P}\left(z\mid\mathbf{s}_{q_{1}}\right% )\log\hat{P}\left(z\mid\mathbf{s}_{q_{1}}\right)\right)G\left(\mathbf{s}_{q_{1% }}\right)\right)( ( ∑ start_POSTSUBSCRIPT italic_z = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT over^ start_ARG italic_P end_ARG ( italic_z ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) roman_log over^ start_ARG italic_P end_ARG ( italic_z ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) italic_G ( bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + ( ∑ start_POSTSUBSCRIPT italic_z = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT over^ start_ARG italic_P end_ARG ( italic_z ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) roman_log over^ start_ARG italic_P end_ARG ( italic_z ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) italic_G ( bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) (40)
    s.t. P^(Z=0𝐬q0)G(𝐬q0)+P^(Z=0𝐬q1)G(𝐬q1)=u,^𝑃𝑍conditional0subscript𝐬subscript𝑞0𝐺subscript𝐬subscript𝑞0^𝑃𝑍conditional0subscript𝐬subscript𝑞1𝐺subscript𝐬subscript𝑞1𝑢\displaystyle\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)G\left(\mathbf{s}_{q% _{0}}\right)+\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)G\left(\mathbf{s}_{q% _{1}}\right)=u,over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_G ( bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_G ( bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = italic_u , (41)
    G(𝐬q0)+G(𝐬q1)=1,𝐺subscript𝐬subscript𝑞0𝐺subscript𝐬subscript𝑞11\displaystyle G(\mathbf{s}_{q_{0}})+G(\mathbf{s}_{q_{1}})=1,italic_G ( bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_G ( bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = 1 , (42)
    G(𝐬q0)0,G(𝐬q1)0.formulae-sequence𝐺subscript𝐬subscript𝑞00𝐺subscript𝐬subscript𝑞10\displaystyle G(\mathbf{s}_{q_{0}})\geq 0,G(\mathbf{s}_{q_{1}})\geq 0.italic_G ( bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ≥ 0 , italic_G ( bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ≥ 0 . (43)

    For the sake of simplifying the expressions in what follows, we write

    T0subscript𝑇0\displaystyle T_{0}italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT =P^(Z=0𝐬q0)logP^(Z=0𝐬q0)+(1P^(Z=0𝐬q0))log(1P^(Z=0𝐬q0)),absent^𝑃𝑍conditional0subscript𝐬subscript𝑞0^𝑃𝑍conditional0subscript𝐬subscript𝑞01^𝑃𝑍conditional0subscript𝐬subscript𝑞01^𝑃𝑍conditional0subscript𝐬subscript𝑞0\displaystyle=\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)\log\hat{P}\left(Z=% 0\mid\mathbf{s}_{q_{0}}\right)+\left(1-\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}% \right)\right)\log\left(1-\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)\right),= over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) roman_log over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + ( 1 - over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) roman_log ( 1 - over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) , (44)
    T1subscript𝑇1\displaystyle T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =P^(Z=0𝐬q1)logP^(Z=0𝐬q1)+(1P^(Z=0𝐬q1))log(1P^(Z=0sq1)),absent^𝑃𝑍conditional0subscript𝐬subscript𝑞1^𝑃𝑍conditional0subscript𝐬subscript𝑞11^𝑃𝑍conditional0subscript𝐬subscript𝑞11^𝑃𝑍conditional0subscript𝑠subscript𝑞1\displaystyle=\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)\log\hat{P}\left(Z=% 0\mid\mathbf{s}_{q_{1}}\right)+\left(1-\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}% \right)\right)\log\left(1-\hat{P}\left(Z=0\mid s_{q_{1}}\right)\right),= over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) roman_log over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + ( 1 - over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) roman_log ( 1 - over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ italic_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) , (45)
    T2subscript𝑇2\displaystyle T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =P^(Z=0𝐬q1)logP^(Z=0𝐬q0)+(1P^(Z=0𝐬q1))log(1P^(Z=0𝐬q0)),absent^𝑃𝑍conditional0subscript𝐬subscript𝑞1^𝑃𝑍conditional0subscript𝐬subscript𝑞01^𝑃𝑍conditional0subscript𝐬subscript𝑞11^𝑃𝑍conditional0subscript𝐬subscript𝑞0\displaystyle=\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)\log\hat{P}\left(Z=% 0\mid\mathbf{s}_{q_{0}}\right)+\left(1-\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}% \right)\right)\log\left(1-\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)\right),= over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) roman_log over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + ( 1 - over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) roman_log ( 1 - over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) , (46)
    T3subscript𝑇3\displaystyle T_{3}italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT =P^(Z=0𝐬q0)logP^(Z=0𝐬q1)+(1P^(Z=0𝐬q0))log(1P^(Z=0sq1)).absent^𝑃𝑍conditional0subscript𝐬subscript𝑞0^𝑃𝑍conditional0subscript𝐬subscript𝑞11^𝑃𝑍conditional0subscript𝐬subscript𝑞01^𝑃𝑍conditional0subscript𝑠subscript𝑞1\displaystyle=\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)\log\hat{P}\left(Z=% 0\mid\mathbf{s}_{q_{1}}\right)+\left(1-\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}% \right)\right)\log\left(1-\hat{P}\left(Z=0\mid s_{q_{1}}\right)\right).= over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) roman_log over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + ( 1 - over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) roman_log ( 1 - over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ italic_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) . (47)

    Then, equation 40 is re-expressed by the following,

    maxq0,q1subscriptsubscript𝑞0subscript𝑞1\displaystyle\max_{q_{0},q_{1}}roman_max start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT T0(uP^(Z=0sq1))P^(Z=0𝐬q0)P^(Z=0𝐬q1)+T1(P^(Z=0𝐬q0)u)P^(Z=0𝐬q0)P^(Z=0𝐬q1)subscript𝑇0𝑢^𝑃𝑍conditional0subscript𝑠subscript𝑞1^𝑃𝑍conditional0subscript𝐬subscript𝑞0^𝑃𝑍conditional0subscript𝐬subscript𝑞1subscript𝑇1^𝑃𝑍conditional0subscript𝐬subscript𝑞0𝑢^𝑃𝑍conditional0subscript𝐬subscript𝑞0^𝑃𝑍conditional0subscript𝐬subscript𝑞1\displaystyle\frac{T_{0}\left(u-\hat{P}\left(Z=0\mid s_{q_{1}}\right)\right)}{% \hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)-\hat{P}\left(Z=0\mid\mathbf{s}_{% q_{1}}\right)}+\frac{T_{1}\left(\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)-% u\right)}{\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)-\hat{P}\left(Z=0\mid% \mathbf{s}_{q_{1}}\right)}divide start_ARG italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_u - over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ italic_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) end_ARG start_ARG over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG + divide start_ARG italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_u ) end_ARG start_ARG over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG (48)
    s.t. P^(Z=0𝐬q0)P^(Z=0sq1)>0,^𝑃𝑍conditional0subscript𝐬subscript𝑞0^𝑃𝑍conditional0subscript𝑠subscript𝑞10\displaystyle\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)-\hat{P}\left(Z=0% \mid s_{q_{1}}\right)>0,over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ italic_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) > 0 , (49)
    P^(Z=0𝐬q1)u,^𝑃𝑍conditional0subscript𝐬subscript𝑞1𝑢\displaystyle\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)\leq u,over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ≤ italic_u , (50)
    P^(Z=0𝐬q0)u.^𝑃𝑍conditional0subscript𝐬subscript𝑞0𝑢\displaystyle\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)\geq u.over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ≥ italic_u . (51)

    equation 48 is an optimization problem that finds {P^(z𝐬q0),P^(z𝐬q1)}{P^(z𝐬i)}i=1L^𝑃conditional𝑧subscript𝐬subscript𝑞0^𝑃conditional𝑧subscript𝐬subscript𝑞1superscriptsubscript^𝑃conditional𝑧subscript𝐬𝑖𝑖1𝐿\left\{\hat{P}\left(z\mid\mathbf{s}_{q_{0}}\right),\hat{P}\left(z\mid\mathbf{s% }_{q_{1}}\right)\right\}\subset\{\hat{P}\left(z\mid\mathbf{s}_{i}\right)\}_{i=% 1}^{L}{ over^ start_ARG italic_P end_ARG ( italic_z ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , over^ start_ARG italic_P end_ARG ( italic_z ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) } ⊂ { over^ start_ARG italic_P end_ARG ( italic_z ∣ bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT to maximize the objective function. Herein, we write

    A𝐴\displaystyle Aitalic_A =T0P^(Z=0𝐬q0)P^(Z=0𝐬q1),absentsubscript𝑇0^𝑃𝑍conditional0subscript𝐬subscript𝑞0^𝑃𝑍conditional0subscript𝐬subscript𝑞1\displaystyle=\frac{T_{0}}{\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)-\hat{% P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)},= divide start_ARG italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG , (52)
    B𝐵\displaystyle Bitalic_B =uP^(Z=0𝐬q1),absent𝑢^𝑃𝑍conditional0subscript𝐬subscript𝑞1\displaystyle=u-\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right),= italic_u - over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , (53)
    C𝐶\displaystyle Citalic_C =T1P^(Z=0𝐬q0)P^(Z=0𝐬q1),absentsubscript𝑇1^𝑃𝑍conditional0subscript𝐬subscript𝑞0^𝑃𝑍conditional0subscript𝐬subscript𝑞1\displaystyle=\frac{T_{1}}{\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)-\hat{% P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)},= divide start_ARG italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG , (54)
    D𝐷\displaystyle Ditalic_D =P^(Z=0𝐬q0)u.absent^𝑃𝑍conditional0subscript𝐬subscript𝑞0𝑢\displaystyle=\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)-u.= over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_u . (55)

    Now, we analyze the derivatives of equation 48 by checking the partial derivatives of A𝐴Aitalic_A, B𝐵Bitalic_B, C𝐶Citalic_C and D𝐷Ditalic_D with respect to P^(Z=0𝐬q0)^𝑃𝑍conditional0subscript𝐬subscript𝑞0\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) and P^(Z=0𝐬q1)^𝑃𝑍conditional0subscript𝐬subscript𝑞1\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ):

    AP^(Z=0𝐬q0)𝐴^𝑃𝑍conditional0subscript𝐬subscript𝑞0\displaystyle\frac{\partial A}{\partial\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}% \right)}divide start_ARG ∂ italic_A end_ARG start_ARG ∂ over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG =T2(P^(Z=0𝐬q0)P^(Z=0𝐬q1))2>0,absentsubscript𝑇2superscript^𝑃𝑍conditional0subscript𝐬subscript𝑞0^𝑃𝑍conditional0subscript𝐬subscript𝑞120\displaystyle=\frac{-T_{2}}{\left(\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right% )-\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)\right)^{2}}>0,= divide start_ARG - italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ( over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG > 0 , (56)
    AP^(Z=0𝐬q1)𝐴^𝑃𝑍conditional0subscript𝐬subscript𝑞1\displaystyle\frac{\partial A}{\partial\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}% \right)}divide start_ARG ∂ italic_A end_ARG start_ARG ∂ over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG =T0(P^(Z=0𝐬q0)P^(Z=0𝐬q1))2<0,absentsubscript𝑇0superscript^𝑃𝑍conditional0subscript𝐬subscript𝑞0^𝑃𝑍conditional0subscript𝐬subscript𝑞120\displaystyle=\frac{T_{0}}{\left(\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)% -\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)\right)^{2}}<0,= divide start_ARG italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG ( over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG < 0 , (57)
    BP^(Z=0𝐬q1)𝐵^𝑃𝑍conditional0subscript𝐬subscript𝑞1\displaystyle\frac{\partial B}{\partial\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}% \right)}divide start_ARG ∂ italic_B end_ARG start_ARG ∂ over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG =1,absent1\displaystyle=-1,= - 1 , (58)
    CP^(Z=0𝐬q0)𝐶^𝑃𝑍conditional0subscript𝐬subscript𝑞0\displaystyle\frac{\partial C}{\partial\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}% \right)}divide start_ARG ∂ italic_C end_ARG start_ARG ∂ over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG =T1(P^(Z=0𝐬q0)P^(Z=0𝐬q1))2>0,absentsubscript𝑇1superscript^𝑃𝑍conditional0subscript𝐬subscript𝑞0^𝑃𝑍conditional0subscript𝐬subscript𝑞120\displaystyle=\frac{-T_{1}}{\left(\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right% )-\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)\right)^{2}}>0,= divide start_ARG - italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG ( over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG > 0 , (59)
    CP^(Z=0𝐬q1)𝐶^𝑃𝑍conditional0subscript𝐬subscript𝑞1\displaystyle\frac{\partial C}{\partial\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}% \right)}divide start_ARG ∂ italic_C end_ARG start_ARG ∂ over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG =T3(P^(Z=0𝐬q0)P^(Z=0𝐬q1))2<0,absentsubscript𝑇3superscript^𝑃𝑍conditional0subscript𝐬subscript𝑞0^𝑃𝑍conditional0subscript𝐬subscript𝑞120\displaystyle=\frac{T_{3}}{(\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)-\hat% {P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right))^{2}}<0,= divide start_ARG italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG start_ARG ( over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG < 0 , (60)
    DP^(Z=0𝐬q0)𝐷^𝑃𝑍conditional0subscript𝐬subscript𝑞0\displaystyle\frac{\partial D}{\partial\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}% \right)}divide start_ARG ∂ italic_D end_ARG start_ARG ∂ over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG =1.absent1\displaystyle=1.= 1 . (61)

    Therefore, equation 48 is a function that monotonically increases with increasing P^(Z=0𝐬q0)^𝑃𝑍conditional0subscript𝐬subscript𝑞0\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) and decreasing P^(Z=0𝐬q1)^𝑃𝑍conditional0subscript𝐬subscript𝑞1\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), implying that the optimal solution to equation 36 has the following probability mass function Gsuperscript𝐺G^{*}italic_G start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT,

    G(𝐬q0)superscript𝐺subscript𝐬subscript𝑞0\displaystyle G^{*}\left(\mathbf{s}_{q_{0}}\right)italic_G start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) =uP^(Z=0𝐬q1)P^(Z=0𝐬q0)P^(Z=0𝐬q1),𝐬q0=argmax𝐬P^(Z=0𝐬),formulae-sequenceabsent𝑢^𝑃𝑍conditional0subscript𝐬subscript𝑞1^𝑃𝑍conditional0subscript𝐬subscript𝑞0^𝑃𝑍conditional0subscript𝐬subscript𝑞1subscript𝐬subscript𝑞0subscript𝐬^𝑃𝑍conditional0𝐬\displaystyle=\frac{u-\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)}{\hat{P}% \left(Z=0\mid\mathbf{s}_{q_{0}}\right)-\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}% \right)},\mathbf{s}_{q_{0}}=\arg\max_{\mathbf{s}}\hat{P}\left(Z=0\mid\mathbf{s% }\right),= divide start_ARG italic_u - over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG , bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s ) , (62)
    G(𝐬q1)superscript𝐺subscript𝐬subscript𝑞1\displaystyle G^{*}\left(\mathbf{s}_{q_{1}}\right)italic_G start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) =P^(Z=0s0)uP^(Z=0𝐬q0)P^(Z=0𝐬q1),𝐬q1=argmaxsP^(Z=1𝐬)formulae-sequenceabsent^𝑃𝑍conditional0subscript𝑠0𝑢^𝑃𝑍conditional0subscript𝐬subscript𝑞0^𝑃𝑍conditional0subscript𝐬subscript𝑞1subscript𝐬subscript𝑞1subscript𝑠^𝑃𝑍conditional1𝐬\displaystyle=\frac{\hat{P}(Z=0\mid s_{0})-u}{\hat{P}\left(Z=0\mid\mathbf{s}_{% q_{0}}\right)-\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)},\mathbf{s}_{q_{1}% }=\arg\max_{s}\hat{P}\left(Z=1\mid\mathbf{s}\right)= divide start_ARG over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_u end_ARG start_ARG over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG , bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT over^ start_ARG italic_P end_ARG ( italic_Z = 1 ∣ bold_s ) (63)
    G(𝐬)superscript𝐺𝐬\displaystyle G^{*}(\mathbf{s})italic_G start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s ) =0,𝐬{𝐬i}i=1L{𝐬q0,𝐬q1}.formulae-sequenceabsent0for-all𝐬superscriptsubscriptsubscript𝐬𝑖𝑖1𝐿subscript𝐬subscript𝑞0subscript𝐬subscript𝑞1\displaystyle=0,\forall\mathbf{s}\in\{\mathbf{s}_{i}\}_{i=1}^{L}\setminus\{% \mathbf{s}_{q_{0}},\mathbf{s}_{q_{1}}\}.= 0 , ∀ bold_s ∈ { bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∖ { bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT } . (64)

    Recall that LP in equation 36 approximates the continuous optimization problem in equation 32 by partitioning the sample space 𝒮𝒮\mathcal{S}caligraphic_S to {B(𝐬i,r)}i=1Lsuperscriptsubscript𝐵subscript𝐬𝑖𝑟𝑖1𝐿\{B(\mathbf{s}_{i},r)\}_{i=1}^{L}{ italic_B ( bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT. Hence, by shrinking the radius r𝑟ritalic_r infinitely close to zero, we get the optimal solution p(𝐬)superscript𝑝𝐬p^{*}(\mathbf{s})italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s ) of equation 32 as follows,

    p(𝐬q0)p(𝐬q1)superscript𝑝subscript𝐬subscript𝑞0superscript𝑝subscript𝐬subscript𝑞1\displaystyle\frac{p^{*}\left(\mathbf{s}_{q_{0}}\right)}{p^{*}\left(\mathbf{s}% _{q_{1}}\right)}divide start_ARG italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG =uP(Z=0𝐬q1)P(Z=0𝐬q0)u,𝐬q0=argmax𝐬P^(Z=0𝐬),𝐬q1=argmax𝐬P^(Z=1𝐬),formulae-sequenceabsent𝑢𝑃𝑍conditional0subscript𝐬subscript𝑞1𝑃𝑍conditional0subscript𝐬subscript𝑞0𝑢formulae-sequencesubscript𝐬subscript𝑞0subscript𝐬^𝑃𝑍conditional0𝐬subscript𝐬subscript𝑞1subscript𝐬^𝑃𝑍conditional1𝐬\displaystyle=\frac{u-P\left(Z=0\mid\mathbf{s}_{q_{1}}\right)}{P\left(Z=0\mid% \mathbf{s}_{q_{0}}\right)-u},\mathbf{s}_{q_{0}}=\arg\max_{\mathbf{s}}\hat{P}% \left(Z=0\mid\mathbf{s}\right),\mathbf{s}_{q_{1}}=\arg\max_{\mathbf{s}}\hat{P}% \left(Z=1\mid\mathbf{s}\right),= divide start_ARG italic_u - italic_P ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_u end_ARG , bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT over^ start_ARG italic_P end_ARG ( italic_Z = 0 ∣ bold_s ) , bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT over^ start_ARG italic_P end_ARG ( italic_Z = 1 ∣ bold_s ) , (65)
    p(𝐬)superscript𝑝𝐬\displaystyle p^{*}(\mathbf{s})italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s ) =0,𝐬𝒮{𝐬q0,𝐬q1}.formulae-sequenceabsent0for-all𝐬𝒮subscript𝐬subscript𝑞0subscript𝐬subscript𝑞1\displaystyle=0,\forall\mathbf{s}\in\mathcal{S}\setminus\{\mathbf{s}_{q_{0}},% \mathbf{s}_{q_{1}}\}.= 0 , ∀ bold_s ∈ caligraphic_S ∖ { bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT } . (66)

    Varying u𝑢uitalic_u leads to the optimal solution with the same form that p(𝐬)=0,s𝒮{𝐬q0,𝐬q1}formulae-sequencesuperscript𝑝𝐬0for-all𝑠𝒮subscript𝐬subscript𝑞0subscript𝐬subscript𝑞1p^{*}(\mathbf{s})=0,\forall s\in\mathcal{S}\setminus\{\mathbf{s}_{q_{0}},% \mathbf{s}_{q_{1}}\}italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s ) = 0 , ∀ italic_s ∈ caligraphic_S ∖ { bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT } and p(𝐬q0)>0,p(𝐬q1)>0formulae-sequencesuperscript𝑝subscript𝐬subscript𝑞00superscript𝑝subscript𝐬subscript𝑞10p^{*}\left(\mathbf{s}_{q_{0}}\right)>0,p^{*}\left(\mathbf{s}_{q_{1}}\right)>0italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) > 0 , italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) > 0, but different ratio p(𝐬q0)p(𝐬q1)superscript𝑝subscript𝐬subscript𝑞0superscript𝑝subscript𝐬subscript𝑞1\frac{p^{*}\left(\mathbf{s}_{q_{0}}\right)}{p^{*}\left(\mathbf{s}_{q_{1}}% \right)}divide start_ARG italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG. Furthermore, there could exist a set 𝒮q0={𝐬q0P(Z=0𝐬q0)=max𝐬𝒮P(Z=0𝐬)}subscript𝒮subscript𝑞0conditional-setsubscript𝐬subscript𝑞0𝑃𝑍conditional0subscript𝐬subscript𝑞0subscript𝐬𝒮𝑃𝑍conditional0𝐬\mathcal{S}_{q_{0}}=\left\{\mathbf{s}_{q_{0}}\mid P\left(Z=0\mid\mathbf{s}_{q_% {0}}\right)=\max_{\mathbf{s}\in\mathcal{S}}P\left(Z=0\mid\mathbf{s}\right)\right\}caligraphic_S start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_P ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = roman_max start_POSTSUBSCRIPT bold_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_P ( italic_Z = 0 ∣ bold_s ) } with identical P(z𝐬q0)𝑃conditional𝑧subscript𝐬subscript𝑞0P\left(z\mid\mathbf{s}_{q_{0}}\right)italic_P ( italic_z ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), and so does 𝒮q1={𝐬q1P(Z=1𝐬q1)=max𝐬𝒮P(Z=1𝐬)}subscript𝒮subscript𝑞1conditional-setsubscript𝐬subscript𝑞1𝑃𝑍conditional1subscript𝐬subscript𝑞1subscript𝐬𝒮𝑃𝑍conditional1𝐬\mathcal{S}_{q_{1}}=\left\{\mathbf{s}_{q_{1}}\mid P\left(Z=1\mid\mathbf{s}_{q_% {1}}\right)=\max_{\mathbf{s}\in\mathcal{S}}P\left(Z=1\mid\mathbf{s}\right)\right\}caligraphic_S start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_P ( italic_Z = 1 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = roman_max start_POSTSUBSCRIPT bold_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_P ( italic_Z = 1 ∣ bold_s ) } for the case of 𝐬q1subscript𝐬subscript𝑞1\mathbf{s}_{q_{1}}bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Hence, the optimal solution to the original optimization problem in equation 31 has the following form

    p(𝐬)superscript𝑝𝐬\displaystyle p^{*}\left(\mathbf{s}\right)italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s ) =0,𝐬𝒮(𝒮q0𝒮q1), and p(𝐬)>0,𝐬𝒮q0𝒮q1,formulae-sequenceabsent0formulae-sequencefor-all𝐬𝒮subscript𝒮subscript𝑞0subscript𝒮subscript𝑞1formulae-sequence and superscript𝑝𝐬0for-all𝐬subscript𝒮subscript𝑞0subscript𝒮subscript𝑞1\displaystyle=0,\forall\mathbf{s}\in\mathcal{S}\setminus\left(\mathcal{S}_{q_{% 0}}\bigcup\mathcal{S}_{q_{1}}\right),\text{ and }p^{*}\left(\mathbf{s}\right)>% 0,\forall\mathbf{s}\in\mathcal{S}_{q_{0}}\bigcup\mathcal{S}_{q_{1}},= 0 , ∀ bold_s ∈ caligraphic_S ∖ ( caligraphic_S start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋃ caligraphic_S start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , and italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s ) > 0 , ∀ bold_s ∈ caligraphic_S start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋃ caligraphic_S start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (67)
    𝒮q0subscript𝒮subscript𝑞0\displaystyle\mathcal{S}_{q_{0}}caligraphic_S start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ={𝐬q0|P(Z=0𝐬q0)=max𝐬𝒮P(Z=0𝐬)},\displaystyle=\left\{\mathbf{s}_{q_{0}}\left\rvert P\left(Z=0\mid\mathbf{s}_{q% _{0}}\right)=\max_{\mathbf{s}\in\mathcal{S}}P\left(Z=0\mid\mathbf{s}\right)% \right\}\right.,= { bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_P ( italic_Z = 0 ∣ bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = roman_max start_POSTSUBSCRIPT bold_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_P ( italic_Z = 0 ∣ bold_s ) } , (68)
    𝒮q1subscript𝒮subscript𝑞1\displaystyle\mathcal{S}_{q_{1}}caligraphic_S start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ={𝐬q1P(Z=1|𝐬q1)=max𝐬𝒮P(Z=1𝐬)}.\displaystyle=\left\{\mathbf{s}_{q_{1}}\mid P\left(Z=1\left\rvert\mathbf{s}_{q% _{1}}\right)=\max_{\mathbf{s}\in\mathcal{S}}P\left(Z=1\mid\mathbf{s}\right)% \right\}\right..= { bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_P ( italic_Z = 1 | bold_s start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = roman_max start_POSTSUBSCRIPT bold_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_P ( italic_Z = 1 ∣ bold_s ) } . (69)

    Therefore, there exists a consistent bimodal query resulting in an asymptotic distribution of the labeled feature variables admitting p(𝐬)superscript𝑝𝐬p^{*}\left(\mathbf{s}\right)italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s ) (equation 67 to equation 69) to maximize MI and hence minimize the negated MI with P(z𝐬)𝑃conditional𝑧𝐬P\left(z\mid\mathbf{s}\right)italic_P ( italic_z ∣ bold_s ) provided by the original two-sample testing problem.

Appendix C Proof of Theorem 5.10

Proof.

Testing power of the baseline case: As the baseline case randomly samples features from 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and queries their labels, then the resulting variable pair (𝐒n,Zn)subscript𝐒𝑛subscript𝑍𝑛\left(\mathbf{S}_{n},Z_{n}\right)( bold_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) collected by the analyst admits p(𝐬,z),n[Nq]𝑝𝐬𝑧for-all𝑛delimited-[]subscript𝑁𝑞p\left(\mathbf{s},z\right),\forall n\in\left[N_{q}\right]italic_p ( bold_s , italic_z ) , ∀ italic_n ∈ [ italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ], in which p(𝐬,z)𝑝𝐬𝑧p\left(\mathbf{s},z\right)italic_p ( bold_s , italic_z ) is the joint distribution that characterizes the original two-sample testing problem. In addition, Q(z𝐬)𝑄conditional𝑧𝐬Q\left(z\mid\mathbf{s}\right)italic_Q ( italic_z ∣ bold_s ) is initialized and stable, and the class-prior P(Z=0)𝑃𝑍0P(Z=0)italic_P ( italic_Z = 0 ) is provided in the case study. Given the label budget Nqsubscript𝑁𝑞N_{q}italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and the significance level α𝛼\alphaitalic_α, we have the following inequalities for the testing power in the case study:

P1(n[Nq],Wn=i=1nP(Zi)Q(Zi𝐒i)α)P1(WNq=i=1NqP(Zi)Q(Zi𝐒i)α),(𝐒n,Zn)p(𝐬,z)formulae-sequencesubscript𝑃1formulae-sequence𝑛delimited-[]subscript𝑁𝑞subscript𝑊𝑛superscriptsubscriptproduct𝑖1𝑛𝑃subscript𝑍𝑖𝑄conditionalsubscript𝑍𝑖subscript𝐒𝑖𝛼subscript𝑃1subscript𝑊subscript𝑁𝑞superscriptsubscriptproduct𝑖1subscript𝑁𝑞𝑃subscript𝑍𝑖𝑄conditionalsubscript𝑍𝑖subscript𝐒𝑖𝛼similar-tosubscript𝐒𝑛subscript𝑍𝑛𝑝𝐬𝑧\displaystyle P_{1}\left(\exists n\in\left[N_{q}\right],W_{n}=\prod_{i=1}^{n}% \frac{P(Z_{i})}{Q\left(Z_{i}\mid\mathbf{S}_{i}\right)}\leq\alpha\right)\geq P_% {1}\left(W_{N_{q}}=\prod_{i=1}^{N_{q}}\frac{P(Z_{i})}{Q\left(Z_{i}\mid\mathbf{% S}_{i}\right)}\leq\alpha\right),\left(\mathbf{S}_{n},Z_{n}\right)\sim p\left(% \mathbf{s},z\right)italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ∃ italic_n ∈ [ italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ] , italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_P ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ≤ italic_α ) ≥ italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG italic_P ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ≤ italic_α ) , ( bold_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∼ italic_p ( bold_s , italic_z ) (70)

The inequality in equation 70 is derived from sequentially comparing wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with α,n[Nq]𝛼for-all𝑛delimited-[]subscript𝑁𝑞\alpha,\forall n\in\left[N_{q}\right]italic_α , ∀ italic_n ∈ [ italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ] leading to a higher testing power than only comparing wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with α𝛼\alphaitalic_α at n=Nq𝑛subscript𝑁𝑞n=N_{q}italic_n = italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. We subsequently convert RHS of equation 70 as follows,

P1(WNq=i=1NqP(Zi)Q0(Zi𝐒i)α)=P1(log(WNq)Nq=i=1Nqlog(P(Zi)Q(Zi𝐒i))Nqlog(α)Nq)subscript𝑃1subscript𝑊subscript𝑁𝑞superscriptsubscriptproduct𝑖1subscript𝑁𝑞𝑃subscript𝑍𝑖subscript𝑄0conditionalsubscript𝑍𝑖subscript𝐒𝑖𝛼subscript𝑃1subscript𝑊subscript𝑁𝑞subscript𝑁𝑞superscriptsubscript𝑖1subscript𝑁𝑞𝑃subscript𝑍𝑖𝑄conditionalsubscript𝑍𝑖subscript𝐒𝑖subscript𝑁𝑞𝛼subscript𝑁𝑞\displaystyle P_{1}\left(W_{N_{q}}=\prod_{i=1}^{N_{q}}\frac{P(Z_{i})}{Q_{0}% \left(Z_{i}\mid\mathbf{S}_{i}\right)}\leq\alpha\right)=P_{1}\left(\frac{\log% \left(W_{N_{q}}\right)}{N_{q}}=\frac{\sum_{i=1}^{N_{q}}\log\left(\frac{P(Z_{i}% )}{Q\left(Z_{i}\mid\mathbf{S}_{i}\right)}\right)}{N_{q}}\leq\frac{\log\left(% \alpha\right)}{N_{q}}\right)italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG italic_P ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ≤ italic_α ) = italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( divide start_ARG roman_log ( italic_W start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_ARG = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_P ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_ARG ≤ divide start_ARG roman_log ( italic_α ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_ARG ) (71)

Since {(𝐒𝐢,Zi)}i=1Nqsuperscriptsubscriptsubscript𝐒𝐢subscript𝑍𝑖𝑖1subscript𝑁𝑞\left\{\left(\mathbf{S_{i}},Z_{i}\right)\right\}_{i=1}^{N_{q}}{ ( bold_S start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is an i.i.d. sequence, we skip i𝑖iitalic_i in (𝐒𝐢,Zi)subscript𝐒𝐢subscript𝑍𝑖\left(\mathbf{S_{i}},Z_{i}\right)( bold_S start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and analyze 𝔼[P(Z)Q(Z𝐒)]𝔼delimited-[]𝑃𝑍𝑄conditional𝑍𝐒\mathbb{E}\left[\frac{P(Z)}{Q\left(Z\mid\mathbf{S}\right)}\right]blackboard_E [ divide start_ARG italic_P ( italic_Z ) end_ARG start_ARG italic_Q ( italic_Z ∣ bold_S ) end_ARG ] and Var[P(Z)Q(Z𝐒)]Vardelimited-[]𝑃𝑍𝑄conditional𝑍𝐒\mathrm{Var}\left[\frac{P(Z)}{Q\left(Z\mid\mathbf{S}\right)}\right]roman_Var [ divide start_ARG italic_P ( italic_Z ) end_ARG start_ARG italic_Q ( italic_Z ∣ bold_S ) end_ARG ] for (𝐒,Z)p(𝐬,z)similar-to𝐒𝑍𝑝𝐬𝑧\left(\mathbf{S},Z\right)\sim p\left(\mathbf{s},z\right)( bold_S , italic_Z ) ∼ italic_p ( bold_s , italic_z ) in the following,

𝔼[logP(Z)Q(Z𝐒)]𝔼delimited-[]𝑃𝑍𝑄conditional𝑍𝐒\displaystyle\mathbb{E}\left[\log\frac{P(Z)}{Q\left(Z\mid\mathbf{S}\right)}\right]blackboard_E [ roman_log divide start_ARG italic_P ( italic_Z ) end_ARG start_ARG italic_Q ( italic_Z ∣ bold_S ) end_ARG ] =𝔼[logP(Z)P(Z𝐒)+logP(Z𝐒)Q(Z𝐒)]absent𝔼delimited-[]𝑃𝑍𝑃conditional𝑍𝐒𝑃conditional𝑍𝐒𝑄conditional𝑍𝐒\displaystyle=\mathbb{E}\left[\log\frac{P(Z)}{P\left(Z\mid\mathbf{S}\right)}+% \log\frac{P\left(Z\mid\mathbf{S}\right)}{Q\left(Z\mid\mathbf{S}\right)}\right]= blackboard_E [ roman_log divide start_ARG italic_P ( italic_Z ) end_ARG start_ARG italic_P ( italic_Z ∣ bold_S ) end_ARG + roman_log divide start_ARG italic_P ( italic_Z ∣ bold_S ) end_ARG start_ARG italic_Q ( italic_Z ∣ bold_S ) end_ARG ] (72)
=I(S;Z)+DKL(P(z𝐬)Q(z𝐬))\displaystyle=-I\left(S;Z\right)+D_{\text{KL}}\left(P\left(z\mid\mathbf{s}% \right)\|Q\left(z\mid\mathbf{s}\right)\right)= - italic_I ( italic_S ; italic_Z ) + italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_P ( italic_z ∣ bold_s ) ∥ italic_Q ( italic_z ∣ bold_s ) ) (73)
I(S;Z)+ϵ1;absent𝐼𝑆𝑍subscriptitalic-ϵ1\displaystyle\leq-I\left(S;Z\right)+\sqrt{\epsilon_{1}};≤ - italic_I ( italic_S ; italic_Z ) + square-root start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ; (74)
Var[P(Z)Q(Z𝐒)]Vardelimited-[]𝑃𝑍𝑄conditional𝑍𝐒\displaystyle\mathrm{Var}\left[\frac{P(Z)}{Q\left(Z\mid\mathbf{S}\right)}\right]roman_Var [ divide start_ARG italic_P ( italic_Z ) end_ARG start_ARG italic_Q ( italic_Z ∣ bold_S ) end_ARG ] =Var[logP(Z)P(Z𝐒)+logP(ZS)Q(Z𝐒)]absentVardelimited-[]𝑃𝑍𝑃conditional𝑍𝐒𝑃conditional𝑍𝑆𝑄conditional𝑍𝐒\displaystyle=\mathrm{Var}\left[\log\frac{P(Z)}{P\left(Z\mid\mathbf{S}\right)}% +\log\frac{P(Z\mid S)}{Q\left(Z\mid\mathbf{S}\right)}\right]= roman_Var [ roman_log divide start_ARG italic_P ( italic_Z ) end_ARG start_ARG italic_P ( italic_Z ∣ bold_S ) end_ARG + roman_log divide start_ARG italic_P ( italic_Z ∣ italic_S ) end_ARG start_ARG italic_Q ( italic_Z ∣ bold_S ) end_ARG ] (75)
Var[logP(Z)P(Z𝐒)]+Var[logP(Z𝐒)Q(Z𝐒)]+2Var[logP(Z)P(Z𝐒)]Var[logP(Z𝐒)Q(Z𝐒)]absentVardelimited-[]𝑃𝑍𝑃conditional𝑍𝐒Vardelimited-[]𝑃conditional𝑍𝐒𝑄conditional𝑍𝐒2Vardelimited-[]𝑃𝑍𝑃conditional𝑍𝐒Vardelimited-[]𝑃conditional𝑍𝐒𝑄conditional𝑍𝐒\displaystyle\leq\mathrm{Var}\left[\log\frac{P(Z)}{P\left(Z\mid\mathbf{S}% \right)}\right]+\mathrm{Var}\left[\log\frac{P\left(Z\mid\mathbf{S}\right)}{Q% \left(Z\mid\mathbf{S}\right)}\right]+2\sqrt{\mathrm{Var}\left[\log\frac{P(Z)}{% P\left(Z\mid\mathbf{S}\right)}\right]\mathrm{Var}\left[\log\frac{P\left(Z\mid% \mathbf{S}\right)}{Q\left(Z\mid\mathbf{S}\right)}\right]}≤ roman_Var [ roman_log divide start_ARG italic_P ( italic_Z ) end_ARG start_ARG italic_P ( italic_Z ∣ bold_S ) end_ARG ] + roman_Var [ roman_log divide start_ARG italic_P ( italic_Z ∣ bold_S ) end_ARG start_ARG italic_Q ( italic_Z ∣ bold_S ) end_ARG ] + 2 square-root start_ARG roman_Var [ roman_log divide start_ARG italic_P ( italic_Z ) end_ARG start_ARG italic_P ( italic_Z ∣ bold_S ) end_ARG ] roman_Var [ roman_log divide start_ARG italic_P ( italic_Z ∣ bold_S ) end_ARG start_ARG italic_Q ( italic_Z ∣ bold_S ) end_ARG ] end_ARG (76)
σ2+ϵ1+2σϵ1.absentsuperscript𝜎2subscriptitalic-ϵ12𝜎subscriptitalic-ϵ1\displaystyle\leq\sigma^{2}+\epsilon_{1}+2\sigma\sqrt{\epsilon_{1}}.≤ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 2 italic_σ square-root start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG . (77)

The inequalities in equation 74 and equation 77 are results of the following facts: ϵ1=maxA𝒫DKL2(q(𝐬,z)p(𝐬,z)A)subscriptitalic-ϵ1subscript𝐴𝒫subscript𝐷superscriptKL2𝑞𝐬𝑧delimited-‖∣𝑝𝐬𝑧𝐴\epsilon_{1}=\max_{A\in\mathcal{P}}D_{\text{KL}^{2}}\left(q\left(\mathbf{s},z% \right)\|p\left(\mathbf{s},z\right)\mid A\right)italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_A ∈ caligraphic_P end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT KL start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_q ( bold_s , italic_z ) ∥ italic_p ( bold_s , italic_z ) ∣ italic_A ) and σ2=max{maxA𝒫Var(𝐒,Z)p(𝐬,zA)I¯(𝐒;Z),Var(𝐒,Z)p(𝐬,z)I¯(𝐒;Z)}superscript𝜎2subscript𝐴𝒫subscriptVarsimilar-to𝐒𝑍𝑝𝐬conditional𝑧𝐴¯𝐼𝐒𝑍subscriptVarsimilar-to𝐒𝑍𝑝𝐬𝑧¯𝐼𝐒𝑍\sigma^{2}=\max\left\{\max_{A\in\mathcal{P}}\text{Var}_{\left(\mathbf{S},Z% \right)\sim p\left(\mathbf{s},z\mid A\right)}\bar{I}(\mathbf{S};Z),\text{Var}_% {\left(\mathbf{S},Z\right)\sim p\left(\mathbf{s},z\right)}\bar{I}(\mathbf{S};Z% )\right\}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_max { roman_max start_POSTSUBSCRIPT italic_A ∈ caligraphic_P end_POSTSUBSCRIPT Var start_POSTSUBSCRIPT ( bold_S , italic_Z ) ∼ italic_p ( bold_s , italic_z ∣ italic_A ) end_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG ( bold_S ; italic_Z ) , Var start_POSTSUBSCRIPT ( bold_S , italic_Z ) ∼ italic_p ( bold_s , italic_z ) end_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG ( bold_S ; italic_Z ) } over the partition 𝒫={A1,,Am}𝒫subscript𝐴1subscript𝐴𝑚\mathcal{P}=\{A_{1},\cdots,A_{m}\}caligraphic_P = { italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }.

It is observed that, in equation 71, log(WNq)Nq=i=1Nqlog(P(Zi)Q(Zi𝐒i))Nqsubscript𝑊subscript𝑁𝑞subscript𝑁𝑞superscriptsubscript𝑖1subscript𝑁𝑞𝑃subscript𝑍𝑖𝑄conditionalsubscript𝑍𝑖subscript𝐒𝑖subscript𝑁𝑞\frac{\log\left(W_{N_{q}}\right)}{N_{q}}=\frac{\sum_{i=1}^{N_{q}}\log\left(% \frac{P(Z_{i})}{Q\left(Z_{i}\mid\mathbf{S}_{i}\right)}\right)}{N_{q}}divide start_ARG roman_log ( italic_W start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_ARG = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_P ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_ARG is a sample mean of {logP(Zi)Q(Zi𝐒i)}i=1Nqsuperscriptsubscript𝑃subscript𝑍𝑖𝑄conditionalsubscript𝑍𝑖subscript𝐒𝑖𝑖1subscript𝑁𝑞\left\{\log\frac{P(Z_{i})}{Q\left(Z_{i}\mid\mathbf{S}_{i}\right)}\right\}_{i=1% }^{N_{q}}{ roman_log divide start_ARG italic_P ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, hence we use the central limit theorem to approximate the distribution of log(WNq)Nqsubscript𝑊subscript𝑁𝑞subscript𝑁𝑞\frac{\log\left(W_{N_{q}}\right)}{N_{q}}divide start_ARG roman_log ( italic_W start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_ARG leading to the following,

P1(n[Nq],Wn=i=1nP(Zi)Q(Zi𝐒i)α)subscript𝑃1formulae-sequence𝑛delimited-[]subscript𝑁𝑞subscript𝑊𝑛superscriptsubscriptproduct𝑖1𝑛𝑃subscript𝑍𝑖𝑄conditionalsubscript𝑍𝑖subscript𝐒𝑖𝛼\displaystyle P_{1}\left(\exists n\in\left[N_{q}\right],W_{n}=\prod_{i=1}^{n}% \frac{P(Z_{i})}{Q\left(Z_{i}\mid\mathbf{S}_{i}\right)}\leq\alpha\right)italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ∃ italic_n ∈ [ italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ] , italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_P ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ≤ italic_α ) P1(WNq=i=1NqP(Zi)Q(Zi𝐒i)α)absentsubscript𝑃1subscript𝑊subscript𝑁𝑞superscriptsubscriptproduct𝑖1subscript𝑁𝑞𝑃subscript𝑍𝑖𝑄conditionalsubscript𝑍𝑖subscript𝐒𝑖𝛼\displaystyle\geq P_{1}\left(W_{N_{q}}=\prod_{i=1}^{N_{q}}\frac{P(Z_{i})}{Q% \left(Z_{i}\mid\mathbf{S}_{i}\right)}\leq\alpha\right)≥ italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG italic_P ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ≤ italic_α ) (78)
Φ(logαNq+Nq(I(𝐒;Z)ϵ1)(σ2+ϵ1+2ϵ1σ)12).absentΦ𝛼subscript𝑁𝑞subscript𝑁𝑞𝐼𝐒𝑍subscriptitalic-ϵ1superscriptsuperscript𝜎2subscriptitalic-ϵ12subscriptitalic-ϵ1𝜎12\displaystyle\eqsim\Phi\left(\frac{\frac{\log\alpha}{\sqrt{N_{q}}}+\sqrt{N_{q}% }\left(I\left(\mathbf{S};Z\right)-\sqrt{\epsilon_{1}}\right)}{\left(\sigma^{2}% +\sqrt{\epsilon_{1}}+2\sqrt{\epsilon_{1}}\sigma\right)^{\frac{1}{2}}}\right).≂ roman_Φ ( divide start_ARG divide start_ARG roman_log italic_α end_ARG start_ARG square-root start_ARG italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_ARG end_ARG + square-root start_ARG italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_ARG ( italic_I ( bold_S ; italic_Z ) - square-root start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) end_ARG start_ARG ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + square-root start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG + 2 square-root start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG italic_σ ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG ) . (79)

Testing power of the proposed framework in the case study: The analyst selects a region Asuperscript𝐴A^{*}italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT from a partition 𝒫={Ai}i=1m𝒫superscriptsubscriptsubscript𝐴𝑖𝑖1𝑚\mathcal{P}=\left\{A_{i}\right\}_{i=1}^{m}caligraphic_P = { italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, in which Asuperscript𝐴A^{*}italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is predicted to have highest I(𝐒;ZA)𝐼𝐒conditional𝑍superscript𝐴I\left(\mathbf{S};Z\mid A^{*}\right)italic_I ( bold_S ; italic_Z ∣ italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ); then the analyst conducts the sequential testing with (𝐒n,Zn)subscript𝐒𝑛subscript𝑍𝑛\left(\mathbf{S}_{n},Z_{n}\right)( bold_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) i.i.d. generated from p(𝐬,zA)𝑝𝐬conditional𝑧superscript𝐴p\left(\mathbf{s},z\mid A^{*}\right)italic_p ( bold_s , italic_z ∣ italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). We first quantify I(𝐒;ZA)𝐼𝐒conditional𝑍superscript𝐴I\left(\mathbf{S};Z\mid A^{*}\right)italic_I ( bold_S ; italic_Z ∣ italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). Recall that the approximated MI {I^(𝐒;ZAi)}i=1msuperscriptsubscript^𝐼𝐒conditional𝑍subscript𝐴𝑖𝑖1𝑚\left\{\hat{I}\left(\mathbf{S};Z\mid A_{i}\right)\right\}_{i=1}^{m}{ over^ start_ARG italic_I end_ARG ( bold_S ; italic_Z ∣ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT used to find A𝒫superscript𝐴𝒫A^{*}\in\mathcal{P}italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_P is provided in equation 16 in the case study; given Assumption 5.9, the discrepancy between true and approximate MI for any A𝒫𝐴𝒫A\in\mathcal{P}italic_A ∈ caligraphic_P is as follows

I(𝐒;ZA)I^(𝐒;ZA)=𝔼𝐒p(𝐬A)[𝔼ZQ(z𝐒)[logQ(Z𝐒)]𝔼ZP(z𝐒)[logP(Z𝐒)]]𝐼𝐒conditional𝑍𝐴^𝐼𝐒conditional𝑍𝐴subscript𝔼similar-to𝐒𝑝conditional𝐬𝐴delimited-[]subscript𝔼similar-to𝑍𝑄conditional𝑧𝐒delimited-[]𝑄conditional𝑍𝐒subscript𝔼similar-to𝑍𝑃conditional𝑧𝐒delimited-[]𝑃conditional𝑍𝐒\displaystyle I\left(\mathbf{S};Z\mid A\right)-\hat{I}\left(\mathbf{S};Z\mid A% \right)=\mathbb{E}_{\mathbf{S}\sim p\left(\mathbf{s}\mid A\right)}\left[% \mathbb{E}_{Z\sim Q\left(z\mid\mathbf{S}\right)}\left[\log Q\left(Z\mid\mathbf% {S}\right)\right]-\mathbb{E}_{Z\sim P\left(z\mid\mathbf{S}\right)}\left[\log P% \left(Z\mid\mathbf{S}\right)\right]\right]italic_I ( bold_S ; italic_Z ∣ italic_A ) - over^ start_ARG italic_I end_ARG ( bold_S ; italic_Z ∣ italic_A ) = blackboard_E start_POSTSUBSCRIPT bold_S ∼ italic_p ( bold_s ∣ italic_A ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_Q ( italic_z ∣ bold_S ) end_POSTSUBSCRIPT [ roman_log italic_Q ( italic_Z ∣ bold_S ) ] - blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_P ( italic_z ∣ bold_S ) end_POSTSUBSCRIPT [ roman_log italic_P ( italic_Z ∣ bold_S ) ] ] (80)

Furthermore, given ϵ2=maxA𝒫DKL2(p(𝐬,z)q(𝐬,z)A)subscriptitalic-ϵ2subscript𝐴𝒫subscript𝐷superscriptKL2𝑝𝐬𝑧delimited-‖∣𝑞𝐬𝑧𝐴\epsilon_{2}=\max_{A\in\mathcal{P}}D_{\text{KL}^{2}}\left(p\left(\mathbf{s},z% \right)\|q\left(\mathbf{s},z\right)\mid A\right)italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_A ∈ caligraphic_P end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT KL start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_p ( bold_s , italic_z ) ∥ italic_q ( bold_s , italic_z ) ∣ italic_A ) over the partition 𝒫={A1,,Am}𝒫subscript𝐴1subscript𝐴𝑚\mathcal{P}=\{A_{1},\cdots,A_{m}\}caligraphic_P = { italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }, we evaluate the upper bound of equation 80 for any A𝒫𝐴𝒫A\in\mathcal{P}italic_A ∈ caligraphic_P in the following,

𝔼𝐒p(𝐬A)[𝔼ZQ(z𝐒)[logQ(Z𝐒)]𝔼ZP(z𝐒)[logP(Z𝐒)]]subscript𝔼similar-to𝐒𝑝conditional𝐬𝐴delimited-[]subscript𝔼similar-to𝑍𝑄conditional𝑧𝐒delimited-[]𝑄conditional𝑍𝐒subscript𝔼similar-to𝑍𝑃conditional𝑧𝐒delimited-[]𝑃conditional𝑍𝐒\displaystyle\mathbb{E}_{\mathbf{S}\sim p\left(\mathbf{s}\mid A\right)}\left[% \mathbb{E}_{Z\sim Q\left(z\mid\mathbf{S}\right)}\left[\log Q\left(Z\mid\mathbf% {S}\right)\right]-\mathbb{E}_{Z\sim P\left(z\mid\mathbf{S}\right)}\left[\log P% \left(Z\mid\mathbf{S}\right)\right]\right]blackboard_E start_POSTSUBSCRIPT bold_S ∼ italic_p ( bold_s ∣ italic_A ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_Q ( italic_z ∣ bold_S ) end_POSTSUBSCRIPT [ roman_log italic_Q ( italic_Z ∣ bold_S ) ] - blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_P ( italic_z ∣ bold_S ) end_POSTSUBSCRIPT [ roman_log italic_P ( italic_Z ∣ bold_S ) ] ] (81)
\displaystyle\leq 𝔼𝐒p(𝐬A)[𝔼ZQ(z𝐒)[logQ(Z𝐒)]𝔼ZQ(z𝐒)[logP(Z𝐒)]]subscript𝔼similar-to𝐒𝑝conditional𝐬𝐴delimited-[]subscript𝔼similar-to𝑍𝑄conditional𝑧𝐒delimited-[]𝑄conditional𝑍𝐒subscript𝔼similar-to𝑍𝑄conditional𝑧𝐒delimited-[]𝑃conditional𝑍𝐒\displaystyle\mathbb{E}_{\mathbf{S}\sim p\left(\mathbf{s}\mid A\right)}\left[% \mathbb{E}_{Z\sim Q\left(z\mid\mathbf{S}\right)}\left[\log Q\left(Z\mid\mathbf% {S}\right)\right]-\mathbb{E}_{Z\sim Q\left(z\mid\mathbf{S}\right)}\left[\log P% \left(Z\mid\mathbf{S}\right)\right]\right]blackboard_E start_POSTSUBSCRIPT bold_S ∼ italic_p ( bold_s ∣ italic_A ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_Q ( italic_z ∣ bold_S ) end_POSTSUBSCRIPT [ roman_log italic_Q ( italic_Z ∣ bold_S ) ] - blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_Q ( italic_z ∣ bold_S ) end_POSTSUBSCRIPT [ roman_log italic_P ( italic_Z ∣ bold_S ) ] ] (82)
=\displaystyle== DKL(Q(z𝐬)P(z𝐬)A)\displaystyle D_{\text{KL}}\left(Q\left(z\mid\mathbf{s}\right)\|P\left(z\mid% \mathbf{s}\right)\mid A\right)italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_Q ( italic_z ∣ bold_s ) ∥ italic_P ( italic_z ∣ bold_s ) ∣ italic_A ) (83)
\displaystyle\leq ϵ2.subscriptitalic-ϵ2\displaystyle\sqrt{\epsilon_{2}}.square-root start_ARG italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG . (84)

Similarly, we evaluate the lower bound of equation 80 for any A𝒫𝐴𝒫A\in\mathcal{P}italic_A ∈ caligraphic_P in the following,

𝔼𝐒p(𝐬A)[𝔼ZQ(z𝐒)[logQ(Z𝐒)]𝔼ZP(z𝐒)[logP(Z𝐒)]]subscript𝔼similar-to𝐒𝑝conditional𝐬𝐴delimited-[]subscript𝔼similar-to𝑍𝑄conditional𝑧𝐒delimited-[]𝑄conditional𝑍𝐒subscript𝔼similar-to𝑍𝑃conditional𝑧𝐒delimited-[]𝑃conditional𝑍𝐒\displaystyle\mathbb{E}_{\mathbf{S}\sim p\left(\mathbf{s}\mid A\right)}\left[% \mathbb{E}_{Z\sim Q\left(z\mid\mathbf{S}\right)}\left[\log Q\left(Z\mid\mathbf% {S}\right)\right]-\mathbb{E}_{Z\sim P\left(z\mid\mathbf{S}\right)}\left[\log P% \left(Z\mid\mathbf{S}\right)\right]\right]blackboard_E start_POSTSUBSCRIPT bold_S ∼ italic_p ( bold_s ∣ italic_A ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_Q ( italic_z ∣ bold_S ) end_POSTSUBSCRIPT [ roman_log italic_Q ( italic_Z ∣ bold_S ) ] - blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_P ( italic_z ∣ bold_S ) end_POSTSUBSCRIPT [ roman_log italic_P ( italic_Z ∣ bold_S ) ] ] (85)
\displaystyle\geq 𝔼𝐒p(𝐬A)[𝔼ZP(z𝐒)[logQ(Z𝐒)]𝔼ZQ(z𝐒)[logP(Z𝐒)]]subscript𝔼similar-to𝐒𝑝conditional𝐬𝐴delimited-[]subscript𝔼similar-to𝑍𝑃conditional𝑧𝐒delimited-[]𝑄conditional𝑍𝐒subscript𝔼similar-to𝑍𝑄conditional𝑧𝐒delimited-[]𝑃conditional𝑍𝐒\displaystyle\mathbb{E}_{\mathbf{S}\sim p\left(\mathbf{s}\mid A\right)}\left[% \mathbb{E}_{Z\sim P\left(z\mid\mathbf{S}\right)}\left[\log Q\left(Z\mid\mathbf% {S}\right)\right]-\mathbb{E}_{Z\sim Q\left(z\mid\mathbf{S}\right)}\left[\log P% \left(Z\mid\mathbf{S}\right)\right]\right]blackboard_E start_POSTSUBSCRIPT bold_S ∼ italic_p ( bold_s ∣ italic_A ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_P ( italic_z ∣ bold_S ) end_POSTSUBSCRIPT [ roman_log italic_Q ( italic_Z ∣ bold_S ) ] - blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_Q ( italic_z ∣ bold_S ) end_POSTSUBSCRIPT [ roman_log italic_P ( italic_Z ∣ bold_S ) ] ] (86)
=\displaystyle== DKL(P(z𝐬)Q(z𝐬)A)\displaystyle-D_{\text{KL}}\left(P\left(z\mid\mathbf{s}\right)\|Q\left(z\mid% \mathbf{s}\right)\mid A\right)- italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_P ( italic_z ∣ bold_s ) ∥ italic_Q ( italic_z ∣ bold_s ) ∣ italic_A ) (87)
\displaystyle\geq ϵ1.subscriptitalic-ϵ1\displaystyle-\sqrt{\epsilon_{1}}.- square-root start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG . (88)

Assumption 5.8 suggests that the maximum MI over 𝒫𝒫\mathcal{P}caligraphic_P is I(𝐒;Z)+Δ𝐼𝐒𝑍ΔI\left(\mathbf{S};Z\right)+\Deltaitalic_I ( bold_S ; italic_Z ) + roman_Δ. Combining  equation 84 and equation 88, we get the lower bound of I(𝐒;ZA)𝐼𝐒conditional𝑍superscript𝐴I\left(\mathbf{S};Z\mid A^{*}\right)italic_I ( bold_S ; italic_Z ∣ italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) as follows,

I(𝐒;ZA)I(𝐒;Z)+Δ(ϵ1+ϵ2).𝐼𝐒conditional𝑍superscript𝐴𝐼𝐒𝑍Δsubscriptitalic-ϵ1subscriptitalic-ϵ2\displaystyle I\left(\mathbf{S};Z\mid A^{*}\right)\geq I\left(\mathbf{S};Z% \right)+\Delta-\left(\sqrt{\epsilon_{1}}+\sqrt{\epsilon_{2}}\right).italic_I ( bold_S ; italic_Z ∣ italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≥ italic_I ( bold_S ; italic_Z ) + roman_Δ - ( square-root start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG + square-root start_ARG italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) . (89)

The analyst conducts the sequential testing in the selected Asuperscript𝐴A^{*}italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with sample features randomly sampled from A𝒮usuperscript𝐴subscript𝒮𝑢A^{*}\bigcap\mathcal{S}_{u}italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⋂ caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and labeled, leading to the following testing power lower bound

P1(n[Nq],Wn=i=1nP(Zi)Q(Zi𝐒i)α)P1(WNq=i=1NqP(Zi)Q(Zi𝐒i)α),(𝐒n,Zn)p(𝐬,zA).formulae-sequencesubscript𝑃1formulae-sequence𝑛delimited-[]subscript𝑁𝑞subscript𝑊𝑛superscriptsubscriptproduct𝑖1𝑛𝑃subscript𝑍𝑖𝑄conditionalsubscript𝑍𝑖subscript𝐒𝑖𝛼subscript𝑃1subscript𝑊subscript𝑁𝑞superscriptsubscriptproduct𝑖1subscript𝑁𝑞𝑃subscript𝑍𝑖𝑄conditionalsubscript𝑍𝑖subscript𝐒𝑖𝛼similar-tosubscript𝐒𝑛subscript𝑍𝑛𝑝𝐬conditional𝑧superscript𝐴\displaystyle P_{1}\left(\exists n\in\left[N_{q}\right],W_{n}=\prod_{i=1}^{n}% \frac{P(Z_{i})}{Q\left(Z_{i}\mid\mathbf{S}_{i}\right)}\leq\alpha\right)\geq P_% {1}\left(W_{N_{q}}=\prod_{i=1}^{N_{q}}\frac{P(Z_{i})}{Q\left(Z_{i}\mid\mathbf{% S}_{i}\right)}\leq\alpha\right),\left(\mathbf{S}_{n},Z_{n}\right)\sim p\left(% \mathbf{s},z\mid A^{*}\right).italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ∃ italic_n ∈ [ italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ] , italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_P ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ≤ italic_α ) ≥ italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG italic_P ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ≤ italic_α ) , ( bold_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∼ italic_p ( bold_s , italic_z ∣ italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) . (90)

The quantification of the RHS in equation 90 is identical to the one in the baseline case, except the sample space is constrained to Asuperscript𝐴A^{*}italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Hence, we skip the derivation process and obtain the following result,

P1(n[Nq],Wn=i=1nP(Zi)Q(Zi𝐒i)α)subscript𝑃1formulae-sequence𝑛delimited-[]subscript𝑁𝑞subscript𝑊𝑛superscriptsubscriptproduct𝑖1𝑛𝑃subscript𝑍𝑖𝑄conditionalsubscript𝑍𝑖subscript𝐒𝑖𝛼\displaystyle P_{1}\left(\exists n\in\left[N_{q}\right],W_{n}=\prod_{i=1}^{n}% \frac{P(Z_{i})}{Q\left(Z_{i}\mid\mathbf{S}_{i}\right)}\leq\alpha\right)italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ∃ italic_n ∈ [ italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ] , italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_P ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ≤ italic_α ) P1(WNq=i=1NqP(Zi)Q(Zi𝐒i)α)absentsubscript𝑃1subscript𝑊subscript𝑁𝑞superscriptsubscriptproduct𝑖1subscript𝑁𝑞𝑃subscript𝑍𝑖𝑄conditionalsubscript𝑍𝑖subscript𝐒𝑖𝛼\displaystyle\geq P_{1}\left(W_{N_{q}}=\prod_{i=1}^{N_{q}}\frac{P(Z_{i})}{Q% \left(Z_{i}\mid\mathbf{S}_{i}\right)}\leq\alpha\right)≥ italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG italic_P ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ≤ italic_α ) (91)
Φ(logαNq+Nq(I(𝐒;Z)+Δ2ϵ1ϵ2)(σ2+ϵ1+2ϵ1σ)12).absentΦ𝛼subscript𝑁𝑞subscript𝑁𝑞𝐼𝐒𝑍Δ2subscriptitalic-ϵ1subscriptitalic-ϵ2superscriptsuperscript𝜎2subscriptitalic-ϵ12subscriptitalic-ϵ1𝜎12\displaystyle\eqsim\Phi\left(\frac{\frac{\log\alpha}{\sqrt{N_{q}}}+\sqrt{N_{q}% }\left(I\left(\mathbf{S};Z\right)+\Delta-2\sqrt{\epsilon_{1}}-\sqrt{\epsilon_{% 2}}\right)}{\left(\sigma^{2}+\sqrt{\epsilon_{1}}+2\sqrt{\epsilon_{1}}\sigma% \right)^{\frac{1}{2}}}\right).≂ roman_Φ ( divide start_ARG divide start_ARG roman_log italic_α end_ARG start_ARG square-root start_ARG italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_ARG end_ARG + square-root start_ARG italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_ARG ( italic_I ( bold_S ; italic_Z ) + roman_Δ - 2 square-root start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG - square-root start_ARG italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) end_ARG start_ARG ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + square-root start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG + 2 square-root start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG italic_σ ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG ) . (92)