Active Sequential Two-Sample Testing

Weizhi Li [email protected]
Arizona State University
Los Alamos National Laboratory Prad Kadambi [email protected]
Arizona State University Pouria Saidi [email protected]
Arizona State University Karthikeyan Natesan Ramamurthy [email protected]
IBM Research Gautam Dasarathy [email protected]
Arizona State University Visar Berisha [email protected]
Arizona State University Work done when the author was at Arizona State University.

Abstract

A two-sample hypothesis test is a statistical procedure used to determine whether the distributions generating two samples are identical. We consider the two-sample testing problem in a new scenario where the sample measurements (or sample features) are inexpensive to access, but their group memberships (or labels) are costly. To address the problem, we devise the first active sequential two-sample testing framework that not only sequentially but also actively queries. Our test statistic is a likelihood ratio where one likelihood is found by maximization over all class priors, and the other is provided by a probabilistic classification model. The classification model is adaptively updated and used to predict where the (unlabelled) features have a high dependency on labels; labeling the “high-dependency” features leads to the increased power of the proposed testing framework. In theory, we provide the proof that our framework produces an anytime-valid $p$ -value. In addition, we characterize the proposed framework’s gain in testing power by analyzing the mutual information between the feature and label variables in asymptotic and finite-sample scenarios. In practice, we introduce an instantiation of our framework and evaluate it using several experiments; the experiments on the synthetic, MNIST, and application-specific datasets demonstrate that the testing power of the instantiated active sequential test significantly increases while the Type I error is under control.

1 Introduction

The two-sample test is a statistical hypothesis test applied to data samples (or measurements) from two distributions. The goal is to test if the data supports the hypothesis that the distributions are different. If we consider each data point as a feature and label (which tells us which distribution the data is from) pair, then the two-sample test is equivalent to the problem of testing the dependence between the features and the labels. Viewed with this lens, the null hypothesis for the two-sample test states that the feature and label variables are independent, and the alternate hypothesis states the opposite. The analyst performing the two-sample test needs to decide between the null and the alternative hypotheses with data from the two distributions.

The analyst typically knows little about the difficulty of a two-sample testing problem before running the test. Fixing the sample size a priori may result in a test that needs to collect additional evidence to arrive at a final decision (if the problem is hard) or in an inefficient test with over-collected data (if the problem is simple). To address this dichotomy, the research community has proposed sequential two-sample tests (Wald, 1992; Lhéritier & Cazals, 2018; Hajnal, 1961; Shekhar & Ramdas, 2021; Balsubramani & Ramdas, 2015) that allow the analyst to sequentially collect data and monitor statistical evidence, i.e., a statistic is computed from the data. The test can stop anytime when sufficient evidence has been accumulated to make a decision.

Existing sequential two-sample tests (Wald, 1992; Lhéritier & Cazals, 2018; Hajnal, 1961; Shekhar & Ramdas, 2021; Balsubramani & Ramdas, 2015) are devised to collect both sample features and sample labels simultaneously. In this paper, we consider the problem of sequential two-sample testing in a novel and practical setting where the cost of obtaining sample labels is high, but accessing sample features is inexpensive. As a result, the analyst can obtain a large collection of sample features without labels; she will need to sequentially query the label of the sample features in the collection to perform the two-sample testing while ensuring the query complexity (i.e., the number of queried labels) doesn’t exceed a label budget. A motivation for this formulation comes from the field of digital health: Physicians seek inexpensive digital measurements (e.g., gait, speech, ty** speed measured using a patient’s smartphone) to replace traditional biomarkers (e.g., the amyloid buildup that indicates Alzheimer’s progression) which are often costly to access; hence they need to validate the dependency between the digital measurements (feature variables) and traditional biomarkers (label variables). While validation studies can access large registries to collect digital measurements remotely at scale, there is a fixed label budget for the expensive biomarker measures. An efficient sequential design would reveal the dependency between the features and the labels using only a reasonable label budget.

In this paper, we propose the active sequential testing framework shown in Figure 1. The framework initializes a classifier to model probabilities of sample labels given features using an initial random sample; next, depending on the classifier’s outputs, the framework queries the labels of features predicted to have a high dependency with the labels and constructs a test statistic $w$ . The framework rejects the null if $w$ is smaller than a pre-defined significance level $\alpha$ ; otherwise, the framework stops and retains the null if the label budget runs out or re-enters the label query and decision-making, enabling a sequential testing process.

Refer to caption — Figure 1: The active sequential two-sample testing framework.

The test statistic $w$ in the framework is based on the likelihood ratio between the likelihood constructed under the null that feature and label variables are independent and the likelihood constructed under the alternative that the dependency between the feature and label variables exists. Such a likelihood ratio two-sample test statistic has been first proposed in (Lhéritier & Cazals, 2018) to develop a non-active sequential two-sample test capable of controlling the Type I error (i.e., the probability of a decision made on the alternative when the null is true). We adapt the original test statistic by replacing the pre-defined label probability prior with a maximum likelihood estimate to satisfy our considered setting that the label prior is unknown. More importantly, our framework actively labels the features that are predicted to have a high dependency on labels. We will characterize the benefits of the active query over the random query by the change of mutual information between feature and label variables in the asymptotic and finite-sample scenarios. In practice, we suggest using an active query scheme called bimodal query proposed in (Li et al., 2022), in which the scheme labels samples with the highest class one or zero probabilities.

We summarize the main contributions of our work as follows:

•

We introduce the first active sequential two-sample testing framework. We prove that the proposed framework produces an anytime-valid $p$ -value to achieve Type I error control. Furthermore, we provide an information-theoretic interpretation of the proposed framework. We prove that, asymptotically, the framework is capable of generating the largest mutual information (MI) between feature and label variables under standard conditions (Györfi et al., 2002); and we also analyze the gain of the testing power for the proposed framework over its passive query parallel in the finite-sample scenario through MI.
•

We instantiate the framework using the bimodal query (Li et al., 2022) (i.e., queries the labels of the samples that have the highest class one or zero probabilities) as the label query scheme. We perform extensive experiments on synthetic data, MNIST, and an application-specific Alzheimer’s disease dataset to demonstrate the effectiveness of the instantiated framework. Our proposed test exhibits a significant reduction of the Type II error using fewer labeled samples compared with a non-active sequential testing baseline.

2 Related Works

The author of (Student, 1908) developed the $t$ -test, probably the simplest form of a two-sample test that compares the mean difference of two samples of uni-variate data. Since then, the research community has expanded the two-sample test to many other forms, e.g., the hotelling test (Hotelling, 1992), the Friedman-Rafsky test (Friedman & Rafsky, 1979), the kernel two-sample test (Gretton et al., 2012) and the classifier two-sample test (Lopez-Paz & Oquab, 2016) for the multi-variate case. These tests are constructed with various statistics, including the Mahalanobis distance, the measurement over a graph, a kernel embedding, or classifier accuracy, all in service of increasing testing power while controlling the Type I error. In particular, (Friedman & Rafsky, 1979; Gretton et al., 2012; Lopez-Paz & Oquab, 2016) test if the data from two samples is distributionally different, which is a generalization of the hotelling and $t-$ test (Student, 1908; Hotelling, 1992) that only detect the mean difference of two samples. These two-sample tests are batch tests that have been extensively used subject to a fixed-sample size: When the collection of experimental data ends, an analyst performs the two-sample tests on the data and makes a decision; she is not allowed to continue to collect and incorporate more data into the testing after a decision made, as that will inflate the Type I error.

In contrast to the batch two-sample tests, the research community has developed a class of sequential two-sample tests (Lhéritier & Cazals, 2018; Shekhar & Ramdas, 2021; Pandeva et al., 2022) that allow the analyst to sequentially collect data and perform the two-sample test, enabling sequential decision-making. These sequential tests rectify the inflated Type I that will happen in the batch test with different statistical techniques such as Bonferroni correction (Dunn, 1961) and Ville’s maximal inequality (Doob, 1939).

There are also several works that consider the active setting in two-sample testing. The authors of (Li et al., 2022) proposed a batch two-sample test combined with active learning when curated labeled data is unavailable and querying the data labels is expensive. Several studies have also considered sequential testing for develo** active sequential hypothesis tests (Naghshvar & Javidi, 2013; Chernoff, 1959; Bessler, 1960; Blot & Meeter, 1973; Keener, 1984; Kiefer & Sacks, 1963). However, these tests require a clear parametric description of the statistical models of the hypotheses. The authors of (Duan et al., 2022) developed an interactive rank test, which is distribution-free and can similarly perform the sequential two-sample testing in the active learning setting.

The work proposed herein uses the label query scheme in (Li et al., 2022) to develop the first multivariate non-parametric sequential test for the active learning setting with a novel test statistic and theoretical results. We demonstrate that the test controls the Type I error via Ville’s maximal inequality (See Theorem 5.1). Ville’s maximal inequality results in higher testing power than the Bonferroni correction for sequential testing (Shekhar & Ramdas, 2021; Ramdas et al., 2022).

While our framework in Figure 1 employs the label query scheme introduced in (Li et al., 2022), it offers distinct advantages over (Li et al., 2022):

•

Our proposed framework follows a sequential design. Upon accumulating sufficient evidence to reject the null hypothesis, our design automatically stops label collection before exhausting the label budget. In contrast, the batch design in (Li et al., 2022) invariably exhausts the label budget.
•

Utilizing a different test statistic, our framework enables finite-sample analysis, which is not provided in (Li et al., 2022).

3 Problem Statement and Preliminaries

3.1 Notations

We use a pair of random variables $(\mathbf{S},Z)$ to denote a feature and its label variables whose realization is $(\mathbf{s},z)\in\mathbb{R}^{d}\times\{0,1\}$ . The variable pair $(\mathbf{S},Z)$ admits a joint distribution $p_{\mathbf{S}Z}(\mathbf{s},z)$ . Furthermore, we write $\mathcal{S}$ to denote the support of $p_{\mathbf{S}}(\mathbf{s})$ . Formally, a two-sample testing problem consists of null hypothesis $H_{0}$ that states $p_{\mathbf{S}\mid Z=0}(\mathbf{s})=p_{\mathbf{S}\mid Z=1}(\mathbf{s})$ and an alternative hypothesis $H_{1}$ that states $p_{\mathbf{S}\mid Z=0}(s)\neq p_{\mathbf{S}\mid Z=1}(s)$ . An analyst collects a sequence $\left((\mathbf{s},z)_{i}\right)_{i=1}^{N}$ of $N$ realizations of $(\mathbf{S},Z)$ to test $H_{0}$ against $H_{1}$ . The problem is equivalent to testing the independency between $\mathbf{S}$ and $Z$ . Therefore, we equivalently restate the hypothesis test as follows:

	$\displaystyle H_{0}:p_{\mathbf{S}Z}(\mathbf{s},z)=p_{\mathbf{S}}(\mathbf{s})P_% {Z}(z),\forall\mathbf{s}\in\mathcal{S}$
	$\displaystyle H_{1}:p_{\mathbf{S}Z}(\mathbf{s},z)\neq p_{\mathbf{S}}(\mathbf{s% })P_{Z}(z),\exists\mathbf{s}\in\mathcal{S}$		(1)

Moving forward, we omit the subscripts in $p_{\mathbf{S}Z}(\mathbf{s},z)$ , $P_{Z}(z)$ and $p_{\mathbf{S}}(\mathbf{s})$ and write them as $p(\mathbf{s},z)$ , $P(z)$ and $p(\mathbf{s})$ . In addition, we use $\mathbf{s}^{N}$ , $z^{N}$ and $(\mathbf{s},z)^{N}$ to denote sequences of samples $(\mathbf{s}_{i})_{i=1}^{N}$ , $(z_{i})_{i=1}^{N}$ and $((\mathbf{s},z)_{i})^{N}_{i=1}$ respectively. We use similar notation throughout the paper.

3.2 The problem

In the typical setting of a sequential two-sample test, an analyst does not have prior knowledge of sample features. The analyst sequentially collects both sample features and their labels simultaneously with the corresponding random variable pair $(\mathbf{S},Z)$ i.i.d. generated from a data-generating process, i.e., $p(\mathbf{s},z)$ . We consider a variant of the setting in which accessing sample features is free/inexpensive. Consequently, the analyst collects a large set $\mathcal{S}_{u}$ of sample features before performing a sequential test. However, accessing the label of a feature in $\mathcal{S}_{u}$ is costly. We assume the following fact throughout the paper: The already-collected $\mathcal{S}_{u}$ is the result of a sample feature collection process where all $\mathbf{s}_{i}\in\mathcal{S}_{u}$ are realizations of random variables $\mathbf{S}_{i}$ i.i.d. generated from $p(\mathbf{s})$ . There exists an oracle to return a label $z_{i}$ of $\mathbf{s}_{i}\in\mathcal{S}_{u}$ with the corresponding random variable $Z_{i}$ and $\mathbf{S}_{i}$ admitting the posterior probability $P(z_{i}|\mathbf{s}_{i})$ . We consider the following new sequential two-sample testing problem:

An analyst actively labeling $\mathbf{s}_{n}\in\mathcal{S}_{u}$ may result in non-i.i.d pairs of $(\mathbf{S},Z)$ ; hence the distribution of $(\mathbf{S},Z)$ is shifted away from $p(\mathbf{s},z)$ . In contrast, an analyst passively (or randomly) labeling $\mathbf{s}_{n}\in\mathcal{S}_{u}$ maintains $(\mathbf{S},Z)\sim p(\mathbf{s},z)$ .

3.3 Evaluation metrics for the problem

In the following, we introduce the evaluation metrics used throughout the paper.

•

Type I error $P_{0}$ : The probability of rejecting $H_{0}$ when $H_{0}$ is true.
•

Type II error $P_{1}$ : The probability of rejecting $H_{1}$ when $H_{1}$ is true.
•

Testing power: The probability of rejecting $H_{0}$ when $H_{1}$ is true. In other words, $\text{Testing power}=1-P_{1}$ .

Testing power and Type II error are interchangeably used in the methodology and experiment sections (Section 5 and 6).

3.4 Attributes of an active two-sample test

As already generalized in many two-sample testing literature such as (Johari et al., 2022; Wald, 1992; Lhéritier & Cazals, 2018; Shekhar & Ramdas, 2021; Welch, 1990), a conventional procedure for sequential two-sample testing is to compute a $p$ -value from sequentially observed samples and compare it to a pre-defined significance level $\alpha\in[0,1]$ anytime. The analyst rejects $H_{0}$ and stops the testing if $p\leq\alpha$ . For more details, see (Wasserstein & Lazar, 2016). In addition, as the test proposed in what follows is endowed with active querying to reduce the number of label queries, the active sequential test is anticipated to spend fewer labels than a passive (random-query) test to reject $H_{1}$ when $H_{1}$ is true. In summary, an active sequential two-sample test has the following four attributes:

•

The test generates an anytime-valid $p$ -value such that $P_{0}(p\leq\alpha)\leq\alpha$ holds at anytime of the sequential testing process. $P_{0}$ is exactly the Type I error and that implies the Type I is upper-bounded by $\alpha$ .
•

The test has a high testing power $P_{1}(p\leq\alpha)$ .
•

The test is consistent such that $P_{1}(p\leq\alpha)=1$ under $H_{1}$ when the test sample size goes to infinity.
•

The test has higher $P_{1}$ than the passive test given the same label budgets.

4 A Sequential Two-Sample Testing Statistic

We follow the well-known likelihood ratio test (Wilks, 1938) to construct a sequential testing statistic. We use the statistical models that characterize the label generation processes conditional on the observed sample features under $H_{0}$ and $H_{1}$ . More precisely, under $H_{0}$ , we have $P(z|\mathbf{s})=P(z),\forall\mathbf{s}\in\mathcal{S}$ ; that is, when $S$ and $Z$ are independent, the posterior probability $P\left(z|\mathbf{s}\right)$ is the same for any $\mathbf{s}$ in the support $\mathcal{S}$ of $p(\mathbf{s})$ . In contrast, under $H_{1}$ , we have the following statistical model: $\exists s\in\mathcal{S},P(z|\mathbf{s})\neq P(z)$ . We sequentially collect sample data $(\mathbf{s},z)$ , and when a new observation $(\mathbf{s}_{n},z_{n})$ arrives, we construct a likelihood ratio $w_{n}$ : With $w_{0}=1$ , $w_{n}=w_{n-1}\frac{P(z_{n})}{P(z_{n}|\mathbf{s}_{n})}=\prod_{i=1}^{n}\frac{P(z% _{i})}{P(z_{i}\mid\mathbf{s}_{i})},n\geq 1$ to assess $H_{0}$ against $H_{1}$ .

The statistical models $P(z)$ and $P(z|\mathbf{s})$ are unknown. To formulate our two-sample test, we will use a likelihood estimate $\hat{P}(z^{n})$ that is maximized over all the class priors to replace $P(z^{n})$ –the product of the class prior. In addition, we build a class-probability predictor $Q_{n}(z\mid\mathbf{s})$ with the past observed sample sequence $(\mathbf{s},z)^{n-1}$ to model $P(z_{n}\mid\mathbf{s}_{n})$ –the posterior probability of $z_{n}$ given newly observed $\mathbf{s}_{n}$ ; any probabilistic classifier, such as a neural network and logistic function, can be used to build $Q_{n}(z\mid\mathbf{s})$ . Additionally, $Q_{1}(z\mid\mathbf{s})$ indicates an initialized class-probability predictor¹¹1It is possible to set $Q_{1}(z|\mathbf{s})$ as a random guess class-probability predictor, and then sequentially gather $(\mathbf{s},z)$ for training; however, this would hurt the testing power. As suggested by Duan et al. (2022); Lhéritier & Cazals (2018), we initialize $Q_{1}(z|\mathbf{s})$ with a small set of samples randomly labeled and start the sequential testing after that.. We formally present our sequential testing statistic in the following:

We accordingly use $W_{n}$ to indicate a random variable of which $w_{n}$ is a realization. Our test statistic in equation 2 is a generalization of the test statistic proposed in (Lhéritier & Cazals, 2018). In contrast to that work, our test statistic does not require the prior class to be known. The analyst compares $w_{n}$ with $\alpha$ at every step $n$ starting from $n=1$ , stop** the test once encountering a step with $w_{n}\leq\alpha$ . As a result, a small $w_{n}$ is favored under $H_{1}$ to reject $H_{0}$ for increasing testing power.

Algorithm 1 Bimodal Query Based Active Sequential Two-Sample Testing (BQ-AST)

1:Input:

\mathcal{S}_{u},\mathcal{A},N_{0},N_{q},\alpha

2:Output: Reject or fail to reject

H_{0}

3:Initialization: Initialize

Q_{1}(z\mid\mathbf{s})

using

\mathcal{A}

with

N_{0}

features uniformly sampled from

\mathcal{S}_{u}

without replacement and then labeled.

4:Active Sequential testing:

5:for

n=1

N_{q}-N_{0}

6: Sample a feature

\mathbf{s}_{n}=\mathbf{s}_{q_{0}}

\mathbf{s}_{q_{1}}

with fair chance where

\mathbf{s}_{q_{0}}=\arg\max_{\mathbf{s}}\left[Q_{n}(Z=0|\mathbf{s})\right],% \forall\mathbf{s}\in\mathcal{S}_{u}

and

\mathbf{s}_{q_{1}}=\arg\max_{\mathbf{s}}\left[Q_{n}(Z=1|\mathbf{s})\right],% \forall\mathbf{s}\in\mathcal{S}_{u}

7: Query the label

z_{n}

\mathbf{s}_{n}

8: Update

w_{n}

in equation 3 with

(\mathbf{s}_{n},z_{n})

and

Q_{n}(z_{n}\mid\mathbf{s}_{n})

9: if

w_{n}\leq\alpha

then

10: Return Reject

H_{0}

11: else

12: Update

Q_{n}(z\mid\mathbf{s})

with newly queried

(\mathbf{s}_{n},z_{n})

and past training examples.

13: end if

14:end for

15:Return Retain

H_{0}

5 Active Sequential Two-Sample Testing

This section introduces the active sequential two-sample testing framework and its instantiation. We demonstrate that the framework produces an anytime-valid $p$ -value regardless of the selected query scheme. We also provide the asymptotic and finite-sample performance of the framework with the testing power gain measured by the change of the mutual information between feature and label variables.

5.1 An active sequential two-sample testing framework

A flow chart of the proposed framework is shown in Figure 1. Our framework starts by initializing the class-probability predictor $Q_{n}(z\mid\mathbf{s})$ at $n=1$ with a small set of sample features randomly selected from $\mathcal{S}_{u}$ and then labeled. Then, the framework enters the sequential testing stage that iteratively performs the following: selects features in $\mathcal{S}_{u}$ predicted by $Q_{n}$ to have a high dependency on their labels, update the statistic $w_{n}$ , decide whether we can reject $H_{0}$ and update $Q_{n}$ if the test has not stopped. We formally introduce our active sequential two-sample testing framework as follows,

Framework instantiation: We provide a framework instantiation called bimodal query based active sequential two-sample testing (BQ-AST) described in Algorithm 1. The algorithm takes the following input: an unlabelled feature set $\mathcal{S}_{u}$ , a probabilistic classification algorithm $\mathcal{A}$ , the size $N_{0}$ of an initialization set used for $\mathcal{A}$ , a label budget $N_{q}$ and a significance level $\alpha$ . Then, the algorithm initializes a class-probability predictor $Q$ using $\mathcal{A}$ with a small set of randomly labeled samples. In the sequential testing stage, the algorithm uses bimodal query from Li et al. (2022) to sample $\mathbf{s}_{n}$ with samples having the highest posteriors from either class (e.g. a fair chance to select the highest $Q_{n}\left(Z=0\mid\mathbf{s}\right)$ or $Q_{n}\left(Z=1\mid\mathbf{s}\right)$ ) from $\mathcal{S}_{u}$ , queries its label $z_{n}$ and updates the statistic $w_{n}$ . Next, the algorithm compares $w_{n}$ with $\alpha$ , and if $H_{0}$ is not rejected, update $Q_{n}$ with $(\mathbf{s}_{n},z_{n})$ and then re-enter the query labeling. The algorithm rejects $H_{0}$ if $w_{n}\leq\alpha$ or fails to reject $H_{0}$ if the label budget is exhausted.

The label budget $N_{q}$ in Algorithm 1 contains the labels for both initializing $Q_{1}(z\mid\mathbf{s})$ and constructing the statistic $w_{n}$ . In what follows in this section, we simply use $N_{q}$ to denote the “label budget” allowed to be used after the initialization.

5.2 The proposed framework results in an anytime-valid $p$ -value

Our framework rejects $H_{0}$ if the statistic $w_{n}\leq\alpha$ .The following theorem states that under $H_{0}$ , $w_{n}$ is an anytime-valid $p$ -value.

Theorem 5.1.

If an analyst uses the proposed framework to sequentially query the oracle for $Z$ with $\mathbf{S}\in\mathcal{S}_{u}$ resulting in $(\mathbf{S},Z)^{n}$ , then we have the following under $H_{0}$ ,

\displaystyle P_{0}\left(\exists n\in\left[N_{q}\right],W_{n}=\prod_{i=1}^{n}% \frac{\hat{P}(Z_{i})}{Q_{i}\left(Z_{i}\mid\mathbf{S}_{i}\right)}\leq\alpha% \right)\leq\alpha

(4)

where $N_{q}$ is a label budget and $\alpha$ is the pre-specified significance level.

Theorem 5.1 implies the probability $P_{0}$ (or Type I error) that our framework mistakenly rejects $H_{0}$ is upper-bounded by $\alpha$ . Briefly, we prove this by observing that the sequence $\left(\frac{1}{W_{1}},\cdots,\frac{1}{W_{n}}\right)$ is upper-bounded by a martingale, and hence we use Ville’s maximal inequality Durrett (2019); Doob (1939) to develop Theorem 5.1. See the Appendix for the complete proof.

5.3 Asymptotic properties of the proposed framework

This section provides the theoretical conditions under which the proposed framework asymptotically generates the smallest normalized statistic (normalization of the statistic in equation 3), or equivalently, maximally increases the mutual information between $\mathbf{S}$ and $Z$ . Before that, we first define the consistent bimodal query as follows,

Definition 5.2.

(Consistent bimodal query) Let $\mathcal{S}$ be the support of $p(\mathbf{s})$ that sample features are collected from and added to an unlabeled set $\mathcal{S}_{u}$ , and let $P(z\mid\mathbf{s})$ denote the posterior probability of $z$ given $\mathbf{s}\in\mathcal{S}$ . An analyst adopts a label query scheme, for every $n\in\left[N_{q}\right]$ , to query the label $Z_{n}$ of $\mathbf{S}_{n}\in\mathcal{S}_{u}$ such that $\mathbf{S}_{n}$ admits a probability density function (PDF) $p_{n}(\mathbf{s})$ . The label query scheme is a consistent bimodal query if $\lim_{n\to\infty}p_{n}(\mathbf{s})=p^{*}(\mathbf{s})$ where

$\displaystyle p^{*}(\mathbf{s})$	$\displaystyle=0,\forall\mathbf{s}\in\mathcal{S}\setminus\left(\mathcal{S}_{q_{% 0}}\bigcup\mathcal{S}_{q_{1}}\right),\text{ and }p^{*}(\mathbf{s})>0,\forall% \mathbf{s}\in\mathcal{S}_{q_{0}}\bigcup\mathcal{S}_{q_{1}},$	(5)
$\displaystyle\mathcal{S}_{q_{0}}$	$\displaystyle=\left\{\mathbf{s}_{q_{0}}\left\rvert P\left(Z=0\mid\mathbf{s}_{q% _{0}}\right)=\max_{\mathbf{s}\in\mathcal{S}}P\left(Z=0\mid\mathbf{s}\right)% \right\}\right.,$	(6)
$\displaystyle\mathcal{S}_{q_{1}}$	$\displaystyle=\left\{\mathbf{s}_{q_{1}}\left\rvert P\left(Z=1\mid\mathbf{s}_{q% _{1}}\right)=\max_{\mathbf{s}\in\mathcal{S}}P\left(Z=1\mid\mathbf{s}\right)% \right\}\right..$	(7)

Remark 5.3.

Def 5.2 considers a label query scheme that only queries the labels of $\mathbf{s}$ with the highest $p\left(Z=0\mid\mathbf{s}\right)$ and $p\left(Z=1\mid\mathbf{s}\right)$ when $n$ goes to infinity. As $p\left(z\mid\mathbf{s}\right)$ is not directly available, to construct the consistent bimodal query, one can use nonparametric regressors to construct a class-probability predictor $Q\left(z\mid\mathbf{s}\right)$ as nonparametric estimates of $P\left(z\mid\mathbf{s}\right),\forall\mathbf{s}\in\mathcal{S}$ and implements the bimodal query to label $\mathbf{s}$ with highest $Q(Z=0\mid\mathbf{s})$ or highest $Q(Z=1\mid\mathbf{s})$ after $Q(z\mid\mathbf{s})$ converges to $P\left(z\mid\mathbf{s}\right)$ . The authors of (Györfi et al., 2002) prove that when $Q\left(z\mid\mathbf{s}\right)$ is a kernel, KNN or partition estimates with proper smoothing parameters (e.g., bandwidth for the kernel) and labels are sufficiently revealed in the proximity of $\mathbf{s},\forall\mathbf{s}\in\mathcal{S}$ , then $Q\left(z\mid\mathbf{s}\right)$ converges to $P\left(z\mid\mathbf{s}\right)$ .

To this end, we introduce the asymptotic property of our framework. We consider normalizing the test statistic in equation 3 as follows,

\displaystyle\overline{W}_{n}=\frac{1}{n}\sum_{i=1}^{n}\log\frac{\hat{P}(Z_{i}% )}{Q_{i}\left(Z_{i}\mid\mathbf{S}_{i}\right)},\left(\mathbf{S}_{i},Z_{i}\right% )\sim p_{i}\left(\mathbf{s},z)=p(z\mid\mathbf{s}\right)p_{i}\left(\mathbf{s}\right)

(8)

where $(\mathbf{S}_{i},Z_{i})$ denotes a feature-label pair returned by a label query scheme when querying the $i$ -th label. Next, we state the following theorem.

Theorem 5.4.

Let $\mathcal{S}$ be the support of $p\left(\mathbf{s}\right)$ that sample features are collected from and added to an unlabeled set $\mathcal{S}_{u}$ , and let $P\left(z\mid\mathbf{s}\right)$ denote the posterior probability of $z$ given $\mathbf{s}\in\mathcal{S}$ . There exists a consistent bimodal query scheme; when an analyst uses such a scheme in the proposed active sequential framework, then, under $H_{1}$ , $\overline{W}_{n}$ converges to the negation of mutual information (MI), and the converged negated MI lower-bounds the negated MI generated by any $p\left(\mathbf{s}\right)$ subject to $P\left(z\mid\mathbf{s}\right),\forall\mathbf{s}\in\mathcal{S}$ . Precisely, there exists a consistent bimodal query leading to the following

\displaystyle\lim_{n\to\infty}\overline{W}_{n}=-\left(H^{*}(Z)-H^{*}\left(Z% \mid\mathbf{S}\right)\right)=-I^{*}\left(\mathbf{S};Z\right)\leq-I\left(% \mathbf{S};Z\right).

(9)

$I^{*}\left(\mathbf{S};Z\right)$ is the MI constructed with $\left(\mathbf{S},Z\right)\sim p^{*}\left(\mathbf{s},z\right)=P\left(z\mid% \mathbf{s}\right)p^{*}\left(\mathbf{s}\right)$ (See equation 5 for $p^{*}\left(\mathbf{s}\right)$ ); $I\left(\mathbf{S};Z\right)$ is MI constructed with $\left(\mathbf{S},Z\right)\sim p\left(\mathbf{s},z\right)=P\left(z\mid\mathbf{s% }\right)p\left(\mathbf{s}\right)$ .

Recalling the null $H_{0}$ is rejected when the test statistic $w_{n}$ in equation 3 is smaller than $\alpha$ ; hence, the proposed framework, when used with a consistent bimodal query to asymptotically minimize the normalized $w_{n}$ in equation 3, favorably increases the testing power when $|\mathcal{S}_{u}|$ is large and $Q(z\mid\mathbf{s})$ is close to $P(z\mid\mathbf{s})$ . In Section 5.4, we will analyze the finite-sample performance of the proposed framework considering the approximation error of $Q(z\mid\mathbf{s})$ . Additionally, by characterizing the difficulty of a two-sample testing problem with MI, Theorem 5.4 alludes that the proposed framework asymptotically turns the original hard two-sample testing problem with low dependency between $\mathbf{S}$ and $Z$ (low MI), to a simple one by increasing the dependency between $\mathbf{S}$ and $Z$ (high MI).

Remark 5.5.

Our testing framework is also consistent under $H_{1}$ and the same conditions of Theorem 5.4 as $\lim_{n\to\infty}P_{1}\left(\prod_{i=1}^{n}\frac{\hat{P}(Z_{i})}{Q_{i}(Z_{i}% \mid\mathbf{S}_{i})}\leq\alpha\right)=\lim_{n\to\infty}P\left(\overline{W}_{n}% \leq\frac{1}{n}\log(\alpha)\right)=P_{1}(-I^{*}\left(\mathbf{S},Z\right)\leq 0% )=1$ . The last equality holds due to $I^{*}\left(\mathbf{S},Z\right)>0$ under $H_{1}$ .

5.4 Finite-sample analysis for the proposed framework

This section analyzes the testing power of the proposed framework in the finite-sample case. Section 5.4.1 and Section 5.4.2 offer metrics that assess the approximation error of $Q(z\mid\mathbf{s})$ and an irreducible Type II error. These metrics together determine the finite-sample testing power. Furthermore, Section 5.4.3 presents an illustrative example of using our framework. In Section 5.4.4, we conduct a finite-sample analysis for the example, incorporating both the metrics that characterize the approximation error and the irreducible Type II error.

5.4.1 Characterizing the approximation error of $Q(z\mid\mathbf{s})$

As our framework constructs the test statistic in equation 2 with the approximation $Q(z\mid\mathbf{s})$ , there arises a need to establish a metric for assessing the approximation error of $Q(z\mid\mathbf{s})$ for our finite-sample analysis. To this end, we introduce $\text{KL}^{2}$ -divergence,

Definition 5.6.

( $\text{KL}^{2}$ -divergence) Let $p_{0}$ and $q_{0}$ be two probability density functions on the same support $\mathcal{X}$ . Let $f(t)=\log^{2}(t)$ . Then, the $\text{KL}^{2}$ -divergence between $p_{0}$ and $q_{0}$ is

\displaystyle D_{\text{KL}^{2}}\left(q_{0}\|p_{0}\right)=\mathbb{E}_{\mathbf{X% }\sim p_{0}(\mathbf{x})}\left[f\left(\frac{q_{0}(\mathbf{X})}{p_{0}(\mathbf{X}% )}\right)\right]=\mathbb{E}_{\mathbf{X}\sim p_{0}(\mathbf{x})}\left[\log^{2}{% \left(\frac{q_{0}(\mathbf{X})}{p_{0}(\mathbf{X})}\right)}\right].

(10)

$D_{\text{KL}^{2}}\left(q_{0}\|p_{0}\right)$ is the second moment of the log-likelihood ratio and has been used (see, e.g., (3.1.14) in (Koga et al., 2002)) to understand the behavior of the distribution of $\log{\left(\frac{q_{0}(\mathbf{x})}{p_{0}(\mathbf{x})}\right)}$ . We use $D_{\text{KL}^{2}}\left(q_{0}||p_{0}\right)$ to evaluate the distance between $p\left(\mathbf{s},z\right)=P\left(z\mid\mathbf{s}\right)p\left(\mathbf{s}\right)$ and $q(\mathbf{s},z)=Q\left(z\mid\mathbf{s}\right)p\left(\mathbf{s}\right)$ , which yields the following

\displaystyle D_{\text{KL}^{2}}\left(q(\mathbf{s},z)\|p(\mathbf{s},z)\right)=% \mathbb{E}_{\left(\mathbf{S},Z\right)\sim p(\mathbf{s},z)}\left[\log^{2}\left(% \frac{q\left(\mathbf{S},Z\right)}{p\left(\mathbf{S},Z\right)}\right)\right]=% \mathbb{E}_{\left(\mathbf{S},Z\right)\sim p(\mathbf{s},z)}\left[\log^{2}\left(% \frac{Q\left(Z\mid\mathbf{S}\right)}{P\left(Z\mid\mathbf{S}\right)}\right)% \right].

(11)

Remarkably, $D_{\text{KL}^{2}}\left(q(\mathbf{s},z)\|p(\mathbf{s},z)\right)$ in equation 11 also characterizes the discrepancy between $P\left(z\mid\mathbf{s}\right)$ and $Q\left(z\mid\mathbf{s}\right)$ by averaging their log square distance over $\mathcal{S}$ ; in our main result, we will see that the testing power of the proposed framework depends on $D_{\text{KL}^{2}}\left(q(\mathbf{s},z)\|p(\mathbf{s},z)\right)$ . Additionally, $D_{\text{KL}^{2}}\left(q(\mathbf{s},z)\|p(\mathbf{s},z)\right)$ is closely related to the typical KL divergence $D_{\text{KL}}\left(P\left(z\mid\mathbf{s}\right)\|Q\left(z\mid\mathbf{s}\right% )\right)=\mathbb{E}_{\left(\mathbf{S},Z\right)\sim p\left(\mathbf{s},z\right)}% \left[\log{\frac{P\left(Z\mid\mathbf{S}\right)}{Q\left(Z\mid\mathbf{S}\right)}% }\right]$ . This can be seen by expanding equation 11 using the formula $\text{Var}\left[X\right]=\mathbb{E}\left[X^{2}\right]-\mathbb{E}^{2}\left[X\right]$ resulting in,

\displaystyle D_{\text{KL}^{2}}\left(q(\mathbf{s},z)\|p(\mathbf{s},z)\right)=% \text{Var}_{\left(\mathbf{S},Z\right)\sim p\left(\mathbf{s},z\right)}\left[% \log\left(\frac{P(Z\mid\mathbf{S})}{Q(Z\mid\mathbf{S})}\right)\right]+\left[D_% {\text{KL}}\left(P\left(z\mid\mathbf{s}\right)\|Q\left(z\mid\mathbf{s}\right)% \right)\right].

(12)

equation 12 implies that $D_{\text{KL}^{2}}\left(q\left(\mathbf{s},z\right)||p\left(\mathbf{s},z\right)\right)$ not only measures the expected distance between $P\left(z\mid\mathbf{s}\right)$ and $Q\left(z\mid\mathbf{s}\right)$ over $\mathcal{S}$ but also the variance of that distance. Similarly, we write

\displaystyle D_{\text{KL}^{2}}\left(p\left(\mathbf{s},z\right)\|q\left(% \mathbf{s},z\right)\right)=\mathbb{E}_{\left(\mathbf{S},Z\right)\sim q(\mathbf% {s},z)}\left[\log^{2}\left(\frac{P\left(Z\mid\mathbf{S}\right)}{Q\left(Z\mid% \mathbf{S}\right)}\right)\right]

(13)

to measure the discrepancy between $p(\mathbf{s},z)$ and $q(\mathbf{s},z)$ but with a reverse direction opposed to $D_{\text{KL}^{2}}\left(q\left(\mathbf{s},z\right)\|p\left(\mathbf{s},z\right)\right)$ .

$D_{\text{KL}^{2}}\left(q\left(\mathbf{s},z\right)\|p\left(\mathbf{s},z\right)\right)$ and $D_{\text{KL}^{2}}\left(p\left(\mathbf{s},z\right)\|q\left(\mathbf{s},z\right)\right)$ both characterize the approximation error of $Q(z\mid\mathbf{s})$ , and we will also see they jointly determine the testing power of the proposed framework in Section 5.4.4.

5.4.2 Characterizing the factor that leads to the irreducible Type II error in finite-sample case

We also introduce another factor influencing testing power, which persists even in the absence of approximation error, i.e., $Q(z\mid\mathbf{s})=P(z\mid\mathbf{s})$ . To see this, we recall the information spectrum introduced in (Han & Verdú, 1993),

Definition 5.7.

(Information spectrum (Han & Verdú, 1993)) Let $\left(\mathbf{X},\mathbf{Y}\right)$ be a pair of random variables over the support $\mathcal{X}\times\mathcal{Y}$ . Let $p_{\mathbf{X}\mathbf{Y}}$ denote the joint distribution of $\left(\mathbf{X},\mathbf{Y}\right)$ , and let $p_{\mathbf{X}}$ and $p_{\mathbf{Y}}$ denote the marginal distributions of $\mathbf{X}$ and $\mathbf{Y}$ . Suppose $\{(\mathbf{X},\mathbf{Y})\}_{i=1}^{n}$ is a sequence of i.i.d random variables for $(\mathbf{X},\mathbf{Y})\sim p_{\mathbf{X}\mathbf{Y}}(\mathbf{x},\mathbf{y})$ . Then, the information spectrum is the probability distribution of the following random variable,

\displaystyle\bar{I}(\mathbf{X}^{n};\mathbf{Y}^{n})=\frac{1}{n}\sum_{i=1}^{n}% \log\frac{p_{\mathbf{X}\mathbf{Y}}\left(\mathbf{X}_{i},\mathbf{Y}_{i}\right)}{% p_{\mathbf{X}}(\mathbf{X}_{i})p_{\mathbf{Y}}(\mathbf{Y}_{i})},\quad\left(% \mathbf{X},\mathbf{Y}\right)\sim p_{\mathbf{X}\mathbf{Y}}(\mathbf{x},\mathbf{y})

(14)

It is easy to see the expectation of $\bar{I}(\mathbf{X}^{n};\mathbf{Y}^{n})$ is the mutual information $I(\mathbf{X};\mathbf{Y})$ for $(\mathbf{X},\mathbf{Y})\sim p_{\mathbf{X}\mathbf{Y}}\left(\mathbf{x},\mathbf{y% }\right)$ . Substituting $(\mathbf{X},\mathbf{Y})\sim p_{\mathbf{X}\mathbf{Y}}\left(\mathbf{x},\mathbf{y% }\right)$ in equation 14 with the feature-label variable pair $(\mathbf{S},Z)\sim p\left(\mathbf{s},z\right)$ in our two-sample testing problem recovers the (negated) normalizing test statistic in equation 8 with $P(z)$ and $P(z\mid\mathbf{s})$ inserted, i.e., in the absence of approximation error.

(Han, 2000) leverages the dispersion of the information spectrum (the distribution of $\bar{I}\left(\mathbf{X}^{n};\mathbf{Y}^{n}\right)$ ) for $\{\left(\mathbf{X},\mathbf{Y}\right)\}_{i=1}^{n}$ to quantify the rate that Type II error goes to zero with increasing $n$ . Their underlying rationale is that, for a larger variance of $\bar{I}(\mathbf{X}^{n};\mathbf{Y}^{n})$ , the probability of $\bar{I}(\mathbf{X}^{n};\mathbf{Y}^{n})$ falling outside the acceptance region for an alternative hypothesis also increases, thereby resulting in a slower convergence rate for the Type II error. In our work, we will make use of the variance of the log-likelihood ratio between $p(\mathbf{s},z)$ and $p(\mathbf{s})p(z)$

\displaystyle\text{Var}_{(\mathbf{S};Z)\sim p(\mathbf{s},z)}\bar{I}(\mathbf{S}% ,Z)=n\text{Var}_{(\mathbf{S};Z)^{n}\sim p\left(\left(\mathbf{s},z\right)^{n}% \right)}\bar{I}(\mathbf{S}^{n},Z^{n})=\text{Var}_{\left(\mathbf{S},Z\right)% \sim p\left(\mathbf{s},z\right)}\left[-\log\frac{P(Z)}{P\left(Z\mid\mathbf{S}% \right)}\right].

(15)

Scaling $\text{Var}_{(\mathbf{S},Z)\sim p(\mathbf{s},z)}\bar{I}(\mathbf{S};Z)$ down by $n$ is the variance of $\bar{I}\left(\mathbf{S}^{n};\mathbf{Z}^{n}\right)$ , characterizing the the dispersion of the information spectrum for $\{(\mathbf{S},\mathbf{Z})\}_{i=1}^{n}$ given $n$ . $\text{Var}_{(\mathbf{S},Z)\sim p(\mathbf{s},z)}\bar{I}(\mathbf{S};Z)$ is also known as the relative entropy variance (See e.g., (2.29) in (Tan et al., 2014)). It remains present even in the absence of approximation error (i.e., $Q(z\mid\mathbf{s})=P(z\mid\mathbf{s})$ ). As we will see in Section 5.4.4, the persistent $\text{Var}_{(\mathbf{S},Z)\sim p(\mathbf{s},z)}\bar{I}(\mathbf{S};Z)$ leads to a non-zero Type II error in the finite-sample case.

5.4.3 An example of using the proposed framework

We first introduce the notation that will be used in the ensuing sections. We write $\mathcal{P}=\left\{A_{1},\cdots,A_{m}\right\}$ to denote a partition of the support $\mathcal{S}$ of $p(\mathbf{s})$ from which unlabeled sample features in $\mathcal{S}_{u}$ are generated; in other words, $\bigcup_{i=1}^{m}A_{i}=\mathcal{S}$ . We compare an example of our proposed framework with the baseline, where features are randomly sampled from $\mathcal{S}_{u}$ and labeled. We quantitatively analyze the testing power of both cases. Both the example and the baseline are detained as follows:

In the example of using the proposed framework, the class priors $\left\{P\left(z\mid A_{i}\right)\right\}$ are given to simplify our analytical results; however, one can estimate these priors with labels in each $A_{i}$ and use the prior estimates to replace $\left\{P(z\mid A_{i})\right\}$ , and that will not change the main argument of our theorem. In addition, the analyst chooses the partition $A^{*}$ predicted by $Q\left(z\mid\mathbf{s}\right)$ to have the highest dependency between $\mathbf{S}$ and $Z$ and only conducts sequential testing with the labeled points in $A^{*}$ . In contrast, the baseline conducts the sequential test entirely the same, except that the analyst queries the labels of features that are randomly generated from $\mathcal{S}_{u}$ . Both the proposed framework and the baseline assert the use of a stable $Q\left(z\mid\mathbf{s}\right)$ with no updates in the sequential testing; that is sufficient for our analysis as we will see the testing power for the above cases depend on $D_{\text{KL}^{2}}\left(q(\mathbf{s},z)||p(\mathbf{s},z)\right)$ in equation 11, $D_{\text{KL}^{2}}\left(p(\mathbf{s},z)||q(\mathbf{s},z)\right)$ in equation 12 and $\text{Var}_{(\mathbf{S},Z)\sim p(\mathbf{s},z)}\bar{I}(\mathbf{S},Z)$ in equation 15

5.4.4 Finite-sample analysis for the example

We use $\epsilon_{1}=\max_{A\in\mathcal{P}}D_{\text{KL}^{2}}\left(q\left(\mathbf{s},z% \right)\|p\left(\mathbf{s},z\right)\mid A\right)$ and $\epsilon_{2}=\max_{A\in\mathcal{P}}D_{\text{KL}^{2}}\left(p\left(\mathbf{s},z% \right)\|q\left(\mathbf{s},z\right)\mid A\right)$ to capture the maximum approximation error of $Q(z\mid\mathbf{s})$ over the partition $\mathcal{P}=\{A_{1},\cdots,A_{m}\}$ , and use $\sigma^{2}=\max\left\{\max_{A\in\mathcal{P}}\text{Var}_{\left(\mathbf{S},Z% \right)\sim p\left(\mathbf{s},z\mid A\right)}\bar{I}(\mathbf{S};Z),\text{Var}_% {\left(\mathbf{S},Z\right)\sim p\left(\mathbf{s},z\right)}\bar{I}(\mathbf{S};Z% )\right\}$ to capture the maximum irreducible Type II error over the same partition $\mathcal{P}$ .

We will need to make the following assumptions before presenting our results.

Assumption 5.8.

(Maximum mutual information gain) $\max_{A\in\mathcal{P}}I\left(\mathbf{S};Z\mid A\right)-I(\mathbf{S};Z)=\Delta\geq 0$ .

Assumption 5.8 characterizes the largest MI gain of the proposed framework in the example over the baseline; that is the direct reason for the increased testing power of the proposed framework.

Assumption 5.9.

(Sufficient number of unlabeled samples) $\frac{\sum_{\mathbf{s}\in A\cap\mathcal{S}_{u}}\mathbb{E}_{Z\sim Q\left(z\mid% \mathbf{s}\right)}\left[\log\left(\frac{Q\left(Z\mid\mathbf{s}\right)}{P\left(% Z\mid\mathbf{s}\right)}\right)\right]}{\left|A\cap\mathcal{S}_{u}\right|}% \approx D_{\text{KL}}\left(Q\left(z\mid\mathbf{s}\right)\|P\left(z\mid\mathbf{% s}\right)\mid A\right),\forall A\in\mathcal{P}$ .

Even though we typically have access to only a finite number of unlabeled samples in real-world scenarios, this number is usually quite large and affordable for many applications. Hence, similar to (Hanneke & Yang, 2015), Assumption 5.9 assumes a sufficient supply of unlabeled samples to simplify the analysis and concentrate solely on the number of labels needed for the proposed framework in the example.

Now, we present our theorem to address the testing power of the framework in the example and the baseline test in the finite-sample case.

Theorem 5.10.

Under Assumption 5.8 5.9, the proposed framework in the example with a label budget $N_{q}$ and $\alpha$ has a testing power of approximately at least

\displaystyle\Phi\left(\frac{\frac{\log\alpha}{\sqrt{N_{q}}}+\sqrt{N_{q}}\left% (I\left(\mathbf{S};Z\right)+\Delta-2\sqrt{\epsilon}_{1}-\sqrt{\epsilon}_{2}% \right)}{\left(\epsilon_{1}+\sigma^{2}+2\sigma\sqrt{\epsilon_{1}}\right)^{1/2}% }\right);

(17)

and the baseline test with features randomly sampled from $\mathcal{S}_{u}$ and labeled has a testing power of approximately at least

\displaystyle\Phi\left(\frac{\frac{\log\alpha}{\sqrt{N_{q}}}+\sqrt{N_{q}}\left% (I\left(\mathbf{S};Z\right)-\sqrt{\epsilon}_{1}\right)}{\left(\epsilon_{1}+% \sigma^{2}+2\sigma\sqrt{\epsilon_{1}}\right)^{1/2}}\right).

(18)

equation 17 and equation 18 state approximate testing power’s lower bounds for the proposed framework in the example and the baseline test. We can observe that

•

Given $\alpha$ , then, a large budget $N_{q}$ , and small approximation errors characterized by $\epsilon_{1}$ , increase the two testing power’s lower-bounds of the proposed framework and the baseline, as structured similarly in equation 17 and equation 18.
•

Comparing equation 17 for the proposed framework to the equation 18 for the baseline, the extra $\Delta$ is ascribed to the maximum power gain, and $\sqrt{\epsilon_{1}}+\sqrt{\epsilon_{2}}$ accounts for the diminishing of the maximum power gain in selecting a $A^{*}\in\mathcal{P}$ that does not have the highest MI over $A\in\mathcal{P}$ .
•

When the approximation errors $\epsilon_{1}=0$ and/or $\epsilon_{2}=0$ , both testing power’s lower-bounds are decreased by a factor of $\sigma$ , resulting in the irreducible Type II error.
•

When the maximum MI gain $\Delta$ can compensate the approximation error of $Q\left(z\mid\mathbf{s}\right)$ being larger than $\sqrt{\epsilon_{1}}+\sqrt{\epsilon_{2}}$ , our framework in the example has higher testing power’s lower bound than the baseline test given the same label budget $N_{q}$ and $\alpha$ .

6 Experimental Results

We have proposed a practical instantiation of the framework, and its algorithmic description BQ-AST is presented in Algorithm 1. In this section, we compare the BQ-AST with a sequential testing baseline (Lhéritier & Cazals, 2018) that uses the same statistic in equation 2, but the baseline labels features randomly sampled from the unlabeled set $\mathcal{S}_{u}$ . In addition, we build $Q\left(z\mid\mathbf{s}\right)$ for the test statistic in equation 2 using logistic regression, SVM, or KNN classifiers; we set $N_{0}=10$ for the number of label queries used to initialize $Q\left(z\mid\mathbf{s}\right)$ , and set significance level $\alpha=0.05$ .

6.1 Experiments on Synthetic Datasets

Our first suite of experiment results is generated from synthetic data. We create synthetic datasets that comprise two samples of data to simulate cases under the null hypothesis $H_{0}$ and the alternative hypothesis $H_{1}$ ; the data for the first sample ( $Z=0$ ) is generated from $p\left(\mathbf{s}\mid Z=0\right)\equiv\mathcal{N}\left(\left(-\delta,0\right),% I\right)$ and the data for the second sample ( $Z=1$ ) is generated from $p\left(\mathbf{s}\mid Z=1\right)\equiv\mathcal{N}\left(\left(\delta,0\right),I\right)$ . In addition, we set $P(Z=0)$ from $0.5$ to $0.8$ to vary the ratio of the data sizes for two samples. For the simulations of the data under $H_{0}$ , we set $\delta=0$ , implying there is no difference between the distributions that generate the two samples; for the simulations of the data under $H_{1}$ , we vary $\delta$ from $0.2$ to $0.5$ to simulate two samples from small to high discrepancy under $H_{1}$ . Having constructed the data-generating process, we simulate 200 cases of data for each pair of $P(Z=0)$ and $\delta$ under $H_{1}$ , and simulate 500 cases of data for each pair of $P(Z=0)$ and $\delta=0$ under $H_{0}$ . Each case of data is of size 2000 with labels masked, resulting in an unlabeled set $\mathcal{S}_{u}$ with $|\mathcal{S}_{u}|=2000$ . The proposed test actively and sequentially labels feature in $\mathcal{S}_{u}$ to test the difference between the two samples.

Figure 2 presents the empirical Type I errors: when $H_{0}$ is true, the probability of the proposed test mistakenly predicting the two samples is generated under $H_{1}$ . As observed, the empirical Type I errors are all smaller than $\alpha=0.05$ for using various classifiers and label budgets in the experiments; this provides empirical evidence for Theorem 5.1, which states that the Type I error is controlled to be smaller than the significance level $\alpha$ .

Table-1 presents the empirical Type II errors: when $H_{1}$ is true, the probability of the proposed test and the baseline test mistakenly predicting the two samples are generated under $H_{0}$ . Table 2 presents the average label queried spent to reject $H_{0}$ when $H_{1}$ is true. We can observe from Table-1 that the proposed test produces lower Type II errors than that of the baseline under different classifiers and label budgets; furthermore, in Table 2, we observe the proposed test spends a smaller number of label queries than the baseline test. Additionally, we run a two-sample t-test to assess the mean difference of label query numbers generated by 200 runs using both methods. The resultant $p$ -values, truncated to the last 6 decimal places, all equate to zero, indicating that the label spent by our framework is statistically smaller than the baseline test. All these observations demonstrate that, under $H_{1}$ , the proposed test labels the features that have a high dependency on labels to effectively decrease the Type II error and reduce the number of label queries needed to reject $H_{0}$ .

Table 1: Under

H_{1}

, Type II errors of conducting the proposed/baseline with various classifiers and label budgets for the synthetic data generated by setting

\delta=0.2

and different class priors

P(Z=0)

. Due to the active query, our test produces lower Type II errors than the baseline for various label budgets.

$P(Z=0)$	Logistic						KNN
$0.5$	Label budget	200	400	600	800	1000	200	400	600	800	1000
	Baseline	0.82	0.53	0.29	0.11	0.04	0.95	0.77	0.50	0.28	0.14
	Proposed	0.16	0.02	0.00	0.00	0.00	0.49	0.17	0.06	0.03	0.01
$0.6$	Baseline	0.80	0.50	0.23	0.12	0.06	0.95	0.77	0.48	0.29	0.14
$0.6$	Proposed	0.26	0.06	0.01	0.01	0.01	0.59	0.26	0.09	0.03	0.01
$0.7$	Baseline	0.81	0.56	0.34	0.22	0.10	0.96	0.81	0.58	0.36	0.28
$0.7$	Proposed	0.26	0.04	0.01	0.01	0.01	0.71	0.33	0.14	0.04	0.02
$0.8$	Baseline	0.88	0.73	0.56	0.35	0.21	0.98	0.90	0.77	0.59	0.48
$0.8$	Proposed	0.38	0.10	0.04	0.03	0.02	0.80	0.50	0.28	0.16	0.10

Table 2: Under

H_{1}

, average number of label queries needed to reject

H_{0}

for the proposed/baseline test using various classifiers and label budgets in the synthetic data generated by setting

\delta=0.2

and different class priors

P(Z=0)

. Due to the active query, our test spends fewer label queries to reject

H_{0}

than the baseline for various label budgets.

$P(Z=0)$	Logistic						KNN
$0.5$	Label budget	200	400	600	800	1000	200	400	600	800	1000
	Baseline	183.5 $\pm$ 41	319.7 $\pm$ 113	399.1 $\pm$ 183	438.1 $\pm$ 233	451.4 $\pm$ 257	198.1 $\pm$ 10	374.4 $\pm$ 61	500.4 $\pm$ 132	578.1 $\pm$ 201	619.5 $\pm$ 254
	Proposed	95.3 $\pm$ 64	108.1 $\pm$ 92	108.6 $\pm$ 93	108.6 $\pm$ 93	108.6 $\pm$ 93	162.1 $\pm$ 50	223.1 $\pm$ 116	240.8 $\pm$ 149	249.7 $\pm$ 173	252.8 $\pm$ 184
$0.6$	Baseline	182.3 $\pm$ 41	312.4 $\pm$ 116	386.0 $\pm$ 184	419.7 $\pm$ 231	439.0 $\pm$ 266	196.7 $\pm$ 16	373.7 $\pm$ 66	499.3 $\pm$ 134	578.9 $\pm$ 206	619.7 $\pm$ 256
$0.6$	Proposed	107.9 $\pm$ 70	134.2 $\pm$ 114	142.3 $\pm$ 136	143.7 $\pm$ 142	144.7 $\pm$ 147	166.3 $\pm$ 48	246.8 $\pm$ 123	282.2 $\pm$ 175	294.3 $\pm$ 200	296.6 $\pm$ 207
$0.7$	Baseline	184.0 $\pm$ 41	323.3 $\pm$ 113	415.5 $\pm$ 188	472.2 $\pm$ 252	505.0 $\pm$ 299	198.3 $\pm$ 11	378.5 $\pm$ 58	520.0 $\pm$ 127	613.4 $\pm$ 199	678.1 $\pm$ 268
$0.7$	Proposed	120.4 $\pm$ 67	143.4 $\pm$ 104	147.6 $\pm$ 117	149.0 $\pm$ 122	150.0 $\pm$ 128	178.0 $\pm$ 43	282.2 $\pm$ 117	327.4 $\pm$ 173	345.9 $\pm$ 207	351.7 $\pm$ 222
$0.8$	Baseline	190.8 $\pm$ 31	351.7 $\pm$ 96	479.6 $\pm$ 172	571.1 $\pm$ 245	628.0 $\pm$ 306	199.0 $\pm$ 8	386.6 $\pm$ 47	555.0 $\pm$ 106	689.5 $\pm$ 175	798.4 $\pm$ 253
$0.8$	Proposed	134.7 $\pm$ 64	174.8 $\pm$ 118	189.5 $\pm$ 151	195.6 $\pm$ 170	199.7 $\pm$ 186	184.4 $\pm$ 36	310.2 $\pm$ 111	387.7 $\pm$ 186	434.7 $\pm$ 247	462.6 $\pm$ 293

We present the average number of label queries spent for two samples with small to big discrepancies under $H_{1}$ in Table 3. A small discrepancy between two samples indicates a more difficult two-sample testing problem than one with a large discrepancy between the two samples, as a two-sample test requires more data to test the existence of the small discrepancy. Table 3 shows that the proposed active sequential test spends fewer labels to reject $H_{0}$ when increasing the mean discrepancy $\delta$ between two samples, which demonstrates the proposed sequential test automatically adapts the number of label queries to the problem’s complexity.

Table 3: Under

H_{1}

and label budget

N_{q}=1000

, the average number of label queries needed to reject

H_{0}

for different

\delta

. When the mean difference

\delta

increases between two samples, both our active sequential test and the baseline test reject

H_{0}

with a reduced number of label queries spent, exhibiting the sequential test’s benefit that the tests adapt the label queries to the problem’s complexity. Due to the active query, our test spends fewer label queries to reject

H_{0}

than the baseline for various

\delta

$P(Z=0)$	Logistic					KNN
$0.5$	$\delta$	0.2	0.3	0.4	0.5	0.2	0.3	0.4	0.5
	Baseline	451.4 $\pm$ 257	178.3 $\pm$ 105	101.0 $\pm$ 58	63.9 $\pm$ 32	619.5 $\pm$ 254	287.8 $\pm$ 129	167.4 $\pm$ 70	116.8 $\pm$ 43
	Proposed	108.6 $\pm$ 93	37.3 $\pm$ 22	24.3 $\pm$ 10	19.7 $\pm$ 5	252.8 $\pm$ 184	109.5 $\pm$ 64	72.2 $\pm$ 33	54.9 $\pm$ 20
$0.6$	Baseline	439.0 $\pm$ 266	175.3 $\pm$ 118	96.9 $\pm$ 65	65.5 $\pm$ 40	619.7 $\pm$ 256	289.8 $\pm$ 130	170.2 $\pm$ 72	116.2 $\pm$ 47
$0.6$	Proposed	144.7 $\pm$ 147	40.5 $\pm$ 30	24.9 $\pm$ 11	20.1 $\pm$ 7	296.6 $\pm$ 207	134.3 $\pm$ 88	84.3 $\pm$ 43	58.3 $\pm$ 25
$0.7$	Baseline	505.0 $\pm$ 299	223.6 $\pm$ 145	115.7 $\pm$ 70	75.7 $\pm$ 47	678.1 $\pm$ 268	349.3 $\pm$ 178	198.2 $\pm$ 93	133.3 $\pm$ 56
$0.7$	Proposed	150.0 $\pm$ 128	57.1 $\pm$ 42	32.3 $\pm$ 21	22.2 $\pm$ 8	351.7 $\pm$ 222	160.2 $\pm$ 107	94.0 $\pm$ 54	67.0 $\pm$ 30
$0.8$	Baseline	628.0 $\pm$ 306	278.1 $\pm$ 177	149.3 $\pm$ 95	94.8 $\pm$ 56	798.4 $\pm$ 253	470.3 $\pm$ 223	268.7 $\pm$ 126	176.3 $\pm$ 81
$0.8$	Proposed	199.7 $\pm$ 186	66.7 $\pm$ 41	40.0 $\pm$ 22	29.4 $\pm$ 15	462.6 $\pm$ 293	198.8 $\pm$ 143	115.7 $\pm$ 65	83.8 $\pm$ 46

6.2 Experiments on MNIST

In addition to the synthetic datasets, We simulate the cases of $H_{0}$ and $H_{1}$ with MNIST (LeCun, 1998). To create a case for $H_{0}$ , we randomly pick one digit category from 0-9, then randomly sample images from the selected digit category, and lastly divide the images to sample zero ( $Z=0$ ) and one ( $Z=1$ ) based on a pre-defined class prior $P(Z=0)$ ; for each case, the two samples contain data from the same digit, but the digit categories could be different over cases. To create a case for $H_{1}$ , we randomly pick two different digit categories from 0-9, then sample images from one digit category and place the images to sample zero ( $Z=0$ ); to create sample one ( $Z=1$ ), we sample images from the two digits, mix the sampled images, and place them to sample one. We set the mixture ratio $0.7$ , meaning there are roughly $30\%$ data in sample one generated from a distribution different from sample zero. We also adjust $P(Z=0)$ to create cases with different ratios for the size of sample zero over sample one for $H_{1}$ . We produce 500 cases for $H_{0}$ and 200 cases for $H_{1}$ with the stated procedure for each $P(Z=0)$ that ranges from $0.5$ to $0.8$ ; each case comprises an unlabeled set $\mathcal{S}_{u}$ with a size of 2000 and its corresponding labels that are unknown to an analyst. Instead of using the raw data in the created cases, we projected the MNIST data to a 28-dimensional space by a convolutional autoencoder before conducting the two-sample testing.

We first present the empirical Type I errors in Figure 3. We use the support vector machine (SVM) to build $Q(z\mid\mathbf{s})$ to generate the results. As observed, all the Type I errors are smaller than $\alpha=0.05$ , which agrees with Theorem 5.1. In addition, we present the Type II errors, as shown in Table 4. The proposed test generates smaller Type II errors than the baseline sequential test for various classifiers, label budgets, and $P(Z=0)$ , implying the proposed sequential testing combined with the active query is effective. This is further corroborated by Table 5 that exhibits the average number of label queries needed to reject $H_{0}$ ; the proposed test spent fewer label queries than the baseline test to reject $H_{0}$ . We additionally run a two-sample t-test to statistically compare the mean difference between the label query numbers generated by both methods. The resultant $p$ -values, truncated to the last 6 decimal places, all equate to zero, indicating that the label spent by our framework is statistically smaller than the baseline test in the MNIST experiment.

Table 4: Under

H_{1}

, Type II errors of conducting the proposed/baseline with various classifiers and label budgets for MNIST and different class priors

P(Z=0)

. Due to the active query, our test produces lower Type II errors than the baseline for various label budgets.

$P(Z=0)$	Logistic						SVM					KNN
$0.5$	Label budget	200	400	600	800	1000	200	400	600	800	1000	200	400	600	800	1000
	Baseline	0.65	0.21	0.02	0.01	0.01	0.59	0.07	0.00	0.00	0.00	0.84	0.43	0.15	0.07	0.03
	Proposed	0.12	0.01	0.01	0.00	0.00	0.12	0.03	0.01	0.01	0.00	0.10	0.01	0.01	0.00	0.00
$0.6$	Baseline	0.59	0.16	0.02	0.01	0.01	0.55	0.04	0.00	0.00	0.00	0.89	0.43	0.15	0.06	0.03
$0.6$	Proposed	0.01	0.00	0.00	0.00	0.00	0.06	0.01	0.00	0.00	0.00	0.06	0.02	0.01	0.01	0.00
$0.7$	Baseline	0.58	0.21	0.04	0.01	0.00	0.67	0.15	0.01	0.00	0.00	0.91	0.58	0.29	0.10	0.04
$0.7$	Proposed	0.00	0.00	0.00	0.00	0.00	0.10	0.01	0.00	0.00	0.00	0.12	0.03	0.01	0.00	0.00
$0.8$	Baseline	0.66	0.24	0.04	0.01	0.01	0.77	0.32	0.10	0.01	0.01	0.95	0.71	0.47	0.27	0.12
$0.8$	Proposed	0.00	0.00	0.00	0.00	0.00	0.06	0.01	0.00	0.00	0.00	0.14	0.03	0.01	0.01	0.00

Table 5: Under

H_{1}

, average number of label queries needed to reject

H_{0}

for the proposed/baseline test using various classifiers and label budgets for MNIST and different class priors

P(Z=0)

. Due to the active query, our test spends fewer label queries to reject

H_{0}

than the baseline for various label budgets.

P(Z=0)

Logistic

SVM

KNN

0.5

Label budget

200

400

600

800

1000

200

400

600

800

1000

200

400

600

800

1000

Baseline

165.3

\pm

251.7

\pm

126

267.9

\pm

150

270.7

\pm

158

271.7

\pm

162

175.4

\pm

229.9

\pm

233.0

\pm

233.0

\pm

233.0

\pm

187.0

\pm

311.3

\pm

359.1

\pm

141

376.8

\pm

167

384.6

\pm

185

Proposed

90.4

\pm

99.8

\pm

101.0

\pm

101.5

\pm

101.5

\pm

93.5

\pm

106.5

\pm

109.4

\pm

110.5

\pm

105

110.7

\pm

106

89.4

\pm

97.9

\pm

99.9

\pm

100.1

\pm

100.1

\pm

0.6

Baseline

160.8

\pm

233.2

\pm

125

247.5

\pm

148

249.3

\pm

154

250.3

\pm

158

173.5

\pm

226.8

\pm

229.8

\pm

101

229.8

\pm

101

229.8

\pm

101

187.5

\pm

315.1

\pm

363.9

\pm

142

379.8

\pm

166

385.4

\pm

178

Proposed

61.7

\pm

61.7

\pm

61.7

\pm

61.7

\pm

61.7

\pm

79.4

\pm

83.2

\pm

83.4

\pm

83.4

\pm

83.4

\pm

85.0

\pm

90.9

\pm

94.0

\pm

95.6

\pm

96.6

\pm

103

0.7

Baseline

160.3

\pm

234.8

\pm

128

255.2

\pm

161

257.8

\pm

167

258.0

\pm

168

174.6

\pm

252.1

\pm

109

264.6

\pm

130

265.4

\pm

133

265.4

\pm

133

188.4

\pm

330.7

\pm

415.7

\pm

162

451.8

\pm

206

463.2

\pm

225

Proposed

46.2

\pm

46.2

\pm

46.2

\pm

46.2

\pm

46.2

\pm

74.7

\pm

82.0

\pm

83.0

\pm

83.0

\pm

83.0

\pm

89.3

\pm

101.5

\pm

104.6

\pm

105.1

\pm

101

105.1

\pm

101

0.8

Baseline

163.9

\pm

243.3

\pm

126

268.1

\pm

163

273.1

\pm

175

275.3

\pm

183

92.6

\pm

148.5

\pm

167.3

\pm

171.7

\pm

172.6

\pm

192.8

\pm

357.3

\pm

471.8

\pm

146

540.7

\pm

210

575.4

\pm

255

Proposed

34.8

\pm

34.8

\pm

34.8

\pm

34.8

\pm

34.8

\pm

77.2

\pm

81.6

\pm

82.3

\pm

82.3

\pm

82.3

\pm

104.8

\pm

116.9

\pm

119.1

\pm

120.1

\pm

120.3

\pm

6.3 Experiments on An Alzheimer’s Disease Dataset

Table 6: Under

H_{1}

, Type II errors of conducting the proposed/baseline with various classifiers and label budgets for ADNI and different class priors

P(Z=0)

. Due to the active query, our test produces lower Type II errors than the baseline for various label budgets.

$P(Z=0)$	Logistic						SVM					KNN
$0.5$	Label budget	100	200	300	400	500	100	200	300	400	500	100	200	300	400	500
	Baseline	0.32	0.06	0.01	0.00	0.00	0.67	0.17	0.02	0.00	0.00	0.72	0.49	0.25	0.13	0.04
	Proposed	0.10	0.01	0.00	0.00	0.00	0.24	0.03	0.01	0.00	0.00	0.21	0.04	0.00	0.00	0.00
$0.6$	Baseline	0.35	0.04	0.00	0.00	0.00	0.62	0.15	0.01	0.00	0.00	0.73	0.25	0.06	0.01	0.00
$0.6$	Proposed	0.07	0.00	0.00	0.00	0.00	0.18	0.04	0.03	0.00	0.00	0.10	0.01	0.00	0.00	0.00
$0.7$	Baseline	0.40	0.10	0.01	0.00	0.00	0.65	0.21	0.06	0.00	0.00	0.81	0.36	0.12	0.04	0.02
$0.7$	Proposed	0.11	0.03	0.00	0.00	0.00	0.32	0.07	0.02	0.01	0.01	0.25	0.04	0.01	0.00	0.00
$0.8$	Baseline	0.52	0.23	0.07	0.01	0.00	0.89	0.53	0.27	0.07	0.02	0.90	0.59	0.28	0.16	0.07
$0.8$	Proposed	0.28	0.01	0.00	0.00	0.00	0.49	0.15	0.06	0.03	0.01	0.38	0.10	0.03	0.01	0.01

We demonstrate the utility of the proposed test in a clinical application using data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (Jack Jr et al., 2008). The ADNI study protocol was approved by local institutional review boards (IRB). All the personal information in the data provided to researchers has been removed. The motivation for applying the proposed test to Alzheimer’s disease research is as follows. Amyloid has been linked to the development of Alzheimer’s disease; identifying the amount of amyloid in the human brain is an important step in predicting the progression of Alzheimer’s disease. To measure the amyloid level, an expensive CT scan is required used to assess the amyloid deposition in the brain. A useful replacement would be an easy-to-measure and inexpensive replacement for the amyloid to indicate the progression of Alzheimer’s disease. In the following experiments, we considered using digital test results that include five cognition measurement scores of participants as a replacement. To verify if the digital test results are suitable replacements, clinicians are seeking an approach to test the independence between the digital test results and the amyloid amount with a limited number of expensive CT scans to measure the amyloid levels. We use a binary version of the amyloid level where $Z=0$ and $Z=1$ suggest low and high amyloid depositions in the brain respectively; we can now formulate a two-sample test and use the proposed scheme. As the results show, our proposed test is endowed with sequential decision-making and active label query, resulting in fewer CT scans needed compared with the conventional sequential test.

Table 7: Under

H_{1}

, average number of label queries needed to reject

H_{0}

for the proposed/baseline test using various classifiers and label budgets for ADNI and different class priors

P(Z=0)

. Due to the active query, our test spends fewer label queries to reject

H_{0}

than the baseline for various label budgets.

P(Z=0)

Logistic

SVM

KNN

0.5

Label budget

100

200

300

400

500

100

200

300

400

500

100

200

300

400

500

Baseline

68.1

\pm

83.7

\pm

85.5

\pm

85.6

\pm

85.6

\pm

87.0

\pm

127.2

\pm

135.4

\pm

136.1

\pm

136.1

\pm

86.2

\pm

145.8

\pm

181.8

\pm

100

199.5

\pm

124

207.3

\pm

138

Proposed

43.9

\pm

47.1

\pm

47.1

\pm

47.1

\pm

47.1

\pm

64.0

\pm

75.1

\pm

76.5

\pm

76.6

\pm

76.6

\pm

69.3

\pm

76.6

\pm

77.8

\pm

77.8

\pm

77.8

\pm

0.6

Baseline

68.4

\pm

84.0

\pm

86.1

\pm

86.1

\pm

86.1

\pm

85.0

\pm

121.0

\pm

127.5

\pm

127.5

\pm

127.5

\pm

92.7

\pm

140.3

\pm

153.0

\pm

156.0

\pm

156.3

\pm

Proposed

43.9

\pm

45.3

\pm

45.3

\pm

45.3

\pm

45.3

\pm

61.3

\pm

70.0

\pm

72.9

\pm

74.5

\pm

74.5

\pm

60.8

\pm

64.4

\pm

64.4

\pm

64.4

\pm

64.4

\pm

0.7

Baseline

72.3

\pm

95.6

\pm

100.9

\pm

101.1

\pm

101.1

\pm

86.5

\pm

126.7

\pm

139.0

\pm

141.3

\pm

141.3

\pm

94.7

\pm

153.5

\pm

176.5

\pm

183.7

\pm

186.1

\pm

Proposed

50.6

\pm

56.6

\pm

57.1

\pm

57.1

\pm

57.1

\pm

68.8

\pm

85.1

\pm

89.0

\pm

90.0

\pm

90.5

\pm

68.6

\pm

78.5

\pm

79.9

\pm

79.9

\pm

79.9

\pm

0.8

Baseline

78.1

\pm

115.4

\pm

128.5

\pm

132.5

\pm

132.9

\pm

95.9

\pm

166.9

\pm

204.8

\pm

219.6

\pm

101

222.8

\pm

108

97.6

\pm

171.0

\pm

215.3

\pm

235.8

\pm

106

247.6

\pm

126

Proposed

63.6

\pm

72.0

\pm

72.2

\pm

72.2

\pm

72.2

\pm

80.1

\pm

108.6

\pm

118.1

\pm

121.9

\pm

124.0

\pm

80.7

\pm

98.3

\pm

102.4

\pm

103.9

\pm

104.9

\pm

The obtained ADNI data contains both digital test results and the amyloid amount of participants. We use the cut-off value suggested by ADNI and binarize the amyloid amount to create two-sample cases where $\mathbf{s}$ denote a vector of cognition measurement scores and $z$ denotes low or high amyloid amount for the participants. We create 200 data cases for each $P(Z=0)$ that ranges from $0.5$ to $0.8$ ; these cases are simulations for $H_{1}$ , and each case comprises an unlabeled set $\mathcal{S}_{u}$ with a size of 1000 and its corresponding labels that are unknown to an analyst.

Table 6 and Table 7 present the results of empirical Type II errors and the average number of label queries needed to reject $H_{0}$ . Our proposed test has Type II errors decreased by 58% and saves on label queries by 62% at most compared with the baseline test with the same label budgets. Additionally, we run a two-sample t-test to statistically compare the mean difference between the label query numbers generated by both methods. The resultant $p$ -values, truncated to the last 6 decimal places, all equate to zero; this indicates that the label savings are statistically significant.

7 Conclusion

We propose an active sequential two-sample testing framework that sequentially and actively labels the data to increase the testing power and adapt the number of label queries to the problem’s complexity. We provide both finite-sample and asymptotic analysis of the proposed framework; the framework’s benefit is characterized by the change of the mutual information between feature and label variables over a random labeling scheme in both finite-sample and asymptotic cases. Moreover, we suggest an instantiation of the framework, in which we adopt the bimodal query that labels the features predicted by a classifier to have the highest class one or zero probabilities. Our experiments on synthetic data, MNIST, and an Alzheimer’s Disease dataset demonstrate the effectiveness of the suggested instantiation of the proposed framework.

Acknowledgement

This work was funded in part by Office of Naval Research grant N00014-21-1-2615 and by the National Science Foundation (NSF) under grants CNS-2003111, and CCF-2048223.

References

Aaditya Ramdas (2018) Aaditya Ramdas. Martingales, ville and doob, 2018. https://www.stat.cmu.edu/~aramdas/martingales18/L2-martingales.pdf.
Balsubramani & Ramdas (2015) Akshay Balsubramani and Aaditya Ramdas. Sequential nonparametric testing with the law of the iterated logarithm. arXiv preprint arXiv:1506.03486, 2015.
Bessler (1960) Stuart A Bessler. Theory and applications of the sequential design of experiments, k-actions and infinitely many experiments. part i. theory. Technical report, Stanford Univ CA Applied Mathematics and Statistics Labs, 1960.
Blot & Meeter (1973) William J Blot and Duane A Meeter. Sequential experimental design procedures. Journal of the American Statistical Association, 68(343):586–593, 1973.
Chernoff (1959) Herman Chernoff. Sequential design of experiments. The Annals of Mathematical Statistics, 30(3):755–770, 1959.
Doob (1939) JL Doob. Jean ville, étude critique de la notion de collectif. Bulletin of the American mathematical society, 45(11):824–824, 1939.
Duan et al. (2022) Boyan Duan, Aaditya Ramdas, and Larry Wasserman. Interactive rank testing by betting. In Conference on Causal Learning and Reasoning, pp. 201–235. PMLR, 2022.
Dunn (1961) Olive Jean Dunn. Multiple comparisons among means. Journal of the American statistical association, 56(293):52–64, 1961.
Durrett (2019) Rick Durrett. Probability: theory and examples, volume 49. Cambridge university press, 2019.
Friedman & Rafsky (1979) Jerome H Friedman and Lawrence C Rafsky. Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. The Annals of Statistics, pp. 697–717, 1979.
Gretton et al. (2012) Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012.
Györfi et al. (2002) László Györfi, Michael Köhler, Adam Krzyżak, and Harro Walk. A distribution-free theory of nonparametric regression, volume 1. Springer, 2002.
Hajnal (1961) J Hajnal. A two-sample sequential t-test. Biometrika, 48(1/2):65–75, 1961.
Han (2000) Te Sun Han. Hypothesis testing with the general source. arXiv preprint math/0004121, 2000.
Han & Verdú (1993) Te Sun Han and Sergio Verdú. Approximation theory of output statistics. IEEE Transactions on Information Theory, 39(3):752–772, 1993.
Hanneke & Yang (2015) Steve Hanneke and Liu Yang. Minimax analysis of active learning. J. Mach. Learn. Res., 16(1):3487–3602, 2015.
Hotelling (1992) Harold Hotelling. The generalization of student’s ratio. In Breakthroughs in statistics, pp. 54–65. Springer, 1992.
Jack Jr et al. (2008) Clifford R Jack Jr, Matt A Bernstein, Nick C Fox, Paul Thompson, Gene Alexander, Danielle Harvey, Bret Borowski, Paula J Britson, Jennifer L. Whitwell, Chadwick Ward, et al. The Alzheimer’s disease neuroimaging initiative (ADNI): MRI methods. Journal of Magnetic Resonance Imaging: An Official Journal of the International Society for Magnetic Resonance in Medicine, 27(4):685–691, 2008.
Johari et al. (2022) Ramesh Johari, Pete Koomen, Leonid Pekelis, and David Walsh. Always valid inference: Continuous monitoring of a/b tests. Operations Research, 70(3):1806–1821, 2022.
Keener (1984) Robert Keener. Second order efficiency in the sequential design of experiments. The Annals of Statistics, pp. 510–532, 1984.
Kiefer & Sacks (1963) J Kiefer and J Sacks. Asymptotically optimum sequential inference and design. The Annals of Mathematical Statistics, pp. 705–750, 1963.
Koga et al. (2002) H Koga et al. Information-spectrum methods in information theory, volume 50. Springer Science & Business Media, 2002.
LeCun (1998) Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
Lhéritier & Cazals (2018) Alix Lhéritier and Frédéric Cazals. A sequential non-parametric multivariate two-sample test. IEEE Transactions on Information Theory, 64(5):3361–3370, 2018.
Li et al. (2022) Weizhi Li, Gautam Dasarathy, Karthikeyan Natesan Ramamurthy, and Visar Berisha. A label efficient two-sample test. In Uncertainty in Artificial Intelligence, pp. 1168–1177. PMLR, 2022.
Lopez-Paz & Oquab (2016) David Lopez-Paz and Maxime Oquab. Revisiting classifier two-sample tests. arXiv preprint arXiv:1610.06545, 2016.
Miller (2007) Steven J Miller. An introduction to linear programming. lecture notes, 2007.
Naghshvar & Javidi (2013) Mohammad Naghshvar and Tara Javidi. Active sequential hypothesis testing. The Annals of Statistics, 41(6):2703–2738, 2013.
Pandeva et al. (2022) Teodora Pandeva, Tim Bakker, Christian A Naesseth, and Patrick Forré. E-valuating classifier two-sample tests. arXiv preprint arXiv:2210.13027, 2022.
Ramdas et al. (2022) Aaditya Ramdas, Peter Grünwald, Vladimir Vovk, and Glenn Shafer. Game-theoretic statistics and safe anytime-valid inference. arXiv preprint arXiv:2210.01948, 2022.
Shekhar & Ramdas (2021) Shubhanshu Shekhar and Aaditya Ramdas. Game-theoretic formulations of sequential nonparametric one-and two-sample tests. arXiv preprint arXiv:2112.09162, 2021.
Student (1908) Student. The probable error of a mean. Biometrika, 6(1):1–25, 1908.
Tan et al. (2014) Vincent YF Tan et al. Asymptotic estimates in information theory with non-vanishing error probabilities. Foundations and Trends® in Communications and Information Theory, 11(1-2):1–184, 2014.
Ville (1939) Jean Ville. Etude critique de la notion de collectif. Bull. Amer. Math. Soc, 45(11):824, 1939.
Wald (1992) Abraham Wald. Sequential tests of statistical hypotheses. In Breakthroughs in Statistics, pp. 256–298. Springer, 1992.
Wasserstein & Lazar (2016) Ronald L Wasserstein and Nicole A Lazar. The ASA statement on p-values: context, process, and purpose, 2016.
Welch (1990) William J Welch. Construction of permutation tests. Journal of the American Statistical Association, 85(411):693–698, 1990.
Wilks (1938) Samuel S Wilks. The large-sample distribution of the likelihood ratio for testing composite hypotheses. The annals of mathematical statistics, 9(1):60–62, 1938.

Appendix A Proof of Theorem 5.1 and Its Preliminaries

A.1 Some statistical preliminaries

In probability theory, a sequence $\left\{X_{0},\cdots,X_{n}\right\}$ of random variables is called martingale if at a particular time, the expectation of the next random variable is equivalent to the present observation; this is formally defined as follows,

Definition A.1.

(Martingale) A sequence of random variables $\{X_{0},\cdots,X_{n}\}$ is a martingale if, for any $n\geq 0$ ,

	$\displaystyle\mathop{\mathbb{E}}\left[\|X_{n}\|\right]$	$\displaystyle\leq\infty$		(19)
	$\displaystyle\mathop{\mathbb{E}}\left[X_{n+1}\|X_{0},\cdots,X_{n}\right]$	$\displaystyle=X_{n}$		(20)

We refer interested readers to (Aaditya Ramdas, 2018) for a complete introduction to the martingale and its related properties.

Next, we state Ville’s maximal inequalityVille (1939), which will be applied to prove Theorem 5.1.

Theorem A.2.

(Ville’s Maximal Inequality Ville (1939)): If $\{X_{n}\}$ is a nonnegative martingale, then for any $c>0$ , we have

\displaystyle P\left(\sup_{n\geq 0}X_{n}>c\right)\leq\frac{\mathop{\mathbb{E}}% \left[X_{0}\right]}{c}

(21)

Ville’s maximal inequality gives a probability upper bound for the event that the martingale crosses a threshold $c$ ; it is a sequential extension of Markov’s inequality.

A.2 Proof of Theorem 5.1

Proof.

Our proof comprises proving the following two ordered parts:
(1) The first part is to demonstrate that, under the null hypothesis $H_{0}$ , the independence between unqueried label random variables and the corresponding feature random variables still holds following the adaptive label query. In particular, Under $H_{0}$ , the feature and label variables $\mathbf{S}_{i}$ and $Z_{i}$ used to construct the test statistic in equation 3 in the proposed framework are independent $\forall i\in\left[N_{q}\right]$ .
(2) In the second part, we consider $\tilde{W}_{n}=\prod_{i=1}^{n}\frac{P(Z_{i})}{Q_{i}(Z_{i}\mid\mathbf{S}_{i})}$ , which is the test statistic in equation 2 with true class prior $P(z)$ plugged in. Moving forward, the second part is to demonstrate the following inequalities under $H_{0}$

\displaystyle P_{0}\left(\exists n\in\left[N_{q}\right],W_{n}=\prod_{i=1}^{n}% \frac{\hat{P}(Z_{i})}{Q_{i}\left(Z_{i}\mid\mathbf{S}_{i}\right)}\leq\alpha% \right)\leq P_{0}\left(\exists n\in\left[N_{q}\right],\tilde{W}_{n}=\prod_{i=1% }^{n}\frac{P(Z_{i})}{Q_{i}(Z_{i}\mid\mathbf{S}_{i})}\leq\alpha\right)\leq% \alpha\>.

(22)

equation 22 immediately implies that the Type I error of our proposed framework is upper-bounded by $\alpha$ .

•

Proof for the first part
We write $\mathcal{S}_{u}$ and $\mathcal{Z}_{u}$ to denote the sets of original unlabeled feature variables on an analyst’s hand and unrevealed label variables provided by an oracle. We write $\mathcal{S}^{l}_{i}$ and $\mathcal{Z}_{i}^{l}$ to denote the sets of the labeled feature and the corresponding label variables after including the i-th $(\mathbf{S}_{i},Z_{i})$ to construct the statistic in equation 3. We use $\mathcal{S}_{i}^{u}=\mathcal{S}_{u}\setminus\mathcal{S}^{l}_{i}$ and $\mathcal{Z}_{i}^{u}=\mathcal{Z}_{u}\setminus\mathcal{Z}^{l}_{i}$ to denote their complements that comprise unlabeled feature and unrevealed label variables. In particular, we use $\mathcal{S}_{0}^{l}$ and $\mathcal{Z}_{0}^{l}$ to denote the feature and label variable sets used to initialize $Q_{1}\left(z\mid\mathbf{s}\right)$ in the first place; $\mathcal{S}_{0}^{u}=\mathcal{S}_{u}\setminus\mathcal{S}_{0}^{l}$ and $\mathcal{Z}_{0}^{u}=\mathcal{Z}_{u}\setminus\mathcal{Z}_{0}^{l}$ are their complements that comprise unlabeled feature and unrevealed label variables. $H_{0}$ being true implies $\mathcal{S}_{u}\perp\!\!\!\!\perp\mathcal{Z}_{u}$ . In our setting, an analyst randomly samples features and labels them to build $\mathcal{S}_{0}^{l}$ and $\mathcal{Z}_{0}^{l}$ , implying $\mathcal{S}_{0}^{l}\perp\!\!\!\!\perp\mathcal{Z}_{0}^{l}$ and $\mathcal{S}_{0}^{u}\perp\!\!\!\!\perp\mathcal{Z}_{0}^{u}$ when $H_{0}$ is true. In the following, we employ the induction method to prove $\mathbf{S}_{i}$ and $Z_{i}$ are independent $\forall i\in\left[N_{q}\right]$ .

Base case ( $i=1$ ): Under $H_{0}$ , we have $\mathcal{S}^{l}_{0}\perp\!\!\!\!\perp\mathcal{Z}^{l}_{0}$ and $\mathcal{S}^{u}_{0}\perp\!\!\!\!\perp\mathcal{Z}^{u}_{0}$ . The analyst first initializes $Q_{1}(z\mid\mathbf{s})$ with $\mathcal{S}^{l}_{0}\perp\!\!\!\!\perp\mathcal{Z}^{l}_{0}$ before starting the sequential testing. Subsequently, the analyst makes a query on a label based on the prediction of $Q_{1}(z\mid\mathbf{s})$ and includes the first variable pair $(\mathbf{S}_{1},Z_{1})$ to construct the test statistic. That immediately implies $\mathbf{S}_{1}\perp\!\!\!\!\perp Z_{1}$ , $\mathcal{S}^{l}_{1}\perp\!\!\!\!\perp\mathcal{Z}^{l}_{1}$ and $\mathcal{S}^{u}_{1}\perp\!\!\!\!\perp\mathcal{Z}^{u}_{1}$ .

Induction step: Suppose $\mathcal{S}^{u}_{i}\perp\!\!\!\!\perp\mathcal{Z}^{u}_{i}$ and $\mathcal{S}^{l}_{i}\perp\!\!\!\!\perp\mathcal{Z}^{l}_{i}$ , the analyst updates $Q_{i-1}\left(z\mid\mathbf{s}\right)$ to $Q_{i}\left(z\mid\mathbf{s}\right)$ with $\mathcal{S}^{u}_{i}\perp\!\!\!\!\perp\mathcal{Z}^{u}_{i}$ and $\mathcal{S}^{l}_{i}\perp\!\!\!\!\perp\mathcal{Z}^{l}_{i}$ , makes a query on a label based on the prediction of $Q_{i}(z\mid\mathbf{s})$ and includes the (i+1)-th variable pair $\left(\mathbf{S}_{i+1},Z_{i+1}\right)$ to update the statistic. That immediately implies $\mathbf{S}_{i+1}\perp\!\!\!\!\perp Z_{i+1}$ , $\mathcal{S}^{u}_{i+1}\perp\!\!\!\!\perp\mathcal{Z}^{u}_{i+1}$ , and $\mathcal{S}^{l}_{i+1}\perp\!\!\!\!\perp\mathcal{Z}^{l}_{i+1}$ .

Combining the base step and the induction step leads to $S_{i}\perp\!\!\!\!\perp Z_{i},\forall i\in\left[N_{q}\right]$ under $H_{0}$ .

•

Proof for the second part
Suppose $\left(\left(s,z\right)_{i}\right)_{i=1}^{n}$ is a sequence of realizations of $\left(\left(\mathbf{S},Z\right)_{i}\right)_{i=1}^{n}$ collected under $H_{0}$ and the proposed framework. We use $\phi$ to denote a class-one prior probability parameter, and hence $P\left(z_{1},\cdots,z_{n}\mid\phi\right)$ is a likelihood function of $\phi$ . Maximizing $P\left(z_{1},\cdots,z_{n}\mid\phi\right)$ over the prior parameter $\phi$ leads to the solution $\phi^{*}=\frac{\sum_{i=1}^{n}z_{i}}{n}$ . In other words, $P\left(z_{1},\cdots,z_{n}\mid\phi^{*}\right)=\prod_{i=1}^{n}\hat{P}(z_{i})$ is a maximized likelihood obtained from $(z_{i})_{i=1}^{n}$ , where $\phi^{*}=\hat{P}(Z=1)$ . We use $P(Z=1)$ to denote the true prior-one probability under $H_{0}$ , and plugging $P(Z=1)$ to $\phi$ leads to the true likelihood $\prod_{i}^{n}P(z_{i})$ for $(z_{i})_{i=1}^{n}$ under $H_{0}$ . It is easy to see $\prod_{i=1}^{n}\hat{P}(z_{i})\geq\prod_{i}^{n}P(z_{i})$ thus $\prod_{i=1}^{n}\frac{\hat{P}(z_{i})}{Q_{i}(z_{i}|\mathbf{s}_{i})}\geq\prod_{i=% 1}^{n}\frac{P(z_{i})}{Q_{i}(z_{i}|\mathbf{s}_{i})}$ for any realization $(z_{i})_{i=1}^{n}$ of $(Z_{i})_{i=1}^{n}$ under $H_{0}$ . As a result, we have $P_{0}\left(\exists n\in\left[N_{q}\right],W_{n}=\prod_{i=1}^{n}\frac{\hat{P}(Z% _{i})}{Q_{i-1}(Z_{i}\mid\mathbf{S}_{i})}\leq\alpha\right)\leq P_{0}\left(% \exists n,\tilde{W}_{n}=\prod_{i=1}^{n}\frac{P(Z_{i})}{Q_{i-1}\left(Z_{i}\mid% \mathbf{S}_{i}\right)}\leq\alpha\right)$ .
Lastly, we prove $P_{0}\left(\exists n,\tilde{W}_{n}=\prod_{i=1}^{n}\frac{P(Z_{i})}{Q_{i-1}\left% (Z_{i}\mid\mathbf{S}_{i}\right)}\leq\alpha\right)\leq\alpha$ . We let $\tilde{W}^{\prime}_{n}\equiv\frac{1}{\tilde{W}_{n}}$ . Therefore, $\tilde{W}^{\prime}_{n}\equiv\tilde{W}^{\prime}_{n-1}\frac{Q_{n}(Z_{n}\mid% \mathbf{S}_{n})}{P(Z_{n})}$ with $\tilde{W}^{\prime}_{0}\equiv 1$ for $n\in\left[N_{q}\right]$ . The sequence $(\tilde{W}^{\prime}_{i})_{i=1}^{n}$ is a non-negative martingale under $H_{0}$ given

$\displaystyle.\mathop{\mathbb{E}}\left[\tilde{W}^{\prime}_{n}\right\rvert% \tilde{W}^{\prime}_{1},\cdots,\tilde{W}^{\prime}_{n-1}]$	$\displaystyle\equiv\mathop{\mathbb{E}}\left.\left[\tilde{W}^{\prime}_{n-1}% \frac{Q_{n}(Z_{n}\mid\mathbf{S}_{n})}{P(Z_{n})}\right\rvert\tilde{W}^{\prime}_% {1},\cdots,\tilde{W}^{\prime}_{n-1}\right]$	(23)
	$\displaystyle\equiv\tilde{W}^{\prime}_{n-1}\mathop{\mathbb{E}}\left.\left[% \frac{Q_{n}(Z_{n}\mid\mathbf{S}_{n})}{P(Z_{n})}\right\rvert\tilde{W}^{\prime}_% {1},\cdots,\tilde{W}^{\prime}_{n-1}\right]$	(24)
	$\displaystyle=\tilde{W}^{\prime}_{n-1}\mathop{\mathbb{E}}\left[\sum_{z=0}^{1}P% (Z_{n}=z)\frac{Q_{n-1}(Z_{n}=z\mid\mathbf{S}_{n})}{P(Z_{n}=z)}\right]$	(25)
	$\displaystyle=\tilde{W}^{\prime}_{n-1}$	(26)

Using Ville’s maximal inequality in Theorem A.2 leads to the following: For any $\alpha>0$ , we have

	$\displaystyle P\left(\sup_{n\in\left[N_{q}\right]}\tilde{W}^{\prime}_{n}>\frac% {1}{\alpha}\right)\leq\frac{\alpha}{\mathop{\mathbb{E}}[\tilde{W}^{\prime}_{0}% ]}=\alpha$		(27)
	$\displaystyle\equiv P\left(\sup_{n\in\left[N_{q}\right]}\frac{1}{\tilde{W}_{n}% }>\frac{1}{\alpha}\right)\leq\frac{\alpha}{\mathop{\mathbb{E}}\left[\frac{1}{% \tilde{W}_{0}}\right]}=\alpha$		(28)
	$\displaystyle\equiv P\left(\inf_{n\in\left[N_{q}\right]}\tilde{W}_{n}\leq% \alpha\right)\leq\alpha$		(29)

Therefore, we have $P_{0}\left(\exists n\in\left[N_{q}\right],W_{n}=\prod_{i=1}^{n}\frac{\hat{P}(Z% _{i})}{Q_{i}(Z_{i}\mid\mathbf{S}_{i})}\leq\alpha\right)\leq P_{0}\left(\exists n% \in\left[N_{q}\right],\tilde{W}_{n}=\prod_{i=1}^{n}\frac{P(Z_{i})}{Q_{i}(Z_{i}% \mid\mathbf{S}_{i})}\leq\alpha\right)\leq\alpha$ .

∎

Appendix B Proof of Theorem 5.4

Proof.

In the following, we formulate an optimization problem that seeks an arbitrary marginal distribution $g(\mathbf{s})$ to maximize the mutual information (MI) between $\mathbf{S}$ and $Z$ , where $\left(\mathbf{S},Z\right)\sim g(s)p\left(z\mid\mathbf{s}\right)$ . Solving this optimization problem leads to a consistent bimodal query (see Definition 5.2), asymptotically minimizing the test statistic in equation 2.

•

Constructing an optimization problem that maximizes MI
We write $g\left(\mathbf{s}\right)$ to denote an arbitrary probability distribution of ${\mathbf{s}}$ . Recall $P(z\mid\mathbf{s})$ and $p(\mathbf{s})$ that indicate the class probability given $\mathbf{s}$ and a marginal probability distribution of $\mathbf{s}$ for the two-sample testing problem on the analyst’s hand; we write $g\left(\mathbf{s},z\right)=g(\mathbf{s})P(z\mid\mathbf{s})$ and $G(z)=\int g\left(\mathbf{s},z\right)d\mathbf{s}$ to denote the joint probability distribution and the class prior for a new two-sample testing problem with the original $p(\mathbf{s})$ replaced by $g(\mathbf{s})$ . The mutual information (MI) that characterizes the new two-sample testing problem is as follows

\displaystyle\text{MI}=-\sum_{z=0}^{1}\left(G(z)\right)\log\left(G(z)\right)+% \int\left(\sum_{z=0}^{1}P(z\mid\mathbf{s})\log\left(P\left(z\mid\mathbf{s}% \right)\right)\right)g(\mathbf{s})d\mathbf{s}

(30)

We expand equation 30 and consider the following optimization problem,

\displaystyle\max_{g(\mathbf{s})}-\sum_{z=0}^{1}\left(\int p(z\mid\mathbf{s})g% (\mathbf{s})d\mathbf{s}\right)\log\left(\int p(z\mid\mathbf{s})g(\mathbf{s})d% \mathbf{s}\right)+\int\left(\sum_{z=0}^{1}P(z\mid\mathbf{s})\log\left(P(z\mid% \mathbf{s})\right)\right)g(\mathbf{s})d\mathbf{s}

(31)

In other words, equation 31 is seeking an $g(\mathbf{s})$ to maximize the MI of a new two-sample testing problem with $p(z\mid\mathbf{s})$ provided by the original two-sample testing problem. In what follows, we will see that solving 31 leads to a probability distribution in which a consistent bimodal query (see Definition 5.2) results, proving the asymptotic property in Theorem 5.4. Instead of directly solving equation 31, we fix $G(Z=0)=\int P(Z=0\mid s)g(\mathbf{s})d\mathbf{s}=u$ , and resort to finding the solution of the following,

$\displaystyle\min_{g(\mathbf{s})}\quad$	$\displaystyle-\int\left(\sum_{z=0}^{1}P(z\mid\mathbf{s})\log\left(P\left(z\mid% \mathbf{s}\right)\right)\right)g(\mathbf{s})d\mathbf{s}$	(32)
s.t.	$\displaystyle\int P(Z=0\mid\mathbf{s})g(\mathbf{s})d\mathbf{s}=u,$	(33)
	$\displaystyle\int g(\mathbf{s})d\mathbf{s}=1,$	(34)
	$\displaystyle g(\mathbf{s})\geq 0,\forall s\in\mathcal{S}.$	(35)

Then, we approximate equation 32 with a discrete version of the same by partitioning the sample space $\mathcal{S}$ into $L$ balls $\{B\left(\mathbf{s}_{i},r\right)\}_{i=1}^{L}$ ; in addition, $L>2$ . Each $B\left(\mathbf{s},r\right)\in\{B\left(\mathbf{s}_{i},r\right)\}_{i=1}^{L}$ has a radius $r$ centering at $\mathbf{s}$ leading to an approximation $\hat{P}(Z=0|\mathbf{s})=\int P(Z=0|\mathbf{s})p\left(\mathbf{s}\mid B(\mathbf{% s},r)\right)d\mathbf{s}$ , and a probability mass function $G(\mathbf{s})=\int_{\mathbf{s}\in B\left(\mathbf{s},r\right)}g(\mathbf{s})d% \mathbf{s}$ . Hence, we approximate equation 32 by the following linear programming (LP):

$\displaystyle\min_{G(\mathbf{s})}\quad$	$\displaystyle\sum_{i=1}^{L}H_{i}(Z)G(\mathbf{s}_{i})$	(36)
s.t.	$\displaystyle\sum_{i=1}^{L}\hat{P}(Z=0\mid\mathbf{s}_{i})G(\mathbf{s}_{i})=u,$	(37)
	$\displaystyle\sum_{i=1}^{L}G(\mathbf{s}_{i})=1,$	(38)
	$\displaystyle G(\mathbf{s}_{i})\geq 0,\forall i\in\left[L\right].$	(39)

where $H_{i}(Z)=-\sum_{z=0}^{1}\hat{P}(z\mid s_{i})\log\left(\hat{P}(z\mid\mathbf{s}_% {i})\right),\forall i\in\left[L\right]$ indicates constant coefficients in the LP in equation 36.

•

Solving the optimization problem
The constraints in equation 37 and equation 38 construct a region of feasible solutions to the considered LP in equation 36; we write this region $U=\{\mathbf{s}\mid\mathbf{s}\text{ is non-negative and }\mathbf{s}\text{ % satisfies~{}equation~{}\ref{LPConstraint1} and~{}equation~{}\ref{LPConstraint2% }}.\}$ . In addition, we need to make one more definition of one kind of solution to the system of linear equations, which is well-known in linear algebra.

Definition B.1.

(Basic solutions) Let $A\mathbf{x}=b$ be a system of linear equations. Let $\{\mathbf{x}_{j_{1}},\cdots,\mathbf{x}_{j_{k}}\}$ be positive and other entries be zero in $\mathbf{x}$ . Then, if the corresponding columns $A_{j_{1}},\cdots,A_{j_{k}}$ are linearly independent, then $\mathbf{x}$ is a basic solution to the system.

Moreover, we will need to apply the following Theorems to derive the optimal feasible solution for the LP.

Theorem B.2.

If the feasible region of an LP is bounded, then at least one optimal solution occurs at a vertex of the corresponding polytope (or the feasible region).

Theorem B.3.

Let $U$ be the feasible region of a linear program. Then, $\mathbf{x}\in U$ is a basic feasible solution if and only if $x$ is a vertex of $U$ .

Theorem B.2 and Theorem B.3 are well-known in LP; we refer interested readers to (Miller, 2007) for the elaboration on their proofs. Theorem B.2 and B.3 suggests one optimal solution of equation 36 is a vector $\left(G\left(\mathbf{s}_{1}\right),\cdots,G\left(\mathbf{s}_{L}\right)\right)$ with at most two non-zero entries. Herein, we write $G(\mathbf{s}_{q_{0}})$ and $G(\mathbf{s}_{q_{1}})$ to denote the two non-zero entries. That reduces the LP in equation 36 to the following:

$\displaystyle\max_{q_{0},q_{1}}$	$\displaystyle\left(\left(\sum_{z=0}^{1}\hat{P}\left(z\mid\mathbf{s}_{q_{0}}% \right)\log\hat{P}\left(z\mid\mathbf{s}_{q_{0}}\right)\right)G\left(\mathbf{s}% _{q_{0}}\right)+\left(\sum_{z=0}^{1}\hat{P}\left(z\mid\mathbf{s}_{q_{1}}\right% )\log\hat{P}\left(z\mid\mathbf{s}_{q_{1}}\right)\right)G\left(\mathbf{s}_{q_{1% }}\right)\right)$	(40)
s.t.	$\displaystyle\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)G\left(\mathbf{s}_{q% _{0}}\right)+\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)G\left(\mathbf{s}_{q% _{1}}\right)=u,$	(41)
	$\displaystyle G(\mathbf{s}_{q_{0}})+G(\mathbf{s}_{q_{1}})=1,$	(42)
	$\displaystyle G(\mathbf{s}_{q_{0}})\geq 0,G(\mathbf{s}_{q_{1}})\geq 0.$	(43)

For the sake of simplifying the expressions in what follows, we write

$\displaystyle T_{0}$	$\displaystyle=\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)\log\hat{P}\left(Z=% 0\mid\mathbf{s}_{q_{0}}\right)+\left(1-\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}% \right)\right)\log\left(1-\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)\right),$	(44)
$\displaystyle T_{1}$	$\displaystyle=\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)\log\hat{P}\left(Z=% 0\mid\mathbf{s}_{q_{1}}\right)+\left(1-\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}% \right)\right)\log\left(1-\hat{P}\left(Z=0\mid s_{q_{1}}\right)\right),$	(45)
$\displaystyle T_{2}$	$\displaystyle=\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)\log\hat{P}\left(Z=% 0\mid\mathbf{s}_{q_{0}}\right)+\left(1-\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}% \right)\right)\log\left(1-\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)\right),$	(46)
$\displaystyle T_{3}$	$\displaystyle=\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)\log\hat{P}\left(Z=% 0\mid\mathbf{s}_{q_{1}}\right)+\left(1-\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}% \right)\right)\log\left(1-\hat{P}\left(Z=0\mid s_{q_{1}}\right)\right).$	(47)

Then, equation 40 is re-expressed by the following,

$\displaystyle\max_{q_{0},q_{1}}$	$\displaystyle\frac{T_{0}\left(u-\hat{P}\left(Z=0\mid s_{q_{1}}\right)\right)}{% \hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)-\hat{P}\left(Z=0\mid\mathbf{s}_{% q_{1}}\right)}+\frac{T_{1}\left(\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)-% u\right)}{\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)-\hat{P}\left(Z=0\mid% \mathbf{s}_{q_{1}}\right)}$	(48)
s.t.	$\displaystyle\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)-\hat{P}\left(Z=0% \mid s_{q_{1}}\right)>0,$	(49)
	$\displaystyle\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)\leq u,$	(50)
	$\displaystyle\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)\geq u.$	(51)

equation 48 is an optimization problem that finds $\left\{\hat{P}\left(z\mid\mathbf{s}_{q_{0}}\right),\hat{P}\left(z\mid\mathbf{s% }_{q_{1}}\right)\right\}\subset\{\hat{P}\left(z\mid\mathbf{s}_{i}\right)\}_{i=% 1}^{L}$ to maximize the objective function. Herein, we write

$\displaystyle A$	$\displaystyle=\frac{T_{0}}{\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)-\hat{% P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)},$	(52)
$\displaystyle B$	$\displaystyle=u-\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right),$	(53)
$\displaystyle C$	$\displaystyle=\frac{T_{1}}{\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)-\hat{% P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)},$	(54)
$\displaystyle D$	$\displaystyle=\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)-u.$	(55)

Now, we analyze the derivatives of equation 48 by checking the partial derivatives of $A$ , $B$ , $C$ and $D$ with respect to $\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)$ and $\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)$ :

$\displaystyle\frac{\partial A}{\partial\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}% \right)}$	$\displaystyle=\frac{-T_{2}}{\left(\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right% )-\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)\right)^{2}}>0,$	(56)
$\displaystyle\frac{\partial A}{\partial\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}% \right)}$	$\displaystyle=\frac{T_{0}}{\left(\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)% -\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)\right)^{2}}<0,$	(57)
$\displaystyle\frac{\partial B}{\partial\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}% \right)}$	$\displaystyle=-1,$	(58)
$\displaystyle\frac{\partial C}{\partial\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}% \right)}$	$\displaystyle=\frac{-T_{1}}{\left(\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right% )-\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)\right)^{2}}>0,$	(59)
$\displaystyle\frac{\partial C}{\partial\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}% \right)}$	$\displaystyle=\frac{T_{3}}{(\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)-\hat% {P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right))^{2}}<0,$	(60)
$\displaystyle\frac{\partial D}{\partial\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}% \right)}$	$\displaystyle=1.$	(61)

Therefore, equation 48 is a function that monotonically increases with increasing $\hat{P}\left(Z=0\mid\mathbf{s}_{q_{0}}\right)$ and decreasing $\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)$ , implying that the optimal solution to equation 36 has the following probability mass function $G^{*}$ ,

$\displaystyle G^{*}\left(\mathbf{s}_{q_{0}}\right)$	$\displaystyle=\frac{u-\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)}{\hat{P}% \left(Z=0\mid\mathbf{s}_{q_{0}}\right)-\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}% \right)},\mathbf{s}_{q_{0}}=\arg\max_{\mathbf{s}}\hat{P}\left(Z=0\mid\mathbf{s% }\right),$	(62)
$\displaystyle G^{*}\left(\mathbf{s}_{q_{1}}\right)$	$\displaystyle=\frac{\hat{P}(Z=0\mid s_{0})-u}{\hat{P}\left(Z=0\mid\mathbf{s}_{% q_{0}}\right)-\hat{P}\left(Z=0\mid\mathbf{s}_{q_{1}}\right)},\mathbf{s}_{q_{1}% }=\arg\max_{s}\hat{P}\left(Z=1\mid\mathbf{s}\right)$	(63)
$\displaystyle G^{*}(\mathbf{s})$	$\displaystyle=0,\forall\mathbf{s}\in\{\mathbf{s}_{i}\}_{i=1}^{L}\setminus\{% \mathbf{s}_{q_{0}},\mathbf{s}_{q_{1}}\}.$	(64)

Recall that LP in equation 36 approximates the continuous optimization problem in equation 32 by partitioning the sample space $\mathcal{S}$ to $\{B(\mathbf{s}_{i},r)\}_{i=1}^{L}$ . Hence, by shrinking the radius $r$ infinitely close to zero, we get the optimal solution $p^{*}(\mathbf{s})$ of equation 32 as follows,

	$\displaystyle\frac{p^{}\left(\mathbf{s}_{q_{0}}\right)}{p^{}\left(\mathbf{s}% _{q_{1}}\right)}$	$\displaystyle=\frac{u-P\left(Z=0\mid\mathbf{s}_{q_{1}}\right)}{P\left(Z=0\mid% \mathbf{s}_{q_{0}}\right)-u},\mathbf{s}_{q_{0}}=\arg\max_{\mathbf{s}}\hat{P}% \left(Z=0\mid\mathbf{s}\right),\mathbf{s}_{q_{1}}=\arg\max_{\mathbf{s}}\hat{P}% \left(Z=1\mid\mathbf{s}\right),$		(65)
	$\displaystyle p^{*}(\mathbf{s})$	$\displaystyle=0,\forall\mathbf{s}\in\mathcal{S}\setminus\{\mathbf{s}_{q_{0}},% \mathbf{s}_{q_{1}}\}.$		(66)

Varying $u$ leads to the optimal solution with the same form that $p^{*}(\mathbf{s})=0,\forall s\in\mathcal{S}\setminus\{\mathbf{s}_{q_{0}},% \mathbf{s}_{q_{1}}\}$ and $p^{*}\left(\mathbf{s}_{q_{0}}\right)>0,p^{*}\left(\mathbf{s}_{q_{1}}\right)>0$ , but different ratio $\frac{p^{*}\left(\mathbf{s}_{q_{0}}\right)}{p^{*}\left(\mathbf{s}_{q_{1}}% \right)}$ . Furthermore, there could exist a set $\mathcal{S}_{q_{0}}=\left\{\mathbf{s}_{q_{0}}\mid P\left(Z=0\mid\mathbf{s}_{q_% {0}}\right)=\max_{\mathbf{s}\in\mathcal{S}}P\left(Z=0\mid\mathbf{s}\right)\right\}$ with identical $P\left(z\mid\mathbf{s}_{q_{0}}\right)$ , and so does $\mathcal{S}_{q_{1}}=\left\{\mathbf{s}_{q_{1}}\mid P\left(Z=1\mid\mathbf{s}_{q_% {1}}\right)=\max_{\mathbf{s}\in\mathcal{S}}P\left(Z=1\mid\mathbf{s}\right)\right\}$ for the case of $\mathbf{s}_{q_{1}}$ . Hence, the optimal solution to the original optimization problem in equation 31 has the following form

$\displaystyle p^{*}\left(\mathbf{s}\right)$	$\displaystyle=0,\forall\mathbf{s}\in\mathcal{S}\setminus\left(\mathcal{S}_{q_{% 0}}\bigcup\mathcal{S}_{q_{1}}\right),\text{ and }p^{*}\left(\mathbf{s}\right)>% 0,\forall\mathbf{s}\in\mathcal{S}_{q_{0}}\bigcup\mathcal{S}_{q_{1}},$	(67)
$\displaystyle\mathcal{S}_{q_{0}}$	$\displaystyle=\left\{\mathbf{s}_{q_{0}}\left\rvert P\left(Z=0\mid\mathbf{s}_{q% _{0}}\right)=\max_{\mathbf{s}\in\mathcal{S}}P\left(Z=0\mid\mathbf{s}\right)% \right\}\right.,$	(68)
$\displaystyle\mathcal{S}_{q_{1}}$	$\displaystyle=\left\{\mathbf{s}_{q_{1}}\mid P\left(Z=1\left\rvert\mathbf{s}_{q% _{1}}\right)=\max_{\mathbf{s}\in\mathcal{S}}P\left(Z=1\mid\mathbf{s}\right)% \right\}\right..$	(69)

Therefore, there exists a consistent bimodal query resulting in an asymptotic distribution of the labeled feature variables admitting $p^{*}\left(\mathbf{s}\right)$ (equation 67 to equation 69) to maximize MI and hence minimize the negated MI with $P\left(z\mid\mathbf{s}\right)$ provided by the original two-sample testing problem.

∎

Appendix C Proof of Theorem 5.10

Proof.

Testing power of the baseline case: As the baseline case randomly samples features from $\mathcal{S}_{u}$ and queries their labels, then the resulting variable pair $\left(\mathbf{S}_{n},Z_{n}\right)$ collected by the analyst admits $p\left(\mathbf{s},z\right),\forall n\in\left[N_{q}\right]$ , in which $p\left(\mathbf{s},z\right)$ is the joint distribution that characterizes the original two-sample testing problem. In addition, $Q\left(z\mid\mathbf{s}\right)$ is initialized and stable, and the class-prior $P(Z=0)$ is provided in the case study. Given the label budget $N_{q}$ and the significance level $\alpha$ , we have the following inequalities for the testing power in the case study:

\displaystyle P_{1}\left(\exists n\in\left[N_{q}\right],W_{n}=\prod_{i=1}^{n}% \frac{P(Z_{i})}{Q\left(Z_{i}\mid\mathbf{S}_{i}\right)}\leq\alpha\right)\geq P_% {1}\left(W_{N_{q}}=\prod_{i=1}^{N_{q}}\frac{P(Z_{i})}{Q\left(Z_{i}\mid\mathbf{% S}_{i}\right)}\leq\alpha\right),\left(\mathbf{S}_{n},Z_{n}\right)\sim p\left(% \mathbf{s},z\right)

(70)

The inequality in equation 70 is derived from sequentially comparing $w_{n}$ with $\alpha,\forall n\in\left[N_{q}\right]$ leading to a higher testing power than only comparing $w_{n}$ with $\alpha$ at $n=N_{q}$ . We subsequently convert RHS of equation 70 as follows,

\displaystyle P_{1}\left(W_{N_{q}}=\prod_{i=1}^{N_{q}}\frac{P(Z_{i})}{Q_{0}% \left(Z_{i}\mid\mathbf{S}_{i}\right)}\leq\alpha\right)=P_{1}\left(\frac{\log% \left(W_{N_{q}}\right)}{N_{q}}=\frac{\sum_{i=1}^{N_{q}}\log\left(\frac{P(Z_{i}% )}{Q\left(Z_{i}\mid\mathbf{S}_{i}\right)}\right)}{N_{q}}\leq\frac{\log\left(% \alpha\right)}{N_{q}}\right)

(71)

Since $\left\{\left(\mathbf{S_{i}},Z_{i}\right)\right\}_{i=1}^{N_{q}}$ is an i.i.d. sequence, we skip $i$ in $\left(\mathbf{S_{i}},Z_{i}\right)$ and analyze $\mathbb{E}\left[\frac{P(Z)}{Q\left(Z\mid\mathbf{S}\right)}\right]$ and $\mathrm{Var}\left[\frac{P(Z)}{Q\left(Z\mid\mathbf{S}\right)}\right]$ for $\left(\mathbf{S},Z\right)\sim p\left(\mathbf{s},z\right)$ in the following,

$\displaystyle\mathbb{E}\left[\log\frac{P(Z)}{Q\left(Z\mid\mathbf{S}\right)}\right]$	$\displaystyle=\mathbb{E}\left[\log\frac{P(Z)}{P\left(Z\mid\mathbf{S}\right)}+% \log\frac{P\left(Z\mid\mathbf{S}\right)}{Q\left(Z\mid\mathbf{S}\right)}\right]$	(72)
	$\displaystyle=-I\left(S;Z\right)+D_{\text{KL}}\left(P\left(z\mid\mathbf{s}% \right)\\|Q\left(z\mid\mathbf{s}\right)\right)$	(73)
	$\displaystyle\leq-I\left(S;Z\right)+\sqrt{\epsilon_{1}};$	(74)

$\displaystyle\mathrm{Var}\left[\frac{P(Z)}{Q\left(Z\mid\mathbf{S}\right)}\right]$	$\displaystyle=\mathrm{Var}\left[\log\frac{P(Z)}{P\left(Z\mid\mathbf{S}\right)}% +\log\frac{P(Z\mid S)}{Q\left(Z\mid\mathbf{S}\right)}\right]$	(75)
	$\displaystyle\leq\mathrm{Var}\left[\log\frac{P(Z)}{P\left(Z\mid\mathbf{S}% \right)}\right]+\mathrm{Var}\left[\log\frac{P\left(Z\mid\mathbf{S}\right)}{Q% \left(Z\mid\mathbf{S}\right)}\right]+2\sqrt{\mathrm{Var}\left[\log\frac{P(Z)}{% P\left(Z\mid\mathbf{S}\right)}\right]\mathrm{Var}\left[\log\frac{P\left(Z\mid% \mathbf{S}\right)}{Q\left(Z\mid\mathbf{S}\right)}\right]}$	(76)
	$\displaystyle\leq\sigma^{2}+\epsilon_{1}+2\sigma\sqrt{\epsilon_{1}}.$	(77)

The inequalities in equation 74 and equation 77 are results of the following facts: $\epsilon_{1}=\max_{A\in\mathcal{P}}D_{\text{KL}^{2}}\left(q\left(\mathbf{s},z% \right)\|p\left(\mathbf{s},z\right)\mid A\right)$ and $\sigma^{2}=\max\left\{\max_{A\in\mathcal{P}}\text{Var}_{\left(\mathbf{S},Z% \right)\sim p\left(\mathbf{s},z\mid A\right)}\bar{I}(\mathbf{S};Z),\text{Var}_% {\left(\mathbf{S},Z\right)\sim p\left(\mathbf{s},z\right)}\bar{I}(\mathbf{S};Z% )\right\}$ over the partition $\mathcal{P}=\{A_{1},\cdots,A_{m}\}$ .

It is observed that, in equation 71, $\frac{\log\left(W_{N_{q}}\right)}{N_{q}}=\frac{\sum_{i=1}^{N_{q}}\log\left(% \frac{P(Z_{i})}{Q\left(Z_{i}\mid\mathbf{S}_{i}\right)}\right)}{N_{q}}$ is a sample mean of $\left\{\log\frac{P(Z_{i})}{Q\left(Z_{i}\mid\mathbf{S}_{i}\right)}\right\}_{i=1% }^{N_{q}}$ , hence we use the central limit theorem to approximate the distribution of $\frac{\log\left(W_{N_{q}}\right)}{N_{q}}$ leading to the following,

	$\displaystyle P_{1}\left(\exists n\in\left[N_{q}\right],W_{n}=\prod_{i=1}^{n}% \frac{P(Z_{i})}{Q\left(Z_{i}\mid\mathbf{S}_{i}\right)}\leq\alpha\right)$	$\displaystyle\geq P_{1}\left(W_{N_{q}}=\prod_{i=1}^{N_{q}}\frac{P(Z_{i})}{Q% \left(Z_{i}\mid\mathbf{S}_{i}\right)}\leq\alpha\right)$		(78)
		$\displaystyle\eqsim\Phi\left(\frac{\frac{\log\alpha}{\sqrt{N_{q}}}+\sqrt{N_{q}% }\left(I\left(\mathbf{S};Z\right)-\sqrt{\epsilon_{1}}\right)}{\left(\sigma^{2}% +\sqrt{\epsilon_{1}}+2\sqrt{\epsilon_{1}}\sigma\right)^{\frac{1}{2}}}\right).$		(79)

Testing power of the proposed framework in the case study: The analyst selects a region $A^{*}$ from a partition $\mathcal{P}=\left\{A_{i}\right\}_{i=1}^{m}$ , in which $A^{*}$ is predicted to have highest $I\left(\mathbf{S};Z\mid A^{*}\right)$ ; then the analyst conducts the sequential testing with $\left(\mathbf{S}_{n},Z_{n}\right)$ i.i.d. generated from $p\left(\mathbf{s},z\mid A^{*}\right)$ . We first quantify $I\left(\mathbf{S};Z\mid A^{*}\right)$ . Recall that the approximated MI $\left\{\hat{I}\left(\mathbf{S};Z\mid A_{i}\right)\right\}_{i=1}^{m}$ used to find $A^{*}\in\mathcal{P}$ is provided in equation 16 in the case study; given Assumption 5.9, the discrepancy between true and approximate MI for any $A\in\mathcal{P}$ is as follows

\displaystyle I\left(\mathbf{S};Z\mid A\right)-\hat{I}\left(\mathbf{S};Z\mid A% \right)=\mathbb{E}_{\mathbf{S}\sim p\left(\mathbf{s}\mid A\right)}\left[% \mathbb{E}_{Z\sim Q\left(z\mid\mathbf{S}\right)}\left[\log Q\left(Z\mid\mathbf% {S}\right)\right]-\mathbb{E}_{Z\sim P\left(z\mid\mathbf{S}\right)}\left[\log P% \left(Z\mid\mathbf{S}\right)\right]\right]

(80)

Furthermore, given $\epsilon_{2}=\max_{A\in\mathcal{P}}D_{\text{KL}^{2}}\left(p\left(\mathbf{s},z% \right)\|q\left(\mathbf{s},z\right)\mid A\right)$ over the partition $\mathcal{P}=\{A_{1},\cdots,A_{m}\}$ , we evaluate the upper bound of equation 80 for any $A\in\mathcal{P}$ in the following,

	$\displaystyle\mathbb{E}_{\mathbf{S}\sim p\left(\mathbf{s}\mid A\right)}\left[% \mathbb{E}_{Z\sim Q\left(z\mid\mathbf{S}\right)}\left[\log Q\left(Z\mid\mathbf% {S}\right)\right]-\mathbb{E}_{Z\sim P\left(z\mid\mathbf{S}\right)}\left[\log P% \left(Z\mid\mathbf{S}\right)\right]\right]$	(81)
$\displaystyle\leq$	$\displaystyle\mathbb{E}_{\mathbf{S}\sim p\left(\mathbf{s}\mid A\right)}\left[% \mathbb{E}_{Z\sim Q\left(z\mid\mathbf{S}\right)}\left[\log Q\left(Z\mid\mathbf% {S}\right)\right]-\mathbb{E}_{Z\sim Q\left(z\mid\mathbf{S}\right)}\left[\log P% \left(Z\mid\mathbf{S}\right)\right]\right]$	(82)
$\displaystyle=$	$\displaystyle D_{\text{KL}}\left(Q\left(z\mid\mathbf{s}\right)\\|P\left(z\mid% \mathbf{s}\right)\mid A\right)$	(83)
$\displaystyle\leq$	$\displaystyle\sqrt{\epsilon_{2}}.$	(84)

Similarly, we evaluate the lower bound of equation 80 for any $A\in\mathcal{P}$ in the following,

	$\displaystyle\mathbb{E}_{\mathbf{S}\sim p\left(\mathbf{s}\mid A\right)}\left[% \mathbb{E}_{Z\sim Q\left(z\mid\mathbf{S}\right)}\left[\log Q\left(Z\mid\mathbf% {S}\right)\right]-\mathbb{E}_{Z\sim P\left(z\mid\mathbf{S}\right)}\left[\log P% \left(Z\mid\mathbf{S}\right)\right]\right]$	(85)
$\displaystyle\geq$	$\displaystyle\mathbb{E}_{\mathbf{S}\sim p\left(\mathbf{s}\mid A\right)}\left[% \mathbb{E}_{Z\sim P\left(z\mid\mathbf{S}\right)}\left[\log Q\left(Z\mid\mathbf% {S}\right)\right]-\mathbb{E}_{Z\sim Q\left(z\mid\mathbf{S}\right)}\left[\log P% \left(Z\mid\mathbf{S}\right)\right]\right]$	(86)
$\displaystyle=$	$\displaystyle-D_{\text{KL}}\left(P\left(z\mid\mathbf{s}\right)\\|Q\left(z\mid% \mathbf{s}\right)\mid A\right)$	(87)
$\displaystyle\geq$	$\displaystyle-\sqrt{\epsilon_{1}}.$	(88)

Assumption 5.8 suggests that the maximum MI over $\mathcal{P}$ is $I\left(\mathbf{S};Z\right)+\Delta$ . Combining equation 84 and equation 88, we get the lower bound of $I\left(\mathbf{S};Z\mid A^{*}\right)$ as follows,

\displaystyle I\left(\mathbf{S};Z\mid A^{*}\right)\geq I\left(\mathbf{S};Z% \right)+\Delta-\left(\sqrt{\epsilon_{1}}+\sqrt{\epsilon_{2}}\right).

(89)

The analyst conducts the sequential testing in the selected $A^{*}$ with sample features randomly sampled from $A^{*}\bigcap\mathcal{S}_{u}$ and labeled, leading to the following testing power lower bound

\displaystyle P_{1}\left(\exists n\in\left[N_{q}\right],W_{n}=\prod_{i=1}^{n}% \frac{P(Z_{i})}{Q\left(Z_{i}\mid\mathbf{S}_{i}\right)}\leq\alpha\right)\geq P_% {1}\left(W_{N_{q}}=\prod_{i=1}^{N_{q}}\frac{P(Z_{i})}{Q\left(Z_{i}\mid\mathbf{% S}_{i}\right)}\leq\alpha\right),\left(\mathbf{S}_{n},Z_{n}\right)\sim p\left(% \mathbf{s},z\mid A^{*}\right).

(90)

The quantification of the RHS in equation 90 is identical to the one in the baseline case, except the sample space is constrained to $A^{*}$ . Hence, we skip the derivation process and obtain the following result,

	$\displaystyle P_{1}\left(\exists n\in\left[N_{q}\right],W_{n}=\prod_{i=1}^{n}% \frac{P(Z_{i})}{Q\left(Z_{i}\mid\mathbf{S}_{i}\right)}\leq\alpha\right)$	$\displaystyle\geq P_{1}\left(W_{N_{q}}=\prod_{i=1}^{N_{q}}\frac{P(Z_{i})}{Q% \left(Z_{i}\mid\mathbf{S}_{i}\right)}\leq\alpha\right)$		(91)
		$\displaystyle\eqsim\Phi\left(\frac{\frac{\log\alpha}{\sqrt{N_{q}}}+\sqrt{N_{q}% }\left(I\left(\mathbf{S};Z\right)+\Delta-2\sqrt{\epsilon_{1}}-\sqrt{\epsilon_{% 2}}\right)}{\left(\sigma^{2}+\sqrt{\epsilon_{1}}+2\sqrt{\epsilon_{1}}\sigma% \right)^{\frac{1}{2}}}\right).$		(92)

∎

Active Sequential Two-Sample Testing

Abstract

1 Introduction

2 Related Works

3 Problem Statement and Preliminaries

3.1 Notations

3.2 The problem

3.3 Evaluation metrics for the problem

3.4 Attributes of an active two-sample test

4 A Sequential Two-Sample Testing Statistic

5 Active Sequential Two-Sample Testing

5.1 An active sequential two-sample testing framework

5.2 The proposed framework results in an anytime-valid p𝑝pitalic_p-value

Theorem 5.1.

5.3 Asymptotic properties of the proposed framework

Definition 5.2.

Remark 5.3.

Theorem 5.4.

Remark 5.5.

5.4 Finite-sample analysis for the proposed framework

5.4.1 Characterizing the approximation error of Q⁢(z∣𝐬)𝑄conditional𝑧𝐬Q(z\mid\mathbf{s})italic_Q ( italic_z ∣ bold_s )

Definition 5.6.

5.4.2 Characterizing the factor that leads to the irreducible Type II error in finite-sample case

Definition 5.7.

5.4.3 An example of using the proposed framework

5.4.4 Finite-sample analysis for the example

Assumption 5.8.

Assumption 5.9.

Theorem 5.10.

6 Experimental Results

6.1 Experiments on Synthetic Datasets

6.2 Experiments on MNIST

6.3 Experiments on An Alzheimer’s Disease Dataset

7 Conclusion

Acknowledgement

References

Appendix A Proof of Theorem 5.1 and Its Preliminaries

A.1 Some statistical preliminaries

Definition A.1.

Theorem A.2.

A.2 Proof of Theorem 5.1

Proof.

Appendix B Proof of Theorem 5.4

Proof.

Definition B.1.

Theorem B.2.

Theorem B.3.

Appendix C Proof of Theorem 5.10

Proof.

5.2 The proposed framework results in an anytime-valid $p$ -value

5.4.1 Characterizing the approximation error of $Q(z\mid\mathbf{s})$