Active Sequential Two-Sample Testing
Abstract
A two-sample hypothesis test is a statistical procedure used to determine whether the distributions generating two samples are identical. We consider the two-sample testing problem in a new scenario where the sample measurements (or sample features) are inexpensive to access, but their group memberships (or labels) are costly. To address the problem, we devise the first active sequential two-sample testing framework that not only sequentially but also actively queries. Our test statistic is a likelihood ratio where one likelihood is found by maximization over all class priors, and the other is provided by a probabilistic classification model. The classification model is adaptively updated and used to predict where the (unlabelled) features have a high dependency on labels; labeling the “high-dependency” features leads to the increased power of the proposed testing framework. In theory, we provide the proof that our framework produces an anytime-valid -value. In addition, we characterize the proposed framework’s gain in testing power by analyzing the mutual information between the feature and label variables in asymptotic and finite-sample scenarios. In practice, we introduce an instantiation of our framework and evaluate it using several experiments; the experiments on the synthetic, MNIST, and application-specific datasets demonstrate that the testing power of the instantiated active sequential test significantly increases while the Type I error is under control.
1 Introduction
The two-sample test is a statistical hypothesis test applied to data samples (or measurements) from two distributions. The goal is to test if the data supports the hypothesis that the distributions are different. If we consider each data point as a feature and label (which tells us which distribution the data is from) pair, then the two-sample test is equivalent to the problem of testing the dependence between the features and the labels. Viewed with this lens, the null hypothesis for the two-sample test states that the feature and label variables are independent, and the alternate hypothesis states the opposite. The analyst performing the two-sample test needs to decide between the null and the alternative hypotheses with data from the two distributions.
The analyst typically knows little about the difficulty of a two-sample testing problem before running the test. Fixing the sample size a priori may result in a test that needs to collect additional evidence to arrive at a final decision (if the problem is hard) or in an inefficient test with over-collected data (if the problem is simple). To address this dichotomy, the research community has proposed sequential two-sample tests (Wald, 1992; Lhéritier & Cazals, 2018; Hajnal, 1961; Shekhar & Ramdas, 2021; Balsubramani & Ramdas, 2015) that allow the analyst to sequentially collect data and monitor statistical evidence, i.e., a statistic is computed from the data. The test can stop anytime when sufficient evidence has been accumulated to make a decision.
Existing sequential two-sample tests (Wald, 1992; Lhéritier & Cazals, 2018; Hajnal, 1961; Shekhar & Ramdas, 2021; Balsubramani & Ramdas, 2015) are devised to collect both sample features and sample labels simultaneously. In this paper, we consider the problem of sequential two-sample testing in a novel and practical setting where the cost of obtaining sample labels is high, but accessing sample features is inexpensive. As a result, the analyst can obtain a large collection of sample features without labels; she will need to sequentially query the label of the sample features in the collection to perform the two-sample testing while ensuring the query complexity (i.e., the number of queried labels) doesn’t exceed a label budget. A motivation for this formulation comes from the field of digital health: Physicians seek inexpensive digital measurements (e.g., gait, speech, ty** speed measured using a patient’s smartphone) to replace traditional biomarkers (e.g., the amyloid buildup that indicates Alzheimer’s progression) which are often costly to access; hence they need to validate the dependency between the digital measurements (feature variables) and traditional biomarkers (label variables). While validation studies can access large registries to collect digital measurements remotely at scale, there is a fixed label budget for the expensive biomarker measures. An efficient sequential design would reveal the dependency between the features and the labels using only a reasonable label budget.
In this paper, we propose the active sequential testing framework shown in Figure 1. The framework initializes a classifier to model probabilities of sample labels given features using an initial random sample; next, depending on the classifier’s outputs, the framework queries the labels of features predicted to have a high dependency with the labels and constructs a test statistic . The framework rejects the null if is smaller than a pre-defined significance level ; otherwise, the framework stops and retains the null if the label budget runs out or re-enters the label query and decision-making, enabling a sequential testing process.
![Refer to caption](x1.png)
The test statistic in the framework is based on the likelihood ratio between the likelihood constructed under the null that feature and label variables are independent and the likelihood constructed under the alternative that the dependency between the feature and label variables exists. Such a likelihood ratio two-sample test statistic has been first proposed in (Lhéritier & Cazals, 2018) to develop a non-active sequential two-sample test capable of controlling the Type I error (i.e., the probability of a decision made on the alternative when the null is true). We adapt the original test statistic by replacing the pre-defined label probability prior with a maximum likelihood estimate to satisfy our considered setting that the label prior is unknown. More importantly, our framework actively labels the features that are predicted to have a high dependency on labels. We will characterize the benefits of the active query over the random query by the change of mutual information between feature and label variables in the asymptotic and finite-sample scenarios. In practice, we suggest using an active query scheme called bimodal query proposed in (Li et al., 2022), in which the scheme labels samples with the highest class one or zero probabilities.
We summarize the main contributions of our work as follows:
-
•
We introduce the first active sequential two-sample testing framework. We prove that the proposed framework produces an anytime-valid -value to achieve Type I error control. Furthermore, we provide an information-theoretic interpretation of the proposed framework. We prove that, asymptotically, the framework is capable of generating the largest mutual information (MI) between feature and label variables under standard conditions (Györfi et al., 2002); and we also analyze the gain of the testing power for the proposed framework over its passive query parallel in the finite-sample scenario through MI.
-
•
We instantiate the framework using the bimodal query (Li et al., 2022) (i.e., queries the labels of the samples that have the highest class one or zero probabilities) as the label query scheme. We perform extensive experiments on synthetic data, MNIST, and an application-specific Alzheimer’s disease dataset to demonstrate the effectiveness of the instantiated framework. Our proposed test exhibits a significant reduction of the Type II error using fewer labeled samples compared with a non-active sequential testing baseline.
2 Related Works
The author of (Student, 1908) developed the -test, probably the simplest form of a two-sample test that compares the mean difference of two samples of uni-variate data. Since then, the research community has expanded the two-sample test to many other forms, e.g., the hotelling test (Hotelling, 1992), the Friedman-Rafsky test (Friedman & Rafsky, 1979), the kernel two-sample test (Gretton et al., 2012) and the classifier two-sample test (Lopez-Paz & Oquab, 2016) for the multi-variate case. These tests are constructed with various statistics, including the Mahalanobis distance, the measurement over a graph, a kernel embedding, or classifier accuracy, all in service of increasing testing power while controlling the Type I error. In particular, (Friedman & Rafsky, 1979; Gretton et al., 2012; Lopez-Paz & Oquab, 2016) test if the data from two samples is distributionally different, which is a generalization of the hotelling and test (Student, 1908; Hotelling, 1992) that only detect the mean difference of two samples. These two-sample tests are batch tests that have been extensively used subject to a fixed-sample size: When the collection of experimental data ends, an analyst performs the two-sample tests on the data and makes a decision; she is not allowed to continue to collect and incorporate more data into the testing after a decision made, as that will inflate the Type I error.
In contrast to the batch two-sample tests, the research community has developed a class of sequential two-sample tests (Lhéritier & Cazals, 2018; Shekhar & Ramdas, 2021; Pandeva et al., 2022) that allow the analyst to sequentially collect data and perform the two-sample test, enabling sequential decision-making. These sequential tests rectify the inflated Type I that will happen in the batch test with different statistical techniques such as Bonferroni correction (Dunn, 1961) and Ville’s maximal inequality (Doob, 1939).
There are also several works that consider the active setting in two-sample testing. The authors of (Li et al., 2022) proposed a batch two-sample test combined with active learning when curated labeled data is unavailable and querying the data labels is expensive. Several studies have also considered sequential testing for develo** active sequential hypothesis tests (Naghshvar & Javidi, 2013; Chernoff, 1959; Bessler, 1960; Blot & Meeter, 1973; Keener, 1984; Kiefer & Sacks, 1963). However, these tests require a clear parametric description of the statistical models of the hypotheses. The authors of (Duan et al., 2022) developed an interactive rank test, which is distribution-free and can similarly perform the sequential two-sample testing in the active learning setting.
The work proposed herein uses the label query scheme in (Li et al., 2022) to develop the first multivariate non-parametric sequential test for the active learning setting with a novel test statistic and theoretical results. We demonstrate that the test controls the Type I error via Ville’s maximal inequality (See Theorem 5.1). Ville’s maximal inequality results in higher testing power than the Bonferroni correction for sequential testing (Shekhar & Ramdas, 2021; Ramdas et al., 2022).
While our framework in Figure 1 employs the label query scheme introduced in (Li et al., 2022), it offers distinct advantages over (Li et al., 2022):
-
•
Our proposed framework follows a sequential design. Upon accumulating sufficient evidence to reject the null hypothesis, our design automatically stops label collection before exhausting the label budget. In contrast, the batch design in (Li et al., 2022) invariably exhausts the label budget.
-
•
Utilizing a different test statistic, our framework enables finite-sample analysis, which is not provided in (Li et al., 2022).
3 Problem Statement and Preliminaries
3.1 Notations
We use a pair of random variables to denote a feature and its label variables whose realization is . The variable pair admits a joint distribution . Furthermore, we write to denote the support of . Formally, a two-sample testing problem consists of null hypothesis that states and an alternative hypothesis that states . An analyst collects a sequence of realizations of to test against . The problem is equivalent to testing the independency between and . Therefore, we equivalently restate the hypothesis test as follows:
(1) |
Moving forward, we omit the subscripts in , and and write them as , and . In addition, we use , and to denote sequences of samples , and respectively. We use similar notation throughout the paper.
3.2 The problem
In the typical setting of a sequential two-sample test, an analyst does not have prior knowledge of sample features. The analyst sequentially collects both sample features and their labels simultaneously with the corresponding random variable pair i.i.d. generated from a data-generating process, i.e., . We consider a variant of the setting in which accessing sample features is free/inexpensive. Consequently, the analyst collects a large set of sample features before performing a sequential test. However, accessing the label of a feature in is costly. We assume the following fact throughout the paper: The already-collected is the result of a sample feature collection process where all are realizations of random variables i.i.d. generated from . There exists an oracle to return a label of with the corresponding random variable and admitting the posterior probability . We consider the following new sequential two-sample testing problem:
An analyst actively labeling may result in non-i.i.d pairs of ; hence the distribution of is shifted away from . In contrast, an analyst passively (or randomly) labeling maintains .
3.3 Evaluation metrics for the problem
In the following, we introduce the evaluation metrics used throughout the paper.
-
•
Type I error : The probability of rejecting when is true.
-
•
Type II error : The probability of rejecting when is true.
-
•
Testing power: The probability of rejecting when is true. In other words, .
Testing power and Type II error are interchangeably used in the methodology and experiment sections (Section 5 and 6).
3.4 Attributes of an active two-sample test
As already generalized in many two-sample testing literature such as (Johari et al., 2022; Wald, 1992; Lhéritier & Cazals, 2018; Shekhar & Ramdas, 2021; Welch, 1990), a conventional procedure for sequential two-sample testing is to compute a -value from sequentially observed samples and compare it to a pre-defined significance level anytime. The analyst rejects and stops the testing if . For more details, see (Wasserstein & Lazar, 2016). In addition, as the test proposed in what follows is endowed with active querying to reduce the number of label queries, the active sequential test is anticipated to spend fewer labels than a passive (random-query) test to reject when is true. In summary, an active sequential two-sample test has the following four attributes:
-
•
The test generates an anytime-valid -value such that holds at anytime of the sequential testing process. is exactly the Type I error and that implies the Type I is upper-bounded by .
-
•
The test has a high testing power .
-
•
The test is consistent such that under when the test sample size goes to infinity.
-
•
The test has higher than the passive test given the same label budgets.
4 A Sequential Two-Sample Testing Statistic
We follow the well-known likelihood ratio test (Wilks, 1938) to construct a sequential testing statistic. We use the statistical models that characterize the label generation processes conditional on the observed sample features under and . More precisely, under , we have ; that is, when and are independent, the posterior probability is the same for any in the support of . In contrast, under , we have the following statistical model: . We sequentially collect sample data , and when a new observation arrives, we construct a likelihood ratio : With , to assess against .
The statistical models and are unknown. To formulate our two-sample test, we will use a likelihood estimate that is maximized over all the class priors to replace –the product of the class prior. In addition, we build a class-probability predictor with the past observed sample sequence to model –the posterior probability of given newly observed ; any probabilistic classifier, such as a neural network and logistic function, can be used to build . Additionally, indicates an initialized class-probability predictor111It is possible to set as a random guess class-probability predictor, and then sequentially gather for training; however, this would hurt the testing power. As suggested by Duan et al. (2022); Lhéritier & Cazals (2018), we initialize with a small set of samples randomly labeled and start the sequential testing after that.. We formally present our sequential testing statistic in the following:
We accordingly use to indicate a random variable of which is a realization. Our test statistic in equation 2 is a generalization of the test statistic proposed in (Lhéritier & Cazals, 2018). In contrast to that work, our test statistic does not require the prior class to be known. The analyst compares with at every step starting from , stop** the test once encountering a step with . As a result, a small is favored under to reject for increasing testing power.
5 Active Sequential Two-Sample Testing
This section introduces the active sequential two-sample testing framework and its instantiation. We demonstrate that the framework produces an anytime-valid -value regardless of the selected query scheme. We also provide the asymptotic and finite-sample performance of the framework with the testing power gain measured by the change of the mutual information between feature and label variables.
5.1 An active sequential two-sample testing framework
A flow chart of the proposed framework is shown in Figure 1. Our framework starts by initializing the class-probability predictor at with a small set of sample features randomly selected from and then labeled. Then, the framework enters the sequential testing stage that iteratively performs the following: selects features in predicted by to have a high dependency on their labels, update the statistic , decide whether we can reject and update if the test has not stopped. We formally introduce our active sequential two-sample testing framework as follows,
Framework instantiation: We provide a framework instantiation called bimodal query based active sequential two-sample testing (BQ-AST) described in Algorithm 1. The algorithm takes the following input: an unlabelled feature set , a probabilistic classification algorithm , the size of an initialization set used for , a label budget and a significance level . Then, the algorithm initializes a class-probability predictor using with a small set of randomly labeled samples. In the sequential testing stage, the algorithm uses bimodal query from Li et al. (2022) to sample with samples having the highest posteriors from either class (e.g. a fair chance to select the highest or ) from , queries its label and updates the statistic . Next, the algorithm compares with , and if is not rejected, update with and then re-enter the query labeling. The algorithm rejects if or fails to reject if the label budget is exhausted.
The label budget in Algorithm 1 contains the labels for both initializing and constructing the statistic . In what follows in this section, we simply use to denote the “label budget” allowed to be used after the initialization.
5.2 The proposed framework results in an anytime-valid -value
Our framework rejects if the statistic .The following theorem states that under , is an anytime-valid -value.
Theorem 5.1.
If an analyst uses the proposed framework to sequentially query the oracle for with resulting in , then we have the following under ,
(4) |
where is a label budget and is the pre-specified significance level.
Theorem 5.1 implies the probability (or Type I error) that our framework mistakenly rejects is upper-bounded by . Briefly, we prove this by observing that the sequence is upper-bounded by a martingale, and hence we use Ville’s maximal inequality Durrett (2019); Doob (1939) to develop Theorem 5.1. See the Appendix for the complete proof.
5.3 Asymptotic properties of the proposed framework
This section provides the theoretical conditions under which the proposed framework asymptotically generates the smallest normalized statistic (normalization of the statistic in equation 3), or equivalently, maximally increases the mutual information between and . Before that, we first define the consistent bimodal query as follows,
Definition 5.2.
(Consistent bimodal query) Let be the support of that sample features are collected from and added to an unlabeled set , and let denote the posterior probability of given . An analyst adopts a label query scheme, for every , to query the label of such that admits a probability density function (PDF) . The label query scheme is a consistent bimodal query if where
(5) | ||||
(6) | ||||
(7) |
Remark 5.3.
Def 5.2 considers a label query scheme that only queries the labels of with the highest and when goes to infinity. As is not directly available, to construct the consistent bimodal query, one can use nonparametric regressors to construct a class-probability predictor as nonparametric estimates of and implements the bimodal query to label with highest or highest after converges to . The authors of (Györfi et al., 2002) prove that when is a kernel, KNN or partition estimates with proper smoothing parameters (e.g., bandwidth for the kernel) and labels are sufficiently revealed in the proximity of , then converges to .
To this end, we introduce the asymptotic property of our framework. We consider normalizing the test statistic in equation 3 as follows,
(8) |
where denotes a feature-label pair returned by a label query scheme when querying the -th label. Next, we state the following theorem.
Theorem 5.4.
Let be the support of that sample features are collected from and added to an unlabeled set , and let denote the posterior probability of given . There exists a consistent bimodal query scheme; when an analyst uses such a scheme in the proposed active sequential framework, then, under , converges to the negation of mutual information (MI), and the converged negated MI lower-bounds the negated MI generated by any subject to . Precisely, there exists a consistent bimodal query leading to the following
(9) |
is the MI constructed with (See equation 5 for ); is MI constructed with .
Recalling the null is rejected when the test statistic in equation 3 is smaller than ; hence, the proposed framework, when used with a consistent bimodal query to asymptotically minimize the normalized in equation 3, favorably increases the testing power when is large and is close to . In Section 5.4, we will analyze the finite-sample performance of the proposed framework considering the approximation error of . Additionally, by characterizing the difficulty of a two-sample testing problem with MI, Theorem 5.4 alludes that the proposed framework asymptotically turns the original hard two-sample testing problem with low dependency between and (low MI), to a simple one by increasing the dependency between and (high MI).
Remark 5.5.
Our testing framework is also consistent under and the same conditions of Theorem 5.4 as . The last equality holds due to under .
5.4 Finite-sample analysis for the proposed framework
This section analyzes the testing power of the proposed framework in the finite-sample case. Section 5.4.1 and Section 5.4.2 offer metrics that assess the approximation error of and an irreducible Type II error. These metrics together determine the finite-sample testing power. Furthermore, Section 5.4.3 presents an illustrative example of using our framework. In Section 5.4.4, we conduct a finite-sample analysis for the example, incorporating both the metrics that characterize the approximation error and the irreducible Type II error.
5.4.1 Characterizing the approximation error of
As our framework constructs the test statistic in equation 2 with the approximation , there arises a need to establish a metric for assessing the approximation error of for our finite-sample analysis. To this end, we introduce -divergence,
Definition 5.6.
(-divergence) Let and be two probability density functions on the same support . Let . Then, the -divergence between and is
(10) |
is the second moment of the log-likelihood ratio and has been used (see, e.g., (3.1.14) in (Koga et al., 2002)) to understand the behavior of the distribution of . We use to evaluate the distance between and , which yields the following
(11) |
Remarkably, in equation 11 also characterizes the discrepancy between and by averaging their log square distance over ; in our main result, we will see that the testing power of the proposed framework depends on . Additionally, is closely related to the typical KL divergence . This can be seen by expanding equation 11 using the formula resulting in,
(12) |
equation 12 implies that not only measures the expected distance between and over but also the variance of that distance. Similarly, we write
(13) |
to measure the discrepancy between and but with a reverse direction opposed to .
and both characterize the approximation error of , and we will also see they jointly determine the testing power of the proposed framework in Section 5.4.4.
5.4.2 Characterizing the factor that leads to the irreducible Type II error in finite-sample case
We also introduce another factor influencing testing power, which persists even in the absence of approximation error, i.e., . To see this, we recall the information spectrum introduced in (Han & Verdú, 1993),
Definition 5.7.
(Information spectrum (Han & Verdú, 1993)) Let be a pair of random variables over the support . Let denote the joint distribution of , and let and denote the marginal distributions of and . Suppose is a sequence of i.i.d random variables for . Then, the information spectrum is the probability distribution of the following random variable,
(14) |
It is easy to see the expectation of is the mutual information for . Substituting in equation 14 with the feature-label variable pair in our two-sample testing problem recovers the (negated) normalizing test statistic in equation 8 with and inserted, i.e., in the absence of approximation error.
(Han, 2000) leverages the dispersion of the information spectrum (the distribution of ) for to quantify the rate that Type II error goes to zero with increasing . Their underlying rationale is that, for a larger variance of , the probability of falling outside the acceptance region for an alternative hypothesis also increases, thereby resulting in a slower convergence rate for the Type II error. In our work, we will make use of the variance of the log-likelihood ratio between and
(15) |
Scaling down by is the variance of , characterizing the the dispersion of the information spectrum for given . is also known as the relative entropy variance (See e.g., (2.29) in (Tan et al., 2014)). It remains present even in the absence of approximation error (i.e., ). As we will see in Section 5.4.4, the persistent leads to a non-zero Type II error in the finite-sample case.
5.4.3 An example of using the proposed framework
We first introduce the notation that will be used in the ensuing sections. We write to denote a partition of the support of from which unlabeled sample features in are generated; in other words, . We compare an example of our proposed framework with the baseline, where features are randomly sampled from and labeled. We quantitatively analyze the testing power of both cases. Both the example and the baseline are detained as follows:
In the example of using the proposed framework, the class priors are given to simplify our analytical results; however, one can estimate these priors with labels in each and use the prior estimates to replace , and that will not change the main argument of our theorem. In addition, the analyst chooses the partition predicted by to have the highest dependency between and and only conducts sequential testing with the labeled points in . In contrast, the baseline conducts the sequential test entirely the same, except that the analyst queries the labels of features that are randomly generated from . Both the proposed framework and the baseline assert the use of a stable with no updates in the sequential testing; that is sufficient for our analysis as we will see the testing power for the above cases depend on in equation 11, in equation 12 and in equation 15
5.4.4 Finite-sample analysis for the example
We use and to capture the maximum approximation error of over the partition , and use to capture the maximum irreducible Type II error over the same partition .
We will need to make the following assumptions before presenting our results.
Assumption 5.8.
(Maximum mutual information gain) .
Assumption 5.8 characterizes the largest MI gain of the proposed framework in the example over the baseline; that is the direct reason for the increased testing power of the proposed framework.
Assumption 5.9.
(Sufficient number of unlabeled samples) .
Even though we typically have access to only a finite number of unlabeled samples in real-world scenarios, this number is usually quite large and affordable for many applications. Hence, similar to (Hanneke & Yang, 2015), Assumption 5.9 assumes a sufficient supply of unlabeled samples to simplify the analysis and concentrate solely on the number of labels needed for the proposed framework in the example.
Now, we present our theorem to address the testing power of the framework in the example and the baseline test in the finite-sample case.
Theorem 5.10.
equation 17 and equation 18 state approximate testing power’s lower bounds for the proposed framework in the example and the baseline test. We can observe that
- •
- •
-
•
When the approximation errors and/or , both testing power’s lower-bounds are decreased by a factor of , resulting in the irreducible Type II error.
-
•
When the maximum MI gain can compensate the approximation error of being larger than , our framework in the example has higher testing power’s lower bound than the baseline test given the same label budget and .
6 Experimental Results
We have proposed a practical instantiation of the framework, and its algorithmic description BQ-AST is presented in Algorithm 1. In this section, we compare the BQ-AST with a sequential testing baseline (Lhéritier & Cazals, 2018) that uses the same statistic in equation 2, but the baseline labels features randomly sampled from the unlabeled set . In addition, we build for the test statistic in equation 2 using logistic regression, SVM, or KNN classifiers; we set for the number of label queries used to initialize , and set significance level .
6.1 Experiments on Synthetic Datasets
Our first suite of experiment results is generated from synthetic data. We create synthetic datasets that comprise two samples of data to simulate cases under the null hypothesis and the alternative hypothesis ; the data for the first sample () is generated from and the data for the second sample () is generated from . In addition, we set from to to vary the ratio of the data sizes for two samples. For the simulations of the data under , we set , implying there is no difference between the distributions that generate the two samples; for the simulations of the data under , we vary from to to simulate two samples from small to high discrepancy under . Having constructed the data-generating process, we simulate 200 cases of data for each pair of and under , and simulate 500 cases of data for each pair of and under . Each case of data is of size 2000 with labels masked, resulting in an unlabeled set with . The proposed test actively and sequentially labels feature in to test the difference between the two samples.
![Refer to caption](extracted/5697244/TypeIError/SynSeqTestingSep0.00AllPriorsTypeI.png)
Figure 2 presents the empirical Type I errors: when is true, the probability of the proposed test mistakenly predicting the two samples is generated under . As observed, the empirical Type I errors are all smaller than for using various classifiers and label budgets in the experiments; this provides empirical evidence for Theorem 5.1, which states that the Type I error is controlled to be smaller than the significance level .
Table-1 presents the empirical Type II errors: when is true, the probability of the proposed test and the baseline test mistakenly predicting the two samples are generated under . Table 2 presents the average label queried spent to reject when is true. We can observe from Table-1 that the proposed test produces lower Type II errors than that of the baseline under different classifiers and label budgets; furthermore, in Table 2, we observe the proposed test spends a smaller number of label queries than the baseline test. Additionally, we run a two-sample t-test to assess the mean difference of label query numbers generated by 200 runs using both methods. The resultant -values, truncated to the last 6 decimal places, all equate to zero, indicating that the label spent by our framework is statistically smaller than the baseline test. All these observations demonstrate that, under , the proposed test labels the features that have a high dependency on labels to effectively decrease the Type II error and reduce the number of label queries needed to reject .
Logistic | KNN | ||||||||||
Label budget | 200 | 400 | 600 | 800 | 1000 | 200 | 400 | 600 | 800 | 1000 | |
Baseline | 0.82 | 0.53 | 0.29 | 0.11 | 0.04 | 0.95 | 0.77 | 0.50 | 0.28 | 0.14 | |
Proposed | 0.16 | 0.02 | 0.00 | 0.00 | 0.00 | 0.49 | 0.17 | 0.06 | 0.03 | 0.01 | |
Baseline | 0.80 | 0.50 | 0.23 | 0.12 | 0.06 | 0.95 | 0.77 | 0.48 | 0.29 | 0.14 | |
Proposed | 0.26 | 0.06 | 0.01 | 0.01 | 0.01 | 0.59 | 0.26 | 0.09 | 0.03 | 0.01 | |
Baseline | 0.81 | 0.56 | 0.34 | 0.22 | 0.10 | 0.96 | 0.81 | 0.58 | 0.36 | 0.28 | |
Proposed | 0.26 | 0.04 | 0.01 | 0.01 | 0.01 | 0.71 | 0.33 | 0.14 | 0.04 | 0.02 | |
Baseline | 0.88 | 0.73 | 0.56 | 0.35 | 0.21 | 0.98 | 0.90 | 0.77 | 0.59 | 0.48 | |
Proposed | 0.38 | 0.10 | 0.04 | 0.03 | 0.02 | 0.80 | 0.50 | 0.28 | 0.16 | 0.10 |
Logistic | KNN | ||||||||||
Label budget | 200 | 400 | 600 | 800 | 1000 | 200 | 400 | 600 | 800 | 1000 | |
Baseline | 183.541 | 319.7113 | 399.1183 | 438.1233 | 451.4257 | 198.110 | 374.461 | 500.4132 | 578.1201 | 619.5254 | |
Proposed | 95.364 | 108.192 | 108.693 | 108.693 | 108.693 | 162.150 | 223.1116 | 240.8149 | 249.7173 | 252.8184 | |
Baseline | 182.341 | 312.4116 | 386.0184 | 419.7231 | 439.0266 | 196.716 | 373.766 | 499.3134 | 578.9206 | 619.7256 | |
Proposed | 107.970 | 134.2114 | 142.3136 | 143.7142 | 144.7147 | 166.348 | 246.8123 | 282.2175 | 294.3200 | 296.6207 | |
Baseline | 184.041 | 323.3113 | 415.5188 | 472.2252 | 505.0299 | 198.311 | 378.558 | 520.0127 | 613.4199 | 678.1268 | |
Proposed | 120.467 | 143.4104 | 147.6117 | 149.0122 | 150.0128 | 178.043 | 282.2117 | 327.4173 | 345.9207 | 351.7222 | |
Baseline | 190.831 | 351.796 | 479.6172 | 571.1245 | 628.0306 | 199.08 | 386.647 | 555.0106 | 689.5175 | 798.4253 | |
Proposed | 134.764 | 174.8118 | 189.5151 | 195.6170 | 199.7186 | 184.436 | 310.2111 | 387.7186 | 434.7247 | 462.6293 |
We present the average number of label queries spent for two samples with small to big discrepancies under in Table 3. A small discrepancy between two samples indicates a more difficult two-sample testing problem than one with a large discrepancy between the two samples, as a two-sample test requires more data to test the existence of the small discrepancy. Table 3 shows that the proposed active sequential test spends fewer labels to reject when increasing the mean discrepancy between two samples, which demonstrates the proposed sequential test automatically adapts the number of label queries to the problem’s complexity.
Logistic | KNN | ||||||||
0.2 | 0.3 | 0.4 | 0.5 | 0.2 | 0.3 | 0.4 | 0.5 | ||
Baseline | 451.4257 | 178.3105 | 101.058 | 63.932 | 619.5254 | 287.8129 | 167.470 | 116.843 | |
Proposed | 108.693 | 37.322 | 24.310 | 19.75 | 252.8184 | 109.564 | 72.233 | 54.920 | |
Baseline | 439.0266 | 175.3118 | 96.965 | 65.540 | 619.7256 | 289.8130 | 170.272 | 116.247 | |
Proposed | 144.7147 | 40.530 | 24.911 | 20.17 | 296.6207 | 134.388 | 84.343 | 58.325 | |
Baseline | 505.0299 | 223.6145 | 115.770 | 75.747 | 678.1268 | 349.3178 | 198.293 | 133.356 | |
Proposed | 150.0128 | 57.142 | 32.321 | 22.28 | 351.7222 | 160.2107 | 94.054 | 67.030 | |
Baseline | 628.0306 | 278.1177 | 149.395 | 94.856 | 798.4253 | 470.3223 | 268.7126 | 176.381 | |
Proposed | 199.7186 | 66.741 | 40.022 | 29.415 | 462.6293 | 198.8143 | 115.765 | 83.846 |
6.2 Experiments on MNIST
![Refer to caption](extracted/5697244/TypeIError/MNISTSeqTestingSep0.00AllPriorsTypeI.png)
In addition to the synthetic datasets, We simulate the cases of and with MNIST (LeCun, 1998). To create a case for , we randomly pick one digit category from 0-9, then randomly sample images from the selected digit category, and lastly divide the images to sample zero () and one () based on a pre-defined class prior ; for each case, the two samples contain data from the same digit, but the digit categories could be different over cases. To create a case for , we randomly pick two different digit categories from 0-9, then sample images from one digit category and place the images to sample zero (); to create sample one (), we sample images from the two digits, mix the sampled images, and place them to sample one. We set the mixture ratio , meaning there are roughly data in sample one generated from a distribution different from sample zero. We also adjust to create cases with different ratios for the size of sample zero over sample one for . We produce 500 cases for and 200 cases for with the stated procedure for each that ranges from to ; each case comprises an unlabeled set with a size of 2000 and its corresponding labels that are unknown to an analyst. Instead of using the raw data in the created cases, we projected the MNIST data to a 28-dimensional space by a convolutional autoencoder before conducting the two-sample testing.
We first present the empirical Type I errors in Figure 3. We use the support vector machine (SVM) to build to generate the results. As observed, all the Type I errors are smaller than , which agrees with Theorem 5.1. In addition, we present the Type II errors, as shown in Table 4. The proposed test generates smaller Type II errors than the baseline sequential test for various classifiers, label budgets, and , implying the proposed sequential testing combined with the active query is effective. This is further corroborated by Table 5 that exhibits the average number of label queries needed to reject ; the proposed test spent fewer label queries than the baseline test to reject . We additionally run a two-sample t-test to statistically compare the mean difference between the label query numbers generated by both methods. The resultant -values, truncated to the last 6 decimal places, all equate to zero, indicating that the label spent by our framework is statistically smaller than the baseline test in the MNIST experiment.
Logistic | SVM | KNN | ||||||||||||||
Label budget | 200 | 400 | 600 | 800 | 1000 | 200 | 400 | 600 | 800 | 1000 | 200 | 400 | 600 | 800 | 1000 | |
Baseline | 0.65 | 0.21 | 0.02 | 0.01 | 0.01 | 0.59 | 0.07 | 0.00 | 0.00 | 0.00 | 0.84 | 0.43 | 0.15 | 0.07 | 0.03 | |
Proposed | 0.12 | 0.01 | 0.01 | 0.00 | 0.00 | 0.12 | 0.03 | 0.01 | 0.01 | 0.00 | 0.10 | 0.01 | 0.01 | 0.00 | 0.00 | |
Baseline | 0.59 | 0.16 | 0.02 | 0.01 | 0.01 | 0.55 | 0.04 | 0.00 | 0.00 | 0.00 | 0.89 | 0.43 | 0.15 | 0.06 | 0.03 | |
Proposed | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.06 | 0.01 | 0.00 | 0.00 | 0.00 | 0.06 | 0.02 | 0.01 | 0.01 | 0.00 | |
Baseline | 0.58 | 0.21 | 0.04 | 0.01 | 0.00 | 0.67 | 0.15 | 0.01 | 0.00 | 0.00 | 0.91 | 0.58 | 0.29 | 0.10 | 0.04 | |
Proposed | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.10 | 0.01 | 0.00 | 0.00 | 0.00 | 0.12 | 0.03 | 0.01 | 0.00 | 0.00 | |
Baseline | 0.66 | 0.24 | 0.04 | 0.01 | 0.01 | 0.77 | 0.32 | 0.10 | 0.01 | 0.01 | 0.95 | 0.71 | 0.47 | 0.27 | 0.12 | |
Proposed | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.06 | 0.01 | 0.00 | 0.00 | 0.00 | 0.14 | 0.03 | 0.01 | 0.01 | 0.00 |
Logistic | SVM | KNN | ||||||||||||||
Label budget | 200 | 400 | 600 | 800 | 1000 | 200 | 400 | 600 | 800 | 1000 | 200 | 400 | 600 | 800 | 1000 | |
Baseline | 165.356 | 251.7126 | 267.9150 | 270.7158 | 271.7162 | 175.439 | 229.993 | 233.099 | 233.099 | 233.099 | 187.030 | 311.393 | 359.1141 | 376.8167 | 384.6185 | |
Proposed | 90.462 | 99.884 | 101.089 | 101.592 | 101.592 | 93.555 | 106.587 | 109.498 | 110.5105 | 110.7106 | 89.451 | 97.975 | 99.984 | 100.186 | 100.186 | |
Baseline | 160.859 | 233.2125 | 247.5148 | 249.3154 | 250.3158 | 173.539 | 226.895 | 229.8101 | 229.8101 | 229.8101 | 187.531 | 315.193 | 363.9142 | 379.8166 | 385.4178 | |
Proposed | 61.743 | 61.743 | 61.743 | 61.743 | 61.743 | 79.448 | 83.260 | 83.462 | 83.462 | 83.462 | 85.049 | 90.968 | 94.085 | 95.695 | 96.6103 | |
Baseline | 160.359 | 234.8128 | 255.2161 | 257.8167 | 258.0168 | 174.645 | 252.1109 | 264.6130 | 265.4133 | 265.4133 | 188.431 | 330.794 | 415.7162 | 451.8206 | 463.2225 | |
Proposed | 46.228 | 46.228 | 46.228 | 46.228 | 46.228 | 74.756 | 82.076 | 83.081 | 83.081 | 83.081 | 89.354 | 101.585 | 104.698 | 105.1101 | 105.1101 | |
Baseline | 163.958 | 243.3126 | 268.1163 | 273.1175 | 275.3183 | 92.616 | 148.552 | 167.376 | 171.785 | 172.688 | 192.825 | 357.376 | 471.8146 | 540.7210 | 575.4255 | |
Proposed | 34.817 | 34.817 | 34.817 | 34.817 | 34.817 | 77.255 | 81.668 | 82.372 | 82.372 | 82.372 | 104.854 | 116.981 | 119.190 | 120.196 | 120.398 |
6.3 Experiments on An Alzheimer’s Disease Dataset
Logistic | SVM | KNN | ||||||||||||||
Label budget | 100 | 200 | 300 | 400 | 500 | 100 | 200 | 300 | 400 | 500 | 100 | 200 | 300 | 400 | 500 | |
Baseline | 0.32 | 0.06 | 0.01 | 0.00 | 0.00 | 0.67 | 0.17 | 0.02 | 0.00 | 0.00 | 0.72 | 0.49 | 0.25 | 0.13 | 0.04 | |
Proposed | 0.10 | 0.01 | 0.00 | 0.00 | 0.00 | 0.24 | 0.03 | 0.01 | 0.00 | 0.00 | 0.21 | 0.04 | 0.00 | 0.00 | 0.00 | |
Baseline | 0.35 | 0.04 | 0.00 | 0.00 | 0.00 | 0.62 | 0.15 | 0.01 | 0.00 | 0.00 | 0.73 | 0.25 | 0.06 | 0.01 | 0.00 | |
Proposed | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | 0.18 | 0.04 | 0.03 | 0.00 | 0.00 | 0.10 | 0.01 | 0.00 | 0.00 | 0.00 | |
Baseline | 0.40 | 0.10 | 0.01 | 0.00 | 0.00 | 0.65 | 0.21 | 0.06 | 0.00 | 0.00 | 0.81 | 0.36 | 0.12 | 0.04 | 0.02 | |
Proposed | 0.11 | 0.03 | 0.00 | 0.00 | 0.00 | 0.32 | 0.07 | 0.02 | 0.01 | 0.01 | 0.25 | 0.04 | 0.01 | 0.00 | 0.00 | |
Baseline | 0.52 | 0.23 | 0.07 | 0.01 | 0.00 | 0.89 | 0.53 | 0.27 | 0.07 | 0.02 | 0.90 | 0.59 | 0.28 | 0.16 | 0.07 | |
Proposed | 0.28 | 0.01 | 0.00 | 0.00 | 0.00 | 0.49 | 0.15 | 0.06 | 0.03 | 0.01 | 0.38 | 0.10 | 0.03 | 0.01 | 0.01 |
We demonstrate the utility of the proposed test in a clinical application using data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (Jack Jr et al., 2008). The ADNI study protocol was approved by local institutional review boards (IRB). All the personal information in the data provided to researchers has been removed. The motivation for applying the proposed test to Alzheimer’s disease research is as follows. Amyloid has been linked to the development of Alzheimer’s disease; identifying the amount of amyloid in the human brain is an important step in predicting the progression of Alzheimer’s disease. To measure the amyloid level, an expensive CT scan is required used to assess the amyloid deposition in the brain. A useful replacement would be an easy-to-measure and inexpensive replacement for the amyloid to indicate the progression of Alzheimer’s disease. In the following experiments, we considered using digital test results that include five cognition measurement scores of participants as a replacement. To verify if the digital test results are suitable replacements, clinicians are seeking an approach to test the independence between the digital test results and the amyloid amount with a limited number of expensive CT scans to measure the amyloid levels. We use a binary version of the amyloid level where and suggest low and high amyloid depositions in the brain respectively; we can now formulate a two-sample test and use the proposed scheme. As the results show, our proposed test is endowed with sequential decision-making and active label query, resulting in fewer CT scans needed compared with the conventional sequential test.
Logistic | SVM | KNN | ||||||||||||||
Label budget | 100 | 200 | 300 | 400 | 500 | 100 | 200 | 300 | 400 | 500 | 100 | 200 | 300 | 400 | 500 | |
Baseline | 68.129 | 83.752 | 85.557 | 85.657 | 85.657 | 87.022 | 127.255 | 135.469 | 136.170 | 136.170 | 86.226 | 145.866 | 181.8100 | 199.5124 | 207.3138 | |
Proposed | 43.929 | 47.136 | 47.137 | 47.137 | 47.137 | 64.028 | 75.147 | 76.551 | 76.652 | 76.652 | 69.322 | 76.637 | 77.841 | 77.841 | 77.841 | |
Baseline | 68.429 | 84.051 | 86.157 | 86.157 | 86.157 | 85.023 | 121.055 | 127.567 | 127.567 | 127.567 | 92.715 | 140.351 | 153.069 | 156.077 | 156.378 | |
Proposed | 43.929 | 45.332 | 45.332 | 45.332 | 45.332 | 61.326 | 70.044 | 72.954 | 74.561 | 74.561 | 60.820 | 64.430 | 64.430 | 64.430 | 64.430 | |
Baseline | 72.329 | 95.658 | 100.970 | 101.170 | 101.170 | 86.523 | 126.757 | 139.076 | 141.382 | 141.382 | 94.713 | 153.549 | 176.577 | 183.790 | 186.197 | |
Proposed | 50.629 | 56.643 | 57.145 | 57.145 | 57.145 | 68.829 | 85.153 | 89.063 | 90.067 | 90.569 | 68.624 | 78.542 | 79.947 | 79.947 | 79.947 | |
Baseline | 78.128 | 115.465 | 128.586 | 132.595 | 132.996 | 95.913 | 166.947 | 204.881 | 219.6101 | 222.8108 | 97.68 | 171.043 | 215.379 | 235.8106 | 247.6126 | |
Proposed | 63.632 | 72.044 | 72.245 | 72.245 | 72.245 | 80.126 | 108.657 | 118.175 | 121.986 | 124.093 | 80.721 | 98.346 | 102.457 | 103.963 | 104.968 |
The obtained ADNI data contains both digital test results and the amyloid amount of participants. We use the cut-off value suggested by ADNI and binarize the amyloid amount to create two-sample cases where denote a vector of cognition measurement scores and denotes low or high amyloid amount for the participants. We create 200 data cases for each that ranges from to ; these cases are simulations for , and each case comprises an unlabeled set with a size of 1000 and its corresponding labels that are unknown to an analyst.
Table 6 and Table 7 present the results of empirical Type II errors and the average number of label queries needed to reject . Our proposed test has Type II errors decreased by 58% and saves on label queries by 62% at most compared with the baseline test with the same label budgets. Additionally, we run a two-sample t-test to statistically compare the mean difference between the label query numbers generated by both methods. The resultant -values, truncated to the last 6 decimal places, all equate to zero; this indicates that the label savings are statistically significant.
7 Conclusion
We propose an active sequential two-sample testing framework that sequentially and actively labels the data to increase the testing power and adapt the number of label queries to the problem’s complexity. We provide both finite-sample and asymptotic analysis of the proposed framework; the framework’s benefit is characterized by the change of the mutual information between feature and label variables over a random labeling scheme in both finite-sample and asymptotic cases. Moreover, we suggest an instantiation of the framework, in which we adopt the bimodal query that labels the features predicted by a classifier to have the highest class one or zero probabilities. Our experiments on synthetic data, MNIST, and an Alzheimer’s Disease dataset demonstrate the effectiveness of the suggested instantiation of the proposed framework.
Acknowledgement
This work was funded in part by Office of Naval Research grant N00014-21-1-2615 and by the National Science Foundation (NSF) under grants CNS-2003111, and CCF-2048223.
References
- Aaditya Ramdas (2018) Aaditya Ramdas. Martingales, ville and doob, 2018. https://www.stat.cmu.edu/~aramdas/martingales18/L2-martingales.pdf.
- Balsubramani & Ramdas (2015) Akshay Balsubramani and Aaditya Ramdas. Sequential nonparametric testing with the law of the iterated logarithm. arXiv preprint arXiv:1506.03486, 2015.
- Bessler (1960) Stuart A Bessler. Theory and applications of the sequential design of experiments, k-actions and infinitely many experiments. part i. theory. Technical report, Stanford Univ CA Applied Mathematics and Statistics Labs, 1960.
- Blot & Meeter (1973) William J Blot and Duane A Meeter. Sequential experimental design procedures. Journal of the American Statistical Association, 68(343):586–593, 1973.
- Chernoff (1959) Herman Chernoff. Sequential design of experiments. The Annals of Mathematical Statistics, 30(3):755–770, 1959.
- Doob (1939) JL Doob. Jean ville, étude critique de la notion de collectif. Bulletin of the American mathematical society, 45(11):824–824, 1939.
- Duan et al. (2022) Boyan Duan, Aaditya Ramdas, and Larry Wasserman. Interactive rank testing by betting. In Conference on Causal Learning and Reasoning, pp. 201–235. PMLR, 2022.
- Dunn (1961) Olive Jean Dunn. Multiple comparisons among means. Journal of the American statistical association, 56(293):52–64, 1961.
- Durrett (2019) Rick Durrett. Probability: theory and examples, volume 49. Cambridge university press, 2019.
- Friedman & Rafsky (1979) Jerome H Friedman and Lawrence C Rafsky. Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. The Annals of Statistics, pp. 697–717, 1979.
- Gretton et al. (2012) Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012.
- Györfi et al. (2002) László Györfi, Michael Köhler, Adam Krzyżak, and Harro Walk. A distribution-free theory of nonparametric regression, volume 1. Springer, 2002.
- Hajnal (1961) J Hajnal. A two-sample sequential t-test. Biometrika, 48(1/2):65–75, 1961.
- Han (2000) Te Sun Han. Hypothesis testing with the general source. arXiv preprint math/0004121, 2000.
- Han & Verdú (1993) Te Sun Han and Sergio Verdú. Approximation theory of output statistics. IEEE Transactions on Information Theory, 39(3):752–772, 1993.
- Hanneke & Yang (2015) Steve Hanneke and Liu Yang. Minimax analysis of active learning. J. Mach. Learn. Res., 16(1):3487–3602, 2015.
- Hotelling (1992) Harold Hotelling. The generalization of student’s ratio. In Breakthroughs in statistics, pp. 54–65. Springer, 1992.
- Jack Jr et al. (2008) Clifford R Jack Jr, Matt A Bernstein, Nick C Fox, Paul Thompson, Gene Alexander, Danielle Harvey, Bret Borowski, Paula J Britson, Jennifer L. Whitwell, Chadwick Ward, et al. The Alzheimer’s disease neuroimaging initiative (ADNI): MRI methods. Journal of Magnetic Resonance Imaging: An Official Journal of the International Society for Magnetic Resonance in Medicine, 27(4):685–691, 2008.
- Johari et al. (2022) Ramesh Johari, Pete Koomen, Leonid Pekelis, and David Walsh. Always valid inference: Continuous monitoring of a/b tests. Operations Research, 70(3):1806–1821, 2022.
- Keener (1984) Robert Keener. Second order efficiency in the sequential design of experiments. The Annals of Statistics, pp. 510–532, 1984.
- Kiefer & Sacks (1963) J Kiefer and J Sacks. Asymptotically optimum sequential inference and design. The Annals of Mathematical Statistics, pp. 705–750, 1963.
- Koga et al. (2002) H Koga et al. Information-spectrum methods in information theory, volume 50. Springer Science & Business Media, 2002.
- LeCun (1998) Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
- Lhéritier & Cazals (2018) Alix Lhéritier and Frédéric Cazals. A sequential non-parametric multivariate two-sample test. IEEE Transactions on Information Theory, 64(5):3361–3370, 2018.
- Li et al. (2022) Weizhi Li, Gautam Dasarathy, Karthikeyan Natesan Ramamurthy, and Visar Berisha. A label efficient two-sample test. In Uncertainty in Artificial Intelligence, pp. 1168–1177. PMLR, 2022.
- Lopez-Paz & Oquab (2016) David Lopez-Paz and Maxime Oquab. Revisiting classifier two-sample tests. arXiv preprint arXiv:1610.06545, 2016.
- Miller (2007) Steven J Miller. An introduction to linear programming. lecture notes, 2007.
- Naghshvar & Javidi (2013) Mohammad Naghshvar and Tara Javidi. Active sequential hypothesis testing. The Annals of Statistics, 41(6):2703–2738, 2013.
- Pandeva et al. (2022) Teodora Pandeva, Tim Bakker, Christian A Naesseth, and Patrick Forré. E-valuating classifier two-sample tests. arXiv preprint arXiv:2210.13027, 2022.
- Ramdas et al. (2022) Aaditya Ramdas, Peter Grünwald, Vladimir Vovk, and Glenn Shafer. Game-theoretic statistics and safe anytime-valid inference. arXiv preprint arXiv:2210.01948, 2022.
- Shekhar & Ramdas (2021) Shubhanshu Shekhar and Aaditya Ramdas. Game-theoretic formulations of sequential nonparametric one-and two-sample tests. arXiv preprint arXiv:2112.09162, 2021.
- Student (1908) Student. The probable error of a mean. Biometrika, 6(1):1–25, 1908.
- Tan et al. (2014) Vincent YF Tan et al. Asymptotic estimates in information theory with non-vanishing error probabilities. Foundations and Trends® in Communications and Information Theory, 11(1-2):1–184, 2014.
- Ville (1939) Jean Ville. Etude critique de la notion de collectif. Bull. Amer. Math. Soc, 45(11):824, 1939.
- Wald (1992) Abraham Wald. Sequential tests of statistical hypotheses. In Breakthroughs in Statistics, pp. 256–298. Springer, 1992.
- Wasserstein & Lazar (2016) Ronald L Wasserstein and Nicole A Lazar. The ASA statement on p-values: context, process, and purpose, 2016.
- Welch (1990) William J Welch. Construction of permutation tests. Journal of the American Statistical Association, 85(411):693–698, 1990.
- Wilks (1938) Samuel S Wilks. The large-sample distribution of the likelihood ratio for testing composite hypotheses. The annals of mathematical statistics, 9(1):60–62, 1938.
Appendix A Proof of Theorem 5.1 and Its Preliminaries
A.1 Some statistical preliminaries
In probability theory, a sequence of random variables is called martingale if at a particular time, the expectation of the next random variable is equivalent to the present observation; this is formally defined as follows,
Definition A.1.
(Martingale) A sequence of random variables is a martingale if, for any ,
(19) | ||||
(20) |
We refer interested readers to (Aaditya Ramdas, 2018) for a complete introduction to the martingale and its related properties.
Theorem A.2.
(Ville’s Maximal Inequality Ville (1939)): If is a nonnegative martingale, then for any , we have
(21) |
Ville’s maximal inequality gives a probability upper bound for the event that the martingale crosses a threshold ; it is a sequential extension of Markov’s inequality.
A.2 Proof of Theorem 5.1
Proof.
Our proof comprises proving the following two ordered parts:
(1) The first part is to demonstrate that, under the null hypothesis , the independence between unqueried label random variables and the corresponding feature random variables still holds following the adaptive label query. In particular, Under , the feature and label variables and used to construct the test statistic in equation 3 in the proposed framework are independent .
(2) In the second part, we consider , which is the test statistic in equation 2 with true class prior plugged in. Moving forward, the second part is to demonstrate the following inequalities under
(22) |
equation 22 immediately implies that the Type I error of our proposed framework is upper-bounded by .
-
•
Proof for the first part
We write and to denote the sets of original unlabeled feature variables on an analyst’s hand and unrevealed label variables provided by an oracle. We write and to denote the sets of the labeled feature and the corresponding label variables after including the i-th to construct the statistic in equation 3. We use and to denote their complements that comprise unlabeled feature and unrevealed label variables. In particular, we use and to denote the feature and label variable sets used to initialize in the first place; and are their complements that comprise unlabeled feature and unrevealed label variables. being true implies . In our setting, an analyst randomly samples features and labels them to build and , implying and when is true. In the following, we employ the induction method to prove and are independent .Base case (): Under , we have and . The analyst first initializes with before starting the sequential testing. Subsequently, the analyst makes a query on a label based on the prediction of and includes the first variable pair to construct the test statistic. That immediately implies , and .
Induction step: Suppose and , the analyst updates to with and , makes a query on a label based on the prediction of and includes the (i+1)-th variable pair to update the statistic. That immediately implies , , and .
Combining the base step and the induction step leads to under .
-
•
Proof for the second part
Suppose is a sequence of realizations of collected under and the proposed framework. We use to denote a class-one prior probability parameter, and hence is a likelihood function of . Maximizing over the prior parameter leads to the solution . In other words, is a maximized likelihood obtained from , where . We use to denote the true prior-one probability under , and plugging to leads to the true likelihood for under . It is easy to see thus for any realization of under . As a result, we have .
Lastly, we prove . We let . Therefore, with for . The sequence is a non-negative martingale under given(23) (24) (25) (26) Using Ville’s maximal inequality in Theorem A.2 leads to the following: For any , we have
(27) (28) (29) Therefore, we have .
∎
Appendix B Proof of Theorem 5.4
Proof.
In the following, we formulate an optimization problem that seeks an arbitrary marginal distribution to maximize the mutual information (MI) between and , where . Solving this optimization problem leads to a consistent bimodal query (see Definition 5.2), asymptotically minimizing the test statistic in equation 2.
-
•
Constructing an optimization problem that maximizes MI
We write to denote an arbitrary probability distribution of . Recall and that indicate the class probability given and a marginal probability distribution of for the two-sample testing problem on the analyst’s hand; we write and to denote the joint probability distribution and the class prior for a new two-sample testing problem with the original replaced by . The mutual information (MI) that characterizes the new two-sample testing problem is as follows(30) We expand equation 30 and consider the following optimization problem,
(31) In other words, equation 31 is seeking an to maximize the MI of a new two-sample testing problem with provided by the original two-sample testing problem. In what follows, we will see that solving 31 leads to a probability distribution in which a consistent bimodal query (see Definition 5.2) results, proving the asymptotic property in Theorem 5.4. Instead of directly solving equation 31, we fix , and resort to finding the solution of the following,
(32) s.t. (33) (34) (35) Then, we approximate equation 32 with a discrete version of the same by partitioning the sample space into balls ; in addition, . Each has a radius centering at leading to an approximation , and a probability mass function . Hence, we approximate equation 32 by the following linear programming (LP):
(36) s.t. (37) (38) (39) where indicates constant coefficients in the LP in equation 36.
-
•
Solving the optimization problem
The constraints in equation 37 and equation 38 construct a region of feasible solutions to the considered LP in equation 36; we write this region . In addition, we need to make one more definition of one kind of solution to the system of linear equations, which is well-known in linear algebra.Definition B.1.
(Basic solutions) Let be a system of linear equations. Let be positive and other entries be zero in . Then, if the corresponding columns are linearly independent, then is a basic solution to the system.
Moreover, we will need to apply the following Theorems to derive the optimal feasible solution for the LP.
Theorem B.2.
If the feasible region of an LP is bounded, then at least one optimal solution occurs at a vertex of the corresponding polytope (or the feasible region).
Theorem B.3.
Let be the feasible region of a linear program. Then, is a basic feasible solution if and only if is a vertex of .
Theorem B.2 and Theorem B.3 are well-known in LP; we refer interested readers to (Miller, 2007) for the elaboration on their proofs. Theorem B.2 and B.3 suggests one optimal solution of equation 36 is a vector with at most two non-zero entries. Herein, we write and to denote the two non-zero entries. That reduces the LP in equation 36 to the following:
(40) s.t. (41) (42) (43) For the sake of simplifying the expressions in what follows, we write
(44) (45) (46) (47) Then, equation 40 is re-expressed by the following,
(48) s.t. (49) (50) (51) equation 48 is an optimization problem that finds to maximize the objective function. Herein, we write
(52) (53) (54) (55) Now, we analyze the derivatives of equation 48 by checking the partial derivatives of , , and with respect to and :
(56) (57) (58) (59) (60) (61) Therefore, equation 48 is a function that monotonically increases with increasing and decreasing , implying that the optimal solution to equation 36 has the following probability mass function ,
(62) (63) (64) Recall that LP in equation 36 approximates the continuous optimization problem in equation 32 by partitioning the sample space to . Hence, by shrinking the radius infinitely close to zero, we get the optimal solution of equation 32 as follows,
(65) (66) Varying leads to the optimal solution with the same form that and , but different ratio . Furthermore, there could exist a set with identical , and so does for the case of . Hence, the optimal solution to the original optimization problem in equation 31 has the following form
(67) (68) (69)
∎
Appendix C Proof of Theorem 5.10
Proof.
Testing power of the baseline case: As the baseline case randomly samples features from and queries their labels, then the resulting variable pair collected by the analyst admits , in which is the joint distribution that characterizes the original two-sample testing problem. In addition, is initialized and stable, and the class-prior is provided in the case study. Given the label budget and the significance level , we have the following inequalities for the testing power in the case study:
(70) |
The inequality in equation 70 is derived from sequentially comparing with leading to a higher testing power than only comparing with at . We subsequently convert RHS of equation 70 as follows,
(71) |
Since is an i.i.d. sequence, we skip in and analyze and for in the following,
(72) | ||||
(73) | ||||
(74) |
(75) | ||||
(76) | ||||
(77) |
The inequalities in equation 74 and equation 77 are results of the following facts: and over the partition .
It is observed that, in equation 71, is a sample mean of , hence we use the central limit theorem to approximate the distribution of leading to the following,
(78) | ||||
(79) |
Testing power of the proposed framework in the case study: The analyst selects a region from a partition , in which is predicted to have highest ; then the analyst conducts the sequential testing with i.i.d. generated from . We first quantify . Recall that the approximated MI used to find is provided in equation 16 in the case study; given Assumption 5.9, the discrepancy between true and approximate MI for any is as follows
(80) |
Furthermore, given over the partition , we evaluate the upper bound of equation 80 for any in the following,
(81) | ||||
(82) | ||||
(83) | ||||
(84) |
Similarly, we evaluate the lower bound of equation 80 for any in the following,
(85) | ||||
(86) | ||||
(87) | ||||
(88) |
Assumption 5.8 suggests that the maximum MI over is . Combining equation 84 and equation 88, we get the lower bound of as follows,
(89) |
The analyst conducts the sequential testing in the selected with sample features randomly sampled from and labeled, leading to the following testing power lower bound
(90) |
The quantification of the RHS in equation 90 is identical to the one in the baseline case, except the sample space is constrained to . Hence, we skip the derivation process and obtain the following result,
(91) | ||||
(92) |
∎