License: arXiv.org perpetual non-exclusive license
arXiv:2302.02237v2 [cs.LG] 29 Feb 2024

 

Conformalized Semi-supervised Random Forest for Classification and Abnormality Detection


 


Yu** Han                        Mingwenchan Xu                        Leying Guan

Department of Computer Science, The University of Hong Kong Hong Kong, China                        Department of IEMS Northwestern University Illinois, USA                        Department of Biostatistics Yale University New Haven, USA

Abstract

The Random Forests classifier, a widely utilized off-the-shelf classification tool, assumes training and test samples come from the same distribution as other standard classifiers. However, in safety-critical scenarios like medical diagnosis and network attack detection, discrepancies between the training and test sets, including the potential presence of novel outlier samples not appearing during training, can pose significant challenges. To address this problem, we introduce the Conformalized Semi-Supervised Random Forest (CSForest), which couples the conformalization technique Jackknife+aB with semi-supervised tree ensembles to construct a set-valued prediction C(x)𝐶𝑥C(x)italic_C ( italic_x ). Instead of optimizing over the training distribution, CSForest employs unlabeled test samples to enhance accuracy and flag unseen outliers by generating an empty set. Theoretically, we establish CSForest to cover true labels for previously observed inlier classes under arbitrarily label-shift in the test data. We compare CSForest with state-of-the-art methods using synthetic examples and various real-world datasets, under different types of distribution changes in the test domain. Our results highlight CSForest’s effective prediction of inliers and its ability to detect outlier samples unique to the test data. In addition, CSForest shows persistently good performance as the sizes of the training and test sets vary. Codes of CSForest are available at https://github.com/yu**han98/CSForest.

.00footnotetext: †Equal contribution. This work was done at Yale University. Correspondence to: [email protected].

1 INTRODUCTION

A classifier typically generates predictions for a test sample by choosing the class label associated with the highest predicted probability. This approach proves inadequate for addressing the increasing demand for assessing prediction reliability in practical scenarios, such as medical diagnosis (Esteva et al.,, 2017; Kompa et al.,, 2021) and autonomous vehicles (Kalra and Paddock,, 2016; Qayyum et al.,, 2020). One approach for addressing this challenge involves minimizing a combined cost associated with misclassification and rejection, permitting the avoidance of predictions for test samples exhibiting high uncertainty. For example, if the maximum estimated probability maxk{0,1}p^(k|x)subscript𝑘01^𝑝conditional𝑘𝑥\max_{k\in\{0,1\}}\hat{p}(k|x)roman_max start_POSTSUBSCRIPT italic_k ∈ { 0 , 1 } end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG ( italic_k | italic_x ) for the binary response is low where p^(k|x)^𝑝conditional𝑘𝑥\hat{p}(k|x)over^ start_ARG italic_p end_ARG ( italic_k | italic_x ) is the estimated probability of beging in class k𝑘kitalic_k using the training data, we might choose not to predict a test observation x𝑥xitalic_x (Chow,, 1970; Herbei and Wegkamp,, 2006; Bartlett and Wegkamp,, 2008). This idea has been implemented across various learning algorithms and expanded to address multi-class classification problems (Cortes et al.,, 2016; Ni et al.,, 2019; Charoenphakdee et al.,, 2021). The set-valued prediction via conformal prediction provides an alternative framework (Vovk et al.,, 2005; Papadopoulos et al.,, 2002; Lei and Wasserman,, 2015; Gammerman et al.,, 2013), in which the classifier generates a set covering all possible labels for a given observation x𝑥xitalic_x based on the conformal score function s(x,k)𝑠𝑥𝑘s(x,k)italic_s ( italic_x , italic_k ) that measures the plausibility of label for x𝑥xitalic_x being k𝑘kitalic_k, e.g., s(x,k)p^(k|x)𝑠𝑥𝑘^𝑝conditional𝑘𝑥s(x,k)\leftarrow\hat{p}(k|x)italic_s ( italic_x , italic_k ) ← over^ start_ARG italic_p end_ARG ( italic_k | italic_x ). For instance, one can form the calibrated set-valued prediction set C(x)={k:s(x,k)τk}𝐶𝑥conditional-set𝑘𝑠𝑥𝑘subscript𝜏𝑘C(x)=\{k:s(x,k)\geq\tau_{k}\}italic_C ( italic_x ) = { italic_k : italic_s ( italic_x , italic_k ) ≥ italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }, with τksubscript𝜏𝑘\tau_{k}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT being class-dependent and calculated to ensure a desired coverage of the true label (Vovk et al.,, 2005).

Traditionally, classification uncertainty quantification assumes that the training and test samples are independently and identically distributed (i.i.d.) from the same distribution. In reality, this assumption doesn’t always hold. For instance, in medical applications, the test cohort may include samples representing novel pathologies that bear little similarity to the labeled training set (Lin et al.,, 2005). Similarly, network attackers may create new intrusions to evade existing detection systems (Marchette and Marchette,, 2001). Therefore, it becomes essential to assess uncertainty under distributional changes and flag test samples where predictions should not rely solely on the model trained with the training data.

To address this challenge, we introduce CSForest (Conformalized Semi-Supervised Random Forest), an ensemble tree classifier that leverages both labeled training data and unlabeled test data to form calibrated set-valued prediction and flag outliers. The term “semi-supervision” here refers to the utilization of unlabeled test data. CSForest builds upon recent work on test-data optimized calibrated classification framework(Guan and Tibshirani,, 2022). Guan and Tibshirani, (2022) constructs a calibrated semi-supervised set-valued prediction via sample-splitting where one subset of samples is used for training the model while the remaining part is for calibration. In contrast, CSForest avoids the sample-splitting schema, constructing the random forest tree ensembles and calibrating the prediction using all samples. We summarize our contributions into three main aspects:

  1. 1.

    We present a novel classifier, CSForest, designed for classification with calibrated uncertainty quantification in the presence of distributional shifts between training and test datasets. It employs a novel semi-supervised random forest structure that differentiates between observed training classes and unlabeled test data, and adapts the conformalization technique Jackknife+aB (Kim et al.,, 2020) to handle the case of joint and asymmetric utilization of both training and test samples.

  2. 2.

    We provide a theoretical guarantee for true lable coverage using C(x)𝐶𝑥C(x)italic_C ( italic_x ) constructed by CSForest, under arbitrarily shifted test distributions. This theoretically ensures the effectiveness of CSForest under varying degrees of data drift.

  3. 3.

    We conduct extensive experiments on simulated and publicly available datasets under various label shift settings to demonstrate CSForest’s gain over existing state-of-the-art methods.

2 RELATED WORK

Distribution Shift. Regarding distributional changes, both the covariate and label shifts are commonly studied (Schölkopf et al.,, 2012). The former assumes the conditional density p(y|x)𝑝conditional𝑦𝑥p(y|x)italic_p ( italic_y | italic_x ) to be fixed with f(x)𝑓𝑥f(x)italic_f ( italic_x ), the marginal density of x𝑥xitalic_x, potentially changing (Shimodaira,, 2000; Bickel et al.,, 2009; Gretton et al.,, 2009; Csurka,, 2017); the latter treats fk(x)subscript𝑓𝑘𝑥f_{k}(x)italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ), conditional density of x𝑥xitalic_x given the label y=k𝑦𝑘y=kitalic_y = italic_k, as fixed, but the prevalence of different labels can vary among the observed training classes (Storkey,, 2009; Lipton et al.,, 2018).

Recently, Guan and Tibshirani, (2022) proposes BCOPS, a test-data optimized calibrated classifier, and the Generalized Label Shift (GLS) model defined in eq. (1), which extends the label shift model to include unseen classes to handle outliers. Suppose that the training data is a mixture of K𝐾Kitalic_K different classes. For class k𝑘kitalic_k, its mixture proportion is πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and feature density is fk(x)subscript𝑓𝑘𝑥f_{k}(x)italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ), with πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT satisfying k=1Kπk=1superscriptsubscript𝑘1𝐾subscript𝜋𝑘1\sum_{k=1}^{K}\pi_{k}=1∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1. The generalized label shift model assumes a target distribution accepting both label shift among training classes and the appearance of outlier component(s) and requires only fk(x)subscript𝑓𝑘𝑥f_{k}(x)italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) to remain the same for each observed class:

μ(x)=k=1Kπ~kfk(x)+δfR(x),𝜇𝑥superscriptsubscript𝑘1𝐾subscript~𝜋𝑘subscript𝑓𝑘𝑥𝛿subscript𝑓𝑅𝑥\mu(x)=\sum_{k=1}^{K}\tilde{\pi}_{k}f_{k}(x)+\delta\cdot f_{R}(x),italic_μ ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) + italic_δ ⋅ italic_f start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( italic_x ) , (1)

where δ+k=1Kπ~k=1𝛿superscriptsubscript𝑘1𝐾subscript~𝜋𝑘1\delta+\sum_{k=1}^{K}\tilde{\pi}_{k}=1italic_δ + ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1, π~k0subscript~𝜋𝑘0\tilde{\pi}_{k}\geq 0over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ 0 represents the proportion of samples from class k𝑘kitalic_k in the target distribution, δ0𝛿0\delta\geq 0italic_δ ≥ 0 represents the proportion of outlier samples not from the observed classes, and fR(x)subscript𝑓𝑅𝑥f_{R}(x)italic_f start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( italic_x ) represents the density for the outlier component. Under the GLS model, BCOPS utilizes both labeled training samples and unlabeled test samples to construct calibrated set-valued prediction. The crucial calibration step of BCOPS relies on sample-splitting, which results in low data-utilization efficiency, especially when training or test samples are limited.

Conformal Prediction. Conformal prediction (also known as conformal inference) (Vovk et al.,, 2005; Papadopoulos et al.,, 2002; Lei and Wasserman,, 2015) aims to create statistically rigorous uncertainty sets/intervals for the predictions from classical machine learning models, aiming to cover the true label with a desired probability in the non-asymptotic regime, without model assumptions on how y𝑦yitalic_y depends on x𝑥xitalic_x. Consider (xi,yi)i=1nsuperscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑛(x_{i},y_{i})_{i=1}^{n}( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT as n𝑛nitalic_n (feature, label) pairs, and a new sample (xn+1,yn+1)subscript𝑥𝑛1subscript𝑦𝑛1(x_{n+1},y_{n+1})( italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) where yn+1subscript𝑦𝑛1y_{n+1}italic_y start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT is unobserved. Based on the previous n𝑛nitalic_n observations, the conformal prediction creates a prediction set C^n(xn+1)subscript^𝐶𝑛subscript𝑥𝑛1\hat{C}_{n}(x_{n+1})over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) for the new instance xn+1subscript𝑥𝑛1x_{n+1}italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT and ensure that (yn+1C^n(xn+1))1αsubscript𝑦𝑛1subscript^𝐶𝑛subscript𝑥𝑛11𝛼\mathbb{P}(y_{n+1}\in\hat{C}_{n}(x_{n+1}))\geq 1-\alphablackboard_P ( italic_y start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∈ over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) ) ≥ 1 - italic_α, where α(0,1)𝛼01\alpha\in(0,1)italic_α ∈ ( 0 , 1 ) is the allowed miscoverage level. For example, in the classification setting, if α=0.1𝛼0.1\alpha=0.1italic_α = 0.1, then the probability that C^n(xn+1)subscript^𝐶𝑛subscript𝑥𝑛1\hat{C}_{n}(x_{n+1})over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) contains the true label yn+1subscript𝑦𝑛1y_{n+1}italic_y start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT is no smaller than 90%.

A key step in forming the conformal prediction is the choice of the conformal score function s(x,y)𝑠𝑥𝑦s(x,y)italic_s ( italic_x , italic_y ), which is used for evaluating how plausible of observing certain (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), followed by a calibration of the probability for observing (xn+1,yn+1)subscript𝑥𝑛1subscript𝑦𝑛1(x_{n+1},y_{n+1})( italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) via comparing its score to those from labeled training samples. As some examples for classification problems, Romano et al., (2020) considers s(x,k)𝑠𝑥𝑘s(x,k)italic_s ( italic_x , italic_k ) as the probability (estimated) of observing labels with estimated conditional probability p(.|x)p(.|x)italic_p ( . | italic_x ) no worse than that of class k𝑘kitalic_k, and C^n(xn+1)={k:s(xn+1,k)τ}subscript^𝐶𝑛subscript𝑥𝑛1conditional-set𝑘𝑠subscript𝑥𝑛1𝑘𝜏\hat{C}_{n}(x_{n+1})=\{k:s(x_{n+1},k)\geq\tau\}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) = { italic_k : italic_s ( italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , italic_k ) ≥ italic_τ } where τ𝜏\tauitalic_τ is a threshold determined by the empirical distribution {s(xi,yi)}i=1nsuperscriptsubscript𝑠subscript𝑥𝑖subscript𝑦𝑖𝑖1𝑛\{s(x_{i},y_{i})\}_{i=1}^{n}{ italic_s ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Vovk et al., (2005) and Lei, (2014) consider the conformal score s(x,k)𝑠𝑥𝑘s(x,k)italic_s ( italic_x , italic_k ) as the conditional probability of having label k𝑘kitalic_k or density of x𝑥xitalic_x in class k𝑘kitalic_k, and construct C^n(xn+1)={k:s(xn+1,k)τk}subscript^𝐶𝑛subscript𝑥𝑛1conditional-set𝑘𝑠subscript𝑥𝑛1𝑘subscript𝜏𝑘\hat{C}_{n}(x_{n+1})=\{k:s(x_{n+1},k)\geq\tau_{k}\}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) = { italic_k : italic_s ( italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , italic_k ) ≥ italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } using a class-specific threshold τksubscript𝜏𝑘\tau_{k}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT where τksubscript𝜏𝑘\tau_{k}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is determined by the empirical distribution {s(xi,k)}i:yi=ksubscript𝑠subscript𝑥𝑖𝑘:𝑖subscript𝑦𝑖𝑘\{s(x_{i},k)\}_{i:y_{i}=k}{ italic_s ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_k ) } start_POSTSUBSCRIPT italic_i : italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_k end_POSTSUBSCRIPT.

Different conformal prediction schemes have been developed in the literature. During the early times, split conformal prediction is usually adopted where we train the conformal score function s(.)s(.)italic_s ( . ) using one-fold of the data and perform calibration τ𝜏\tauitalic_τ using the remaining (Vovk et al.,, 2005). In recent years, significant progress has been made in cross-conformal prediction to improve data utilization efficiency. Vovk et al., (2018) proposed splitting data into multiple folds, calculating scores for each fold using score functions learned from the remaining data, and aggregating all scores for calibration. Barber et al., (2021) developed Jacknife+ for regression problems, which combines Jacknife with conformal prediction and constructs the prediction interval as

C^(x)={y:1n+1(1+i=1n𝟙s^i(x,y)s^i(xi,yi))α},^𝐶𝑥conditional-set𝑦1𝑛11superscriptsubscript𝑖1𝑛subscript1superscript^𝑠𝑖𝑥𝑦superscript^𝑠𝑖subscript𝑥𝑖subscript𝑦𝑖𝛼\hat{C}(x)=\left\{y:\frac{1}{n+1}\left(1+\sum_{i=1}^{n}\mathbbm{1}_{\hat{s}^{i% }(x,y)\geq\hat{s}^{i}(x_{i},y_{i})}\right)\geq\alpha\right\},over^ start_ARG italic_C end_ARG ( italic_x ) = { italic_y : divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG ( 1 + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_x , italic_y ) ≥ over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ) ≥ italic_α } ,

where s^i(x,y)=|m^i(x)y|superscript^𝑠𝑖𝑥𝑦superscript^𝑚𝑖𝑥𝑦\hat{s}^{i}(x,y)=|\hat{m}^{i}(x)-y|over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_x , italic_y ) = | over^ start_ARG italic_m end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_x ) - italic_y | is the conformal score function using the mean-prediction function m^i(x)superscript^𝑚𝑖𝑥\hat{m}^{i}(x)over^ start_ARG italic_m end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_x ) learned from training samples excluding (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Although Jacknife+ can only provide a worst-case coverage guarantee at level (12α)12𝛼(1-2\alpha)( 1 - 2 italic_α ), the achieved empirical coverage is often well-calibrated. Kim et al., (2020) described Jacknife+aB to mediate the computational burden of Jacknife+, which ensembles and calibrates prediction using repeated Bootstraps rather than retraining the model after excluding each training sample.

Refer to caption
Figure 1: Overview of CSForest. For class k𝑘kitalic_k, let kbsubscriptsuperscript𝑏𝑘\mathcal{I}^{b}_{k}caligraphic_I start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, tebsubscriptsuperscript𝑏𝑡𝑒\mathcal{I}^{b}_{te}caligraphic_I start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT and ~othersubscript~𝑜𝑡𝑒𝑟\tilde{\mathcal{I}}_{other}over~ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT be Bootstrap samples from from training class k𝑘kitalic_k, test samples and training samples other than class k𝑘kitalic_k. We train a multi-class tree classifier with random feature selection as in the random forest using the Bootstrapped samples, where we maintain the labels all training samples and treat the test set as its own class. The resulting B𝐵Bitalic_B random forest tree classifiers, {G^1(x),,G^B(x)}superscript^𝐺1𝑥superscript^𝐺𝐵𝑥\{\hat{G}^{1}(x),...,\hat{G}^{B}(x)\}{ over^ start_ARG italic_G end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_x ) , … , over^ start_ARG italic_G end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( italic_x ) }, are used to separate different labeled classes and the test samples. For the sample pair xitesubscript𝑥𝑖subscript𝑡𝑒x_{i}\in\mathcal{I}_{te}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT and xiksuperscriptsubscript𝑥𝑖subscript𝑘x_{i}^{\prime}\in\mathcal{I}_{k}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we aggregate trees that do not use xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and xisuperscriptsubscript𝑥𝑖x_{i}^{\prime}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (i.e., the data ii={b:iteb,ikb}subscript𝑖superscript𝑖conditional-set𝑏formulae-sequence𝑖superscriptsubscript𝑡𝑒𝑏superscript𝑖subscriptsuperscript𝑏𝑘\mathcal{B}_{ii^{\prime}}=\{b:i\notin\mathcal{I}_{te}^{b},i^{\prime}\notin% \mathcal{I}^{b}_{k}\}caligraphic_B start_POSTSUBSCRIPT italic_i italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = { italic_b : italic_i ∉ caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∉ caligraphic_I start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }) to form an ensemble classifier, and subsequently, an ensemble conformal score function s^ii(x,k;μ)superscript^𝑠𝑖superscript𝑖𝑥𝑘𝜇\hat{s}^{ii^{\prime}}(x,k;\mu)over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_i italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x , italic_k ; italic_μ ) . Finally, we use the score function s^ii(x,k;μ)superscript^𝑠𝑖superscript𝑖𝑥𝑘𝜇\hat{s}^{ii^{\prime}}(x,k;\mu)over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_i italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x , italic_k ; italic_μ ) and compare s^ii(xi,k;μ)superscript^𝑠𝑖superscript𝑖subscript𝑥𝑖𝑘𝜇\hat{s}^{ii^{\prime}}(x_{i},k;\mu)over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_i italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_k ; italic_μ ) to s^ii(xi,k;μ)superscript^𝑠𝑖superscript𝑖subscript𝑥superscript𝑖𝑘𝜇\hat{s}^{ii^{\prime}}(x_{i^{\prime}},k;\mu)over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_i italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_k ; italic_μ ) for all iksuperscript𝑖subscript𝑘i^{\prime}\in\mathcal{I}_{k}italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to form the calibrated evaluation s^iksubscript^𝑠𝑖𝑘\hat{s}_{ik}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT for test sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT being in class k𝑘kitalic_k and include k𝑘kitalic_k in the prediction set C^(xi)^𝐶subscript𝑥𝑖\hat{C}(x_{i})over^ start_ARG italic_C end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) if s^iksubscript^𝑠𝑖𝑘\hat{s}_{ik}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT is no smaller than α𝛼\alphaitalic_α.

3 CONFORMALIZED SEMI-SUPERVISED RANDOM FOREST

Despite the popularity of random forest and its variants, existing work implicitly assumes that training and test samples originate from the same distribution. This reliance makes them unreliable in the presence of distributional changes, which can be particularly problematic in safe-critical applications. Classification uncertainty quantification in this setting is also challenging. To address this issue, we introduce CSForest (Conformalized Semi-Supervised Random Forest), a tree-ensemble classifier that produces set-valued predictions designed to incorporate true labels while minimizing the inclusion of false labels, customized to match a target distribution μ(x)𝜇𝑥\mu(x)italic_μ ( italic_x ).

minx|C(x)|μ(x)𝑑x,subscript𝑥𝐶𝑥𝜇𝑥differential-d𝑥\displaystyle\min\int_{x}|C(x)|\mu(x)dx,roman_min ∫ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_C ( italic_x ) | italic_μ ( italic_x ) italic_d italic_x , (2)
s.t.[kC(X)|Y=k]1α,for all k=1,,K.formulae-sequence𝑠𝑡formulae-sequencedelimited-[]𝑘conditional𝐶𝑋𝑌𝑘1𝛼for all 𝑘1𝐾\displaystyle s.t.\;\mathbb{P}[k\in C(X)|Y=k]\geq 1-\alpha,\mbox{for all }k=1,% \ldots,K.italic_s . italic_t . blackboard_P [ italic_k ∈ italic_C ( italic_X ) | italic_Y = italic_k ] ≥ 1 - italic_α , for all italic_k = 1 , … , italic_K .

Specifically, CSForest optimizes for a target distribution as a mixture of the training density ftr(x)subscript𝑓𝑡𝑟𝑥f_{tr}(x)italic_f start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT ( italic_x ) and test feature density fte(x)subscript𝑓𝑡𝑒𝑥f_{te}(x)italic_f start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT ( italic_x ) and set μ(x)=fte(x)+wftr(x)𝜇𝑥subscript𝑓𝑡𝑒𝑥𝑤subscript𝑓𝑡𝑟𝑥\mu(x)=f_{te}(x)+wf_{tr}(x)italic_μ ( italic_x ) = italic_f start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT ( italic_x ) + italic_w italic_f start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT ( italic_x ), where w0𝑤0w\geq 0italic_w ≥ 0. If w=0𝑤0w=0italic_w = 0, μ(x)=fte(x)𝜇𝑥subscript𝑓𝑡𝑒𝑥\mu(x)=f_{te}(x)italic_μ ( italic_x ) = italic_f start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT ( italic_x ) and the objective of CSForest coincides with the objective of BCOPS, which optimizes for the expected test cohort classification accuracy. On the other hand, when w𝑤witalic_w is large, it has a similar objective as the CRF model and optimizes classification performance on samples generated the same way as the training cohort. When w𝑤witalic_w is not excessively large, fte(x)subscript𝑓𝑡𝑒𝑥f_{te}(x)italic_f start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT ( italic_x ) is a significant contributor to μ(x)𝜇𝑥\mu(x)italic_μ ( italic_x ), the constrained optimization objective in eq. (2) encourages C(x)=𝐶𝑥C(x)=\emptysetitalic_C ( italic_x ) = ∅ for unseen outliers even though we do not explicitly model outliers. In other words, if x𝑥xitalic_x is unlikely to belong to any of the observed training classes, we prefer to classify it as an outlier.

The following Proposition 3.1 further provides the oracle solution to eq. (2) under GLS model:

Proposition 3.1.

Set the conformal score function as s(x,k;μ)=[fk(x)/μ(x)]𝑠𝑥𝑘𝜇delimited-[]subscript𝑓𝑘𝑥𝜇𝑥s(x,k;\mu)=[f_{k}(x)/\mu(x)]italic_s ( italic_x , italic_k ; italic_μ ) = [ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) / italic_μ ( italic_x ) ]. Under the GLS model, the solution to eq. (2) is C(x)={k:𝔼X[𝟙{s(x,k;μ)s(X,k;μ)}|Y=k]α,k=1,,K}𝐶𝑥conditional-set𝑘formulae-sequencesubscript𝔼𝑋delimited-[]conditional1𝑠𝑥𝑘𝜇𝑠𝑋𝑘𝜇𝑌𝑘𝛼𝑘1normal-…𝐾C(x)=\{k:\mathbb{E}_{X}[\mathbbm{1}\{s(x,k;\mu)\geq s(X,k;\mu)\}|Y=k]\geq% \alpha,k=1,\ldots,K\}italic_C ( italic_x ) = { italic_k : blackboard_E start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT [ blackboard_1 { italic_s ( italic_x , italic_k ; italic_μ ) ≥ italic_s ( italic_X , italic_k ; italic_μ ) } | italic_Y = italic_k ] ≥ italic_α , italic_k = 1 , … , italic_K }.

Proposition 3.1 rephrases Proposition 2 from Guan and Tibshirani, (2022), and its proof is provided in Appendix A for completeness. CSForest estimates C(x)𝐶𝑥C(x)italic_C ( italic_x ) in Proposition 3.1 using a semi-supervised random forest that utilizes labeled training samples and the unlabeled test cohort, coupled with the Jackknife+aB strategy for correct calibration to ensure coverage guarantee. Specifically, for a test sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and a training sample xisubscript𝑥superscript𝑖x_{i^{\prime}}italic_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT from class k𝑘kitalic_k with size nksubscript𝑛𝑘n_{k}italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, CSForest constructs a conformal score function s^ii(x,k;μ)superscript^𝑠𝑖superscript𝑖𝑥𝑘𝜇\hat{s}^{ii^{\prime}}(x,k;\mu)over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_i italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x , italic_k ; italic_μ ) trained without xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and (xi,yi)subscript𝑥superscript𝑖subscript𝑦superscript𝑖(x_{i^{\prime}},y_{i^{\prime}})( italic_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) to measure how likely a sample is from class k𝑘kitalic_k. More specifically, CSForest replaces the oracle score s(x,k;μ)𝑠𝑥𝑘𝜇s(x,k;\mu)italic_s ( italic_x , italic_k ; italic_μ ) in Proposition 3.1 when comparing xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and xisubscript𝑥superscript𝑖x_{i^{\prime}}italic_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and replaces expectation 𝔼X[.]\mathbb{E}_{X}[.]blackboard_E start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT [ . ] is replaced by an empirical version using corresponding s^ii(x,k;μ)superscript^𝑠𝑖superscript𝑖𝑥𝑘𝜇\hat{s}^{ii^{\prime}}(x,k;\mu)over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_i italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x , italic_k ; italic_μ ). In other words, let nksubscript𝑛𝑘n_{k}italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT be the training sample size for class k𝑘kitalic_k, the inclusion criterion in Proposition 3.1 is replaced by its empirical version below:

s^ik=1+yi=k𝟙{s^ii(xi,k;μ)s^ii(xi,k;μ)}nk+1.subscript^𝑠𝑖𝑘1subscriptsubscript𝑦𝑖𝑘1superscript^𝑠𝑖superscript𝑖subscript𝑥𝑖𝑘𝜇superscript^𝑠𝑖superscript𝑖subscript𝑥superscript𝑖𝑘𝜇subscript𝑛𝑘1\hat{s}_{ik}=\frac{1+\sum_{y_{i}=k}\mathbbm{1}\{\hat{s}^{ii^{\prime}}(x_{i},k;% \mu)\geq\hat{s}^{ii^{\prime}}(x_{i^{\prime}},k;\mu)\}}{n_{k}+1}.over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT = divide start_ARG 1 + ∑ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_k end_POSTSUBSCRIPT blackboard_1 { over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_i italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_k ; italic_μ ) ≥ over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_i italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_k ; italic_μ ) } end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_ARG . (3)

The the estimated prediction set C^(xi)^𝐶subscript𝑥𝑖\hat{C}(x_{i})over^ start_ARG italic_C end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is :

C^(xi)={k:s^ikα}.^𝐶subscript𝑥𝑖conditional-set𝑘subscript^𝑠𝑖𝑘𝛼\hat{C}(x_{i})=\{k:\hat{s}_{ik}\geq\alpha\}.over^ start_ARG italic_C end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = { italic_k : over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ≥ italic_α } . (4)

Given a user-specified weight w𝑤witalic_w, Figure 2 presents a graphical illustration of its model structure and Algorithm 1 delineates the ensemble tree constructions and prediction calibrations for CSForest. In Algorithm 1, we use trsubscript𝑡𝑟\mathcal{I}_{tr}caligraphic_I start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT and tesubscript𝑡𝑒\mathcal{I}_{te}caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT to denote the training and test sets, respectively, and ksubscript𝑘\mathcal{I}_{k}caligraphic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to denote samples from the training class k𝑘kitalic_k. For each class k𝑘kitalic_k, lines 2-5 construct B𝐵Bitalic_B Bootstrapped random forest tree classifiers to separate the training classes and the test samples. The random forest tree refers to a tree whose split is selected by the best split from L𝐿Litalic_L randomly selected candidate features, as constructed in the random forest, with L=p𝐿𝑝L=\lfloor\sqrt{p}\rflooritalic_L = ⌊ square-root start_ARG italic_p end_ARG ⌋ as the default split number in the R range package. Line 6 constructs the conformal score function s^ii(x,k;μ)superscript^𝑠𝑖superscript𝑖𝑥𝑘𝜇\hat{s}^{ii^{\prime}}(x,k;\mu)over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_i italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x , italic_k ; italic_μ ), using only the trees excluding test sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the training sample xisubscript𝑥superscript𝑖x_{i^{\prime}}italic_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. In short, Algorithm 1 can be seen as estimating the oracle conformal score s(x,k;μ)𝑠𝑥𝑘𝜇s(x,k;\mu)italic_s ( italic_x , italic_k ; italic_μ ) under the target distribution with w𝑤witalic_w being not excessively large, by utilizing trees from a weighted random forest classifier.

Refer to caption
Figure 2: Panel A shows the first two dimensions of samples are generated from the three classes: green/blue/red points representing samples from class 1/2/R. Panel B shows the coverage rate which is defined by the proportion of samples with true labels included in their prediction sets. The horizontal dash line refers to the coverage level of 95%. Panel B is grouped by the actual labels in the testing data and colored based on if a prediction set contains only the correct label (blue) or more than the correct label (gray).

The estimated conformal score functions are then used to construct the calibrated score, s^iksubscript^𝑠𝑖𝑘\hat{s}_{ik}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT, and the prediction set, C^(x)^𝐶𝑥\hat{C}(x)over^ start_ARG italic_C end_ARG ( italic_x ), in lines 8-13. It is worth noting that, in line 3, to prevent redundant resampling in Bootstrap, we constrain ~othersubscript~𝑜𝑡𝑒𝑟\tilde{\mathcal{I}}_{other}over~ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT to be the Bootstrap sample of size min(ntew,nnk)subscript𝑛𝑡𝑒𝑤𝑛subscript𝑛𝑘\min(\lceil n_{te}w\rceil,n-n_{k})roman_min ( ⌈ italic_n start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT italic_w ⌉ , italic_n - italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) drawn from training samples, excluding class k𝑘kitalic_k. It is worth noting that the probability Pr(Bii=)𝑃𝑟subscript𝐵𝑖superscript𝑖Pr(B_{ii^{\prime}}=\emptyset)italic_P italic_r ( italic_B start_POSTSUBSCRIPT italic_i italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = ∅ ) decreases rapidly as B𝐵Bitalic_B increases.

Input : Training Data {(xi,yi),itr}subscript𝑥𝑖subscript𝑦𝑖𝑖subscript𝑡𝑟\{(x_{i},y_{i}),i\in\mathcal{I}_{tr}\}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT }, Test Data {xi,ite}subscript𝑥𝑖𝑖subscript𝑡𝑒\{x_{i},i\in\mathcal{I}_{te}\}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT }, B~~𝐵\tilde{B}over~ start_ARG italic_B end_ARG and w𝑤witalic_w (1 by default.)
Output : Prediction sets C^i(xi)subscript^𝐶𝑖subscript𝑥𝑖\hat{C}_{i}(x_{i})over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for ite𝑖subscript𝑡𝑒i\in\mathcal{I}_{te}italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT.
1 for k=1,,K𝑘1normal-…𝐾k=1,\ldots,Kitalic_k = 1 , … , italic_K do
2       Sample B𝐵Bitalic_B from Binomial(B~;(11nk+1)nk)Binomial~𝐵superscript11subscript𝑛𝑘1subscript𝑛𝑘{\rm{Binomial}}(\tilde{B};(1-\frac{1}{n_{k}+1})^{n_{k}})roman_Binomial ( over~ start_ARG italic_B end_ARG ; ( 1 - divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_ARG ) start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ). for b=1,,B𝑏1normal-…𝐵b=1,\ldots,Bitalic_b = 1 , … , italic_B do
3             Let kbsuperscriptsubscript𝑘𝑏\mathcal{I}_{k}^{b}caligraphic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT, tebsuperscriptsubscript𝑡𝑒𝑏\mathcal{I}_{te}^{b}caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT be the Bootstraps of ksubscript𝑘\mathcal{I}_{k}caligraphic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (index of training class k𝑘kitalic_k) and tesubscript𝑡𝑒\mathcal{I}_{te}caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT. Let ~othersubscript~𝑜𝑡𝑒𝑟\tilde{\mathcal{I}}_{other}over~ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT be the Bootstrap of size min(ntew,nnk)subscript𝑛𝑡𝑒𝑤𝑛subscript𝑛𝑘\min(\lceil n_{te}w\rceil,n-n_{k})roman_min ( ⌈ italic_n start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT italic_w ⌉ , italic_n - italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) from the remaining training sample indices ksubscript𝑘\mathcal{I}\setminus\mathcal{I}_{k}caligraphic_I ∖ caligraphic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Grow a single random forest tree classifier G^b(x)superscript^𝐺𝑏𝑥\hat{G}^{b}(x)over^ start_ARG italic_G end_ARG start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( italic_x ) separating different labeled classes and the test samples using kbteb~othersuperscriptsubscript𝑘𝑏superscriptsubscript𝑡𝑒𝑏subscript~𝑜𝑡𝑒𝑟\mathcal{I}_{k}^{b}\cup\mathcal{I}_{te}^{b}\cup\tilde{\mathcal{I}}_{other}caligraphic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ∪ caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ∪ over~ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT.
4       end for
5      For sample pair ite,ikformulae-sequence𝑖subscript𝑡𝑒superscript𝑖subscript𝑘i\in\mathcal{I}_{te},i^{\prime}\in\mathcal{I}_{k}italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, set ii={b:iteb,ikb}subscript𝑖superscript𝑖conditional-set𝑏formulae-sequence𝑖superscriptsubscript𝑡𝑒𝑏superscript𝑖subscriptsuperscript𝑏𝑘\mathcal{B}_{ii^{\prime}}=\{b:i\notin\mathcal{I}_{te}^{b},i^{\prime}\notin% \mathcal{I}^{b}_{k}\}caligraphic_B start_POSTSUBSCRIPT italic_i italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = { italic_b : italic_i ∉ caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∉ caligraphic_I start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } and construct the conformal score function s^ii(x,k;μ)=(biiG^kb(x))/|ii|superscript^𝑠𝑖superscript𝑖𝑥𝑘𝜇subscript𝑏subscript𝑖superscript𝑖subscriptsuperscript^𝐺𝑏𝑘𝑥subscript𝑖superscript𝑖\hat{s}^{ii^{\prime}}(x,k;\mu)=\left(\sum_{b\in\mathcal{B}_{ii^{\prime}}}\hat{% G}^{b}_{k}(x)\right)/|\mathcal{B}_{ii^{\prime}}|over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_i italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x , italic_k ; italic_μ ) = ( ∑ start_POSTSUBSCRIPT italic_b ∈ caligraphic_B start_POSTSUBSCRIPT italic_i italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_G end_ARG start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) ) / | caligraphic_B start_POSTSUBSCRIPT italic_i italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT |.
6 end for
7for ite𝑖subscript𝑡𝑒i\in\mathcal{I}_{te}italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT do
8       for k=1,,K𝑘1normal-…𝐾k=1,\ldots,Kitalic_k = 1 , … , italic_K do
9             Construct s^iksubscript^𝑠𝑖𝑘\hat{s}_{ik}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT for sample i𝑖iitalic_i and class k𝑘kitalic_k via eq. (3).
10       end for
11      Construct C^(xi)={k:s^ikα}^𝐶subscript𝑥𝑖conditional-set𝑘subscript^𝑠𝑖𝑘𝛼\hat{C}(x_{i})=\left\{k:\hat{s}_{ik}\geq\alpha\right\}over^ start_ARG italic_C end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = { italic_k : over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ≥ italic_α }.
12 end for
Algorithm 1 CSForest
Remark 3.2.

When Biisubscript𝐵𝑖superscript𝑖B_{ii^{\prime}}italic_B start_POSTSUBSCRIPT italic_i italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is empty, sii(x,k;μ)superscript𝑠𝑖superscript𝑖𝑥𝑘𝜇s^{ii^{\prime}}(x,k;\mu)italic_s start_POSTSUPERSCRIPT italic_i italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x , italic_k ; italic_μ ) is not defined. In this case, We will exclude training sample xiksubscript𝑥superscript𝑖subscript𝑘x_{i^{\prime}}\in\mathcal{I}_{k}italic_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∈ caligraphic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT when calibrating for test sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Let ntesubscript𝑛𝑡𝑒n_{te}italic_n start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT be the size of tesubscript𝑡𝑒\mathcal{I}_{te}caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT. The probability that Bii=subscript𝐵𝑖superscript𝑖B_{ii^{\prime}}=\emptysetitalic_B start_POSTSUBSCRIPT italic_i italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = ∅ is bounded by [ikb,bB]+[iteb,bB]=[1(11nk)nk]B+[1(11nte)nte]B2(11e)Bdelimited-[]formulae-sequencesuperscript𝑖superscriptsubscript𝑘𝑏for-all𝑏𝐵delimited-[]formulae-sequence𝑖superscriptsubscript𝑡𝑒𝑏for-all𝑏𝐵superscriptdelimited-[]1superscript11subscript𝑛𝑘subscript𝑛𝑘𝐵superscriptdelimited-[]1superscript11subscript𝑛𝑡𝑒subscript𝑛𝑡𝑒𝐵2superscript11𝑒𝐵\mathbb{P}[i^{\prime}\in\mathcal{I}_{k}^{b},\forall b\leq B]+\mathbb{P}[i\in% \mathcal{I}_{te}^{b},\forall b\leq B]=[1-(1-\frac{1}{n_{k}})^{n_{k}}]^{B}+[1-(% 1-\frac{1}{n_{te}})^{n_{te}}]^{B}\approx 2(1-\frac{1}{e})^{B}blackboard_P [ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , ∀ italic_b ≤ italic_B ] + blackboard_P [ italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , ∀ italic_b ≤ italic_B ] = [ 1 - ( 1 - divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT + [ 1 - ( 1 - divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ≈ 2 ( 1 - divide start_ARG 1 end_ARG start_ARG italic_e end_ARG ) start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT for decently large ntesubscript𝑛𝑡𝑒n_{te}italic_n start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT and nksubscript𝑛𝑘n_{k}italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, which decreases fast with B𝐵Bitalic_B.

Theorem 3.3 states that CSForest provides a worst-case coverage guarantee for the true response at the level (12α)12𝛼(1-2\alpha)( 1 - 2 italic_α ).

Theorem 3.3.

Suppose the generalized label shift model holds where features from class k𝑘kitalic_k are i.i.d generated from a distribution 𝒫ksubscript𝒫𝑘\mathcal{P}_{k}caligraphic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. For any fixed integers B~1normal-~𝐵1\tilde{B}\geq 1over~ start_ARG italic_B end_ARG ≥ 1, the constructed C^i(x)subscriptnormal-^𝐶𝑖𝑥\hat{C}_{i}(x)over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) from CSForest satisfies:

[kC^(xi)|yi=k]12α,delimited-[]𝑘conditional^𝐶subscript𝑥𝑖subscript𝑦𝑖𝑘12𝛼\displaystyle\mathbb{P}\left[k\in\hat{C}(x_{i})|y_{i}=k\right]\geq 1-2\alpha,blackboard_P [ italic_k ∈ over^ start_ARG italic_C end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_k ] ≥ 1 - 2 italic_α , (5)
for all ite and k=1,,K.formulae-sequencefor all 𝑖subscript𝑡𝑒 and 𝑘1𝐾\displaystyle\mbox{for all }i\in\mathcal{I}_{te}\;\mbox{ and }\;k=1,\ldots,K.for all italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT and italic_k = 1 , … , italic_K .

While the proof relies on the previous arguments used in Jacknife+aB for supervised regression problem Kim et al., (2020). However, new conditioning arguments are needed to estabilish exchangeability due to the paired sampling of both training and test samples. Please find the proof of Theorem 3.3 in Appendix A.

Theorem 3.3 ensures per-class coverage for observed training classes, which means we guarantee true label coverage for inlier classes even in arbitrarily shifted test distributions, e.g.,

[yC(x)|y{1,,K},(x,y)Pte]12α,\mathbb{P}\left[y\in C(x)|y\in\{1,\ldots,K\},(x,y)\sim P_{te}\right]\geq 1-2\alpha,blackboard_P [ italic_y ∈ italic_C ( italic_x ) | italic_y ∈ { 1 , … , italic_K } , ( italic_x , italic_y ) ∼ italic_P start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT ] ≥ 1 - 2 italic_α ,

for any test distribution Ptesubscript𝑃𝑡𝑒P_{te}italic_P start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT satisfying the generalized label shift model assumption. Although the theoretical guarantee for the worst-case coverage is at the level (12α)12𝛼(1-2\alpha)( 1 - 2 italic_α ), the empirical coverage using CSForest is usually close to or above the targeted level (1α)1𝛼(1-\alpha)( 1 - italic_α ).

Refer to caption
(A) Per-class quality evaluation with outliers but no additional label shift among inlier digits, where the outliers are defined as R={6,7,8,9}𝑅6789R=\{6,7,8,9\}italic_R = { 6 , 7 , 8 , 9 }.
Refer to caption
(B) Per-class (class 0-5) quality evaluation with additional label shift among inlier digits but no outliers.
Figure 3: Per-class quality evaluation on MNIST. Panel A and B were grouped by the true labels in the testing data and colored based on whether a prediction set contains only the correct label (blue) or more than the correct label (gray). The horizontal dash line refers to the coverage level of 95%.

4 EXPERIMENTS

4.1 Synthetic Data

We begin with a simple illustrative 2D synthetic dataset and compare the performance using CSForest and three closely related set-valued conformal classifiers: BCOPS, which is a conformalized semi-supervised classifier that uses half of the samples for training while the other half for calibration (see Section 2); DC (density set classifier)(Cadre,, 2006; Lei,, 2014; Hechtlinger et al.,, 2018; Sadinle et al.,, 2019) and CRF (conformalized random forest), which follow the sample-splitting conformal prediction scheme while using the kernel estimate of the per-class density for x|yconditional𝑥𝑦x|yitalic_x | italic_y and the estimated conditional probability of observing a label y|xconditional𝑦𝑥y|xitalic_y | italic_x via a random forest on the training samples as the conformal score functions, respectively.

Example 1.

Let X10superscript10𝑋absentX\in^{10}italic_X ∈ start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT be the feature. We observe two classes Y{1,2}𝑌12Y\in\{1,2\}italic_Y ∈ { 1 , 2 } in the training data, but the test data contains outliers labeled with Y=R𝑌𝑅Y=Ritalic_Y = italic_R. We generate XjN(0,1)similar-tosubscript𝑋𝑗𝑁01X_{j}\sim N(0,1)italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∼ italic_N ( 0 , 1 ) (j=3,,10)𝑗3normal-…10(j=3,\ldots,10)( italic_j = 3 , … , 10 ) as noise, with different classes separated by the first two dimensions:

{X1N(0,1),X2N(0,1),Y=1,X1N(3,0.5),X2N(0,1),Y=2,X1N(0,1),X2N(3,1),Y=R.\left\{\begin{aligned} &X_{1}\sim N(0,1),\;X_{2}\sim N(0,1),\quad Y=1,\\ &X_{1}\sim N(3,0.5),\;X_{2}\sim N(0,1),\quad Y=2,\\ &X_{1}\sim N(0,1),\;X_{2}\sim N(3,1),\quad Y=R.\end{aligned}\right.{ start_ROW start_CELL end_CELL start_CELL italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_N ( 0 , 1 ) , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ italic_N ( 0 , 1 ) , italic_Y = 1 , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_N ( 3 , 0.5 ) , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ italic_N ( 0 , 1 ) , italic_Y = 2 , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_N ( 0 , 1 ) , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ italic_N ( 3 , 1 ) , italic_Y = italic_R . end_CELL end_ROW

Figure 2(A) shows the first two dimensions of samples generated from the three classes Y{1,2,R}𝑌12𝑅Y\in\{1,2,R\}italic_Y ∈ { 1 , 2 , italic_R }. We generated 200 samples from classes 1 and 2 to form the training set and 200 samples from each of the three classes to form the test set. In Figure 2(B), we evaluated the quality of the set-valued prediction C^(x)^𝐶𝑥\hat{C}(x)over^ start_ARG italic_C end_ARG ( italic_x ) using DC, CRF, BCOPS, and CSForest across 20 independent runs with a targeted miscoverage rate at α=0.05𝛼0.05\alpha=0.05italic_α = 0.05. All four methods achieve the desired 95% (1α)1𝛼(1-\alpha)( 1 - italic_α ) coverage on true labels. However, both CSForest and BCOPS adapted to the test cohort and outperformed CRF and DC significantly in outlier detection. Additionally, compared to BCOPS, CSForest had fewer samples with multiple labels from classes 1 and 2 and exhibited a higher rejection rate for outliers.

4.2 Real-World Data

In this section, our primary objective is to evaluate the effectiveness of CSForest on various real-world datasets, focusing on addressing the following three questions:

  1. Q1.

    Can CSForest detect outliers efficiently while making accurate predictions for inliers in the presence of outliers but no additional label shift among inliers? (denoted as outliers w/o shift.)

  2. Q2.

    Under the traditional label shift model without outliers, can CSForest achieve competitive performance compared to alternative classifiers? (denoted as shift w/o outliers.)

  3. Q3.

    Does CSForest demonstrate stable performance as the training and test sample sizes vary?

Q1 and Q2 capture two extreme settings of the GLS model. We have set w𝑤witalic_w at its default value for all experiments in the main paper with w=1𝑤1w=1italic_w = 1. In Appendix E, we include the sensitivity analysis of the crucial parameter w𝑤witalic_w and shows CSForest performs well for different values of w𝑤witalic_w, with w=1𝑤1w=1italic_w = 1 being a reasonable choice to balance performance for both inliers and outliers.

Table 1: Achieved type I and type II errors at α=0.05𝛼0.05\alpha=0.05italic_α = 0.05 under different distributional shift settings. While most methods achieved desirable type I errors and true label coverage rates (1α1𝛼1-\alpha1 - italic_α), only CSForest consistently achieved lower type II errors in both settings.
Dataset Method outliers w/o shift shift w/o outliers
Type I Error Type II Error Type I Error Type II Error
MNIST CSForest 0.049±plus-or-minus\pm±0.006 0.091 ±plus-or-minus\pm± 0.008 0.048±plus-or-minus\pm±0.016 0.291±plus-or-minus\pm±0.038
BCOPS 0.048±plus-or-minus\pm±0.004 0.237±plus-or-minus\pm±0.019 0.042±plus-or-minus\pm±0.007 0.556±plus-or-minus\pm±0.040
DC 0.049±plus-or-minus\pm±0.008 0.890±plus-or-minus\pm±0.021 0.046±plus-or-minus\pm±0.016 0.968±plus-or-minus\pm±0.022
CRF 0.048±plus-or-minus\pm±0.007 0.338±plus-or-minus\pm±0.035 0.046±plus-or-minus\pm±0.018 0.428±plus-or-minus\pm±0.082
ACRF 0.046±plus-or-minus\pm±0.006 0.430±plus-or-minus\pm±0.003 0.025±plus-or-minus\pm±0.011 0.884±plus-or-minus\pm±0.012
ACRFshift 0.046±plus-or-minus\pm±0.006 0.432±plus-or-minus\pm±0.009 0.055±plus-or-minus\pm±0.013 0.828±plus-or-minus\pm±0.015
CIFAR-10 CSForest 0.051±plus-or-minus\pm±0.008 0.000±plus-or-minus\pm±0.000 0.049±plus-or-minus\pm±0.013 0.009±plus-or-minus\pm±0.035
BCOPS 0.049±plus-or-minus\pm±0.006 0.001±plus-or-minus\pm±0.000 0.042±plus-or-minus\pm±0.009 0.029±plus-or-minus\pm±0.006
DC 0.046±plus-or-minus\pm±0.007 0.048±plus-or-minus\pm±0.091 0.039±plus-or-minus\pm±0.010 0.071±plus-or-minus\pm±0.115
CRF 0.049±plus-or-minus\pm±0.008 0.003±plus-or-minus\pm±0.000 0.047±plus-or-minus\pm±0.015 0.000±plus-or-minus\pm±0.000
ACRF 0.003±plus-or-minus\pm±0.001 0.402±plus-or-minus\pm±0.001 0.040±plus-or-minus\pm±0.009 0.221±plus-or-minus\pm±0.023
ACRFshift 0.003±plus-or-minus\pm±0.001 0.069±plus-or-minus\pm±0.003 0.046±plus-or-minus\pm±0.007 0.230±plus-or-minus\pm±0.035
FashionMNIST CSForest 0.050±plus-or-minus\pm±0.005 0.266±plus-or-minus\pm±0.018 0.038±plus-or-minus\pm±0.009 0.311±plus-or-minus\pm±0.040
BCOPS 0.050±plus-or-minus\pm±0.007 0.381±plus-or-minus\pm±0.020 0.038±plus-or-minus\pm±0.009 0.311±plus-or-minus\pm±0.040
DC 0.051±plus-or-minus\pm±0.007 0.666±plus-or-minus\pm±0.033 0.038±plus-or-minus\pm±0.013 0.584±plus-or-minus\pm±0.066
CRF 0.051±plus-or-minus\pm±0.006 0.514±plus-or-minus\pm±0.021 0.038±plus-or-minus\pm±0.014 0.804±plus-or-minus\pm±0.080
ACRF 0.051±plus-or-minus\pm±0.006 0.537±plus-or-minus\pm±0.013 0.054±plus-or-minus\pm±0.009 0.835±plus-or-minus\pm±0.020
ACRFshift 0.046±plus-or-minus\pm±0.005 0.481±plus-or-minus\pm±0.019 0.072±plus-or-minus\pm±0.021 0.814±plus-or-minus\pm±0.039

Datasets and Baselines. Our evaluation is conducted on three well-established image benchmarks: MNIST (LeCun and Cortes,, 2010), FashionMNIST (Xiao et al.,, 2017), and CIFAR-10 (Krizhevsky et al.,, 2009). Additionally, we have included tabular data from the Network Intrusion domain and Chest X-ray data from the medical domain in the Appendix D.1 to demonstrate the effectiveness of CSForest in handling diverse datasets. To evaluate CSForest’s performance, we compared it with BCOPS, CRF, DC along with two other approaches based on adaptive classification (Romano et al.,, 2020) and the covariate shift conformal prediction Tibshirani et al., (2019): ACRF (Adaptive classifier via random forest) and ACRFshift (Adaptive classifier via random forest under covariate shift). ACRF is a derandomized version of the existing conformalized adaptive random forest classifier (Romano et al.,, 2020), denoted as ACRFrandom, which aims for adaptive coverage across different feature regions. We will show results using ACRF instead of ACRFrandom in the main paper due to the latter’s tendency to produce overly wide prediction sets, due to the attempt to achieve conditional coverage as indicated in the original paper (Romano et al.,, 2020). ACRFshift combines ACRF with the covariate shift conformal prediction. BCOPS, ACRF and ACRFshift all utilize random forest for constructing set-valued predictions, as a fair comparison to CSForest. More details of these baselines are provided in Appendix B.

Training Details and Evaluations. We set the number of trees B=3000𝐵3000B=3000italic_B = 3000 for CSForest, and repeat all experiments ten times for performance evaluations. We evaluate the effectiveness of all methods using the type I error, type II error, and the average set length of C^(x)^𝐶𝑥\hat{C}(x)over^ start_ARG italic_C end_ARG ( italic_x ) at α=0.05𝛼0.05\alpha=0.05italic_α = 0.05. Type I error is the percentage of samples with the true label excluded from their associated set-valued prediction C^(x)^𝐶𝑥\hat{C}(x)over^ start_ARG italic_C end_ARG ( italic_x ) for observed classes. This error measure is directly linked to the coverage guarantee in Theorem 3.3. Type II error is calculated as the percentage of samples with C^(x)^𝐶𝑥\hat{C}(x)over^ start_ARG italic_C end_ARG ( italic_x ) containing labels other than the true labels. The average length C^(x)^𝐶𝑥\hat{C}(x)over^ start_ARG italic_C end_ARG ( italic_x ) under the mixed distribution μ(x)𝜇𝑥\mu(x)italic_μ ( italic_x ) is the optimization objective under the formulation described by eq. (2).

4.2.1 The Outliers w/o Shift Setting

In this section, the test set contains outlier labels relative to the training set but without any additional label shift. For each data set, we constructed the training set by including a subset of class labels and the test set with all labels, e.g., for the MINIST data, the training set had digit labels 0-5 of equal size and the test set included both digit labels 0-5 of equal size and digit labels 6-9. Details of the data subsampling schemes for Q1 can be found in Appendix C.

Table 1 displays the average type I and type II errors of all methods on the test data at α=0.05𝛼0.05\alpha=0.05italic_α = 0.05. All methods achieved the the targeted coverage rate at 95%percent9595\%95 % when averaging inlier data. Figure 3(A) presents detailed classification results with different methods on the MNIST dataset. Although ACRF, ACRFshift, and CRF achieved slightly higher accuracy compared to CSForest among inlier digits under no additional label shift, CSForest demonstrated the strongest capability to detect outlier digits 6-9, even compared to BCOPS, with an outlier detection accuracy of approximately 90%. Similar results are observed on other datasets, as detailed in Appendix D.2.

Refer to caption
Figure 4: The type II error for inliers and outliers across different sample sizes on MNIST. Figure 4 demonstrates that CSForest outperforms the baselines by efficiently detecting outliers while maintaining lower inlier type II errors across various sample sizes. Note that error bars here are calculated based on repeated sample-splitting and can be smaller than the standard deviation due to sample dependence from different runs.

4.2.2 The Shift w/o Outliers Setting

In the previous simulations, although we had outliers, the class ratios among inlier classes were balanced and remained the same for training and test data. To verify whether methods like CSForest, designed to achieve per-class coverage rather than marginal coverage, still maintain robustness when handling label shifts among inlier classes, we examine the predictive performance of all methods under the traditional label shift setting in the absence of outliers. For example, for the MNIST data set, the training and test sets contain digit labels 0-5 but with different class proportions. Details of the data subsampling and label shift schemes for Q2 can be found in Appendix C.

The achieved type I and type II errors using different methods in this standard label shift simulation can be found in Table 1. We observed that all methods achieved the desired coverage (1α1𝛼1-\alpha1 - italic_α) with ACRFshift exhibiting high variability. However, CSForest is the only method that achieved consistently low type II errors: CRF has a type II error 10% and 50% more than CSForest on MNIST and FashionMNIST, respectively; BCOPS has a type II error 25% more than CSForest on MNIST; ACRF and ACRFshift both have type II errors more than two folds than those from CSForest on all three data sets. Figure 3(B) presents a detailed view of the prediction results for each class on the MNIST dataset. CSForest contains a higher proportion of samples with only the correct labels in almost every class compared to the baseline models, which underscores the high-quality prediction sets C^(x)^𝐶𝑥\hat{C}(x)over^ start_ARG italic_C end_ARG ( italic_x ) generated by CSForest under Q2, consistent with the lower type II error reported in Table 1. Detailed results on other datasets can be found in Appendix D.2.

4.2.3 Comparisons with Varying Sample Sizes

We conducted a comparison of different methods under varying sample size settings. Specifically, we varied the number of training and test samples per class from 50 to 200. Figure 4 presents the type II errors for inliers and outliers across all models on the MNIST dataset. In Figure 4, it is evident that as training sample size increases, the type II error (inliers) decreases for all methods, while BCOPS and CSForest also benefit from increased test sample sizes. CSForest closely matches the CRF, the best-performing classifier in the inlier classification, for predicting inlier labels as we vary the training/test sample sizes from 50 to 200. CSForest and BCOPs outperformed other methods by a large margin for varying sample sizes for outlier detection, with CSForest significantly improving over BCOPS due to the enhanced data utilization efficiency. Of note, the ability for outlier detection (higher type II error for outliers) ACRFshift deteriorated as sample size increased. This surprising phenomenon is attributed to ACRFshift’s decision rule for outliers, which strongly depends on the sample weights under the covariate shift model, denoted as γx0(x)=r(x)r(x0)+zicalr(xi)subscript𝛾subscript𝑥0𝑥𝑟𝑥𝑟subscript𝑥0subscriptsubscript𝑧𝑖subscript𝑐𝑎𝑙𝑟subscript𝑥𝑖\gamma_{x_{0}}(x)=\frac{r(x)}{r(x_{0})+\sum_{z_{i}\in\mathcal{I}_{cal}}r(x_{i})}italic_γ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG italic_r ( italic_x ) end_ARG start_ARG italic_r ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_I start_POSTSUBSCRIPT italic_c italic_a italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_r ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG. A sample is claimed an outlier γx0(x)subscript𝛾subscript𝑥0𝑥\gamma_{x_{0}}(x)italic_γ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) is very large, and γx0(x)subscript𝛾subscript𝑥0𝑥\gamma_{x_{0}}(x)italic_γ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) tends to increase with increased training sample sizes (Appendix B). Results on other datasets are consistent with those on MNIST (Appendix D.2).

5 DISCUSSION

We propose CSForest, which aims to construct a calibrated and narrow set-valued prediction set under distributional changes, as a powerful ensemble classifier for robust inlier classification and outlier detection. We theoretically justified its robustness for covering the true class label and confirmed its ability to construct high-quality prediction sets compared to alternative methods via extensive experiments.

Future Work. An interesting question is how much guidance from test samples is needed for effective outlier detection. Can CSForest still utilize the limited test samples for efficient outlier detection? As an exploratory experiment, we consider a challenging MNIST example, where we have 200 samples per class for labels 0-5 in the training set but only five samples per class for labels 0-9 in the test set. Figure 5 shows CSForest achieves an average type II error of approximately 42% for inliers and 60% for outliers, whereas DC exhibits an average type II error of as high as 95%. This highlights the benefits of utilizing a small test set in CSForest and suggesting the potential of extending the framework of CSForest to settings with extremely limited or even single test samples for future exploration.

Refer to caption
Figure 5: Achieved Type II errors for inliers and outliers across 100 repetitions at α=0.05𝛼0.05\alpha=0.05italic_α = 0.05 with merely 5 samples per-class in the test cohort.

Additionally, the GLS model assumes that the distribution of x|yconditional𝑥𝑦x|yitalic_x | italic_y remains unchanged, which could be violated in practice. When both the distribution of y𝑦yitalic_y and the distribution of x|yconditional𝑥𝑦x|yitalic_x | italic_y are allowed to change, the problem becomes much more challenging and less well-defined. One interesting future direction is to relax GLS model and assume bounded small changes in x|yconditional𝑥𝑦x|yitalic_x | italic_y, leading to the investigation of CSForest under an adversarial setting that allows adversarial yet small perturbations in x|yconditional𝑥𝑦x|yitalic_x | italic_y during test time.

Acknowledgements

This work was supported by NSF award DMS2310836.

References

  • Barber et al., (2021) Barber, R. F., Candes, E. J., Ramdas, A., and Tibshirani, R. J. (2021). Predictive inference with the jackknife+. The Annals of Statistics, 49(1):486–507.
  • Bartlett and Wegkamp, (2008) Bartlett, P. L. and Wegkamp, M. H. (2008). Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9(8).
  • Bickel et al., (2009) Bickel, S., Brückner, M., and Scheffer, T. (2009). Discriminative learning under covariate shift. Journal of Machine Learning Research, 10(9).
  • Cadre, (2006) Cadre, B. (2006). Kernel estimation of density level sets. Journal of multivariate analysis, 97(4):999–1023.
  • Charoenphakdee et al., (2021) Charoenphakdee, N., Cui, Z., Zhang, Y., and Sugiyama, M. (2021). Classification with rejection based on cost-sensitive classification. In International Conference on Machine Learning, pages 1507–1517. PMLR.
  • Chow, (1970) Chow, C. (1970). On optimum recognition error and reject tradeoff. IEEE Transactions on information theory, 16(1):41–46.
  • Cortes et al., (2016) Cortes, C., DeSalvo, G., and Mohri, M. (2016). Boosting with abstention. Advances in Neural Information Processing Systems, 29.
  • Csurka, (2017) Csurka, G. (2017). Domain adaptation for visual applications: A comprehensive survey. arXiv preprint arXiv:1702.05374.
  • Esteva et al., (2017) Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M., and Thrun, S. (2017). Dermatologist-level classification of skin cancer with deep neural networks. nature, 542(7639):115–118.
  • Gammerman et al., (2013) Gammerman, A., Vovk, V., and Vapnik, V. (2013). Learning by transduction. arXiv preprint arXiv:1301.7375.
  • Gretton et al., (2009) Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., and Schölkopf, B. (2009). Covariate shift by kernel mean matching. Dataset shift in machine learning, 3(4):5.
  • Guan and Tibshirani, (2022) Guan, L. and Tibshirani, R. (2022). Prediction and outlier detection in classification problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(2):524–546.
  • Hechtlinger et al., (2018) Hechtlinger, Y., Póczos, B., and Wasserman, L. (2018). Cautious deep learning. arXiv preprint arXiv:1805.09460.
  • Herbei and Wegkamp, (2006) Herbei, R. and Wegkamp, M. H. (2006). Classification with reject option. The Canadian Journal of Statistics/La Revue Canadienne de Statistique, pages 709–721.
  • Kalra and Paddock, (2016) Kalra, N. and Paddock, S. M. (2016). Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability? Transportation Research Part A: Policy and Practice, 94:182–193.
  • Kim et al., (2020) Kim, B., Xu, C., and Barber, R. (2020). Predictive inference is free with the jackknife+-after-bootstrap. Advances in Neural Information Processing Systems, 33:4138–4149.
  • Kompa et al., (2021) Kompa, B., Snoek, J., and Beam, A. L. (2021). Second opinion needed: communicating uncertainty in medical machine learning. NPJ Digital Medicine, 4(1):1–6.
  • Krizhevsky et al., (2009) Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images.
  • LeCun and Cortes, (2010) LeCun, Y. and Cortes, C. (2010). MNIST handwritten digit database.
  • Lei, (2014) Lei, J. (2014). Classification with confidence. Biometrika, 101(4):755–769.
  • Lei and Wasserman, (2015) Lei, J. and Wasserman, L. (2015). Distribution-free prediction bands for nonparametric regression. Quality control and applied statistics, 60(1):109–110.
  • Lin et al., (2005) Lin, J., Keogh, E., Fu, A., and Van Herle, H. (2005). Approximations to magic: Finding unusual medical time series. In 18th IEEE Symposium on Computer-Based Medical Systems (CBMS’05), pages 329–334. IEEE.
  • Lipton et al., (2018) Lipton, Z., Wang, Y.-X., and Smola, A. (2018). Detecting and correcting for label shift with black box predictors. In International conference on machine learning, pages 3122–3130. PMLR.
  • Marchette and Marchette, (2001) Marchette, D. J. and Marchette, D. (2001). Computer intrusion detection and network monitoring: a statistical viewpoint. Springer.
  • Neyman and Pearson, (1933) Neyman, J. and Pearson, E. S. (1933). Ix. on the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231(694-706):289–337.
  • Ni et al., (2019) Ni, C., Charoenphakdee, N., Honda, J., and Sugiyama, M. (2019). On the calibration of multiclass classification with rejection. Advances in Neural Information Processing Systems, 32.
  • Papadopoulos et al., (2002) Papadopoulos, H., Proedrou, K., Vovk, V., and Gammerman, A. (2002). Inductive confidence machines for regression. In Machine Learning: ECML 2002: 13th European Conference on Machine Learning Helsinki, Finland, August 19–23, 2002 Proceedings 13, pages 345–356. Springer.
  • Qayyum et al., (2020) Qayyum, A., Usama, M., Qadir, J., and Al-Fuqaha, A. (2020). Securing connected & autonomous vehicles: Challenges posed by adversarial machine learning and the way forward. IEEE Communications Surveys & Tutorials, 22(2):998–1026.
  • Romano et al., (2020) Romano, Y., Sesia, M., and Candes, E. (2020). Classification with valid and adaptive coverage. Advances in Neural Information Processing Systems, 33:3581–3591.
  • Sadinle et al., (2019) Sadinle, M., Lei, J., and Wasserman, L. (2019). Least ambiguous set-valued classifiers with bounded error levels. Journal of the American Statistical Association, 114(525):223–234.
  • Schölkopf et al., (2012) Schölkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., and Mooij, J. (2012). On causal and anticausal learning. arXiv preprint arXiv:1206.6471.
  • Shimodaira, (2000) Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227–244.
  • Storkey, (2009) Storkey, A. (2009). When training and test sets are different: characterizing learning transfer. Dataset shift in machine learning, 30:3–28.
  • Tibshirani et al., (2019) Tibshirani, R. J., Foygel Barber, R., Candes, E., and Ramdas, A. (2019). Conformal prediction under covariate shift. Advances in neural information processing systems, 32.
  • Vovk et al., (2005) Vovk, V., Gammerman, A., and Shafer, G. (2005). Algorithmic learning in a random world. Springer Science & Business Media.
  • Vovk et al., (2018) Vovk, V., Nouretdinov, I., Manokhin, V., and Gammerman, A. (2018). Cross-conformal predictive distributions. In Conformal and Probabilistic Prediction and Applications, pages 37–51. PMLR.
  • Xiao et al., (2017) Xiao, H., Rasul, K., and Vollgraf, R. (2017). Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747.

Checklist

  1. 1.

    For all models and algorithms presented, check if you include:

    1. (a)

      A clear description of the mathematical setting, assumptions, algorithm, and/or model. [Yes/No/Not Applicable] Yes

    2. (b)

      An analysis of the properties and complexity (time, space, sample size) of any algorithm. [Yes/No/Not Applicable] Yes

    3. (c)

      (Optional) Anonymized source code, with specification of all dependencies, including external libraries. [Yes/No/Not Applicable] Not Applicable

  2. 2.

    For any theoretical claim, check if you include:

    1. (a)

      Statements of the full set of assumptions of all theoretical results. [Yes/No/Not Applicable] Yes

    2. (b)

      Complete proofs of all theoretical results. [Yes/No/Not Applicable] Yes

    3. (c)

      Clear explanations of any assumptions. [Yes/No/Not Applicable] Yes

  3. 3.

    For all figures and tables that present empirical results, check if you include:

    1. (a)

      The code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL). [Yes/No/Not Applicable] Yes

    2. (b)

      All the training details (e.g., data splits, hyperparameters, how they were chosen). [Yes/No/Not Applicable] Yes

    3. (c)

      A clear definition of the specific measure or statistics and error bars (e.g., with respect to the random seed after running experiments multiple times). [Yes/No/Not Applicable] Yes

    4. (d)

      A description of the computing infrastructure used. (e.g., type of GPUs, internal cluster, or cloud provider). [Yes/No/Not Applicable] Not Applicable

  4. 4.

    If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include:

    1. (a)

      Citations of the creator If your work uses existing assets. [Yes/No/Not Applicable] Yes

    2. (b)

      The license information of the assets, if applicable. [Yes/No/Not Applicable] Not Applicable

    3. (c)

      New assets either in the supplemental material or as a URL, if applicable. [Yes/No/Not Applicable] Not Applicable

    4. (d)

      Information about consent from data providers/curators. [Yes/No/Not Applicable] Not Applicable

    5. (e)

      Discussion of sensible content if applicable, e.g., personally identifiable information or offensive content. [Yes/No/Not Applicable] Not Applicable

  5. 5.

    If you used crowdsourcing or conducted research with human subjects, check if you include:

    1. (a)

      The full text of instructions given to participants and screenshots. [Yes/No/Not Applicable] Not Applicable

    2. (b)

      Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. [Yes/No/Not Applicable] Not Applicable

    3. (c)

      The estimated hourly wage paid to participants and the total amount spent on participant compensation. [Yes/No/Not Applicable] Not Applicable

Appendix A PROOFS

A.1 Proof of Proposition 3.1

For the completeness, we provide a reproduction of the proof presented in Guan and Tibshirani, (2022),

Proof.

We first decompose the problem in eq. (2) into K𝐾Kitalic_K independent problems for different classes, referred to as the problem Pksubscript𝑃𝑘P_{k}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT:

minx𝟙x𝒜kμ(x)𝑑x,subscript𝑥subscript1𝑥subscript𝒜𝑘𝜇𝑥differential-d𝑥\displaystyle\min\int_{x}\mathbbm{1}_{x\in\mathcal{A}_{k}}\mu(x)dx,roman_min ∫ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT blackboard_1 start_POSTSUBSCRIPT italic_x ∈ caligraphic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_μ ( italic_x ) italic_d italic_x , (6)
s.t.[x𝒜k]1α.formulae-sequence𝑠𝑡delimited-[]𝑥subscript𝒜𝑘1𝛼\displaystyle s.t.\;\mathbb{P}[x\in\mathcal{A}_{k}]\geq 1-\alpha.italic_s . italic_t . blackboard_P [ italic_x ∈ caligraphic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ≥ 1 - italic_α . (7)

Let 𝒜ksubscript𝒜𝑘\mathcal{A}_{k}caligraphic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT be the solution to problem Pksubscript𝑃𝑘P_{k}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, then the solution to problem eq. (2) is C(x)={k:x𝒜k}𝐶𝑥conditional-set𝑘𝑥subscript𝒜𝑘C(x)=\{k:x\in\mathcal{A}_{k}\}italic_C ( italic_x ) = { italic_k : italic_x ∈ caligraphic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }.

We then define Q(α,g;F)𝑄𝛼𝑔𝐹Q(\alpha,g;F)italic_Q ( italic_α , italic_g ; italic_F ) is the lower α𝛼\alphaitalic_α percentile of a real-valued function g(x)𝑔𝑥g(x)italic_g ( italic_x ) under distribution F𝐹Fitalic_F, i.e.,

Q(α,g;F)=sup{t:F(g(x)t)α}.𝑄𝛼𝑔𝐹𝑠𝑢𝑝conditional-set𝑡subscript𝐹𝑔𝑥𝑡𝛼Q(\alpha,g;F)=sup\{t:\mathbb{P}_{F}(g(x)\leq t)\leq\alpha\}.italic_Q ( italic_α , italic_g ; italic_F ) = italic_s italic_u italic_p { italic_t : blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_g ( italic_x ) ≤ italic_t ) ≤ italic_α } . (8)

Following Guan and Tibshirani, (2022), we regard the problem Pksubscript𝑃𝑘P_{k}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as a hypothesis testing problem where the null hypothesis is H0:xfk:subscript𝐻0similar-to𝑥subscript𝑓𝑘H_{0}:x\sim f_{k}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_x ∼ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and the alternative is H1:xμk:subscript𝐻1similar-to𝑥subscript𝜇𝑘H_{1}:x\sim\mu_{k}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_x ∼ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. The optimal solution of Pksubscript𝑃𝑘P_{k}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is 𝒜ksubscript𝒜𝑘\mathcal{A}_{k}caligraphic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT which is actually the decision region of above hypothesis with the most powerful level α𝛼\alphaitalic_α. Therefore, by Neyman–Pearson Lemma (Neyman and Pearson,, 1933), we can construct the likelihood ratio statistic s(x,k;μ)=fk(x)/μ(x)𝑠𝑥𝑘𝜇subscript𝑓𝑘𝑥𝜇𝑥s(x,k;\mu)=f_{k}(x)/\mu(x)italic_s ( italic_x , italic_k ; italic_μ ) = italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) / italic_μ ( italic_x ) and have the solution 𝒜k={s(x,k;μ)Q(α,sk;Fk)}subscript𝒜𝑘𝑠𝑥𝑘𝜇𝑄𝛼subscript𝑠𝑘subscript𝐹𝑘\mathcal{A}_{k}=\{s(x,k;\mu)\leq Q(\alpha,s_{k};F_{k})\}caligraphic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_s ( italic_x , italic_k ; italic_μ ) ≤ italic_Q ( italic_α , italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } where the sksubscript𝑠𝑘s_{k}italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the conformal score function of class k𝑘kitalic_k and Fksubscript𝐹𝑘F_{k}italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the distribution of x𝑥xitalic_x from class k𝑘kitalic_k. Hence, the solution to eq. ((2) is C(x)={k:𝔼X[𝟙{s(x,k;μ)s(X,k;μ)}|Y=k]α,k=1,,K}𝐶𝑥conditional-set𝑘formulae-sequencesubscript𝔼𝑋delimited-[]conditional1𝑠𝑥𝑘𝜇𝑠𝑋𝑘𝜇𝑌𝑘𝛼𝑘1𝐾C(x)=\{k:\mathbb{E}_{X}[\mathbbm{1}\{s(x,k;\mu)\geq s(X,k;\mu)\}|Y=k]\geq% \alpha,k=1,\ldots,K\}italic_C ( italic_x ) = { italic_k : blackboard_E start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT [ blackboard_1 { italic_s ( italic_x , italic_k ; italic_μ ) ≥ italic_s ( italic_X , italic_k ; italic_μ ) } | italic_Y = italic_k ] ≥ italic_α , italic_k = 1 , … , italic_K }.

A.2 Proof of Theorem 3.3

Proof.

Here, we prove eq. (5) for any given class k𝑘kitalic_k and the test sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The original procedure for determining whether we should assign class k𝑘kitalic_k to sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can also be described as following. First, generate two events, 1subscript1\mathcal{E}_{1}caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 2subscript2\mathcal{E}_{2}caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT:

  1. 1.

    Event 11\mathcal{E}1caligraphic_E 1: Training samples other than class k𝑘kitalic_k and bootstrap copies otherbsuperscriptsubscript𝑜𝑡𝑒𝑟𝑏\mathcal{I}_{other}^{b}caligraphic_I start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT for b=1,,B~𝑏1~𝐵b=1,\ldots,\tilde{B}italic_b = 1 , … , over~ start_ARG italic_B end_ARG.

  2. 2.

    Event 22\mathcal{E}2caligraphic_E 2: Test samples other than xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and bootstrap copies testbsuperscriptsubscript𝑡𝑒𝑠𝑡𝑏\mathcal{I}_{test}^{b}caligraphic_I start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT for b=1,,B~𝑏1~𝐵b=1,\ldots,\tilde{B}italic_b = 1 , … , over~ start_ARG italic_B end_ARG.

Let kbsuperscriptsubscript𝑘𝑏\mathcal{I}_{k}^{b}caligraphic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT for b=1,,B~𝑏1~𝐵b=1,\ldots,\tilde{B}italic_b = 1 , … , over~ start_ARG italic_B end_ARG represent B~~𝐵\tilde{B}over~ start_ARG italic_B end_ARG bootstrap copies of training class k𝑘kitalic_k samples. We conduct our conformalized classification using only copies b𝑏bitalic_b with vb=1subscript𝑣𝑏1v_{b}=1italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = 1 for b=1,,B~𝑏1~𝐵b=1,\ldots,\tilde{B}italic_b = 1 , … , over~ start_ARG italic_B end_ARG, where vbBernoulli((11nk+1)nk)similar-tosubscript𝑣𝑏Bernoullisuperscript11subscript𝑛𝑘1subscript𝑛𝑘v_{b}\sim{\rm Bernoulli}((1-\frac{1}{n_{k}+1})^{n_{k}})italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∼ roman_Bernoulli ( ( 1 - divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_ARG ) start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ). The comparison between xisubscript𝑥superscript𝑖x_{i^{\prime}}italic_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT from class k𝑘kitalic_k and the test sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is performed by aggregating prediction functions using runs b𝑏bitalic_b with vb=1subscript𝑣𝑏1v_{b}=1italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = 1 and excluding both xisubscript𝑥superscript𝑖x_{i^{\prime}}italic_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT: G^ii(x;μ)=φ(G^kb(x;μ):b satisfies vb=1,ikb,iteb)\hat{G}^{ii^{\prime}}(x;\mu)=\varphi({\hat{G}_{k}^{b}(x;\mu):b\text{ satisfies% }v_{b}=1,i^{\prime}\notin\mathcal{I}^{b}_{k},i\notin\mathcal{I}_{te}^{b}})over^ start_ARG italic_G end_ARG start_POSTSUPERSCRIPT italic_i italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x ; italic_μ ) = italic_φ ( over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( italic_x ; italic_μ ) : italic_b satisfies italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = 1 , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∉ caligraphic_I start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_i ∉ caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ).

We will now condition on 1subscript1\mathcal{E}_{1}caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 2subscript2\mathcal{E}_{2}caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and define i={b:iteb}subscript𝑖conditional-set𝑏𝑖superscriptsubscript𝑡𝑒𝑏\mathcal{B}_{i}=\{b:i\notin\mathcal{I}_{te}^{b}\}caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_b : italic_i ∉ caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT }. The function G^ii(x;μ)superscript^𝐺𝑖superscript𝑖𝑥𝜇\hat{G}^{ii^{\prime}}(x;\mu)over^ start_ARG italic_G end_ARG start_POSTSUPERSCRIPT italic_i italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x ; italic_μ ) can be rewritten as G^ii(x;μ)=φ(G^kb(x;μ):bi satisfies vb=1,ikb)\hat{G}^{ii^{\prime}}(x;\mu)=\varphi({\hat{G}_{k}^{b}(x;\mu):b\in\mathcal{B}_{% i}\text{ satisfies }v_{b}=1,i^{\prime}\notin\mathcal{I}^{b}_{k}})over^ start_ARG italic_G end_ARG start_POSTSUPERSCRIPT italic_i italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x ; italic_μ ) = italic_φ ( over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( italic_x ; italic_μ ) : italic_b ∈ caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT satisfies italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = 1 , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∉ caligraphic_I start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). A key observation is that this can be equivalently expressed as first sampling BBinomial(|i|,(11nk+1)nk)similar-to𝐵Binomial𝑖superscript11subscript𝑛𝑘1subscript𝑛𝑘B\sim{\rm Binomial}(|\mathcal{B}{i}|,(1-\frac{1}{n_{k}+1})^{n_{k}})italic_B ∼ roman_Binomial ( | caligraphic_B italic_i | , ( 1 - divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_ARG ) start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) and constructing the ensemble prediction function comparing xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and xisubscript𝑥superscript𝑖x_{i^{\prime}}italic_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT: G^i(x;μ)=φ(G^kb(x;μ):b satisfies ikb,1bB)\hat{G}^{i^{\prime}}(x;\mu)=\varphi({\hat{G}_{k}^{b}(x;\mu):b\text{ satisfies % }i^{\prime}\notin\mathcal{I}^{b}_{k},1\leq b\leq B})over^ start_ARG italic_G end_ARG start_POSTSUPERSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x ; italic_μ ) = italic_φ ( over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( italic_x ; italic_μ ) : italic_b satisfies italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∉ caligraphic_I start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , 1 ≤ italic_b ≤ italic_B ). We have also dropped the superscript i𝑖iitalic_i since this dependence disappears after conditioning and restricting ourselves to isubscript𝑖\mathcal{B}_{i}caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Interestingly, under this new equivalent characterization, sampling BBinomial(|i|,(11nk+1)nk)similar-to𝐵Binomialsubscript𝑖superscript11subscript𝑛𝑘1subscript𝑛𝑘B\sim{\rm Binomial}(|\mathcal{B}_{i}|,(1-\frac{1}{n_{k}+1})^{n_{k}})italic_B ∼ roman_Binomial ( | caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | , ( 1 - divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_ARG ) start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) followed by bootstrap B𝐵Bitalic_B copies of the nksubscript𝑛𝑘n_{k}italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT class k𝑘kitalic_k samples is equivalent to drawing |i|𝑖|\mathcal{B}{i}|| caligraphic_B italic_i | bootstrap copies of the nk+1subscript𝑛𝑘1n_{k}+1italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 samples, which include nksubscript𝑛𝑘n_{k}italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT samples from training class k𝑘kitalic_k and the test sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and then kee** only those bootstrap samples where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is not included. A similar equivalence was first noted by Kim et al., (2020) and utilized in the Jacknife+aB procedure for traditional supervised regression.

In summary, the decision rule in CSForest for whether to include label k𝑘kitalic_k in C^(xi)^𝐶subscript𝑥𝑖\hat{C}(x_{i})over^ start_ARG italic_C end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) can be equivalently expressed with the following procedure after conditioning on 1subscript1\mathcal{E}_{1}caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 2subscript2\mathcal{E}_{2}caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT:

  1. 1.

    Conduct |i|𝑖|\mathcal{B}{i}|| caligraphic_B italic_i | bootstrap resamplings of the nk+1subscript𝑛𝑘1n_{k}+1italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 samples. Denote these samples as x1,,xnksubscript𝑥1subscript𝑥subscript𝑛𝑘x_{1},\ldots,x_{n_{k}}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT (representing nksubscript𝑛𝑘n_{k}italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT class k𝑘kitalic_k training samples) and xnk+1xisubscript𝑥subscript𝑛𝑘1subscript𝑥𝑖x_{n_{k}+1}\leftarrow x_{i}italic_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ← italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the test sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Let I~bsubscript~𝐼𝑏\tilde{I}_{b}over~ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT be the index of samples in the bthsuperscript𝑏𝑡b^{th}italic_b start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT bootstrap.

  2. 2.

    For each bootstrap, construct a random forest tree G~b(x)superscript~𝐺𝑏𝑥\tilde{G}^{b}(x)over~ start_ARG italic_G end_ARG start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( italic_x ) separates xI~bsubscript𝑥subscript~𝐼𝑏x_{\tilde{I}_{b}}italic_x start_POSTSUBSCRIPT over~ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT from other samples (conditioned on).

  3. 3.

    For each pair (l,j)𝑙𝑗(l,j)( italic_l , italic_j ) with 1l,jnk+1formulae-sequence1𝑙𝑗subscript𝑛𝑘11\leq l,j\leq n_{k}+11 ≤ italic_l , italic_j ≤ italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1, construct G^lj(x)=φ({G~kb(x):l,jI~b})superscript^𝐺𝑙𝑗𝑥𝜑conditional-setsubscriptsuperscript~𝐺𝑏𝑘𝑥𝑙𝑗subscript~𝐼𝑏\hat{G}^{lj}(x)=\varphi(\{\tilde{G}^{b}_{k}(x):l,j\notin\tilde{I}_{b}\})over^ start_ARG italic_G end_ARG start_POSTSUPERSCRIPT italic_l italic_j end_POSTSUPERSCRIPT ( italic_x ) = italic_φ ( { over~ start_ARG italic_G end_ARG start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) : italic_l , italic_j ∉ over~ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT } ) and include label k𝑘kitalic_k if and only if

    1+j=1n𝟙{G^nk+1,j(xnk+1)G^j,nk+1(xj)}(nk+1)α.1superscriptsubscript𝑗1𝑛1superscript^𝐺subscript𝑛𝑘1𝑗subscript𝑥subscript𝑛𝑘1superscript^𝐺𝑗subscript𝑛𝑘1subscript𝑥𝑗subscript𝑛𝑘1𝛼1+\sum_{j=1}^{n}\mathbbm{1}\{\hat{G}^{n_{k}+1,j}(x_{n_{k}+1})\geq\hat{G}^{j,n_% {k}+1}(x_{j})\}\geq(n_{k}+1)\alpha.1 + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_1 { over^ start_ARG italic_G end_ARG start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 , italic_j end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ) ≥ over^ start_ARG italic_G end_ARG start_POSTSUPERSCRIPT italic_j , italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } ≥ ( italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 ) italic_α .

Note that s^ii(x,k;μ)superscript^𝑠𝑖superscript𝑖𝑥𝑘𝜇\hat{s}^{ii^{\prime}}(x,k;\mu)over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_i italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x , italic_k ; italic_μ ) is the same as Gnk+1,i(xnk+1)superscript𝐺subscript𝑛𝑘1superscript𝑖subscript𝑥subscript𝑛𝑘1G^{n_{k}+1,i^{\prime}}(x_{n_{k}+1})italic_G start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ).

Define Alj=𝟙{G^l,j(xl)G^j,l(xj)}subscript𝐴𝑙𝑗1superscript^𝐺𝑙𝑗subscript𝑥𝑙superscript^𝐺𝑗𝑙subscript𝑥𝑗A_{lj}=\mathbbm{1}\{\hat{G}^{l,j}(x_{l})\geq\hat{G}^{j,l}(x_{j})\}italic_A start_POSTSUBSCRIPT italic_l italic_j end_POSTSUBSCRIPT = blackboard_1 { over^ start_ARG italic_G end_ARG start_POSTSUPERSCRIPT italic_l , italic_j end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ≥ over^ start_ARG italic_G end_ARG start_POSTSUPERSCRIPT italic_j , italic_l end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } for all 1l,jnk+1formulae-sequence1𝑙𝑗subscript𝑛𝑘11\leq l,j\leq n_{k}+11 ≤ italic_l , italic_j ≤ italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1. It is obvious that Aii=1subscript𝐴𝑖𝑖1A_{ii}=1italic_A start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT = 1 for all i=1,,n+1𝑖1𝑛1i=1,\ldots,n+1italic_i = 1 , … , italic_n + 1. Define Al=j=1nk+1Aljsubscript𝐴𝑙superscriptsubscript𝑗1subscript𝑛𝑘1subscript𝐴𝑙𝑗A_{l\bullet}=\sum_{j=1}^{n_{k}+1}A_{lj}italic_A start_POSTSUBSCRIPT italic_l ∙ end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_l italic_j end_POSTSUBSCRIPT as the sum of the lthsuperscript𝑙𝑡l^{th}italic_l start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT row from the comparison matrix A. Then,

kC^l(xnk+1)if and only ifAnk+1(nk+1)α1.formulae-sequence𝑘subscript^𝐶𝑙subscript𝑥subscript𝑛𝑘1if and only ifsubscript𝐴subscript𝑛𝑘1subscript𝑛𝑘1𝛼1k\notin\hat{C}_{l}(x_{n_{k}+1})\quad\mbox{if and only if}\qquad A_{n_{k}+1% \bullet}\leq(n_{k}+1)\alpha-1.italic_k ∉ over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ) if and only if italic_A start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 ∙ end_POSTSUBSCRIPT ≤ ( italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 ) italic_α - 1 .

Hence, eq. (5) from Theorem 3.3 is equivalent to (9) below:

(Ank+1(nk+1)α1|ynk+1=k)2α.subscript𝐴subscript𝑛𝑘1subscript𝑛𝑘1𝛼conditional1subscript𝑦subscript𝑛𝑘1𝑘2𝛼\mathbb{P}(A_{n_{k}+1\bullet}\leq(n_{k}+1)\alpha-1|y_{n_{k}+1}=k)\leq 2\alpha.blackboard_P ( italic_A start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 ∙ end_POSTSUBSCRIPT ≤ ( italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 ) italic_α - 1 | italic_y start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT = italic_k ) ≤ 2 italic_α . (9)

We now proceed to prove (9), which consists of two steps (1) Ajsubscript𝐴𝑗A_{j\bullet}italic_A start_POSTSUBSCRIPT italic_j ∙ end_POSTSUBSCRIPT are exchangeable with each other for j=1,,n+1𝑗1𝑛1j=1,\ldots,n+1italic_j = 1 , … , italic_n + 1 when ynk+1=ksubscript𝑦subscript𝑛𝑘1𝑘y_{n_{k}+1}=kitalic_y start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT = italic_k, and (2) the strange set S(A)={j:Aj(nk+1)α1}𝑆𝐴conditional-set𝑗subscript𝐴𝑗subscript𝑛𝑘1𝛼1S(A)=\{j:A_{j\bullet}\leq(n_{k}+1)\alpha-1\}italic_S ( italic_A ) = { italic_j : italic_A start_POSTSUBSCRIPT italic_j ∙ end_POSTSUBSCRIPT ≤ ( italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 ) italic_α - 1 } satisfies |S(A)|2α(nk+1)𝑆𝐴2𝛼subscript𝑛𝑘1|S(A)|\leq 2\alpha(n_{k}+1)| italic_S ( italic_A ) | ≤ 2 italic_α ( italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 ). Combining these two steps, we immediately have

[Ank+1(nk+1)α1|ynk+1=k]delimited-[]subscript𝐴subscript𝑛𝑘1subscript𝑛𝑘1𝛼conditional1subscript𝑦subscript𝑛𝑘1𝑘\displaystyle\mathbb{P}\left[A_{n_{k}+1\bullet}\leq(n_{k}+1)\alpha-1|y_{n_{k}+% 1}=k\right]blackboard_P [ italic_A start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 ∙ end_POSTSUBSCRIPT ≤ ( italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 ) italic_α - 1 | italic_y start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT = italic_k ]
=\displaystyle== [(nk+1)S(A)|ynk+1=k]=|S(A)|nk+12α.delimited-[]subscript𝑛𝑘1conditional𝑆𝐴subscript𝑦subscript𝑛𝑘1𝑘𝑆𝐴subscript𝑛𝑘12𝛼\displaystyle\mathbb{P}\left[(n_{k}+1)\in S(A)|y_{n_{k}+1}=k\right]=\frac{|S(A% )|}{n_{k}+1}\leq 2\alpha.blackboard_P [ ( italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 ) ∈ italic_S ( italic_A ) | italic_y start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT = italic_k ] = divide start_ARG | italic_S ( italic_A ) | end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_ARG ≤ 2 italic_α .

At this stage, proofs to above two steps (1) and (2) become identical to that used in the proofs to Theorem 1 in Barber et al., (2021) or Theorem 1 in Kim et al., (2020).

Appendix B MORE DETAILS ON BASELINES

In this section, we provide more details about the baseline models. We first introduce several existing methods for constructing set-valued predictions C(x)𝐶𝑥C(x)italic_C ( italic_x ), including BCOPS, CRF, DC, and ACRFrandom.

B.1 BCOPS, CRF, DC and ACRFrandom

  • BCOPS (Balanced Conformalized Optimal Prediction Sets) (Guan and Tibshirani,, 2022) is a semi-supervised classifier that utilizes half of the training data to train a classifier, separating observed classes from unlabeled test samples. The remaining half of the training samples is used for calibration through conformal prediction and constructing a set-valued prediction set. In contrast, BCOPS focuses on optimizing model performance on the test set, setting μ(x)=fte(x)𝜇𝑥subscript𝑓𝑡𝑒𝑥\mu(x)=f_{te}(x)italic_μ ( italic_x ) = italic_f start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT ( italic_x ), which represents the marginal density for the test data. It constructs calibrated set-valued predictions by combining empirically estimated vk(x)subscript𝑣𝑘𝑥v_{k}(x)italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) with the sample-splitting conformal prediction method(Vovk et al.,, 2005). While BCOPS excels in abnormality detection, outperforming non-test-cohort-adaptive methods, it relies on having a large set of test data, and the sample-splitting scheme results in lower data utilization efficiency(Guan and Tibshirani,, 2022).

  • CRF (Comformalized Random Forest)(Vovk et al.,, 2005) constructs the set-valued prediction {k:p^k(x)τk}conditional-set𝑘subscript^𝑝𝑘𝑥subscript𝜏𝑘\{k:\hat{p}_{k}(x)\geq\tau_{k}\}{ italic_k : over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) ≥ italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } by including training labels k𝑘kitalic_k achieving high estimated probability from the random forest classifier, with the cut-offs τksubscript𝜏𝑘\tau_{k}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT chosen based on sample-splitting conformal prediction.

  • DC (Density-set Classifier)(Hechtlinger et al.,, 2018) constructs the set-valued prediction similarly to CRF, except for replacing the estimated probability p^k(x)subscript^𝑝𝑘𝑥\hat{p}_{k}(x)over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) by an estimation of the density function for class k𝑘kitalic_k using the training data.

  • ACRFrandom: We refer to the adaptive-coverage classification approach using random forest proposed in Romano et al., (2020) as ACRFrandom (Adaptive-coverage CRF with randomization) where the randomization is introduced via an additional uniform random variable U𝑈Uitalic_U for tie-breaking. ACRFrandom constructs the prediction set C^(x)^𝐶𝑥\hat{C}(x)over^ start_ARG italic_C end_ARG ( italic_x ) by including labels with large estimated probabilities such that the total probability is greater than the upper-level quantile of the empirical distribution of {Ei}ical{}subscriptsubscript𝐸𝑖𝑖subscript𝑐𝑎𝑙\{E_{i}\}_{i\in\mathcal{I}_{cal}}\cup\{\infty\}{ italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_c italic_a italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∪ { ∞ } where calsubscript𝑐𝑎𝑙\mathcal{I}_{cal}caligraphic_I start_POSTSUBSCRIPT italic_c italic_a italic_l end_POSTSUBSCRIPT is the calibration set in sample-splitting conformal prediction and Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the sum of estimated probabilities for all labels proceeding that for the true label.

B.2 ACRF and ACRFrandom

We further introduce the baseline ACRFrandom and its de-randomized version ACRF.

ACRFrandom

In the original proposal of Romano et al., (2020), the authors assume that training and test data to have the same distribution. ACRFrandom defines a function 𝒮𝒮\mathcal{S}caligraphic_S with input x𝑥xitalic_x, u[0,1]𝑢01u\in[0,1]italic_u ∈ [ 0 , 1 ], the conditional probability πy=P(Y=y|X=x)subscript𝜋𝑦𝑃𝑌conditional𝑦𝑋𝑥\pi_{y}=P(Y=y|X=x)italic_π start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = italic_P ( italic_Y = italic_y | italic_X = italic_x ), and the threshold τ𝜏\tauitalic_τ. Define

𝒮(x,u;π,τ)={yindices of theL(x;π,τ)1largestπy(x),if u<V(x;π,τ)yindices of theL(x;π,τ)largestπy(x),otherwise𝒮𝑥𝑢𝜋𝜏cases𝑦indices of the𝐿𝑥𝜋𝜏1largestsubscript𝜋𝑦𝑥if 𝑢𝑉𝑥𝜋𝜏𝑦indices of the𝐿𝑥𝜋𝜏largestsubscript𝜋𝑦𝑥otherwise\mathcal{S}(x,u;\pi,\tau)=\left\{\begin{array}[]{cl}y\;\text{indices of the}\;% L(x;\pi,\tau)-1\;\text{largest}\;\pi_{y}(x),&\text{if }u<V(x;\pi,\tau)\\ y\;\text{indices of the}\;L(x;\pi,\tau)\;\text{largest}\;\pi_{y}(x),&\text{% otherwise}\end{array}\right.caligraphic_S ( italic_x , italic_u ; italic_π , italic_τ ) = { start_ARRAY start_ROW start_CELL italic_y indices of the italic_L ( italic_x ; italic_π , italic_τ ) - 1 largest italic_π start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_x ) , end_CELL start_CELL if italic_u < italic_V ( italic_x ; italic_π , italic_τ ) end_CELL end_ROW start_ROW start_CELL italic_y indices of the italic_L ( italic_x ; italic_π , italic_τ ) largest italic_π start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_x ) , end_CELL start_CELL otherwise end_CELL end_ROW end_ARRAY (10)

where

V(x;π,τ)=c=1L(x;π,τ)π(c)(x)τπ(L(x;π,τ))(x)𝑉𝑥𝜋𝜏superscriptsubscript𝑐1𝐿𝑥𝜋𝜏subscript𝜋𝑐𝑥𝜏subscript𝜋𝐿𝑥𝜋𝜏𝑥V(x;\pi,\tau)=\frac{\sum_{c=1}^{L(x;\pi,\tau)}\pi_{(c)}(x)-\tau}{\pi_{(L(x;\pi% ,\tau))}(x)}italic_V ( italic_x ; italic_π , italic_τ ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L ( italic_x ; italic_π , italic_τ ) end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT ( italic_c ) end_POSTSUBSCRIPT ( italic_x ) - italic_τ end_ARG start_ARG italic_π start_POSTSUBSCRIPT ( italic_L ( italic_x ; italic_π , italic_τ ) ) end_POSTSUBSCRIPT ( italic_x ) end_ARG (11)
L(x;π,τ)=min{c{1,,C}:π(1)(x)+π(2)(x)++π(c)(x)τ}.𝐿𝑥𝜋𝜏:𝑐1𝐶subscript𝜋1𝑥subscript𝜋2𝑥subscript𝜋𝑐𝑥𝜏L(x;\pi,\tau)=\min\{c\in\{1,\cdots,C\}:\pi_{(1)}(x)+\pi_{(2)}(x)+\cdots+\pi_{(% c)}(x)\geq\tau\}.italic_L ( italic_x ; italic_π , italic_τ ) = roman_min { italic_c ∈ { 1 , ⋯ , italic_C } : italic_π start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT ( italic_x ) + italic_π start_POSTSUBSCRIPT ( 2 ) end_POSTSUBSCRIPT ( italic_x ) + ⋯ + italic_π start_POSTSUBSCRIPT ( italic_c ) end_POSTSUBSCRIPT ( italic_x ) ≥ italic_τ } . (12)

and π(i)(x)subscript𝜋𝑖𝑥\pi_{(i)}(x)italic_π start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT ( italic_x ) is the ith𝑖𝑡ithitalic_i italic_t italic_h largest conditional probability.

Further, ACRFrandom defines the generalized inverse quantile conformity score function E𝐸Eitalic_E,

E(x,y,u;π^)=min{τ[0,1]:y𝒮(x,u;π^,τ)}.𝐸𝑥𝑦𝑢^𝜋:𝜏01𝑦𝒮𝑥𝑢^𝜋𝜏E(x,y,u;\hat{\pi})=\min\{\tau\in[0,1]:y\in\mathcal{S}(x,u;\hat{\pi},\tau)\}.italic_E ( italic_x , italic_y , italic_u ; over^ start_ARG italic_π end_ARG ) = roman_min { italic_τ ∈ [ 0 , 1 ] : italic_y ∈ caligraphic_S ( italic_x , italic_u ; over^ start_ARG italic_π end_ARG , italic_τ ) } . (13)

And the empirical distribution is

V(x,y;E)=1|cal|+1icalδEi+1|cal|+1δ𝑉𝑥𝑦𝐸1subscript𝑐𝑎𝑙1subscript𝑖subscript𝑐𝑎𝑙subscript𝛿subscript𝐸𝑖1subscript𝑐𝑎𝑙1subscript𝛿V(x,y;E)=\frac{1}{|\mathcal{I}_{cal}|+1}\sum_{i\in\mathcal{I}_{cal}}\delta_{E_% {i}}+\frac{1}{|\mathcal{I}_{cal}|+1}\delta_{\infty}italic_V ( italic_x , italic_y ; italic_E ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_I start_POSTSUBSCRIPT italic_c italic_a italic_l end_POSTSUBSCRIPT | + 1 end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_c italic_a italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG | caligraphic_I start_POSTSUBSCRIPT italic_c italic_a italic_l end_POSTSUBSCRIPT | + 1 end_ARG italic_δ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT (14)

where Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is constructed at the minimum τ𝜏\tauitalic_τ for the calibration sample i𝑖iitalic_i such that yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is included in S(xi,ui;π^,τ)𝑆subscript𝑥𝑖subscript𝑢𝑖^𝜋𝜏S(x_{i},u_{i};\hat{\pi},\tau)italic_S ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; over^ start_ARG italic_π end_ARG , italic_τ ) and δisubscript𝛿𝑖\delta_{i}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes a point mass at i𝑖iitalic_i. The final prediction set is constructed as C^(x)=𝒮(x,u;π,τ^α)^𝐶𝑥𝒮𝑥𝑢𝜋subscript^𝜏𝛼\hat{C}(x)=\mathcal{S}(x,u;\pi,\hat{\tau}_{\alpha})over^ start_ARG italic_C end_ARG ( italic_x ) = caligraphic_S ( italic_x , italic_u ; italic_π , over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ), where τ^αsubscript^𝜏𝛼\hat{\tau}_{\alpha}over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT is the upper level α𝛼\alphaitalic_α quantile of the empirical distribution {Ei}ical{}subscriptsubscript𝐸𝑖𝑖subscript𝑐𝑎𝑙\{E_{i}\}_{i\in\mathcal{I}_{cal}}\cup\{\infty\}{ italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_c italic_a italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∪ { ∞ }.

ACRF

We consider a derandomized version of ACRF without the uniform variable U𝑈Uitalic_U in our experiment. For ACRF, we define

𝒮(x;π,τ)={yindicesoftheL(x;π,τ)largestπy(x)},𝒮𝑥𝜋𝜏𝑦𝑖𝑛𝑑𝑖𝑐𝑒𝑠𝑜𝑓𝑡𝑒𝐿𝑥𝜋𝜏𝑙𝑎𝑟𝑔𝑒𝑠𝑡subscript𝜋𝑦𝑥\mathcal{S}(x;\pi,\tau)=\{y\;indices\;of\;the\;L(x;\pi,\tau)\;largest\;\pi_{y}% (x)\},caligraphic_S ( italic_x ; italic_π , italic_τ ) = { italic_y italic_i italic_n italic_d italic_i italic_c italic_e italic_s italic_o italic_f italic_t italic_h italic_e italic_L ( italic_x ; italic_π , italic_τ ) italic_l italic_a italic_r italic_g italic_e italic_s italic_t italic_π start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_x ) } , (15)

where

L(x;π,τ)=min{c{1,,C}:π(1)(x)+π(2)(x)++π(c)(x)>τ}.𝐿𝑥𝜋𝜏:𝑐1𝐶subscript𝜋1𝑥subscript𝜋2𝑥subscript𝜋𝑐𝑥𝜏L(x;\pi,\tau)=\min\{c\in\{1,\cdots,C\}:\pi_{(1)}(x)+\pi_{(2)}(x)+\cdots+\pi_{(% c)}(x)>\tau\}.italic_L ( italic_x ; italic_π , italic_τ ) = roman_min { italic_c ∈ { 1 , ⋯ , italic_C } : italic_π start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT ( italic_x ) + italic_π start_POSTSUBSCRIPT ( 2 ) end_POSTSUBSCRIPT ( italic_x ) + ⋯ + italic_π start_POSTSUBSCRIPT ( italic_c ) end_POSTSUBSCRIPT ( italic_x ) > italic_τ } . (16)

We can similarly define a score function E𝐸Eitalic_E,

E(x,y;π^)=min{τ[0,1]:y𝒮(x;π^,τ)},𝐸𝑥𝑦^𝜋:𝜏01𝑦𝒮𝑥^𝜋𝜏E(x,y;\hat{\pi})=\min\{\tau\in[0,1]:y\in\mathcal{S}(x;\hat{\pi},\tau)\},italic_E ( italic_x , italic_y ; over^ start_ARG italic_π end_ARG ) = roman_min { italic_τ ∈ [ 0 , 1 ] : italic_y ∈ caligraphic_S ( italic_x ; over^ start_ARG italic_π end_ARG , italic_τ ) } , (17)

and construct Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the the minimum τ𝜏\tauitalic_τ for the calibration sample i𝑖iitalic_i such that yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is included in S(xi;π^,τ)𝑆subscript𝑥𝑖^𝜋𝜏S(x_{i};\hat{\pi},\tau)italic_S ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; over^ start_ARG italic_π end_ARG , italic_τ ). Same as in ACRFrandom, ACRF constructs as C^(x)=𝒮(x;π,τ^α)^𝐶𝑥𝒮𝑥𝜋subscript^𝜏𝛼\hat{C}(x)=\mathcal{S}(x;\pi,\hat{\tau}_{\alpha})over^ start_ARG italic_C end_ARG ( italic_x ) = caligraphic_S ( italic_x ; italic_π , over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ), where τ^αsubscript^𝜏𝛼\hat{\tau}_{\alpha}over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT is the upper level α𝛼\alphaitalic_α quantile of the empirical distribution {Ei}ical{}subscriptsubscript𝐸𝑖𝑖subscript𝑐𝑎𝑙\{E_{i}\}_{i\in\mathcal{I}_{cal}}\cup\{\infty\}{ italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_c italic_a italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∪ { ∞ }. Algorithm 2 shows the details of ACRF.

Input : Training Data {zi(xi,yi)i=1n,itr}formulae-sequencesubscript𝑧𝑖subscriptsuperscriptsubscript𝑥𝑖subscript𝑦𝑖𝑛𝑖1𝑖subscript𝑡𝑟\{z_{i}\coloneqq(x_{i},y_{i})^{n}_{i=1},i\in\mathcal{I}_{tr}\}{ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≔ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT , italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT }, Test Data {(xi)i=1m,ite}subscriptsuperscriptsubscript𝑥𝑖𝑚𝑖1𝑖subscript𝑡𝑒\{(x_{i})^{m}_{i=1},i\in\mathcal{I}_{te}\}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT , italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT }.
Output : Prediction sets C^i(xi)subscript^𝐶𝑖subscript𝑥𝑖\hat{C}_{i}(x_{i})over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for ite𝑖subscript𝑡𝑒i\in\mathcal{I}_{te}italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT.
Randomly split the training data into 2 subsets, the training set tr1subscriptsuperscript1𝑡𝑟\mathcal{I}^{1}_{tr}caligraphic_I start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT, the calibration training set tr2subscriptsuperscript2𝑡𝑟\mathcal{I}^{2}_{tr}caligraphic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT. Train random forest model \mathcal{B}caligraphic_B on all samples in tr1subscriptsuperscript1𝑡𝑟\mathcal{I}^{1}_{tr}caligraphic_I start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT: π^1((Xi,Yi)itr1)subscript^𝜋1subscriptsubscript𝑋𝑖subscript𝑌𝑖𝑖subscriptsuperscript1𝑡𝑟\hat{\pi}_{1}\leftarrow\mathcal{B}({(X_{i},Y_{i})}_{i\in\mathcal{I}^{1}_{tr}})over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← caligraphic_B ( ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). Predict on tr2superscriptsubscript𝑡𝑟2\mathcal{I}_{tr}^{2}caligraphic_I start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: π^2((Xi)i2)subscript^𝜋2subscriptsubscript𝑋𝑖𝑖subscript2\hat{\pi}_{2}\leftarrow\mathcal{B}({(X_{i})}_{i\in\mathcal{I}_{2}})over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← caligraphic_B ( ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) and tesubscript𝑡𝑒\mathcal{I}_{te}caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT: π^te((Xi)ite)subscript^𝜋𝑡𝑒subscriptsubscript𝑋𝑖𝑖subscript𝑡𝑒\hat{\pi}_{te}\leftarrow\mathcal{B}({(X_{i})}_{i\in\mathcal{I}_{te}})over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT ← caligraphic_B ( ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). Construct {Ei}itr2subscriptsubscript𝐸𝑖𝑖subscriptsuperscript2𝑡𝑟\{E_{i}\}_{i\in\mathcal{I}^{2}_{tr}}{ italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT treating tr2subscriptsuperscript2𝑡𝑟\mathcal{I}^{2}_{tr}caligraphic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT as the calibration set. Compute the level (1α)1𝛼(1-\alpha)( 1 - italic_α ) quantile of the empirical distribution {Ei}itr2{}subscriptsubscript𝐸𝑖𝑖subscriptsuperscript2𝑡𝑟\{E_{i}\}_{i\in\mathcal{I}^{2}_{tr}}\cup\{\infty\}{ italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∪ { ∞ }. Use the function 𝒮𝒮\mathcal{S}caligraphic_S defined in eq. (15) to construct the prediction set at xitesubscript𝑥𝑖subscript𝑡𝑒x_{i}\in\mathcal{I}_{te}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT as C^i(xi)=𝒮(Xi;π^tei,τ^α)subscript^𝐶𝑖subscript𝑥𝑖𝒮subscript𝑋𝑖subscriptsuperscript^𝜋𝑖𝑡𝑒subscript^𝜏𝛼\hat{C}_{i}(x_{i})=\mathcal{S}(X_{i};\hat{\pi}^{i}_{te},\hat{\tau}_{\alpha})over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = caligraphic_S ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT , over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ).
Algorithm 2 Implementation of ACRF

B.3 ACRFshift

Finally, we introduce ACRFshift, another baseline that explicitly accounts for distributional changes under covariate shift model. ACRFshift combines the covariate shift comformal prediction (Tibshirani et al.,, 2019) with ACRF. This has not been discussed in previous work, so we give details about how this is done in our paper. We split also the test samples into two sets te1superscriptsubscript𝑡𝑒1\mathcal{I}_{te}^{1}caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, te2superscriptsubscript𝑡𝑒2\mathcal{I}_{te}^{2}caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Suppose that we now construct the prediction set for test samples in te2superscriptsubscript𝑡𝑒2\mathcal{I}_{te}^{2}caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Instead of finding τ𝜏\tauitalic_τ with eq. (17) and (14), we consider the following weighted calibration. The weighted function is γx(x)=r(x)r(x)+zi2r(xi)subscript𝛾𝑥𝑥𝑟𝑥𝑟𝑥subscriptsubscript𝑧𝑖subscript2𝑟subscript𝑥𝑖\gamma_{x}(x)=\frac{r(x)}{r(x)+\sum_{z_{i}\in\mathcal{I}_{2}}r(x_{i})}italic_γ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG italic_r ( italic_x ) end_ARG start_ARG italic_r ( italic_x ) + ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_r ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG and r(x)=[W=1|x][W=0|x]𝑟𝑥delimited-[]𝑊conditional1𝑥delimited-[]𝑊conditional0𝑥r(x)=\frac{\mathbb{P}[W=1|x]}{\mathbb{P}[W=0|x]}italic_r ( italic_x ) = divide start_ARG blackboard_P [ italic_W = 1 | italic_x ] end_ARG start_ARG blackboard_P [ italic_W = 0 | italic_x ] end_ARG is the conditional probability of being generated from the test data (W=1𝑊1W=1italic_W = 1 means from the test data, and W=0𝑊0W=0italic_W = 0 represent from the training data), learned from the classifier separating the test data te1superscriptsubscript𝑡𝑒1\mathcal{I}_{te}^{1}caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT from the training data tr1subscriptsuperscript1𝑡𝑟\mathcal{I}^{1}_{tr}caligraphic_I start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT. Instead of consider τ^αsubscript^𝜏𝛼\hat{\tau}_{\alpha}over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT as the level (1α)1𝛼(1-\alpha)( 1 - italic_α ) quantile of the empirical distribution {Ei}itr2{}subscriptsubscript𝐸𝑖𝑖subscriptsuperscript2𝑡𝑟\{E_{i}\}_{i\in\mathcal{I}^{2}_{tr}}\cup\{\infty\}{ italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∪ { ∞ }, for any xte2𝑥superscriptsubscript𝑡𝑒2x\in\mathcal{I}_{te}^{2}italic_x ∈ caligraphic_I start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we consider the level (1α)1𝛼(1-\alpha)( 1 - italic_α ) quantile of the weighted distribution below:

Vw(x,y;E)=itr2γx(xi)δEi+γx(x)δ.subscript𝑉𝑤𝑥𝑦𝐸subscript𝑖superscriptsubscript𝑡𝑟2subscript𝛾𝑥subscript𝑥𝑖subscript𝛿subscript𝐸𝑖subscript𝛾𝑥𝑥subscript𝛿V_{w}(x,y;E)=\sum_{i\in\mathcal{I}_{tr}^{2}}\gamma_{x}(x_{i})\delta_{E_{i}}+% \gamma_{x}(x)\delta_{\infty}.italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_x , italic_y ; italic_E ) = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_δ start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_γ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ) italic_δ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT .

Similarly, whether to include the random variable U𝑈Uitalic_U in ACRF will result in two versions: ACRFshiftrandom and ACRFshift.

Unlike CSForest, BCOPS, and even CRF, which naturally considers samples with \emptyset as the ones not close to inlier classes, and thus, outliers, ACRF and ACRFshift both consider the conditional probability of y|xconditional𝑦𝑥y|xitalic_y | italic_x and do not have such a feature encoded in their constructions. Hence, we adopt the rule where we reject a sample when rx(x)subscript𝑟𝑥𝑥r_{x}(x)italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ) is very large compared to others with iItr2rx(xi)<τsubscript𝑖superscriptsubscript𝐼𝑡𝑟2subscript𝑟𝑥subscript𝑥𝑖𝜏\sum_{i\in I_{tr}^{2}}r_{x}(x_{i})<\tau∑ start_POSTSUBSCRIPT italic_i ∈ italic_I start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < italic_τ. (Recall that rx(xi)=1n+1subscript𝑟𝑥subscript𝑥𝑖1𝑛1r_{x}(x_{i})=\frac{1}{n+1}italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG without covariate shift.)

B.4 Randomized ACRF/ACRFshift vs. derandomized ACRF/ACRFshift

In this section, we further demonstrate the difference between the randomized version and derandomized version to support our claim that removing the random variable U𝑈Uitalic_U from ACRFrandom and ACRFshift helps achieve a desirable type II error.

Table 2 shows the type I error and type II error of ACRF/ACRFrandom and ACRFshift/ACRFshiftrandom (referred to as derandomized and randomized versions of ACRF/ACRFshift in Table 2). We observe that models with randomness and those without randomness achieve comparable type I errors when there is not shift in the inlier labels, while models with randomness tend to have worse type II errors and the final prediction set C^^𝐶\hat{C}over^ start_ARG italic_C end_ARG is more likely to contain multiple labels. One potential explanation for this is that randomization can help with the control of conditional coverage and may lead to increased high type II error due to this more ambitious goal. Supporting this, the randomized version for both ACRF and ACRFshift controls the type I error while the derandomized version, especially ACRF, shows inflated type I error under the label shift model.

Table 2: Randomized version vs. derandomized version: Achieved Type I and Type II errors at α=0.05𝛼0.05\alpha=0.05italic_α = 0.05 with outlier components and no additional label shift among inlier digits and achieved Type I and Type II errors at α=0.05𝛼0.05\alpha=0.05italic_α = 0.05 with label shift among inlier digits but no outlier digits.
Version Method no additional label shift additional label shift
Type I Type II Type I Type II
Randomized ACRF 0.049±plus-or-minus\pm±0.007 0.702±plus-or-minus\pm±0.017 0.025±plus-or-minus\pm±0.007 0.884±plus-or-minus\pm±0.014
ACRFshift 0.053±plus-or-minus\pm±0.007 0.681±plus-or-minus\pm±0.014 0.055±plus-or-minus\pm±0.013 0.828±plus-or-minus\pm±0.015
derandomized ACRF 0.047±plus-or-minus\pm±0.006 0.431±plus-or-minus\pm±0.003 0.171±plus-or-minus\pm±0.024 0.313±plus-or-minus\pm±0.067
ACRFshift 0.036±plus-or-minus\pm±0.009 0.439±plus-or-minus\pm±0.009 0.080±plus-or-minus\pm±0.026 0.630±plus-or-minus\pm±0.127

Appendix C TRAINING DETAILS

In this section, we provide a detailed description of how we constructed datasets satisfying GLS and label shift using MNIST, CIFAR-10, and FashionMNIST for experimentation.

Q1 outliers w/o shift. For the datasets MNIST, CIFAR-10, and FashionMNIST, each consisting of 10 categories, we sampled 500 samples from each class (categories 0-5) to create a training set of 2500 samples. From the remaining samples in categories 0-5, we randomly selected 500 samples from each class, and similarly, we randomly selected 500 samples from each class in categories 6-9. These 5000 samples formed the test set. Notably, in the test set, categories 6-9 represent outliers that never appeared in the training set, while categories 0-5 are inlier samples.

Q2 shift w/o outliers. For the Label Shift setup, we sampled 500 samples from each class (categories 0-5) from MNIST, CIFAR-10, and FashionMNIST to create a training set of 3000 samples. From the remaining samples in categories 0-5, we randomly selected 100 samples from each class, and similarly, we randomly selected 500 samples from each class in categories 6-9. These 3000 samples formed the test set.

It is important to emphasize that for MNIST, CIFAR-10, and FashionMNIST, both CSForest and the baseline methods utilized representations extracted by a pre-trained ResNet-18 model as inputs, rather than the original images.

Appendix D ADDITIONAL EXPERIMENTS RESULTS

In the following section, we present additional experimental results to further substantiate the conclusions made in Section 4.

D.1 More Datastes

To further illustrate the effectiveness of CSForest under different tasks, we additionally include experiments on two new datasets: a cyber/network intrusion dataset from the KDD data competition and a chest X-ray dataset from the medical domain. Results in Tabel 3 shows CSForest achieves low type I errors and minimizes type II errors in both datasets, demonstrating its superior capability for outlier detection compared to the baseline.

Table 3: The achieved type I and type II errors at α=0.05𝛼0.05\alpha=0.05italic_α = 0.05 under no additional distribution shift but with outliers in the test set across 10 repetitions. For Network Intrusion, the test set includes additional 15 intrusion types as outliers; for Chest X-ray, the test set includes lung abnormalities caused by viruses as outliers.
Method Network Intrusion Chest X-ray
Type I Error Type II Error Type I Error Type II Error
CSForest 0.048 ±plus-or-minus\pm± 0.013 9.524e-5 ±plus-or-minus\pm± 0.000 0.056 ±plus-or-minus\pm± 0.003 0.566 ±plus-or-minus\pm± 0.002
BCOPS 0.047 ±plus-or-minus\pm± 0.014 5.905e-4 ±plus-or-minus\pm± 0.000 0.059 ±plus-or-minus\pm± 0.005 0.576 ±plus-or-minus\pm± 0.004
DC 0.049 ±plus-or-minus\pm± 0.010 0.261±plus-or-minus\pm± 0.049 0.062 ±plus-or-minus\pm± 0.010 0.728 ±plus-or-minus\pm± 0.013
CRF 0.030 ±plus-or-minus\pm± 0.004 0.467±plus-or-minus\pm± 0.104 0.039 ±plus-or-minus\pm± 0.033 0.793 ±plus-or-minus\pm± 0.178
ACRF 0.000 ±plus-or-minus\pm± 0.000 0.857±plus-or-minus\pm± 0.000 0.067 ±plus-or-minus\pm± 0.046 0.922 ±plus-or-minus\pm± 0.005
ACRFshift 0.001 ±plus-or-minus\pm± 0.001 0.019 ±plus-or-minus\pm± 0.004 0.039 ±plus-or-minus\pm± 0.049 0.885 ±plus-or-minus\pm± 0.057

D.2 Per-class Quality Evaluation

Figures 6 and 7 provide a detailed breakdown of the predictions made by all methods for each class in the CIFAR-10 and FashionMNIST datasets. Consistent with the results observed in MNIST, we find that CSForest is the top-performing method for outlier detection, and it avoids over-predicting by generating prediction sets that predominantly contain only the correct labels for each class.

Refer to caption
(a) Per-class quality evaluation with outliers but no additional label shift among inlier digits, where the outliers are defined as R={6,7,8,9}𝑅6789R=\{6,7,8,9\}italic_R = { 6 , 7 , 8 , 9 }.
Refer to caption
(b) Per-class (class 0-5) quality evaluation with additional label shift among inlier digits but no outliers.
Figure 6: Per-class quality evaluation on CIFAR-10. Figure 6 is grouped by the actual labels in the testing data and colored based on if a prediction set contains only the correct label (blue) or more than the correct label (gray). The horizontal dash line refers to the coverage level of 95%.
Refer to caption
(a) Per-class quality evaluation with outliers but no additional label shift among inlier digits, where the outliers are defined as R={6,7,8,9}𝑅6789R=\{6,7,8,9\}italic_R = { 6 , 7 , 8 , 9 }.
Refer to caption
(b) Per-class (class 0-5) quality evaluation with additional label shift among inlier digits but no outliers.
Figure 7: Per-class quality evaluation on FashionMNIST. Figure 7 is grouped by the actual labels in the testing data and colored based on if a prediction set contains only the correct label (blue) or more than the correct label (gray). The horizontal dash line refers to the coverage level of 95%.

D.3 Average Length of the Prediction Set C^^𝐶\hat{C}over^ start_ARG italic_C end_ARG

We observe that across all datasets, whether in the setting with outliers and no additional label shift or without outliers and with additional label shift, CSForest consistently achieves the smallest average prediction set interval length. This indicates that CSForest’s predictions do not contain a significant amount of redundant information, aligning with our previous observations of CSForest containing more “only correct labe” content for each class and exhibiting lower type II error.

To further illustrate that CSForest’s prediction set C^^𝐶\hat{C}over^ start_ARG italic_C end_ARG results in more accurate label predictions (i.e., predominantly containing only the correct labels) compared to other methods, Figure 8, 9(a) and 10(a) visualize the average interval length of prediction sets C^^𝐶\hat{C}over^ start_ARG italic_C end_ARG for all methods.

Refer to caption
(a) Per-class quality evaluation of all methods with outlier components but no additional label shift among inlier digits.
Refer to caption
(b) Average length of the prediction set C^^𝐶\hat{C}over^ start_ARG italic_C end_ARG of all methods with additional label shift among inlier digits but no outlier components.
Figure 8: Average length of the prediction set C^^𝐶\hat{C}over^ start_ARG italic_C end_ARG on MNIST. For MNIST, in both settings, CSForest achieves the smallest average prediction set interval length, which aligns with the high “only correct labels” content demonstrated in Figure 3 for CSForest across all classes. This conclusion is further supported by the lower type II error exhibited by CSForest in Table 1.
Refer to caption
(a) Per-class quality evaluation of all methods with outlier components but no additional label shift among inlier digits.
Refer to caption
(b) Average length of the prediction set C^^𝐶\hat{C}over^ start_ARG italic_C end_ARG of all methods with additional label shift among inlier digits but no outlier components.
Figure 9: Average length of the prediction set C^^𝐶\hat{C}over^ start_ARG italic_C end_ARG on CIFAR-10. For CIFAR-10, in both settings, CSForest achieves the smallest average prediction set interval length, which aligns with the high “only correct labels” content demonstrated in Figure 6 for CSForest across all classes. This conclusion is further supported by the lower type II error exhibited by CSForest in Table 1.
Refer to caption
(a) Per-class quality evaluation of all methods with outlier components but no additional label shift among inlier digits.
Refer to caption
(b) Average length of the prediction set C^^𝐶\hat{C}over^ start_ARG italic_C end_ARG of all methods with additional label shift among inlier digits but no outlier components.
Figure 10: Average length of the prediction set C^^𝐶\hat{C}over^ start_ARG italic_C end_ARG on FashionMNIST. For FashionMNIST, in both settings, CSForest achieves the smallest average prediction set interval length, which aligns with the high “only correct labels” content demonstrated in Figure 7 for CSForest across all classes. This conclusion is further supported by the lower type II error exhibited by CSForest in Table 1.

D.4 CSForest’s Performance with Varying Sample Size

In this section, we present the type II (inlier and outlier) error curves for all methods on CIFAR-10 and FashionMNIST as sample sizes vary. Consistent with the experimental results on MNIST, CSForest demonstrates superior outlier detection capabilities relative to the baseline as sample sizes change, and it also maintains lower inlier type II errors.

Refer to caption
Figure 11: The type II error for inliers and outliers obtained under different sample sizes on CIFAR-10. Figure 11 illustrates that, compared to baselines, CSForest efficiently detects outliers while maintaining lower inlier type II errors across varying sample sizes.
Refer to caption
Figure 12: The type II error for inliers and outliers obtained under different sample sizes on FashionMNIST. Figure 12 illustrates that, compared to baselines, CSForest efficiently detects outliers while maintaining lower inlier type II errors across varying sample sizes.

Appendix E DISCUSSION ON w𝑤witalic_w

To avoid oversampling, we impose constraints on ~othersubscript~𝑜𝑡𝑒𝑟\tilde{\mathcal{I}}_{other}over~ start_ARG caligraphic_I end_ARG start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r end_POSTSUBSCRIPT, where the sample size for other classes is constrained to min(ntew,nnk)subscript𝑛𝑡𝑒𝑤𝑛subscript𝑛𝑘\min(\lceil n_{te}w\rceil,n-n_{k})roman_min ( ⌈ italic_n start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT italic_w ⌉ , italic_n - italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). That is, we weigh the training samples through sampling, and we will cap the influence of training samples if even taking all training samples becomes insufficient. So the threshold of satisfies ntew=nnksubscript𝑛𝑡𝑒𝑤𝑛subscript𝑛𝑘\lceil n_{te}w\rceil=n-n_{k}⌈ italic_n start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT italic_w ⌉ = italic_n - italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. As a result, CSForest will yield the same Type I and Type II error once ω𝜔\omegaitalic_ω exceeds a certain threshold depending on the data.

Assuming a training dataset with K𝐾Kitalic_K classes and T𝑇Titalic_T samples per class, and a test dataset with Ksuperscript𝐾K^{\prime}italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT classes and Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT samples per class. We have:

nteω+1subscript𝑛𝑡𝑒𝜔1\displaystyle n_{te}\omega+1italic_n start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT italic_ω + 1 nnknteωabsent𝑛subscript𝑛𝑘subscript𝑛𝑡𝑒𝜔\displaystyle\geq n-n_{k}\geq n_{te}\omega≥ italic_n - italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ italic_n start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT italic_ω (18)
KTω+1absentsuperscript𝐾superscript𝑇𝜔1\displaystyle\implies K^{\prime}T^{\prime}\omega+1⟹ italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_ω + 1 KTTKTωabsent𝐾𝑇𝑇superscript𝐾superscript𝑇𝜔\displaystyle\geq KT-T\geq K^{\prime}T^{\prime}\omega≥ italic_K italic_T - italic_T ≥ italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_ω
(K1)TKTabsent𝐾1𝑇superscript𝐾superscript𝑇\displaystyle\implies\frac{(K-1)T}{K^{\prime}T^{\prime}}⟹ divide start_ARG ( italic_K - 1 ) italic_T end_ARG start_ARG italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ω(K1)T1KTabsent𝜔𝐾1𝑇1superscript𝐾superscript𝑇\displaystyle\geq\omega\geq\frac{(K-1)T-1}{K^{\prime}T^{\prime}}≥ italic_ω ≥ divide start_ARG ( italic_K - 1 ) italic_T - 1 end_ARG start_ARG italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG

Based on eq. (18), we have the following conclusions:

  1. 1.

    If ω𝜔\omegaitalic_ω exceeds the threshold (K1)TKT𝐾1𝑇superscript𝐾superscript𝑇\frac{{(K-1)T}}{{K^{\prime}T^{\prime}}}divide start_ARG ( italic_K - 1 ) italic_T end_ARG start_ARG italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG, increasing ω𝜔\omegaitalic_ω will have no effect on CSForest.

  2. 2.

    If the test sample size nte=KTKT=ntrsubscript𝑛𝑡𝑒superscript𝐾superscript𝑇much-greater-than𝐾𝑇subscript𝑛𝑡𝑟n_{te}=K^{\prime}T^{\prime}\gg KT=n_{tr}italic_n start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT = italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≫ italic_K italic_T = italic_n start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT where ntrsubscript𝑛𝑡𝑟n_{tr}italic_n start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT is the training sample size, ω1𝜔1\omega\to 1italic_ω → 1. In this case, greater than 1 will achieve the same errors as ω=1𝜔1\omega=1italic_ω = 1.

In this section, we experiment with different choices of weights w𝑤witalic_w from small 00 to large (exceeding the weight threshold) as opposed to fixing w=1𝑤1w=1italic_w = 1 at the default value. Although small w𝑤witalic_w can sometimes lead to improved outlier detection and large w𝑤witalic_w can sometimes improve the inlier classification, w=1𝑤1w=1italic_w = 1 tend to provides a good tradeoff between these two objectives on the three real data sets considered.

E.1 MNIST

For the MNIST data, we consider K=6,T=200,K=10,T=50formulae-sequence𝐾6formulae-sequence𝑇200formulae-sequencesuperscript𝐾10superscript𝑇50K=6,T=200,K^{\prime}=10,T^{\prime}=50italic_K = 6 , italic_T = 200 , italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 10 , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 50 and we can get the threshold 1.998<w<2.0001.998𝑤2.0001.998<w<2.0001.998 < italic_w < 2.000. For the MNIST dataset, we indeed observed in Table 4 that once w2𝑤2w\geq 2italic_w ≥ 2, the type I error and type II error of CSForest no longer change.

Table 4: Achieved Type I and Type II errors at α=0.05𝛼0.05\alpha=0.05italic_α = 0.05 and ω1𝜔1\omega\geq 1italic_ω ≥ 1 on MNIST. We observed that when ω2𝜔2\omega\geq 2italic_ω ≥ 2, CSForest achieves the same Type I and Type II error as ω2𝜔2\omega\geq 2italic_ω ≥ 2.
ω𝜔\omegaitalic_ω Type I Error Type II Error (inlier) Type II Error (outlier)
0 0.057±plus-or-minus\pm±0.018 0.251±plus-or-minus\pm±0.033 0.314±plus-or-minus\pm±0.068
LOG 0.053±plus-or-minus\pm±0.016 0.224±plus-or-minus\pm±0.031 0.315±plus-or-minus\pm±0.061
1 0.058±plus-or-minus\pm±0.014 0.119±plus-or-minus\pm±0.018 0.346±plus-or-minus\pm±0.065
1.5 0.055±plus-or-minus\pm±0.016 0.106±plus-or-minus\pm±0.014 0.349±plus-or-minus\pm±0.068
2 0.056±plus-or-minus\pm±0.018 0.099±plus-or-minus\pm±0.016 0.373±plus-or-minus\pm±0.072
5 0.056±plus-or-minus\pm±0.018 0.099±plus-or-minus\pm±0.016 0.373±plus-or-minus\pm±0.072
10 0.056±plus-or-minus\pm±0.018 0.099±plus-or-minus\pm±0.016 0.373±plus-or-minus\pm±0.072
100 0.056±plus-or-minus\pm±0.018 0.099±plus-or-minus\pm±0.016 0.373±plus-or-minus\pm±0.072

E.2 CIFAR-10

For CIFAR-10, we consider K=6,T=200,K=10,T=50formulae-sequence𝐾6formulae-sequence𝑇200formulae-sequencesuperscript𝐾10superscript𝑇50K=6,T=200,K^{\prime}=10,T^{\prime}=50italic_K = 6 , italic_T = 200 , italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 10 , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 50 and get the threshold 1.998<w<2.0001.998𝑤2.0001.998<w<2.0001.998 < italic_w < 2.000. Similarly to MNIST, for CIFAR-10, once w𝑤witalic_w exceeds the threshold of 2, the performance of CSForest remains unchanged.

Table 5: Achieved Type I and Type II errors at α=0.05𝛼0.05\alpha=0.05italic_α = 0.05 and ω1𝜔1\omega\geq 1italic_ω ≥ 1 on CIFAR-10. We observed that when ω2𝜔2\omega\geq 2italic_ω ≥ 2, CSForest achieves the same Type I and Type II error as ω2𝜔2\omega\geq 2italic_ω ≥ 2.
ω𝜔\omegaitalic_ω Type I Error Type II Error (inlier) Type II Error (outlier)
0 0.045±plus-or-minus\pm±0.015 0.001±plus-or-minus\pm±0.001 0.000±plus-or-minus\pm±0.000
LOG 0.043±plus-or-minus\pm±0.016 0.000±plus-or-minus\pm±0.001 0.000±plus-or-minus\pm±0.000
1 0.043±plus-or-minus\pm±0.016 0.000±plus-or-minus\pm±0.001 0.000±plus-or-minus\pm±0.000
1.5 0.043±plus-or-minus\pm±0.014 0.000±plus-or-minus\pm±0.001 0.000±plus-or-minus\pm±0.000
2 0.043±plus-or-minus\pm±0.016 0.000±plus-or-minus\pm±0.001 0.000±plus-or-minus\pm±0.000
5 0.043±plus-or-minus\pm±0.016 0.000±plus-or-minus\pm±0.001 0.000±plus-or-minus\pm±0.000
10 0.043±plus-or-minus\pm±0.016 0.000±plus-or-minus\pm±0.001 0.000±plus-or-minus\pm±0.000
100 0.043±plus-or-minus\pm±0.016 0.000±plus-or-minus\pm±0.001 0.000±plus-or-minus\pm±0.000

E.3 FashionMNIST

For the CIFAR-10 data, we consider K=6,T=200,K=10,T=50formulae-sequence𝐾6formulae-sequence𝑇200formulae-sequencesuperscript𝐾10superscript𝑇50K=6,T=200,K^{\prime}=10,T^{\prime}=50italic_K = 6 , italic_T = 200 , italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 10 , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 50 and get the threshold 1.998<w<2.0001.998𝑤2.0001.998<w<2.0001.998 < italic_w < 2.000. For FashionMNIST as well, the performance of CSForest remains constant once w𝑤witalic_w surpasses the threshold of 2.

Table 6: Achieved Type I and Type II errors at α=0.05𝛼0.05\alpha=0.05italic_α = 0.05 and ω1𝜔1\omega\geq 1italic_ω ≥ 1 on FashionMNIST. We observed that when ω2𝜔2\omega\geq 2italic_ω ≥ 2, CSForest achieves the same Type I and Type II error as ω2𝜔2\omega\geq 2italic_ω ≥ 2.
ω𝜔\omegaitalic_ω Type I Error Type II Error (inlier) Type II Error (outlier)
0 0.049±plus-or-minus\pm±0.011 0.421±plus-or-minus\pm±0.001 0.406±plus-or-minus\pm±0.045
LOG 0.048±plus-or-minus\pm±0.012 0.391±plus-or-minus\pm±0.047 0.399±plus-or-minus\pm±0.043
1 0.046±plus-or-minus\pm±0.012 0.287±plus-or-minus\pm±0.028 0.403±plus-or-minus\pm±0.037
1.5 0.047±plus-or-minus\pm±0.010 0.272±plus-or-minus\pm±0.029 0.408±plus-or-minus\pm±0.038
2 0.048±plus-or-minus\pm±0.012 0.262±plus-or-minus\pm±0.032 0.410±plus-or-minus\pm±0.034
5 0.048±plus-or-minus\pm±0.012 0.262±plus-or-minus\pm±0.032 0.410±plus-or-minus\pm±0.034
10 0.048±plus-or-minus\pm±0.012 0.262±plus-or-minus\pm±0.032 0.410±plus-or-minus\pm±0.034
100 0.048±plus-or-minus\pm±0.012 0.262±plus-or-minus\pm±0.032 0.410±plus-or-minus\pm±0.034