Overcoming Common Flaws in the Evaluation of Selective Classification Systems

Jeremias Traub1,2   Till J. Bungert1,2,6   Carsten T. Lüth1,2Michael Baumgartner2,3,6
Klaus H. Maier-Hein2,3,5,6,7Lena Maier-Hein2,4,6,7Paul F. Jäger1,2
1German Cancer Research Center (DKFZ) Heidelberg, Interactive Machine Learning Group, Germany
2Helmholtz Imaging, DKFZ Heidelberg, Germany
3DKFZ Heidelberg, Division of Medical Image Computing (MIC), Germany
4DKFZ Heidelberg, Division of Intelligent Medical Systems (IMSY), Germany
5Pattern Analysis and Learning Group, Department of Radiation Oncology,
Heidelberg University Hospital, 69120 Heidelberg, Germany
6Faculty of Mathematics and Computer Science, University of Heidelberg, Germany
7National Center for Tumor Diseases (NCT) Heidelberg
[email protected]
Abstract

Selective Classification, wherein models can reject low-confidence predictions, promises reliable translation of machine-learning based classification systems to real-world scenarios such as clinical diagnostics. While current evaluation of these systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the AUROCAUROC\mathrm{AUROC}roman_AUROC in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them. We propose the Area under the Generalized Risk Coverage curve (AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC), which meets all requirements and can be directly interpreted as the average risk of undetected failures. We empirically demonstrate the relevance of AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC on a comprehensive benchmark spanning 6 data sets and 13 confidence scoring functions. We find that the proposed metric substantially changes metric rankings on 5 out of the 6 data sets.

1 Introduction

Selective Classification (SC) is increasingly recognized as a crucial component for the reliable deployment of machine learning-based classification systems in real-world scenarios, such as clinical diagnostics [Dvijotham et al., 2023, Leibig et al., 2022, Dembrower et al., 2020, Yala et al., 2019]. The core idea of SC is to equip models with the ability to reject low-confidence predictions, thereby enhancing the overall reliability and safety of the system [Geifman and El-Yaniv, 2017, Chow, 1957, El-Yaniv et al., 2010]. This is typically achieved through three key components: a classifier that makes the predictions, a confidence scoring function (CSF) that assesses the reliability of these predictions, and a rejection threshold τ𝜏\tauitalic_τ that determines when to reject a prediction based on its confidence score.

In evaluating SC systems, two primary concepts are essential: risk and coverage. Risk expresses the classifier’s error potential and is typically measured by its misclassification rate. Coverage, on the other hand, indicates the proportion of instances where the model makes a prediction rather than rejecting it. An effective SC system aims to minimize risk while maintaining high coverage, ensuring that the model provides accurate predictions for as many instances as possible.

Current evaluation of SC systems often focuses on fixed working points defined by pre-set rejection thresholds[Geifman and El-Yaniv, 2017, Liu et al., 2019, Geifman and El-Yaniv, 2019]. For instance, a common evaluation metric might be the selective risk at a given coverage of 70%percent7070\%70 % [Geifman and El-Yaniv, 2019], which communicates the potential risk associated with a specific confidence score cc\mathrm{c}roman_c and selection threshold τ𝜏\tauitalic_τ to a patient or end-user. While this method is useful for communicating risk in specific instances, it does not provide a comprehensive evaluation of the system’s overall performance. In standard classification, this is analogous to the need for the Area Under the Receiver Operating Characteristic (AUROC) curve rather than evaluating sensitivity at a specific specificity, which can be highly misleading when assessing general classifier capabilities [Maier-Hein et al., 2022]. The AUROC provides a holistic view of the classifier’s performance across all possible thresholds, thereby driving methodological progress. Similarly, SC requires a multi-threshold metric that aggregates performance across all rejection thresholds to fully benchmark the system’s capabilities.

Several current approaches attempt to address this need for multi-threshold metrics in SC. The Area Under the Risk Coverage curve (AURC) is the most prevalent of these Geifman et al. [2018]. However, we demonstrate in this work that AURC has significant limitations as it fails to adequately translate the risk from specific working points into a meaningful aggregated evaluation score. This shortcoming hampers the ability to benchmark and improve SC methodologies effectively.

In our work, we aim to bridge this gap by providing a robust evaluation framework for SC. Our contributions are as follows:

Refined Task Formulation: We are the first to provide a comprehensive SC evaluation framework, explicitly deriving meaningful task formulations for different evaluation and application scenarios such as working point versus multi-threshold evaluation.

Formulation of Requirements: We define five critical requirements for multi-threshold metrics in SC, focusing on task suitability, interpretability, and flexibility. We assess current multi-threshold metrics against our six requirements and demonstrate their shortcomings.

Proposal of AUGRC: We introduce the Area Under the Generalized Risk Coverage (AUGRC) curve, a new metric designed to overcome the flaws of current multi-threshold metrics for SC. AUGRC meets all five requirements, providing a comprehensive and interpretable measure of SC system performance.

Empirical Validation: We empirically demonstrate the relevance and effectiveness of AUGRC through a comprehensive benchmark spanning 6 datasets and 13 confidence scoring functions.

In summary, our work presents a significant advancement in the evaluation of SC systems, offering a more reliable and interpretable metric that can drive further methodological progress in the field.

Refer to caption
Figure 1: The AUGRC metric based on Generalized Risk overcomes common flaws in current evaluation of Selective classification (SC). a) Refined task definition for SC. Analogously to standard classification, we distinguish between holistic evaluation for method development and benchmarking using multi-threshold metrics versus evaluation of specific application scenarios at pre-determined working points. The current most prevalent multi-threshold metric in SC, AURCAURC\mathrm{AURC}roman_AURC, is based on Selective Risk, a concept for working point evaluation that is not suitable for aggregation over rejection thresholds (red arrow). To fill this gap, we formulate the new concept of Generalized Risk and a corresponding metric, AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC (green arrow). b) We formalize our perspective on SC evaluation by identifying five key requirements for multi-threshold metrics and analyze how previous metrics fail to fulfill them. Abbreviations, CSF: Confidence Scoring Function.

2 Refined Task Formulation

SC systems consist of a classifier m:𝒳𝒴:𝑚𝒳𝒴m:\mathcal{X}\rightarrow\mathcal{Y}italic_m : caligraphic_X → caligraphic_Y, which outputs a prediction and a CSF g:𝒳:𝑔𝒳g:\mathcal{X}\rightarrow\mathbb{R}italic_g : caligraphic_X → blackboard_R, which outputs a confidence score associated with the prediction. Assuming a supervised training setup, let {(xi,yi)}i=1Nsuperscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑁\{(x_{i},y_{i})\}_{i=1}^{N}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT be a dataset containing N𝑁Nitalic_N independent samples from the source distribution P(X,Y)𝑃𝑋𝑌P(X,Y)italic_P ( italic_X , italic_Y ) over 𝒳×𝒴𝒳𝒴\mathcal{X}\times\mathcal{Y}caligraphic_X × caligraphic_Y. Given a rejection threshold τ𝜏\tauitalic_τ, the model prediction is accepted only if the corresponding score is larger than τ𝜏\tauitalic_τ:

(m,g)(x){m(x),if g(x)τreject,otherwise𝑚𝑔𝑥cases𝑚𝑥if 𝑔𝑥𝜏rejectotherwise(m,g)(x)\coloneqq\begin{cases}m(x),&\text{if }g(x)\geq\tau\\ \text{reject},&\text{otherwise}\end{cases}( italic_m , italic_g ) ( italic_x ) ≔ { start_ROW start_CELL italic_m ( italic_x ) , end_CELL start_CELL if italic_g ( italic_x ) ≥ italic_τ end_CELL end_ROW start_ROW start_CELL reject , end_CELL start_CELL otherwise end_CELL end_ROW (1)

Given an error function :𝒴×𝒴+:𝒴𝒴superscript\ell:\mathcal{Y}\times\mathcal{Y}\rightarrow\mathbb{R}^{+}roman_ℓ : caligraphic_Y × caligraphic_Y → blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, the overall risk of the classifier m𝑚mitalic_m is given by R(m)1Ni=1N(m(x),y)𝑅𝑚1𝑁superscriptsubscript𝑖1𝑁𝑚𝑥𝑦R(m)\coloneqq\frac{1}{N}\sum_{i=1}^{N}\ell(m(x),y)italic_R ( italic_m ) ≔ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_ℓ ( italic_m ( italic_x ) , italic_y ). The error function thereby contains the measure of classification performance suitable for the task at hand. Commonly, SC literature assumes a 0/1 error corresponding to the failure indicator variable Yfsubscript𝑌fY_{\text{f}}italic_Y start_POSTSUBSCRIPT f end_POSTSUBSCRIPT with yf,i(xi,yi,m)𝕀[yim(xi)]subscript𝑦f𝑖subscript𝑥𝑖subscript𝑦𝑖𝑚𝕀delimited-[]subscript𝑦𝑖𝑚subscript𝑥𝑖y_{\text{f},i}(x_{i},y_{i},m)\coloneqq\mathbb{I}[y_{i}\neq m(x_{i})]italic_y start_POSTSUBSCRIPT f , italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_m ) ≔ blackboard_I [ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_m ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ]. In this case, the overall risk represents the probability of misclassification P(Yf=1)𝑃subscript𝑌f1P(Y_{\text{f}}=1)italic_P ( italic_Y start_POSTSUBSCRIPT f end_POSTSUBSCRIPT = 1 ). By rejecting low-confidence predictions, the selective classifier can decrease the risk of misclassifications at the cost of letting the classifier make predictions on only a fraction of the dataset. This fraction is denoted as the coverageP(g(x)τ)coverage𝑃𝑔𝑥𝜏\text{coverage}\coloneqq P(g(x)\geq\tau)coverage ≔ italic_P ( italic_g ( italic_x ) ≥ italic_τ ), with the respective empirical estimator 1Ni=1N𝕀(g(xi)τ)1𝑁superscriptsubscript𝑖1𝑁𝕀𝑔subscript𝑥𝑖𝜏\frac{1}{N}\sum_{i=1}^{N}\mathbb{I}(g(x_{i})\geq\tau)divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_I ( italic_g ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ italic_τ ). Evaluation of SC systems often focuses on preventing "silent", i.e. undetected failures for which both yf,i=1subscript𝑦f𝑖1y_{\text{f},i}=1italic_y start_POSTSUBSCRIPT f , italic_i end_POSTSUBSCRIPT = 1 and g(xi)τ𝑔subscript𝑥𝑖𝜏g(x_{i})\geq\tauitalic_g ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ italic_τ.

Deploying a SC system in practice requires the selection of a fixed rejection threshold τ𝜏\tauitalic_τ. In the following refined evaluation protocol, we distinguish between the well-established application-specific task formulation where SC models are evaluated at individual working points (Section 2.1) and SC method development where evaluation is independent of individual working points (Section 2.2).

2.1 Evaluating SC systems in applied settings

Current SC evaluation commonly reports the selective risk of the system at a given coverage (“risk@coverage”), which implies a pre-determined cutoff on the risk or coverage (e.g. "maximum risk of 5%") leading to a working point of the system defined by a rejection threshold τ𝜏\tauitalic_τ. The selective risk is defined as:

SelectiveRisk(m,g)(τ)i=1N(m(xi),yi)𝕀(g(xi)τ)i=1N𝕀(g(xi)τ)SelectivesubscriptRisk𝑚𝑔𝜏superscriptsubscript𝑖1𝑁𝑚subscript𝑥𝑖subscript𝑦𝑖𝕀𝑔subscript𝑥𝑖𝜏superscriptsubscript𝑖1𝑁𝕀𝑔subscript𝑥𝑖𝜏\mathrm{Selective\ Risk}_{(m,g)}(\tau)\coloneqq\frac{\sum_{i=1}^{N}\ell(m(x_{i% }),y_{i})\cdot\mathbb{I}(g(x_{i})\geq\tau)}{\sum_{i=1}^{N}\mathbb{I}(g(x_{i})% \geq\tau)}roman_Selective roman_Risk start_POSTSUBSCRIPT ( italic_m , italic_g ) end_POSTSUBSCRIPT ( italic_τ ) ≔ divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_ℓ ( italic_m ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ blackboard_I ( italic_g ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_I ( italic_g ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ italic_τ ) end_ARG (2)

This formulation only considers accepted predictions with c>τ𝑐𝜏c>\tauitalic_c > italic_τ. For a binary failure error, this risk is an empirical estimator for the conditional probability P(Yf=1|g(x)τ)𝑃subscript𝑌fconditional1𝑔𝑥𝜏P(Y_{\text{f}}=1|g(x)\geq\tau)italic_P ( italic_Y start_POSTSUBSCRIPT f end_POSTSUBSCRIPT = 1 | italic_g ( italic_x ) ≥ italic_τ ). Thus, this metric effectively communicates the risk of a "silent" failure for a specific prediction of classifier m𝑚mitalic_m that has been accepted (c>τ𝑐𝜏c>\tauitalic_c > italic_τ) at a pre-defined threshold τ𝜏\tauitalic_τ. This information may be useful in applied medical settings, e.g. to communicate the risk of misclassification for a patient p whose associated prediction has been accepted by a given SC system (cp>τsubscript𝑐𝑝𝜏c_{p}>\tauitalic_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT > italic_τ).

2.2 Evaluating SC systems for method development and benchmarking

Evaluating the general performance of a SC system requires a multi-threshold metric for a more comprehensive view compared to a working point analysis based on a fixed singular rejection threshold, analogous to binary classification tasks, where the Area Under the Receiver Operating Characteristic (AUROC) is used to assess the entire system performance across multiple classifier thresholds. However, the working point analysis can not be simply translated to multi-threshold aggregation, as it is based on Selective Risk, which only considers the risk w.r.t accepted predictions, assuming that a specific selection has already occurred (P(Yf=1g(x)τ)𝑃subscript𝑌fconditional1𝑔𝑥𝜏P(Y_{\text{f}}=1\mid g(x)\geq\tau)italic_P ( italic_Y start_POSTSUBSCRIPT f end_POSTSUBSCRIPT = 1 ∣ italic_g ( italic_x ) ≥ italic_τ )). This evaluation ignores rejected cases and thus contradicts the holistic assessment, where we are interested in the general risk for any prediction causing a silent failure independent of the future selection decision of individual predictions. This holistic assessment is reflected by the joint probability P(Yf=1,g(x)τ)𝑃formulae-sequencesubscript𝑌f1𝑔𝑥𝜏P(Y_{\text{f}}=1,g(x)\geq\tau)italic_P ( italic_Y start_POSTSUBSCRIPT f end_POSTSUBSCRIPT = 1 , italic_g ( italic_x ) ≥ italic_τ ), which reflects the risk of silent failure for any prediction processed by the system including cases that are potentially rejected in the future.

To address the discrepancy between the need for holistic assessment versus the selective risk’s focus on already selected cases, we formulate the generalized risk:

GeneralizedRisk(m,g)(τ)1Ni=1N(m(xi),yi)𝕀[g(xi)τ].GeneralizedRisk𝑚𝑔𝜏1𝑁superscriptsubscript𝑖1𝑁𝑚subscript𝑥𝑖subscript𝑦𝑖𝕀delimited-[]𝑔subscript𝑥𝑖𝜏\mathrm{Generalized\ Risk}{(m,g)}(\tau)\coloneqq\frac{1}{N}\sum_{i=1}^{N}\ell(% m(x_{i}),y_{i})\cdot\mathbb{I}[g(x_{i})\geq\tau].roman_Generalized roman_Risk ( italic_m , italic_g ) ( italic_τ ) ≔ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_ℓ ( italic_m ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ blackboard_I [ italic_g ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ italic_τ ] . (3)

This metric reflects the joint probability of misclassification and acceptance by the confidence score threshold. By aggregating this risk over multiple rejection thresholds, we can evaluate the overall performance of the SC system.

2.3 Requirements for Selective Classification multi-threshold metrics

Based on the refined task definition given above, we can formulate concrete five requirements R1-R5 for multi-threshold metrics in SC:

R1: Task Completeness. The general goal in SC is to prevent silent failures either by preventing failures in the first place (via classifier performance), or by detecting the remaining failures (via CSF ranking quality). As argued in Jäger et al. [2023], it is crucial to jointly evaluate both aspects of an SC system, since the choice of CSF generally affects the underlying classifier. R2: Monotonicity. The metric should be monotonic w.r.t both evaluated factors stated in R1, i.e. improving on one of the two factors while kee** the other one fixed results in an improved metric value. Note that R2 does not make assumptions about how the two factors are combined, but it represents a minimum requirement for meaningful comparison and optimization of the metric. R3: Ranking Interpretability. Interpretability is a crucial component of a metric [Maier-Hein et al., 2022]. AUROC is the de facto standard ranking metric for binary classification tasks, providing an intuitive assessment of ranking quality which is proportional to the number of permutations needed for establishing an optimal ranking. This can also be interpreted as the "probability of a positive sample having a higher schore than a negative one". Ideally, the SC metric should follow this intuitive assessment of rankings. R4: CSF Flexibility. The metric should be applicable to arbitrary choices of CSFs. This includes external CSFs, i.e. scores that are not based directly on the classifier output. R5: Error Flexibility. Current SC literature largely focuses on 0/1 error (i.e. 1accuracy1accuracy1-\text{accuracy}1 - accuracy) in their risk computation. However, in many real-world scenarios, accuracy is not an adequate classification metric, such as in the presence of class imbalance and for pixel-level classification. Thus, the SC metric should be flexible w.r.t the choice of error function.

2.4 Current multi-threshold metrics in SC do not fulfill requirements R1-R5

AURC: Geifman et al. [2018] derive the AURCAURC\mathrm{AURC}roman_AURC as the Area under the Selective Risk-Coverage curve. This metric is the most prevalent multi-threshold metric in SC [Geifman et al., 2018, Jäger et al., 2023, Bungert et al., 2023, Cheng et al., 2023, Zhu et al., 2023a, Varshney et al., 2020, Naushad and Voiculescu, 2024, Van Landeghem et al., 2024, 2023, Zhu et al., 2022, Xin et al., 2021, Yoshikawa and Okazaki, 2023, Ding et al., 2020, Zhu et al., 2023b, Galil and El-Yaniv, 2021, Franc et al., 2023, Cen et al., 2023, Xia and Bouganis, 2022, Cattelan and Silva, 2023, Tran et al., 2022, Kim et al., 2023, Ashukha et al., 2020, Xia et al., 2024]. For the 0/1-error, AURCAURC\mathrm{AURC}roman_AURC can be expressed through the following integral:

AURC=01P(Yf=1|g(x)τ)dP(g(x)τ)AURCsuperscriptsubscript01𝑃subscript𝑌fconditional1𝑔𝑥𝜏differential-d𝑃𝑔𝑥𝜏\mathrm{AURC}=\int_{0}^{1}P(Y_{\text{f}}=1|g(x)\geq\tau)\,\mathrm{d}P(g(x)\geq\tau)roman_AURC = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_P ( italic_Y start_POSTSUBSCRIPT f end_POSTSUBSCRIPT = 1 | italic_g ( italic_x ) ≥ italic_τ ) roman_d italic_P ( italic_g ( italic_x ) ≥ italic_τ ) (4)

This integral effectively aggregates the Selective Risk (Equation 2) over all fractions of accepted predictions (i.e. coverage). However, as discussed in Section 2.2, the Selective Risk is not suitable for aggregation over rejection thresholds to holistically assess a SC system. This inadequacy effectively leads to an excessive over-weighting of high-confidence failures in AURCAURC\mathrm{AURC}roman_AURC and in return to breaching the requirements of monotonicity (R2) and ranking interpretability (R3). We provide empirical evidence for these shortcomings on toy data (Section 3) and real-world data (Section 4.2). Further, current SC literature usually employs the 1/0 error function, even though the related Accuracy metric is well known to be an unsuitable classification performance measure for many applications. For example, Balanced Accuracy is used to address class imbalance and metrics such as the Dice Score are employed for segmentation. Moving beyond the 1/0 error yields a higher variance in distinct error values and may thus amplify the shortcoming of over-weighting high-confidence failures. As more sophisticated error functions being are an important direction of future SC research, it is crucial to ensure the compatibility of metrics in SC evaluation.

e-AURC: Geifman et al. [2018] further introduce the e-AURCe-AURC\mathrm{e\text{-}AURC}roman_e - roman_AURC as the difference to the AURCAURC\mathrm{AURC}roman_AURC of an optimal CSF (denoted as AURCAURC\mathrm{AURC}roman_AURC )

e-AURC=AURCAURCe-AURCAURCsuperscriptAURC\mathrm{e\text{-}AURC}=\mathrm{AURC}-\mathrm{AURC}^{*}roman_e - roman_AURC = roman_AURC - roman_AURC start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (5)

It relies on the intuition that subtracting the optimal AURCAURC\mathrm{AURC}roman_AURC eliminates the contribution of the overall classification performance and collapses it to a pure ranking metric [Geifman et al., 2018, Galil et al., 2023, Jäger et al., 2023]. However, several works demonstrated that this intuition does not hold and that the e-AURCe-AURC\mathrm{e\text{-}AURC}roman_e - roman_AURC is still sensitive to the overall classification performance [Galil et al., 2023, Cattelan and Silva, 2023]. We attribute this deviation from the intended behavior to the shortcomings of the underlying AURCAURC\mathrm{AURC}roman_AURC, as we find the desired isolation of classifier performance in our improved metric formulation in Section 3. e-AURC further inherits the shortcomings of AURCAURC\mathrm{AURC}roman_AURC w.r.t monotonicity (R2), ranking interpretability (R3), and error flexibility (R5).

NLL / Brier Score: Proper scoring rules such as the Negative-Log-Likelihood and the Brier Score evaluate a general meaningfulness of scores based on the softmax-output of the classifier [Ovadia et al., 2019]. Thereby, they jointly assess ranking and calibration of scores, which dilutes the focus on ranking quality in the context of SC [Jäger et al., 2023]. In our formulation, the calibration aspect in these metrics breaks the required monotonicity w.r.t SC evaluation (R2). Further, they are not applicable to CSFs beyond those that are directly based on the classifier output (R4). AUROCff{}_{\text{f}}start_FLOATSUBSCRIPT f end_FLOATSUBSCRIPT: The "failure version" of the standard AUROC assesses the correctness of predictions with a binary failure label. Based on the AUROC, it provides an intuitive ranking quality assessment. However, it ignores the underlying classifier performance (R1, R2), and is restricted to binary error functions (R5). OC-AUC: Kivlichan et al. [2021] introduce the Oracle-Model Collaborative AUC, where first a fixed threshold is applied on the confidence scores, above which error values are set to zero. Then, the AUROCfsubscriptAUROCf\mathrm{AUROC_{f}}roman_AUROC start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT (or the Average Precision) are evaluated. This metric is also reported in [Dehghani et al., 2023, Tran et al., 2022], with a review fraction of 0.5%. OC-AUC is subject to the same pitfalls as AUROCfsubscriptAUROCf\mathrm{AUROC_{f}}roman_AUROC start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT (R1, R2, R5). AUROC-AURC: Pugnana and Ruggieri [2023] propose the AUROC-AURC where the Selective Risk is defined as the classification (not failure) AUROC computed over the set of accepted predictions. However, it is only applicable to binary classification models with binary error functions (R5). Further, in employing Selective Risk it inherits the pitfalls described for AURCAURC\mathrm{AURC}roman_AURC regarding monotonicity (R2) and ranking interpretability (R3). NAURC: Cattelan and Silva [2023] propose the NAURC, a min-max scaled version of the e-AURCe-AURC\mathrm{e\text{-}AURC}roman_e - roman_AURC, and claim that this modification eliminates its lack of interpretability. However, given the linear relationship with the AURCAURC\mathrm{AURC}roman_AURC, it does not fulfill the requirements R2 and R3. F1-AUC: Malinin and Gales [2020], Malinin et al. [2021] introduce the notion of Error-Retention Curves, which corresponds to that of Generalized Risk Coverage Curves. However, in Malinin et al. [2021] the authors propose an F1-AUC metric which is only applicable to binary error functions (R5) and breaks monotonicity (R2), as it decreases with increasing accuracy for accuracy values above 0.56absent0.56\approx 0.56≈ 0.56 (see Appendix A.1.3 for a detailed explanation.) ARC: Accuracy-Rejection-Curves [Nadeem et al., 2009, Ashukha et al., 2020, Condessa et al., 2017] and the associated AUC directly correspond to the AURCAURC\mathrm{AURC}roman_AURC and are therefore subject to the same pitfalls (R2, R3).

Refer to caption
Figure 2: The proposed AUGRC metric resolves shortcomings of AURC. All figures are based on rankings of predictions according to descending associated confidence scores induced by a CSF. All AURCAURC\mathrm{AURC}roman_AURC, e-AURCe-AURC\mathrm{e\text{-}AURC}roman_e - roman_AURC, and AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC values are scaled by ×1000absent1000\times 1000× 1000. a) shows the contribution of an individual failure case on the AURCAURC\mathrm{AURC}roman_AURC and AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC metrics depending on its ranking position (for technical details, see Section A.1.1). While AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC reflects the intuitive behavior of weighing the failure cases proportional to their ranking position, the AURCAURC\mathrm{AURC}roman_AURC puts excessive weight on high-confidence failures. b-d) Toy example of three CSFs ranking 20 predictions to show how AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC resolves the broken monotonicity requirement (R2) of AURCAURC\mathrm{AURC}roman_AURC. Despite equal AUROCfsubscriptAUROCf\mathrm{AUROC_{f}}roman_AUROC start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT and equal accacc\mathrm{acc}roman_acc in CSF-1 and CSF-2, the AURCAURC\mathrm{AURC}roman_AURC improves. And AURCAURC\mathrm{AURC}roman_AURC even improves in CSF-3, which features lower AUROCfsubscriptAUROCf\mathrm{AUROC_{f}}roman_AUROC start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT and lower accacc\mathrm{acc}roman_acc compared to CSF-1. e-f) The corresponding risk-coverage curves reveal that the non-intuitive behavior of AURCAURC\mathrm{AURC}roman_AURC is due to the excessive effect of the high-confidence failure of CSF-1 on the Selective Risk, which is resolved in the Generalized Risk.

3 Area under the Generalized Risk Coverage Curve

In Section 2.2, we illustrate that aggregating SC performance across working points requires to shift the perspective from the Selective Risk to the Generalized Risk (Equation 3) as an holistic assessment of the risk of silent failures for all predictions, before the rejection decision is made. We propose to evaluate SC methods via the Area under the Generalized Risk Coverage Curve (AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC). For the binary failure error it becomes an empirical estimator of the following expression:

AUGRC=01P(Yf=1,g(x)τ)dP(g(x)τ)AUGRCsuperscriptsubscript01𝑃formulae-sequencesubscript𝑌f1𝑔𝑥𝜏differential-d𝑃𝑔𝑥𝜏\mathrm{AUGRC}=\int_{0}^{1}P(Y_{\text{f}}=1,\,g(x)\geq\tau)\,\mathrm{d}P(g(x)% \geq\tau)roman_AUGRC = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_P ( italic_Y start_POSTSUBSCRIPT f end_POSTSUBSCRIPT = 1 , italic_g ( italic_x ) ≥ italic_τ ) roman_d italic_P ( italic_g ( italic_x ) ≥ italic_τ ) (6)

This metric evaluates SC models in terms of the expected risk of silent failures across working points and thus provides a practical measurement that is directly interpretable. It is bounded to the [0,1/2]012[0,\nicefrac{{1}}{{2}}][ 0 , / start_ARG 1 end_ARG start_ARG 2 end_ARG ] interval, whereby lower AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC corresponds to better SC performance. The AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC is not subject to the shortcomings of the AURCAURC\mathrm{AURC}roman_AURC, and we can derive a direct relationship to AUROCfsubscriptAUROCf\mathrm{AUROC_{f}}roman_AUROC start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT and accacc\mathrm{acc}roman_acc (the derivation and visualizations are shown in Equation 8 and Figure 5):

AUGRC=(1AUROCf)acc(1acc)+12(1acc)2AUGRC1subscriptAUROCfacc1acc12superscript1acc2\mathrm{AUGRC}=(1-\mathrm{AUROC_{f}})\cdot\mathrm{acc}\cdot(1-\mathrm{acc})+% \frac{1}{2}(1-\mathrm{acc})^{2}roman_AUGRC = ( 1 - roman_AUROC start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT ) ⋅ roman_acc ⋅ ( 1 - roman_acc ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( 1 - roman_acc ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (7)

The optimal AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC is given by the second term in Equation 7, hence subtracting the optimal SC performance yields a pure ranking measure re-scaled by the probability of finding a positive/negative pair. This overcomes the lack of interpretability of AURCAURC\mathrm{AURC}roman_AURC and e-AURCe-AURC\mathrm{e\text{-}AURC}roman_e - roman_AURC (R3). Monotonicity (R2) is ensured since the partial gradients w.r.t. both AUROCfsubscriptAUROCf\mathrm{AUROC_{f}}roman_AUROC start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT and accacc\mathrm{acc}roman_acc are always negative. Further, it can accommodate arbitrary classification metrics via the error function \ellroman_ℓ (R4) as well as arbitrary CSF (R5).

Figure 2a depicts the metric contribution of individual failure cases depending on their ranking position. This shows empirically how the excessive over-weighting of high-confidence failures in AURCAURC\mathrm{AURC}roman_AURC is resolved by AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC, and how the intuitive ranking assessment of AUROCfsubscriptAUROCf\mathrm{AUROC_{f}}roman_AUROC start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT is established (see Section A.1.1 for a detailed derivation). We further showcase how AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC resolves the broken monotonicity requirement (R2) of AURCAURC\mathrm{AURC}roman_AURC in Figures 2 b-d. Despite equal AUROCfsubscriptAUROCf\mathrm{AUROC_{f}}roman_AUROC start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT and equal accacc\mathrm{acc}roman_acc in CSF-1 and CSF-2, the AURCAURC\mathrm{AURC}roman_AURC improves. And AURCAURC\mathrm{AURC}roman_AURC even improves in CSF-3, which features lower AUROCfsubscriptAUROCf\mathrm{AUROC_{f}}roman_AUROC start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT and lower accacc\mathrm{acc}roman_acc compared to CSF-1. Figures 2 e-f depict the associated risk-coverage curves and reveal that the non-intuitive behavior of AURCAURC\mathrm{AURC}roman_AURC is due to the excessive effect of the high-confidence failure of CSF-1 on the Selective Risk, which is resolved in the Generalized Risk. In Section 4.2 we demonstrate implications of AURCAURC\mathrm{AURC}roman_AURC’s shortcomings on real-world data.

4 Empirical Study

We conduct an empirical study based on the existing FD-Shifts benchmarking framework [Jäger et al., 2023], which compares various CSFs across a broad range of datasets. The focus of our study is not on the performance of individual methods (CSFs) but rather on the ranking of methods based on AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC evaluation as compared to AURCAURC\mathrm{AURC}roman_AURC. We utilize the same experimental setup as in Jäger et al. [2023] with the addition of CSF scores based on temperature-scaled classifier logits. A detailed overview of the datasets and methods used can be found in Appendix A.2. The code for reproducing our results and a PyTorch implementation of the AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC are available at: https://github.com/IML-DKFZ/fd-shifts.

4.1 Comparing Method Rankings of AUGRC and AURC

To study the relevance of AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC, we investigate the changes induced by AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC on rankings of CSFs compared to AURCAURC\mathrm{AURC}roman_AURC rankings.

CSF Rankings with AUGRC substantially deviate from AURC rankings. Figure 3 illustrates the ranking differences for all test datasets, showing changes in ranks across all CSFs. Notably, AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC induces changes in the top-3 methods (out of 13) on 5 out of 6 datasets. To ensure that these ranking changes are not due to variability in the test data, we evaluate the metrics on 500500500500 bootstrap samples from the test dataset and derive the compared method rankings based on the average rank across these samples ("rank-then-aggregate"). The size of each bootstrap sample corresponds to the size of the test dataset. Metric values for each bootstrap sample are averaged over 5 training initializations (2 for BREEDS and 10 for CAMELYON-17-Wilds). We analyze the robustness of the resulting method rankings for AURCAURC\mathrm{AURC}roman_AURC and AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC separately, based on statistical significance. To that end, we perform pairwise CSF comparisons in form of a one-sided Wilcoxon signed-rank test [Wilcoxon, 1992] with a significance level of α=0.05𝛼0.05\alpha=0.05italic_α = 0.05 on the bootstrap samples, as proposed in Wiesenfarth et al. [2021]. The resulting significance maps displayed in Figure 3 indicate stable rankings for both AURCAURC\mathrm{AURC}roman_AURC and AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC.This suggests that the observed ranking differences are induced by the conceptual difference of AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC compared to AURCAURC\mathrm{AURC}roman_AURC and, as a consequence, that the shortcomings of current metrics and respective solutions discussed in Section 2, are not only conceptually sound, but also highly relevant in practice. For the results in Figure3, we select the DeepGamblers reward parameter and whether to use dropout for both metrics separately on the validation dataset (details in Table 1 and Table 2).

The shortcomings of AURC affect CSF comparison across datasets and distribution shifts. The method rankings for all datasets and distribution shifts, based on the original test datasets, are shown in Table 4, which also includes results for equal hyperparameters for AURCAURC\mathrm{AURC}roman_AURC and AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC. The substantial differences in method rankings across CSFs and datasets underline the relevance of AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC for SC evaluation. The complete AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC results on the FD-Shifts benchmark are shown in Table 3.

In few cases, as shown in Figure 3, we observe that a CSF A is ranked worse than a neighboring CSF B even though CSF A’s ranks are significantly superior to those of CSF B based on the Wilcoxon statistic over the bootstrap samples. This discrepancy can occur if the rank variability in CSF A is larger than in CSF B. While this does not affect our conclusions regarding the relevance of the AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC, it indicates that selecting a single best CSF for application based solely on method rankings should be approached with caution. We recommend closer inspection of the individual CSF performance in such cases, although this is not the focus of our study.

Refer to caption
Figure 3: Substantial differences in method rankings for AUGRC and AURC. On 5 out of 6 datasets, the top-3 CSFs (out of 13 compared CSFs) change when employing the proposed AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC instead of AURCAURC\mathrm{AURC}roman_AURC. This demonstrates the practical relevance of AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC for SC evaluation. CSFs are color-coded and sorted from top (best) to bottom (worst) by average rank based on 500500500500 bootstrap samples from the test dataset to ensure ranking stability. Ranking changes are reflected in changes in the color sequence and highlighted by red arrows. We assess the stability of the rankings for each metric individually using one-sided Wilcoxon signed-rank tests with a 5%percent55\%5 % significance level based on the bootstrap samples. Adjacent to each ranking, we present the resulting significance maps for the pairwise CSF comparisons. These maps can be interpreted as follows: At each grid position (x,y)𝑥𝑦(x,y)( italic_x , italic_y ), filled entries indicate that metric values of CSF y𝑦yitalic_y are ranked significantly better than those from CSF x𝑥xitalic_x (across bootstrap samples), cross-marks indicate no significant superiority. An ideal ranking exhibits only filled entries above the diagonal.

4.2 Showcasing Implications of AURC Shortcomings on Real-world Data

Figure 4 gives an in-depth look at how the conceptual shortcomings of AURCAURC\mathrm{AURC}roman_AURC affect method assessment on real-world data. The example uses the DG-Res and MCD-PE CSFs on the CIFAR-10 test dataset. Despite DG-Res having higher classification performance and ranking quality than MCD-PE, the AURCAURC\mathrm{AURC}roman_AURC erroneously favors DG-Res over MCD-PE. This violates the monotonicity requirement (R2). This can be attributed to the excessive contributions of only few high-confidence failures, which aligns with the theoretical findings on failure contributions shown in Figure 2 (R3). In this example, the high-confidence failures are associated with high label ambiguity or incorrect labeling, suggesting that the AURCAURC\mathrm{AURC}roman_AURC may exacerbate the influence of label noise in practice.

Refer to caption
Figure 4: The conceptual shortcomings of AURC affects method assessment in practice. We illustrate the practical effects of excessive weight high-confidence failures in AURCAURC\mathrm{AURC}roman_AURC by comparing the performance of two CSFs, DG-Res and MCD-PE, on the CIFAR-10 test dataset. (a) shows the coverage curves based on Selective Risk and Generalized Risk for both CSFs. The AURCAURC\mathrm{AURC}roman_AURC violates the monotonicity requirement (R2) in practice, favoring DG-Res despite a lower classification performance and ranking quality compared to MCD-PE. (b) displays the images associated with the top-k𝑘kitalic_k high-confidence failures. For DG-Res, the four failures correspond to the first four peaks in the Selective Risk curve, up to coverage0.27coverage0.27\text{coverage}\approx 0.27coverage ≈ 0.27 (the total number of failures is 446). Only a few high-confidence failures significantly increase the AURCAURC\mathrm{AURC}roman_AURC. For both CSFs, the images associated with high-confidence failures exhibit high label ambiguity or are incorrectly labeled, indicating that the AURCAURC\mathrm{AURC}roman_AURC may amplify the influence of label noise in practice.

5 Conclusion

Despite the increasing relevance of SC for reliable translation of machine learning systems to real-world application, we find that the current metrics have significant limitations in providing the comprehensive assessment needed to guide the methodological progress of SC systems.

In this work, we establish a systematic SC evaluation framework which lays a solid foundation for future method development in the field. This contribution promotes the adoption of more comprehensive, interpretable, and task-aligned metrics for comparative benchmarking of SC systems.

We find that none of the existing multi-threshold metrics, particularly the AURCAURC\mathrm{AURC}roman_AURC, meet the key requirements we identified for comprehensive SC evaluation, leading to substantial deviations from intended and intuitive performance assessment behaviors. To address this, we introduce the AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC as a suitable metric for comprehensive SC method evaluation. Substantial differences in method rankings between AURCAURC\mathrm{AURC}roman_AURC and AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC, demonstrated through extensive empirical studies, highlight the importance of selecting the right SC metric. Thus, we propose the adoption of AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC for meaningful SC performance evaluation.

Acknowledgments and Disclosure of Funding

This work was partly funded by Helmholtz Imaging (HI), a platform of the Helmholtz Incubator on Information and Data Science. We thank Lukas Klein, Maximilian Zenk and Fabian Isensee for insightful discussions and feedback.

References

  • Dvijotham et al. [2023] Krishnamurthy Dvijotham, Jim Winkens, Melih Barsbey, Sumedh Ghaisas, Robert Stanforth, Nick Pawlowski, Patricia Strachan, Zahra Ahmed, Shekoofeh Azizi, Yoram Bachrach, et al. Enhancing the reliability and accuracy of ai-enabled diagnosis via complementarity-driven deferral to clinicians. Nature Medicine, 29(7):1814–1820, 2023.
  • Leibig et al. [2022] Christian Leibig, Moritz Brehmer, Stefan Bunk, Danalyn Byng, Katja Pinker, and Lale Umutlu. Combining the strengths of radiologists and ai for breast cancer screening: a retrospective analysis. The Lancet Digital Health, 4(7):e507–e519, 2022.
  • Dembrower et al. [2020] Karin Dembrower, Erik Wåhlin, Yue Liu, Mattie Salim, Kevin Smith, Peter Lindholm, Martin Eklund, and Fredrik Strand. Effect of artificial intelligence-based triaging of breast cancer screening mammograms on cancer detection and radiologist workload: a retrospective simulation study. The Lancet Digital Health, 2(9):e468–e474, 2020.
  • Yala et al. [2019] Adam Yala, Tal Schuster, Randy Miles, Regina Barzilay, and Constance Lehman. A deep learning model to triage screening mammograms: a simulation study. Radiology, 293(1):38–46, 2019.
  • Geifman and El-Yaniv [2017] Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. Advances in neural information processing systems, 30, 2017.
  • Chow [1957] Chi-Keung Chow. An optimum character recognition system using decision functions. IRE Transactions on Electronic Computers, (4):247–254, 1957.
  • El-Yaniv et al. [2010] Ran El-Yaniv et al. On the foundations of noise-free selective classification. Journal of Machine Learning Research, 11(5), 2010.
  • Liu et al. [2019] Ziyin Liu, Zhikang Wang, Paul Pu Liang, Russ R Salakhutdinov, Louis-Philippe Morency, and Masahito Ueda. Deep gamblers: Learning to abstain with portfolio theory. Advances in Neural Information Processing Systems, 32, 2019.
  • Geifman and El-Yaniv [2019] Yonatan Geifman and Ran El-Yaniv. Selectivenet: A deep neural network with an integrated reject option. In International conference on machine learning, pages 2151–2159. PMLR, 2019.
  • Maier-Hein et al. [2022] Lena Maier-Hein, Bjoern Menze, et al. Metrics reloaded: Pitfalls and recommendations for image analysis validation. arXiv. org, (2206.01653), 2022.
  • Geifman et al. [2018] Yonatan Geifman, Guy Uziel, and Ran El-Yaniv. Bias-reduced uncertainty estimation for deep neural classifiers. arXiv preprint arXiv:1805.08206, 2018.
  • Jäger et al. [2023] Paul F Jäger, Carsten Lüth, Lukas Klein, and Till Bungert. A call to reflect on evaluation practices for failure detection in image classification. In ICLR 2023, 2023.
  • Bungert et al. [2023] Till J Bungert, Levin Kobelke, and Paul F Jaeger. Understanding silent failures in medical image classification. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 400–410. Springer, 2023.
  • Cheng et al. [2023] Zhen Cheng, Xu-Yao Zhang, and Cheng-Lin Liu. Unified classification and rejection: A one-versus-all framework. arXiv preprint arXiv:2311.13355, 2023.
  • Zhu et al. [2023a] Fei Zhu, Zhen Cheng, Xu-Yao Zhang, and Cheng-Lin Liu. Openmix: Exploring outlier samples for misclassification detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12074–12083, 2023a.
  • Varshney et al. [2020] Neeraj Varshney, Swaroop Mishra, and Chitta Baral. Towards improving selective prediction ability of nlp systems. arXiv preprint arXiv:2008.09371, 2020.
  • Naushad and Voiculescu [2024] Junayed Naushad and ID Voiculescu. Super-trustscore: reliable failure detection for automated skin lesion diagnosis. 2024.
  • Van Landeghem et al. [2024] Jordy Van Landeghem, Sanket Biswas, Matthew Blaschko, and Marie-Francine Moens. Beyond document page classification: Design, datasets, and challenges. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2962–2972, 2024.
  • Van Landeghem et al. [2023] Jordy Van Landeghem, Rubèn Tito, Łukasz Borchmann, Michał Pietruszka, Pawel Joziak, Rafal Powalski, Dawid Jurkiewicz, Mickaël Coustaty, Bertrand Anckaert, Ernest Valveny, et al. Document understanding dataset and evaluation (dude). In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19528–19540, 2023.
  • Zhu et al. [2022] Fei Zhu, Zhen Cheng, Xu-Yao Zhang, and Cheng-Lin Liu. Rethinking confidence calibration for failure prediction. In European Conference on Computer Vision, pages 518–536. Springer, 2022.
  • Xin et al. [2021] Ji Xin, Raphael Tang, Yaoliang Yu, and Jimmy Lin. The art of abstention: Selective prediction and error regularization for natural language processing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1040–1051, 2021.
  • Yoshikawa and Okazaki [2023] Hiyori Yoshikawa and Naoaki Okazaki. Selective-lama: Selective prediction for confidence-aware evaluation of language models. In Findings of the Association for Computational Linguistics: EACL 2023, pages 2017–2028, 2023.
  • Ding et al. [2020] Yukun Ding, **glan Liu, **jun Xiong, and Yiyu Shi. Revisiting the evaluation of uncertainty estimation and its application to explore model complexity-uncertainty trade-off. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 4–5, 2020.
  • Zhu et al. [2023b] Fei Zhu, Xu-Yao Zhang, Zhen Cheng, and Cheng-Lin Liu. Revisiting confidence estimation: Towards reliable failure prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023b.
  • Galil and El-Yaniv [2021] Ido Galil and Ran El-Yaniv. Disrupting deep uncertainty estimation without harming accuracy. Advances in Neural Information Processing Systems, 34:21285–21296, 2021.
  • Franc et al. [2023] Vojtech Franc, Daniel Prusa, and Vaclav Voracek. Optimal strategies for reject option classifiers. Journal of Machine Learning Research, 24(11):1–49, 2023.
  • Cen et al. [2023] Jun Cen, Di Luan, Shiwei Zhang, Yixuan Pei, Yingya Zhang, Deli Zhao, Shaojie Shen, and Qifeng Chen. The devil is in the wrongly-classified samples: Towards unified open-set recognition. arXiv preprint arXiv:2302.04002, 2023.
  • Xia and Bouganis [2022] Guoxuan Xia and Christos-Savvas Bouganis. Augmenting softmax information for selective classification with out-of-distribution data. In Proceedings of the Asian Conference on Computer Vision, pages 1995–2012, 2022.
  • Cattelan and Silva [2023] Luís Felipe Prates Cattelan and Danilo Silva. How to fix a broken confidence estimator: Evaluating post-hoc methods for selective classification with deep neural networks. 2023.
  • Tran et al. [2022] Dustin Tran, Jeremiah Liu, Michael W Dusenberry, Du Phan, Mark Collier, Jie Ren, Kehang Han, Zi Wang, Zelda Mariet, Huiyi Hu, et al. Plex: Towards reliability using pretrained large model extensions. arXiv preprint arXiv:2207.07411, 2022.
  • Kim et al. [2023] Jihyo Kim, Jiin Koo, and Sangheum Hwang. A unified benchmark for the unknown detection capability of deep neural networks. Expert Systems with Applications, 229:120461, 2023.
  • Ashukha et al. [2020] Arsenii Ashukha, Alexander Lyzhov, Dmitry Molchanov, and Dmitry Vetrov. Pitfalls of in-domain uncertainty estimation and ensembling in deep learning. arXiv preprint arXiv:2002.06470, 2020.
  • Xia et al. [2024] Guoxuan Xia, Olivier Laurent, Gianni Franchi, and Christos-Savvas Bouganis. Understanding why label smoothing degrades selective classification and how to fix it. arXiv preprint arXiv:2403.14715, 2024.
  • Galil et al. [2023] Ido Galil, Mohammed Dabbah, and Ran El-Yaniv. What can we learn from the selective prediction and uncertainty estimation performance of 523 imagenet classifiers. arXiv preprint arXiv:2302.11874, 2023.
  • Ovadia et al. [2019] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems, 32, 2019.
  • Kivlichan et al. [2021] Ian D Kivlichan, Zi Lin, Jeremiah Liu, and Lucy Vasserman. Measuring and improving model-moderator collaboration using uncertainty estimation. arXiv preprint arXiv:2107.04212, 2021.
  • Dehghani et al. [2023] Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023.
  • Pugnana and Ruggieri [2023] Andrea Pugnana and Salvatore Ruggieri. Auc-based selective classification. In International Conference on Artificial Intelligence and Statistics, pages 2494–2514. PMLR, 2023.
  • Malinin and Gales [2020] Andrey Malinin and Mark Gales. Uncertainty estimation in autoregressive structured prediction. arXiv preprint arXiv:2002.07650, 2020.
  • Malinin et al. [2021] Andrey Malinin, Neil Band, German Chesnokov, Yarin Gal, Mark JF Gales, Alexey Noskov, Andrey Ploskonosov, Liudmila Prokhorenkova, Ivan Provilkov, Vatsal Raina, et al. Shifts: A dataset of real distributional shift across multiple large-scale tasks. arXiv preprint arXiv:2107.07455, 2021.
  • Nadeem et al. [2009] Malik Sajjad Ahmed Nadeem, Jean-Daniel Zucker, and Blaise Hanczar. Accuracy-rejection curves (arcs) for comparing classification methods with a reject option. In Machine Learning in Systems Biology, pages 65–81. PMLR, 2009.
  • Condessa et al. [2017] Filipe Condessa, José Bioucas-Dias, and Jelena Kovačević. Performance measures for classification systems with rejection. Pattern Recognition, 63:437–450, 2017.
  • Wilcoxon [1992] Frank Wilcoxon. Individual comparisons by ranking methods. In Breakthroughs in statistics: Methodology and distribution, pages 196–202. Springer, 1992.
  • Wiesenfarth et al. [2021] Manuel Wiesenfarth, Annika Reinke, Bennett A Landman, Matthias Eisenmann, Laura Aguilera Saiz, M Jorge Cardoso, Lena Maier-Hein, and Annette Kopp-Schneider. Methods and open-source toolkit for analyzing and visualizing challenge results. Scientific reports, 11(1):2369, 2021.
  • Guo et al. [2017] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR, 2017.
  • [46] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.
  • [47] Ziyin Liu, Zhikang Wang, Paul Pu Liang, Russ R Salakhutdinov, Louis-Philippe Morency, and Masahito Ueda. Deep Gamblers: Learning to Abstain with Portfolio Theory.
  • Corbière et al. [2019] Charles Corbière, Nicolas Thome, Avner Bar-Hen, Matthieu Cord, and Patrick Pérez. Addressing failure prediction by learning model confidence. Advances in Neural Information Processing Systems, 32, 2019.
  • DeVries and Taylor [2018] Terrance DeVries and Graham W Taylor. Learning confidence for out-of-distribution detection in neural networks. arXiv preprint arXiv:1802.04865, 2018.
  • Netzer et al. [2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Baolin Wu, Andrew Y Ng, et al. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 7. Granada, Spain, 2011.
  • [51] Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images.
  • Le and Yang [2015] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.
  • Koh et al. [2021] Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. In International conference on machine learning, pages 5637–5664. PMLR, 2021.
  • Santurkar et al. [2020] Shibani Santurkar, Dimitris Tsipras, and Aleksander Madry. Breeds: Benchmarks for subpopulation shift. arXiv preprint arXiv:2008.04859, 2020.
  • Hendrycks and Dietterich [2019] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
  • Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

Appendix A Appendix

A.1 Technical Details

A.1.1 AURC Failure Contribution

Let {τt}t=1Nsuperscriptsubscriptsubscript𝜏𝑡𝑡1𝑁\{\tau_{t}\}_{t=1}^{N}{ italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT provide a total order on the predictions (relaxing the assumption to a partial order leads to a more complicated but very similar formulation). Then, for a rejection threshold τtsubscript𝜏𝑡\tau_{t}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, a failure case at rank tsuperscript𝑡t^{*}italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT contributes to the Selective Risk (Equation 2) by 1t1𝑡\frac{1}{t}divide start_ARG 1 end_ARG start_ARG italic_t end_ARG if ttsuperscript𝑡𝑡t^{*}\leq titalic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_t. Averaging over all thresholds yields a contribution through the failure at rank tsuperscript𝑡t^{*}italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to the AURCAURC\mathrm{AURC}roman_AURC of 1Nt=tN1t1𝑁superscriptsubscript𝑡superscript𝑡𝑁1𝑡\frac{1}{N}\sum_{t=t^{*}}^{N}\frac{1}{t}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_t end_ARG. For the Generalized Risk, the contribution at ttsuperscript𝑡𝑡t^{*}\leq titalic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_t is always 1N1𝑁\frac{1}{N}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG, resulting in an overall contribution of Nt1N2𝑁superscript𝑡1superscript𝑁2\frac{N-t^{*}-1}{N^{2}}divide start_ARG italic_N - italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG to the AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC. The failure contribution curves are displayed in Figure 2.

A.1.2 Relationship between AUGRC and failure AUROC

Let Yf=𝕀(m(x)y)subscript𝑌f𝕀𝑚𝑥𝑦Y_{\text{f}}=\mathbb{I}(m(x)\neq y)italic_Y start_POSTSUBSCRIPT f end_POSTSUBSCRIPT = blackboard_I ( italic_m ( italic_x ) ≠ italic_y ) be a binary indicator variable for classification failures. Then, the Generalized Risk corresponds to the joint probability that a sample is accepted and wrongly classified. We can write the AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC as follows:

AUGRC=P(Yf=1,g(x)τ)dP(g(x)τ)=(1P(Yf=0|g(x)τ))P(g(x)τ)dP(g(x)τ)=12P(g(x)τ|Yf=0)P(Yf=0)dP(g(x)τ)=()12acc[12acc+(1acc)P(g(x)τ|Yf=0)dP(g(x)τ|Yf=1)]=12acc[12acc+(1acc)AUROCf]=(1AUROCf)acc(1acc)+12(1acc)2AUGRC𝑃formulae-sequencesubscript𝑌f1𝑔𝑥𝜏differential-d𝑃𝑔𝑥𝜏1𝑃subscript𝑌fconditional0𝑔𝑥𝜏𝑃𝑔𝑥𝜏differential-d𝑃𝑔𝑥𝜏12𝑃𝑔𝑥conditional𝜏subscript𝑌f0𝑃subscript𝑌f0differential-d𝑃𝑔𝑥𝜏12accdelimited-[]12acc1acc𝑃𝑔𝑥conditional𝜏subscript𝑌f0differential-d𝑃𝑔𝑥conditional𝜏subscript𝑌f112accdelimited-[]12acc1accsubscriptAUROCf1subscriptAUROCfacc1acc12superscript1acc2\displaystyle\begin{split}\mathrm{AUGRC}&=\int P(Y_{\text{f}}=1,g(x)\geq\tau)% \,\mathrm{d}P(g(x)\geq\tau)\\ &=\int(1-P(Y_{\text{f}}=0|g(x)\geq\tau))\cdot P(g(x)\geq\tau)\,\mathrm{d}P(g(x% )\geq\tau)\\ &=\frac{1}{2}-\int P(g(x)\geq\tau|Y_{\text{f}}=0)P(Y_{\text{f}}=0)\,\mathrm{d}% P(g(x)\geq\tau)\\ &\overset{(*)}{=}\frac{1}{2}-\text{acc}\Big{[}\frac{1}{2}\text{acc}+(1-\text{% acc})\cdot\int P(g(x)\geq\tau|Y_{\text{f}}=0)\,\mathrm{d}P(g(x)\geq\tau|Y_{% \text{f}}=1)\Big{]}\\ &=\frac{1}{2}-\text{acc}\Big{[}\frac{1}{2}\text{acc}+(1-\text{acc})\cdot% \mathrm{AUROC_{f}}\Big{]}\\ &=(1-\mathrm{AUROC_{f}})\cdot\text{acc}(1-\text{acc})+\frac{1}{2}(1-\text{acc}% )^{2}\end{split}start_ROW start_CELL roman_AUGRC end_CELL start_CELL = ∫ italic_P ( italic_Y start_POSTSUBSCRIPT f end_POSTSUBSCRIPT = 1 , italic_g ( italic_x ) ≥ italic_τ ) roman_d italic_P ( italic_g ( italic_x ) ≥ italic_τ ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∫ ( 1 - italic_P ( italic_Y start_POSTSUBSCRIPT f end_POSTSUBSCRIPT = 0 | italic_g ( italic_x ) ≥ italic_τ ) ) ⋅ italic_P ( italic_g ( italic_x ) ≥ italic_τ ) roman_d italic_P ( italic_g ( italic_x ) ≥ italic_τ ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG 2 end_ARG - ∫ italic_P ( italic_g ( italic_x ) ≥ italic_τ | italic_Y start_POSTSUBSCRIPT f end_POSTSUBSCRIPT = 0 ) italic_P ( italic_Y start_POSTSUBSCRIPT f end_POSTSUBSCRIPT = 0 ) roman_d italic_P ( italic_g ( italic_x ) ≥ italic_τ ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL start_OVERACCENT ( ∗ ) end_OVERACCENT start_ARG = end_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG - acc [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG acc + ( 1 - acc ) ⋅ ∫ italic_P ( italic_g ( italic_x ) ≥ italic_τ | italic_Y start_POSTSUBSCRIPT f end_POSTSUBSCRIPT = 0 ) roman_d italic_P ( italic_g ( italic_x ) ≥ italic_τ | italic_Y start_POSTSUBSCRIPT f end_POSTSUBSCRIPT = 1 ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG 2 end_ARG - acc [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG acc + ( 1 - acc ) ⋅ roman_AUROC start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ( 1 - roman_AUROC start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT ) ⋅ acc ( 1 - acc ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( 1 - acc ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW (8)

At (*), we use that P(g(x)τ)=P(g(x)τ|Yf=0)acc+P(g(x)τ|Yf=1)(1acc)𝑃𝑔𝑥𝜏𝑃𝑔𝑥conditional𝜏subscript𝑌f0acc𝑃𝑔𝑥conditional𝜏subscript𝑌f11accP(g(x)\geq\tau)=P(g(x)\geq\tau|Y_{\text{f}}=0)\cdot\text{acc}+P(g(x)\geq\tau|Y% _{\text{f}}=1)\cdot(1-\text{acc})italic_P ( italic_g ( italic_x ) ≥ italic_τ ) = italic_P ( italic_g ( italic_x ) ≥ italic_τ | italic_Y start_POSTSUBSCRIPT f end_POSTSUBSCRIPT = 0 ) ⋅ acc + italic_P ( italic_g ( italic_x ) ≥ italic_τ | italic_Y start_POSTSUBSCRIPT f end_POSTSUBSCRIPT = 1 ) ⋅ ( 1 - acc ). On choosing two random samples, the second term in the final equation represents the probability that both are failures and the first term represents the probability that one is a failure, the other is not, and that the failure has higher confidence score than the other.

The AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC is monotonic in accacc\mathrm{acc}roman_acc and AUROCfsubscriptAUROCf\mathrm{AUROC_{f}}roman_AUROC start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT as both partial derivatives are negative acc(0,1),AUROCf[0,1]formulae-sequencefor-allacc01for-allsubscriptAUROCf01\forall\,\mathrm{acc}\in(0,1),\forall\,\mathrm{AUROC_{f}}\in[0,1]∀ roman_acc ∈ ( 0 , 1 ) , ∀ roman_AUROC start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT ∈ [ 0 , 1 ]:

AUGRCAUROCf=acc(1acc)AUGRCsubscriptAUROCfacc1acc\frac{\partial\mathrm{AUGRC}}{\partial\mathrm{AUROC_{f}}}=-\mathrm{acc}\cdot(1% -\mathrm{acc})divide start_ARG ∂ roman_AUGRC end_ARG start_ARG ∂ roman_AUROC start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT end_ARG = - roman_acc ⋅ ( 1 - roman_acc ) (9)
AUGRCacc=2AUROCfaccaccAUROCfAUGRCacc2subscriptAUROCfaccaccsubscriptAUROCf\frac{\partial\mathrm{AUGRC}}{\partial\mathrm{acc}}=2\cdot\mathrm{AUROC_{f}}% \cdot\mathrm{acc}-\mathrm{acc}-\mathrm{AUROC_{f}}divide start_ARG ∂ roman_AUGRC end_ARG start_ARG ∂ roman_acc end_ARG = 2 ⋅ roman_AUROC start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT ⋅ roman_acc - roman_acc - roman_AUROC start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT (10)
Refer to caption
Refer to caption
Figure 5: Visualization of the relationship between AUGRC and AUROCff{}_{\text{f}}start_FLOATSUBSCRIPT f end_FLOATSUBSCRIPT. (a) The Selective Risk curve can be transformed into the Generalized Risk curve via multiplication by the respective coverages. The resulting curve is monotonically increasing and bounded by the diagonal; decreasing Selective Risk corresponds to a plateau in Generalized Risk. The AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC corresponds to the AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC of an optimal CSF (shaded red) plus the re-scaled AUROCfsubscriptAUROCf\mathrm{AUROC_{f}}roman_AUROC start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT (shaded in green). The AUROCfsubscriptAUROCf\mathrm{AUROC_{f}}roman_AUROC start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT corresponds to the fraction of the area (parallelogram) enclosed by the green dashed line that lies above the Generalized Risk curve. (b) AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC (color-coded) and its negative gradients (arrows) in the Accuracy-AUROCfsubscriptAUROCf\mathrm{AUROC_{f}}roman_AUROC start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT space.

A.1.3 F1-AUC does not fulfill R2

For the optimal CSF, all failure cases are attributed to lower confidence scores than correct predictions. Hence, the precision P(Yf=0|g(x)τ)𝑃subscript𝑌fconditional0𝑔𝑥𝜏P(Y_{\text{f}}=0|g(x)\geq\tau)italic_P ( italic_Y start_POSTSUBSCRIPT f end_POSTSUBSCRIPT = 0 | italic_g ( italic_x ) ≥ italic_τ ) is one for coverages up to P(Yf=0)𝑃subscript𝑌f0P(Y_{\text{f}}=0)italic_P ( italic_Y start_POSTSUBSCRIPT f end_POSTSUBSCRIPT = 0 ) and decreases as P(Yf=0)/P(g(x)τ)𝑃subscript𝑌f0𝑃𝑔𝑥𝜏P(Y_{\text{f}}=0)/P(g(x)\geq\tau)italic_P ( italic_Y start_POSTSUBSCRIPT f end_POSTSUBSCRIPT = 0 ) / italic_P ( italic_g ( italic_x ) ≥ italic_τ ) above. Following the task formulation as defined in Malinin et al. [2021], we can derive the following analytical expression for the optimal F1-AUC:

F1-AUC=01F1(τ)dP(g(x)τ)=0P(Yf=0)2P(g(x)τ)P(Yf=0)+P(g(x)τ)dP(g(x)τ)+P(Yf=0)12P(Yf=0)P(Yf=0)+P(g(x)τ)dP(g(x)τ)=2acc[1+ln(14(1+1acc))]superscriptF1-AUCsuperscriptsubscript01F1𝜏differential-d𝑃𝑔𝑥𝜏superscriptsubscript0𝑃subscript𝑌f02𝑃𝑔𝑥𝜏𝑃subscript𝑌f0𝑃𝑔𝑥𝜏differential-d𝑃𝑔𝑥𝜏superscriptsubscript𝑃subscript𝑌f012𝑃subscript𝑌f0𝑃subscript𝑌f0𝑃𝑔𝑥𝜏differential-d𝑃𝑔𝑥𝜏2accdelimited-[]11411acc\displaystyle\begin{split}\text{F1-AUC}^{*}&=\int_{0}^{1}\text{F1}(\tau)\,% \mathrm{d}P(g(x)\geq\tau)\\ &=\int_{0}^{P(Y_{\text{f}}=0)}\frac{2\cdot P(g(x)\geq\tau)}{P(Y_{\text{f}}=0)+% P(g(x)\geq\tau)}\,\mathrm{d}P(g(x)\geq\tau)\\ &\quad+\int_{P(Y_{\text{f}}=0)}^{1}\frac{2\cdot P(Y_{\text{f}}=0)}{P(Y_{\text{% f}}=0)+P(g(x)\geq\tau)}\,\mathrm{d}P(g(x)\geq\tau)\\ &=2\cdot\mathrm{acc}\cdot\Big{[}1+\ln\Big{(}\frac{1}{4}\cdot(1+\frac{1}{% \mathrm{acc}})\Big{)}\Big{]}\end{split}start_ROW start_CELL F1-AUC start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_CELL start_CELL = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT F1 ( italic_τ ) roman_d italic_P ( italic_g ( italic_x ) ≥ italic_τ ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P ( italic_Y start_POSTSUBSCRIPT f end_POSTSUBSCRIPT = 0 ) end_POSTSUPERSCRIPT divide start_ARG 2 ⋅ italic_P ( italic_g ( italic_x ) ≥ italic_τ ) end_ARG start_ARG italic_P ( italic_Y start_POSTSUBSCRIPT f end_POSTSUBSCRIPT = 0 ) + italic_P ( italic_g ( italic_x ) ≥ italic_τ ) end_ARG roman_d italic_P ( italic_g ( italic_x ) ≥ italic_τ ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ∫ start_POSTSUBSCRIPT italic_P ( italic_Y start_POSTSUBSCRIPT f end_POSTSUBSCRIPT = 0 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG 2 ⋅ italic_P ( italic_Y start_POSTSUBSCRIPT f end_POSTSUBSCRIPT = 0 ) end_ARG start_ARG italic_P ( italic_Y start_POSTSUBSCRIPT f end_POSTSUBSCRIPT = 0 ) + italic_P ( italic_g ( italic_x ) ≥ italic_τ ) end_ARG roman_d italic_P ( italic_g ( italic_x ) ≥ italic_τ ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = 2 ⋅ roman_acc ⋅ [ 1 + roman_ln ( divide start_ARG 1 end_ARG start_ARG 4 end_ARG ⋅ ( 1 + divide start_ARG 1 end_ARG start_ARG roman_acc end_ARG ) ) ] end_CELL end_ROW (11)

F1-AUC increases with increasing accuracy up to P(Yf=0)0.56𝑃subscript𝑌f00.56P(Y_{\text{f}}=0)\approx 0.56italic_P ( italic_Y start_POSTSUBSCRIPT f end_POSTSUBSCRIPT = 0 ) ≈ 0.56, then it decreases, favoring models with lower classification performance. Thus, the F1-AUC does not fulfill the monotonicity requirement (R2).

A.2 Experiment Setup

A.2.1 Datasets and Methods

We compare the following CSFs: From the classifier’s logits we compute the Maximum Softmax Response (MSR), Maximum Logit Score (MLS), Predictive Entropy (PE), and the MLS based on temperature-scaled logits, for which a scalar temperature parameter is tuned based on the validation set [Guo et al., 2017]. Three predictive uncertainty measures are based on Monte Carlo Dropout [Gal and Ghahramani, ]: mean softmax (MCD-MSR), predictive entropy (MCD-PE), and mutual information (MCD-MI). We additionally include the DeepGamblers method [Liu et al., ], which learns a confidence like reservation score (DG-Res), ConfidNet Corbière et al. [2019], which is trained as an extension to the classifier, and the work of DeVries and Taylor [2018].

We evaluate SC methods on the FD-Shifts benchmark [Jäger et al., 2023], which considers a broad range of datasets and failure sources through various distribution shifts: SVHN [Netzer et al., 2011], CIFAR-10, and CIFAR-100 [Krizhevsky, ] are evaluated on semantic and non-semantic new-class shifts in a rotating fashion including Tiny ImageNet [Le and Yang, 2015]. Additionally, we consider sub-class shifts on iWildCam [Koh et al., 2021], BREEDS-Enity-13 [Santurkar et al., 2020], CAMELYON-17-Wilds [Koh et al., 2021], and on CIFAR-100 (based on super-classes) as well as corruption shifts based on Hendrycks and Dietterich [2019] on CIFAR-100. The data preparation and splitting are done as described in Jäger et al. [2023]. For the corruption shifts, we reduce the test split size to 75000 (subsampled within each corruption type and intensity level).

The following classifier architecture are used in the benchmark: small convolutional network for SVHN, VGG-13 [Simonyan and Zisserman, 2014] on CIFAR-10/100, ResNet-50 [He et al., 2016] on the other datasets.

If the distribution shift is not mentioned explicitly, we evaluate on the respective i.i.d. test datasets.

Our method ranking study focuses on the evaluation of CSF performance based on the existing FD-Shifts benchmark, hence we required no GPU’s for the analysis in Section. 4. As both AURCAURC\mathrm{AURC}roman_AURC and AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC can be computed efficiently, (CPU) evaluation time for a single test set is less than a minute; evaluation on 500500500500 bootstrap samples on a single CPU core take around 3 hours.

A.2.2 Hyperparameters and Model Selection

The experiments are based on the same hyperparameters as reported in Table 4 in Jäger et al. [2023]. Based on the performance on the validation set, we choose the DeepGambler reward hyperparameter and whether to use dropout (for the non-MCD-based CSFs). For the former, we select from [2.2,3,6,10]2.23610[2.2,3,6,10][ 2.2 , 3 , 6 , 10 ] on Wilds-Camelyon-17, CIFAR-10, and SVHN, from [2.2,3,6,10,15]2.2361015[2.2,3,6,10,15][ 2.2 , 3 , 6 , 10 , 15 ] on iWildCam and BREEDS-Entity-13, and from [2.2,3,6,10,12,15,20]2.23610121520[2.2,3,6,10,12,15,20][ 2.2 , 3 , 6 , 10 , 12 , 15 , 20 ] on CIFAR-100. When performing model selection based on the AURCAURC\mathrm{AURC}roman_AURC metric, we obtain the same configurations as reported in Jäger et al. [2023]. When performing model selection based on the AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC metric, we obtain the parameters as reported in Tab. 1 and Tab. 2. For temperature scaling, we optimize the NLL on the validation set using the L-BFGS algorithm with lr=0.01lr0.01\mathrm{lr}=0.01roman_lr = 0.01.

Table 1: Selected DeepGambler reward hyperparameter based on the AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC on the validation set for all confidence scoring functions trained with the DeepGamblers objective.
Method iWildCam BREEDS-Entity-13 Wilds-Camelyon-17 CIFAR-100 CIFAR-10 SVHN
DG-MCD-MSR 15 15 10 20 3 3
DG-PE 15 15 2.2 10 10 3
DG-Res 6 2.2 6 12 2.2 2.2
DG-TEMP-MLS 15 15 2.2 15 10 3
Table 2: Whether or not dropout-training has been selected based on the AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC on the validation set. This selection is only done for deterministic confidence scoring methods (no MCD). "1" denotes dropout training and "0" denotes training without dropout.
Method iWildCam BREEDS-Entity-13 Wilds-Camelyon-17 CIFAR-100 CIFAR-10 SVHN
MSR 0 0 1 0 1 1
MLS 0 0 1 1 1 1
PE 0 0 1 0 1 1
ConfidNet 1 0 1 1 1 1
DG-Res 0 1 0 0 0 1
Devries et al. 1 0 1 0 0 1
TEMP-MLS 0 0 1 0 1 1
DG-PE 1 0 0 0 0 1
DG-TEMP-MLS 1 0 0 0 0 1

A.3 Additional Results

Table 3 shows the AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC results for the 13 compared CSFs across all datasets and distribution shifts. While the MSR baseline is not consistently outperformed across all experiments by any of the compared CSFs, MCD improves MSR scores in most settings. Temperature scaling does not exhibit consistent improvement of MLS scores.

Table 3: FD-Shifts benchmark results measured as AUGRC ×103absentsuperscript103\times 10^{3}× 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT (score range: [0, 500], lower is better \downarrow). The color heatmap is normalized per column, whereby whiter colors depict better scores. "cor" is the average over 5 intensity levels of image corruption shifts. AUGRC scores are averaged over 10 runs on CAMELYON-17-Wilds, over 2 runs on BREEDS, and over 5 runs on all other datasets. Abbreviations: ncs: new-class shift (s for semantic, ns for non-semantic), iid: independent and identically distributed, sub: sub-class shift, cor: image corruptions, c10/100: CIFAR-10/100, ti: TinyImagenet.
iWildCam BREEDS CAMELYON CIFAR-100 CIFAR-10 SVHN
study     iid sub    iid sub     iid sub     iid sub cor s-ncs ns-ncs     iid cor s-ncs ns-ncs     iid ns-ncs
ncs-data set                 c10 svhn ti    c100 svhn ti    c10 c100 ti
MSR     48.0 55.0    6.86 117     8.13 91.3     54.6 150 187 210 334 203    4.93 67.9 166 291 154    3.44 48.5 48.2 48.6
MLS     60.3 62.4    9.28 121     8.13 91.3     65.6 155 200 213 337 195    5.56 68.4 163 288 150    3.97 46.5 46.1 46.6
PE     48.2 55.2    6.87 116     8.13 91.3     56.4 150 188 210 333 202    4.91 67.2 165 290 152    3.44 47.9 47.5 48.0
MCD-MSR     42.8 54.3    6.89 116     5.64 95.1     53.1 146 176 210 336 202    4.56 60.4 166 294 156    3.40 47.3 47.4 47.9
MCD-PE     43.4 55.1    6.98 116     5.64 95.1     56.6 146 179 210 335 198    4.73 60.6 164 293 153    3.47 46.3 46.4 46.9
MCD-MI     43.7 58.1    7.28 118     6.13 101     59.4 148 182 211 333 197    4.99 64.9 167 302 161    3.55 47.0 47.7 48.2
ConfidNet     82.0 91.2    6.89 117     4.36 86.1     57.5 154 192 214 340 200    4.70 65.4 165 291 153    3.43 48.6 48.4 48.8
DG-MCD-MSR     43.9 59.6    6.31 112     3.47 149     52.5 145 175 210 336 203    4.75 60.8 168 296 159    3.38 47.6 47.7 48.0
DG-Res     66.5 63.8    9.39 122     3.46 124     64.3 230 194 215 330 197    4.40 63.8 167 287 151    4.09 46.8 46.2 46.6
Devries et al.     61.7 74.7    8.24 120     24.3 145     69.6 150 214 222 342 211    4.57 70.8 166 287 152    5.56 48.3 48.5 49.9
TEMP-MLS     48.2 55.1    6.86 116     8.13 91.3     54.8 150 187 210 333 203    4.92 67.6 166 291 153    3.44 48.3 47.9 48.4
DG-PE     52.0 69.7    6.33 114     3.46 123     55.2 151 186 210 329 197    4.09 62.9 167 288 154    3.41 48.9 48.4 48.7
DG-TEMP-MLS     52.2 69.9    6.35 114     3.46 123     55.0 222 184 211 331 199    4.09 63.1 167 288 155    3.41 49.0 48.5 48.8
Table 4: Comparing Rankings of AURC \rightarrow α𝛼\alphaitalic_α versus AUGRC \rightarrow β𝛽\betaitalic_β. Differences in the method rankings between AURCAURC\mathrm{AURC}roman_AURC and AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC demonstrate the relevance of the pitfalls of the AURCAURC\mathrm{AURC}roman_AURC discussed in Section 2.4. Upper half: Model selection for dropout and DG hyperparameter was done for both metrics separately. Lower half: Model selection was done based on AUGRCAUGRC\mathrm{AUGRC}roman_AUGRC for both α𝛼\alphaitalic_α and β𝛽\betaitalic_β. The selected hyperparameters are reported in Appendix A.2.2. The color heatmap is normalized per column, whereby whiter colors depict better scores. "cor" is the average over 5 intensity levels of image corruption shifts. AUGRC scores are averaged over 10 runs on CAMELYON-17-Wilds, over 2 runs on BREEDS, and over 5 runs on all other datasets. Abbreviations: ncs: new-class shift (s for semantic, ns for non-semantic), iid: independent and identically distributed, sub: sub-class shift, cor: image corruptions, c10/100: CIFAR-10/100, ti: TinyImagenet.
iWildCam     BREEDS     CAMELYON     CIFAR-100     CIFAR-10     SVHN
study     iid     sub     iid     sub     iid     sub     iid     sub     cor     s-ncs     ns-ncs     iid     cor     s-ncs     ns-ncs     iid     ns-ncs
ncs-data set                                         c10     svhn     ti             c100     svhn     ti         c10     c100     ti
metric     α𝛼\alphaitalic_α β𝛽\betaitalic_β    α𝛼\alphaitalic_α β𝛽\betaitalic_β    α𝛼\alphaitalic_α β𝛽\betaitalic_β    α𝛼\alphaitalic_α β𝛽\betaitalic_β    α𝛼\alphaitalic_α β𝛽\betaitalic_β    α𝛼\alphaitalic_α β𝛽\betaitalic_β    α𝛼\alphaitalic_α β𝛽\betaitalic_β    α𝛼\alphaitalic_α β𝛽\betaitalic_β    α𝛼\alphaitalic_α β𝛽\betaitalic_β    α𝛼\alphaitalic_α β𝛽\betaitalic_β    α𝛼\alphaitalic_α β𝛽\betaitalic_β    α𝛼\alphaitalic_α β𝛽\betaitalic_β    α𝛼\alphaitalic_α β𝛽\betaitalic_β    α𝛼\alphaitalic_α β𝛽\betaitalic_β    α𝛼\alphaitalic_α β𝛽\betaitalic_β    α𝛼\alphaitalic_α β𝛽\betaitalic_β    α𝛼\alphaitalic_α β𝛽\betaitalic_β    α𝛼\alphaitalic_α β𝛽\betaitalic_β    α𝛼\alphaitalic_α β𝛽\betaitalic_β    α𝛼\alphaitalic_α β𝛽\betaitalic_β    α𝛼\alphaitalic_α β𝛽\betaitalic_β   
optimized both (w.r.t. dropout and DG reward parameter)
MSR     5 5    6 2    6 4    7 9    9 9    2 2    3 3    5 5    7 7    3 1    3 6    11 10    12 12    11 11    5 6    9 7    8 9    8 6    10 10    9 9    9 9   
MLS     10 10    10 8    13 12    13 12    9 9    2 2    11 12    7 11    12 12    9 10    7 11    1 1    13 13    12 12    1 1    2 3    1 1    11 11    3 3    1 1    1 1   
PE     6 6    7 5    8 6    6 4    9 9    2 2    10 7    9 5    11 9    10 1    12 4    6 8    10 10    9 9    4 4    6 6    2 3    8 6    7 7    6 6    6 6   
MCD-MSR     1 1    1 1    3 7    3 4    6 6    6 7    2 2    2 2    2 2    1 1    9 8    9 8    3 4    2 2    5 6    12 12    10 12    4 2    5 5    5 5    5 5   
MCD-PE     2 2    2 3    4 9    3 4    6 6    6 7    7 8    2 2    3 3    1 1    6 7    3 4    5 7    3 3    3 2    11 11    8 6    6 9    1 1    4 4    4 4   
MCD-EE     3 3    3 6    5 10    3 4    8 8    6 6    8 9    2 2    4 4    6 8    7 8    6 6    7 9    1 1    2 2    9 7    2 3    8 10    2 2    3 3    3 3   
ConfidNet     13 13    13 13    10 7    9 9    5 5    1 1    9 10    7 10    9 10    11 11    11 12    4 6    9 6    8 8    9 4    6 7    6 6    5 5    11 11    10 10    10 11   
DG-MCD-MSR     4 4    9 7    1 1    1 1    4 4    12 13    1 1    1 1    1 1    3 1    10 8    11 10    6 8    4 4    10 13    13 13    13 13    1 1    6 6    7 7    7 6   
DG-Res     12 12    11 9    12 13    12 13    1 1    9 11    12 11    13 13    10 11    12 12    1 2    2 2    8 3    7 7    13 10    3 1    5 2    12 12    4 4    2 2    2 1   
Devries et al.     11 11    12 12    11 11    11 11    13 13    13 12    13 13    10 5    13 13    13 13    13 13    13 13    4 5    13 13    5 6    1 1    2 3    13 13    8 8    13 12    13 13   
TEMP-MLS     7 6    8 3    8 4    7 4    9 9    2 2    5 4    5 5    8 7    3 1    3 4    10 10    11 11    10 10    5 6    8 7    6 6    7 6    8 8    8 8    8 8   
DG-PE     8 8    4 10    7 2    10 2    1 1    10 9    6 6    11 9    6 6    6 1    2 1    5 2    1 1    5 5    11 10    4 3    10 9    2 3    12 12    10 10    10 10   
DG-TEMP-MLS     8 9    4 11    2 3    2 2    1 1    10 9    4 5    12 12    5 5    8 8    3 3    6 5    1 1    6 6    11 10    5 3    12 11    3 3    13 13    12 12    12 11   
optimized AUGRC (w.r.t. dropout and DG reward parameter)
MSR     5 5    4 2    7 4    8 9    9 9    2 2    3 3    7 5    7 7    4 1    5 6    11 10    12 12    11 11    5 6    9 7    8 9    8 6    10 10    9 9    9 9   
MLS     10 10    8 8    13 12    13 12    9 9    2 2    11 12    10 11    12 12    10 10    8 11    1 1    13 13    12 12    1 1    2 3    1 1    11 11    3 3    1 1    1 1   
PE     6 6    5 5    8 6    7 4    9 9    2 2    9 7    6 5    9 9    1 1    4 4    8 8    10 10    9 9    4 4    6 6    2 3    8 6    7 7    6 6    6 6   
MCD-MSR     1 1    1 1    4 7    4 4    6 6    6 7    2 2    2 2    2 2    1 1    10 8    8 8    3 4    2 2    5 6    12 12    10 12    4 2    5 5    5 5    5 5   
MCD-PE     2 2    2 3    5 9    4 4    6 6    6 7    7 8    2 2    3 3    1 1    7 7    4 4    5 7    3 3    3 2    11 11    8 6    6 9    1 1    4 4    4 4   
MCD-EE     3 3    3 6    6 10    4 4    8 8    6 6    8 9    2 2    4 4    7 8    8 8    7 6    7 9    1 1    2 2    9 7    2 3    8 10    2 2    3 3    3 3   
ConfidNet     13 13    13 13    10 7    10 9    5 5    1 1    10 10    10 10    10 10    11 11    12 12    5 6    9 6    8 8    9 4    6 7    6 6    5 5    11 11    10 10    10 11   
DG-MCD-MSR     4 4    7 7    1 1    1 1    4 4    12 13    1 1    1 1    1 1    4 1    11 8    11 10    6 8    4 4    10 13    13 13    13 13    1 1    6 6    7 7    7 6   
DG-Res     12 12    9 9    12 13    12 13    1 1    9 11    12 11    13 13    11 11    12 12    1 2    2 2    8 3    7 7    13 10    3 1    5 2    12 12    4 4    2 2    2 1   
Devries et al.     11 11    12 12    11 11    11 11    13 13    13 12    13 13    5 5    13 13    13 13    13 13    13 13    4 5    13 13    5 6    1 1    2 3    13 13    8 8    13 12    13 13   
TEMP-MLS     7 6    6 3    8 4    8 4    9 9    2 2    4 4    7 5    8 7    4 1    5 4    10 10    11 11    10 10    5 6    8 7    6 6    7 6    8 8    8 8    8 8   
DG-PE     8 8    10 10    1 2    2 2    1 1    10 9    5 6    9 9    6 6    7 1    1 1    3 2    1 1    5 5    11 10    4 3    10 9    2 3    12 12    10 10    10 10   
DG-TEMP-MLS     9 9    11 11    3 3    2 2    1 1    10 9    6 5    12 12    5 5    9 8    3 3    5 5    1 1    6 6    11 10    5 3    12 11    3 3    13 13    12 12    12 11