HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: bibentry

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2312.10686v2 [cs.CV] 19 Dec 2023

Out-of-Distribution Detection in Long-Tailed Recognition
with Calibrated Outlier Class Learning

Wenjun Miao1, Guansong Pang2*, Xiao Bai1,3, Tianqi Li1, ** Zheng1, 4 Corresponding authors: G. Pang ([email protected]) and J. Zheng ([email protected])
Abstract

Existing out-of-distribution (OOD) methods have shown great success on balanced datasets but become ineffective in long-tailed recognition (LTR) scenarios where 1) OOD samples are often wrongly classified into head classes and/or 2) tail-class samples are treated as OOD samples. To address these issues, current studies fit a prior distribution of auxiliary/pseudo OOD data to the long-tailed in-distribution (ID) data. However, it is difficult to obtain such an accurate prior distribution given the unknowingness of real OOD samples and heavy class imbalance in LTR. A straightforward solution to avoid the requirement of this prior is to learn an outlier class to encapsulate the OOD samples. The main challenge is then to tackle the aforementioned confusion between OOD samples and head/tail-class samples when learning the outlier class. To this end, we introduce a novel calibrated outlier class learning (COCL) approach, in which 1) a debiased large margin learning method is introduced in the outlier class learning to distinguish OOD samples from both head and tail classes in the representation space and 2) an outlier-class-aware logit calibration method is defined to enhance the long-tailed classification confidence. Extensive empirical results on three popular benchmarks CIFAR10-LT, CIFAR100-LT, and ImageNet-LT demonstrate that COCL substantially outperforms state-of-the-art OOD detection methods in LTR while being able to improve the classification accuracy on ID data. Code is available at https://github.com/mala-lab/COCL.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 1: Visualization and qualitative results on test data of CIFAR100-LT using an LTR model augmented with an outlier learning module (see Eq. 2) for OOD detection. (a) Feature representations of samples randomly selected from the head class, tail class, and OOD samples. The gray areas highlight obscure regions between head/tail samples and OOD samples. (b) The mean prediction confidence of the model classifying six OOD datasets into one of the ID classes. (c) The mean OOD score (i.e., the softmax probability of the outlier class) of samples from each ID class.

Introduction

Deep neural networks (DNNs) have achieved remarkable success across various fields (Russakovsky et al. 2015; Krizhevsky, Sutskever, and Hinton 2017). However, their application in real-world scenarios, such as autonomous driving (Kendall and Gal 2017) and medical diagnosis (Leibig et al. 2017), remains challenging due to the presence of long-tailed distribution and unknown classes (Huang and Li 2021; Wang et al. 2020b). In particular, DNNs often have high confidence predictions that classify out-of-distribution (OOD) samples from unknown classes as one of the known classes. This issue is further amplified when the in-distribution (ID) data has a class-imbalanced/long-tailed distribution (Zhu et al. 2023; Li et al. 2022; Li, Cheung, and Lu 2022; Wang et al. 2022). This is because, as illustrated in Fig. 0(a), DNNs trained on long-tailed data can be heavily biased towards head classes (the majority classes) due to the overwhelming presence of samples from these classes, and as a result, the long-tailed recognition (LTR) models often misclassify OOD samples into the head classes with high confidence (see Fig. 0(b)); further, the LTR models tend to treat tail samples as part of OOD samples due to the rareness of tail samples in the training data, i.e., the tail samples often have a much higher OOD score than the head samples (see Fig. 0(c)).

Compared to OOD detection on balanced ID datasets, significantly less work has been done in the LTR scenarios. Recent studies (Wang et al. 2022; Wei et al. 2022a; Jiang et al. 2023; Choi, Jeong, and Choi 2023) are among the seminal works exploring OOD detection in LTR. The current methods in this line focus on distinguishing OOD samples from ID samples by an approach called outlier exposure (OE) (Hendrycks, Mazeika, and Dietterich 2018) that fits auxiliary/pseudo OOD data to a prior distribution (e.g., uniform distribution) of ID data. However, unlike balanced ID datasets, LTR datasets heavily skewed the distribution of ID data, so using the commonly-adopted uniform distribution as the prior becomes ineffective. Estimating this prior from the sample size of ID classes is a simple solution to alleviate this issue, but it can intensify the LTR models’ bias toward head classes. Another line of approach is focused on learning discriminative representations to separate OOD samples from tail samples. However, the lack of sufficient samples in the tail classes renders this approach less effective, furthermore, it often fails to distinguish head and OOD samples.

In this work, we aim to synthesize both approaches and introduce a novel approach, namely calibrated outlier class learning (COCL). Intuitively, a straightforward solution to avoid the requirement of this prior in the OE-based approach is to learn an outlier class to encapsulate the OOD samples. The main challenge is then mainly about tackling the aforementioned confusion between OOD samples and head/tail-class samples when learning the outlier class. To address this challenge, we introduce a debiased large margin learning method, which is jointly optimized with the outlier class learning to distinguish OOD samples from both head and tail classes in the representation space. We further introduce an outlier-class-aware logit calibration method that takes into account the outlier class when calibrating the ID prediction probability. This helps enhance the long-tailed classification confidence while improving the OOD detection performance. In summary, our main contributions are as follows:

  • We show that outlier class learning is generally more effective for OOD detection in LTR than fitting to a prior distribution when auxiliary OOD data is available.

  • We then introduce a novel calibrated outlier class learning (COCL) approach that learns an accurate LTR model with a strong OOD detector that effectively mitigates the biases towards head and OOD samples. To this end, we introduce two components, including the debiased large margin learning and the outlier-class-aware logit calibration, which work in the respective training and inference stages, enabling substantially improved OOD detection and long-tailed classification performance.

  • Extensive empirical results on three popular benchmarks CIFAR10-LT, CIFAR100-LT, and ImageNet-LT demonstrate that COCL substantially outperforms state-of-the-art OOD detection methods in LTR while improving the classification accuracy on ID data.

Related Work

OOD Detection

The objective of this task is to determine whether a given input sample belongs to known classes (in-distribution) or unknown classes (out-of-distribution). In recent years, OOD detection has been developed extensively, including post-hoc strategies (Sun, Guo, and Li 2021; Wang et al. 2023; Zhang and Xiang 2023) and training-time strategies (Liu et al. 2020; Wei et al. 2022b; Tian et al. 2022; Yu et al. 2023; Li et al. 2023; Liu et al. 2023). The post-hoc methods focus on devising new OOD scoring functions in the inference phase. , such as MSP (Hendrycks and Gimpel 2016), Mahalanobis distance (Lee et al. 2018), Gram matrix (Sastry and Oore 2020). The training-time methods focus on separating OOD samples from ID samples by utilizing auxiliary data during training. Outlier exposure (OE) (Hendrycks, Mazeika, and Dietterich 2018) is arguably the most popular approach in this line that utilizes the OOD data by enforcing a uniform distribution of its prediction probability to ID classes. EnergyOE (Liu et al. 2020) improves OE and maximizes the free energy of OOD samples instead. UDG (Yang et al. 2021) introduces unsupervised dual grou** to leverage unlabelled auxiliary data for OOD detection. However, all these methods are focused on cases with balanced ID training data, which fail to work well on imbalanced ID datasets.

Long-Tailed Recognition (LTR)

LTR aims at improving the accuracy of tail classes with the least influence on the head classes. Re-sampling (Wang et al. 2020a; Tang et al. 2022; Bai et al. 2023) and re-weighting (Tan et al. 2020; Alshammari et al. 2022; Gou et al. 2023; Hong et al. 2023) that focus on balancing the ratio between head and tail classes are the most straightforward solutions for LTR. Two-stage methods (Kang et al. 2019; Nam, Jang, and Lee 2023) is another approach that retrains the classifier on a re-balanced dataset during the fine-tuning stage, leading to significant improvement in LTR. Additionally, logit adjustment (LA) (Menon et al. 2020) emerges as an effective statistical framework that can be applied in both the training and inference phases to further enhance the ID recognition performance. Although these LTR methods show effective performance in the long-tailed classification of ID samples, they do not have an explicit design to handle OOD samples.

Table 1: Comparison of outlier exposure (OE) and outlier class learning (OCL) approaches when combined with three LTR methods. All methods are trained on CIFAR10/100-LT using ResNet18. Reported are the average performance across six different OOD test sets (including CIFAR, Texture, SVHN, LSUN, Places365, and TinyImagenet) in the commonly-used SC-OOD detection benchmark (Yang et al. 2021) (See the Experiments section for the description of evaluation measures).
(b) Comparison results with different competing methods. The results are averaged over the six OOD test datasets in (a).
(b) Comparison results with different competing methods. The results are averaged over the six OOD test datasets in (a).
OOD Method LTR Method CIFAR10-LT CIFAR100-LT
AUC\uparrow AP-in\uparrow AP-out\uparrow FPR\downarrow ACC\uparrow AUC\uparrow AP-in\uparrow AP-out\uparrow FPR\downarrow ACC\uparrow
OE + None (baseline) 89.76 89.45 87.22 53.19 73.59 73.52 75.06 67.27 86.30 39.42
Re-weight 89.34 88.63 86.39 56.24 70.35 73.08 73.86 66.05 87.22 39.45
τ𝜏\tauitalic_τ-norm 89.58 88.21 85.88 52.84 73.33 73.62 74.67 66.59 86.02 40.87
LA 89.46 88.74 86.39 53.38 73.93 73.44 74.33 66.48 86.13 42.06
Outlier class learning + None (baseline) 89.91 88.15 90.38 41.13 74.48 73.56 74.12 69.65 81.93 41.54
Re-weight 90.45 89.12 90.58 38.86 74.84 74.23 74.29 70.68 79.45 42.06
τ𝜏\tauitalic_τ-norm 90.95 89.59 91.11 37.91 75.14 74.57 75.12 70.76 81.27 44.21
LA 91.56 90.52 91.51 36.50 76.67 74.77 75.15 71.13 80.33 43.02
Our method COCL 93.28 92.24 92.89 30.88 81.56 78.25 79.37 73.58 74.09 46.41
Method AUC\uparrow AP-in\uparrow AP-out\uparrow FPR\downarrow ACC\uparrow
MSP 74.33 73.96 72.14 85.33 72.17
OE 89.76 89.45 87.22 53.19 73.59
EnergyOE 91.92 91.03 91.97 33.80 74.57
OCL 89.91 88.15 90.38 41.13 74.48
PASCL 90.99 90.56 89.24 42.90 77.08
Open Sampling 91.94 91.08 89.35 36.92 75.78
Class Prior 92.08 91.17 90.86 34.42 74.33
BERL 92.56 91.41 91.94 32.83 81.37
COCL (Ours) 93.28 92.24 92.89 30.88 81.56
Table 3: Comparison results on CIFAR100-LT.
Method AUC\uparrow AP-in\uparrow AP-out\uparrow FPR\downarrow ACC\uparrow
MSP 63.93 64.71 60.76 89.71 40.51
OE 73.52 75.06 67.27 86.30 39.42
EnergyOE 76.40 77.32 72.24 76.33 41.32
OCL 73.56 74.12 69.65 81.93 41.54
PASCL 73.32 74.84 67.18 79.38 43.10
Open Sampling 74.37 75.80 70.42 78.18 40.87
Class Prior 76.03 77.31 72.26 76.43 40.77
BERL 77.75 78.61 73.10 74.86 45.88
COCL (Ours) 78.25 79.37 73.58 74.09 46.41
(b) Comparison results with different competing methods. The results are averaged over the six OOD test datasets in (a).
(b) Comparison results with different competing methods. The results are averaged over the six OOD test datasets in (a).