Detecting Out-of-distribution through the Lens of Neural Collapse

Litian Liu
MIT
[email protected]
&Yao Qin
UC Santa Barbara
[email protected]
Abstract

Efficient and versatile Out-of-Distribution (OOD) detection is essential for the safe deployment of AI yet remains challenging for existing algorithms. Inspired by Neural Collapse, we discover that features of in-distribution (ID) samples cluster closer to the weight vectors compared to features of OOD samples. In addition, we reveal that ID features tend to expand in space to structure a simplex Equiangular Tight Framework, which nicely explains the prevalent observation that ID features reside further from the origin than OOD features. Taking both insights from Neural Collapse into consideration, we propose to leverage feature proximity to weight vectors for OOD detection and further complement this perspective by using feature norms to filter OOD samples. Extensive experiments on off-the-shelf models demonstrate the efficiency and effectiveness of our method across diverse classification tasks and model architectures, enhancing the generalization capability of OOD detection.

1 Introduction

Machine learning models deployed in practice will inevitably encounter samples that deviate from the training distribution. As a classifier cannot make meaningful predictions on test samples that belong to classes unseen during training, it is important to actively detect and handle Out-of-Distribution (OOD) samples. Considering the diverse and oftentimes time-critical application scenarios, an OOD detector should be computationally efficient and can effectively generalize across various scenarios.

In this work, we focus on post-hoc methods, which address OOD detection independently of the training process. One line of prior work designs OOD scores over model output space (Djurisic et al., 2022; Hendrycks et al., 2019; Liang et al., 2018; Liu et al., 2020; Sun et al., 2021; Sun and Li, 2022) and another line of work focuses on the feature space, where OOD samples are observed to deviate from the clusters of ID samples (Lee et al., 2018; Mahalanobis, 2018; Sun et al., 2022; Tack et al., 2020). While existing research has made strides in OOD detection, they still face two major challenges: 1) maintaining detection effectiveness across different scenarios, and 2) ensuring computational efficiency for real-world deployment. For example, both output space and feature space methods suffer from performance discrepancy across different classification tasks, as shown in Table 1(a). Specifically, state-of-art algorithms on CIFAR-10 (Krizhevsky et al., 2009) OOD benchmarks perform suboptimally on ImageNet (Deng et al., 2009) OOD benchmarks, and vice versa. Such discrepancy is also observed across different architectures, as shown in Table 2. In addition, feature space methods, which rely on auxiliary models, raise efficiency concerns. For example, Lee et al. (2018) learns a Gaussian mixture model from training features and detects OOD based on Mahalanobis distance Mahalanobis (2018); Sun et al. (2022) records the training features and measures OOD-ness based on the k-th nearest neighbor distance to the training features. As shown in Liu and Qin (2023), such reliance on auxiliary models introduces additional cost, posing challenges for time-critical applications.

Refer to caption


Figure 1: Illustration of our framework inspired by Neural Collapse. Left: On the penultimate layer, ID samples cluster near the weight vectors (marked by stars) of their predicted classes while OOD samples reside separated, as shown by UMAP (McInnes et al., 2018). Middle: ID and OOD samples are separated by our proposed 𝚙𝚂𝚌𝚘𝚛𝚎𝚙𝚂𝚌𝚘𝚛𝚎\mathtt{pScore}typewriter_pScore (Equation 6), which measures feature proximity to weight vectors. Additionally, ID samples tend to reside further from the origin, illustrated with 𝙻𝟷𝙻𝟷\mathtt{L1}typewriter_L1 norms. Right: ID samples cluster near a simplex Equiangular Tight Framework, illustrated with black arrows denoting weight vectors. We detect OOD by thresholding on 𝚙𝚂𝚌𝚘𝚛𝚎𝚙𝚂𝚌𝚘𝚛𝚎\mathtt{pScore}typewriter_pScore, selecting blue-shaded hypercones centered at weight vectors, with OOD samples outside these areas. Also, we filter OOD samples characterized by smaller feature norms. Left and Middle visualize a CIFAR-10 ResNet-18 classifier with OOD set SVHN. Right depicts our scheme on a three-class classifier with 2D penultimate space.

To this end, we aim to develop an efficient and versatile OOD detector. Specifically, we delve into the penultimate layer, i.e., the layer before the linear classification head, and revisit the observation that ID features tend to form clusters while OOD features reside apart. While this observation is well-established in prior literature Lee et al. (2018); Sun et al. (2022); Tack et al. (2020), the underlying mechanism remains largely unexplained. To better understand and characterize the observation, we take insights from Neural Collapse (Papyan et al., 2020). Neural Collapse characterizes the interplay between the linear classification head and the penultimate layer features as training epochs go to infinity. Notably, the Neural Collapse phenomenons are observed to be prevalent across canonical classification tasks under diverse architectures (see Appendix E). One particular phenomenon that Neural Collapse reveals is that features of each class gradually converge towards a single point during training. This explains the prevalent clustering observed on off-the-shelf models as a precursor to the final convergence. Therefore, we leverage the landscape of Neural Collapse to understand:

Where do features of ID samples form clusters?

To address the question, we first demonstrate that as a deterministic effect of Neural Collapse, features of training samples will converge towards the weight vectors of the predicted class. Additionally, Neural Collapse reveals that training features also converge towards a simplex Equiangular Tight Framework (ETF) (Equation 1). The spatial structure of an ETF, illustrated in Figure 1 Right, corresponds to the maximum separation achievable in space by equiangular vectors, requiring the features to be sufficiently far from the origin.

Refer to caption
Figure 2: Features of ID samples tend to cluster closer to the predicted class weight vectors, indicated by higher average cosine similarity (Equation 5) compared to OOD features. Evaluation across the training of a CIFAR-10 ResNet-18 classifier, with OOD set SVHN.

Such a landscape of Neural Collapse sheds light on the geometric structure of ID clusters. Specifically, for ID test samples, drawn from the same distribution as training samples, we anticipate a similar clustering behavior towards the weight vectors and towards an ETF. Conversely, OOD samples do not undergo the same training process, which enables the model to align features with weight vectors and to expand features to accommodate the spatial structure of ETF in Neural Collapse. Therefore, we do not expect the model to effectively align the weight vectors learned from ID features with unseen OOD features. Nor do we anticipate the model to posit OOD features far from the origin to structure an ETF. To validate our hypotheses, we trace a model’s training stages in Figure 2. We observe that ID samples consistently cluster closer to the weight vectors than OOD samples. This observation is reinforced in our UMAP McInnes et al. (2018) visualization on an off-the-shelf CIFAR-10 classifier with ResNet-18 backbone in Figure 1 Left. Here, ID features cluster near predicted class weight vectors (marked by stars), while OOD features are distant. Combining our observation with the results in (Zhu et al., 2021), which show that the weight vectors form an ETF in training, we conclude that ID features are driven to structure the ETF during training, whereas OOD features have no incentive to expand in space to form an ETF. Note that the lack of incentive for OOD features to expand in space explains the well-established observation Huang et al. (2021); Park et al. (2023); Sun et al. (2022); Tack et al. (2020) that OOD features tend to reside closer to the origin.

Based on our understanding, we design an efficient and versatile OOD detector. We first leverage feature proximity to the weight vectors to characterize ID clustering, bypassing auxiliary models and reducing the computational cost. Specifically, we define an angle-based proximity score as the norm of the projection of the weight vector of the predicted class onto the sample feature. As shown in Figure 1 Middle, our proximity score can effectively separate ID/OOD. A higher score indicates closer proximity and a lower chance of OOD-ness. Geometrically, thresholding on the score selects hyper-cones centered at the weight vector, as illustrated in Figure 1 Right. Notably, our proximity score effectively incorporates class-specific information and brings in performance benefits as well as efficiency gain. Complementing the proximity score’s contingency on ID clustering, we also consider feature distance to the origin. Specifically, ID features tend to reside further from the origin as they expand in space to form an ETF, whereas OOD features tend to reside near the origin, as illustrated by Figure 1 Right. Using the L1 norm as an example metric for distance to the origin, we observe that ID features can be separated from OOD features, as supported by Figure 1 Middle. Combining both aspects, we propose Neural Collapse Inspired OOD Detector (𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI).

Notably, prior methods, e.g., KNN Sun et al. (2022), focus on ID clustering but do not explicitly consider feature distance to the origin. Such approaches fall short in scenarios like ImageNet benchmarks but yield superior performance in CIFAR-10 benchmarks in Table 1(a). Conversely, methods such as Energy Liu et al. (2020), Energy-based ASH Djurisic et al. (2022), and, Energy-based Scale Xu et al. (2023) inherently utilize feature distance to the origin by considering log-sum-exp of logits, yet largely overlook ID clustering. These approaches excel in some scenarios, e.g., ImageNet, but perform sub-optimally in others, e.g., CIFAR-10 (see Table 1(a)). Through the lens of Neural Collapse, we explain, connect, and complete prior methods under a holistic view, leading to improved efficiency and generalizability validated across diverse experiments.

We summarize our main contributions below:

  • Understanding and Observation: By understanding ID clustering as a precursor to Neural Collapse, we novelly establish through experiments the significance of weight vectors in the clusters. We also explain from a spatial structure point of view the previous observation: ID features tend to reside further from the origin.

  • OOD Detector: We propose to leverage feature proximity to the weight vectors of predicted classes for OOD detection, which integrates class-specific information. Complementary to feature clustering, we also propose to detect OOD samples by thresholding the feature distance to the origin, which further enhances the method generalizability.

  • Experimental Analysis: We evaluate 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI across diverse classification tasks (CIFAR-10, CIFAR-100, ImageNet) and model architectures (ResNet, DenseNet Huang et al. (2017), ViT Dosovitskiy et al. (2020), Swin Liu et al. (2022)). Our 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI practically matches the latency of vanilla softmax-confidence detector while largely mitigating the performance discrepancy of existing methods across different classification tasks and model architectures.

2 Problem Setting

We consider a data space 𝒳𝒳\mathcal{X}caligraphic_X, a class set 𝒞𝒞\mathcal{C}caligraphic_C, and a classifier f:𝒳𝒞:𝑓𝒳𝒞f:\mathcal{X}\rightarrow\mathcal{C}italic_f : caligraphic_X → caligraphic_C, which is trained on samples i.i.d. drawn from joint distribution 𝒳𝒞subscript𝒳𝒞\mathbb{P}_{\mathcal{X}\mathcal{C}}blackboard_P start_POSTSUBSCRIPT caligraphic_X caligraphic_C end_POSTSUBSCRIPT. We denote the marginal distribution of 𝒳𝒞subscript𝒳𝒞\mathbb{P}_{\mathcal{X}\mathcal{C}}blackboard_P start_POSTSUBSCRIPT caligraphic_X caligraphic_C end_POSTSUBSCRIPT on 𝒳𝒳\mathcal{X}caligraphic_X as insuperscript𝑖𝑛\mathbb{P}^{in}blackboard_P start_POSTSUPERSCRIPT italic_i italic_n end_POSTSUPERSCRIPT. And samples drawn from insuperscript𝑖𝑛\mathbb{P}^{in}blackboard_P start_POSTSUPERSCRIPT italic_i italic_n end_POSTSUPERSCRIPT are In-Distribution (ID) samples. In practice, the classifier f𝑓fitalic_f may encounter 𝒙𝒳𝒙𝒳\bm{x}\in\mathcal{X}bold_italic_x ∈ caligraphic_X yet is not drawn from insuperscript𝑖𝑛\mathbb{P}^{in}blackboard_P start_POSTSUPERSCRIPT italic_i italic_n end_POSTSUPERSCRIPT. We say such samples are Out-of-Distribution (OOD).

In this work, we focus on detecting OOD samples from classes unseen during training, for which the classifiers cannot make meaningful predictions. The OOD detector D:𝒳{ID,OOD}:𝐷𝒳IDOODD:\mathcal{X}\rightarrow\{\text{ID},\text{OOD}\}italic_D : caligraphic_X → { ID , OOD } is commonly constructed as: D(𝒙)={IDif s(𝒙)τOODif s(𝒙)<τ𝐷𝒙casesIDif 𝑠𝒙𝜏OODif 𝑠𝒙𝜏D(\bm{x})=\begin{cases}\text{ID}&\text{if }s(\bm{x})\geq\tau\\ \text{OOD}&\text{if }s(\bm{x})<\tau\end{cases}italic_D ( bold_italic_x ) = { start_ROW start_CELL ID end_CELL start_CELL if italic_s ( bold_italic_x ) ≥ italic_τ end_CELL end_ROW start_ROW start_CELL OOD end_CELL start_CELL if italic_s ( bold_italic_x ) < italic_τ end_CELL end_ROW, where s:𝒳:𝑠𝒳s:\mathcal{X}\rightarrow\mathbb{R}italic_s : caligraphic_X → blackboard_R is a score function of design and τ𝜏\tauitalic_τ is the threshold. Considering the diverse application scenarios, an ideal OOD detector should be efficient and generalizable across classification tasks and model architectures. Thus, we propose a versatile and efficient OOD detector leveraging insights from Neural Collapse in this work.

3 OOD Detection through the Lens of Neural Collapse

In this section, we re-examine the observation in Lee et al. (2018); Sun et al. (2022) that ID features tend to form clusters while OOD features deviate from the clusters. We understand the phenomenon as a precursor to the Neural Collapse (Papyan et al., 2020) convergence. And leveraging the landscape revealed in Neural Collapse, we examine:

Where do features of ID samples form clusters?

Through analytical and empirical study, we hypothesize and validate that (1) ID features tend to cluster closer to the weight vectors compared to OOD features; (2) ID clusters tend to reside further from the origin, as necessitated by their spatial structure. From our understanding, we develop a post-hoc OOD detector with enhanced efficiency and effectiveness.

3.1 Neural Collapse: Convergence of Training Features

Neural Collapse, first observed in Papyan et al. (2020), occurs on the penultimate layer across canonical classification settings during the Terminal Phase of Training (TPT) where training error vanishes and the training loss is trained towards zero. With 𝒉i,csubscript𝒉𝑖𝑐\bm{h}_{i,c}bold_italic_h start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT denoting the penultimate layer feature of the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT training sample with ground truth / predicted label c𝑐citalic_c, Neural Collapse is framed in relation to

  • the feature global mean, 𝝁G=Avei,c𝒉i,csubscript𝝁𝐺subscriptAve𝑖𝑐subscript𝒉𝑖𝑐\bm{\mu}_{G}=\mathrm{Ave}_{i,c}\bm{h}_{i,c}bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = roman_Ave start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT, where AveAve\mathrm{Ave}roman_Ave is the average operation;

  • the feature class means, 𝝁c=Avei𝒉i,c,c𝒞formulae-sequencesubscript𝝁𝑐subscriptAve𝑖subscript𝒉𝑖𝑐for-all𝑐𝒞\bm{\mu}_{c}=\mathrm{Ave}_{i}\bm{h}_{i,c},\ \forall c\in\mathcal{C}bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = roman_Ave start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT , ∀ italic_c ∈ caligraphic_C;

  • the within-class covariance, 𝚺W=Avei,c(𝒉i,c𝝁c)(𝒉i,c𝝁c)Tsubscript𝚺𝑊subscriptAve𝑖𝑐subscript𝒉𝑖𝑐subscript𝝁𝑐superscriptsubscript𝒉𝑖𝑐subscript𝝁𝑐𝑇\bm{\Sigma}_{W}=\mathrm{Ave}_{i,c}(\bm{h}_{i,c}-\bm{\mu}_{c})(\bm{h}_{i,c}-\bm% {\mu}_{c})^{T}bold_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT = roman_Ave start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT ( bold_italic_h start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ( bold_italic_h start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT;

  • the between-class covariance, 𝚺B=Avec(𝝁c𝝁G)(𝝁c𝝁G)Tsubscript𝚺𝐵subscriptAve𝑐subscript𝝁𝑐subscript𝝁𝐺superscriptsubscript𝝁𝑐subscript𝝁𝐺𝑇\bm{\Sigma}_{B}=\mathrm{Ave}_{c}(\bm{\mu}_{c}-\bm{\mu}_{G})(\bm{\mu}_{c}-\bm{% \mu}_{G})^{T}bold_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = roman_Ave start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) ( bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT;

  • the linear classification head, i.e. the last layer of the NN, argmaxc𝒞𝒘cT𝒉+bcsubscript𝑐𝒞superscriptsubscript𝒘𝑐𝑇𝒉subscript𝑏𝑐\arg\max_{c\in\mathcal{C}}\bm{w}_{c}^{T}\bm{h}+b_{c}roman_arg roman_max start_POSTSUBSCRIPT italic_c ∈ caligraphic_C end_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_h + italic_b start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, where 𝒘csubscript𝒘𝑐\bm{w}_{c}bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and bcsubscript𝑏𝑐b_{c}italic_b start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT are parameters corresponding to class c𝑐citalic_c.

Neural Collapse comprises four inter-related limiting behaviors:

(NC1) Within-class variability collapse: 𝚺W𝟎subscript𝚺𝑊0\bm{\Sigma}_{W}\rightarrow\bm{0}bold_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT → bold_0

(NC2) Convergence to a simplex Equiangular Tight Frame (ETF):

|𝝁c𝝁G2𝝁c𝝁G2|0,c,csubscriptnormsubscript𝝁𝑐subscript𝝁𝐺2subscriptnormsubscript𝝁superscript𝑐subscript𝝁𝐺20for-all𝑐superscript𝑐\displaystyle|\|\bm{\mu}_{c}-\bm{\mu}_{G}\|_{2}-\|\bm{\mu}_{c^{\prime}}-\bm{% \mu}_{G}\|_{2}|\rightarrow 0,\ \forall\ c,\ c^{\prime}| ∥ bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - ∥ bold_italic_μ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | → 0 , ∀ italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (1)
(𝝁c𝝁G)T(𝝁c𝝁G)𝝁c𝝁G2𝝁c𝝁G2|𝒞||𝒞|1δc,c1|𝒞|1superscriptsubscript𝝁𝑐subscript𝝁𝐺𝑇subscript𝝁superscript𝑐subscript𝝁𝐺subscriptnormsubscript𝝁𝑐subscript𝝁𝐺2subscriptnormsubscript𝝁superscript𝑐subscript𝝁𝐺2𝒞𝒞1subscript𝛿𝑐superscript𝑐1𝒞1\displaystyle\frac{(\bm{\mu}_{c}-\bm{\mu}_{G})^{T}(\bm{\mu}_{c^{\prime}}-\bm{% \mu}_{G})}{\|\bm{\mu}_{c}-\bm{\mu}_{G}\|_{2}\|\bm{\mu}_{c^{\prime}}-\bm{\mu}_{% G}\|_{2}}\rightarrow\frac{|\mathcal{C}|}{|\mathcal{C}|-1}\delta_{c,c^{\prime}}% -\frac{1}{|\mathcal{C}|-1}divide start_ARG ( bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_italic_μ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG → divide start_ARG | caligraphic_C | end_ARG start_ARG | caligraphic_C | - 1 end_ARG italic_δ start_POSTSUBSCRIPT italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG | caligraphic_C | - 1 end_ARG

where δc,csubscript𝛿𝑐superscript𝑐\delta_{c,c^{\prime}}italic_δ start_POSTSUBSCRIPT italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is the Kronecker delta symbol.

(NC3) Convergence to self-duality:

𝒘c𝒘c2𝝁c𝝁G𝝁c𝝁G2𝟎subscript𝒘𝑐subscriptnormsubscript𝒘𝑐2subscript𝝁𝑐subscript𝝁𝐺subscriptnormsubscript𝝁𝑐subscript𝝁𝐺20\displaystyle\frac{\bm{w}_{c}}{\|\bm{w}_{c}\|_{2}}-\frac{\bm{\mu}_{c}-\bm{\mu}% _{G}}{\|\bm{\mu}_{c}-\bm{\mu}_{G}\|_{2}}\rightarrow\bm{0}divide start_ARG bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG - divide start_ARG bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG → bold_0

(NC4) Simplification to nearest class center:

argmaxc𝒞𝒘cT𝒉+bcargminc𝒞𝒉𝝁c2subscript𝑐𝒞superscriptsubscript𝒘𝑐𝑇𝒉subscript𝑏𝑐𝑎𝑟𝑔subscript𝑐𝒞subscriptnorm𝒉subscript𝝁𝑐2\displaystyle\arg\max_{c\in\mathcal{C}}\bm{w}_{c}^{T}\bm{h}+b_{c}\rightarrow arg% \min_{c\in\mathcal{C}}\|\bm{h}-\bm{\mu}_{c}\|_{2}roman_arg roman_max start_POSTSUBSCRIPT italic_c ∈ caligraphic_C end_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_h + italic_b start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT → italic_a italic_r italic_g roman_min start_POSTSUBSCRIPT italic_c ∈ caligraphic_C end_POSTSUBSCRIPT ∥ bold_italic_h - bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

We first remark on (NC2) that an ETF achieves the maximum separation possible for globally centered equiangular vectors Papyan et al. (2020) and extends in space, as visualized in Figure 1 Right. Since training features converge towards an ETF, they need to have sufficient norms to accommodate the spatial arrangement.

We next build on (NC1) and (NC3) to demonstrate in the following that training features converge towards the weight vectors of the linear classification head, up to a scaling factor.

Theorem 3.1.

(NC1) and (NC3) imply that for any sample iiiitalic_i and its predicted class cccitalic_c, we have

(𝒉i,c𝝁G)λ𝒘csubscript𝒉𝑖𝑐subscript𝝁𝐺𝜆subscript𝒘𝑐(\bm{h}_{i,c}-\bm{\mu}_{G})\rightarrow\lambda\bm{w}_{c}( bold_italic_h start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) → italic_λ bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT (2)

in the Terminal Phase of Training, where λ=𝛍c𝛍G2𝐰c2𝜆subscriptnormsubscript𝛍𝑐subscript𝛍𝐺2subscriptnormsubscript𝐰𝑐2\displaystyle\lambda=\frac{\|\bm{\mu}_{c}-\bm{\mu}_{G}\|_{2}}{\|\bm{w}_{c}\|_{% 2}}italic_λ = divide start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG.

Proof.

Considering that (𝒉i,c𝝁c)(𝒉i,c𝝁c)Tsubscript𝒉𝑖𝑐subscript𝝁𝑐superscriptsubscript𝒉𝑖𝑐subscript𝝁𝑐𝑇(\bm{h}_{i,c}-\bm{\mu}_{c})(\bm{h}_{i,c}-\bm{\mu}_{c})^{T}( bold_italic_h start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ( bold_italic_h start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is positive semi-definite for any i𝑖iitalic_i and c𝑐citalic_c. 𝚺W𝟎subscript𝚺𝑊0\bm{\Sigma}_{W}\rightarrow\bm{0}bold_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT → bold_0 thus implies (𝒉i,c𝝁c)(𝒉i,c𝝁c)T𝟎subscript𝒉𝑖𝑐subscript𝝁𝑐superscriptsubscript𝒉𝑖𝑐subscript𝝁𝑐𝑇0(\bm{h}_{i,c}-\bm{\mu}_{c})(\bm{h}_{i,c}-\bm{\mu}_{c})^{T}\rightarrow\bm{0}( bold_italic_h start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ( bold_italic_h start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT → bold_0 and 𝒉i,c𝝁c𝟎,i,csubscript𝒉𝑖𝑐subscript𝝁𝑐0for-all𝑖𝑐\bm{h}_{i,c}-\bm{\mu}_{c}\rightarrow\bm{0},\ \forall i,cbold_italic_h start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT → bold_0 , ∀ italic_i , italic_c. With algebraic manipulations, we have

𝒉i,c𝝁𝑮𝝁c𝝁𝑮2𝝁c𝝁𝑮𝝁c𝝁𝑮2𝟎,i,csubscript𝒉𝑖𝑐subscript𝝁𝑮subscriptnormsubscript𝝁𝑐subscript𝝁𝑮2subscript𝝁𝑐subscript𝝁𝑮subscriptnormsubscript𝝁𝑐subscript𝝁𝑮20for-all𝑖𝑐\frac{\bm{h}_{i,c}-\bm{\mu_{G}}}{\|\bm{\mu}_{c}-\bm{\mu_{G}}\|_{2}}-\frac{\bm{% \mu}_{c}-\bm{\mu_{G}}}{\|\bm{\mu}_{c}-\bm{\mu_{G}}\|_{2}}\rightarrow\bm{0},\ % \forall i,cdivide start_ARG bold_italic_h start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT bold_italic_G end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT bold_italic_G end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG - divide start_ARG bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT bold_italic_G end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT bold_italic_G end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG → bold_0 , ∀ italic_i , italic_c (3)

Applying the triangle inequality, we have

|𝒉i,c𝝁𝑮𝝁c𝝁𝑮2𝒘c𝒘c2||𝒉i,c𝝁𝑮𝝁c𝝁𝑮2𝝁c𝝁𝑮𝝁c𝝁𝑮2|+|𝒘c𝒘c2𝝁c𝝁G𝝁c𝝁G2|.subscript𝒉𝑖𝑐subscript𝝁𝑮subscriptnormsubscript𝝁𝑐subscript𝝁𝑮2subscript𝒘𝑐subscriptnormsubscript𝒘𝑐2subscript𝒉𝑖𝑐subscript𝝁𝑮subscriptnormsubscript𝝁𝑐subscript𝝁𝑮2subscript𝝁𝑐subscript𝝁𝑮subscriptnormsubscript𝝁𝑐subscript𝝁𝑮2subscript𝒘𝑐subscriptnormsubscript𝒘𝑐2subscript𝝁𝑐subscript𝝁𝐺subscriptnormsubscript𝝁𝑐subscript𝝁𝐺2\displaystyle|\frac{\bm{h}_{i,c}-\bm{\mu_{G}}}{\|\bm{\mu}_{c}-\bm{\mu_{G}}\|_{% 2}}-\frac{\bm{w}_{c}}{\|\bm{w}_{c}\|_{2}}|\leq|\frac{\bm{h}_{i,c}-\bm{\mu_{G}}% }{\|\bm{\mu}_{c}-\bm{\mu_{G}}\|_{2}}-\frac{\bm{\mu}_{c}-\bm{\mu_{G}}}{\|\bm{% \mu}_{c}-\bm{\mu_{G}}\|_{2}}|+|\frac{\bm{w}_{c}}{\|\bm{w}_{c}\|_{2}}-\frac{\bm% {\mu}_{c}-\bm{\mu}_{G}}{\|\bm{\mu}_{c}-\bm{\mu}_{G}\|_{2}}|.| divide start_ARG bold_italic_h start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT bold_italic_G end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT bold_italic_G end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG - divide start_ARG bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG | ≤ | divide start_ARG bold_italic_h start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT bold_italic_G end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT bold_italic_G end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG - divide start_ARG bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT bold_italic_G end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT bold_italic_G end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG | + | divide start_ARG bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG - divide start_ARG bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG | . (4)

Since both terms on the RHS converge to 𝟎0\bm{0}bold_0, as demonstrated by (3) and (NC3), it follows that the LHS also converges to 𝟎0\bm{0}bold_0. ∎

3.2 Geometric Structure of the Clusters of ID Features

While Neural Collapse reveals the within-class variability collapse (NC1), it also explains the clustering of features observed on general classifiers as a precursor to the collapse. We thus leverage the landscape of Neural Collapse revealed in Theorem 3.1 and (NC2) to examine the geometry of ID feature clusters. Since ID test samples are drawn from the same distribution as the training samples, we anticipate a similar pattern in their features. Specifically, we expect ID features to cluster towards the weight vectors of their predicted class during training. Additionally, we expect ID features to reside near a simplex Equiangular Tight Frame (ETF), thereby acquiring sufficient norm. Conversely, OOD samples are unseen during training and do not undergo the process of iterative adjustment, which drives the Neural Collapse phenomenon. Thus we expect the model to be less effective in aligning the OOD samples with weight vectors, placing OOD further from the weight vectors than ID features. Meanwhile, we do not expect the model to effectively align the OOD samples with an ETF.

In Figure 2, we validate our hypothesis across the training process of a CIFAR-10 classifier with ResNet-18 backbone. In Figure 2, we compute over the ID set (CIFAR-10) and OOD set (SVHN) the average cosine similarity between the centered feature 𝒉i𝝁Gsubscript𝒉𝑖subscript𝝁𝐺\bm{h}_{i}-\bm{\mu}_{G}bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT and the weight vector 𝒘csubscript𝒘𝑐\bm{w}_{c}bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT of the predicted class c𝑐citalic_c, i.e.,

Avgi(𝒉i𝝁G)𝒘c𝒉i𝝁G2𝒘c2𝐴𝑣subscript𝑔𝑖subscript𝒉𝑖subscript𝝁𝐺subscript𝒘𝑐subscriptnormsubscript𝒉𝑖subscript𝝁𝐺2subscriptnormsubscript𝒘𝑐2Avg_{i}\ \ \frac{(\bm{h}_{i}-\bm{\mu}_{G})\cdot\bm{w}_{c}}{\|\bm{h}_{i}-\bm{% \mu}_{G}\|_{2}\|\bm{w}_{c}\|_{2}}italic_A italic_v italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG ( bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) ⋅ bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG (5)

We observe that ID features have higher similarity scores and cluster closer to the weight vectors than OOD features. We further reinforce our observation in Figure 1 Left where we visualize ID features, OOD features, and weight vectors of a CIFAR-10 classifier with UMAP(McInnes et al., 2018). ID features are color-coded to align with the weight vectors (marked by stars) of their predicted classes, revealing a distinct clustering pattern near the weight vectors. Conversely, OOD features reside further away.

Additionally, we combine our observation with the results from (Zhu et al., 2021), showing that the weight vectors form an ETF during training. Our observed proximity to the weight vectors thus also validates the clustering of ID features near an ETF and the divergence of OOD from this structure. The lack of structure and incentives to extend in space explains the relatively smaller norm of OOD features.

3.3 Out-of-Distribution Detection

Based on our understanding, we design an efficient and versatile OOD detector. Specifically, we propose to detect OOD based on feature proximity to the weight vectors of the predicted class. For the proximity metric, we avoid Euclidean-based metrics as they require estimating the scaling factor λ𝜆\lambdaitalic_λ in Equation 2. This estimation tends to be imprecise for general classifiers which may cease training prior to convergence, resulting in suboptimal performance of Euclidean-based metrics shown in Appendix B. Instead, we design an angle-based metric, adjusted for class-wise difference. Specifically, we propose to quantify the proximity as the norm of projection of the weight vector 𝒘csubscript𝒘𝑐\bm{w}_{c}bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT onto the centered feature 𝒉𝝁G𝒉subscript𝝁𝐺\bm{h}-\bm{\mu}_{G}bold_italic_h - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, where c𝑐citalic_c corresponds to the predicted class, i.e.,

𝚙𝚂𝚌𝚘𝚛𝚎=cos(𝒘c,𝒉𝝁G)𝒘c2,𝚙𝚂𝚌𝚘𝚛𝚎𝑐𝑜𝑠subscript𝒘𝑐𝒉subscript𝝁𝐺subscriptnormsubscript𝒘𝑐2\mathtt{pScore}=cos(\bm{w}_{c},\bm{h}-\bm{\mu}_{G})\|\bm{w}_{c}\|_{2},typewriter_pScore = italic_c italic_o italic_s ( bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , bold_italic_h - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) ∥ bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (6)

where cos(𝒘c,𝒉𝝁G)=(𝒉𝝁G)𝒘c𝒉𝝁G2𝒘c2𝑐𝑜𝑠subscript𝒘𝑐𝒉subscript𝝁𝐺𝒉subscript𝝁𝐺subscript𝒘𝑐subscriptnorm𝒉subscript𝝁𝐺2subscriptnormsubscript𝒘𝑐2cos(\bm{w}_{c},\bm{h}-\bm{\mu}_{G})=\frac{(\bm{h}-\bm{\mu}_{G})\cdot\bm{w}_{c}% }{\|\bm{h}-\bm{\mu}_{G}\|_{2}\|\bm{w}_{c}\|_{2}}italic_c italic_o italic_s ( bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , bold_italic_h - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) = divide start_ARG ( bold_italic_h - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) ⋅ bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_h - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG. A higher 𝚙𝚂𝚌𝚘𝚛𝚎𝚙𝚂𝚌𝚘𝚛𝚎\mathtt{pScore}typewriter_pScore indicates closer proximity to the weight vector and thus a lower chance of OOD-ness. Geometrically, thresholding on 𝚙𝚂𝚌𝚘𝚛𝚎𝚙𝚂𝚌𝚘𝚛𝚎\mathtt{pScore}typewriter_pScore selects infinite hyper-cones centered at the weight vectors, as illustrated in Figure 1 Right. Within the same predicted class, 𝚙𝚂𝚌𝚘𝚛𝚎𝚙𝚂𝚌𝚘𝚛𝚎\mathtt{pScore}typewriter_pScore is proportional to the cosine similarity. Across different classes, 𝚙𝚂𝚌𝚘𝚛𝚎𝚙𝚂𝚌𝚘𝚛𝚎\mathtt{pScore}typewriter_pScore adapts to class-wise difference by selecting wider hyper-cones for classes with larger weight vectors, which tend to have larger decision regions. As shown in Appendix B, our 𝚙𝚂𝚌𝚘𝚛𝚎𝚙𝚂𝚌𝚘𝚛𝚎\mathtt{pScore}typewriter_pScore with class-wise adjustment outperforms vanilla cosine similarity. Notably, our 𝚙𝚂𝚌𝚘𝚛𝚎𝚙𝚂𝚌𝚘𝚛𝚎\mathtt{pScore}typewriter_pScore incorporates class-specific information into characterizing ID clustering by using the weight vectors of the predicted class. This brings in additional gain in detection effectiveness, as we shall see in Section 4.

While 𝚙𝚂𝚌𝚘𝚛𝚎𝚙𝚂𝚌𝚘𝚛𝚎\mathtt{pScore}typewriter_pScore enhances efficiency and effectiveness, its performance is intrinsically contingent on the strength of ID clustering. Such contingency, widely exhibited by clustering-based methods Lee et al. (2018); Sun et al. (2022); Tack et al. (2020), poses challenges on classifiers with less pronounced ID clustering, such as ImageNet ResNet-50 in Section 4.1. To mitigate such discrepancy, we complement 𝚙𝚂𝚌𝚘𝚛𝚎𝚙𝚂𝚌𝚘𝚛𝚎\mathtt{pScore}typewriter_pScore by considering the distance of ID clusters to the origin. Specifically, we enhance our proximity score by incorporating feature norms to filter out OOD near the origin, as illustrated in Figure 1 Right. Taking 𝙻𝟷𝙻𝟷\mathtt{L1}typewriter_L1 norm as an example, we define our detection score as 𝚙𝚂𝚌𝚘𝚛𝚎+α𝒉1𝚙𝚂𝚌𝚘𝚛𝚎𝛼subscriptnorm𝒉1\mathtt{pScore}+\alpha\|\bm{h}\|_{1}typewriter_pScore + italic_α ∥ bold_italic_h ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, where α𝛼\alphaitalic_α controls the filtering strength and can be selected from a validation set as detailed in Section 4. We refer readers to Section 4.3 for the effect of different orders of p𝑝pitalic_p-norm. Thresholding on the detection score, we have Neural Collapse Inspired OOD Detector (𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI): A lower score indicates a higher chance of OOD-ness.

𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI has O(P)𝑂𝑃O(P)italic_O ( italic_P ) complexity, where P𝑃Pitalic_P represents the dimension of the penultimate layer. The complexity theoretically ensures computational scalability of 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI on large models. Empirically, 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI maintains inference latency comparable to the vanilla softmax-confidence detector, as we shall see in Section 4.

4 Experiments

Table 1: 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI achieves high AUROC, low FPR95, and low latency on CIFAR-10 and ImageNet OpenOOD benchmarks. The CIFAR-10 classifier is a pre-trained ResNet-18, and the ImageNet classifier is a pre-trained ResNet-50.
[Uncaptioned image]
(a) 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI remains effective while baselines exhibit discrepancy. \uparrow and \downarrow denotes better performance. Bold indicates the best results while underline denotes the 2nd and 3rd best results. Methods with * are hyperparameter-free. Scores, except for the most recent baselines – 𝚏𝙳𝙱𝙳𝚏𝙳𝙱𝙳\mathtt{fDBD}typewriter_fDBD, 𝙽𝙴𝙲𝙾𝙽𝙴𝙲𝙾\mathtt{NECO}typewriter_NECO, 𝙰𝚂𝙷𝙰𝚂𝙷\mathtt{ASH}typewriter_ASH, 𝚂𝚌𝚊𝚕𝚎𝚂𝚌𝚊𝚕𝚎\mathtt{Scale}typewriter_Scale – are copied from OpenOOD report Zhang et al. (2023).
[Uncaptioned image]
(b) 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI achieves simialr latency as 𝙼𝚂𝙿𝙼𝚂𝙿\mathtt{MSP}typewriter_MSP, outperforming representative baselines. Per image latency reported.
Table 2: 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI achieves high AUROC, low FPR95 on ImageNet OpenOOD benchmarks across ViT B/16 and Swin v2 classifiers. Both transfomers are finetuned on ImageNet. While baselines exhibits discrepency, our 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI boosts Swin v2 performance while maintaining ViT effectiveness, even without filtering. Bold indicates the best results while underline denotes the 2nd best results.
[Uncaptioned image]

In this section, we extensively demonstrate the versatility and efficiency of 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI across classification tasks: CIFAR-10, CIFAR-100 (see App. D), ImageNet, as well as model architectures: ResNet, DenseNet (see App. D), ViT, Swin. We compare 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI against thirteen baseline methods, including the most recent ones, and demonstrate that 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI effectively mitigates the generalization discrepancies of existing methods. Following the OpenOOD benchmark Zhang et al. (2023), we evaluate on six OOD sets for CIFAR-10 and CIFAR-100 classifiers and five for ImageNet classifiers. Performance is evaluated using two widely recognized metrics: the False Positive Rate at 95% True Positive Rate (FPR95) and the Area Under the Receiver Operating Characteristic Curve (AUROC). Lower FPR95 and higher AUROC values indicate better performance. We also report the per-image inference latency (in milliseconds) evaluated on a Tesla T4 GPU. In our experiments, other than the ablation study in Section 4.3, we use the L1𝐿1L1italic_L 1-norm as the filtering term and select the filtering strength α𝛼\alphaitalic_α from {104,103,102,101}superscript104superscript103superscript102superscript101\{10^{-4},10^{-3},10^{-2},10^{-1}\}{ 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT } based on a validation set generated per pixel from Gaussian N(0,1)𝑁01N(0,1)italic_N ( 0 , 1 ), following Sun et al. (2021); Sun and Li (2022). In Section 4.3, we explore different choices of norm order and examine the effect of filtering on alternative clustering-based method. For detailed setups, please see Appendix A. We emphasize that our method and all baselines are post-hoc methods, while all models used are off-the-shelf and do not undergo prolonged training.

4.1 Versatility across Classification Tasks

We first assess the performance of 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI and baselines across CIFAR-10 and ImageNet classification tasks. The two tasks provide an ideal test bed for evaluating versatility, as they drastically differ in input resolution, number of classes, and classification accuracy. We use the ResNets from the OpenOOD benchmarkZhang et al. (2023): the CIFAR-10 classifier is a ResNet-18 with an accuracy of 95.06% and the ImageNet classifier is a ResNet-50 with an accuracy of 76.65%. Based on validation results, we set the filter strength α𝛼\alphaitalic_α of the L1𝐿1L1italic_L 1-norm to 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT for CIFAR-10 experiments and 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT for ImageNet experiments.

Datasets For CIFAR-10 experiments, We follow the OpenOOD split of CIFAR-10 test set and evaluate on the OpenOOD benchmarks, including CIFAR-100 Krizhevsky et al. (2009), Tiny ImageNet Le and Yang (2015), MNIST Deng (2012), SVHN Netzer et al. (2011), Texture (Cimpoi et al., 2014), and Places365 (Zhou et al., 2017). For ImageNet experiments, we follow the OpenOOD split of ImageNet test set and evaluate on the OpenOOD benchmarks, including SSB-hard Vaze et al. (2021), NINCO Bitterwolf et al. (2023), iNaturalist (Van Horn et al., 2018), Texture (Cimpoi et al., 2014), and OpenImage-O Wang et al. (2022).

Baselines In Table 1(a), we compare our method with thirteen baselines. Some baselines focus more on the CIFAR-10 Benchmark while others focus more focused on the Imagenet Benchmark. Based on performance, we categorize the baselines, besides the vanilla confidence-based 𝙼𝚂𝙿𝙼𝚂𝙿\mathtt{MSP}typewriter_MSP (Hendrycks and Gimpel, 2016), into two groups: the “CIFAR-10 Strong" baselines, including 𝙾𝙳𝙸𝙽𝙾𝙳𝙸𝙽\mathtt{ODIN}typewriter_ODIN (Liang et al., 2018), 𝙴𝚗𝚎𝚛𝚐𝚢𝙴𝚗𝚎𝚛𝚐𝚢\mathtt{Energy}typewriter_Energy (Liu et al., 2020), 𝙼𝚊𝚑𝚊𝚕𝚊𝚗𝚘𝚋𝚒𝚜𝙼𝚊𝚑𝚊𝚕𝚊𝚗𝚘𝚋𝚒𝚜\mathtt{Mahalanobis}typewriter_Mahalanobis (Lee et al., 2018), 𝙺𝙽𝙽𝙺𝙽𝙽\mathtt{KNN}typewriter_KNN(Sun et al., 2022), 𝚅𝚒𝙼𝚅𝚒𝙼\mathtt{ViM}typewriter_ViM (Wang et al., 2022), and 𝚏𝙳𝙱𝙳𝚏𝙳𝙱𝙳\mathtt{fDBD}typewriter_fDBD Liu and Qin (2023); the “ImageNet Strong" baselines, including 𝙶𝚛𝚊𝚍𝙽𝚘𝚛𝚖𝙶𝚛𝚊𝚍𝙽𝚘𝚛𝚖\mathtt{GradNorm}typewriter_GradNorm (Huang et al., 2021), 𝙽𝙴𝙲𝙾𝙽𝙴𝙲𝙾\mathtt{NECO}typewriter_NECO Ammar et al. (2023), 𝚁𝚎𝚊𝚌𝚝𝚁𝚎𝚊𝚌𝚝\mathtt{React}typewriter_React (Sun et al., 2021), 𝙳𝚒𝚌𝚎𝙳𝚒𝚌𝚎\mathtt{Dice}typewriter_Dice (Sun and Li, 2022), 𝙰𝚂𝙷𝙰𝚂𝙷\mathtt{ASH}typewriter_ASH Djurisic et al. (2022), 𝚂𝚌𝚊𝚕𝚎𝚂𝚌𝚊𝚕𝚎\mathtt{Scale}typewriter_Scale Xu et al. (2023). For details of the baselines, please see Appendix C.

OOD detection performance Table 1(a) highlights the discrepancy among existing methods: CIFAR-10 Strong baselines perform sub-optimally on ImageNet benchmarks, whereas ImageNet Strong baselines cannot achieve state-of-the-art performance on CIFAR-10 benchmarks. Conversely, our 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI largely mitigates the generalization gap, achieving competitive performance across both tasks. Moreover, 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI is practically as efficient as 𝙼𝚂𝙿𝙼𝚂𝙿\mathtt{MSP}typewriter_MSP, as shown in Table 1(b)111Running time of 𝙺𝙽𝙽𝙺𝙽𝙽\mathtt{KNN}typewriter_KNN and 𝙼𝙳𝚂𝙼𝙳𝚂\mathtt{MDS}typewriter_MDS on ImageNet are copied from Table 4 in Sun et al. (2022).. This enhances the efficiency of feature-space methods, aligning with our complexity analysis in Section 3 and Appendix C.

We highlight the following pairs of comparison:

  • 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI v.s. 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI w/o filter: The comparison validates the effectiveness of our norm-based filtering and highlights its importance in enhancing generalizability. On the CIFAR-10 classifier, strong ID clustering allows our method to achieve state-of-art performance without additional filtering. Conversely, on the ImageNet ResNet-50, ID clustering is less prominent (see Appendix E). There, norm-based filtering effectively complements the performance.

  • 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI v.s. 𝙺𝙽𝙽𝙺𝙽𝙽\mathtt{KNN}typewriter_KNN: Compared to 𝙺𝙽𝙽𝙺𝙽𝙽\mathtt{KNN}typewriter_KNN, 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI avoids auxiliary models and significantly enhances efficiency as shown in Table 1(b). Notably, without filtering, our hyperparameter-free score still outperforms KNN with selected parameters on ImageNet benchmarks. The performance gain validates the benefit of incorporating class-specific information in our design.

  • 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI v.s. 𝙰𝚂𝙷𝙰𝚂𝙷\mathtt{ASH}typewriter_ASH / 𝚂𝚌𝚊𝚕𝚎𝚂𝚌𝚊𝚕𝚎\mathtt{Scale}typewriter_Scale: Compared to both baselines, our 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI achieves competitive performance on the ImageNet benchmark while significantly enhancing the performance on the CIFAR-10 benchmark, thereby largely improving versatility across classification tasks. Additionally, 𝙰𝚂𝙷𝙰𝚂𝙷\mathtt{ASH}typewriter_ASH and 𝚂𝚌𝚊𝚕𝚎𝚂𝚌𝚊𝚕𝚎\mathtt{Scale}typewriter_Scale introduce in a small delay on the ImageNet benchmark due to their cost of activation sorting per image. We expect the delay to increase and the latency gap to widen on classifiers with a larger activation dimension.

  • 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI v.s. 𝙽𝙴𝙲𝙾𝙽𝙴𝙲𝙾\mathtt{NECO}typewriter_NECO: Like our 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI, 𝙽𝙴𝙲𝙾𝙽𝙴𝙲𝙾\mathtt{NECO}typewriter_NECO Ammar et al. (2023) is motivated by Neural Collapse. Similar to our 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI with filtering, 𝙽𝙴𝙲𝙾𝙽𝙴𝙲𝙾\mathtt{NECO}typewriter_NECO incorporates distance to the origin using max-logit. However, 𝙽𝙴𝙲𝙾𝙽𝙴𝙲𝙾\mathtt{NECO}typewriter_NECO, exclusively focuses on the subspace analysis of features and does not utilize the classification head. Such subspace operations require expensive matrix multiplication, resulting in a noticeable inference latency shown in Table 1(b). Conversely, our 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI explores the interplay between features and the classification head, and thus integrates class-specific information revealed by Neural Collapse. Compared to 𝙽𝙴𝙲𝙾𝙽𝙴𝙲𝙾\mathtt{NECO}typewriter_NECO, our 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI improves in both efficiency and effectiveness.

4.2 Versatility across Model Architectures

Now we assess the performance of 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI and baselines across different architectures. Particularly, we study two transformer-based models: ViT B/16 Dosovitskiy et al. (2020) and Swin-v2 Liu et al. (2022). Both transformers are finetuned on ImageNet, achieving an accuracy of 81.14% and 82.94% respectively. We follow the setup of the OpenOOD ImageNet Benchmark, detailed in Section 4.1. And based on validation results, we set the filter strength α𝛼\alphaitalic_α of the L1𝐿1L1italic_L 1 norm to 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT for both classifiers. In Table 2, we observe strong baselines on ViT – 𝙺𝙽𝙽𝙺𝙽𝙽\mathtt{KNN}typewriter_KNN, 𝙰𝚂𝙷𝙰𝚂𝙷\mathtt{ASH}typewriter_ASH, 𝚂𝚌𝚊𝚕𝚎𝚂𝚌𝚊𝚕𝚎\mathtt{Scale}typewriter_Scale – exhibit discrepancy when applied to Swin v2, echoing the observations in Ammar et al. (2023). Conversely, our 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI, even without filtering, improves baseline performance on Swin v2 while maintaining superior effectiveness on ViT. The addition of filtering further enhances overall generalizability. Our experiments on ResNet (Section 4.1) and DenseNet (Appendix D) further validate the versatility of 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI across model architectures.

4.3 Ablation on the Filtering Effect

Table 3: Ablation on filtering norm on ImageNet OpenOOD Benchmark with ResNet-50 backbone. AUROC score is reported (higher is better). Bold denotes the best result. Filtering with 𝙻𝟷𝙻𝟷\mathtt{L1}typewriter_L1 norm outperforms alternative choice of norms across OOD datasets.
[Uncaptioned image]

In Table 3, we assess different orders of p𝑝pitalic_p-norm as the filtering term, compared to the L1𝐿1L1italic_L 1 norm used so far. To ensure a fair comparison, we report the best performance from the filter strengths {104,103,102,101}superscript104superscript103superscript102superscript101\{10^{-4},10^{-3},10^{-2},10^{-1}\}{ 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT }. The rest of the setup follows the ImageNet benchmarks in Section 4.1. As shown in Table  3, filtering with L1𝐿1L1italic_L 1 norm achieves the best performance across OOD datasets, aligning with prior observations Huang et al. (2021); Park et al. (2023). Meanwhile, we observe that in rare scenarios, e.g., a ResNet-18 on CIFAR-10, the 𝙻𝟷𝙻𝟷\mathtt{L1}typewriter_L1 norm cannot effectively characterize OOD’s proximity to the origin, leading to no extra performance gain compared to simply thresholding on 𝚙𝚂𝚌𝚘𝚛𝚎𝚙𝚂𝚌𝚘𝚛𝚎\mathtt{pScore}typewriter_pScore. In these cases, our algorithm benefits from its ability to automatically select a low filter strength based on validation results, effectively disregarding the filtering term.

We further apply 𝙻𝟷𝙻𝟷\mathtt{L1}typewriter_L1-norm based filtering to 𝙺𝙽𝙽𝙺𝙽𝙽\mathtt{KNN}typewriter_KNN to see if this perspective can mitigate the discrepancy of clustering-based methods in general. In Table 4 222 Note that we report our run of KNN here to ensure a fair evaluation of the filtering effect. Our results are very similar to the OpenOOD results reported in Table 1(a), with only marginal differences. , we report the the best performance of 𝙺𝙽𝙽𝙺𝙽𝙽\mathtt{KNN}typewriter_KNN from filter strengths {104,103,102,101}superscript104superscript103superscript102superscript101\{10^{-4},10^{-3},10^{-2},10^{-1}\}{ 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT }. We observe a significant performance gain from adding the filter, which further validates our understanding of ID clustering landscape from Neural Collapse. Note that our method outperforms the standalone L1𝐿1L1italic_L 1 norm as well as 𝙺𝙽𝙽𝙺𝙽𝙽\mathtt{KNN}typewriter_KNN, before and after filtering.

Table 4: Effectiveness of our filtering scheme on 𝙺𝙽𝙽𝙺𝙽𝙽\mathtt{KNN}typewriter_KNN. Performance gain validates our understanding of ID clustering landscape. 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI outperforms 𝙺𝙽𝙽𝙺𝙽𝙽\mathtt{KNN}typewriter_KNN and standalone L1𝐿1L1italic_L 1 norm. AUROC reported (higher is better). Bold highlights the best result.
[Uncaptioned image]

5 Related Work

OOD Detection Extensive research has focused on develo** OOD detection algorithms. One line of work is post-hoc and builds upon pre-trained models. For example, Hendrycks et al. (2019); Liang et al. (2018); Liu et al. (2020); Sun et al. (2021); Sun and Li (2022) design OOD score over the output space of a classifier. Meanwhile, Lee et al. (2018) and Sun et al. (2022) measure OOD-ness from the perspective of ID clustering in feature space. Our work extends the observation that ID features tend to cluster from the perspective of Neural Collapse. While existing work is more focused are certain classification tasks than others, our proposed OOD detector is tested to be highly versatile.

Another line of work explores the regularization of OOD detection in training. For example, DeVries and Taylor (2018); Hsu et al. (2020) propose OOD-specific architecture whereas Huang and Li (2021); Wei et al. (2022) design OOD-specific training loss. In particular, Tack et al. (2020) brings attention to representation learning for OOD detection and proposes an OOD-specific contrastive learning scheme. Our work does not belong to this school of thought and is not restricted to specific training schemes or architecture.

Neural Collapse Neural Collapse was first observed in Papyan et al. (2020). During Neural Collapse, the penultimate layer features collapse to class means, the class means and the classifier collapses to a simplex equiangular tight framework, and the classifier simplifies to adopt the nearest class-mean decision rule. Further work provides theoretical justification for the emergence of Neural Collapse (Han et al., 2021; Mixon et al., 2020; Zhou et al., 2022; Zhu et al., 2021). In addition, Zhu et al. (2021) derives an efficient training algorithm drawing inspiration from Neural Collapse. Our concurrent work Ammar et al. (2023) also leverages insights from Neural Collapse for OOD detection. However, they tackle from the subspace perspective and largely overlook class-specific information revealed by Neural Collapse, which is essential for our work.

6 Conclusion

This work leverages insights from Neural Collapse to propose a novel OOD detector. Specifically, we study the phenomenon that ID features tend to form clusters whereas OOD features reside far away. Inspired by Neural Collapse, we hypothesize and validate that ID features tend to cluster near weight vectors. We also explain why ID features tend to reside further from the origin and complement our method from this perspective. Experiments show that our method can achieve superior performance with low latency across diverse setups, improving upon the generalizability of existing work. We hope our work can inspire future work to explore the interplay between features and weight vectors for OOD detection and other research problems such as calibration and adversarial robustness.

Limitations:

Despite significantly reducing the generalization discrepancy, our 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI does not completely close the gap. For example, there is still an absolute difference of 1.72% in AUROC compared to the best result on the ImageNet ResNet-50 benchmark.

References

  • Ammar et al. [2023] Mouïn Ben Ammar, Nacim Belkhir, Sebastian Popescu, Antoine Manzanera, and Gianni Franchi. Neco: Neural collapse based out-of-distribution detection. arXiv preprint arXiv:2310.06823, 2023.
  • Bitterwolf et al. [2023] Julian Bitterwolf, Maximilian Müller, and Matthias Hein. In or out? fixing imagenet out-of-distribution detection evaluation. In International Conference on Machine Learning, pages 2471–2506. PMLR, 2023.
  • Chen et al. [2020] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  • Cimpoi et al. [2014] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In IEEE Conference in Computer Vision and Pattern Recognition, pages 3606–3613, 2014.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • Deng [2012] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
  • DeVries and Taylor [2018] Terrance DeVries and Graham W Taylor. Learning confidence for out-of-distribution detection in neural networks. arXiv preprint arXiv:1802.04865, 2018.
  • Djurisic et al. [2022] Andrija Djurisic, Nebojsa Bozanic, Arjun Ashok, and Rosanne Liu. Extremely simple activation sha** for out-of-distribution detection. arXiv preprint arXiv:2209.09858, 2022.
  • Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • Han et al. [2021] XY Han, Vardan Papyan, and David L Donoho. Neural collapse under mse loss: Proximity to and dynamics on the central path. arXiv preprint arXiv:2106.02073, 2021.
  • Hendrycks and Gimpel [2016] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016.
  • Hendrycks et al. [2019] Dan Hendrycks, Steven Basart, Mantas Mazeika, Mohammadreza Mostajabi, Jacob Steinhardt, and Dawn Song. Scaling out-of-distribution detection for real-world settings. arXiv preprint arXiv:1911.11132, 2019.
  • Hsu et al. [2020] Yen-Chang Hsu, Yilin Shen, Hongxia **, and Zsolt Kira. Generalized odin: Detecting out-of-distribution image without learning from out-of-distribution data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10951–10960, 2020.
  • Huang et al. [2017] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
  • Huang and Li [2021] Rui Huang and Yixuan Li. Mos: Towards scaling out-of-distribution detection for large semantic space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8710–8719, 2021.
  • Huang et al. [2021] Rui Huang, Andrew Geng, and Yixuan Li. On the importance of gradients for detecting distributional shifts in the wild. Advances in Neural Information Processing Systems, 34:677–689, 2021.
  • Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Le and Yang [2015] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.
  • Lee et al. [2018] Kimin Lee, Kibok Lee, Honglak Lee, and **woo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems, 31, 2018.
  • Liang et al. [2018] Shiyu Liang, Yixuan Li, and R Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. In 6th International Conference on Learning Representations, ICLR 2018, 2018.
  • Liu and Qin [2023] Litian Liu and Yao Qin. Fast decision boundary based out-of-distribution detector. arXiv preprint arXiv:2312.11536, 2023.
  • Liu et al. [2020] Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection. Advances in Neural Information Processing Systems, 33:21464–21475, 2020.
  • Liu et al. [2022] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022.
  • Mahalanobis [2018] Prasanta Chandra Mahalanobis. On the generalized distance in statistics. Sankhyā: The Indian Journal of Statistics, Series A (2008-), 80:S1–S7, 2018.
  • McInnes et al. [2018] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
  • Mixon et al. [2020] Dustin G Mixon, Hans Parshall, and Jianzong Pi. Neural collapse with unconstrained features. arXiv preprint arXiv:2011.11619, 2020.
  • Netzer et al. [2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
  • Papyan et al. [2020] Vardan Papyan, XY Han, and David L Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020.
  • Park et al. [2023] Jaewoo Park, Jacky Chen Long Chai, Jaeho Yoon, and Andrew Beng ** Teoh. Understanding the feature norm for out-of-distribution detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1557–1567, 2023.
  • Sun and Li [2022] Yiyou Sun and Yixuan Li. Dice: Leveraging sparsification for out-of-distribution detection. In European Conference on Computer Vision, pages 691–708. Springer, 2022.
  • Sun et al. [2021] Yiyou Sun, Chuan Guo, and Yixuan Li. React: Out-of-distribution detection with rectified activations. Advances in Neural Information Processing Systems, 34:144–157, 2021.
  • Sun et al. [2022] Yiyou Sun, Yifei Ming, Xiao** Zhu, and Yixuan Li. Out-of-distribution detection with deep nearest neighbors. arXiv preprint arXiv:2204.06507, 2022.
  • Tack et al. [2020] Jihoon Tack, Sangwoo Mo, Jongheon Jeong, and **woo Shin. Csi: Novelty detection via contrastive learning on distributionally shifted instances. Advances in neural information processing systems, 33:11839–11852, 2020.
  • Van Horn et al. [2018] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778, 2018.
  • Vaze et al. [2021] Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisserman. Open-set recognition: A good closed-set classifier is all you need. In International Conference on Learning Representations, 2021.
  • Wang et al. [2022] Haoqi Wang, Zhizhong Li, Litong Feng, and Wayne Zhang. Vim: Out-of-distribution with virtual-logit matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4921–4930, 2022.
  • Wei et al. [2022] Hongxin Wei, Renchunzi Xie, Hao Cheng, Lei Feng, Bo An, and Yixuan Li. Mitigating neural network overconfidence with logit normalization. arXiv preprint arXiv:2205.09310, 2022.
  • Wightman [2019] Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
  • Xu et al. [2023] Kai Xu, Rongyu Chen, Gianni Franchi, and Angela Yao. Scaling for training time and post-hoc out-of-distribution detection enhancement. In The Twelfth International Conference on Learning Representations, 2023.
  • Zhang et al. [2023] **gyang Zhang, **gkang Yang, Pengyun Wang, Haoqi Wang, Yueqian Lin, Haoran Zhang, Yiyou Sun, Xuefeng Du, Kaiyang Zhou, Wayne Zhang, Yixuan Li, Ziwei Liu, Yiran Chen, and Hai Li. Openood v1.5: Enhanced benchmark for out-of-distribution detection. arXiv preprint arXiv:2306.09301, 2023.
  • Zhou et al. [2017] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2017.
  • Zhou et al. [2022] **xin Zhou, Xiao Li, Tianyu Ding, Chong You, Qing Qu, and Zhihui Zhu. On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features. In International Conference on Machine Learning, pages 27179–27202. PMLR, 2022.
  • Zhu et al. [2021] Zhihui Zhu, Tianyu Ding, **xin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu. A geometric analysis of neural collapse with unconstrained features. Advances in Neural Information Processing Systems, 34:29820–29834, 2021.

NeurIPS Paper Checklist

  1. 1.

    Claims

  2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

  3. Answer: [Yes]

  4. Justification: The claims in our abstract and introduction are justified in rest of our paper.

  5. Guidelines:

    • The answer NA means that the abstract and introduction do not include the claims made in the paper.

    • The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.

    • The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.

    • It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

  6. 2.

    Limitations

  7. Question: Does the paper discuss the limitations of the work performed by the authors?

  8. Answer: [Yes]

  9. Justification: We discuss our limitations after the conclusion session.

  10. Guidelines:

    • The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.

    • The authors are encouraged to create a separate "Limitations" section in their paper.

    • The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.

    • The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.

    • The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.

    • The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.

    • If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.

    • While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in develo** norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

  11. 3.

    Theory Assumptions and Proofs

  12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

  13. Answer: [Yes]

  14. Justification: We include complete proof and assumptions for our theoretical result in Section 3.1.

  15. Guidelines:

    • The answer NA means that the paper does not include theoretical results.

    • All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced.

    • All assumptions should be clearly stated or referenced in the statement of any theorems.

    • The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.

    • Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.

    • Theorems and Lemmas that the proof relies upon should be properly referenced.

  16. 4.

    Experimental Result Reproducibility

  17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

  18. Answer: [Yes]

  19. Justification: We include code to reproduce our result in supplemental material.

  20. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.

    • If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.

    • Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.

    • While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example

      1. (a)

        If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.

      2. (b)

        If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.

      3. (c)

        If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).

      4. (d)

        We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

  21. 5.

    Open access to data and code

  22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

  23. Answer: [Yes]

  24. Justification: We include our code in supplemental material and all data we used are open sourced.

  25. Guidelines:

    • The answer NA means that paper does not include experiments requiring code.

    • Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.

    • While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).

    • The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.

    • The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.

    • The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.

    • At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).

    • Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

  26. 6.

    Experimental Setting/Details

  27. Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

  28. Answer: [Yes]

  29. Justification: We specify our train/test split and hyperparameter choice in our Experiment session.

  30. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.

    • The full details can be provided either with the code, in appendix, or as supplemental material.

  31. 7.

    Experiment Statistical Significance

  32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

  33. Answer: [No]

  34. Justification: We follow prior work and adhere to standard benchmarks with fixed train/test splits and pre-trained models, resulting in minimal randomness.

  35. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.

    • The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

    • The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)

    • The assumptions made should be given (e.g., Normally distributed errors).

    • It should be clear whether the error bar is the standard deviation or the standard error of the mean.

    • It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.

    • For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).

    • If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

  36. 8.

    Experiments Compute Resources

  37. Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

  38. Answer: [Yes]

  39. Justification: We use off-the-shelf models with references given in appendix. We also report the computer resources used for inference.

  40. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.

    • The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.

    • The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper).

  41. 9.

    Code Of Ethics

  42. Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?

  43. Answer: [Yes]

  44. Justification: We reviewed and followed the code of ethics.

  45. Guidelines:

    • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.

    • If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.

    • The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

  46. 10.

    Broader Impacts

  47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

  48. Answer: [Yes]

  49. Justification: The work can enhance the safe deployment of AI models. We do not foresee any negative societal impact of this work.

  50. Guidelines:

    • The answer NA means that there is no societal impact of the work performed.

    • If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

    • Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

    • The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.

    • The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.

    • If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

  51. 11.

    Safeguards

  52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

  53. Answer: [N/A]

  54. Justification: The paper poses no such risks.

  55. Guidelines:

    • The answer NA means that the paper poses no such risks.

    • Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.

    • Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.

    • We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

  56. 12.

    Licenses for existing assets

  57. Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

  58. Answer: [Yes]

  59. Justification: We properly credited the code, data, and models usded in the paper.

  60. Guidelines:

    • The answer NA means that the paper does not use existing assets.

    • The authors should cite the original paper that produced the code package or dataset.

    • The authors should state which version of the asset is used and, if possible, include a URL.

    • The name of the license (e.g., CC-BY 4.0) should be included for each asset.

    • For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.

    • If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

    • For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

    • If this information is not available online, the authors are encouraged to reach out to the asset’s creators.

  61. 13.

    New Assets

  62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

  63. Answer: [Yes]

  64. Justification: We release our well-documented code alongside the paper.

  65. Guidelines:

    • The answer NA means that the paper does not release new assets.

    • Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.

    • The paper should discuss whether and how consent was obtained from people whose asset is used.

    • At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

  66. 14.

    Crowdsourcing and Research with Human Subjects

  67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

  68. Answer: [N/A]

  69. Justification: The paper does not involve crowdsourcing nor research with human subjects.

  70. Guidelines:

    • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

    • Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.

    • According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

  71. 15.

    Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

  72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

  73. Answer: [N/A]

  74. Justification: The paper does not involve crowdsourcing nor research with human subjects.

  75. Guidelines:

    • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

    • Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.

    • We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.

    • For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.

Appendix A Implementation Details

A.1 CIFAR-10

ResNet-18 For visualization in Fig. 1Left, Middle, we use a CIFAR-10 classifier of ResNet-18 backbone trained with cross-entropy loss. The classifier is trained for 100 epochs, with the initial learning rate 0.1 decaying to 0.01, 0.001, and 0.0001 at epochs 50, 75, and 90 respectively. For experiments in Table 1(a), we use the pre-trained model provided by the OpenOOD benchmark. And we refer readers to Zhang et al. (2023) for their training recipe.

DenseNet-101 For experiments on CIFAR-10 Benchmark presented in Table 6, we evaluate a CIFAR-10 classifier of DenseNet-101 backbone. The classifier is trained following the setups in Huang et al. (2017) with depth L=100𝐿100L=100italic_L = 100 and growth rate k=12𝑘12k=12italic_k = 12.

A.2 CIFAR-100

DenseNet-101 For experiments on the CIFAR-100 Benchmark presented in Table 6, we evaluate a CIFAR-100 classifier of the DenseNet-101 backbone. The classifier is trained following the setups in Huang et al. (2017) with depth L=100𝐿100L=100italic_L = 100 and growth rate k=12𝑘12k=12italic_k = 12.

A.3 ImageNet

ResNet-50 For evaluation on ImageNet Benchmark in Table 1(a), we use the default ResNet-50 model trained with cross-entropy loss provided by Pytorch. Training recipe can be found at https://pytorch.org/blog/how-to-train-state-of-the-art-models-using-torchvision-latest-primitives/

ViT B/16 In Table 2, we use the PyTorch implementation and pre-trained checkpoint of ViT B/16, available https://github.com/lukemelas/PyTorch-Pretrained-ViT/tree/master.

Swin v2 In Table 2, we use the 𝚝𝚒𝚖𝚖𝚝𝚒𝚖𝚖\mathtt{timm}typewriter_timm Wightman (2019) implementation of Swin v2 as well as their pre-trained checkpoint ’swinv2_base_window8_256’.

Appendix B Alternatives Proximity Metrics

In this section, we validate that under alternative similarity metrics, ID features also reside closer to weight vectors and empirically compare the metrics. In addition to our proposed 𝚙𝚂𝚌𝚘𝚛𝚎𝚙𝚂𝚌𝚘𝚛𝚎\mathtt{pScore}typewriter_pScore, we consider two standard similarity metrics, cosine similarity and Euclidean distance. For cosine similarity, we evaluate

𝚌𝚘𝚜𝚂𝚌𝚘𝚛𝚎=(𝒉𝝁G)𝒘c𝒉𝝁G2𝒘c2.𝚌𝚘𝚜𝚂𝚌𝚘𝚛𝚎𝒉subscript𝝁𝐺subscript𝒘𝑐subscriptnorm𝒉subscript𝝁𝐺2subscriptnormsubscript𝒘𝑐2\mathtt{cosScore}=\frac{(\bm{h}-\bm{\mu}_{G})\cdot\bm{w}_{c}}{\|\bm{h}-\bm{\mu% }_{G}\|_{2}\|\bm{w}_{c}\|_{2}}.typewriter_cosScore = divide start_ARG ( bold_italic_h - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) ⋅ bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_h - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG . (7)

As for Euclidean distance, we first estimate the scaling factor in Theorem 3.1 by λ~c=𝝁c𝝁G2𝒘c2.subscript~𝜆𝑐subscriptnormsubscript𝝁𝑐subscript𝝁𝐺2subscriptnormsubscript𝒘𝑐2\displaystyle\tilde{\lambda}_{c}=\frac{\|\bm{\mu}_{c}-\bm{\mu}_{G}\|_{2}}{\|% \bm{w}_{c}\|_{2}}.over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = divide start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG . Based on the estimation, we measure the distance between the centered feature 𝒉𝝁G𝒉subscript𝝁𝐺\bm{h}-\bm{\mu}_{G}bold_italic_h - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT and the scaled weight vector corresponding to the predicted class c𝑐citalic_c as

𝚍𝚒𝚜𝚝𝚂𝚌𝚘𝚛𝚎=(𝒉𝝁G)λ~c𝒘c2.𝚍𝚒𝚜𝚝𝚂𝚌𝚘𝚛𝚎subscriptnorm𝒉subscript𝝁𝐺subscript~𝜆𝑐subscript𝒘𝑐2\mathtt{distScore}=-\|(\bm{h}-\bm{\mu}_{G})-\tilde{\lambda}_{c}\bm{w}_{c}\|_{2% }.\vspace{-2mm}typewriter_distScore = - ∥ ( bold_italic_h - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) - over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (8)

Same as 𝚙𝚂𝚌𝚘𝚛𝚎𝚙𝚂𝚌𝚘𝚛𝚎\mathtt{pScore}typewriter_pScore, the larger 𝚌𝚘𝚜𝚂𝚌𝚘𝚛𝚎𝚌𝚘𝚜𝚂𝚌𝚘𝚛𝚎\mathtt{cosScore}typewriter_cosScore or 𝚍𝚒𝚜𝚝𝚂𝚌𝚘𝚛𝚎𝚍𝚒𝚜𝚝𝚂𝚌𝚘𝚛𝚎\mathtt{distScore}typewriter_distScore is, the closer the feature is to the weight vector.

We evaluate in Table 5 OOD detection performance using standalone 𝚙𝚂𝚌𝚘𝚛𝚎𝚙𝚂𝚌𝚘𝚛𝚎\mathtt{pScore}typewriter_pScore, 𝚌𝚘𝚜𝚂𝚌𝚘𝚛𝚎𝚌𝚘𝚜𝚂𝚌𝚘𝚛𝚎\mathtt{cosScore}typewriter_cosScore, and 𝚍𝚒𝚜𝚝𝚂𝚌𝚘𝚛𝚎𝚍𝚒𝚜𝚝𝚂𝚌𝚘𝚛𝚎\mathtt{distScore}typewriter_distScore as scoring function respectively. The experiments are evaluated with AUROC under the same ImageNet setup as in Section 4.1. We observe in Table 5, that across OOD datasets, all three scores achieve an AUROC score >50absent50>50> 50, indicating that ID features reside closer to weight vectors compared to OOD under either metric.

Furthermore, we observe that 𝚙𝚂𝚌𝚘𝚛𝚎𝚙𝚂𝚌𝚘𝚛𝚎\mathtt{pScore}typewriter_pScore outperforms both 𝚌𝚘𝚜𝚂𝚌𝚘𝚛𝚎𝚌𝚘𝚜𝚂𝚌𝚘𝚛𝚎\mathtt{cosScore}typewriter_cosScore and 𝚍𝚒𝚜𝚝𝚂𝚌𝚘𝚛𝚎𝚍𝚒𝚜𝚝𝚂𝚌𝚘𝚛𝚎\mathtt{distScore}typewriter_distScore. Comparing the performance of 𝚙𝚂𝚌𝚘𝚛𝚎𝚙𝚂𝚌𝚘𝚛𝚎\mathtt{pScore}typewriter_pScore and 𝚌𝚘𝚜𝚂𝚌𝚘𝚛𝚎𝚌𝚘𝚜𝚂𝚌𝚘𝚛𝚎\mathtt{cosScore}typewriter_cosScore, the superior performance of 𝚙𝚂𝚌𝚘𝚛𝚎𝚙𝚂𝚌𝚘𝚛𝚎\mathtt{pScore}typewriter_pScore implies that ID features corresponding to the classes with larger 𝒘csubscript𝒘𝑐\bm{w}_{c}bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT are less compact. This is in line with the decision rule of the classifier that classes with larger 𝒘csubscript𝒘𝑐\bm{w}_{c}bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT have larger decision regions. As for comparison against Euclidean distance based 𝚍𝚒𝚜𝚝𝚂𝚌𝚘𝚛𝚎𝚍𝚒𝚜𝚝𝚂𝚌𝚘𝚛𝚎\mathtt{distScore}typewriter_distScore, 𝚙𝚂𝚌𝚘𝚛𝚎𝚙𝚂𝚌𝚘𝚛𝚎\mathtt{pScore}typewriter_pScore eliminates the need to estimate the scaling factor, which can be error-prone before convergence, potentially leading to performance degradation.

Table 5: Ablation on proximity scores. AUROC score is reported (higher is better). ID features are closer to weight vectors than OOD features (AUROC >>> 50) under all metrics. Across OOD datasets, our proposed 𝚙𝚂𝚌𝚘𝚛𝚎𝚙𝚂𝚌𝚘𝚛𝚎\mathtt{pScore}typewriter_pScore can better separate ID an OOD features than 𝚍𝚒𝚜𝚝𝚂𝚌𝚘𝚛𝚎𝚍𝚒𝚜𝚝𝚂𝚌𝚘𝚛𝚎\mathtt{distScore}typewriter_distScore and 𝚌𝚘𝚜𝚂𝚌𝚘𝚛𝚎𝚌𝚘𝚜𝚂𝚌𝚘𝚛𝚎\mathtt{cosScore}typewriter_cosScore.
[Uncaptioned image]

Appendix C Baseline Methods

We provide an overview of our baseline methods in this session. We follow our notation in Section 3. In the following, a lower detection score indicates OOD-ness.

MSP Hendrycks and Gimpel (2016) proposes to detect OOD based on the maximum softmax probability. Given the penultimate feature 𝒉𝒉\bm{h}bold_italic_h for a given test sample 𝒙𝒙\bm{x}bold_italic_x, the detection score of MSP can be represented as:

exp(𝒘cT𝒉+bc)c𝒞exp(𝒘cT𝒉+bc),superscriptsubscript𝒘𝑐𝑇𝒉subscript𝑏𝑐subscriptsuperscript𝑐𝒞superscriptsubscript𝒘superscript𝑐𝑇𝒉subscript𝑏superscript𝑐\frac{\exp{(\bm{w}_{c}^{T}\bm{h}}+b_{c})}{\sum_{c^{\prime}\in\mathcal{C}}\exp{% (\bm{w}_{c^{\prime}}^{T}\bm{h}}+b_{c^{\prime}})},divide start_ARG roman_exp ( bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_h + italic_b start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_C end_POSTSUBSCRIPT roman_exp ( bold_italic_w start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_h + italic_b start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_ARG , (9)

where c𝑐citalic_c is the predicted class for 𝒙𝒙\bm{x}bold_italic_x.

ODIN Liang et al. (2018) proposes to amplify ID the OOD separation on top of MSP through temperature scaling and adversarial perturbation. Given a sample 𝒙𝒙\bm{x}bold_italic_x, ODIN constructs a noisy sample 𝒙superscript𝒙bold-′\bm{x^{\prime}}bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT from 𝒙𝒙\bm{x}bold_italic_x. Denote the penultimate feature of the noisy sample 𝒙superscript𝒙bold-′\bm{x^{\prime}}bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT as 𝒉superscript𝒉bold-′\bm{h^{\prime}}bold_italic_h start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT, ODIN assigns OOD score following:

exp((𝒘cT𝒉+bc)/T)c𝒞exp((𝒘cT𝒉+bc)/T),superscriptsubscript𝒘𝑐𝑇superscript𝒉bold-′subscript𝑏𝑐𝑇subscriptsuperscript𝑐𝒞superscriptsubscript𝒘𝑐𝑇superscript𝒉bold-′subscript𝑏superscript𝑐𝑇\frac{\exp{((\bm{w}_{c}^{T}\bm{h^{\prime}}}+b_{c})/T)}{\sum_{c^{\prime}\in% \mathcal{C}}\exp{((\bm{w}_{c}^{\prime T}\bm{h^{\prime}}}+b_{c^{\prime}})/T)},divide start_ARG roman_exp ( ( bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_h start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) / italic_T ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_C end_POSTSUBSCRIPT roman_exp ( ( bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ italic_T end_POSTSUPERSCRIPT bold_italic_h start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) / italic_T ) end_ARG , (10)

where c𝑐citalic_c is the predicted class for the perturbed sample and T𝑇Titalic_T is the temperature. In our implementation, we set the noise magnitude as 0.0014 and the temperature as 1000.

Energy Liu et al. (2020) designs an energy-based score function over the logit output. Given a test sample 𝒙𝒙\bm{x}bold_italic_x as well as its penultimate layer feature 𝒉𝒉\bm{h}bold_italic_h, the energy based detection score can be represented as:

logc𝒞exp(𝒘cT𝒉+bc).subscriptsuperscript𝑐𝒞superscriptsubscript𝒘superscript𝑐𝑇𝒉subscript𝑏superscript𝑐-\log\sum_{c^{\prime}\in\mathcal{C}}\exp{(\bm{w}_{c^{\prime}}^{T}\bm{h}}+b_{c^% {\prime}}).- roman_log ∑ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_C end_POSTSUBSCRIPT roman_exp ( bold_italic_w start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_h + italic_b start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) . (11)

ReAct Sun et al. (2021) builds upon the energy score proposed in Liu et al. (2020) and regularizes the score by truncating the penultimate layer estimation. We set the truncation threshold at 90909090 percentile in our experiments.

Dice Sun and Li (2022) builds upon the energy score proposed in Liu et al. (2020). Leveraging the observation that units and weights are used sparsely in ID inference, Sun and Li (2022) proposes to select and compute the energy score over a selected subset of weights based on their importance. We set a threshold at 90909090 percentile for CIFAR experiments and 70707070 percentile for ImageNet experiments following Sun and Li (2022).

ASH Djurisic et al. (2022) builds upon the energy score proposed in Liu et al. (2020). Prior to the Energy score, ASH sorts each feature to find the top-k elements, scales up the top-k elements, and sets the rest to zero. We note that in addition to the cost of Energy, ASH introduces a sorting cost of O(Plogk)𝑂𝑃𝑘O(P\log k)italic_O ( italic_P roman_log italic_k ), where P𝑃Pitalic_P is the penultimate layer dimension.

Scale Xu et al. (2023) builds upon the energy score proposed in Liu et al. (2020). Prior to the Energy score, Scale sorts each feature to find the top-k elements and based on the statistics, scales all elements in the feature. We note that in addition to the cost of Energy, Scale also introduces a sorting cost of O(Plogk)𝑂𝑃𝑘O(P\log k)italic_O ( italic_P roman_log italic_k ), where P𝑃Pitalic_P is the penultimate layer dimension.

Mahalanobis On the feature space, Lee et al. (2018) models the ID feature distribution as multivariate Gaussian and designs a Mahalanobis distance-based score:

maxc(𝒆𝒙𝝁c^)TΣ^1(𝒆𝒙𝝁c^),subscript𝑐superscriptsubscript𝒆𝒙^subscript𝝁𝑐𝑇superscript^Σ1subscript𝒆𝒙^subscript𝝁𝑐\max_{c}-(\bm{e_{x}}-\hat{\bm{\mu}_{c}})^{T}\hat{\Sigma}^{-1}(\bm{e_{x}}-\hat{% \bm{\mu}_{c}}),roman_max start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - ( bold_italic_e start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT - over^ start_ARG bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_e start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT - over^ start_ARG bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG ) , (12)

where 𝒆𝒙subscript𝒆𝒙\bm{e_{x}}bold_italic_e start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT is the feature embedding of 𝒙𝒙\bm{x}bold_italic_x in a specific layer, μc^^subscript𝜇𝑐\hat{\mu_{c}}over^ start_ARG italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG is the feature mean for class c𝑐citalic_c estimated on the training set, and Σ^^Σ\hat{\Sigma}over^ start_ARG roman_Σ end_ARG is the covariance matrix estimated over all classes on the training set.

On top of the basic score, Lee et al. (2018) also proposes two techniques to enhance the OOD detection performance. The first is to inject noise into samples. The second is to learn a logistic regressor to combine scores across layers. We tune the noise magnitude and learn the logistic regressor on an adversarial constructed OOD dataset. The selected noise magnitude is 0.005 in both our ResNet and DenseNet experiments.

KNN Chen et al. (2020) proposes to detect OOD based on the k-th nearest neighbor distance between the normalized embedding of the test sample 𝒛𝒙/|𝒛𝒙|subscript𝒛𝒙subscript𝒛𝒙\bm{z_{x}}/|\bm{z_{x}}|bold_italic_z start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT / | bold_italic_z start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT | and the normalized training embeddings on the penultimate space. Chen et al. (2020) also observes that contrastive learning helps in improving OOD detection effectiveness.

GradNorm Huang et al. (2021) extracts information from the gradient space to detect OOD samples. Specifically, Huang et al. (2021) defines the OOD score function as the 𝙻𝟷𝙻𝟷\mathtt{L1}typewriter_L1 norm of the gradient of the weight matrix with respect to the KL divergence between the softmax prediction for 𝒙𝒙\bm{x}bold_italic_x and the uniform distribution.

DKL(𝒖softmaxf(𝒙))𝑾1.subscriptnormsubscript𝐷𝐾𝐿conditional𝒖𝑠𝑜𝑓𝑡𝑚𝑎𝑥𝑓𝒙𝑾1\|\frac{\partial D_{KL}(\bm{u}\|softmax{f(\bm{x})})}{\partial\bm{W}}\|_{1}.∥ divide start_ARG ∂ italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( bold_italic_u ∥ italic_s italic_o italic_f italic_t italic_m italic_a italic_x italic_f ( bold_italic_x ) ) end_ARG start_ARG ∂ bold_italic_W end_ARG ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (13)

ViM Wang et al. (2022) proposes to integrate class-specific information into feature space information by adding energy score to the feature norm in the residual space of the training feature matrix. The detection score is designed to be:

α𝒉T𝑹𝑹𝒉,𝛼superscript𝒉𝑇𝑹𝑹𝒉\alpha\sqrt{\bm{h}^{T}\bm{R}\bm{R}\bm{h}},italic_α square-root start_ARG bold_italic_h start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_R bold_italic_R bold_italic_h end_ARG , (14)

where 𝑹RP×(PD)𝑹superscript𝑅𝑃𝑃𝐷\bm{R}\in R^{P\times(P-D)}bold_italic_R ∈ italic_R start_POSTSUPERSCRIPT italic_P × ( italic_P - italic_D ) end_POSTSUPERSCRIPT correspond to the residual after subtracting the Dlimit-from𝐷D-italic_D -dimensional principle space. In the preparation stage, ViM requires evaluating the residual/null space from the training data, which is computationally expensive given the data volume. During inference, large matrix multiplication is required, resulting in a computational complexity of O((PD)2)𝑂superscript𝑃𝐷2O((P-D)^{2})italic_O ( ( italic_P - italic_D ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

NECO is inspired by the ETF structure of Neural Collapse to utilize feature subspace for OOD detection. The detection score is designed to be

MaxLogit×𝒉T𝑷𝑷𝒉𝒉T𝒉,MaxLogitsuperscript𝒉𝑇𝑷𝑷𝒉superscript𝒉𝑇𝒉\text{MaxLogit}\times\frac{\sqrt{\bm{h}^{T}\bm{P}\bm{P}\bm{h}}}{\sqrt{\bm{h}^{% T}\bm{h}}},MaxLogit × divide start_ARG square-root start_ARG bold_italic_h start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_P bold_italic_P bold_italic_h end_ARG end_ARG start_ARG square-root start_ARG bold_italic_h start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_h end_ARG end_ARG , (15)

where 𝑷RP×d𝑷superscript𝑅𝑃𝑑\bm{P}\in R^{P\times d}bold_italic_P ∈ italic_R start_POSTSUPERSCRIPT italic_P × italic_d end_POSTSUPERSCRIPT correspond to the dlimit-from𝑑d-italic_d -dimensional principle space. In the preparation stage, NECO requires evaluating the residual/null space from the training data, which is computationally expensive given the data volume. During inference, large matrix multiplication is required, resulting in a computational complexity of O((d)2+P)𝑂superscript𝑑2𝑃O((d)^{2}+P)italic_O ( ( italic_d ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_P ).

fDBD Liu and Qin (2023) proposes to detect OOD based on estimated feature distance to decision boundaries of class c𝒞𝑐𝒞c\in\mathcal{C}italic_c ∈ caligraphic_C besides its predicted class f(𝒙)𝑓𝒙f(\bm{x})italic_f ( bold_italic_x ):

D~f(𝒉,c)=|(𝒘f(𝒙)𝒘c)T𝒉+(bf(𝒙)bc)|𝒘f(𝒙)𝒘c2,subscript~𝐷𝑓𝒉𝑐superscriptsubscript𝒘𝑓𝒙subscript𝒘𝑐𝑇𝒉subscript𝑏𝑓𝒙subscript𝑏𝑐subscriptdelimited-∥∥subscript𝒘𝑓𝒙subscript𝒘𝑐2\tilde{D}_{f}(\bm{h},c)=\frac{|(\bm{w}_{f(\bm{x})}-\bm{w}_{c})^{T}\bm{h}+(b_{f% (\bm{x})}-b_{c})|}{\left\lVert\bm{w}_{f(\bm{x})}-\bm{w}_{c}\right\rVert_{2}},% \vspace{-3mm}over~ start_ARG italic_D end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( bold_italic_h , italic_c ) = divide start_ARG | ( bold_italic_w start_POSTSUBSCRIPT italic_f ( bold_italic_x ) end_POSTSUBSCRIPT - bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_h + ( italic_b start_POSTSUBSCRIPT italic_f ( bold_italic_x ) end_POSTSUBSCRIPT - italic_b start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) | end_ARG start_ARG ∥ bold_italic_w start_POSTSUBSCRIPT italic_f ( bold_italic_x ) end_POSTSUBSCRIPT - bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG , (16)

The detection score is designed as

1|𝒞|1c𝒞,cf(𝒙)D~f(𝒉,c)𝒉𝝁train2.1𝒞1subscript𝑐𝒞𝑐𝑓𝒙subscript~𝐷𝑓𝒉𝑐subscriptnorm𝒉subscript𝝁𝑡𝑟𝑎𝑖𝑛2\frac{1}{|\mathcal{C}|-1}\sum_{\begin{subarray}{c}c\in\mathcal{C}\end{subarray% },\ c\neq f(\bm{x})}\frac{\tilde{D}_{f}(\bm{h},c)}{\|\bm{h}-\bm{\mu}_{train}\|% _{2}}.divide start_ARG 1 end_ARG start_ARG | caligraphic_C | - 1 end_ARG ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_c ∈ caligraphic_C end_CELL end_ROW end_ARG , italic_c ≠ italic_f ( bold_italic_x ) end_POSTSUBSCRIPT divide start_ARG over~ start_ARG italic_D end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( bold_italic_h , italic_c ) end_ARG start_ARG ∥ bold_italic_h - bold_italic_μ start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG . (17)

fDBD has time complexity O(|𝒞|+P)𝑂𝒞𝑃O(|\mathcal{C}|+P)italic_O ( | caligraphic_C | + italic_P ), where |𝒞|𝒞|\mathcal{C}|| caligraphic_C | is the number of training classes and P𝑃Pitalic_P is the penultimate layer dimension.

Appendix D Evaluation on DenseNet

In addition to evaluation on ResNet and transformer-based model in Section 4, we report the performance of our 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI along with the baselines under AUROC and FPR95 across OpenOOD benchmarks in Table 6.

Table 6: Our OOD detectors achieves high AUROC and low FPR95 across CIFAR-10 and CIFAR-100 OOD benchmark on DenseNet. \uparrow indicates that larger values are better and vice versa. Bold highlight the best results and underline denotes the 2nd and 3rd best results. We note that for DenseNet CIFAR-10 and CIFAR-100 classifiers, the discrepancy among existing methods is not as severe as in the examples presented in the main paper. Nevertheless, our 𝙽𝙲𝙸𝙽𝙲𝙸\mathtt{NCI}typewriter_NCI achieves state-of-the-art performance or improves upon existing methods, enhancing overall performance on average.
[Uncaptioned image]

Appendix E The Prevalence of Neural Collapse across Canonical Classification Tasks

The phenomenon of Neural Collapse, as established in the seminal work by Papyan et al. Papyan et al. (2020) and corroborated by subsequent studies Han et al. (2021); Mixon et al. (2020); Zhou et al. (2022); Zhu et al. (2021), widely exists across canonical classification datasets and model architectures. The prevalent occurrence of Neural Collapse forms a robust foundation for the design of our versatile OOD detectors. To this end, we review the empirical evidence of Neural Collapse across different datasets and model architectures in Figure 3, Figure 4, Figure 5, Figure 6, and Figure 7. Comparing CIFAR-10 and ImageNet behaviors with ResNet backbone in Figure 7, we note that the clustering of CIFAR-10 is more prominent than Imagenet, as indicated by a higher ratio of between-class variance to within-class covariance. Note that the figures and captions are sourced from Papyan et al. (2020). The definition and notation follow Section 3.

Refer to caption

Figure 3: (ref. Figure 2 in Papyan et al. (2020)) Train class means become equinorm. In each array cell, the vertical axis shows the coefficient of variation of the centered class-mean norms as well as the network classifiers norms. In particular, the blue lines show Stdc(𝝁c𝝁G2)/Avg(𝝁𝝁G2)subscriptStd𝑐subscriptnormsubscript𝝁𝑐subscript𝝁𝐺2Avgsubscriptnorm𝝁subscript𝝁𝐺2\text{Std}_{c}(\|\bm{\mu}_{c}-\bm{\mu}_{G}\|_{2})/\text{Avg}(\|\bm{\mu}-\bm{% \mu}_{G}\|_{2})Std start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( ∥ bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) / Avg ( ∥ bold_italic_μ - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) where {𝝁c}subscript𝝁𝑐\{\bm{\mu}_{c}\}{ bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } are the class means of the last-layer activations of the training data and 𝝁Gsubscript𝝁𝐺\bm{\mu}_{G}bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT is the corresponding train global mean; the orange lines showStdc(𝒘c2)/Avg(𝒘c2)subscriptStd𝑐subscriptnormsubscript𝒘𝑐2Avgsubscriptnormsubscript𝒘𝑐2\text{Std}_{c}(\|\bm{w}_{c}\|_{2})/\text{Avg}(\|\bm{w}_{c}\|_{2})Std start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( ∥ bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) / Avg ( ∥ bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) where {𝒘c}subscript𝒘𝑐\{\bm{w}_{c}\}{ bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } is the last-layer classifier of the c𝑐citalic_c th class. As training progresses, the coefficients of variation of both class means and classifiers decrease.

Refer to caption

Figure 4: (ref. Figure 3 in Papyan et al. (2020)) Classifiers and train class means approach equiangularity. In each array cell, the vertical axis shows the SD of the cosines between pairs of centered class means and classifiers across all distinct pairs of classes c𝑐citalic_c and csuperscript𝑐c^{\prime}italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Mathematically, denote cosμ(c,c)=<𝝁c𝝁G,𝝁c𝝁G>/𝝁c𝝁G2𝝁c𝝁G2\cos_{\mu}(c,c^{\prime})=<\bm{\mu}_{c}-\bm{\mu}_{G},\bm{\mu}_{c}^{\prime}-\bm{% \mu}_{G}>/\|\bm{\mu}_{c}-\bm{\mu}_{G}\|_{2}\|\bm{\mu}_{c}^{\prime}-\bm{\mu}_{G% }\|_{2}roman_cos start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = < bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT > / ∥ bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and cosw(c,c)=<𝒘c,𝒘c>/𝒘c2𝒘c2\cos_{w}(c,c^{\prime})=<\bm{w}_{c},\bm{w}_{c}^{\prime}>/\|\bm{w}_{c}\|_{2}\|% \bm{w}_{c}^{\prime}\|_{2}roman_cos start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = < bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > / ∥ bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, where {𝒘c}c=1C,{𝝁c}c=1Csuperscriptsubscriptsubscript𝒘𝑐𝑐1𝐶superscriptsubscriptsubscript𝝁𝑐𝑐1𝐶\{\bm{w}_{c}\}_{c=1}^{C},\{\bm{\mu}_{c}\}_{c=1}^{C}{ bold_italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT , { bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT, and 𝝁Gsubscript𝝁𝐺\bm{\mu}_{G}bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT are as in Figure 3. We measure Stdc,c(cosμ(c,c))subscriptStd𝑐superscript𝑐subscript𝜇𝑐superscript𝑐\text{Std}_{c,c^{\prime}}(\cos_{\mu}(c,c^{\prime}))Std start_POSTSUBSCRIPT italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_cos start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) (orange) and Stdc,c(cosw(c,c))subscriptStd𝑐superscript𝑐subscript𝑤𝑐superscript𝑐\text{Std}_{c,c^{\prime}}(\cos_{w}(c,c^{\prime}))Std start_POSTSUBSCRIPT italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_cos start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ). As training progresses, the SDs of the cosines approach zero, indicating equiangularity.

Refer to caption

Figure 5: (ref. Figure 4 in Papyan et al. (2020)) Classifiers and train class means approach maximal-angle equiangularity. We plot in the vertical axis of each cell the quantities Avgc,c|cosμ(c,c)+1/(C1)|subscriptAvg𝑐superscript𝑐subscript𝜇𝑐superscript𝑐1𝐶1\text{Avg}_{c,c^{\prime}}|\cos_{\mu}(c,c^{\prime})+1/(C-1)|Avg start_POSTSUBSCRIPT italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | roman_cos start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + 1 / ( italic_C - 1 ) | (blue) and Avgc,c|cosw(c,c)+1/(C1)|subscriptAvg𝑐superscript𝑐subscript𝑤𝑐superscript𝑐1𝐶1\text{Avg}_{c,c^{\prime}}|\cos_{w}(c,c^{\prime})+1/(C-1)|Avg start_POSTSUBSCRIPT italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | roman_cos start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + 1 / ( italic_C - 1 ) | (orange), where cosμ(c,c)subscript𝜇𝑐superscript𝑐\cos_{\mu}(c,c^{\prime})roman_cos start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) and cosw(c,c)subscript𝑤𝑐superscript𝑐\cos_{w}(c,c^{\prime})roman_cos start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) are as in Figure 4. As training progresses, the convergence of these values to zero implies that all cosines converge to 1/(C1)1𝐶1-1/(C-1)- 1 / ( italic_C - 1 ). This corresponds to the maximum separation possible for globally centered, equiangular vectors.

Refer to caption


Figure 6: (ref. Figure 5 in Papyan et al. (2020)) Classifier converges to train class means. The formatting and technical details are as described in Section 3. In the vertical axis of each cell, we measure the distance between the classifiers and the centered class means, both rescaled to unit norm. Mathematically, denote 𝑴~=𝑴˙/𝑴˙F~𝑴˙𝑴subscriptnorm˙𝑴𝐹\tilde{\bm{M}}=\dot{\bm{M}}/\|\dot{\bm{M}}\|_{F}over~ start_ARG bold_italic_M end_ARG = over˙ start_ARG bold_italic_M end_ARG / ∥ over˙ start_ARG bold_italic_M end_ARG ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT where 𝑴˙=[𝝁c𝝁G,c=1,.,C]𝚁P×C\dot{\bm{M}}=[\bm{\mu}_{c}-\bm{\mu}_{G},c=1,....,C]\in\mathtt{R}^{P\times C}over˙ start_ARG bold_italic_M end_ARG = [ bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , italic_c = 1 , … . , italic_C ] ∈ typewriter_R start_POSTSUPERSCRIPT italic_P × italic_C end_POSTSUPERSCRIPT is the matrix whose columns consist of the centered train class means; denote 𝑾~=𝑾/𝑾F~𝑾𝑾subscriptnorm𝑾𝐹\tilde{\bm{W}}=\bm{W}/\|\bm{W}\|_{F}over~ start_ARG bold_italic_W end_ARG = bold_italic_W / ∥ bold_italic_W ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT where 𝑾𝚁C×P𝑾superscript𝚁𝐶𝑃\bm{W}\in\mathtt{R}^{C\times P}bold_italic_W ∈ typewriter_R start_POSTSUPERSCRIPT italic_C × italic_P end_POSTSUPERSCRIPT is the last-layer classifier of the network. We plot the quantity 𝑾~T𝑴~F2superscriptsubscriptnormsuperscript~𝑾𝑇~𝑴𝐹2\|\tilde{\bm{W}}^{T}-\tilde{\bm{M}}\|_{F}^{2}∥ over~ start_ARG bold_italic_W end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - over~ start_ARG bold_italic_M end_ARG ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT on the vertical axis. This value decreases as a function of training, indicating that the network classifier and the centered-means matrices become proportional to each other (self-duality).

Refer to caption

Figure 7: (ref. Figure 6 in Papyan et al. (2020)) Training within-class variation collapses. In each array cell, the vertical axis (log scaled) shows the magnitude of the between-class covariance compared with the within-class covariance of the train activations. Mathematically, this is represented by Tr(ΣWΣB+/C)TrsubscriptΣ𝑊superscriptsubscriptΣ𝐵𝐶\text{Tr}(\Sigma_{W}\Sigma_{B}^{+}/C)Tr ( roman_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT / italic_C ) where Tr()Tr\text{Tr}(\cdot)Tr ( ⋅ ) s the trace operator, ΣWsubscriptΣ𝑊\Sigma_{W}roman_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT is the within-class covariance of the last-layer activations of the training data, ΣBsubscriptΣ𝐵\Sigma_{B}roman_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT is the corresponding between-class covariance, C𝐶Citalic_C is the total number of classes, and []+superscriptdelimited-[][\cdot]^{+}[ ⋅ ] start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is Moore–Penrose pseudoinverse. This value decreases as a function of training—indicating collapse of within-class variation.