Detecting Out-of-distribution through the Lens of Neural Collapse

Litian Liu
MIT
[email protected]
&Yao Qin
UC Santa Barbara
[email protected]

Abstract

Efficient and versatile Out-of-Distribution (OOD) detection is essential for the safe deployment of AI yet remains challenging for existing algorithms. Inspired by Neural Collapse, we discover that features of in-distribution (ID) samples cluster closer to the weight vectors compared to features of OOD samples. In addition, we reveal that ID features tend to expand in space to structure a simplex Equiangular Tight Framework, which nicely explains the prevalent observation that ID features reside further from the origin than OOD features. Taking both insights from Neural Collapse into consideration, we propose to leverage feature proximity to weight vectors for OOD detection and further complement this perspective by using feature norms to filter OOD samples. Extensive experiments on off-the-shelf models demonstrate the efficiency and effectiveness of our method across diverse classification tasks and model architectures, enhancing the generalization capability of OOD detection.

1 Introduction

Machine learning models deployed in practice will inevitably encounter samples that deviate from the training distribution. As a classifier cannot make meaningful predictions on test samples that belong to classes unseen during training, it is important to actively detect and handle Out-of-Distribution (OOD) samples. Considering the diverse and oftentimes time-critical application scenarios, an OOD detector should be computationally efficient and can effectively generalize across various scenarios.

In this work, we focus on post-hoc methods, which address OOD detection independently of the training process. One line of prior work designs OOD scores over model output space (Djurisic et al., 2022; Hendrycks et al., 2019; Liang et al., 2018; Liu et al., 2020; Sun et al., 2021; Sun and Li, 2022) and another line of work focuses on the feature space, where OOD samples are observed to deviate from the clusters of ID samples (Lee et al., 2018; Mahalanobis, 2018; Sun et al., 2022; Tack et al., 2020). While existing research has made strides in OOD detection, they still face two major challenges: 1) maintaining detection effectiveness across different scenarios, and 2) ensuring computational efficiency for real-world deployment. For example, both output space and feature space methods suffer from performance discrepancy across different classification tasks, as shown in Table 1(a). Specifically, state-of-art algorithms on CIFAR-10 (Krizhevsky et al., 2009) OOD benchmarks perform suboptimally on ImageNet (Deng et al., 2009) OOD benchmarks, and vice versa. Such discrepancy is also observed across different architectures, as shown in Table 2. In addition, feature space methods, which rely on auxiliary models, raise efficiency concerns. For example, Lee et al. (2018) learns a Gaussian mixture model from training features and detects OOD based on Mahalanobis distance Mahalanobis (2018); Sun et al. (2022) records the training features and measures OOD-ness based on the k-th nearest neighbor distance to the training features. As shown in Liu and Qin (2023), such reliance on auxiliary models introduces additional cost, posing challenges for time-critical applications.

Refer to caption — Figure 1: Illustration of our framework inspired by Neural Collapse. Left: On the penultimate layer, ID samples cluster near the weight vectors (marked by stars) of their predicted classes while OOD samples reside separated, as shown by UMAP (McInnes et al., 2018). Middle: ID and OOD samples are separated by our proposed $\mathtt{pScore}$ (Equation 6), which measures feature proximity to weight vectors. Additionally, ID samples tend to reside further from the origin, illustrated with $\mathtt{L1}$ norms. Right: ID samples cluster near a simplex Equiangular Tight Framework, illustrated with black arrows denoting weight vectors. We detect OOD by thresholding on $\mathtt{pScore}$ , selecting blue-shaded hypercones centered at weight vectors, with OOD samples outside these areas. Also, we filter OOD samples characterized by smaller feature norms. Left and Middle visualize a CIFAR-10 ResNet-18 classifier with OOD set SVHN. Right depicts our scheme on a three-class classifier with 2D penultimate space.

To this end, we aim to develop an efficient and versatile OOD detector. Specifically, we delve into the penultimate layer, i.e., the layer before the linear classification head, and revisit the observation that ID features tend to form clusters while OOD features reside apart. While this observation is well-established in prior literature Lee et al. (2018); Sun et al. (2022); Tack et al. (2020), the underlying mechanism remains largely unexplained. To better understand and characterize the observation, we take insights from Neural Collapse (Papyan et al., 2020). Neural Collapse characterizes the interplay between the linear classification head and the penultimate layer features as training epochs go to infinity. Notably, the Neural Collapse phenomenons are observed to be prevalent across canonical classification tasks under diverse architectures (see Appendix E). One particular phenomenon that Neural Collapse reveals is that features of each class gradually converge towards a single point during training. This explains the prevalent clustering observed on off-the-shelf models as a precursor to the final convergence. Therefore, we leverage the landscape of Neural Collapse to understand:

Where do features of ID samples form clusters?

To address the question, we first demonstrate that as a deterministic effect of Neural Collapse, features of training samples will converge towards the weight vectors of the predicted class. Additionally, Neural Collapse reveals that training features also converge towards a simplex Equiangular Tight Framework (ETF) (Equation 1). The spatial structure of an ETF, illustrated in Figure 1 Right, corresponds to the maximum separation achievable in space by equiangular vectors, requiring the features to be sufficiently far from the origin.

Such a landscape of Neural Collapse sheds light on the geometric structure of ID clusters. Specifically, for ID test samples, drawn from the same distribution as training samples, we anticipate a similar clustering behavior towards the weight vectors and towards an ETF. Conversely, OOD samples do not undergo the same training process, which enables the model to align features with weight vectors and to expand features to accommodate the spatial structure of ETF in Neural Collapse. Therefore, we do not expect the model to effectively align the weight vectors learned from ID features with unseen OOD features. Nor do we anticipate the model to posit OOD features far from the origin to structure an ETF. To validate our hypotheses, we trace a model’s training stages in Figure 2. We observe that ID samples consistently cluster closer to the weight vectors than OOD samples. This observation is reinforced in our UMAP McInnes et al. (2018) visualization on an off-the-shelf CIFAR-10 classifier with ResNet-18 backbone in Figure 1 Left. Here, ID features cluster near predicted class weight vectors (marked by stars), while OOD features are distant. Combining our observation with the results in (Zhu et al., 2021), which show that the weight vectors form an ETF in training, we conclude that ID features are driven to structure the ETF during training, whereas OOD features have no incentive to expand in space to form an ETF. Note that the lack of incentive for OOD features to expand in space explains the well-established observation Huang et al. (2021); Park et al. (2023); Sun et al. (2022); Tack et al. (2020) that OOD features tend to reside closer to the origin.

Based on our understanding, we design an efficient and versatile OOD detector. We first leverage feature proximity to the weight vectors to characterize ID clustering, bypassing auxiliary models and reducing the computational cost. Specifically, we define an angle-based proximity score as the norm of the projection of the weight vector of the predicted class onto the sample feature. As shown in Figure 1 Middle, our proximity score can effectively separate ID/OOD. A higher score indicates closer proximity and a lower chance of OOD-ness. Geometrically, thresholding on the score selects hyper-cones centered at the weight vector, as illustrated in Figure 1 Right. Notably, our proximity score effectively incorporates class-specific information and brings in performance benefits as well as efficiency gain. Complementing the proximity score’s contingency on ID clustering, we also consider feature distance to the origin. Specifically, ID features tend to reside further from the origin as they expand in space to form an ETF, whereas OOD features tend to reside near the origin, as illustrated by Figure 1 Right. Using the L1 norm as an example metric for distance to the origin, we observe that ID features can be separated from OOD features, as supported by Figure 1 Middle. Combining both aspects, we propose Neural Collapse Inspired OOD Detector ( $\mathtt{NCI}$ ).

Notably, prior methods, e.g., KNN Sun et al. (2022), focus on ID clustering but do not explicitly consider feature distance to the origin. Such approaches fall short in scenarios like ImageNet benchmarks but yield superior performance in CIFAR-10 benchmarks in Table 1(a). Conversely, methods such as Energy Liu et al. (2020), Energy-based ASH Djurisic et al. (2022), and, Energy-based Scale Xu et al. (2023) inherently utilize feature distance to the origin by considering log-sum-exp of logits, yet largely overlook ID clustering. These approaches excel in some scenarios, e.g., ImageNet, but perform sub-optimally in others, e.g., CIFAR-10 (see Table 1(a)). Through the lens of Neural Collapse, we explain, connect, and complete prior methods under a holistic view, leading to improved efficiency and generalizability validated across diverse experiments.

We summarize our main contributions below:

•

Understanding and Observation: By understanding ID clustering as a precursor to Neural Collapse, we novelly establish through experiments the significance of weight vectors in the clusters. We also explain from a spatial structure point of view the previous observation: ID features tend to reside further from the origin.
•

OOD Detector: We propose to leverage feature proximity to the weight vectors of predicted classes for OOD detection, which integrates class-specific information. Complementary to feature clustering, we also propose to detect OOD samples by thresholding the feature distance to the origin, which further enhances the method generalizability.
•

Experimental Analysis: We evaluate $\mathtt{NCI}$ across diverse classification tasks (CIFAR-10, CIFAR-100, ImageNet) and model architectures (ResNet, DenseNet Huang et al. (2017), ViT Dosovitskiy et al. (2020), Swin Liu et al. (2022)). Our $\mathtt{NCI}$ practically matches the latency of vanilla softmax-confidence detector while largely mitigating the performance discrepancy of existing methods across different classification tasks and model architectures.

2 Problem Setting

We consider a data space $\mathcal{X}$ , a class set $\mathcal{C}$ , and a classifier $f:\mathcal{X}\rightarrow\mathcal{C}$ , which is trained on samples i.i.d. drawn from joint distribution $\mathbb{P}_{\mathcal{X}\mathcal{C}}$ . We denote the marginal distribution of $\mathbb{P}_{\mathcal{X}\mathcal{C}}$ on $\mathcal{X}$ as $\mathbb{P}^{in}$ . And samples drawn from $\mathbb{P}^{in}$ are In-Distribution (ID) samples. In practice, the classifier $f$ may encounter $\bm{x}\in\mathcal{X}$ yet is not drawn from $\mathbb{P}^{in}$ . We say such samples are Out-of-Distribution (OOD).

In this work, we focus on detecting OOD samples from classes unseen during training, for which the classifiers cannot make meaningful predictions. The OOD detector $D:\mathcal{X}\rightarrow\{\text{ID},\text{OOD}\}$ is commonly constructed as: $D(\bm{x})=\begin{cases}\text{ID}&\text{if }s(\bm{x})\geq\tau\\ \text{OOD}&\text{if }s(\bm{x})<\tau\end{cases}$ , where $s:\mathcal{X}\rightarrow\mathbb{R}$ is a score function of design and $\tau$ is the threshold. Considering the diverse application scenarios, an ideal OOD detector should be efficient and generalizable across classification tasks and model architectures. Thus, we propose a versatile and efficient OOD detector leveraging insights from Neural Collapse in this work.

3 OOD Detection through the Lens of Neural Collapse

In this section, we re-examine the observation in Lee et al. (2018); Sun et al. (2022) that ID features tend to form clusters while OOD features deviate from the clusters. We understand the phenomenon as a precursor to the Neural Collapse (Papyan et al., 2020) convergence. And leveraging the landscape revealed in Neural Collapse, we examine:

Where do features of ID samples form clusters?

Through analytical and empirical study, we hypothesize and validate that (1) ID features tend to cluster closer to the weight vectors compared to OOD features; (2) ID clusters tend to reside further from the origin, as necessitated by their spatial structure. From our understanding, we develop a post-hoc OOD detector with enhanced efficiency and effectiveness.

3.1 Neural Collapse: Convergence of Training Features

Neural Collapse, first observed in Papyan et al. (2020), occurs on the penultimate layer across canonical classification settings during the Terminal Phase of Training (TPT) where training error vanishes and the training loss is trained towards zero. With $\bm{h}_{i,c}$ denoting the penultimate layer feature of the $i_{th}$ training sample with ground truth / predicted label $c$ , Neural Collapse is framed in relation to

•

the feature global mean, $\bm{\mu}_{G}=\mathrm{Ave}_{i,c}\bm{h}_{i,c}$ , where $\mathrm{Ave}$ is the average operation;
•

the feature class means, $\bm{\mu}_{c}=\mathrm{Ave}_{i}\bm{h}_{i,c},\ \forall c\in\mathcal{C}$ ;
•

the within-class covariance, $\bm{\Sigma}_{W}=\mathrm{Ave}_{i,c}(\bm{h}_{i,c}-\bm{\mu}_{c})(\bm{h}_{i,c}-\bm% {\mu}_{c})^{T}$ ;
•

the between-class covariance, $\bm{\Sigma}_{B}=\mathrm{Ave}_{c}(\bm{\mu}_{c}-\bm{\mu}_{G})(\bm{\mu}_{c}-\bm{% \mu}_{G})^{T}$ ;
•

the linear classification head, i.e. the last layer of the NN, $\arg\max_{c\in\mathcal{C}}\bm{w}_{c}^{T}\bm{h}+b_{c}$ , where $\bm{w}_{c}$ and $b_{c}$ are parameters corresponding to class $c$ .

Neural Collapse comprises four inter-related limiting behaviors:

(NC1) Within-class variability collapse: $\bm{\Sigma}_{W}\rightarrow\bm{0}$

(NC2) Convergence to a simplex Equiangular Tight Frame (ETF):

		$\displaystyle\|\\|\bm{\mu}_{c}-\bm{\mu}_{G}\\|_{2}-\\|\bm{\mu}_{c^{\prime}}-\bm{% \mu}_{G}\\|_{2}\|\rightarrow 0,\ \forall\ c,\ c^{\prime}$		(1)
		$\displaystyle\frac{(\bm{\mu}_{c}-\bm{\mu}_{G})^{T}(\bm{\mu}_{c^{\prime}}-\bm{% \mu}_{G})}{\\|\bm{\mu}_{c}-\bm{\mu}_{G}\\|_{2}\\|\bm{\mu}_{c^{\prime}}-\bm{\mu}_{% G}\\|_{2}}\rightarrow\frac{\|\mathcal{C}\|}{\|\mathcal{C}\|-1}\delta_{c,c^{\prime}}% -\frac{1}{\|\mathcal{C}\|-1}$		(1)

where $\delta_{c,c^{\prime}}$ is the Kronecker delta symbol.

(NC3) Convergence to self-duality:

\displaystyle\frac{\bm{w}_{c}}{\|\bm{w}_{c}\|_{2}}-\frac{\bm{\mu}_{c}-\bm{\mu}% _{G}}{\|\bm{\mu}_{c}-\bm{\mu}_{G}\|_{2}}\rightarrow\bm{0}

(NC4) Simplification to nearest class center:

\displaystyle\arg\max_{c\in\mathcal{C}}\bm{w}_{c}^{T}\bm{h}+b_{c}\rightarrow arg% \min_{c\in\mathcal{C}}\|\bm{h}-\bm{\mu}_{c}\|_{2}

We first remark on (NC2) that an ETF achieves the maximum separation possible for globally centered equiangular vectors Papyan et al. (2020) and extends in space, as visualized in Figure 1 Right. Since training features converge towards an ETF, they need to have sufficient norms to accommodate the spatial arrangement.

We next build on (NC1) and (NC3) to demonstrate in the following that training features converge towards the weight vectors of the linear classification head, up to a scaling factor.

Theorem 3.1.

(NC1) and (NC3) imply that for any sample $i$ and its predicted class $c$ , we have

(\bm{h}_{i,c}-\bm{\mu}_{G})\rightarrow\lambda\bm{w}_{c}

(2)

in the Terminal Phase of Training, where $\displaystyle\lambda=\frac{\|\bm{\mu}_{c}-\bm{\mu}_{G}\|_{2}}{\|\bm{w}_{c}\|_{% 2}}$ .

Proof.

Considering that $(\bm{h}_{i,c}-\bm{\mu}_{c})(\bm{h}_{i,c}-\bm{\mu}_{c})^{T}$ is positive semi-definite for any $i$ and $c$ . $\bm{\Sigma}_{W}\rightarrow\bm{0}$ thus implies $(\bm{h}_{i,c}-\bm{\mu}_{c})(\bm{h}_{i,c}-\bm{\mu}_{c})^{T}\rightarrow\bm{0}$ and $\bm{h}_{i,c}-\bm{\mu}_{c}\rightarrow\bm{0},\ \forall i,c$ . With algebraic manipulations, we have

\frac{\bm{h}_{i,c}-\bm{\mu_{G}}}{\|\bm{\mu}_{c}-\bm{\mu_{G}}\|_{2}}-\frac{\bm{% \mu}_{c}-\bm{\mu_{G}}}{\|\bm{\mu}_{c}-\bm{\mu_{G}}\|_{2}}\rightarrow\bm{0},\ % \forall i,c

(3)

Applying the triangle inequality, we have

\displaystyle|\frac{\bm{h}_{i,c}-\bm{\mu_{G}}}{\|\bm{\mu}_{c}-\bm{\mu_{G}}\|_{% 2}}-\frac{\bm{w}_{c}}{\|\bm{w}_{c}\|_{2}}|\leq|\frac{\bm{h}_{i,c}-\bm{\mu_{G}}% }{\|\bm{\mu}_{c}-\bm{\mu_{G}}\|_{2}}-\frac{\bm{\mu}_{c}-\bm{\mu_{G}}}{\|\bm{% \mu}_{c}-\bm{\mu_{G}}\|_{2}}|+|\frac{\bm{w}_{c}}{\|\bm{w}_{c}\|_{2}}-\frac{\bm% {\mu}_{c}-\bm{\mu}_{G}}{\|\bm{\mu}_{c}-\bm{\mu}_{G}\|_{2}}|.

(4)

Since both terms on the RHS converge to $\bm{0}$ , as demonstrated by (3) and (NC3), it follows that the LHS also converges to $\bm{0}$ . ∎

3.2 Geometric Structure of the Clusters of ID Features

While Neural Collapse reveals the within-class variability collapse (NC1), it also explains the clustering of features observed on general classifiers as a precursor to the collapse. We thus leverage the landscape of Neural Collapse revealed in Theorem 3.1 and (NC2) to examine the geometry of ID feature clusters. Since ID test samples are drawn from the same distribution as the training samples, we anticipate a similar pattern in their features. Specifically, we expect ID features to cluster towards the weight vectors of their predicted class during training. Additionally, we expect ID features to reside near a simplex Equiangular Tight Frame (ETF), thereby acquiring sufficient norm. Conversely, OOD samples are unseen during training and do not undergo the process of iterative adjustment, which drives the Neural Collapse phenomenon. Thus we expect the model to be less effective in aligning the OOD samples with weight vectors, placing OOD further from the weight vectors than ID features. Meanwhile, we do not expect the model to effectively align the OOD samples with an ETF.

In Figure 2, we validate our hypothesis across the training process of a CIFAR-10 classifier with ResNet-18 backbone. In Figure 2, we compute over the ID set (CIFAR-10) and OOD set (SVHN) the average cosine similarity between the centered feature $\bm{h}_{i}-\bm{\mu}_{G}$ and the weight vector $\bm{w}_{c}$ of the predicted class $c$ , i.e.,

Avg_{i}\ \ \frac{(\bm{h}_{i}-\bm{\mu}_{G})\cdot\bm{w}_{c}}{\|\bm{h}_{i}-\bm{% \mu}_{G}\|_{2}\|\bm{w}_{c}\|_{2}}

(5)

We observe that ID features have higher similarity scores and cluster closer to the weight vectors than OOD features. We further reinforce our observation in Figure 1 Left where we visualize ID features, OOD features, and weight vectors of a CIFAR-10 classifier with UMAP(McInnes et al., 2018). ID features are color-coded to align with the weight vectors (marked by stars) of their predicted classes, revealing a distinct clustering pattern near the weight vectors. Conversely, OOD features reside further away.

Additionally, we combine our observation with the results from (Zhu et al., 2021), showing that the weight vectors form an ETF during training. Our observed proximity to the weight vectors thus also validates the clustering of ID features near an ETF and the divergence of OOD from this structure. The lack of structure and incentives to extend in space explains the relatively smaller norm of OOD features.

3.3 Out-of-Distribution Detection

Based on our understanding, we design an efficient and versatile OOD detector. Specifically, we propose to detect OOD based on feature proximity to the weight vectors of the predicted class. For the proximity metric, we avoid Euclidean-based metrics as they require estimating the scaling factor $\lambda$ in Equation 2. This estimation tends to be imprecise for general classifiers which may cease training prior to convergence, resulting in suboptimal performance of Euclidean-based metrics shown in Appendix B. Instead, we design an angle-based metric, adjusted for class-wise difference. Specifically, we propose to quantify the proximity as the norm of projection of the weight vector $\bm{w}_{c}$ onto the centered feature $\bm{h}-\bm{\mu}_{G}$ , where $c$ corresponds to the predicted class, i.e.,

\mathtt{pScore}=cos(\bm{w}_{c},\bm{h}-\bm{\mu}_{G})\|\bm{w}_{c}\|_{2},

(6)

where $cos(\bm{w}_{c},\bm{h}-\bm{\mu}_{G})=\frac{(\bm{h}-\bm{\mu}_{G})\cdot\bm{w}_{c}% }{\|\bm{h}-\bm{\mu}_{G}\|_{2}\|\bm{w}_{c}\|_{2}}$ . A higher $\mathtt{pScore}$ indicates closer proximity to the weight vector and thus a lower chance of OOD-ness. Geometrically, thresholding on $\mathtt{pScore}$ selects infinite hyper-cones centered at the weight vectors, as illustrated in Figure 1 Right. Within the same predicted class, $\mathtt{pScore}$ is proportional to the cosine similarity. Across different classes, $\mathtt{pScore}$ adapts to class-wise difference by selecting wider hyper-cones for classes with larger weight vectors, which tend to have larger decision regions. As shown in Appendix B, our $\mathtt{pScore}$ with class-wise adjustment outperforms vanilla cosine similarity. Notably, our $\mathtt{pScore}$ incorporates class-specific information into characterizing ID clustering by using the weight vectors of the predicted class. This brings in additional gain in detection effectiveness, as we shall see in Section 4.

While $\mathtt{pScore}$ enhances efficiency and effectiveness, its performance is intrinsically contingent on the strength of ID clustering. Such contingency, widely exhibited by clustering-based methods Lee et al. (2018); Sun et al. (2022); Tack et al. (2020), poses challenges on classifiers with less pronounced ID clustering, such as ImageNet ResNet-50 in Section 4.1. To mitigate such discrepancy, we complement $\mathtt{pScore}$ by considering the distance of ID clusters to the origin. Specifically, we enhance our proximity score by incorporating feature norms to filter out OOD near the origin, as illustrated in Figure 1 Right. Taking $\mathtt{L1}$ norm as an example, we define our detection score as $\mathtt{pScore}+\alpha\|\bm{h}\|_{1}$ , where $\alpha$ controls the filtering strength and can be selected from a validation set as detailed in Section 4. We refer readers to Section 4.3 for the effect of different orders of $p$ -norm. Thresholding on the detection score, we have Neural Collapse Inspired OOD Detector ( $\mathtt{NCI}$ ): A lower score indicates a higher chance of OOD-ness.

$\mathtt{NCI}$ has $O(P)$ complexity, where $P$ represents the dimension of the penultimate layer. The complexity theoretically ensures computational scalability of $\mathtt{NCI}$ on large models. Empirically, $\mathtt{NCI}$ maintains inference latency comparable to the vanilla softmax-confidence detector, as we shall see in Section 4.

4 Experiments

[Uncaptioned image] — Table 1: $\mathtt{NCI}$ achieves high AUROC, low FPR95, and low latency on CIFAR-10 and ImageNet OpenOOD benchmarks. The CIFAR-10 classifier is a pre-trained ResNet-18, and the ImageNet classifier is a pre-trained ResNet-50.

In this section, we extensively demonstrate the versatility and efficiency of $\mathtt{NCI}$ across classification tasks: CIFAR-10, CIFAR-100 (see App. D), ImageNet, as well as model architectures: ResNet, DenseNet (see App. D), ViT, Swin. We compare $\mathtt{NCI}$ against thirteen baseline methods, including the most recent ones, and demonstrate that $\mathtt{NCI}$ effectively mitigates the generalization discrepancies of existing methods. Following the OpenOOD benchmark Zhang et al. (2023), we evaluate on six OOD sets for CIFAR-10 and CIFAR-100 classifiers and five for ImageNet classifiers. Performance is evaluated using two widely recognized metrics: the False Positive Rate at 95% True Positive Rate (FPR95) and the Area Under the Receiver Operating Characteristic Curve (AUROC). Lower FPR95 and higher AUROC values indicate better performance. We also report the per-image inference latency (in milliseconds) evaluated on a Tesla T4 GPU. In our experiments, other than the ablation study in Section 4.3, we use the $L1$ -norm as the filtering term and select the filtering strength $\alpha$ from $\{10^{-4},10^{-3},10^{-2},10^{-1}\}$ based on a validation set generated per pixel from Gaussian $N(0,1)$ , following Sun et al. (2021); Sun and Li (2022). In Section 4.3, we explore different choices of norm order and examine the effect of filtering on alternative clustering-based method. For detailed setups, please see Appendix A. We emphasize that our method and all baselines are post-hoc methods, while all models used are off-the-shelf and do not undergo prolonged training.

4.1 Versatility across Classification Tasks

We first assess the performance of $\mathtt{NCI}$ and baselines across CIFAR-10 and ImageNet classification tasks. The two tasks provide an ideal test bed for evaluating versatility, as they drastically differ in input resolution, number of classes, and classification accuracy. We use the ResNets from the OpenOOD benchmarkZhang et al. (2023): the CIFAR-10 classifier is a ResNet-18 with an accuracy of 95.06% and the ImageNet classifier is a ResNet-50 with an accuracy of 76.65%. Based on validation results, we set the filter strength $\alpha$ of the $L1$ -norm to $10^{-2}$ for CIFAR-10 experiments and $10^{-3}$ for ImageNet experiments.

Datasets For CIFAR-10 experiments, We follow the OpenOOD split of CIFAR-10 test set and evaluate on the OpenOOD benchmarks, including CIFAR-100 Krizhevsky et al. (2009), Tiny ImageNet Le and Yang (2015), MNIST Deng (2012), SVHN Netzer et al. (2011), Texture (Cimpoi et al., 2014), and Places365 (Zhou et al., 2017). For ImageNet experiments, we follow the OpenOOD split of ImageNet test set and evaluate on the OpenOOD benchmarks, including SSB-hard Vaze et al. (2021), NINCO Bitterwolf et al. (2023), iNaturalist (Van Horn et al., 2018), Texture (Cimpoi et al., 2014), and OpenImage-O Wang et al. (2022).

Baselines In Table 1(a), we compare our method with thirteen baselines. Some baselines focus more on the CIFAR-10 Benchmark while others focus more focused on the Imagenet Benchmark. Based on performance, we categorize the baselines, besides the vanilla confidence-based $\mathtt{MSP}$ (Hendrycks and Gimpel, 2016), into two groups: the “CIFAR-10 Strong" baselines, including $\mathtt{ODIN}$ (Liang et al., 2018), $\mathtt{Energy}$ (Liu et al., 2020), $\mathtt{Mahalanobis}$ (Lee et al., 2018), $\mathtt{KNN}$ (Sun et al., 2022), $\mathtt{ViM}$ (Wang et al., 2022), and $\mathtt{fDBD}$ Liu and Qin (2023); the “ImageNet Strong" baselines, including $\mathtt{GradNorm}$ (Huang et al., 2021), $\mathtt{NECO}$ Ammar et al. (2023), $\mathtt{React}$ (Sun et al., 2021), $\mathtt{Dice}$ (Sun and Li, 2022), $\mathtt{ASH}$ Djurisic et al. (2022), $\mathtt{Scale}$ Xu et al. (2023). For details of the baselines, please see Appendix C.

OOD detection performance Table 1(a) highlights the discrepancy among existing methods: CIFAR-10 Strong baselines perform sub-optimally on ImageNet benchmarks, whereas ImageNet Strong baselines cannot achieve state-of-the-art performance on CIFAR-10 benchmarks. Conversely, our $\mathtt{NCI}$ largely mitigates the generalization gap, achieving competitive performance across both tasks. Moreover, $\mathtt{NCI}$ is practically as efficient as $\mathtt{MSP}$ , as shown in Table 1(b)¹¹1Running time of $\mathtt{KNN}$ and $\mathtt{MDS}$ on ImageNet are copied from Table 4 in Sun et al. (2022).. This enhances the efficiency of feature-space methods, aligning with our complexity analysis in Section 3 and Appendix C.

We highlight the following pairs of comparison:

•

$\mathtt{NCI}$ v.s. $\mathtt{NCI}$ w/o filter: The comparison validates the effectiveness of our norm-based filtering and highlights its importance in enhancing generalizability. On the CIFAR-10 classifier, strong ID clustering allows our method to achieve state-of-art performance without additional filtering. Conversely, on the ImageNet ResNet-50, ID clustering is less prominent (see Appendix E). There, norm-based filtering effectively complements the performance.
•

$\mathtt{NCI}$ v.s. $\mathtt{KNN}$ : Compared to $\mathtt{KNN}$ , $\mathtt{NCI}$ avoids auxiliary models and significantly enhances efficiency as shown in Table 1(b). Notably, without filtering, our hyperparameter-free score still outperforms KNN with selected parameters on ImageNet benchmarks. The performance gain validates the benefit of incorporating class-specific information in our design.
•

$\mathtt{NCI}$ v.s. $\mathtt{ASH}$ / $\mathtt{Scale}$ : Compared to both baselines, our $\mathtt{NCI}$ achieves competitive performance on the ImageNet benchmark while significantly enhancing the performance on the CIFAR-10 benchmark, thereby largely improving versatility across classification tasks. Additionally, $\mathtt{ASH}$ and $\mathtt{Scale}$ introduce in a small delay on the ImageNet benchmark due to their cost of activation sorting per image. We expect the delay to increase and the latency gap to widen on classifiers with a larger activation dimension.
•

$\mathtt{NCI}$ v.s. $\mathtt{NECO}$ : Like our $\mathtt{NCI}$ , $\mathtt{NECO}$ Ammar et al. (2023) is motivated by Neural Collapse. Similar to our $\mathtt{NCI}$ with filtering, $\mathtt{NECO}$ incorporates distance to the origin using max-logit. However, $\mathtt{NECO}$ , exclusively focuses on the subspace analysis of features and does not utilize the classification head. Such subspace operations require expensive matrix multiplication, resulting in a noticeable inference latency shown in Table 1(b). Conversely, our $\mathtt{NCI}$ explores the interplay between features and the classification head, and thus integrates class-specific information revealed by Neural Collapse. Compared to $\mathtt{NECO}$ , our $\mathtt{NCI}$ improves in both efficiency and effectiveness.

4.2 Versatility across Model Architectures

Now we assess the performance of $\mathtt{NCI}$ and baselines across different architectures. Particularly, we study two transformer-based models: ViT B/16 Dosovitskiy et al. (2020) and Swin-v2 Liu et al. (2022). Both transformers are finetuned on ImageNet, achieving an accuracy of 81.14% and 82.94% respectively. We follow the setup of the OpenOOD ImageNet Benchmark, detailed in Section 4.1. And based on validation results, we set the filter strength $\alpha$ of the $L1$ norm to $10^{-3}$ for both classifiers. In Table 2, we observe strong baselines on ViT – $\mathtt{KNN}$ , $\mathtt{ASH}$ , $\mathtt{Scale}$ – exhibit discrepancy when applied to Swin v2, echoing the observations in Ammar et al. (2023). Conversely, our $\mathtt{NCI}$ , even without filtering, improves baseline performance on Swin v2 while maintaining superior effectiveness on ViT. The addition of filtering further enhances overall generalizability. Our experiments on ResNet (Section 4.1) and DenseNet (Appendix D) further validate the versatility of $\mathtt{NCI}$ across model architectures.

4.3 Ablation on the Filtering Effect

In Table 3, we assess different orders of $p$ -norm as the filtering term, compared to the $L1$ norm used so far. To ensure a fair comparison, we report the best performance from the filter strengths $\{10^{-4},10^{-3},10^{-2},10^{-1}\}$ . The rest of the setup follows the ImageNet benchmarks in Section 4.1. As shown in Table 3, filtering with $L1$ norm achieves the best performance across OOD datasets, aligning with prior observations Huang et al. (2021); Park et al. (2023). Meanwhile, we observe that in rare scenarios, e.g., a ResNet-18 on CIFAR-10, the $\mathtt{L1}$ norm cannot effectively characterize OOD’s proximity to the origin, leading to no extra performance gain compared to simply thresholding on $\mathtt{pScore}$ . In these cases, our algorithm benefits from its ability to automatically select a low filter strength based on validation results, effectively disregarding the filtering term.

We further apply $\mathtt{L1}$ -norm based filtering to $\mathtt{KNN}$ to see if this perspective can mitigate the discrepancy of clustering-based methods in general. In Table 4 ²²2 Note that we report our run of KNN here to ensure a fair evaluation of the filtering effect. Our results are very similar to the OpenOOD results reported in Table 1(a), with only marginal differences. , we report the the best performance of $\mathtt{KNN}$ from filter strengths $\{10^{-4},10^{-3},10^{-2},10^{-1}\}$ . We observe a significant performance gain from adding the filter, which further validates our understanding of ID clustering landscape from Neural Collapse. Note that our method outperforms the standalone $L1$ norm as well as $\mathtt{KNN}$ , before and after filtering.

5 Related Work

OOD Detection Extensive research has focused on develo** OOD detection algorithms. One line of work is post-hoc and builds upon pre-trained models. For example, Hendrycks et al. (2019); Liang et al. (2018); Liu et al. (2020); Sun et al. (2021); Sun and Li (2022) design OOD score over the output space of a classifier. Meanwhile, Lee et al. (2018) and Sun et al. (2022) measure OOD-ness from the perspective of ID clustering in feature space. Our work extends the observation that ID features tend to cluster from the perspective of Neural Collapse. While existing work is more focused are certain classification tasks than others, our proposed OOD detector is tested to be highly versatile.

Another line of work explores the regularization of OOD detection in training. For example, DeVries and Taylor (2018); Hsu et al. (2020) propose OOD-specific architecture whereas Huang and Li (2021); Wei et al. (2022) design OOD-specific training loss. In particular, Tack et al. (2020) brings attention to representation learning for OOD detection and proposes an OOD-specific contrastive learning scheme. Our work does not belong to this school of thought and is not restricted to specific training schemes or architecture.

Neural Collapse Neural Collapse was first observed in Papyan et al. (2020). During Neural Collapse, the penultimate layer features collapse to class means, the class means and the classifier collapses to a simplex equiangular tight framework, and the classifier simplifies to adopt the nearest class-mean decision rule. Further work provides theoretical justification for the emergence of Neural Collapse (Han et al., 2021; Mixon et al., 2020; Zhou et al., 2022; Zhu et al., 2021). In addition, Zhu et al. (2021) derives an efficient training algorithm drawing inspiration from Neural Collapse. Our concurrent work Ammar et al. (2023) also leverages insights from Neural Collapse for OOD detection. However, they tackle from the subspace perspective and largely overlook class-specific information revealed by Neural Collapse, which is essential for our work.

6 Conclusion

This work leverages insights from Neural Collapse to propose a novel OOD detector. Specifically, we study the phenomenon that ID features tend to form clusters whereas OOD features reside far away. Inspired by Neural Collapse, we hypothesize and validate that ID features tend to cluster near weight vectors. We also explain why ID features tend to reside further from the origin and complement our method from this perspective. Experiments show that our method can achieve superior performance with low latency across diverse setups, improving upon the generalizability of existing work. We hope our work can inspire future work to explore the interplay between features and weight vectors for OOD detection and other research problems such as calibration and adversarial robustness.

Limitations:

Despite significantly reducing the generalization discrepancy, our $\mathtt{NCI}$ does not completely close the gap. For example, there is still an absolute difference of 1.72% in AUROC compared to the best result on the ImageNet ResNet-50 benchmark.

References

Ammar et al. [2023] Mouïn Ben Ammar, Nacim Belkhir, Sebastian Popescu, Antoine Manzanera, and Gianni Franchi. Neco: Neural collapse based out-of-distribution detection. arXiv preprint arXiv:2310.06823, 2023.
Bitterwolf et al. [2023] Julian Bitterwolf, Maximilian Müller, and Matthias Hein. In or out? fixing imagenet out-of-distribution detection evaluation. In International Conference on Machine Learning, pages 2471–2506. PMLR, 2023.
Chen et al. [2020] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
Cimpoi et al. [2014] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In IEEE Conference in Computer Vision and Pattern Recognition, pages 3606–3613, 2014.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
Deng [2012] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
DeVries and Taylor [2018] Terrance DeVries and Graham W Taylor. Learning confidence for out-of-distribution detection in neural networks. arXiv preprint arXiv:1802.04865, 2018.
Djurisic et al. [2022] Andrija Djurisic, Nebojsa Bozanic, Arjun Ashok, and Rosanne Liu. Extremely simple activation sha** for out-of-distribution detection. arXiv preprint arXiv:2209.09858, 2022.
Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
Han et al. [2021] XY Han, Vardan Papyan, and David L Donoho. Neural collapse under mse loss: Proximity to and dynamics on the central path. arXiv preprint arXiv:2106.02073, 2021.
Hendrycks and Gimpel [2016] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016.
Hendrycks et al. [2019] Dan Hendrycks, Steven Basart, Mantas Mazeika, Mohammadreza Mostajabi, Jacob Steinhardt, and Dawn Song. Scaling out-of-distribution detection for real-world settings. arXiv preprint arXiv:1911.11132, 2019.
Hsu et al. [2020] Yen-Chang Hsu, Yilin Shen, Hongxia **, and Zsolt Kira. Generalized odin: Detecting out-of-distribution image without learning from out-of-distribution data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10951–10960, 2020.
Huang et al. [2017] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
Huang and Li [2021] Rui Huang and Yixuan Li. Mos: Towards scaling out-of-distribution detection for large semantic space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8710–8719, 2021.
Huang et al. [2021] Rui Huang, Andrew Geng, and Yixuan Li. On the importance of gradients for detecting distributional shifts in the wild. Advances in Neural Information Processing Systems, 34:677–689, 2021.
Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
Le and Yang [2015] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.
Lee et al. [2018] Kimin Lee, Kibok Lee, Honglak Lee, and **woo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems, 31, 2018.
Liang et al. [2018] Shiyu Liang, Yixuan Li, and R Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. In 6th International Conference on Learning Representations, ICLR 2018, 2018.
Liu and Qin [2023] Litian Liu and Yao Qin. Fast decision boundary based out-of-distribution detector. arXiv preprint arXiv:2312.11536, 2023.
Liu et al. [2020] Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection. Advances in Neural Information Processing Systems, 33:21464–21475, 2020.
Liu et al. [2022] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022.
Mahalanobis [2018] Prasanta Chandra Mahalanobis. On the generalized distance in statistics. Sankhyā: The Indian Journal of Statistics, Series A (2008-), 80:S1–S7, 2018.
McInnes et al. [2018] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
Mixon et al. [2020] Dustin G Mixon, Hans Parshall, and Jianzong Pi. Neural collapse with unconstrained features. arXiv preprint arXiv:2011.11619, 2020.
Netzer et al. [2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
Papyan et al. [2020] Vardan Papyan, XY Han, and David L Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020.
Park et al. [2023] Jaewoo Park, Jacky Chen Long Chai, Jaeho Yoon, and Andrew Beng ** Teoh. Understanding the feature norm for out-of-distribution detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1557–1567, 2023.
Sun and Li [2022] Yiyou Sun and Yixuan Li. Dice: Leveraging sparsification for out-of-distribution detection. In European Conference on Computer Vision, pages 691–708. Springer, 2022.
Sun et al. [2021] Yiyou Sun, Chuan Guo, and Yixuan Li. React: Out-of-distribution detection with rectified activations. Advances in Neural Information Processing Systems, 34:144–157, 2021.
Sun et al. [2022] Yiyou Sun, Yifei Ming, Xiao** Zhu, and Yixuan Li. Out-of-distribution detection with deep nearest neighbors. arXiv preprint arXiv:2204.06507, 2022.
Tack et al. [2020] Jihoon Tack, Sangwoo Mo, Jongheon Jeong, and **woo Shin. Csi: Novelty detection via contrastive learning on distributionally shifted instances. Advances in neural information processing systems, 33:11839–11852, 2020.
Van Horn et al. [2018] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778, 2018.
Vaze et al. [2021] Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisserman. Open-set recognition: A good closed-set classifier is all you need. In International Conference on Learning Representations, 2021.
Wang et al. [2022] Haoqi Wang, Zhizhong Li, Litong Feng, and Wayne Zhang. Vim: Out-of-distribution with virtual-logit matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4921–4930, 2022.
Wei et al. [2022] Hongxin Wei, Renchunzi Xie, Hao Cheng, Lei Feng, Bo An, and Yixuan Li. Mitigating neural network overconfidence with logit normalization. arXiv preprint arXiv:2205.09310, 2022.
Wightman [2019] Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
Xu et al. [2023] Kai Xu, Rongyu Chen, Gianni Franchi, and Angela Yao. Scaling for training time and post-hoc out-of-distribution detection enhancement. In The Twelfth International Conference on Learning Representations, 2023.
Zhang et al. [2023] **gyang Zhang, **gkang Yang, Pengyun Wang, Haoqi Wang, Yueqian Lin, Haoran Zhang, Yiyou Sun, Xuefeng Du, Kaiyang Zhou, Wayne Zhang, Yixuan Li, Ziwei Liu, Yiran Chen, and Hai Li. Openood v1.5: Enhanced benchmark for out-of-distribution detection. arXiv preprint arXiv:2306.09301, 2023.
Zhou et al. [2017] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2017.
Zhou et al. [2022] **xin Zhou, Xiao Li, Tianyu Ding, Chong You, Qing Qu, and Zhihui Zhu. On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features. In International Conference on Machine Learning, pages 27179–27202. PMLR, 2022.
Zhu et al. [2021] Zhihui Zhu, Tianyu Ding, **xin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu. A geometric analysis of neural collapse with unconstrained features. Advances in Neural Information Processing Systems, 34:29820–29834, 2021.

NeurIPS Paper Checklist

1.

Claims
Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
Answer: [Yes]
Justification: The claims in our abstract and introduction are justified in rest of our paper.
Guidelines:
- •
  
  The answer NA means that the abstract and introduction do not include the claims made in the paper.
- •
  
  The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.
- •
  
  The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.
- •
  
  It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.
2.

Limitations
Question: Does the paper discuss the limitations of the work performed by the authors?
Answer: [Yes]
Justification: We discuss our limitations after the conclusion session.
Guidelines:
- •
  
  The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.
- •
  
  The authors are encouraged to create a separate "Limitations" section in their paper.
- •
  
  The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.
- •
  
  The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.
- •
  
  The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.
- •
  
  The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.
- •
  
  If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.
- •
  
  While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in develo** norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.
3.

Theory Assumptions and Proofs
Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
Answer: [Yes]
Justification: We include complete proof and assumptions for our theoretical result in Section 3.1.
Guidelines:
- •
  
  The answer NA means that the paper does not include theoretical results.
- •
  
  All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced.
- •
  
  All assumptions should be clearly stated or referenced in the statement of any theorems.
- •
  
  The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.
- •
  
  Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.
- •
  
  Theorems and Lemmas that the proof relies upon should be properly referenced.
4.

Experimental Result Reproducibility
Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?
Answer: [Yes]
Justification: We include code to reproduce our result in supplemental material.
Guidelines:
- •
  
  The answer NA means that the paper does not include experiments.
- •
  
  If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.
- •
  
  If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.
- •
  
  Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.
- •
  While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example
  1. (a)
    
    If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.
  2. (b)
    
    If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.
  3. (c)
    
    If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).
  4. (d)
    
    We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.
5.

Open access to data and code
Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
Answer: [Yes]
Justification: We include our code in supplemental material and all data we used are open sourced.
Guidelines:
- •
  
  The answer NA means that paper does not include experiments requiring code.
- •
  
  Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.
- •
  
  While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).
- •
  
  The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.
- •
  
  The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.
- •
  
  The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.
- •
  
  At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).
- •
  
  Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.
6.

Experimental Setting/Details
Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?
Answer: [Yes]
Justification: We specify our train/test split and hyperparameter choice in our Experiment session.
Guidelines:
- •
  
  The answer NA means that the paper does not include experiments.
- •
  
  The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.
- •
  
  The full details can be provided either with the code, in appendix, or as supplemental material.
7.

Experiment Statistical Significance
Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
Answer: [No]
Justification: We follow prior work and adhere to standard benchmarks with fixed train/test splits and pre-trained models, resulting in minimal randomness.
Guidelines:
- •
  
  The answer NA means that the paper does not include experiments.
- •
  
  The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.
- •
  
  The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).
- •
  
  The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)
- •
  
  The assumptions made should be given (e.g., Normally distributed errors).
- •
  
  It should be clear whether the error bar is the standard deviation or the standard error of the mean.
- •
  
  It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.
- •
  
  For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).
- •
  
  If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.
8.

Experiments Compute Resources
Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?
Answer: [Yes]
Justification: We use off-the-shelf models with references given in appendix. We also report the computer resources used for inference.
Guidelines:
- •
  
  The answer NA means that the paper does not include experiments.
- •
  
  The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.
- •
  
  The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.
- •
  
  The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper).
9.

Code Of Ethics
Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?
Answer: [Yes]
Justification: We reviewed and followed the code of ethics.
Guidelines:
- •
  
  The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.
- •
  
  If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.
- •
  
  The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).
10.

Broader Impacts
Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
Answer: [Yes]
Justification: The work can enhance the safe deployment of AI models. We do not foresee any negative societal impact of this work.
Guidelines:
- •
  
  The answer NA means that there is no societal impact of the work performed.
- •
  
  If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.
- •
  
  Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.
- •
  
  The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.
- •
  
  The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.
- •
  
  If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).
11.

Safeguards
Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?
Answer: [N/A]
Justification: The paper poses no such risks.
Guidelines:
- •
  
  The answer NA means that the paper poses no such risks.
- •
  
  Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.
- •
  
  Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.
- •
  
  We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.
12.

Licenses for existing assets
Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
Answer: [Yes]
Justification: We properly credited the code, data, and models usded in the paper.
Guidelines:
- •
  
  The answer NA means that the paper does not use existing assets.
- •
  
  The authors should cite the original paper that produced the code package or dataset.
- •
  
  The authors should state which version of the asset is used and, if possible, include a URL.
- •
  
  The name of the license (e.g., CC-BY 4.0) should be included for each asset.
- •
  
  For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.
- •
  
  If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.
- •
  
  For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.
- •
  
  If this information is not available online, the authors are encouraged to reach out to the asset’s creators.
13.

New Assets
Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
Answer: [Yes]
Justification: We release our well-documented code alongside the paper.
Guidelines:
- •
  
  The answer NA means that the paper does not release new assets.
- •
  
  Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.
- •
  
  The paper should discuss whether and how consent was obtained from people whose asset is used.
- •
  
  At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.
14.

Crowdsourcing and Research with Human Subjects
Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?
Answer: [N/A]
Justification: The paper does not involve crowdsourcing nor research with human subjects.
Guidelines:
- •
  
  The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.
- •
  
  Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.
- •
  
  According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.
15.

Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects
Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?
Answer: [N/A]
Justification: The paper does not involve crowdsourcing nor research with human subjects.
Guidelines:
- •
  
  The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.
- •
  
  Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.
- •
  
  We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.
- •
  
  For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.

Appendix A Implementation Details

A.1 CIFAR-10

ResNet-18 For visualization in Fig. 1Left, Middle, we use a CIFAR-10 classifier of ResNet-18 backbone trained with cross-entropy loss. The classifier is trained for 100 epochs, with the initial learning rate 0.1 decaying to 0.01, 0.001, and 0.0001 at epochs 50, 75, and 90 respectively. For experiments in Table 1(a), we use the pre-trained model provided by the OpenOOD benchmark. And we refer readers to Zhang et al. (2023) for their training recipe.

DenseNet-101 For experiments on CIFAR-10 Benchmark presented in Table 6, we evaluate a CIFAR-10 classifier of DenseNet-101 backbone. The classifier is trained following the setups in Huang et al. (2017) with depth $L=100$ and growth rate $k=12$ .

A.2 CIFAR-100

DenseNet-101 For experiments on the CIFAR-100 Benchmark presented in Table 6, we evaluate a CIFAR-100 classifier of the DenseNet-101 backbone. The classifier is trained following the setups in Huang et al. (2017) with depth $L=100$ and growth rate $k=12$ .

A.3 ImageNet

ResNet-50 For evaluation on ImageNet Benchmark in Table 1(a), we use the default ResNet-50 model trained with cross-entropy loss provided by Pytorch. Training recipe can be found at https://pytorch.org/blog/how-to-train-state-of-the-art-models-using-torchvision-latest-primitives/

ViT B/16 In Table 2, we use the PyTorch implementation and pre-trained checkpoint of ViT B/16, available https://github.com/lukemelas/PyTorch-Pretrained-ViT/tree/master.

Swin v2 In Table 2, we use the $\mathtt{timm}$ Wightman (2019) implementation of Swin v2 as well as their pre-trained checkpoint ’swinv2_base_window8_256’.

Appendix B Alternatives Proximity Metrics

In this section, we validate that under alternative similarity metrics, ID features also reside closer to weight vectors and empirically compare the metrics. In addition to our proposed $\mathtt{pScore}$ , we consider two standard similarity metrics, cosine similarity and Euclidean distance. For cosine similarity, we evaluate

\mathtt{cosScore}=\frac{(\bm{h}-\bm{\mu}_{G})\cdot\bm{w}_{c}}{\|\bm{h}-\bm{\mu% }_{G}\|_{2}\|\bm{w}_{c}\|_{2}}.

(7)

As for Euclidean distance, we first estimate the scaling factor in Theorem 3.1 by $\displaystyle\tilde{\lambda}_{c}=\frac{\|\bm{\mu}_{c}-\bm{\mu}_{G}\|_{2}}{\|% \bm{w}_{c}\|_{2}}.$ Based on the estimation, we measure the distance between the centered feature $\bm{h}-\bm{\mu}_{G}$ and the scaled weight vector corresponding to the predicted class $c$ as

\mathtt{distScore}=-\|(\bm{h}-\bm{\mu}_{G})-\tilde{\lambda}_{c}\bm{w}_{c}\|_{2% }.\vspace{-2mm}

(8)

Same as $\mathtt{pScore}$ , the larger $\mathtt{cosScore}$ or $\mathtt{distScore}$ is, the closer the feature is to the weight vector.

We evaluate in Table 5 OOD detection performance using standalone $\mathtt{pScore}$ , $\mathtt{cosScore}$ , and $\mathtt{distScore}$ as scoring function respectively. The experiments are evaluated with AUROC under the same ImageNet setup as in Section 4.1. We observe in Table 5, that across OOD datasets, all three scores achieve an AUROC score $>50$ , indicating that ID features reside closer to weight vectors compared to OOD under either metric.

Furthermore, we observe that $\mathtt{pScore}$ outperforms both $\mathtt{cosScore}$ and $\mathtt{distScore}$ . Comparing the performance of $\mathtt{pScore}$ and $\mathtt{cosScore}$ , the superior performance of $\mathtt{pScore}$ implies that ID features corresponding to the classes with larger $\bm{w}_{c}$ are less compact. This is in line with the decision rule of the classifier that classes with larger $\bm{w}_{c}$ have larger decision regions. As for comparison against Euclidean distance based $\mathtt{distScore}$ , $\mathtt{pScore}$ eliminates the need to estimate the scaling factor, which can be error-prone before convergence, potentially leading to performance degradation.

Appendix C Baseline Methods

We provide an overview of our baseline methods in this session. We follow our notation in Section 3. In the following, a lower detection score indicates OOD-ness.

MSP Hendrycks and Gimpel (2016) proposes to detect OOD based on the maximum softmax probability. Given the penultimate feature $\bm{h}$ for a given test sample $\bm{x}$ , the detection score of MSP can be represented as:

\frac{\exp{(\bm{w}_{c}^{T}\bm{h}}+b_{c})}{\sum_{c^{\prime}\in\mathcal{C}}\exp{% (\bm{w}_{c^{\prime}}^{T}\bm{h}}+b_{c^{\prime}})},

(9)

where $c$ is the predicted class for $\bm{x}$ .

ODIN Liang et al. (2018) proposes to amplify ID the OOD separation on top of MSP through temperature scaling and adversarial perturbation. Given a sample $\bm{x}$ , ODIN constructs a noisy sample $\bm{x^{\prime}}$ from $\bm{x}$ . Denote the penultimate feature of the noisy sample $\bm{x^{\prime}}$ as $\bm{h^{\prime}}$ , ODIN assigns OOD score following:

\frac{\exp{((\bm{w}_{c}^{T}\bm{h^{\prime}}}+b_{c})/T)}{\sum_{c^{\prime}\in% \mathcal{C}}\exp{((\bm{w}_{c}^{\prime T}\bm{h^{\prime}}}+b_{c^{\prime}})/T)},

(10)

where $c$ is the predicted class for the perturbed sample and $T$ is the temperature. In our implementation, we set the noise magnitude as 0.0014 and the temperature as 1000.

Energy Liu et al. (2020) designs an energy-based score function over the logit output. Given a test sample $\bm{x}$ as well as its penultimate layer feature $\bm{h}$ , the energy based detection score can be represented as:

-\log\sum_{c^{\prime}\in\mathcal{C}}\exp{(\bm{w}_{c^{\prime}}^{T}\bm{h}}+b_{c^% {\prime}}).

(11)

ReAct Sun et al. (2021) builds upon the energy score proposed in Liu et al. (2020) and regularizes the score by truncating the penultimate layer estimation. We set the truncation threshold at $90$ percentile in our experiments.

Dice Sun and Li (2022) builds upon the energy score proposed in Liu et al. (2020). Leveraging the observation that units and weights are used sparsely in ID inference, Sun and Li (2022) proposes to select and compute the energy score over a selected subset of weights based on their importance. We set a threshold at $90$ percentile for CIFAR experiments and $70$ percentile for ImageNet experiments following Sun and Li (2022).

ASH Djurisic et al. (2022) builds upon the energy score proposed in Liu et al. (2020). Prior to the Energy score, ASH sorts each feature to find the top-k elements, scales up the top-k elements, and sets the rest to zero. We note that in addition to the cost of Energy, ASH introduces a sorting cost of $O(P\log k)$ , where $P$ is the penultimate layer dimension.

Scale Xu et al. (2023) builds upon the energy score proposed in Liu et al. (2020). Prior to the Energy score, Scale sorts each feature to find the top-k elements and based on the statistics, scales all elements in the feature. We note that in addition to the cost of Energy, Scale also introduces a sorting cost of $O(P\log k)$ , where $P$ is the penultimate layer dimension.

Mahalanobis On the feature space, Lee et al. (2018) models the ID feature distribution as multivariate Gaussian and designs a Mahalanobis distance-based score:

\max_{c}-(\bm{e_{x}}-\hat{\bm{\mu}_{c}})^{T}\hat{\Sigma}^{-1}(\bm{e_{x}}-\hat{% \bm{\mu}_{c}}),

(12)

where $\bm{e_{x}}$ is the feature embedding of $\bm{x}$ in a specific layer, $\hat{\mu_{c}}$ is the feature mean for class $c$ estimated on the training set, and $\hat{\Sigma}$ is the covariance matrix estimated over all classes on the training set.

On top of the basic score, Lee et al. (2018) also proposes two techniques to enhance the OOD detection performance. The first is to inject noise into samples. The second is to learn a logistic regressor to combine scores across layers. We tune the noise magnitude and learn the logistic regressor on an adversarial constructed OOD dataset. The selected noise magnitude is 0.005 in both our ResNet and DenseNet experiments.

KNN Chen et al. (2020) proposes to detect OOD based on the k-th nearest neighbor distance between the normalized embedding of the test sample $\bm{z_{x}}/|\bm{z_{x}}|$ and the normalized training embeddings on the penultimate space. Chen et al. (2020) also observes that contrastive learning helps in improving OOD detection effectiveness.

GradNorm Huang et al. (2021) extracts information from the gradient space to detect OOD samples. Specifically, Huang et al. (2021) defines the OOD score function as the $\mathtt{L1}$ norm of the gradient of the weight matrix with respect to the KL divergence between the softmax prediction for $\bm{x}$ and the uniform distribution.

\|\frac{\partial D_{KL}(\bm{u}\|softmax{f(\bm{x})})}{\partial\bm{W}}\|_{1}.

(13)

ViM Wang et al. (2022) proposes to integrate class-specific information into feature space information by adding energy score to the feature norm in the residual space of the training feature matrix. The detection score is designed to be:

\alpha\sqrt{\bm{h}^{T}\bm{R}\bm{R}\bm{h}},

(14)

where $\bm{R}\in R^{P\times(P-D)}$ correspond to the residual after subtracting the $D-$ dimensional principle space. In the preparation stage, ViM requires evaluating the residual/null space from the training data, which is computationally expensive given the data volume. During inference, large matrix multiplication is required, resulting in a computational complexity of $O((P-D)^{2})$ .

NECO is inspired by the ETF structure of Neural Collapse to utilize feature subspace for OOD detection. The detection score is designed to be

\text{MaxLogit}\times\frac{\sqrt{\bm{h}^{T}\bm{P}\bm{P}\bm{h}}}{\sqrt{\bm{h}^{% T}\bm{h}}},

(15)

where $\bm{P}\in R^{P\times d}$ correspond to the $d-$ dimensional principle space. In the preparation stage, NECO requires evaluating the residual/null space from the training data, which is computationally expensive given the data volume. During inference, large matrix multiplication is required, resulting in a computational complexity of $O((d)^{2}+P)$ .

fDBD Liu and Qin (2023) proposes to detect OOD based on estimated feature distance to decision boundaries of class $c\in\mathcal{C}$ besides its predicted class $f(\bm{x})$ :

\tilde{D}_{f}(\bm{h},c)=\frac{|(\bm{w}_{f(\bm{x})}-\bm{w}_{c})^{T}\bm{h}+(b_{f% (\bm{x})}-b_{c})|}{\left\lVert\bm{w}_{f(\bm{x})}-\bm{w}_{c}\right\rVert_{2}},% \vspace{-3mm}

(16)

The detection score is designed as

\frac{1}{|\mathcal{C}|-1}\sum_{\begin{subarray}{c}c\in\mathcal{C}\end{subarray% },\ c\neq f(\bm{x})}\frac{\tilde{D}_{f}(\bm{h},c)}{\|\bm{h}-\bm{\mu}_{train}\|% _{2}}.

(17)

fDBD has time complexity $O(|\mathcal{C}|+P)$ , where $|\mathcal{C}|$ is the number of training classes and $P$ is the penultimate layer dimension.

Appendix D Evaluation on DenseNet

In addition to evaluation on ResNet and transformer-based model in Section 4, we report the performance of our $\mathtt{NCI}$ along with the baselines under AUROC and FPR95 across OpenOOD benchmarks in Table 6.

Appendix E The Prevalence of Neural Collapse across Canonical Classification Tasks

The phenomenon of Neural Collapse, as established in the seminal work by Papyan et al. Papyan et al. (2020) and corroborated by subsequent studies Han et al. (2021); Mixon et al. (2020); Zhou et al. (2022); Zhu et al. (2021), widely exists across canonical classification datasets and model architectures. The prevalent occurrence of Neural Collapse forms a robust foundation for the design of our versatile OOD detectors. To this end, we review the empirical evidence of Neural Collapse across different datasets and model architectures in Figure 3, Figure 4, Figure 5, Figure 6, and Figure 7. Comparing CIFAR-10 and ImageNet behaviors with ResNet backbone in Figure 7, we note that the clustering of CIFAR-10 is more prominent than Imagenet, as indicated by a higher ratio of between-class variance to within-class covariance. Note that the figures and captions are sourced from Papyan et al. (2020). The definition and notation follow Section 3.