HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY-NC-SA 4.0
arXiv:2402.17091v1 [cs.CV] 27 Feb 2024
\useunder

\ul

Structural Teacher-Student Normality Learning for Multi-Class Anomaly Detection and Localization

Hanqiu Deng and Xingyu Li
{hanqiu1, xingyu}@ualberta.ca
University of Alberta
Abstract

Visual anomaly detection is a challenging open-set task aimed at identifying unknown anomalous patterns while modeling normal data. The knowledge distillation paradigm has shown remarkable performance in one-class anomaly detection by leveraging teacher-student network feature comparisons. However, extending this paradigm to multi-class anomaly detection introduces novel scalability challenges. In this study, we address the significant performance degradation observed in previous teacher-student models when applied to multi-class anomaly detection, which we identify as resulting from cross-class interference. To tackle this issue, we introduce a novel approach known as Structural Teacher-Student Normality Learning (SNL): (1) We propose spatial-channel distillation and intra-&inter-affinity distillation techniques to measure structural distance between the teacher and student networks. (2) We introduce a central residual aggregation module (CRAM) to encapsulate the normal representation space of the student network. We evaluate our proposed approach on two anomaly detection datasets, MVTecAD and VisA. Our method surpasses the state-of-the-art distillation-based algorithms by a significant margin of 3.9% and 1.5% on MVTecAD and 1.2% and 2.5% on VisA in the multi-class anomaly detection and localization tasks, respectively. Furthermore, our algorithm outperforms the current state-of-the-art unified models on both MVTecAD and VisA.

Refer to caption
Figure 1: We visualize the performance degradation of one-class teacher-student networks, RD [7] (left) and FD [23] (right), in the multi-class anomaly detection task on MVTecAD. Our structural normality learning (SNL) strategy on the teacher-student model shows significant improvement of multi-class anomaly detection and localization on both methods. Besides, SNL can also boost the performance on one-class cases.
Refer to caption
Figure 2: (a) Demonstration of cross-class interference in multi-class anomaly detection. (b) Empirical analysis of cross-class interference. We generated mixtures as anomaly samples from the “hazelnut” and “screw” of MVTecAD via mixup. FD [23] and RD [7] show no discrepancy in the anomaly scores, whereas our models exhibit significant differences. (c) Qualitative analysis of cross-class interference. We crop a small region from an image in the “cable” category and paste it onto an image in the “wood” category as an anomaly sample on MVTecAD. Both FD and RD fail to identify synthetic anomalous regions, whereas our models can locate the anomalies precisely.

1 Introduction

Visual anomaly detection represents a pivotal open-set task in computer vision, aiming to identify unknown anomalous patterns within normal data. This challenge holds significant relevance in a multitude of real-world applications, spanning industrial defect detection [4, 34, 11], video surveillance [14, 1], and medical imaging diagnosis [21, 33]. Traditional anomaly detection approaches often involve training separate models for each specific category. These models are trained on normal samples from their respective categories and can only detect anomalies within the context of that category. While one-class anomaly detection models have shown promise in these contexts [20, 23, 7, 12], their inherent limitation lies in the need to construct a separated model for each class, a paradigm that becomes increasingly inefficient with the increasing number of categories. Recent developments have highlighted the emergence of multi-class anomaly detection as a pressing challenge, demanding enhanced scalability and adaptability from anomaly detection models [28, 32]. In response to this evolving landscape, we aim to propose a scalable solution for multi-class anomaly detection and localization, where one model can identify anomalies of multiple classes.

Feature reconstruction stands as one of the most influential paradigms in the realm of anomaly detection, distinguished for its robustness and effectiveness. Especially, teacher-student networks become a natural approach for feature reconstruction, involving the prediction of teacher network outcomes through the student network [10]. In particular, multi-scale distillation is proposed to achieve superior anomaly detection performance by accumulating feature differences between teachers and students under multiple receptive fields [5, 20, 23]. Recently, by exposing the over-generalization problem on anomaly detection that exists in the forward distillation paradigm, reverse distillation has been proposed as a novel paradigm and achieves SOTA performance on one-class anomaly detection scenarios [7]. However, we observe substantial performance degradation for both forward distillation [20, 23] and reverse distillation [7] for multi-class anomaly detection, as shown in Fig. 1. Therefore, we propose the cross-class interference hypothesis in Fig. 2(a), whereby the generalization of the anomaly detection model across different categories causes the model to be somewhat tolerant towards anomalies.

To empirically assess the impact of cross-class interference on anomaly detection, we conduct two straightforward experiments. In the first experiment, we use Mixup [31] technique to superimpose two images belonging to different classes, creating a mixture that should be considered as an anomalous image. We then conduct a statistical analysis of the image-level anomaly scores. As illustrated in Fig. 2(b), both forward distillation (FD) [23] and reverse distillation (RD) [7] fail to distinguish between mixture and normal images when trained on a multi-class dataset. In the second experiment, we employ a CutPaste-like [13] anomaly synthesis on images originating from two distinct classes. Accordingly, the synthesized irregularity should be distinguishable for effective anomaly detection models [13]. However, as shown in Fig. 2(c), when training under the multi-class setting, both FD and RD models are unable to identify the anomalous region within the synthesized image. These experiments demonstrate the detrimental influence of cross-class interference on the performance of teacher-student networks in multi-class anomaly detection and localization.

Evidently, the issue of cross-class interference in multi-class anomaly detection arises from shortcomings in previous teacher-student reconstruction networks, a concern not as prominent in one-class anomaly detection. On the one hand, previous methods primarily train the student network to learn local features from the teacher network without fostering correlations between these features. The absence of such correlations hindered student networks from effectively discerning structural feature differences between the subject and potential anomalies within a sample. Therefore, we propose structural distillation, enabling student networks to discern and capture pairwise feature disparities from teacher networks. In specific, our structural distillation consists of spatial-channel and intra-&inter-affinity distillation, which represents separate and pairwise feature distances, respectively. On the other hand, the deficiency of normality constraints leads to weak compactness of multi-class normal representations within teacher-student networks. To tackle this issue, we propose the Central Residual Aggregation Module (CRAM) plugged into the student network. Our proposed CRAM facilitates the learning of compact normality features by aggregating residual projections of student features relative to multiple normality centers. Notably, our multi-class anomaly detection model demonstrates excellent discriminative ability in the experiments presented in Fig. 2. Overall, we propose Structural Teacher-Student Normality Learning (SNL) to address the problem of cross-class interference that hampers the effectiveness of knowledge distillation in multi-class anomaly detection. Notably, our approach offers generalizability to previous teacher-student networks and improves performance by a large margin in multi-class anomaly detection and localization. Furthermore, our approach remarkably surpasses SOTA on the MVTecAD and VisA datasets. Our main contributions are summarized as follows:

  • We conduct an in-depth analysis to identify the presence of cross-class interference, which leads to the degradation observed in teacher-student networks when applied to multi-class anomaly detection and localization.

  • To tackle this issue, we propose a structural teacher-student network that learning separate and pairwise feature similarities by spatial-channel and intra-&inter-affinity distillation.

  • We propose CRAM to be integrated in student network to learn a compact normality representation, thereby enhancing the model’s sensitivity to cross-class anomalies.

  • Extensive experiments on the datasets MVTecAD and VisA show that our approach has a dramatic improvement compared to the baseline and also outperforms the state-of-the-art unified models.

2 Related Work

Distillation-based Anomaly Detection:

Reconstruction is the typical paradigm for anomaly detection, e.g., pixel-level structural reconstruction for industrial defect detection [3]. Feature-level reconstruction exhibits impressive performance due to the powerful representation capabilities of pre-trained models [25]. Teacher-student networks, which use student networks to reconstruct features of teacher networks, as a natural reconstruction paradigm have been widely studied for anomaly detection. Uninformed student is the first teacher-student network based anomaly detection method [5]. It trains trains a student network on normal samples to distill from a discriminative teacher network and then detects anomalies by teacher-student differences. The multi-scale knowledge distillation [20, 23] is proposed to train a student network to reconstruct the multi-scale features of the teacher network, which is derived from a pre-trained network on ImageNet [8] with a rich semantic space. In particular, [20] utilizes the disparate gradients generated by the model on novel features to detect anomalies and [23] utilize pyramid reconstruction errors to detect anomalies. In this study, we define this classical teacher-student networks [20, 23] as forward distillation. Previous studies have found that forward distillation suffers from anomalous leakage to student networks, whereby more powerful student networks overgeneralize the anomalous representations and thus lead to performance degradation [20, 23]. To address this issue, reverse distillation has been proposed to reconstruct shallow multi-scale features progressively from deep features using the student network [7], which takes teacher-student networks to state-of-the-art in anomaly detection and localization. Previous approaches have achieved impressive performance on one-class anomaly detection, however, degradation occurs on multi-class anomaly detection. In this study, we aim to achieve high performance in multi-class anomaly detection and localization using teacher-student networks.

Normality Learning:

DeepSVDD [19] is a one-class normality learning algorithm that detects outliers by training a compact support space for a normality center. Subsequently, the sparse memory mechanism [9] and the compact memory module [17] are proposed for learning normality reconstruction. To achieve few-shot adaptation, dynamic normality learning is proposed to project normal prototypes onto a given feature space [16]. Recently, CFA [12] proposes the coupled-hypersphere-based feature adaptation to learning normal centers for one-class anomaly detection. In this study, we aim to present a normality learning module that is adaptable to multi-class and sensitive to anomalies.

Refer to caption
Figure 3: Overview of our structural teacher-student framework. Left: during training with normal samples, our structural distillation quantifies and minimizes the difference between channel-wise features, spatial-wise features, intra-affinity and inter-affinity metrics for the k𝑘kitalic_kth block of teacher-student network. Right: during testing, for query samples, we measure the local and structural differences respectively by the channel-wise feature distance and intra-affinity distances of the teacher-student network for anomaly detection.

Multi-class Anomaly Detection:

UniAD [28] initially formulates the task of multi-class visual anomaly detection and proposes a transformer-based feature reconstruction model. Besides, UniAD proposes layer-wise query in the transformer to learn the complex normal distribution of multi-categories. Additionally, OmniAL [32] proposes a panel-guided approach to synthesize anomalies and train reconstruction and discriminative networks on the synthesized anomaly samples to localize the anomalies. Although the synthetic anomaly approaches [30, 32] provide excellent anomaly localization precision on specific datasets, they require a priori knowledge of the anomalies in the dataset. We commonly define anomalies as unknown so that the model can be sensitive to all kinds of anomalies.

3 Methodology

Problem Definition:

For multi-class anomaly detection, we follow a unified setting where the images are from different classes and the category information is inaccessible [28]. Let Ltrain={Inormal1,,Inormaln}subscript𝐿𝑡𝑟𝑎𝑖𝑛superscriptsubscript𝐼𝑛𝑜𝑟𝑚𝑎𝑙1superscriptsubscript𝐼𝑛𝑜𝑟𝑚𝑎𝑙𝑛L_{train}=\{I_{normal}^{1},...,I_{normal}^{n}\}italic_L start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT = { italic_I start_POSTSUBSCRIPT italic_n italic_o italic_r italic_m italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_n italic_o italic_r italic_m italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } denotes the set of n𝑛nitalic_n anomaly-free training samples from C𝐶Citalic_C potential categories. Then, the inference set is defined as Ltest={Iunknown1,,Iunknownm}subscript𝐿𝑡𝑒𝑠𝑡superscriptsubscript𝐼𝑢𝑛𝑘𝑛𝑜𝑤𝑛1superscriptsubscript𝐼𝑢𝑛𝑘𝑛𝑜𝑤𝑛𝑚L_{test}=\{I_{unknown}^{1},...,I_{unknown}^{m}\}italic_L start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT = { italic_I start_POSTSUBSCRIPT italic_u italic_n italic_k italic_n italic_o italic_w italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_u italic_n italic_k italic_n italic_o italic_w italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT }, which including m𝑚mitalic_m query images from the same C𝐶Citalic_C classes. Notabaly, the training set Ltrainsubscript𝐿𝑡𝑟𝑎𝑖𝑛L_{train}italic_L start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT only includes normal samples and the test set Ltestsubscript𝐿𝑡𝑒𝑠𝑡L_{test}italic_L start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT includes normal or unknown anomalous samples. We aim to achieve a model that can detect anomalous images and localize the anomalous regions in multiple categories.

Preliminaries:

Lately, the teacher-student networks have made significant strides in advancing anomaly detection [20, 23, 7]. In this paradigm, we begin with a pre-trained teacher network capable of extracting rich and discriminative features from images. We train a student network on normal samples to learn and reconstruct these features from the teacher network. This process is commonly referred to as knowledge distillation [10]. Subsequently, we use the feature reconstruction errors on query samples to detect anomalies. However, when applied to multi-class anomaly detection, the teacher-student network suffers from degradation, resulting in weak performance. As highlighted earlier, our observations indicate that this performance degradation is attributed to cross-class interference, a phenomenon that affects the model’s ability to differentiate anomalies in diverse classes. To overcome this issue, we introduce structural teacher-student normality learning as a novel framework for multi-class anomaly detection and localization. In this section, we present the proposed methodology as follows: (1) structural knowledge distillation, (2) central residual aggregation module for normality learning, and (3) scoring for anomaly detection and localization. These elements collectively form the foundation of our approach, which aims to address the challenge of multi-class anomaly detection by mitigating the impact of cross-class interference.

3.1 Structural Distillation for Anomaly Detection

The teacher-student network consists of a frozen pre-trained teacher model and a trainable student model. Particularly, we follow previous work using the same network architecture and distill hierarchical knowledge for the teacher-student network [23, 7]. Formally, let FtkDk×Hk×Wksubscriptsuperscript𝐹𝑘𝑡superscriptsuperscript𝐷𝑘superscript𝐻𝑘superscript𝑊𝑘F^{k}_{t}\in\mathbb{R}^{D^{k}\times H^{k}\times W^{k}}italic_F start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT × italic_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and FskDk×Hk×Wksubscriptsuperscript𝐹𝑘𝑠superscriptsuperscript𝐷𝑘superscript𝐻𝑘superscript𝑊𝑘F^{k}_{s}\in\mathbb{R}^{D^{k}\times H^{k}\times W^{k}}italic_F start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT × italic_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT denote the feature tensors of the k𝑘kitalic_kth block of the teacher and student models, respectively. For notation consistency, this paper uses Fik(:,h,w)Dk×1superscriptsubscript𝐹𝑖𝑘:𝑤superscriptsuperscript𝐷𝑘1F_{i}^{k}(:,h,w)\in\mathbb{R}^{D^{k}\times 1}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( : , italic_h , italic_w ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT × 1 end_POSTSUPERSCRIPT to denote the 1-D channel-wise feature at location (w,h)𝑤(w,h)( italic_w , italic_h ) from the feature tensor, and Fik(d,:,:)Hk×Wksuperscriptsubscript𝐹𝑖𝑘𝑑::superscriptsuperscript𝐻𝑘superscript𝑊𝑘F_{i}^{k}(d,:,:)\in\mathbb{R}^{H^{k}\times W^{k}}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_d , : , : ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT to represent the 2-D spatial feature map in the channel d𝑑ditalic_d, where i{t,s}𝑖𝑡𝑠i\in\{t,s\}italic_i ∈ { italic_t , italic_s }.

During training, the tensor Ftksubscriptsuperscript𝐹𝑘𝑡F^{k}_{t}italic_F start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT extracted from Inormalsubscript𝐼𝑛𝑜𝑟𝑚𝑎𝑙I_{normal}italic_I start_POSTSUBSCRIPT italic_n italic_o italic_r italic_m italic_a italic_l end_POSTSUBSCRIPT is treated as the learning target. Then, we optimize the student network to produce a reconstructed feature tensor Fsksubscriptsuperscript𝐹𝑘𝑠F^{k}_{s}italic_F start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT that is close to the target tensor Ftksubscriptsuperscript𝐹𝑘𝑡F^{k}_{t}italic_F start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Following previous works [20, 23, 10], we compute the channel-wise feature distances along the channel axis for the k𝑘kitalic_kth teacher-student blocks:

Mk(h,w)=1(Ftk(:,h,w))TFsk(:,h,w)Ftk(:,h,w)2Fsk(:,h,w)2,superscript𝑀𝑘𝑤1superscriptsubscriptsuperscript𝐹𝑘𝑡:𝑤𝑇subscriptsuperscript𝐹𝑘𝑠:𝑤subscriptnormsubscriptsuperscript𝐹𝑘𝑡:𝑤2subscriptnormsubscriptsuperscript𝐹𝑘𝑠:𝑤2M^{k}(h,w)=1-\frac{(F^{k}_{t}(:,h,w))^{T}\cdot F^{k}_{s}(:,h,w)}{\|F^{k}_{t}(:% ,h,w)\|_{2}\cdot\|F^{k}_{s}(:,h,w)\|_{2}},italic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_h , italic_w ) = 1 - divide start_ARG ( italic_F start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( : , italic_h , italic_w ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ italic_F start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( : , italic_h , italic_w ) end_ARG start_ARG ∥ italic_F start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( : , italic_h , italic_w ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ ∥ italic_F start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( : , italic_h , italic_w ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG , (1)

where 2\|\cdot\|_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the L2222 norm. By calculate the cosine similarity distance along the channel axis in (1), we obtain a 2-D distance map MkHk×Wksuperscript𝑀𝑘superscriptsuperscript𝐻𝑘superscript𝑊𝑘M^{k}\in\mathbb{R}^{H^{k}\times W^{k}}italic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. Considering the hierarchical knowledge distillation, the channel-wise distillation loss is defined as the aggregation of the multi-scale channel-wise distance maps:

cd=k=1K[1HkWkh=1Hw=1WMk(h,w)],subscript𝑐𝑑superscriptsubscript𝑘1𝐾delimited-[]1superscript𝐻𝑘superscript𝑊𝑘superscriptsubscript1𝐻superscriptsubscript𝑤1𝑊superscript𝑀𝑘𝑤\mathcal{L}_{cd}=\sum_{k=1}^{K}[\frac{1}{H^{k}W^{k}}\sum_{h=1}^{H}\sum_{w=1}^{% W}M^{k}(h,w)],caligraphic_L start_POSTSUBSCRIPT italic_c italic_d end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_w = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_h , italic_w ) ] , (2)

where K𝐾Kitalic_K denotes the number of blocks in both teacher and student networks. Note, previous knowledge distillation methods for anomaly detection are typically performed using the channel-wise distance in (2) [20, 23, 10].

Apart from encouraging the channel-wise feature consistency, we consider adding spatial feature matching for activation map alignment. Spatial feature distillation refers to having the student network learn the features of the teacher network along a feature map for each dimension. We use KL divergence for spatial-wise distillation:

sd=kKdDΦ(Ftk(d,:,:))logΦ(Ftk(d,:,:))Φ(Fsk(d,:,:)),subscript𝑠𝑑superscriptsubscript𝑘𝐾superscriptsubscript𝑑𝐷Φsuperscriptsubscript𝐹𝑡𝑘𝑑::𝑙𝑜𝑔Φsuperscriptsubscript𝐹𝑡𝑘𝑑::Φsuperscriptsubscript𝐹𝑠𝑘𝑑::\mathcal{L}_{sd}=\sum_{k}^{K}\sum_{d}^{D}\Phi(F_{t}^{k}(d,:,:))\cdot log\frac{% \Phi(F_{t}^{k}(d,:,:))}{\Phi(F_{s}^{k}(d,:,:))},caligraphic_L start_POSTSUBSCRIPT italic_s italic_d end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT roman_Φ ( italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_d , : , : ) ) ⋅ italic_l italic_o italic_g divide start_ARG roman_Φ ( italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_d , : , : ) ) end_ARG start_ARG roman_Φ ( italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_d , : , : ) ) end_ARG , (3)

where Φ()Φ\Phi(\cdot)roman_Φ ( ⋅ ) denotes the probability value:

Φ(Fk(d,h,w))=exp(Fk(d,h,w))hHwWexp(Fk(d,h,w)).Φsuperscript𝐹𝑘𝑑𝑤𝑒𝑥𝑝superscript𝐹𝑘𝑑𝑤superscriptsubscript𝐻superscriptsubscript𝑤𝑊𝑒𝑥𝑝superscript𝐹𝑘𝑑𝑤\Phi(F^{k}(d,h,w))=\frac{exp(F^{k}(d,h,w))}{\sum_{h}^{H}\sum_{w}^{W}exp(F^{k}(% d,h,w))}.roman_Φ ( italic_F start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_d , italic_h , italic_w ) ) = divide start_ARG italic_e italic_x italic_p ( italic_F start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_d , italic_h , italic_w ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT italic_e italic_x italic_p ( italic_F start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_d , italic_h , italic_w ) ) end_ARG . (4)

Unlike the channel-wise loss in (2) focusing on local consistency, minimizing the KL divergence of the feature maps between teacher-student models in (3) encourages the global alignment of spatial activations. By spatial-channel distillation via jointly optimizing sdsubscript𝑠𝑑\mathcal{L}_{sd}caligraphic_L start_POSTSUBSCRIPT italic_s italic_d end_POSTSUBSCRIPT and cdsubscript𝑐𝑑\mathcal{L}_{cd}caligraphic_L start_POSTSUBSCRIPT italic_c italic_d end_POSTSUBSCRIPT, we allow the teacher-student network to maintain better local-global continuous consistency.

It should be noted that in multi-class anomaly detection, the normal distribution across multiple categories becomes significantly more complex than that in one-class scenarios. Due to the cross-class interference, the student model may have more freedom and stronger generalization capabilities in reconstructing abnormal features. While the introduced spatial-wise distillation can somewhat limit the student model’s ability to reconstruct globally abnormal features, it does not impose strong constraints on the reconstruction of locally abnormal features, resulting in the failures in Fig. 2. To address this issue, we propose structural information distillation. In human vision theory, image structural information describes the inter-dependency between pixels, and these dependencies typically carry crucial information related to objects and semantic understanding [15]. By encouraging alignment of feature tensor’s structural information in knowledge distillation, the student model’s ability to reconstruct features violating local normality can be significantly reduced. To this end, let \mathcal{R}caligraphic_R denote the reshape function, where (Fik)Dk×HkWksuperscriptsubscript𝐹𝑖𝑘superscriptsuperscript𝐷𝑘superscript𝐻𝑘superscript𝑊𝑘\mathcal{R}(F_{i}^{k})\in\mathbb{R}^{D^{k}\times H^{k}W^{k}}caligraphic_R ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT × italic_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT for i{s,t}𝑖𝑠𝑡i\in\{s,t\}italic_i ∈ { italic_s , italic_t }. We use the affinity matrix to represent the structural relation of the feature map: 𝒜sk=(Fsk)T×(Fsk)subscriptsuperscript𝒜𝑘𝑠superscriptsuperscriptsubscript𝐹𝑠𝑘𝑇superscriptsubscript𝐹𝑠𝑘\mathcal{A}^{k}_{s}=\mathcal{R}(F_{s}^{k})^{T}\times\mathcal{R}(F_{s}^{k})caligraphic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = caligraphic_R ( italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT × caligraphic_R ( italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) and 𝒜tk=(Ftk)T×(Ftk)subscriptsuperscript𝒜𝑘𝑡superscriptsuperscriptsubscript𝐹𝑡𝑘𝑇superscriptsubscript𝐹𝑡𝑘\mathcal{A}^{k}_{t}=\mathcal{R}(F_{t}^{k})^{T}\times\mathcal{R}(F_{t}^{k})caligraphic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_R ( italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT × caligraphic_R ( italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ), and L2 normalization is applied to scale Ftksuperscriptsubscript𝐹𝑡𝑘F_{t}^{k}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and Ftksuperscriptsubscript𝐹𝑡𝑘F_{t}^{k}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. Then, the intra-affinity distillation objective function for the teacher-student network is:

intra=kKiHWjHW𝒜sk(i,j)𝒜tk(i,j)2.subscript𝑖𝑛𝑡𝑟𝑎superscriptsubscript𝑘𝐾superscriptsubscript𝑖𝐻𝑊superscriptsubscript𝑗𝐻𝑊subscriptnormsubscriptsuperscript𝒜𝑘𝑠𝑖𝑗subscriptsuperscript𝒜𝑘𝑡𝑖𝑗2\mathcal{L}_{intra}=\sum_{k}^{K}\sum_{i}^{H\cdot W}\sum_{j}^{H\cdot W}\|% \mathcal{A}^{k}_{s}(i,j)-\mathcal{A}^{k}_{t}(i,j)\|_{2}.caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H ⋅ italic_W end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H ⋅ italic_W end_POSTSUPERSCRIPT ∥ caligraphic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_i , italic_j ) - caligraphic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_i , italic_j ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (5)

In addition, we impose an external affinity distillation to keep the pairwise similarity consistent across samples. Particularly, the training samples from different categories within a batch assist the student network in learning from the teacher about the discrepancies of representations between classes. Our cross affinity matrix is defined as: 𝒜~sk=(Fsk)T×(F~sk)subscriptsuperscript~𝒜𝑘𝑠superscriptsuperscriptsubscript𝐹𝑠𝑘𝑇superscriptsubscript~𝐹𝑠𝑘\mathcal{\tilde{A}}^{k}_{s}=\mathcal{R}(F_{s}^{k})^{T}\times\mathcal{R}(\tilde% {F}_{s}^{k})over~ start_ARG caligraphic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = caligraphic_R ( italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT × caligraphic_R ( over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) and 𝒜~tk=(Ftk)T×(F~tk).subscriptsuperscript~𝒜𝑘𝑡superscriptsuperscriptsubscript𝐹𝑡𝑘𝑇superscriptsubscript~𝐹𝑡𝑘\mathcal{\tilde{A}}^{k}_{t}=\mathcal{R}(F_{t}^{k})^{T}\times\mathcal{R}(\tilde% {F}_{t}^{k}).over~ start_ARG caligraphic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_R ( italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT × caligraphic_R ( over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) . Then the inter-affinity distillation loss is computed as:

inter=kKiHWjHW𝒜~sk(i,j)𝒜~tk(i,j)2.subscript𝑖𝑛𝑡𝑒𝑟superscriptsubscript𝑘𝐾superscriptsubscript𝑖𝐻𝑊superscriptsubscript𝑗𝐻𝑊subscriptnormsubscriptsuperscript~𝒜𝑘𝑠𝑖𝑗subscriptsuperscript~𝒜𝑘𝑡𝑖𝑗2\mathcal{L}_{inter}=\sum_{k}^{K}\sum_{i}^{H\cdot W}\sum_{j}^{H\cdot W}\|% \mathcal{\tilde{A}}^{k}_{s}(i,j)-\mathcal{\tilde{A}}^{k}_{t}(i,j)\|_{2}.caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H ⋅ italic_W end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H ⋅ italic_W end_POSTSUPERSCRIPT ∥ over~ start_ARG caligraphic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_i , italic_j ) - over~ start_ARG caligraphic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_i , italic_j ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (6)

Through this, the student network not only distills the pairwise similarity within the sample from the teacher network, but also learns categorical discrepancies. In summary, the structural distillation objective function for the teacher-student network is:

=cd+λ1sd+λ2intra+λ3inter,subscript𝑐𝑑subscript𝜆1subscript𝑠𝑑subscript𝜆2subscript𝑖𝑛𝑡𝑟𝑎subscript𝜆3subscript𝑖𝑛𝑡𝑒𝑟\mathcal{L}=\mathcal{L}_{cd}+\lambda_{1}\mathcal{L}_{sd}+\lambda_{2}\mathcal{L% }_{intra}+\lambda_{3}\mathcal{L}_{inter},caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_c italic_d end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_s italic_d end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT , (7)

where the λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are the hyper-parameters for the optimization of the teacher-student network.

3.2 Central Residual Aggregation Module

Refer to caption
Figure 4: The overview of proposed CRAM. We add CRAM after each CNN block of the baseline student networks [23, 7].

We introduce a Central Residual Aggregation Module (CRAM) to enable the student network to learning the normality pattern as shown in Fig. 4. Our CRAM is derived from codebook learning [2, 26, 24] and includes learnable assignment parameters α𝛼\alphaitalic_α and clustering centers 𝒞Dk×N𝒞superscriptsuperscript𝐷𝑘𝑁\mathcal{C}\in\mathbb{R}^{D^{k}\times N}caligraphic_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT × italic_N end_POSTSUPERSCRIPT, where N𝑁Nitalic_N is the number of clusters. In addition, our CRAM changes the feature aggregation strategy for local feature reorganization. We assume that the normal centers represent finite normal patterns, and that unknown abnormal features are far from the centers. The student model can learn compact normal representations through CRAM and thus produce more significant error responses to abnormal regions with respect to the teacher. Inside a CRAM student block, we denote fskDk×Hk×Wksuperscriptsubscript𝑓𝑠𝑘superscriptsuperscript𝐷𝑘superscript𝐻𝑘superscript𝑊𝑘f_{s}^{k}\in\mathbb{R}^{D^{k}\times H^{k}\times W^{k}}italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT × italic_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT as the features output from the CNN block. Since the features are from multiple categorical distributions, we first align the unified centers to the current feature vector fsk(:,h,w)superscriptsubscript𝑓𝑠𝑘:𝑤f_{s}^{k}(:,h,w)italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( : , italic_h , italic_w ) through the residual calculation: rk(h,w)=fsk(:,h,w)𝒞superscript𝑟𝑘𝑤superscriptsubscript𝑓𝑠𝑘:𝑤𝒞r^{k}(h,w)=f_{s}^{k}(:,h,w)-\mathcal{C}italic_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_h , italic_w ) = italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( : , italic_h , italic_w ) - caligraphic_C and rk(:,h,w)Dk×Nsuperscript𝑟𝑘:𝑤superscriptsuperscript𝐷𝑘𝑁r^{k}(:,h,w)\in\mathbb{R}^{D^{k}\times N}italic_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( : , italic_h , italic_w ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT × italic_N end_POSTSUPERSCRIPT. Then, the soft-assignment formula for aggregating the residual centers is:

an=exp(αrnk(:,h,w)2)nNexp(αrnk(:,h,w)2),subscript𝑎𝑛𝑒𝑥𝑝𝛼superscriptnormsuperscriptsubscript𝑟𝑛𝑘:𝑤2superscriptsubscript𝑛𝑁𝑒𝑥𝑝𝛼superscriptnormsuperscriptsubscript𝑟𝑛𝑘:𝑤2a_{n}=\frac{exp(-\alpha\|r_{n}^{k}(:,h,w)\|^{2})}{\sum_{n}^{N}exp(-\alpha\|r_{% n}^{k}(:,h,w)\|^{2})},italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG italic_e italic_x italic_p ( - italic_α ∥ italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( : , italic_h , italic_w ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_e italic_x italic_p ( - italic_α ∥ italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( : , italic_h , italic_w ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG , (8)

where we compute the SoftMax distribution for the distance of features from the centers. We obtain the compact representation of the student network by aggregation:

Fsk(:,h,w)=nNanrnk(:,h,w).superscriptsubscript𝐹𝑠𝑘:𝑤superscriptsubscript𝑛𝑁subscript𝑎𝑛superscriptsubscript𝑟𝑛𝑘:𝑤F_{s}^{k}(:,h,w)=\sum_{n}^{N}a_{n}r_{n}^{k}(:,h,w).italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( : , italic_h , italic_w ) = ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( : , italic_h , italic_w ) . (9)

During the training phase, the learnable centers construct a normal feature space. When anomalies are encountered, the residuals between the features and centers appear different. Thus feature discrepancy in the student-teacher network becomes more significant with CRAM normality learning.

3.3 Scoring for Anomaly Detection & localization.

In the inference phase, we consider reconstruction error and affinity error between student and teacher networks as the measurement. The intuition is that as we minimize the structural distillation loss, the outputs of the teacher and student networks are quite similar for anomaly-free samples. When confronted with unknown features, which are from abnormal samples, our model produces relatively greater losses. First, similar to previous work, we calculate the cosine similarity of the features at each location [20, 23, 7]. By Eq. (1), we obtain the distance map Mksuperscript𝑀𝑘M^{k}italic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT from k𝑘kitalic_kth block of the model. Then, the anomaly map from the separate feature distance cH×Wsubscript𝑐superscript𝐻𝑊\mathcal{M}_{c}\in\mathbb{R}^{H\times W}caligraphic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT is computed as:

fea=kK𝕌k(Mk),subscript𝑓𝑒𝑎superscriptsubscript𝑘𝐾superscript𝕌𝑘superscript𝑀𝑘\mathcal{M}_{fea}=\sum_{k}^{K}\mathbb{U}^{k}(M^{k}),caligraphic_M start_POSTSUBSCRIPT italic_f italic_e italic_a end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , (10)

where 𝕌()𝕌\mathbb{U}(\cdot)blackboard_U ( ⋅ ) indicates the upsampling operation to resize the anomaly map to the size of the input image (H,W)𝐻𝑊(H,W)( italic_H , italic_W ). In addition, to tackle cross-class interference, we utilize the intra-affinity error of the teacher and student features to measure anomalous scores.

=𝒜sk𝒜tk2,subscriptnormsubscriptsuperscript𝒜𝑘𝑠subscriptsuperscript𝒜𝑘𝑡2\mathcal{E}=\|\mathcal{A}^{k}_{s}-\mathcal{A}^{k}_{t}\|_{2},caligraphic_E = ∥ caligraphic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - caligraphic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (11)

where HkWk×HkWksuperscriptsuperscript𝐻𝑘superscript𝑊𝑘superscript𝐻𝑘superscript𝑊𝑘\mathcal{E}\in\mathbb{R}^{H^{k}W^{k}\times H^{k}W^{k}}caligraphic_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT × italic_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. Then, the pairwise similarity difference map is computed as:

aff=~(kK𝕌k(1HkWkiHkWk(:,i))),subscript𝑎𝑓𝑓~superscriptsubscript𝑘𝐾superscript𝕌𝑘1superscript𝐻𝑘superscript𝑊𝑘superscriptsubscript𝑖superscript𝐻𝑘superscript𝑊𝑘:𝑖\mathcal{M}_{aff}=\mathcal{\tilde{R}}(\sum_{k}^{K}\mathbb{U}^{k}(\frac{1}{H^{k% }W^{k}}\sum_{i}^{H^{k}W^{k}}\mathcal{E}(:,i))),caligraphic_M start_POSTSUBSCRIPT italic_a italic_f italic_f end_POSTSUBSCRIPT = over~ start_ARG caligraphic_R end_ARG ( ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT caligraphic_E ( : , italic_i ) ) ) , (12)

where ~()H×W~superscript𝐻𝑊\mathcal{\tilde{R}}(\cdot)\in\mathbb{R}^{H\times W}over~ start_ARG caligraphic_R end_ARG ( ⋅ ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT denotes the reshape operation. Then, the overall anomaly map is calculated as:

SAL=fea+aff,subscript𝑆𝐴𝐿subscript𝑓𝑒𝑎subscript𝑎𝑓𝑓S_{AL}=\mathcal{M}_{fea}+\mathcal{M}_{aff},italic_S start_POSTSUBSCRIPT italic_A italic_L end_POSTSUBSCRIPT = caligraphic_M start_POSTSUBSCRIPT italic_f italic_e italic_a end_POSTSUBSCRIPT + caligraphic_M start_POSTSUBSCRIPT italic_a italic_f italic_f end_POSTSUBSCRIPT , (13)

where SALsubscript𝑆𝐴𝐿S_{AL}italic_S start_POSTSUBSCRIPT italic_A italic_L end_POSTSUBSCRIPT is the pixel-level anomaly map for evaluate the anomaly localization. Additionally, we use the most responsive anomaly score for anomaly detection. Thus the image-level anomaly score is calculated as:

SAD=max(SAL),subscript𝑆𝐴𝐷𝑚𝑎𝑥subscript𝑆𝐴𝐿S_{AD}=max(S_{AL}),italic_S start_POSTSUBSCRIPT italic_A italic_D end_POSTSUBSCRIPT = italic_m italic_a italic_x ( italic_S start_POSTSUBSCRIPT italic_A italic_L end_POSTSUBSCRIPT ) , (14)

where max()𝑚𝑎𝑥max(\cdot)italic_m italic_a italic_x ( ⋅ ) denotes calculating the maximum value.

4 Experiment

MVTecAD Bottle Cable Capsule Carpet Grid Hazelnut Leather Metal nut Pill Screw Tile Toothbrush Transistor Wood Zipper Mean
UniAD [28] 99.7/100 95.2/97.6 86.9/85.3 99.8/99.9 98.2/98.5 99.8/99.9 100/100 99.2/99.0 93.7/88.3 87.5/91.9 99.3/99.0 94.2/95.0 99.8/100 98.6/97.9 95.8/96.7 96.5/96.6
OmniAL [32] 100/99.4 98.2/97.6 95.2/92.4 98.7/99.6 99.9/100 95.6/98.0 99.0/97.6 99.2/99.9 97.2/97.7 88.0/81.0 99.6/100 100/100 93.8/93.8 93.2/98.7 100/100 97.2/97.0
FD [23] 76.8/100 96.5/95.1 78.9/75.8 96.5/99.4 98.8/99.1 99.3/100 96.2/97.3 97.9/99.4 91.3/94.1 76.8/93.0 99.8/100 88.9/99.7 96.2/96.6 99.4/99.6 79.7/88.7 91.5/95.9
SNL (FD) 100/100 98.1/99.6 91.7/97.4 99.9/100 99.2/99.9 100/100 100/100 99.9/100 95.8/99.1 87.6/95.3 100/99.6 96.7/92.5 98.4/99.6 99.7/99.4 99.1/97.8 \ul97.7/\ul98.7
RD [7] 66.5/100 79.3/95.0 93.6/96.3 97.0/98.9 99.0/100 100/99.9 100/100 99.3/100 95.0/96.6 96.5/97.0 98.7/99.3 99.1/99.5 92.9/96.7 99.4/99.2 99.1/98.5 94.4/98.5
SNL (RD) 100/100 94.2/99.1 95.4/97.7 98.6/99.3 99.2/100 100/100 100/100 100/100 95.8/97.9 96.6/98.1 100/99.7 99.4/99.4 95.2/99.6 99.6/99.2 99.7/98.9 98.3/99.3
Table 1: Unified anomaly detection results with image-level AUROC on MVTecAD. The multi-class/one-class performance is reported for each method. The best mean outcome is noted in bold and the runner-up mean outcome is underlined.
VisA PCB1 PCB2 PCB3 PCB4 Macaroni1 Macaroni2 Capsules Candles Cashew Chewing gum Fryum Pipe fryum Mean
UniAD [28] 95.4/95.9 93.6/90.5 90.2/91.0 99.4/98.1 93.1/93.4 85.5/85.3 75.3/79.1 96.4/94.5 92.4/93.6 99.4/98.4 90.8/89.3 97.4/97.9 92.4/92.3
OmniAL [32] 77.7/96.6 81.0/99.4 88.1/96.9 95.3/97.4 92.6/96.9 75.2/89.9 90.6/87.9 86.8/85.1 88.6/97.1 96.4/94.9 94.6/97.0 86.1/91.4 87.8/94.2
FD [23] 87.1/93.8 79.0/89.3 79.0/84.1 95.6/96.7 87.0/93.4 69.3/83.9 69.4/85.2 91.6/96.4 92.5/98.8 95.5/96.8 94.5/99.5 92.6/99.1 86.1/93.0
SNL(FD) 94.1/95.0 92.3/93.8 90.3/95.0 99.3/97.4 92.2/93.7 73.2/88.9 72.3/86.9 94.9/96.6 93.8/99.0 96.0/99.3 94.6/99.5 94.9/99.2 90.7/95.3
RD [7] 95.9/97.1 94.4/97.0 92.3/96.4 99.7/99.8 97.8/97.3 85.6/98.6 76.8/89.5 94.2/94.3 92.6/97.6 90.8/98.4 95.9/96.2 97.2/94.6 \ul92.8/ \ul96.4
SNL(RD) 98.1/97.6 94.8/96.4 95.0/97.3 99.9/99.9 96.8/98.2 84.3/91.9 76.1/91.3 94.7/95.3 95.4/97.8 97.6/98.9 95.9/96.5 99.5/99.9 94.0/96.8
Table 2: Unified anomaly detection results with image-level AUROC on VisA.
MVTec Bottle Cable Capsule Carpet Grid Hazelnut Leather Metal nut Pill Screw Tile Toothbrush Transistor Wood Zipper Mean
UniAD [28] 98.1/98.1 97.3/96.8 98.5/97.9 98.5/98.0 98.2/98.5 96.5/94.6 98.8/98.3 94.8/95.7 95.0/95.1 98.3/97.3 91.8/91.8 98.4/97.8 97.9/98.7 93.2/93.4 96.8/96.0 96.8/96.6
OmniAL [32] 99.2/99.0 97.3/97.1 96.9/92.2 99.4/99.6 99.4/99.6 98.4/98.6 99.3/99.7 99.1/99.1 98.9/98.6 98.0/97.2 99.0/99.4 99.4/99.2 93.3/91.7 97.4/96.9 99.5/99.7 98.3/\ul97.8
FD [23] 97.1/98.8 96.5/95.8 94.8/98.6 98.2/99.0 97.6/99.0 98.6/98.6 98.8/99.1 94.4/97.2 96.8/97.6 87.6/98.8 95.5/96.9 98.1/99.0 92.4/81.9 94.0/96.5 96.5/98.8 95.8/97.0
SNL (FD) 98.2/99.0 97.6/97.6 98.1/97.1 98.3/99.0 97.9/98.7 98.9/98.8 99.0/99.5 96.3/97.5 98.4/98.7 97.8/98.9 94.8/96.1 98.5/98.5 95.5/94.2 94.6/94.7 97.3/98.0 97.4/\ul97.8
RD [7] 92.1/98.7 84.7/97.4 98.4/98.7 98.8/98.9 99.1/99.3 99.0/98.9 99.3/99.4 93.5/97.3 98.5/98.2 99.2/99.6 95.9/95.6 98.9/99.1 87.6/92.5 96.1/95.3 98.4/98.2 96.0/\ul97.8
SNL(RD) 98.3/98.6 94.1/97.8 98.7/98.4 98.5/99.3 99.0/99.1 99.2/99.3 99.0/99.4 97.1/97.8 98.7/98.7 99.3/99.5 95.5/95.9 98.9/99.0 91.6/94.0 95.7/95.6 98.5/\ul97.8 \ul97.5/ 98.0
Table 3: Unified anomaly localization results with pixel-level AUROC on MVTecAD.
VisA PCB1 PCB2 PCB3 PCB4 Macaroni1 Macaroni2 Capsules Candles Cashew Chewing gum Fryum Pipe fryum Mean
UniAD [28] 99.3/99.2 97.9/96.7 98.4/98.0 97.9/98.8 99.3/98.9 98.0/97.1 98.3/98.6 99.1/98.9 98.5/99.2 99.1/98.5 97.6/97.8 99.1/99.4 \ul98.5/\ul98.4
OmniAL [32] 97.6/98.7 93.9/83.2 94.7/98.4 97.1/98.5 98.6/98.9 97.9/99.1 99.4/98.6 95.8/90.5 95.0/98.9 99.0/98.7 92.1/89.3 98.2/99.1 96.6/96.0
FD [23] 99.2/99.7 97.3/97.7 97.8/98.1 98.3/97.8 97.9/98.7 98.1/98.7 96.7/97.9 98.9/97.1 76.0/99.0 97.9/98.3 96.6/95.6 98.8/98.6 96.1/98.1
SNL(FD) 99.6/99.7 98.6/98.3 98.5/98.6 98.5/98.8 98.7/99.0 98.3/99.0 97.9/97.9 99.1/97.8 98.4/99.1 98.0/98.5 97.7/92.5 99.0/99.1 \ul98.5/98.2
RD [7] 99.5/99.7 97.8/98.0 98.6/99.3 98.2/98.3 99.6/99.6 99.1/99.4 98.0/99.6 98.7/98.5 75.2/93.5 93.9/98.2 97.5/97.1 99.2/99.3 96.2/\ul98.4
SNL(RD) 99.7/99.7 98.0/98.3 98.2/99.1 98.3/98.7 99.6/99.6 99.2/99.4 98.0/99.4 99.2/98.7 98.4/95.8 98.4/97.7 97.5/96.8 99.3/99.3 98.7/98.5
Table 4: Unified anomaly localization results with pixel-level AUROC on VisA.

4.1 Dataset

We evaluate the proposed method on two large-scale visual anomaly detection datasets: MVTecAD [4] and VisA [34]. MVTecAD is a comprehensive anomaly detection benchmark that includes 10 object categories and 5 texture categories. There are total 3629 normal images for training and 1725 unknown images for testing. VisA contains object images with more complex structures and multiple instances. It includes 9621 normal images and 1200 anomalous images from 12 categories. Note that we train a unified model on data from all categories and evaluate the anomaly detection performance across categories.

4.2 Implementation Details

All images in our experiments are resized to 256x256 and normalized by the mean and variance of ImageNet [8]. We default to using WideResNet-50 [29] as the backbone, where the teacher network loads the model pre-trained on ImageNet. For forward distillation, our framework is based on STPM [23], which uses the same teacher and student architectures. For reverse distillation, we follow the structure used in RD [7]. In this study, we apply the same processing and training strategies to FD-based and RD-based methods. During training, we use the Adam optimizer with a learning rate of 0.0050.0050.0050.005 and a batch size of 8. For hyper-parameters, we simply set λ1=λ2=λ3=1subscript𝜆1subscript𝜆2subscript𝜆31\lambda_{1}=\lambda_{2}=\lambda_{3}=1italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 1, and the cluster number N=50𝑁50N=50italic_N = 50. The experiments are implemented using Pytorch [18] on a single RTX3090 GPU. We use image-level and pixel-level Area Under the Receiver Operator Curve (AUROC) to evaluate the anomaly detection and localization performance following UniAD [28].

Learning Objective Module Evaluation
Lcdsubscript𝐿𝑐𝑑L_{cd}italic_L start_POSTSUBSCRIPT italic_c italic_d end_POSTSUBSCRIPT Lsdsubscript𝐿𝑠𝑑L_{sd}italic_L start_POSTSUBSCRIPT italic_s italic_d end_POSTSUBSCRIPT Lintrasubscript𝐿𝑖𝑛𝑡𝑟𝑎L_{intra}italic_L start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT Lintersubscript𝐿𝑖𝑛𝑡𝑒𝑟L_{inter}italic_L start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT CRAM image-level pixel-level
- - - - 94.4 96.0
- - - 94.7 96.4
- - 97.4 97.0
- 97.7 97.1
- - 97.1 97.3
98.3 97.5
Table 5: Ablation studies of proposed methods. The evaluation is based on RD [7] and all results are evaluated on MVTecAD.
Refer to caption
Figure 5: Visualization of our approach and baseline RD [7] in various anomaly scenarios of MVTecAD and VisA.

4.3 Main Results

The unified models, UniAD [28] and OmniAL [32], which are specifically proposed for multi-class anomaly detection, are the primary comparators of our approach. In addition, we reproduce the performance of forward distillation (FD) [23] and reverse distillation (RD) [7] on multi-class anomaly detection as the baselines. We include a detailed table comparing the performance of other models on multi-class anomaly detection in the supplement [5, 27, 6, 13, 20, 30]. We evaluate anomaly detection with image-level AUROC in Tables 1&2 and anomaly localization with pixel-level AUROC in Tables 3&4 for the MVTecAD and VisA datasets. Notably, we not only report the results of multi-class anomaly detection, but also evaluate on one-class anomaly detection as a reference.

Anomaly Detection.

We show the anomaly detection results on MVTecAD in Tab. 1. It is obvious to notice the degradation of FD and RD in multi-class anomaly detection. Meanwhile, our proposed SNL greatly improves the performance of FD and RD by 6.2% and 3.9%, respectively. Similarly, as shown in Tab. 2 performance degradation is observed on the VisA dataset, while SNL achieves 4.6% and 1.2% improvement compared to FD and RD. Remarkably, we outperform the reconstruction-based SOTA method UniAD by 1.8% on MVTecAD and 1.6% on VisA. For OmniAL, a unified model based on anomaly synthesis, our method slightly gains 1.1% on MVTecAD while exceeds a large margin by 6.2% on VisA. For one-class tasks, our SNL achieves the best two results based on FD and RD, which shows the generalizability of our method.

Anomaly Localization.

As shown in Tabs. 3&4, our SNL significantly boosts the anomaly localization capability of FD and RD on MVTecAD and VisA. Compared to UniAD, our SNL(RD) obtains an increase of 0.7% and 0.2% on MVTecAD and VisA, respectively. Although the localization performance of our method is slightly lower than OmniAL on MVTecAD, we dramatically outperform OmniAL by 2.1% on VisA. Anomaly synthesis lacks the generalization ability because it requires certain a prior knowledge. Therefore, a potential future study is to fuse anomaly synthesis into teacher-student networks to enhance localization ability such as in [22]. Further more, our SNL(RD) achieve the best one-class anomaly localization performance.

Visualization.

We show in Fig. 5 a quantitative comparison of the localization results of our method with RD [7] on various anomaly scenarios in multi-class anomaly detection. Cross-class interference does not mean that the model loses the ability to localize anomalies, but rather lacks sensitivity to anomalous regions. Thus we can observe that the baseline model RD is able to coarsely localize anomalies, but not as accurately as our approach. Furthermore, such imprecision directly leads to the lack of discriminative image-level anomaly scores, which is only slightly reflected in the pixel-level anomaly scores. More visualization results are in the supplement.

Capacity w/o 25 50 75
image-level 97.7 98.2 98.3 98.0
pixel-level 97.1 97.5 97.5 97.4
Table 6: Ablation study on the number of CRAM centers.

4.4 Ablation Study

We conduct detailed ablation experiments to specifically evaluate the results of our approaches. As shown in Tab. 5, we analyze the effects of the structural distillation objective functions and the CRAM normality representation. First, we decompose the feature reconstruction into channel-wise and spatial-wise learning to show the gains of spatial feature distillation. Then, intra-&inter-affinity distillation, which facilitates the teacher-student network to capture pairwise feature similarities, brings 3.0% and 0.7% improvement in image-level and pixel-level anomaly recognition performance. In addition, our normality learning module, CRAM, enables the student network to learn compact normal representations, resulting in increased outcomes of 2.4% and 0.9% compared to baseline. Overall, our SNL boosts the baseline by 3.9%/1.5% of anomaly detection/localization performance.

For CRAM, we assess the effect of the number of normality centers on the results in Tab. 6. We observe a significant improvement with the CRAM, but a larger codebook capacity doesn’t yield further improvement. Ablation studies regarding model capacities, number of network blocks, etc. can be found in the supplement.

5 Conclusion

In this study, we identify cross-class interference problem in the teacher-student network for multi-class anomaly detection, which severely reduces the discriminitive ability to anomalies. To overcome this obstacle, we propose structural teacher-student normality learning, which consists of structural distillation and the central residual aggregation module (CRAM). Specifically, structural distillation presents feature affinity reconstruction for capturing pairwise feature similarities between teacher and student features. Besides, CRAM assists student networks in learning discriminative normality representations. Extensive experiments show that our method dramatically improves baseline results and outperforms state-of-the-art methods.

References

  • Acsintoae et al. [2022] Andra Acsintoae, Andrei Florescu, Mariana-Iuliana Georgescu, Tudor Mare, Paul Sumedrea, Radu Tudor Ionescu, Fahad Shahbaz Khan, and Mubarak Shah. Ubnormal: New benchmark for supervised open-set video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20143–20153, 2022.
  • Arandjelovic et al. [2016] Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5297–5307, 2016.
  • Bergmann et al. [2018] Paul Bergmann, Sindy Löwe, Michael Fauser, David Sattlegger, and Carsten Steger. Improving unsupervised defect segmentation by applying structural similarity to autoencoders. arXiv preprint arXiv:1807.02011, 2018.
  • Bergmann et al. [2019] Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9592–9600, 2019.
  • Bergmann et al. [2020] Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4183–4192, 2020.
  • Defard et al. [2021] Thomas Defard, Aleksandr Setkov, Angelique Loesch, and Romaric Audigier. Padim: a patch distribution modeling framework for anomaly detection and localization. In International Conference on Pattern Recognition, pages 475–489. Springer, 2021.
  • Deng and Li [2022] Hanqiu Deng and Xingyu Li. Anomaly detection via reverse distillation from one-class embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9737–9746, 2022.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • Gong et al. [2019] Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1705–1714, 2019.
  • Hinton et al. [2015] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • Jezek et al. [2021] Stepan Jezek, Martin Jonak, Radim Burget, Pavel Dvorak, and Milos Skotak. Deep learning-based defect detection of metal parts: evaluating current methods in complex conditions. In 2021 13th International congress on ultra modern telecommunications and control systems and workshops (ICUMT), pages 66–71. IEEE, 2021.
  • Lee et al. [2022] Sungwook Lee, Seunghyun Lee, and Byung Cheol Song. Cfa: Coupled-hypersphere-based feature adaptation for target-oriented anomaly localization. IEEE Access, 10:78446–78454, 2022.
  • Li et al. [2021] Chun-Liang Li, Kihyuk Sohn, **sung Yoon, and Tomas Pfister. Cutpaste: Self-supervised learning for anomaly detection and localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9664–9674, 2021.
  • Liu et al. [2018] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6536–6545, 2018.
  • Liu et al. [2019] Yifan Liu, Ke Chen, Kris Liu, Zengchang Qin, Zhenbo Luo, and **gdong Wang. Structured knowledge distillation for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2604–2613, 2019.
  • Lv et al. [2021] Hui Lv, Chen Chen, Zhen Cui, Chunyan Xu, Yong Li, and Jian Yang. Learning normal dynamics in videos with meta prototype network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15425–15434, 2021.
  • Park et al. [2020] Hyunjong Park, Jongyoun Noh, and Bumsub Ham. Learning memory-guided normality for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14372–14381, 2020.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  • Ruff et al. [2018] Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft. Deep one-class classification. In International conference on machine learning, pages 4393–4402. PMLR, 2018.
  • Salehi et al. [2021] Mohammadreza Salehi, Niousha Sadjadi, Soroosh Baselizadeh, Mohammad H Rohban, and Hamid R Rabiee. Multiresolution knowledge distillation for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14902–14912, 2021.
  • Schlegl et al. [2019] Thomas Schlegl, Philipp Seeböck, Sebastian M Waldstein, Georg Langs, and Ursula Schmidt-Erfurth. f-anogan: Fast unsupervised anomaly detection with generative adversarial networks. Medical image analysis, 54:30–44, 2019.
  • Tien et al. [2023] Tran Dinh Tien, Anh Tuan Nguyen, Nguyen Hoang Tran, Ta Duc Huy, Soan Duong, Chanh D Tr Nguyen, and Steven QH Truong. Revisiting reverse distillation for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24511–24520, 2023.
  • Wang et al. [2021] Guodong Wang, Shumin Han, Errui Ding, and Di Huang. Student-teacher feature pyramid matching for anomaly detection. arXiv preprint arXiv:2103.04257, 2021.
  • Wu et al. [2021] Aming Wu, Suqi Zhao, Cheng Deng, and Wei Liu. Generalized and discriminative few-shot object detection via svd-dictionary enhancement. Advances in Neural Information Processing Systems, 34:6353–6364, 2021.
  • Yang et al. [2020] Jie Yang, Yong Shi, and Zhiquan Qi. Dfr: Deep feature reconstruction for unsupervised anomaly segmentation. arXiv preprint arXiv:2012.07122, 2020.
  • Yang et al. [2022] Zhiwei Yang, Peng Wu, **g Liu, and Xiaotao Liu. Dynamic local aggregation network with adaptive clusterer for anomaly detection. In European Conference on Computer Vision, pages 404–421. Springer, 2022.
  • Yi and Yoon [2020] Jihun Yi and Sungroh Yoon. Patch svdd: Patch-level svdd for anomaly detection and segmentation. In Proceedings of the Asian conference on computer vision, 2020.
  • You et al. [2022] Zhiyuan You, Lei Cui, Yujun Shen, Kai Yang, Xin Lu, Yu Zheng, and Xinyi Le. A unified model for multi-class anomaly detection. Advances in Neural Information Processing Systems, 35:4571–4584, 2022.
  • Zagoruyko and Komodakis [2016] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
  • Zavrtanik et al. [2021] Vitjan Zavrtanik, Matej Kristan, and Danijel Skočaj. Draem-a discriminatively trained reconstruction embedding for surface anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8330–8339, 2021.
  • Zhang et al. [2017] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  • Zhao [2023] Ying Zhao. Omnial: A unified cnn framework for unsupervised anomaly localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3924–3933, 2023.
  • Zimmerer et al. [2022] David Zimmerer, Peter M Full, Fabian Isensee, Paul Jäger, Tim Adler, Jens Petersen, Gregor Köhler, Tobias Ross, Annika Reinke, Antanas Kascenas, et al. Mood 2020: A public benchmark for out-of-distribution detection and localization on medical images. IEEE Transactions on Medical Imaging, 41(10):2728–2738, 2022.
  • Zou et al. [2022] Yang Zou, Jongheon Jeong, Latha Pemula, Dongqing Zhang, and Onkar Dabeer. Spot-the-difference self-supervised pre-training for anomaly detection and segmentation. In European Conference on Computer Vision, pages 392–408. Springer, 2022.