\useunder

\ul

Structural Teacher-Student Normality Learning for Multi-Class Anomaly Detection and Localization

Hanqiu Deng and Xingyu Li
{hanqiu1, xingyu}@ualberta.ca
University of Alberta

Abstract

Visual anomaly detection is a challenging open-set task aimed at identifying unknown anomalous patterns while modeling normal data. The knowledge distillation paradigm has shown remarkable performance in one-class anomaly detection by leveraging teacher-student network feature comparisons. However, extending this paradigm to multi-class anomaly detection introduces novel scalability challenges. In this study, we address the significant performance degradation observed in previous teacher-student models when applied to multi-class anomaly detection, which we identify as resulting from cross-class interference. To tackle this issue, we introduce a novel approach known as Structural Teacher-Student Normality Learning (SNL): (1) We propose spatial-channel distillation and intra-&inter-affinity distillation techniques to measure structural distance between the teacher and student networks. (2) We introduce a central residual aggregation module (CRAM) to encapsulate the normal representation space of the student network. We evaluate our proposed approach on two anomaly detection datasets, MVTecAD and VisA. Our method surpasses the state-of-the-art distillation-based algorithms by a significant margin of 3.9% and 1.5% on MVTecAD and 1.2% and 2.5% on VisA in the multi-class anomaly detection and localization tasks, respectively. Furthermore, our algorithm outperforms the current state-of-the-art unified models on both MVTecAD and VisA.

Refer to caption — Figure 1: We visualize the performance degradation of one-class teacher-student networks, RD [7] (left) and FD [23] (right), in the multi-class anomaly detection task on MVTecAD. Our structural normality learning (SNL) strategy on the teacher-student model shows significant improvement of multi-class anomaly detection and localization on both methods. Besides, SNL can also boost the performance on one-class cases.

1 Introduction

Visual anomaly detection represents a pivotal open-set task in computer vision, aiming to identify unknown anomalous patterns within normal data. This challenge holds significant relevance in a multitude of real-world applications, spanning industrial defect detection [4, 34, 11], video surveillance [14, 1], and medical imaging diagnosis [21, 33]. Traditional anomaly detection approaches often involve training separate models for each specific category. These models are trained on normal samples from their respective categories and can only detect anomalies within the context of that category. While one-class anomaly detection models have shown promise in these contexts [20, 23, 7, 12], their inherent limitation lies in the need to construct a separated model for each class, a paradigm that becomes increasingly inefficient with the increasing number of categories. Recent developments have highlighted the emergence of multi-class anomaly detection as a pressing challenge, demanding enhanced scalability and adaptability from anomaly detection models [28, 32]. In response to this evolving landscape, we aim to propose a scalable solution for multi-class anomaly detection and localization, where one model can identify anomalies of multiple classes.

Feature reconstruction stands as one of the most influential paradigms in the realm of anomaly detection, distinguished for its robustness and effectiveness. Especially, teacher-student networks become a natural approach for feature reconstruction, involving the prediction of teacher network outcomes through the student network [10]. In particular, multi-scale distillation is proposed to achieve superior anomaly detection performance by accumulating feature differences between teachers and students under multiple receptive fields [5, 20, 23]. Recently, by exposing the over-generalization problem on anomaly detection that exists in the forward distillation paradigm, reverse distillation has been proposed as a novel paradigm and achieves SOTA performance on one-class anomaly detection scenarios [7]. However, we observe substantial performance degradation for both forward distillation [20, 23] and reverse distillation [7] for multi-class anomaly detection, as shown in Fig. 1. Therefore, we propose the cross-class interference hypothesis in Fig. 2(a), whereby the generalization of the anomaly detection model across different categories causes the model to be somewhat tolerant towards anomalies.

To empirically assess the impact of cross-class interference on anomaly detection, we conduct two straightforward experiments. In the first experiment, we use Mixup [31] technique to superimpose two images belonging to different classes, creating a mixture that should be considered as an anomalous image. We then conduct a statistical analysis of the image-level anomaly scores. As illustrated in Fig. 2(b), both forward distillation (FD) [23] and reverse distillation (RD) [7] fail to distinguish between mixture and normal images when trained on a multi-class dataset. In the second experiment, we employ a CutPaste-like [13] anomaly synthesis on images originating from two distinct classes. Accordingly, the synthesized irregularity should be distinguishable for effective anomaly detection models [13]. However, as shown in Fig. 2(c), when training under the multi-class setting, both FD and RD models are unable to identify the anomalous region within the synthesized image. These experiments demonstrate the detrimental influence of cross-class interference on the performance of teacher-student networks in multi-class anomaly detection and localization.

Evidently, the issue of cross-class interference in multi-class anomaly detection arises from shortcomings in previous teacher-student reconstruction networks, a concern not as prominent in one-class anomaly detection. On the one hand, previous methods primarily train the student network to learn local features from the teacher network without fostering correlations between these features. The absence of such correlations hindered student networks from effectively discerning structural feature differences between the subject and potential anomalies within a sample. Therefore, we propose structural distillation, enabling student networks to discern and capture pairwise feature disparities from teacher networks. In specific, our structural distillation consists of spatial-channel and intra-&inter-affinity distillation, which represents separate and pairwise feature distances, respectively. On the other hand, the deficiency of normality constraints leads to weak compactness of multi-class normal representations within teacher-student networks. To tackle this issue, we propose the Central Residual Aggregation Module (CRAM) plugged into the student network. Our proposed CRAM facilitates the learning of compact normality features by aggregating residual projections of student features relative to multiple normality centers. Notably, our multi-class anomaly detection model demonstrates excellent discriminative ability in the experiments presented in Fig. 2. Overall, we propose Structural Teacher-Student Normality Learning (SNL) to address the problem of cross-class interference that hampers the effectiveness of knowledge distillation in multi-class anomaly detection. Notably, our approach offers generalizability to previous teacher-student networks and improves performance by a large margin in multi-class anomaly detection and localization. Furthermore, our approach remarkably surpasses SOTA on the MVTecAD and VisA datasets. Our main contributions are summarized as follows:

•

We conduct an in-depth analysis to identify the presence of cross-class interference, which leads to the degradation observed in teacher-student networks when applied to multi-class anomaly detection and localization.
•

To tackle this issue, we propose a structural teacher-student network that learning separate and pairwise feature similarities by spatial-channel and intra-&inter-affinity distillation.
•

We propose CRAM to be integrated in student network to learn a compact normality representation, thereby enhancing the model’s sensitivity to cross-class anomalies.
•

Extensive experiments on the datasets MVTecAD and VisA show that our approach has a dramatic improvement compared to the baseline and also outperforms the state-of-the-art unified models.

2 Related Work

Distillation-based Anomaly Detection:

Reconstruction is the typical paradigm for anomaly detection, e.g., pixel-level structural reconstruction for industrial defect detection [3]. Feature-level reconstruction exhibits impressive performance due to the powerful representation capabilities of pre-trained models [25]. Teacher-student networks, which use student networks to reconstruct features of teacher networks, as a natural reconstruction paradigm have been widely studied for anomaly detection. Uninformed student is the first teacher-student network based anomaly detection method [5]. It trains trains a student network on normal samples to distill from a discriminative teacher network and then detects anomalies by teacher-student differences. The multi-scale knowledge distillation [20, 23] is proposed to train a student network to reconstruct the multi-scale features of the teacher network, which is derived from a pre-trained network on ImageNet [8] with a rich semantic space. In particular, [20] utilizes the disparate gradients generated by the model on novel features to detect anomalies and [23] utilize pyramid reconstruction errors to detect anomalies. In this study, we define this classical teacher-student networks [20, 23] as forward distillation. Previous studies have found that forward distillation suffers from anomalous leakage to student networks, whereby more powerful student networks overgeneralize the anomalous representations and thus lead to performance degradation [20, 23]. To address this issue, reverse distillation has been proposed to reconstruct shallow multi-scale features progressively from deep features using the student network [7], which takes teacher-student networks to state-of-the-art in anomaly detection and localization. Previous approaches have achieved impressive performance on one-class anomaly detection, however, degradation occurs on multi-class anomaly detection. In this study, we aim to achieve high performance in multi-class anomaly detection and localization using teacher-student networks.

Normality Learning:

DeepSVDD [19] is a one-class normality learning algorithm that detects outliers by training a compact support space for a normality center. Subsequently, the sparse memory mechanism [9] and the compact memory module [17] are proposed for learning normality reconstruction. To achieve few-shot adaptation, dynamic normality learning is proposed to project normal prototypes onto a given feature space [16]. Recently, CFA [12] proposes the coupled-hypersphere-based feature adaptation to learning normal centers for one-class anomaly detection. In this study, we aim to present a normality learning module that is adaptable to multi-class and sensitive to anomalies.

Multi-class Anomaly Detection:

UniAD [28] initially formulates the task of multi-class visual anomaly detection and proposes a transformer-based feature reconstruction model. Besides, UniAD proposes layer-wise query in the transformer to learn the complex normal distribution of multi-categories. Additionally, OmniAL [32] proposes a panel-guided approach to synthesize anomalies and train reconstruction and discriminative networks on the synthesized anomaly samples to localize the anomalies. Although the synthetic anomaly approaches [30, 32] provide excellent anomaly localization precision on specific datasets, they require a priori knowledge of the anomalies in the dataset. We commonly define anomalies as unknown so that the model can be sensitive to all kinds of anomalies.

3 Methodology

Problem Definition:

For multi-class anomaly detection, we follow a unified setting where the images are from different classes and the category information is inaccessible [28]. Let $L_{train}=\{I_{normal}^{1},...,I_{normal}^{n}\}$ denotes the set of $n$ anomaly-free training samples from $C$ potential categories. Then, the inference set is defined as $L_{test}=\{I_{unknown}^{1},...,I_{unknown}^{m}\}$ , which including $m$ query images from the same $C$ classes. Notabaly, the training set $L_{train}$ only includes normal samples and the test set $L_{test}$ includes normal or unknown anomalous samples. We aim to achieve a model that can detect anomalous images and localize the anomalous regions in multiple categories.

Preliminaries:

Lately, the teacher-student networks have made significant strides in advancing anomaly detection [20, 23, 7]. In this paradigm, we begin with a pre-trained teacher network capable of extracting rich and discriminative features from images. We train a student network on normal samples to learn and reconstruct these features from the teacher network. This process is commonly referred to as knowledge distillation [10]. Subsequently, we use the feature reconstruction errors on query samples to detect anomalies. However, when applied to multi-class anomaly detection, the teacher-student network suffers from degradation, resulting in weak performance. As highlighted earlier, our observations indicate that this performance degradation is attributed to cross-class interference, a phenomenon that affects the model’s ability to differentiate anomalies in diverse classes. To overcome this issue, we introduce structural teacher-student normality learning as a novel framework for multi-class anomaly detection and localization. In this section, we present the proposed methodology as follows: (1) structural knowledge distillation, (2) central residual aggregation module for normality learning, and (3) scoring for anomaly detection and localization. These elements collectively form the foundation of our approach, which aims to address the challenge of multi-class anomaly detection by mitigating the impact of cross-class interference.

3.1 Structural Distillation for Anomaly Detection

The teacher-student network consists of a frozen pre-trained teacher model and a trainable student model. Particularly, we follow previous work using the same network architecture and distill hierarchical knowledge for the teacher-student network [23, 7]. Formally, let $F^{k}_{t}\in\mathbb{R}^{D^{k}\times H^{k}\times W^{k}}$ and $F^{k}_{s}\in\mathbb{R}^{D^{k}\times H^{k}\times W^{k}}$ denote the feature tensors of the $k$ th block of the teacher and student models, respectively. For notation consistency, this paper uses $F_{i}^{k}(:,h,w)\in\mathbb{R}^{D^{k}\times 1}$ to denote the 1-D channel-wise feature at location $(w,h)$ from the feature tensor, and $F_{i}^{k}(d,:,:)\in\mathbb{R}^{H^{k}\times W^{k}}$ to represent the 2-D spatial feature map in the channel $d$ , where $i\in\{t,s\}$ .

During training, the tensor $F^{k}_{t}$ extracted from $I_{normal}$ is treated as the learning target. Then, we optimize the student network to produce a reconstructed feature tensor $F^{k}_{s}$ that is close to the target tensor $F^{k}_{t}$ . Following previous works [20, 23, 10], we compute the channel-wise feature distances along the channel axis for the $k$ th teacher-student blocks:

M^{k}(h,w)=1-\frac{(F^{k}_{t}(:,h,w))^{T}\cdot F^{k}_{s}(:,h,w)}{\|F^{k}_{t}(:% ,h,w)\|_{2}\cdot\|F^{k}_{s}(:,h,w)\|_{2}},

(1)

where $\|\cdot\|_{2}$ is the L $2$ norm. By calculate the cosine similarity distance along the channel axis in (1), we obtain a 2-D distance map $M^{k}\in\mathbb{R}^{H^{k}\times W^{k}}$ . Considering the hierarchical knowledge distillation, the channel-wise distillation loss is defined as the aggregation of the multi-scale channel-wise distance maps:

\mathcal{L}_{cd}=\sum_{k=1}^{K}[\frac{1}{H^{k}W^{k}}\sum_{h=1}^{H}\sum_{w=1}^{% W}M^{k}(h,w)],

(2)

where $K$ denotes the number of blocks in both teacher and student networks. Note, previous knowledge distillation methods for anomaly detection are typically performed using the channel-wise distance in (2) [20, 23, 10].

Apart from encouraging the channel-wise feature consistency, we consider adding spatial feature matching for activation map alignment. Spatial feature distillation refers to having the student network learn the features of the teacher network along a feature map for each dimension. We use KL divergence for spatial-wise distillation:

\mathcal{L}_{sd}=\sum_{k}^{K}\sum_{d}^{D}\Phi(F_{t}^{k}(d,:,:))\cdot log\frac{% \Phi(F_{t}^{k}(d,:,:))}{\Phi(F_{s}^{k}(d,:,:))},

(3)

where $\Phi(\cdot)$ denotes the probability value:

\Phi(F^{k}(d,h,w))=\frac{exp(F^{k}(d,h,w))}{\sum_{h}^{H}\sum_{w}^{W}exp(F^{k}(% d,h,w))}.

(4)

Unlike the channel-wise loss in (2) focusing on local consistency, minimizing the KL divergence of the feature maps between teacher-student models in (3) encourages the global alignment of spatial activations. By spatial-channel distillation via jointly optimizing $\mathcal{L}_{sd}$ and $\mathcal{L}_{cd}$ , we allow the teacher-student network to maintain better local-global continuous consistency.

It should be noted that in multi-class anomaly detection, the normal distribution across multiple categories becomes significantly more complex than that in one-class scenarios. Due to the cross-class interference, the student model may have more freedom and stronger generalization capabilities in reconstructing abnormal features. While the introduced spatial-wise distillation can somewhat limit the student model’s ability to reconstruct globally abnormal features, it does not impose strong constraints on the reconstruction of locally abnormal features, resulting in the failures in Fig. 2. To address this issue, we propose structural information distillation. In human vision theory, image structural information describes the inter-dependency between pixels, and these dependencies typically carry crucial information related to objects and semantic understanding [15]. By encouraging alignment of feature tensor’s structural information in knowledge distillation, the student model’s ability to reconstruct features violating local normality can be significantly reduced. To this end, let $\mathcal{R}$ denote the reshape function, where $\mathcal{R}(F_{i}^{k})\in\mathbb{R}^{D^{k}\times H^{k}W^{k}}$ for $i\in\{s,t\}$ . We use the affinity matrix to represent the structural relation of the feature map: $\mathcal{A}^{k}_{s}=\mathcal{R}(F_{s}^{k})^{T}\times\mathcal{R}(F_{s}^{k})$ and $\mathcal{A}^{k}_{t}=\mathcal{R}(F_{t}^{k})^{T}\times\mathcal{R}(F_{t}^{k})$ , and L2 normalization is applied to scale $F_{t}^{k}$ and $F_{t}^{k}$ . Then, the intra-affinity distillation objective function for the teacher-student network is:

\mathcal{L}_{intra}=\sum_{k}^{K}\sum_{i}^{H\cdot W}\sum_{j}^{H\cdot W}\|% \mathcal{A}^{k}_{s}(i,j)-\mathcal{A}^{k}_{t}(i,j)\|_{2}.

(5)

In addition, we impose an external affinity distillation to keep the pairwise similarity consistent across samples. Particularly, the training samples from different categories within a batch assist the student network in learning from the teacher about the discrepancies of representations between classes. Our cross affinity matrix is defined as: $\mathcal{\tilde{A}}^{k}_{s}=\mathcal{R}(F_{s}^{k})^{T}\times\mathcal{R}(\tilde% {F}_{s}^{k})$ and $\mathcal{\tilde{A}}^{k}_{t}=\mathcal{R}(F_{t}^{k})^{T}\times\mathcal{R}(\tilde% {F}_{t}^{k}).$ Then the inter-affinity distillation loss is computed as:

\mathcal{L}_{inter}=\sum_{k}^{K}\sum_{i}^{H\cdot W}\sum_{j}^{H\cdot W}\|% \mathcal{\tilde{A}}^{k}_{s}(i,j)-\mathcal{\tilde{A}}^{k}_{t}(i,j)\|_{2}.

(6)

Through this, the student network not only distills the pairwise similarity within the sample from the teacher network, but also learns categorical discrepancies. In summary, the structural distillation objective function for the teacher-student network is:

\mathcal{L}=\mathcal{L}_{cd}+\lambda_{1}\mathcal{L}_{sd}+\lambda_{2}\mathcal{L% }_{intra}+\lambda_{3}\mathcal{L}_{inter},

(7)

where the $\lambda_{1}$ , $\lambda_{2}$ , and $\lambda_{3}$ are the hyper-parameters for the optimization of the teacher-student network.

3.2 Central Residual Aggregation Module

We introduce a Central Residual Aggregation Module (CRAM) to enable the student network to learning the normality pattern as shown in Fig. 4. Our CRAM is derived from codebook learning [2, 26, 24] and includes learnable assignment parameters $\alpha$ and clustering centers $\mathcal{C}\in\mathbb{R}^{D^{k}\times N}$ , where $N$ is the number of clusters. In addition, our CRAM changes the feature aggregation strategy for local feature reorganization. We assume that the normal centers represent finite normal patterns, and that unknown abnormal features are far from the centers. The student model can learn compact normal representations through CRAM and thus produce more significant error responses to abnormal regions with respect to the teacher. Inside a CRAM student block, we denote $f_{s}^{k}\in\mathbb{R}^{D^{k}\times H^{k}\times W^{k}}$ as the features output from the CNN block. Since the features are from multiple categorical distributions, we first align the unified centers to the current feature vector $f_{s}^{k}(:,h,w)$ through the residual calculation: $r^{k}(h,w)=f_{s}^{k}(:,h,w)-\mathcal{C}$ and $r^{k}(:,h,w)\in\mathbb{R}^{D^{k}\times N}$ . Then, the soft-assignment formula for aggregating the residual centers is:

a_{n}=\frac{exp(-\alpha\|r_{n}^{k}(:,h,w)\|^{2})}{\sum_{n}^{N}exp(-\alpha\|r_{% n}^{k}(:,h,w)\|^{2})},

(8)

where we compute the SoftMax distribution for the distance of features from the centers. We obtain the compact representation of the student network by aggregation:

F_{s}^{k}(:,h,w)=\sum_{n}^{N}a_{n}r_{n}^{k}(:,h,w).

(9)

During the training phase, the learnable centers construct a normal feature space. When anomalies are encountered, the residuals between the features and centers appear different. Thus feature discrepancy in the student-teacher network becomes more significant with CRAM normality learning.

3.3 Scoring for Anomaly Detection & localization.

In the inference phase, we consider reconstruction error and affinity error between student and teacher networks as the measurement. The intuition is that as we minimize the structural distillation loss, the outputs of the teacher and student networks are quite similar for anomaly-free samples. When confronted with unknown features, which are from abnormal samples, our model produces relatively greater losses. First, similar to previous work, we calculate the cosine similarity of the features at each location [20, 23, 7]. By Eq. (1), we obtain the distance map $M^{k}$ from $k$ th block of the model. Then, the anomaly map from the separate feature distance $\mathcal{M}_{c}\in\mathbb{R}^{H\times W}$ is computed as:

\mathcal{M}_{fea}=\sum_{k}^{K}\mathbb{U}^{k}(M^{k}),

(10)

where $\mathbb{U}(\cdot)$ indicates the upsampling operation to resize the anomaly map to the size of the input image $(H,W)$ . In addition, to tackle cross-class interference, we utilize the intra-affinity error of the teacher and student features to measure anomalous scores.

\mathcal{E}=\|\mathcal{A}^{k}_{s}-\mathcal{A}^{k}_{t}\|_{2},

(11)

where $\mathcal{E}\in\mathbb{R}^{H^{k}W^{k}\times H^{k}W^{k}}$ . Then, the pairwise similarity difference map is computed as:

\mathcal{M}_{aff}=\mathcal{\tilde{R}}(\sum_{k}^{K}\mathbb{U}^{k}(\frac{1}{H^{k% }W^{k}}\sum_{i}^{H^{k}W^{k}}\mathcal{E}(:,i))),

(12)

where $\mathcal{\tilde{R}}(\cdot)\in\mathbb{R}^{H\times W}$ denotes the reshape operation. Then, the overall anomaly map is calculated as:

S_{AL}=\mathcal{M}_{fea}+\mathcal{M}_{aff},

(13)

where $S_{AL}$ is the pixel-level anomaly map for evaluate the anomaly localization. Additionally, we use the most responsive anomaly score for anomaly detection. Thus the image-level anomaly score is calculated as:

S_{AD}=max(S_{AL}),

(14)

where $max(\cdot)$ denotes calculating the maximum value.

4 Experiment

MVTecAD	Bottle	Cable	Capsule	Carpet	Grid	Hazelnut	Leather	Metal nut	Pill	Screw	Tile	Toothbrush	Transistor	Wood	Zipper	Mean
UniAD [28]	99.7/100	95.2/97.6	86.9/85.3	99.8/99.9	98.2/98.5	99.8/99.9	100/100	99.2/99.0	93.7/88.3	87.5/91.9	99.3/99.0	94.2/95.0	99.8/100	98.6/97.9	95.8/96.7	96.5/96.6
OmniAL [32]	100/99.4	98.2/97.6	95.2/92.4	98.7/99.6	99.9/100	95.6/98.0	99.0/97.6	99.2/99.9	97.2/97.7	88.0/81.0	99.6/100	100/100	93.8/93.8	93.2/98.7	100/100	97.2/97.0
FD [23]	76.8/100	96.5/95.1	78.9/75.8	96.5/99.4	98.8/99.1	99.3/100	96.2/97.3	97.9/99.4	91.3/94.1	76.8/93.0	99.8/100	88.9/99.7	96.2/96.6	99.4/99.6	79.7/88.7	91.5/95.9
SNL (FD)	100/100	98.1/99.6	91.7/97.4	99.9/100	99.2/99.9	100/100	100/100	99.9/100	95.8/99.1	87.6/95.3	100/99.6	96.7/92.5	98.4/99.6	99.7/99.4	99.1/97.8	\ul97.7/\ul98.7
RD [7]	66.5/100	79.3/95.0	93.6/96.3	97.0/98.9	99.0/100	100/99.9	100/100	99.3/100	95.0/96.6	96.5/97.0	98.7/99.3	99.1/99.5	92.9/96.7	99.4/99.2	99.1/98.5	94.4/98.5
SNL (RD)	100/100	94.2/99.1	95.4/97.7	98.6/99.3	99.2/100	100/100	100/100	100/100	95.8/97.9	96.6/98.1	100/99.7	99.4/99.4	95.2/99.6	99.6/99.2	99.7/98.9	98.3/99.3

Table 1: Unified anomaly detection results with image-level AUROC on MVTecAD. The multi-class/one-class performance is reported for each method. The best mean outcome is noted in bold and the runner-up mean outcome is underlined.

VisA	PCB1	PCB2	PCB3	PCB4	Macaroni1	Macaroni2	Capsules	Candles	Cashew	Chewing gum	Fryum	Pipe fryum	Mean
UniAD [28]	95.4/95.9	93.6/90.5	90.2/91.0	99.4/98.1	93.1/93.4	85.5/85.3	75.3/79.1	96.4/94.5	92.4/93.6	99.4/98.4	90.8/89.3	97.4/97.9	92.4/92.3
OmniAL [32]	77.7/96.6	81.0/99.4	88.1/96.9	95.3/97.4	92.6/96.9	75.2/89.9	90.6/87.9	86.8/85.1	88.6/97.1	96.4/94.9	94.6/97.0	86.1/91.4	87.8/94.2
FD [23]	87.1/93.8	79.0/89.3	79.0/84.1	95.6/96.7	87.0/93.4	69.3/83.9	69.4/85.2	91.6/96.4	92.5/98.8	95.5/96.8	94.5/99.5	92.6/99.1	86.1/93.0
SNL(FD)	94.1/95.0	92.3/93.8	90.3/95.0	99.3/97.4	92.2/93.7	73.2/88.9	72.3/86.9	94.9/96.6	93.8/99.0	96.0/99.3	94.6/99.5	94.9/99.2	90.7/95.3
RD [7]	95.9/97.1	94.4/97.0	92.3/96.4	99.7/99.8	97.8/97.3	85.6/98.6	76.8/89.5	94.2/94.3	92.6/97.6	90.8/98.4	95.9/96.2	97.2/94.6	\ul92.8/ \ul96.4
SNL(RD)	98.1/97.6	94.8/96.4	95.0/97.3	99.9/99.9	96.8/98.2	84.3/91.9	76.1/91.3	94.7/95.3	95.4/97.8	97.6/98.9	95.9/96.5	99.5/99.9	94.0/96.8

Table 2: Unified anomaly detection results with image-level AUROC on VisA.

MVTec	Bottle	Cable	Capsule	Carpet	Grid	Hazelnut	Leather	Metal nut	Pill	Screw	Tile	Toothbrush	Transistor	Wood	Zipper	Mean
UniAD [28]	98.1/98.1	97.3/96.8	98.5/97.9	98.5/98.0	98.2/98.5	96.5/94.6	98.8/98.3	94.8/95.7	95.0/95.1	98.3/97.3	91.8/91.8	98.4/97.8	97.9/98.7	93.2/93.4	96.8/96.0	96.8/96.6
OmniAL [32]	99.2/99.0	97.3/97.1	96.9/92.2	99.4/99.6	99.4/99.6	98.4/98.6	99.3/99.7	99.1/99.1	98.9/98.6	98.0/97.2	99.0/99.4	99.4/99.2	93.3/91.7	97.4/96.9	99.5/99.7	98.3/\ul97.8
FD [23]	97.1/98.8	96.5/95.8	94.8/98.6	98.2/99.0	97.6/99.0	98.6/98.6	98.8/99.1	94.4/97.2	96.8/97.6	87.6/98.8	95.5/96.9	98.1/99.0	92.4/81.9	94.0/96.5	96.5/98.8	95.8/97.0
SNL (FD)	98.2/99.0	97.6/97.6	98.1/97.1	98.3/99.0	97.9/98.7	98.9/98.8	99.0/99.5	96.3/97.5	98.4/98.7	97.8/98.9	94.8/96.1	98.5/98.5	95.5/94.2	94.6/94.7	97.3/98.0	97.4/\ul97.8
RD [7]	92.1/98.7	84.7/97.4	98.4/98.7	98.8/98.9	99.1/99.3	99.0/98.9	99.3/99.4	93.5/97.3	98.5/98.2	99.2/99.6	95.9/95.6	98.9/99.1	87.6/92.5	96.1/95.3	98.4/98.2	96.0/\ul97.8
SNL(RD)	98.3/98.6	94.1/97.8	98.7/98.4	98.5/99.3	99.0/99.1	99.2/99.3	99.0/99.4	97.1/97.8	98.7/98.7	99.3/99.5	95.5/95.9	98.9/99.0	91.6/94.0	95.7/95.6	98.5/\ul97.8	\ul97.5/ 98.0

Table 3: Unified anomaly localization results with pixel-level AUROC on MVTecAD.

VisA

PCB1

PCB2

PCB3

PCB4

Macaroni1

Macaroni2

Capsules

Candles

Cashew

Chewing gum

Fryum

Pipe fryum

Mean

UniAD [28]

99.3/99.2

97.9/96.7

98.4/98.0

97.9/98.8

99.3/98.9

98.0/97.1

98.3/98.6

99.1/98.9

98.5/99.2

99.1/98.5

97.6/97.8

99.1/99.4

\ul98.5/\ul98.4

OmniAL [32]

97.6/98.7

93.9/83.2

94.7/98.4

97.1/98.5

98.6/98.9

97.9/99.1

99.4/98.6

95.8/90.5

95.0/98.9

99.0/98.7

92.1/89.3

98.2/99.1

96.6/96.0

FD [23]

99.2/99.7

97.3/97.7

97.8/98.1

98.3/97.8

97.9/98.7

98.1/98.7

96.7/97.9

98.9/97.1

76.0/99.0

97.9/98.3

96.6/95.6

98.8/98.6

96.1/98.1

SNL(FD)

99.6/99.7

98.6/98.3

98.5/98.6

98.5/98.8

98.7/99.0

98.3/99.0

97.9/97.9

99.1/97.8

98.4/99.1

98.0/98.5

97.7/92.5

99.0/99.1

\ul98.5/98.2

RD [7]

99.5/99.7

97.8/98.0

98.6/99.3

98.2/98.3

99.6/99.6

99.1/99.4

98.0/99.6

98.7/98.5

75.2/93.5

93.9/98.2

97.5/97.1

99.2/99.3

96.2/\ul98.4

SNL(RD)

99.7/99.7

98.0/98.3

98.2/99.1

98.3/98.7

99.6/99.6

99.2/99.4

98.0/99.4

99.2/98.7

98.4/95.8

98.4/97.7

97.5/96.8

99.3/99.3

98.7/98.5

Table 4: Unified anomaly localization results with pixel-level AUROC on VisA.

4.1 Dataset

We evaluate the proposed method on two large-scale visual anomaly detection datasets: MVTecAD [4] and VisA [34]. MVTecAD is a comprehensive anomaly detection benchmark that includes 10 object categories and 5 texture categories. There are total 3629 normal images for training and 1725 unknown images for testing. VisA contains object images with more complex structures and multiple instances. It includes 9621 normal images and 1200 anomalous images from 12 categories. Note that we train a unified model on data from all categories and evaluate the anomaly detection performance across categories.

4.2 Implementation Details

All images in our experiments are resized to 256x256 and normalized by the mean and variance of ImageNet [8]. We default to using WideResNet-50 [29] as the backbone, where the teacher network loads the model pre-trained on ImageNet. For forward distillation, our framework is based on STPM [23], which uses the same teacher and student architectures. For reverse distillation, we follow the structure used in RD [7]. In this study, we apply the same processing and training strategies to FD-based and RD-based methods. During training, we use the Adam optimizer with a learning rate of $0.005$ and a batch size of 8. For hyper-parameters, we simply set $\lambda_{1}=\lambda_{2}=\lambda_{3}=1$ , and the cluster number $N=50$ . The experiments are implemented using Pytorch [18] on a single RTX3090 GPU. We use image-level and pixel-level Area Under the Receiver Operator Curve (AUROC) to evaluate the anomaly detection and localization performance following UniAD [28].

Learning Objective				Module	Evaluation
$L_{cd}$	$L_{sd}$	$L_{intra}$	$L_{inter}$	CRAM	image-level	pixel-level
✓	-	-	-	-	94.4	96.0
✓	✓	-	-	-	94.7	96.4
✓	✓	✓	-	-	97.4	97.0
✓	✓	✓	✓	-	97.7	97.1
✓	✓	-	-	✓	97.1	97.3
✓	✓	✓	✓	✓	98.3	97.5

Table 5: Ablation studies of proposed methods. The evaluation is based on RD [7] and all results are evaluated on MVTecAD.

4.3 Main Results

The unified models, UniAD [28] and OmniAL [32], which are specifically proposed for multi-class anomaly detection, are the primary comparators of our approach. In addition, we reproduce the performance of forward distillation (FD) [23] and reverse distillation (RD) [7] on multi-class anomaly detection as the baselines. We include a detailed table comparing the performance of other models on multi-class anomaly detection in the supplement [5, 27, 6, 13, 20, 30]. We evaluate anomaly detection with image-level AUROC in Tables 1&2 and anomaly localization with pixel-level AUROC in Tables 3&4 for the MVTecAD and VisA datasets. Notably, we not only report the results of multi-class anomaly detection, but also evaluate on one-class anomaly detection as a reference.

Anomaly Detection.

We show the anomaly detection results on MVTecAD in Tab. 1. It is obvious to notice the degradation of FD and RD in multi-class anomaly detection. Meanwhile, our proposed SNL greatly improves the performance of FD and RD by 6.2% and 3.9%, respectively. Similarly, as shown in Tab. 2 performance degradation is observed on the VisA dataset, while SNL achieves 4.6% and 1.2% improvement compared to FD and RD. Remarkably, we outperform the reconstruction-based SOTA method UniAD by 1.8% on MVTecAD and 1.6% on VisA. For OmniAL, a unified model based on anomaly synthesis, our method slightly gains 1.1% on MVTecAD while exceeds a large margin by 6.2% on VisA. For one-class tasks, our SNL achieves the best two results based on FD and RD, which shows the generalizability of our method.

Anomaly Localization.

As shown in Tabs. 3&4, our SNL significantly boosts the anomaly localization capability of FD and RD on MVTecAD and VisA. Compared to UniAD, our SNL(RD) obtains an increase of 0.7% and 0.2% on MVTecAD and VisA, respectively. Although the localization performance of our method is slightly lower than OmniAL on MVTecAD, we dramatically outperform OmniAL by 2.1% on VisA. Anomaly synthesis lacks the generalization ability because it requires certain a prior knowledge. Therefore, a potential future study is to fuse anomaly synthesis into teacher-student networks to enhance localization ability such as in [22]. Further more, our SNL(RD) achieve the best one-class anomaly localization performance.

Visualization.

We show in Fig. 5 a quantitative comparison of the localization results of our method with RD [7] on various anomaly scenarios in multi-class anomaly detection. Cross-class interference does not mean that the model loses the ability to localize anomalies, but rather lacks sensitivity to anomalous regions. Thus we can observe that the baseline model RD is able to coarsely localize anomalies, but not as accurately as our approach. Furthermore, such imprecision directly leads to the lack of discriminative image-level anomaly scores, which is only slightly reflected in the pixel-level anomaly scores. More visualization results are in the supplement.

Capacity	w/o	25	50	75
image-level	97.7	98.2	98.3	98.0
pixel-level	97.1	97.5	97.5	97.4

Table 6: Ablation study on the number of CRAM centers.

4.4 Ablation Study

We conduct detailed ablation experiments to specifically evaluate the results of our approaches. As shown in Tab. 5, we analyze the effects of the structural distillation objective functions and the CRAM normality representation. First, we decompose the feature reconstruction into channel-wise and spatial-wise learning to show the gains of spatial feature distillation. Then, intra-&inter-affinity distillation, which facilitates the teacher-student network to capture pairwise feature similarities, brings 3.0% and 0.7% improvement in image-level and pixel-level anomaly recognition performance. In addition, our normality learning module, CRAM, enables the student network to learn compact normal representations, resulting in increased outcomes of 2.4% and 0.9% compared to baseline. Overall, our SNL boosts the baseline by 3.9%/1.5% of anomaly detection/localization performance.

For CRAM, we assess the effect of the number of normality centers on the results in Tab. 6. We observe a significant improvement with the CRAM, but a larger codebook capacity doesn’t yield further improvement. Ablation studies regarding model capacities, number of network blocks, etc. can be found in the supplement.

5 Conclusion

In this study, we identify cross-class interference problem in the teacher-student network for multi-class anomaly detection, which severely reduces the discriminitive ability to anomalies. To overcome this obstacle, we propose structural teacher-student normality learning, which consists of structural distillation and the central residual aggregation module (CRAM). Specifically, structural distillation presents feature affinity reconstruction for capturing pairwise feature similarities between teacher and student features. Besides, CRAM assists student networks in learning discriminative normality representations. Extensive experiments show that our method dramatically improves baseline results and outperforms state-of-the-art methods.

References

Acsintoae et al. [2022] Andra Acsintoae, Andrei Florescu, Mariana-Iuliana Georgescu, Tudor Mare, Paul Sumedrea, Radu Tudor Ionescu, Fahad Shahbaz Khan, and Mubarak Shah. Ubnormal: New benchmark for supervised open-set video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20143–20153, 2022.
Arandjelovic et al. [2016] Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5297–5307, 2016.
Bergmann et al. [2018] Paul Bergmann, Sindy Löwe, Michael Fauser, David Sattlegger, and Carsten Steger. Improving unsupervised defect segmentation by applying structural similarity to autoencoders. arXiv preprint arXiv:1807.02011, 2018.
Bergmann et al. [2019] Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9592–9600, 2019.
Bergmann et al. [2020] Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4183–4192, 2020.
Defard et al. [2021] Thomas Defard, Aleksandr Setkov, Angelique Loesch, and Romaric Audigier. Padim: a patch distribution modeling framework for anomaly detection and localization. In International Conference on Pattern Recognition, pages 475–489. Springer, 2021.
Deng and Li [2022] Hanqiu Deng and Xingyu Li. Anomaly detection via reverse distillation from one-class embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9737–9746, 2022.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
Gong et al. [2019] Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1705–1714, 2019.
Hinton et al. [2015] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
Jezek et al. [2021] Stepan Jezek, Martin Jonak, Radim Burget, Pavel Dvorak, and Milos Skotak. Deep learning-based defect detection of metal parts: evaluating current methods in complex conditions. In 2021 13th International congress on ultra modern telecommunications and control systems and workshops (ICUMT), pages 66–71. IEEE, 2021.
Lee et al. [2022] Sungwook Lee, Seunghyun Lee, and Byung Cheol Song. Cfa: Coupled-hypersphere-based feature adaptation for target-oriented anomaly localization. IEEE Access, 10:78446–78454, 2022.
Li et al. [2021] Chun-Liang Li, Kihyuk Sohn, **sung Yoon, and Tomas Pfister. Cutpaste: Self-supervised learning for anomaly detection and localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9664–9674, 2021.
Liu et al. [2018] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6536–6545, 2018.
Liu et al. [2019] Yifan Liu, Ke Chen, Kris Liu, Zengchang Qin, Zhenbo Luo, and **gdong Wang. Structured knowledge distillation for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2604–2613, 2019.
Lv et al. [2021] Hui Lv, Chen Chen, Zhen Cui, Chunyan Xu, Yong Li, and Jian Yang. Learning normal dynamics in videos with meta prototype network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15425–15434, 2021.
Park et al. [2020] Hyunjong Park, Jongyoun Noh, and Bumsub Ham. Learning memory-guided normality for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14372–14381, 2020.
Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
Ruff et al. [2018] Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft. Deep one-class classification. In International conference on machine learning, pages 4393–4402. PMLR, 2018.
Salehi et al. [2021] Mohammadreza Salehi, Niousha Sadjadi, Soroosh Baselizadeh, Mohammad H Rohban, and Hamid R Rabiee. Multiresolution knowledge distillation for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14902–14912, 2021.
Schlegl et al. [2019] Thomas Schlegl, Philipp Seeböck, Sebastian M Waldstein, Georg Langs, and Ursula Schmidt-Erfurth. f-anogan: Fast unsupervised anomaly detection with generative adversarial networks. Medical image analysis, 54:30–44, 2019.
Tien et al. [2023] Tran Dinh Tien, Anh Tuan Nguyen, Nguyen Hoang Tran, Ta Duc Huy, Soan Duong, Chanh D Tr Nguyen, and Steven QH Truong. Revisiting reverse distillation for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24511–24520, 2023.
Wang et al. [2021] Guodong Wang, Shumin Han, Errui Ding, and Di Huang. Student-teacher feature pyramid matching for anomaly detection. arXiv preprint arXiv:2103.04257, 2021.
Wu et al. [2021] Aming Wu, Suqi Zhao, Cheng Deng, and Wei Liu. Generalized and discriminative few-shot object detection via svd-dictionary enhancement. Advances in Neural Information Processing Systems, 34:6353–6364, 2021.
Yang et al. [2020] Jie Yang, Yong Shi, and Zhiquan Qi. Dfr: Deep feature reconstruction for unsupervised anomaly segmentation. arXiv preprint arXiv:2012.07122, 2020.
Yang et al. [2022] Zhiwei Yang, Peng Wu, **g Liu, and Xiaotao Liu. Dynamic local aggregation network with adaptive clusterer for anomaly detection. In European Conference on Computer Vision, pages 404–421. Springer, 2022.
Yi and Yoon [2020] Jihun Yi and Sungroh Yoon. Patch svdd: Patch-level svdd for anomaly detection and segmentation. In Proceedings of the Asian conference on computer vision, 2020.
You et al. [2022] Zhiyuan You, Lei Cui, Yujun Shen, Kai Yang, Xin Lu, Yu Zheng, and Xinyi Le. A unified model for multi-class anomaly detection. Advances in Neural Information Processing Systems, 35:4571–4584, 2022.
Zagoruyko and Komodakis [2016] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
Zavrtanik et al. [2021] Vitjan Zavrtanik, Matej Kristan, and Danijel Skočaj. Draem-a discriminatively trained reconstruction embedding for surface anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8330–8339, 2021.
Zhang et al. [2017] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
Zhao [2023] Ying Zhao. Omnial: A unified cnn framework for unsupervised anomaly localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3924–3933, 2023.
Zimmerer et al. [2022] David Zimmerer, Peter M Full, Fabian Isensee, Paul Jäger, Tim Adler, Jens Petersen, Gregor Köhler, Tobias Ross, Annika Reinke, Antanas Kascenas, et al. Mood 2020: A public benchmark for out-of-distribution detection and localization on medical images. IEEE Transactions on Medical Imaging, 41(10):2728–2738, 2022.
Zou et al. [2022] Yang Zou, Jongheon Jeong, Latha Pemula, Dongqing Zhang, and Onkar Dabeer. Spot-the-difference self-supervised pre-training for anomaly detection and segmentation. In European Conference on Computer Vision, pages 392–408. Springer, 2022.