A SAM-guided Two-stream Lightweight Model for Anomaly Detection

Chenghao Li [email protected] , Lei Qi [email protected] and Xin Geng [email protected] School of Computer Science and Engineering (Southeast University), and Key Laboratory of New Generation Artiicial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of EducationNan**gJiangsuChina210019

Abstract.

In industrial anomaly detection, model efficiency and mobile-friendliness become the primary concerns in real-world applications. Simultaneously, the impressive generalization capabilities of Segment Anything (SAM) have garnered broad academic attention, making it an ideal choice for localizing unseen anomalies and diverse real-world patterns. In this paper, considering these two critical factors, we propose a SAM-guided Two-stream Lightweight Model for unsupervised anomaly detection (STLM) that not only aligns with the two practical application requirements but also harnesses the robust generalization capabilities of SAM. We employ two lightweight image encoders, i.e., our two-stream lightweight module, guided by SAM’s knowledge. To be specific, one stream is trained to generate discriminative and general feature representations in both normal and anomalous regions, while the other stream reconstructs the same images without anomalies, which effectively enhances the differentiation of two-stream representations when facing anomalous regions. Furthermore, we employ a shared mask decoder and a feature aggregation module to generate anomaly maps. Our experiments conducted on MVTec AD benchmark show that STLM, with about 16M parameters and achieving an inference time in 20ms, competes effectively with state-of-the-art methods in terms of performance, 98.26% on pixel-level AUC and 94.92% on PRO. We further experiment on more difficult datasets, e.g., VisA and DAGM, to demonstrate the effectiveness and generalizability of STLM. Codes are available online at https://github.com/StitchKoala/STLM.

Corresponding author: Lei Qi

1. Introduction

Image anomaly detection and localization tasks have attracted great attention in various domains, e.g., industrial quality control (Bergmann et al., 2019a; Roth et al., 2022), medical diagnoses (Fernando et al., 2021), and video surveillance (Liu and Ma, 2019; Guo et al., 2017). These tasks aim to discriminate both abnormal images and anomalous pixels in images according to previously seen normal or pseudo-anomalous samples during training. However, anomaly detection and localization are especially hard, as anomalous images occur rarely and anomalies can vary from subtle changes to large defects such as broken parts. Creating a dataset that includes sufficient anomalous samples with all possible anomaly types for training is a formidable challenge. Therefore, most methods have turned to tackling AD tasks with unsupervised models (Tien et al., 2023; Deng and Li, 2022; Chen et al., 2020), by only relying on normal samples, which are of great significance in practice.

Refer to caption — Figure 1. Comparisons of different anomaly detection methods in terms of pixel-level AUROC (vertical axis), inference time (horizontal axis), and the ratios of parameter numbers (circle radius). Our STLM achieves competitive pixel-level AUROC for anomaly detection while being 8× faster than PatchCore, 4× faster than FOD which achieves the highest pixel-level AUROC, 1× faster than SimpleNet, and 0.5× faster than RD++ (154.87M). In addition, STLM requires only 16.56M of parameters for inference, making it one of the most efficient methods.

To achieve this, the utilization of a memory bank is proposed, in which a core set stores features extracted from the pre-trained backbone, to calculate a patch-level distance between the core set and the sample to detect anomalies (Yao et al., 2023a; Zhang et al., 2023d). However, these methods, which involve creating memory banks, come at the cost of increased computational complexity and large memory space.

Recent efforts have been directed towards proposing effective anomaly detection methods, with particular emphasis on reconstruction-based methods (Chen et al., 2023; Vasilev et al., 2020; Zhang et al., 2023b), which are favored for their simplicity and interpretability. These methods typically employ autoencoders (Bergmann et al., 2019b; Sarafijanovic-Djukic and Davis, 2019; Zhou and Paffenroth, 2017; Shi et al., 2021),variational autoencoders (Vasilev et al., 2020; Liu et al., 2020), or generative adversarial networks (Goodfellow et al., 2014). Based on the assumption that the reconstruction error of normal regions is lower than that of abnormal ones, they compare the test images with their reconstructed counterparts. However, recent studies have revealed that deep models can generalize so effectively that they accurately restore anomalous regions (Zavrtanik et al., 2021b), undermining the effectiveness of anomaly detection tasks.

To address this challenge, memory modules have been integrated into reconstruction-based methods (Park et al., 2020; Gong et al., 2019). These modules store representations of normal images, and the representations of test images are used as queries to retrieve the most relevant memory items for reconstruction. However, they suffer from high memory requirements and search times, making them impractical for real-world industrial applications.

The knowledge distillation (KD) (Hinton et al., 2015) based framework has shown its effectiveness in anomaly detection and localization (Tien et al., 2023; Roth et al., 2022; Salehi et al., 2021). For example, Salehi et al. (Salehi et al., 2021) established an S-T network pair where knowledge is transferred from the teacher to the student. The underlying hypothesis is that the student network sees only the normal samples in the training stages, leading to generating out-of-distribution representations with anomalous queries during inference. However, the statement does not always hold true due to architectural similarities and shared data flow (Deng and Li, 2022). To overcome this limitation, DeSTSeg (Zhang et al., 2023c) introduces a denoising student network to generate distinct feature representations from those of the teacher when handling anomalous inputs.

Recently, foundation models, e.g., Segment Anything (SAM) (Kirillov et al., 2023) have demonstrated great zero-shot abilities through the retrieval of prior knowledge. SAM is trained on millions of annotated images, enabling it to generate high-quality segmentation results for previously unseen images. Nevertheless, foundation models often have limitations within certain domains (Ji et al., 2023) and SAM contains 615M parameters against the mobile-friendly requirement.

Considering the generality problem of unseen anomalies and the real-world application requirements, we propose a SAM-guided Two-stream Lightweight Model for unsupervised anomaly detection (STLM). It takes advantage of the robust generalization capabilities of foundation models and aligns with the mobile-friendly requirements. As illustrated in Figure 2, STLM consists of a fixed SAM teacher, a trainable two-stream lightweight model (TLM), and a feature aggregation (FA) module. We start with Pseudo Anomaly Generation prepossess on the normal training images to balance the number of normal and anomalous images.

After processing the data, the TLM is introduced. First, instead of directly utilizing the pre-trained teacher network during inference in KD, we suggest distilling the comprehensive knowledge from the fixed SAM to a student stream as the lightweight “teacher”, called the plain student stream, which is more precise and generalized for representing features related to anomaly detection (AD) task. Second, inspired by (Zhang et al., 2023c), we incorporate the training of a denoising stream, also distilled from the SAM. The denoising student stream takes a pseudo-anomalous image as input, whereas the teacher SAM operates on the original, clean image. Our method effectively enhances the differentiation of two-stream features when dealing with anomalous regions. The impressive capabilities of SAM alleviate the generality problem typically associated with being opaque to genuine anomalies during training and various normal patterns in practical scenarios. In our method, we employ the encoder of the MobileSAM (Zhang et al., 2023a) as the backbone for our TLM’s encoder. Besides, we design a shared mask decoder and a feature aggregation module to generate anomaly maps.

We evaluate our method on the MVTec AD benchmark (Bergmann et al., 2019a), which is specifically designed for anomaly detection and localization. Extensive experimental results show that our method is competitive with state-of-the-art while tackling the task of image-level and pixel-level anomaly detection. With remarkable parameter efficiency and swift inference speed, our STLM achieves competitive results on MVTec, VisA and DAGM. In order to validate the effectiveness of our proposed components, we conduct comprehensive ablation studies. The contributions of the proposed method are highlighted as follows:

$\bullet$

We propose a SAM-guided Two-stream Lightweight Model for unsupervised anomaly detection that not only conforms to the model efficiency and mobile-friendliness demands of practical industrial applications but also takes advantage of the robust generalization capabilities of SAM for effectively exploring unseen anomalies and diverse normal patterns.
$\bullet$

We conduct extensive experiments on MVTec, VisA and DAGM. Results show that our method with about 16M parameters is competitive with state-of-the-art methods on detection and localization, underscoring its robust generalization capabilities. Notably, taking both performance and model parameter size into consideration, our method is a promising solution for practical applications.

2. Related Work

In this section, we review some related works to our work in the following part, including Deep Learning Methods for Anomaly Detection and Localization, Vision Foundation Models and Data augmentation strategies for AD tasks.

2.1. Deep Learning Methods for Anomaly Detection and Localization

Prior works of anomaly detection and localization have explored various methodologies. The method of image reconstruction is widely adopted, which posits that accurately reconstructing anomalous regions can be challenging due to their absence in the training samples. Generative models such as autoencoders (Bergmann et al., 2019b; Li et al., 2021a), variational autoencoders (Vasilev et al., 2020; Xu et al., 2023), and generative adversarial networks (Goodfellow et al., 2014; Peng and Qi, 2019) are also utilized to reconstruct normal images from anomalous ones (Akrami et al., 2022). Nonetheless, these methods face certain limitations, especially when reconstructing complex textures and patterns. Later methods use deep models to enhance the quality of reconstructing images (Zavrtanik et al., 2021a, b; Zhang et al., 2023b).

Recently, the utilization of a memory bank, in which a core set stores features extracted from the pre-trained backbone, calculates a patch-level distance between the core set and the sample to detect anomalies (Yao et al., 2023a). The studies in (Zhang et al., 2023d) reveal a method that focuses on learning feature residuals of varying scales and sizes between anomalous and normal patterns, which prove beneficial in accurately reconstructing the segmentation maps of anomalous regions. However, these methods, which involve creating memory banks, come at the cost of increased computational complexity.

Knowledge distillation (Hinton et al., 2015) relies on a pre-trained teacher network and a trainable student network. Since the student network is trained on anomaly-free samples, it is expected to have feature representations that differ from those of the teacher network (Bergmann et al., 2020; Salehi et al., 2021; Wang et al., 2021). For instance, to capture anomalies at multiple scales, multi-resolution knowledge distillation (Salehi et al., 2021) is used to distinguish unusual features on multi-level features. Prior studies aim to enhance the similarity of the features when processing normal images, whereas DeSTSeg tries to separate their representations specifically when dealing with anomalous regions. However, the teacher network is hard to accurately represent unseen normal textures and patterns with some variations and the denoising student network has generality limitations of being opaque to genuine anomalies.

2.2. Vision Foundation Models

Vision Foundation Models showcase remarkable zero-shot capabilities to address numerous downstream tasks. CLIP (Radford et al., 2021) can represent visual content using textual prompts and is trained on 400 million image-text pairs, which rivals the image classification accuracy of fully supervised works. Recently, SAM (Kirillov et al., 2023), which is trained on millions of annotated images, generates high-quality segmentation results for previously unseen images. However, foundation models often exhibit limitations in specific domains, including medical imaging and the field of anomaly detection (Ji et al., 2023). Since they are mostly transformer-based frameworks, they also have issues with computation complexity and inference latency at inference time (Pope et al., 2023). And they are mostly trained in natural images, they tend to prioritize foreground objects and may struggle to accurately segment small or irregular objects. Therefore, our work seeks to explore how to effectively harness the extensive knowledge provided by these off-the-shelf models to detect anomalies.

2.3. Data augmentation strategies

Data augmentation strategies play a pivotal role in the field. One widely used technique involves simulating pseudo-anomalous data by introducing artificial anomalies into the provided anomaly-free samples. This method effectively transforms the one-class classification anomaly detection task into a supervised learning framework (Li et al., 2021b; Zavrtanik et al., 2021b; Collin and De Vleeschouwer, 2021). Classical anomaly simulation strategies, such as rotation (Gidaris et al., 2018) and cutout (DeVries and Taylor, 2017), have exhibited limitations in effectively identifying subtle anomalous patterns. A simple yet effective strategy called CutPaste (Li et al., 2021b) generates pseudo-anomalies by copying and pasting a rectangular region from one part of the image to another. Since the model focuses on detecting local features such as edge discontinuity and texture perturbations, it may fall short when dealing with detecting and localizing larger defects and global structural anomalies. Also,it typically generate only one type of pseudo-anomaly, limiting their adaptability to various anomaly types commonly encountered in industrial settings.

More recent researchers have adopted the innovative ideas presented in (Zavrtanik et al., 2021a) for anomaly simulation, and we have also integrated pseudo-anomalous data with the capacity to generalize across diverse types of unseen defects. However, the introduction of pseudo-anomalies onto normal images renders these models more sensitive to anomalous regions, factors like noisy backgrounds or fluctuating illumination in test images may lead to distractions and increase false positive detections (Collin and De Vleeschouwer, 2021). It is noteworthy that, SimpleNet counterfeits anomaly features by introducing Gaussian noise to normal features, a process that requires careful tuning of appropriate hyperparameters.

3. Method

Together with the pseudo anomalies introduced into normal training images with predefined probabilities (in Section 3.1) to maintain balanced data distribution, we propose the SAM-guided Two-stream Lightweight Model for unsupervised anomaly detection (STLM) to effectively generate an anomaly map for anomaly detection and localization. As illustrated in Figure 2, the two-stream lightweight model (TLM, in Section 3.2), consisting of a plain student stream and a denoising student stream, distills different information from a fixed SAM (Kirillov et al., 2023) during training. Consequently, a feature aggregation module (FA, in Section 3.3) is trained to fuse anomaly features with the aid of additional supervision signals. For inference (in Section 3.4), anomaly maps are generated solely using the TLM and the FA module. In the subsequent sections, we will provide a detailed description.

3.1. Pseudo Anomaly Generation

To address the issue of data imbalance, the training of our model relies on diverse types of pseudo-anomalous images, generated using the methodology introduced in (Zavrtanik et al., 2021a). A noise image is generated using a Perlin noise generator, and then Binarization is applied to the Random Perlin noise to acquire an anomaly mask $M$ . The proposed pseudo anomaly image $I_{a}$ can be defined as follows:

(1)

I_{a}=\bar{M}\odot N+(1-\beta)(M\odot A)+\beta(M\odot N),

where $N$ is the normal sample, $A$ is the external data source an arbitrary image from, $\bar{M}$ is the inverse of $M$ , $\odot$ means the element-wise multiplication operation. $\beta$ is the opacity parameter for a better combination of abnormal and normal regions. Figure 3 shows the pseudo-anomalous images generated by this strategy. In practice, we use a probability of 0.5 to decide if this generation method is activated or not, with the ablation study shown in Figure 10. Notably, the generation process is conducted in real-time training.

3.2. Two-Stream Lightweight Model

In traditional KD, the teacher network typically employs a large and deep expert network with extensive capabilities, while the student network adopts a similar neural network structure to that of the teacher (Bergmann et al., 2019b). Although this method enhances precision compared to early works, anomaly detection does not always hold true due to architectural similarities and shared data flow. Despite (Zhang et al., 2023c) introducing a denoising student network to address this issue, it still fails to consider that the teacher network struggles to accurately represent unseen normal textures and patterns with some variations compared to the training set. And the denoising student network has generality limitations of being opaque to genuine anomalies in the training stage.

3.2.1. Mobile Distillation

SAM (Kirillov et al., 2023) has attracted notable academic interest owing to its remarkable zero-shot capabilities and versatility across various vision applications. In our paper, we aim to harness the robust generalization abilities of SAM and render SAM more mobile-friendly.

First, we follow the idea of encouraging the student network to generate anomaly-specific features that diverge from those of the teacher network adopted by (Zhang et al., 2023d, c). This strategy effectively helps overcome the limitations posed by network architecture similarities and identical data flows. However, as (Zhang et al., 2023b) said, the test data can have large distribution shifts arising in many real-world applications and may correspond to multiple normal patterns, leading to imprecise representations of the teacher network. The plain student stream is trained to produce discriminative and general feature representations in both normal and anomalous regions as a superior alternative to the teacher network, demonstrated in Section 4.4. Similarly guided by SAM’s knowledge, the denoising student stream aligns its feature representations with those of the same images without any corruption, which effectively enhances the differentiation of the representations from two student streams when dealing with anomalous regions.

Second, the principal aim of our paper is to develop a model that can deliver satisfactory performance while significantly reducing the number of parameters and inference time. SAM has demonstrated its ability to function on resource-constrained devices, primarily thanks to its lightweight mask decoder. However, the default image encoder in the original SAM relies on ViT-H, which boasts over 600M parameters and is considered heavyweight. Considering the 3.2M decoder lightweight enough, We transition the default image encoder to ViT-Tiny (Wu et al., 2022) while kee** the original architecture of the mask decoder.

Consequently, SAM serves as the teacher network for the TLM. The output feature maps are extracted from the two-layer decoder of SAM, shown in Figure 4 (a). With no need for segmentation outcomes, we only use the Two-Way-Transformer blocks of the mask decoder, and introduce our FA module. The output features from the pre-trained SAM with normal images as input are denoted as $T_{P}^{1}$ and $T_{P}^{2}$ , respectively. Similarly, those originating from pseudo-anomalous images are labeled $T_{D}^{1}$ and $T_{D}^{2}$ . TLM adopts the image encoder of mobileSAM (Zhang et al., 2023a), i.e., ViT-Tiny (Wu et al., 2022), and their output features are denoted as $S_{P}^{1}$ , $S_{P}^{2}$ , $S_{D}^{1}$ , and $S_{D}^{2}$ , respectively.

To distill knowledge from SAM to the plain student stream, we minimize the cosine distance between features from $T_{P}^{k}$ and $S_{P}^{k}$ , where $k$ = 1,2. Additionally, we minimize the cosine distance between features from $T_{D}^{k}$ and $S_{D}^{k}$ , for $k$ = 1,2, to supervise the denoising student stream in reconstructing normal features. The cosine similarity can be computed through Equation (2) and the loss function for optimizing the network is formulated as Equation (3) and Equation (4).

(2)

X_{k}(i,j)=\frac{T_{k}(i,j)\odot S_{k}(i,j)}{||T_{k}(i,j)||_{2}||S_{k}(i,j)||_% {2}},

(3)

\mathcal{L}_{p}=\sum_{k=1}^{2}\left\{\frac{1}{H_{k}W_{k}}\sum_{i,j=1}^{H_{k},W% _{k}}(1-X_{k}^{P}(i,j))\right\},

(4)

\mathcal{L}_{de}=\sum_{k=1}^{2}\left\{\frac{1}{H_{k}W_{k}}\sum_{i,j=1}^{H_{k},% W_{k}}(1-X_{k}^{D}(i,j))\right\},

where $k$ is the number of feature layers used in training. $i$ and $j$ stand for the spatial coordinate on the feature map.In particular, $i=1\ldots H_{k}$ and $j=1\ldots W_{k}$ . $H$ and $W$ denote the height and width of $k$ -th feature map. The cosine similarity between features from SAM and the plain stream denotes $X_{k}^{P}(i,j)$ , while that between features from SAM and the denoising stream denotes $X_{k}^{D}(i,j)$ .

3.2.2. The Decoder

Inspired by (Zhang et al., 2023a), the image embeddings generated by the student encoder can closely approximate that of the original teacher encoder. This observation leads to the conclusion that the utilization of distinct decoders for the TLM may not be imperative. Instead, a shared two-layer decoder as illustrated in Figure 4, is an efficient alternative. We observe that the shared decoder, while economical in terms of parameters and computation time, achieves a satisfactory result with a slight decrease compared to employing two separated decoders, shown in Figure 5.

3.2.3. Remark

The difference between our STLM and DeSTSeg (Zhang et al., 2023c) lies in 1) our method learns not only a denoising student but also a plain student in our TLM. The plain student stream can learn generalized knowledge related to the anomaly detection task, which enables it to effectively represent normal features and even previously unseen patterns in the training set, ensuring differences between features of two streams are well-captured. 2) The high-generalization-capable SAM, as the teacher network of our TLM, also helps the denoising student to generate high-quality reconstructed results for genuine anomalies.

Besides, it is worth noting that our method exhibits superior performance and a more compact model compared to DeSTSeg.

3.3. Feature Aggregation Module

Previous studies (Salehi et al., 2021; Wang et al., 2021) have noted that sub-optimal results can occur when distinctions among features at different levels are not uniformly precise. Following an extensive series of experiments, it has become evident that the inclusion of a segmentation network, guiding feature fusion through additional supervision signals, improves performances. The feature aggregation (FA) module is composed of two residual blocks and an atrous spatial pyramid pooling (ASPP) module (Chen et al., 2017). As a primary consideration, we have refrained from constraining the weights of the TLM when training the segmentation network.

Notably, our findings indicate that simultaneous training of these components yields enhanced results. Furthermore, it has been observed that reducing the depth of the FA module does not significantly impact the performance as indicated in Figure 6. As a result, an effort has been made to make this module more lightweight by reducing the channels from 256 to 128 and adjusting the dilation rate to [1, 1, 3], following a similar method in (Zhang et al., 2023e). It mitigates memory costs associated with both training and inference, which is a factor of paramount importance in practical implementations.

When training, synthetic anomalous images are employed as inputs for the TLM, with the corresponding binary anomaly mask serving as the ground truth. The decoder generates features that have the same shape as the ground truth mask $M$ . The similarities are calculated using Equation (2) and then concatenated as $\hat{X}$ , which is subsequently fed into the FA module. Inspired by (Zavrtanik et al., 2021a; Yang et al., 2023), a focal loss (Lin et al., 2017) and an L1 loss (Girshick, 2015) are applied to increase the robustness toward accurate segmentation of challenging examples and reduce over-sensitivity to outliers, respectively. The FA module outputs an anomaly score map $M_{ij}^{o}$ , which is of the same shape as the ground truth mask $M$ .

(5)

p_{ij}=M_{ij}M_{ij}^{o}+(1-M_{ij})(1-M_{ij}^{o}),

(6)

\mathcal{L}_{focal}=-\frac{1}{HW}\sum_{i,j=1}^{H,W}(1-p_{ij})^{\gamma}\log(p_{% ij}),

(7)

\mathcal{L}_{l1}=-\frac{1}{HW}\sum_{i,j=1}^{H,W}|M_{ij}-M_{ij}^{o}|,

where the focus ( $\gamma$ ) is set as 4, following the set in (Zhang et al., 2023d, c).

3.4. Training and Inference

The training stage is described in Figure 2 (a), and the loss functions used are proposed in Section 3.2 and Section 3.3.

(8)

\mathcal{L}_{total}=\mathcal{L}_{p}+\mathcal{L}_{de}+\mathcal{L}_{focal}+% \mathcal{L}_{l1},

where each loss function carries equal weight.

For inference, the fixed SAM is discarded and the procedure is presented in Figure 2 (b). For pixel $ij$ , the pixel-level anomaly segmentation map $M_{ij}$ is provided by the end of the network. It is anticipated that the output will have higher values for pixels that are anomalous pixels. For the computation of the image-level anomaly score, we take the average of the top- $K$ anomalous pixel values from the anomaly score map, following (Zhang et al., 2023d, c).

4. Experiments

4.1. Experimental Details

4.1.1. Datasets

We validate the effectiveness of our method using the MVTec AD (Bergmann et al., 2019a) dataset, a renowned benchmark in the field of anomaly detection and localization. MVTec AD comprises 5 texture and 10 object categories, each providing hundreds of normal images for training, along with a diverse set of both anomalous and normal images for evaluation. It also provides pixel-level ground truths for defective test images.

Additionally, we extend our experimentation on the VisA (Zou et al., 2022) and DAGM (Wieler and Hahn, 2007) dataset to showcase the generalization capabilities of STLM facing more complex datasets with. The VisA Dataset comprises 10,821 high-resolution color images (9,621 normal and 1,200 anomalous samples) encompassing 12 objects across 3 domains, establishing it as the most extensive anomaly detection dataset in the industrial domain to date. DAGM contains 10 textured objects with small abnormal regions that bear a strong visual resemblance to the background. For each class, we first move all anomalous samples from the original training set to the original test set and then move all normal samples from the test set to the training set. Furthermore, we randomly select 30 normal samples from the training set, designating them as ”good” samples in the test set, which are subsequently removed from the training set.

Additionally, we use the Describable Textures Dataset (DTD) (Cimpoi et al., 2014) as the source of anomaly images (denoted as A in Equation (1)).

4.1.2. Evaluation Metrics

In line with previous research, we employ AUC (Area Under the ROC Curve) to evaluate image-level and pixel-level anomaly detection. However, anomalous regions typically only occupy a tiny fraction of the entire image. Consequently, Pixel-AUROC may not accurately reflect the localization accuracy, as the false positive rate is primarily influenced by the vast number of anomalous-free pixels (Tao et al., 2022). To offer a more comprehensive measure of localization performance, we introduce the Per Region Overlap (PRO) (Bergmann et al., 2020), which assigns equal weight to anomaly regions of varying sizes. Pixel-level Average Precision (AP) (Saito and Rehmsmeier, 2015) is also introduced for comprehensive measuring. The PRO score treats anomaly regions of varying sizes equally, whereas AP is more suitable for highly imbalanced classes, particularly in the context of industrial anomaly localization where accuracy plays a critical role.

Table 1. Anomaly detection and location results in terms of AUROC (

\%

) at image-level and pixel-level on the MVTec dataset (Bergmann et al., 2019a)

Table 2. Anomaly Detection and Localization on MVTec (Bergmann et al., 2019a), VisA (Zou et al., 2022) and DAGM (Zavrtanik et al., 2021a)

Method	DRAEM (Zavrtanik et al., 2021a)	CFLOW (Gudovskiy et al., 2022)	PatchCore (Roth et al., 2022)	RD4AD (Deng and Li, 2022)	SimpleNet (Liu et al., 2023)	DeSTSeg (Zhang et al., 2023c)	RD++ (Tien et al., 2023)	FOD (Yao et al., 2023a)	Ours
Carpet	96.90/97.50	97.60/99.20	99.10/99.00	98.70/98.90	99.70/98.50	-/98.30	100/99.20	100/-	99.48/99.91
Grid	99.90/99.70	98.10/98.90	97.30/98.70	100/98.30	99.70/98.80	-/99.20	100/99.30	100/-	95.57/95.40
Leather	100/99.00	99.90/99.70	100/99.30	100/99.40	100/99.20	-/99.70	100/99.40	100/-	100/99.11
Tile	100/99.20	97.10/96.20	99.30/95.80	99.70/95.70	99.80/97.00	-/99.10	99.70/96.60	100/-	100/99.58
Wood	99.50/95.50	98.70/86.00	99.60/95.10	99.50/95.80	100/94.50	-/98.00	99.30/95.80	99.1/-	100/97.99
Bottle	98.00/99.10	99.90/97.20	100/98.60	100/98.80	100/98.00	-/99.40	100/98.80	100/-	100/97.77
Cable	90.90/95.20	97.60/97.80	99.90/98.50	96.10/97.20	99.90/97.60	-/97.70	99.20/98.40	99.50/-	98.98/96.95
Capsule	91.30/88.10	97.00/99.10	98.00/99.00	96.10/98.70	97.70/98.90	-/99.10	99.00/98.80	100.00/-	98.69/98.48
Hazelnut	100/99.70	100/98.80	100/98.70	100/99.00	100/97.90	-/99.80	100/99.20	100/-	100/ 99.69
Metal_nut	100/99.60	98.50/98.60	99.90/98.30	100/97.30	100/98.80	-/99.00	100/98.10	100/-	100/99.45
Pill	97.10/97.30	96.20/98.90	97.50/97.60	98.70/98.10	99.00/98.60	-/99.10	98.40/98.30	98.40/-	98.20/98.61
Screw	98.70/99.30	93.10/98.90	98.20/99.50	97.80/99.70	98.20/99.30	-/98.80	98.90/99.70	96.7/-	99.47/97.56
Toothbrush	100/97.30	98.80/99.00	100/98.60	100/99.10	99.70/98.50	-/99.40	100/99.10	94.4/-	99.06/99.61
Transistor	91.70/85.20	92.90/98.20	99.90/96.50	95.50/92.30	100/97.60	-/92.50	98.50/94.30	100.0/-	97.58/94.81
Zipper	100/99.10	97.10/99.10	99.50/98.90	97.90/98.30	99.90/98.90	-/99.60	98.60/98.80	99.70/-	99.74/98.93
Average	97.60/96.70	97.50/97.70	99.20/98.10	98.70/97.80	99.57/98.14	99.00/98.20	99.44/98.25	99.20/98.30	99.05/98.26
Inference time	159	37	180	28	39	18	30	103	20
Parameters	97	94.7	186.55	150.64	52.88	35.16	154.87	28.83	16.56

Bold indicates the best and underline the second best. All experiments are consistently conducted on an NVIDIA GeForce RTX 3090.

Method	VisA				MVTec				DAGM				Average
Method	I ↑	P ↑	O ↑	A ↑	I ↑	P ↑	O ↑	A ↑	I ↑	P ↑	O ↑	A ↑	I ↑	P ↑	O ↑	A ↑
PatchCore (Li et al., 2021b) (186.6M)	95.10	98.80	91.20	40.10	99.20	98.10	93.40	56.10	93.60	96.70	89.30	51.70	95.97	97.87	91.30	49.30
RD4AD (Roth et al., 2022) (150.6M)	96.00	90.10	70.90	27.70	98.70	97.80	93.93	58.00	95.80	97.50	93.00	53.40	97.95	98.12	94.06	55.30
RD++ (Tien et al., 2023) (154.9M)	95.90	98.70	93.40	40.80	99.44	98.25	94.99	60.80	98.50	97.40	93.80	64.30	96.70	98.20	94.30	52.40
DeSTSeg (Zhang et al., 2023c) (35.2M)	91.95	97.73	90.02	40.76	99.00	98.20	95.11	75.80	97.44	93.55	87.88	56.99	96.13	96.49	91.00	57.85
SimpleNet (Liu et al., 2023) (52.9M)	96.80	97.80	88.70	36.30	99.57	98.14	90.00	54.80	95.30	97.10	91.30	48.10	97.22	97.68	90.00	46.40
FastFlow (92M)	82.20	88.20	59.80	15.60	90.50	95.50	85.60	39.80	87.40	91.10	79.90	34.20	86.70	91.60	75.10	29.87
DRAEM (Zavrtanik et al., 2021a) (97M)	88.70	94.40	73.70	30.50	97.60	96.70	92.10	68.40	90.80	86.80	71.00	30.60	92.37	92.63	78.93	43.17
Ours (16.6M)	96.73	98.36	93.83	47.63	99.05	98.26	94.92	76.32	98.30	96.33	91.14	64.91	98.03	97.65	93.30	62.95

“I”, “P”, “O” and “A” respectively refer to the five metrics of image-level AUROC, pixel-level AUROC, PRO and AP. The best results on PRO are highlighted in bold.

Table 2. Anomaly Detection and Localization on MVTec (Bergmann et al., 2019a), VisA (Zou et al., 2022) and DAGM (Zavrtanik et al., 2021a)

Our comparison SOTA methods include DRAEM (Zavrtanik et al., 2021a), CFLOW (Gudovskiy et al., 2022), CFA (Lee et al., 2022), PatchCore (Roth et al., 2022), RD4AD (Deng and Li, 2022), SimpleNet (Liu et al., 2023), DeSTSeg (Zhang et al., 2023c), RD++ (Tien et al., 2023), FastFlow (Yu et al., 2021), and FOD (Yao et al., 2023a).

4.1.3. Implementation Details

Our initialized two-stream encoders of TLM are the ViT-Tiny (Wu et al., 2022) with MoblieSAM (Zhang et al., 2023a) weight, while SAM (Kirillov et al., 2023) serves as the teacher network. All images in the two datasets are resized to 1024 × 1024. We employ the Adam Optimizer (Kingma and Ba, 2014), setting the learning rate at 0.0005 for the TLM. For the FA module, we choose the Stochastic Gradient Descent Optimizer, using the same configuration as presented in (Zhang et al., 2023c). Each loss function in our paper carries equal weight. We conduct training for 200 epochs with a batch size of 2 and compute the average of the top 100 anomalous pixels as the image-level anomaly score. Notably, we use the augmentation method, i.e., pseudo anomaly generation strategy, that most state-of-the-art methods use for fair competition. Prototypical (Zhang et al., 2023d) presents more complex anomaly generation strategies to get better performance.

Table 3. Anomaly location results in terms of PRO (

\%

) on the MVTec dataset (Bergmann et al., 2019a)

Class	CFA (Lee et al., 2022)	PatchCore (Roth et al., 2022)	RD4AD (Deng and Li, 2022)	DeSTSeg (Zhang et al., 2023c)	RD++ (Tien et al., 2023)	Ours
Carpet	96.54	96.60	97.00	97.12	97.70	96.97
Grid	94.04	96.00	97.60	96.40	97.70	91.56
Leather	97.43	98.90	99.10	98.88	99.20	98.05
Tile	89.26	87.30	90.60	97.63	92.40	99.23
Wood	90.54	89.40	90.90	95.26	93.30	98.41
Avg. Text.	93.56	93.64	95.04	97.06	96.06	96.84
Bottle	95.76	96.20	96.60	97.09	97.00	90.92
Cable	94.17	92.50	91.00	83.19	93.90	92.11
Capsule	93.66	95.50	95.80	95.12	96.40	97.28
Hazelnut	95.75	93.80	95.50	98.09	96.30	98.29
Metal_nut	94.54	91.40	92.30	94.54	93.00	92.95
Pill	97.19	93.20	96.40	95.12	97.00	94.69
Screw	95.23	97.90	98.20	93.66	98.60	98.60
Toothbrush	91.14	91.50	94.50	96.22	94.20	96.14
Transistor	95.35	83.70	78.00	90.94	81.80	82.50
Zipper	95.95	97.10	95.40	97.35	96.30	96.15
Avg. Obj.	94.87	93.28	93.37	94.13	94.45	93.96
Average	94.44	93.4	93.93	95.11	94.99	94.92
Parameters	25.6M	186.6M	150.6M	35.1M	154.9M	16.6M

Bold indicates the best and underline the second best. All experiments are consistently conducted on an NVIDIA GeForce RTX 3090.

4.2. Quantitative Results and Comparison

4.2.1. Anomaly Detection and Localization on MVTec

For the purpose of ensuring fair comparisons with other studies, we use the results reported in the original papers or the results re-evaluated by (Zhang et al., 2023c, d). In cases where results are not available, we indicate with a hyphen (‘-’).

Table 2 presents the results of anomaly detection and localization on the MVTec dataset. On average, our method achieves a competitive performance and secures the highest score for 9 out of 15 classes. To provide a comprehensive overview of anomaly localization capabilities, we include additional PRO metric results in Table 3. In 5 out of 15 classes, our method achieves a superior PRO score and it remains comparable to state-of-the-art in the remaining classes. The specific results in Figure 1 are shown in Table 2, our method achieves the most efficient method and the second-fastest inference speed, demonstrating the model efficiency and mobile-friendliness of our STLM.

All experiments are consistently conducted on an NVIDIA GeForce RTX 3090. Notably, we attempt to analyze the reason why our STLM has fewer parameters than DeSTSeg but longer inference times is the software and hardware optimization works for ResNet-like frameworks are more mature compared to Transformer (Li et al., 2023). The optimal inference speed of the image encoder requires further study, which we plan to optimize in future work.

4.2.2. Evaluations on other more difficult benchmarks

To conduct more comprehensive evaluations of the anomaly detection capabilities, we further subject our networks to benchmarking using two additional widely recognized datasets VisA (Zou et al., 2022) and DAGM (Zavrtanik et al., 2021a). To provide a comprehensive overview of anomaly localization capabilities, we include 4 metrics: image-level AUROC, pixel-level AUROC, PRO and AP. The results, as presented in Table 2, indicate that our method achieves SOTA pixel-level PRO and AP metrics of 93.83/47.63 on the VisA dataset. Our method has also demonstrated its superiority on the DAGM dataset. On average, STLM performs exceptionally well, affirming its effectiveness and scalability when faced with increasingly complex datasets.

4.3. Qualitative Results and Comparison

4.3.1. Visualization on MVTec

To further analyze the model performance, we visualize a qualitative evaluation of anomaly localization performance presented in Figure 7. The visual results demonstrate our model’s precise localization of anomalies. However, no one is perfect. In Figure 9, we analyze instances where our method does not perform as expected. As evidenced in the first and second columns, the susceptibility of our FA module holds the primary responsibility. But as mentioned in (Zhang et al., 2023c), we must acknowledge that several uncertain ground truths are accountable for the observed failures. Considering the results in the third and fourth columns where the ground truth highlights the designated location, our masks only cover the anomalous region on the same images.

4.3.2. Visualization on other benchmarks

We conducted substantial qualitative experiments on VisA datasets to visually demonstrate the superiority of our method in the accuracy of anomaly localization on other more difficult benchmarks. Also, the Feature Aggregation (FA) module is always effective. As shown in Figure 8, the visual results demonstrate our model’s robust generalization.

4.4. Ablation Studies

4.4.1. The importance of Large Teacher, Plain Student Stream, Mask Decoder, and Feature Aggregation

In Table 4, we assess the effectiveness of our four design components by conducting experiments where we eliminate the fixed teacher network and use a pre-trained mobileSAM as the teacher network to examine the contribution of SAM-knowledge, remove the plain student stream and use the fixed SAM as the plain stream in the TLM like early KD frameworks (but still use both normal and pseudo anomaly images to train denoising student stream), employee the image embedding exacted from the encoders as the FA module input and train with the cosine loss between paired outputs of two encoders, and substitute the FA module with an empirical feature fusion strategy (Wang et al., 2021). The best results are achieved when all four key design components are combined.

Notably, our work outperforms the “w/o-PlainS” manner, explaining that the plain student stream can learn generalized knowledge related to anomaly detection tasks. This enables our plain student stream to effectively represent previously unseen patterns, ensuring differences between features of two streams are well-captured.

Table 4. Ablation studies on our main designs on the MVTec dataset (Bergmann et al., 2019a).

Large Teacher	Plain Student Stream	Mask Decoder	Feature Aggregation	I-AUROC	P-AUROC	PRO
Mobile Teacher	✓	✓	✓	96.36	97.57	93.07
✓	-	✓	✓	98.03	95.77	89.96
✓	✓	-	✓	97.74	95.65	88.05
✓	✓	✓	strategy (Wang et al., 2021)	93.68	96.81	91.58
✓	✓	✓	✓	99.05	98.26	94.92

I-AUROC, P-AUROC, and PRO ( $\%$ ) are used to evaluate image-level and pixel-level detection. The best results on PRO are highlighted in bold.

Table 5. Ablation studies on training strategy between one-stage and two-stage training on the MVTec dataset (Bergmann et al., 2019a).

Method	I-AUROC	P-AUROC	PRO
One-stage	99.05	98.26	94.92
Two-stage	98.98	98.06	90.12

4.4.2. Effect for One-stage training strategy

We conduct ablation studies to investigate the impact of two distinct training strategies mentioned in Section 3.3, as detailed in Table 5. The results reveal that one-stage training delivered superior performance compared to separate training of the TLM and FA module proposed in (Zhang et al., 2023c), particularly in terms of PRO scores. This suggests that training the TLM with additional supervised signals of the pseudo anomaly mask $M$ enhances the network’s ability to locate tiny defects or accurately delineate defect boundaries.

Table 6. Ablation studies on the input of feature aggregation module on the MVTec dataset (Bergmann et al., 2019a).

Method	I-AUROC	P-AUROC	PRO
Cosine	98.95	94.61	92.63
Subtraction	99.00	94.30	93.96
Concat	98.89	97.98	94.51
STLM (Ours)	99.05	98.26	94.92

4.4.3. Discrepancy among different manners for the input of the Feature Aggregation module

As mentioned in Section 3.3, the input of our feature aggregation module is two element-wise products of the feature maps from TLM, as defined by Equation (2). To validate the rationality of this setting, we evaluate three alternative feature combinations as input.

The first way computes the cosine distance between the feature maps of our two-stream lightweight model, making use of more prior information from STLM, which is trained through optimization of the cosine loss function. The second way involves concatenating features of the normal image and the anomalous residual representation defined by (Zhang et al., 2023d). The anomalous feature residuals denote the element-wise Euclidean distance between tensors from two streams. The third way directly concatenates the feature maps $S_{P}^{k}$ and $S_{D}^{k}$ of TLM as the input of our FA module.

The second and the third ways preserve the information from TLM more effectively. The results are presented in Table 6, indicating our method maintains a balance between prior knowledge and representation of information.

4.4.4. Effect of the knowledge distillation framework

We investigate the importance of the fixed SAM teacher, i.e., only utilizing TLM during training and inference phases, as illustrated in Figure 12. The results are shown in Table 7. We find that the knowledge distillation framework, comparing the two-stream representations, is indispensable for detecting anomalous regions.

4.4.5. Analysis of the probability of Pseudo Anomaly Generation

The probability of the Pseudo Anomaly Generation process being activated governs the extent to which the training datasets deviate from the original normal datasets. To be specific, a higher probability results in a larger proportion of abnormal images and a smaller proportion of normal images. The probability of 0.5 denotes an optimal balance between pseudo-anomalous images and normal images, as shown in Figure 10, achieving the best performance. Since the “good” samples in the test dataset, the probability of 1.0 lets normal images be “unseen” images during training, leading to a high false negative. We design experiments to verify this explanation in Section 4.4.6

4.4.6. Additional experience of the probability of Pseudo Anomaly Generation.

In Table 10, we provide further quantitative results of the false negative rate (FNR) (Altman and Bland, 1994) to verify the analysis in the ablation study that the probability of 1.0 lets normal images be “unseen” images during training, leading to a high false negative. The false negative rate, i.e., miss rate, is defined to measure the proportion of instances that are incorrectly identified as negative when it is actually positive. The FNR is calculated as:

(9)

FNR=\frac{FN}{TP+FN},

where false negative (FN) refers to the number that is actually positive but incorrectly predicted as negative and true positive (TP) represents the total number that is actually positive.

4.4.7. Ablation studies on the feature extracted from the $k$ -th layer of mask decoder

In Equation (3) and Equation (4), we define the layers of the feature extracted from the Mask Decoder as the $k$ -th layer, which is also presented in Figure 2: Element-wise Production. We conducted a series of experiments to investigate the impact of $k$ , and the corresponding results are presented in Figure 11. We denoted the two layers as L1 and L2. We include image-level AUROC, pixel-level AUROC and PRO metrics to provide a comprehensive cooperation of the anomaly detection and localization capabilities of each layer.

It is evident that features extracted from L1 alone already yield excellent image-level AUROC performance, while pixel-level AUROC and PRO metrics benefit from incorporating information from both L1 and L2. It indicates that it is necessary to exchange information across different layers. Therefore, we selected the combination of L1 + L2 as our default setting.

Table 7. Ablation studies on the knowledge distillation framework: I-AUROC, P-AUROC, and PRO (

\%

) are used to evaluate image-level and pixel-level detection on the MVTec dataset (Bergmann et al., 2019a).

Class	w/o fixed SAM			STLM(Ours)
Class	I↑	P↑	O↑	I↑	P↑	O↑
Carpet	87.92	91.13	69.53	99.48	99.91	96.97
Grid	98.50	97.10	87.62	95.57	95.40	91.56
Leather	99.93	98.30	98.22	100	99.11	98.05
Tile	100	98.98	98.45	100	99.58	99.23
Wood	95.97	91.24	85.26	100	97.99	98.41
Avg. Text.	96.46	95.35	87.82	99.01	98.40	96.84
Bottle	74.13	76.65	74.63	100	97.77	90.92
Cable	69.11	85.15	58.32	98.98	96.95	92.11
Capsule	77.45	87.00	62.05	98.69	98.48	97.28
Hazelnut	86.00	95.27	86.86	100	99.69	98.29
Metal_nut	98.54	83.56	78.07	100	99.45	92.95
Pill	85.45	86.66	87.84	97.20	98.61	94.69
Screw	99.63	96.18	87.44	99.47	97.56	98.60
Toothbrush	87.39	93.25	65.65	99.06	99.61	96.14
Transistor	93.29	79.09	67.94	97.58	94.81	82.50
Zipper	87.66	90.75	79.95	99.74	98.93	96.15
Avg. Obj.	85.87	87.36	74.88	99.07	98.19	93.96
Average	89.40	90.02	79.19	99.05	98.26	94.92

Table 8. Ablation studies on the initial weight of TLM’s image encoder: I-AUROC (

\%

), P-AUROC (

\%

), PRO (

\%

) and the number of epochs are used to evaluate the performance on the MVTec dataset (Bergmann et al., 2019a).

Method	Avg. Text.				Avg. Obj.				Average
Method	I↑	P↑	O↑	E↓	I↑	P↑	O↑	E↓	I↑	P↑	O↑	E↓
w/o initial weight	99.76	98.87	97.63	129	98.51	98.18	93.03	106	98.93	98.41	94.57	141
STLM (Ours)	99.01	98.40	96.84	38	99.07	98.19	93.96	63	99.05	98.26	94.92	55

Table 9. Ablation studies on the distillation methods: I-AUROC, P-AUROC, and PRO (

\%

) are used to evaluate image-level and pixel-level detection on the MVTec dataset (Bergmann et al., 2019a).

Table 10. Additional experience of the probability of Pseudo Anomaly Generation: FNR (

\%

) on the MVTec dataset (Bergmann et al., 2019a).

Method	I-AUROC	P-AUROC	PRO
Logit-based	93.45	91.65	80.38
Feature-based	99.05	98.26	94.92

Method	Avg. Text.↓	Avg. Obj.↓	Average↓
Prob = 1.0	35.77	38.59	37.71
Prob = 0.5	35.87	30.64	32.39

Table 10. Additional experience of the probability of Pseudo Anomaly Generation: FNR (

\%

) on the MVTec dataset (Bergmann et al., 2019a).

Table 11. Ablation studies on different anomaly generation strategies on the MVTec dataset (Bergmann et al., 2019a).

Method	I-AUROC	P-AUROC	PRO
NSA	99.05	98.26	94.92
Cutpaste(Parch)	84.74	89.85	82.03
Cutpaste(Scar)	96.63	96.30	90.52
Cutpaste(Union)	97.49	97.16	91.06

4.4.8. Effect of the initial weight of TLM’s image encoder

We compare the results of employing pre-trained parameters with MobileSAM (Zhang et al., 2023a) and utilizing randomly initialized parameters for ViT-Tiny (Wu et al., 2022) in our TLM, shown in Table 8. We observe that a pre-trained image encoder allows our method to achieve optimal results in fewer epochs. At first glance, the two methods seem comparable for AD tasks. However, upon closer inspection, the ”w/o initial weight” method outperforms in texture classes, while our STLM is better suited for object classes. This leads us to analyze in future research whether SAM’s prior knowledge, centered around object segmentation, plays a role in its superior performance in object-related tasks, despite its less effective representation in texture classes.

4.4.9. Ablation studies on the distillation methods

SAM generates high-quality segmentation results for previously unseen images, making it suitable for logit distillation. As (Zhao et al., 2022) said, distillation methods often focus on distilling deep features from intermediate layers, while the significance of logit distillation is greatly overlooked. We compare the feature-based method with the logit-based method, which minimizes the focal loss (Lin et al., 2017) and L1 loss (Girshick, 2015) between segmentation masks of teacher SAM and the student streams (13). The experimental results, shown in Table 10, indicate that the performance of feature distillation is superior on average, compared with directly utilizing segmentation results from a SAM-like architecture.

4.4.10. Evaluation on different anomaly generation strategies.

To evaluate the pseudo anomaly generation strategies, we present experimental results with the renowned strategies NSA (Schlüter et al., 2022) and Cutpaste (Li et al., 2021b) in Table 11. In most SOTA methods, i.e., DRAEM (Zavrtanik et al., 2021a), Prototypical (Zhang et al., 2023d) and DeSTSeg (Zhang et al., 2023c), the authors attempt to leverage synthetic anomalies generated by NSA, which are also utilized in our paper. Furthermore, based on NSA and Cutpaste, (Kim et al., 2024; Yao et al., 2023b; Zhang et al., 2023d; Tien et al., 2023) employ a variety of complex or artificial operations that more closely resemble the real anomalies. Notably, SimpleNet (Liu et al., 2023) simulates anomalies by introducing Gaussian noise to normal features, a process that requires careful tuning of appropriate hyperparameters.

5. Conclusion

In industrial anomaly detection, the primary concerns in real-world applications include two aspects: model efficiency and mobile-friendliness. In this paper, we propose a novel framework called SAM-guided Two-stream Lightweight Model for unsupervised anomaly detection tasks, tailored to meet the demands of real-world industrial applications while capitalizing on the strong generalization ability of SAM. Our STLM effectively distills distinct knowledge from SAM into the Two-stream Lightweight Model (TLM), assigning different tasks to each stream. To be specific, one stream focuses on generating discriminative and generalized feature representations in both normal and anomalous locations, while the other stream reconstructs features without anomalies. Last, we employ a shared mask decoder and a feature aggregation module to generate anomaly maps. By the design of our Two-stream Lightweight Model, shared mask decoder and lightweight feature aggregation module, the calculation and the number of parameters of the networks can be significantly reduced and the inference speed can naturally be improved without noticeable accuracy loss. Experiments on the renowned dataset MVTec AD and two more difficult benchmarks VisA and DAGM demonstrate that our method achieves competitive results on anomaly detection tasks. Furthermore, it exhibits model efficiency and mobile-friendliness compared to state-of-the-art methods.

References

(1)
Akrami et al. (2022) Haleh Akrami, Anand A Joshi, Jian Li, Sergül Aydöre, and Richard M Leahy. 2022. A robust variational autoencoder using beta divergence. Knowledge-Based Systems (2022), 107886. https://doi.org/10.1016/j.knosys.2021.107886
Altman and Bland (1994) Douglas G Altman and J Martin Bland. 1994. Statistics Notes: Diagnostic tests 2: predictive values. British Medical Journal (1994).
Bergmann et al. (2019a) Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. 2019a. MVTec AD–A comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 9592–9600. https://doi.org/10.1109/cvpr.2019.00982
Bergmann et al. (2020) Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. 2020. Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 4183–4192. https://doi.org/10.1109/cvpr42600.2020.00424
Bergmann et al. (2019b) Paul Bergmann, Sindy Löwe, Michael Fauser, David Sattlegger, and Carsten Steger. 2019b. Improving Unsupervised Defect Segmentation by Applying Structural Similarity to Autoencoders. In Proceedings of the 14th. International Conference on Computer Vision Theory and Applications. IEEE, 372–380. https://doi.org/10.5220/0007364503720380
Chen et al. (2017) Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2017. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017), 834–848. https://doi.org/10.1109/tpami.2017.2699184
Chen et al. (2023) Yadang Chen, Mei Wang, Duolin Wang, and Dichao Li. 2023. Robust Anomaly Detection and Localization via Simulated Anomalies. In Proceedings of the 18th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and Its Applications in Industry. ACM Press, 1–8. https://doi.org/10.1145/3574131.3574463
Chen et al. (2020) Zhi Chen, Sen Wang, **g**g Li, and Zi Huang. 2020. Rethinking Generative Zero-Shot Learning: An Ensemble Learning Perspective for Recognising Visual Patches. Proceedings of the 28th ACM International Conference on Multimedia. https://doi.org/10.1145/3394171.3413813
Cimpoi et al. (2014) Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. 2014. Describing textures in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 3606–3613. https://doi.org/10.1109/cvpr.2014.461
Collin and De Vleeschouwer (2021) Anne-Sophie Collin and Christophe De Vleeschouwer. 2021. Improved anomaly detection by training an autoencoder with skip connections on images corrupted with stain-shaped noise. In Proceedings of the 25th International Conference on Pattern Recognition. IEEE, 7915–7922.
Deng and Li (2022) Hanqiu Deng and Xingyu Li. 2022. Anomaly detection via reverse distillation from one-class embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 9737–9746. https://doi.org/10.1109/cvpr52688.2022.00951
DeVries and Taylor (2017) Terrance DeVries and Graham W Taylor. 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017).
Fernando et al. (2021) Tharindu Fernando, Harshala Gammulle, Simon Denman, Sridha Sridharan, and Clinton Fookes. 2021. Deep Learning for Medical Anomaly Detection – A Survey. Comput. Surveys (2021), 37. https://doi.org/10.1145/3464423
Gidaris et al. (2018) Spyros Gidaris, Praveer Singh, and Nikos Komodakis. 2018. Unsupervised Representation Learning by Predicting Image Rotations. In Proceedings of the International Conference on Learning Representations. OpenReview.net, 8330–8339.
Girshick (2015) Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 1440–1448. https://doi.org/10.1109/iccv.2015.169
Gong et al. (2019) Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. 2019. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 1705–1714.
Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in Neural Information Processing Systems (2014). https://doi.org/10.3156/JSOFT.29.5_177_2
Gudovskiy et al. (2022) Denis Gudovskiy, Shun Ishizaka, and Kazuki Kozuka. 2022. Cflow-ad: Real-time unsupervised anomaly detection with localization via conditional normalizing flows. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. IEEE, 98–107. https://doi.org/10.1109/wacv51458.2022.00188
Guo et al. (2017) Jianting Guo, Peijia Zheng, and Jiwu Huang. 2017. An Efficient Motion Detection and Tracking Scheme for Encrypted Surveillance Videos. ACM Transactions on MultimediaComputing Communications and Applications (2017), 1–23. https://doi.org/10.1145/3131342
Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
Ji et al. (2023) Wei Ji, **g**g Li, Qi Bi, Wenbo Li, and Li Cheng. 2023. Segment anything is not always perfect: An investigation of sam on different real-world applications. arXiv preprint arXiv:2304.05750 (2023).
Kim et al. (2024) Daehyun Kim, Sungyong Baik, and Tae Hyun Kim. 2024. SANFlow: Semantic-Aware Normalizing Flow for Anomaly Detection. Advances in Neural Information Processing Systems 36 (2024).
Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
Kirillov et al. (2023) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. 2023. Segment anything. arXiv preprint arXiv:2304.02643 (2023).
Lee et al. (2022) Sungwook Lee, Seunghyun Lee, and Byung Cheol Song. 2022. Cfa: Coupled-hypersphere-based feature adaptation for target-oriented anomaly localization. IEEE Access (2022). https://doi.org/10.1109/access.2022.3193699
Li et al. (2021b) Chun-Liang Li, Kihyuk Sohn, **sung Yoon, and Tomas Pfister. 2021b. Cutpaste: Self-supervised learning for anomaly detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 9664–9674. https://doi.org/10.1109/cvpr46437.2021.00954
Li et al. (2023) Yanyu Li, Ju Hu, Yang Wen, Georgios Evangelidis, Kamyar Salahi, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. 2023. Rethinking Vision Transformers for MobileNet Size and Speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 16889–16900. https://doi.org/10.1109/iccv51070.2023.01549
Li et al. (2021a) Yang Li, Guangcan Liu, Yubao Sun, Qingshan Liu, and Shengyong Chen. 2021a. 3D Tensor Auto-encoder with Application to Video Compression. ACM Transactions on MultimediaComputing Communications and Applications (2021), 1–18. https://doi.org/10.1145/3431768
Lin et al. (2017) Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 2980–2988. https://doi.org/10.1109/iccv.2017.324
Liu and Ma (2019) Kun Liu and Huadong Ma. 2019. Exploring Background-bias for Anomaly Detection in Surveillance Videos. In Proceedings of the 27th ACM International Conference on Multimedia. ACM Press, 1490–1499. https://doi.org/10.1145/3343031.3350998
Liu et al. (2020) Wenqian Liu, Runze Li, Meng Zheng, Srikrishna Karanam, Ziyan Wu, Bir Bhanu, Richard J. Radke, and Octavia Camps. 2020. Towards Visually Explaining Variational Autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 8642–8651.
Liu et al. (2023) Zhikang Liu, Yiming Zhou, Yuansheng Xu, and Zilei Wang. 2023. Simplenet: A simple network for image anomaly detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 20402–20411. https://doi.org/10.1109/cvpr52729.2023.01954
Park et al. (2020) Hyunjong Park, Jongyoun Noh, and Bumsub Ham. 2020. Learning memory-guided normality for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE, 14372–14381.
Peng and Qi (2019) Yuxin Peng and **wei Qi. 2019. CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning. ACM Transactions on MultimediaComputing Communications and Applications (2019), 1–24. https://doi.org/10.1145/3284750
Pope et al. (2023) Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2023. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems (2023).
Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763.
Roth et al. (2022) Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Schölkopf, Thomas Brox, and Peter Gehler. 2022. Towards total recall in industrial anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 14318–14328. https://doi.org/10.1109/cvpr52688.2022.01392
Saito and Rehmsmeier (2015) Takaya Saito and Marc Rehmsmeier. 2015. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLOS ONE (2015), 1–21. https://doi.org/10.1371/journal.pone.0118432
Salehi et al. (2021) Mohammadreza Salehi, Niousha Sadjadi, Soroosh Baselizadeh, Mohammad H Rohban, and Hamid R Rabiee. 2021. Multiresolution knowledge distillation for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 14902–14912. https://doi.org/10.1109/cvpr46437.2021.01466
Sarafijanovic-Djukic and Davis (2019) Natasa Sarafijanovic-Djukic and Jesse Davis. 2019. Fast distance-based anomaly detection in images using an inception-like autoencoder. In International Conference on Discovery Science. Springer-Verlag, 493–508. https://doi.org/10.1007/978-3-030-33778-0_37
Schlüter et al. (2022) Hannah M Schlüter, Jeremy Tan, Benjamin Hou, and Bernhard Kainz. 2022. Natural synthetic anomalies for self-supervised anomaly detection and localization. In Proceedings of the European Conference on Computer Vision. Springer-Verlag, 474–489.
Shi et al. (2021) Yong Shi, Jie Yang, and Zhiquan Qi. 2021. Unsupervised anomaly segmentation via deep feature reconstruction. Neurocomputing (2021), 9–22. https://doi.org/10.1016/j.neucom.2020.11.018
Tao et al. (2022) Xian Tao, Xinyi Gong, Xin Zhang, Shaohua Yan, and Chandranath Adak. 2022. Deep learning for unsupervised anomaly localization in industrial images: A survey. IEEE Transactions on Instrumentation and Measurement (2022). https://doi.org/10.1109/tim.2022.3196436
Tien et al. (2023) Tran Dinh Tien, Anh Tuan Nguyen, Nguyen Hoang Tran, Ta Duc Huy, Soan Duong, Chanh D Tr Nguyen, and Steven QH Truong. 2023. Revisiting reverse distillation for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 24511–24520. https://doi.org/10.1109/cvpr52729.2023.02348
Vasilev et al. (2020) Aleksei Vasilev, Vladimir Golkov, Marc Meissner, Ilona Lipp, Eleonora Sgarlata, Valentina Tomassini, Derek K Jones, and Daniel Cremers. 2020. q-Space novelty detection with variational autoencoders. In Computational Diffusion MRI. Springer-Verlag, 113–124. https://doi.org/10.1007/978-3-030-52893-5_10
Wang et al. (2021) Guodong Wang, Shumin Han, Errui Ding, and Di Huang. 2021. Student-teacher feature pyramid matching for anomaly detection. arXiv preprint arXiv:2103.04257 (2021).
Wieler and Hahn (2007) Matthias Wieler and Tobias Hahn. 2007. Weakly supervised learning for industrial optical inspection. In DAGM symposium in.
Wu et al. (2022) Kan Wu, **nian Zhang, Houwen Peng, Mengchen Liu, Bin Xiao, Jianlong Fu, and Lu Yuan. 2022. Tinyvit: Fast pretraining distillation for small vision transformers. In Proceedings of the European Conference on Computer Vision. Springer-Verlag, 68–85. https://doi.org/10.1007/978-3-031-19803-8_5
Xu et al. (2023) **g Xu, Bing Liu, Yong Zhou, Mingming Liu, Rui Yao, and Zhiwen Shao. 2023. Diverse Image Captioning via Conditional Variational Autoencoder and Dual Contrastive Learning. ACM Transactions on MultimediaComputing Communications and Applications (2023), 1–16. https://doi.org/10.1145/3614435
Yang et al. (2023) Minghui Yang, Peng Wu, and Hui Feng. 2023. MemSeg: A semi-supervised method for image surface defect detection using differences and commonalities. Engineering Applications Of Artificial Intelligence (2023).
Yao et al. (2023a) Xincheng Yao, Ruoqi Li, Zefeng Qian, Yan Luo, and Chongyang Zhang. 2023a. Focus the Discrepancy: Intra-and Inter-Correlation Learning for Image Anomaly Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 6803–6813. https://doi.org/10.1109/iccv51070.2023.00626
Yao et al. (2023b) Xincheng Yao, Ruoqi Li, **g Zhang, Jun Sun, and Chongyang Zhang. 2023b. Explicit Boundary Guided Semi-Push-Pull Contrastive Learning for Supervised Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 24490–24499.
Yu et al. (2021) Jiawei Yu, Ye Zheng, Xiang Wang, Wei Li, Yushuang Wu, Rui Zhao, and Liwei Wu. 2021. Fastflow: Unsupervised anomaly detection and localization via 2d normalizing flows. arXiv preprint arXiv:2111.07677 (2021).
Zavrtanik et al. (2021a) Vitjan Zavrtanik, Matej Kristan, and Danijel Skočaj. 2021a. Draem-a discriminatively trained reconstruction embedding for surface anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 8330–8339. https://doi.org/10.1109/iccv48922.2021.00822
Zavrtanik et al. (2021b) Vitjan Zavrtanik, Matej Kristan, and Danijel Skočaj. 2021b. Reconstruction by inpainting for visual anomaly detection. Pattern Recognition (2021), 107706. https://doi.org/10.1016/j.patcog.2020.107706
Zhang et al. (2023a) Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. 2023a. Faster Segment Anything: Towards Lightweight SAM for Mobile Applications. arXiv preprint arXiv:2306.14289 (2023).
Zhang et al. (2023d) Hui Zhang, Zuxuan Wu, Zheng Wang, Zhineng Chen, and Yu-Gang Jiang. 2023d. Prototypical residual networks for anomaly detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 16281–16291. https://doi.org/10.1109/cvpr52729.2023.01562
Zhang et al. (2023b) Xinyi Zhang, Naiqi Li, Jiawei Li, Tao Dai, Yong Jiang, and Shu-Tao Xia. 2023b. Unsupervised Surface Anomaly Detection with Diffusion Probabilistic Model. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 6782–6791. https://doi.org/10.1109/iccv51070.2023.00624
Zhang et al. (2023c) Xuan Zhang, Shiyu Li, Xi Li, ** Huang, Jiulong Shan, and Ting Chen. 2023c. DeSTSeg: Segmentation Guided Denoising Student-Teacher for Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 3914–3923. https://doi.org/10.1109/cvpr52729.2023.00381
Zhang et al. (2023e) Zhengbin Zhang, Zhenhao Xu, Xingsheng Gu, and Juan Xiong. 2023e. Cross-CBAM: A Lightweight network for Scene Segmentation. arXiv preprint arXiv:2306.02306 (2023).
Zhao et al. (2022) Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun Liang. 2022. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 11953–11962. https://doi.org/10.1109/cvpr52688.2022.01165
Zhou and Paffenroth (2017) Chong Zhou and Randy C. Paffenroth. 2017. Anomaly Detection with Robust Deep Autoencoders. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, 665–674. https://doi.org/10.1145/3097983.3098052
Zou et al. (2022) Yang Zou, Jongheon Jeong, Latha Pemula, Dongqing Zhang, and Onkar Dabeer. 2022. Spot-the-difference self-supervised pre-training for anomaly detection and segmentation. In Proceedings of the European Conference on Computer Vision. Springer-Verlag, 392–408. https://doi.org/10.1007/978-3-031-20056-4_23

A SAM-guided Two-stream Lightweight Model for Anomaly Detection

Abstract.

1. Introduction

2. Related Work

2.1. Deep Learning Methods for Anomaly Detection and Localization

2.2. Vision Foundation Models

2.3. Data augmentation strategies

3. Method

3.1. Pseudo Anomaly Generation

3.2. Two-Stream Lightweight Model

3.2.1. Mobile Distillation

3.2.2. The Decoder

3.2.3. Remark

3.3. Feature Aggregation Module

3.4. Training and Inference

4. Experiments

4.1. Experimental Details

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.1.3. Implementation Details

4.2. Quantitative Results and Comparison

4.2.1. Anomaly Detection and Localization on MVTec

4.2.2. Evaluations on other more difficult benchmarks

4.3. Qualitative Results and Comparison

4.3.1. Visualization on MVTec

4.3.2. Visualization on other benchmarks

4.4. Ablation Studies

4.4.1. The importance of Large Teacher, Plain Student Stream, Mask Decoder, and Feature Aggregation

4.4.2. Effect for One-stage training strategy

4.4.3. Discrepancy among different manners for the input of the Feature Aggregation module

4.4.4. Effect of the knowledge distillation framework

4.4.5. Analysis of the probability of Pseudo Anomaly Generation

4.4.6. Additional experience of the probability of Pseudo Anomaly Generation.

4.4.7. Ablation studies on the feature extracted from the k𝑘kitalic_k-th layer of mask decoder

4.4.8. Effect of the initial weight of TLM’s image encoder

4.4.9. Ablation studies on the distillation methods

4.4.10. Evaluation on different anomaly generation strategies.

5. Conclusion

References

4.4.7. Ablation studies on the feature extracted from the $k$ -th layer of mask decoder