ToCoAD: Two-Stage Contrastive Learning for Industrial Anomaly Detection

Yun Liang [email protected] Zhiguang Hu [email protected] Junjie Huang [email protected] Donglin Di [email protected] Anyang Su [email protected] Lei Fan [email protected]

Abstract

Current unsupervised anomaly detection approaches perform well on public datasets but struggle with specific anomaly types due to the domain gap between pre-trained feature extractors and target-specific domains. To tackle this issue, this paper presents a two-stage training strategy, called ToCoAD. In the first stage, a discriminative network is trained by using synthetic anomalies in a self-supervised learning manner. This network is then utilized in the second stage to provide a negative feature guide, aiding in the training of the feature extractor through bootstrap contrastive learning. This approach enables the model to progressively learn the distribution of anomalies specific to industrial datasets, effectively enhancing its generalizability to various types of anomalies. Extensive experiments are conducted to demonstrate the effectiveness of our proposed two-stage training strategy, and our model produces competitive performance, achieving pixel-level AUROC scores of 98.21%, 98.43% and 97.70% on MVTec AD, VisA and BTAD respectively.

keywords:

Anomaly detection , Contrastive learning , Self-supervised learning , Industrial manufacturing

^†^†journal: Elsevier

\affiliation

[1]organization=College of Mathematics and Informatics, addressline=South China Agricultural University, city=Guangzhou, postcode=510642, country=China \affiliation[2]organization=Space AI, addressline=Li Auto, city=Bei**g, postcode=101399, country=China \affiliation[3]organization=College of Software, addressline=Jilin University, city=Changchun, postcode=130012, country=China \affiliation[4]organization=School of Computer Science and Engineering, addressline=The University of New South Wales, city=Sydney, postcode=2052, country=Australia

1 Introduction

Industrial anomaly detection aims to identify defective products during quality inspection sessions within the production process, intending to improve the yield rate. Recently, interest in this field has surged due to the growing need of industrial development [1, 2]. Due to the challenge of obtaining sufficient anomalous samples, the distribution of anomalies is non-estimable, making this scenario typically classified as an unsupervised learning task. In such unsupervised settings, the objective is to train models using only defect-free samples, enabling them to detect and localize anomalous regions during the testing phase.

Existing anomaly detection methods, predominantly leveraging deep learning techniques, have shown superior performance in industrial benchmarks [3, 4]. As illustrated in Figure 1, the majority of existing methods, including feature embedding-based [5, 6, 7, 8] and synthetic anomaly-based methods [9, 10, 11], have attempted to train a classifier or discriminative network to identify anomalies by utilizing a pre-trained truncated model for extracting features. However, it is challenging to accurately identify all anomalies, primarily due to the indistinguishability of frozen pre-trained models from normal and anomalous features. This issue arises from the domain gaps [12, 13] between pre-trained datasets (e.g., ImageNet [14]) and target-specific domains (i.e., industrial images). Recently, reconstruction-based methods [15, 16, 17] have been developed to train a decoder and feature extractor jointly by learning an identical map between the input and output. However, the performance of these methods remains suboptimal, limited by the training overhead and the ability to reconstruct large areas of anomalies.

Recently, the application of contrastive learning has demonstrated considerable progress in cross-domain learning [18, 19]. High-quality image classification [20], segmentation [21, 22, 23], or deraining [24] is achieved by introducing several branches of contrastive learning methods, which can minimize the semantic gaps between target datasets and pre-trained datasets. Fine-tuning the network using positive sample pairs or negative samples has been demonstrated to be effective in enhancing the network’s generalizability to domain-specific datasets, indicating significant potential in the field of anomaly detection as well.

To tackle these challenges, we propose a two-stage training strategy that incorporates contrastive learning due to its advantages in learning robust features from target samples through self-supervised learning [25]. Specifically, we initially leverage a frozen feature extractor to obtain generalized features and train a discriminative network progressively to identify anomalies using synthetic anomalous samples. In the second stage, the discriminate network is fixed to provide a negative guide, while a bootstrap contrastive learning is employed to fine-tune the feature extractor and contrastive learning network jointly. This joint learning can be viewed as a form of adversarial learning [26, 27] of defect and defect-free features, where normal features are compactly enclosed and distinct from defective features. Finally, the fine-tuned feature extractor is used to extract patch features to construct a memory bank through coreset subsampling, and distance metrics between the test sample features and the memory bank features are computed to localize anomalies during the inference phase.

Refer to caption — Figure 1: Existing methods rely on frozen pre-trained feature extractors, which can lead to inaccuracies in anomaly detection. In contrast, our method utilizes a two-stage training strategy to fine-tune the feature extractor under the contrastive learning paradigm.

Our contributions can be summarized as follows:

1.

We propose two-stage training strategy to fine-tune the feature extractor to bridge the domain gap between pre-trained and target features. In the first stage, a network is progressively trained to coarsely localize anomalies, and then it is employed to facilitate the fine-tuning of the extractor in the second stage.
2.

We introduce a negative-guided contrastive learning paradigm, which utilizes the discriminative network guiding negative bootstrap contrastive learning to fine-tune the feature extractor. A joint learning contrastive loss is introduced to regularize negative bootstrap learning on the model.
3.

Our model shows competitive performance on three popular anomaly detection datasets. It achieves AUROC scores of 99.10% / 98.21% (image / pixel-level) on MVTec AD, 95.35% / 98.43% on VisA, and 97.70% (pixel-level) on BTAD.

2 Related Work

2.1 Anomaly Detection

Most anomaly detection methods utilize a feature extractor pre-trained on ImageNet to obtain generalized features. These methods can be classified into three types: feature embedding-based, reconstruction-based and synthetic anomaly-based methods.

Feature embedding-based methods: They typically utilize pre-trained models to extract features and directly model the distributions using various machine learning techniques. For example, SPADE [7] uses the K-Nearest-Neighbors (KNN) algorithm [28] to obtain representative features extracted by a pre-trained ResNet [29]. PaDiM [5] decomposes the image into patches to obtain a probabilistic representation of the normal class using multivariate Gaussian distributions. PatchCore [6] constructs a memory bank for storing neighborhood-aware patch-level features. Some methods are based on Normalizing Flow [30] (NF) to convert pre-trained feature distributions into simple distributions. CFLOW-AD [8] utilizes a pre-trained encoder with multiscale pyramid pooling to capture rich features and leverages conditional normalization flows to enhance anomaly detection efficiency. PyramidFlow [31] utilizes invertible pyramids and coupled pyramid blocks to localize anomalies through multi-scale feature interaction. SANF [32] uses a pre-trained Vision Transformer (ViT) [33] to extract semantic and spatial features from an image for feature fusion. These methods use pre-trained feature extractors, such as ResNet [29], WideResNet [34], EfficientNet [35] and ViT [33] to extract features for modeling distributions.

Reconstruction-based methods: They assume that models trained on only normal samples cannot accurately reconstruct anomalous regions, thus allowing for the localization of anomalies by identifying differences between the input and reconstruction results. For instance, RIAD [15] and InTra [16] perform mask operations on normal samples, and train reconstruction models to recover the masked regions. DiffusionAD [36] uses a diffusion model [37] to reconstruct normal samples as near-normal samples, and localizes anomalous regions using a segmentation network.

Synthetic anomaly-based methods: They formulate unsupervised anomaly detection as a binary classification task, in which pseudo anomaly samples are generated to train the discriminative network for identifying anomalies. CutPaste [11] employs an augmentation strategy where a smaller patch is replaced by other regions. MemSeg [38] and DRAEM [9] combine Perlin noise [39] and binarized masks of samples to generate anomalous images for training their discriminative networks. NSA [10] employs poisson image editing techniques to achieve seamless fusion of anomalies.

However, most of these methods rely on feature extractors pre-trained on ImageNet, leading to a domain gap between pre-trained features and industrial target features. To alleviate this problem, our ToCoAD employs a two-stage training approach that uses positive and negative samples to jointly fine-tune the feature extractor, guiding it to acquire adaptive feature representation capabilities.

2.2 Contrastive Learning

Recently, contrastive learning [40, 41, 42, 28] has played a significant role in self-supervised learning, offering a promising paradigm for exploiting unlabeled data without the need for human annotations. The concept aims to enhance feature consistency and obtain invariant feature representations between different views of the same sample, while preserving the differences among other samples. For example, SimCLR [43] and MoCo [42] construct positive pairs from the same sample and negative pairs from different samples, employing an N-pair loss [44] to keep positive pairs close and negative pairs far away. BYOL [41] avoids prediction crashes and removes negative pairs by an exponential moving average model. SWaV [45] compares cluster assignments under different views instead of directly comparing features for self-supervised learning. SimSiam [40] addresses the issue of collapsing solutions [46] without using negative samples by employing a stop-gradient operation and a predictor.

Given the capability of contrastive learning methods to enable models to acquire robust and generalized feature representations [47, 19, 48, 18, 22], we propose a negative bootstrap contrastive learning to fine-tune the feature extractor. Our approach uses augmented positive samples to train the model by learning feature representations of normal samples while simultaneously using synthetic anomaly samples to bootstrap the model, ensuring its sensitivity to anomalous features.

3 Method

Given a training set $\mathcal{T}_{\text{train}}$ comprising $N$ images and a test set $\mathcal{T}_{\text{test}}$ consisting of $N^{\prime}$ images, each image $I\in\mathbb{R}^{H\times W\times C}$ corresponds to a binary label $c=\{0,1\}$ where $H$ , $W$ and $C$ denote the height, width and the number of the channel of the image. For unsupervised anomaly detection, all images in $\mathcal{T}_{\text{train}}$ are categorized into normal (e.g., defect-free) images with $c=0$ , while $\mathcal{T}_{\text{test}}$ may contain anomalous images with $c=1$ and a corresponding binary annotation $y\in\{0,1\}^{H\times W\times C}$ indicating the anomalous regions. The normal images in the training set and the test set come from the same normal data distribution with similar sample characteristics, while the anomalous images deviate from these normal samples. The goal is to classify and localize anomalous regions in $\mathcal{T}_{\text{test}}$ .

As shown in Figure 2, we propose a two-stage training strategy to learn a feature representation adapted to the target data distribution. The first stage is to train a discriminative network to detect anomalies and localize anomalous regions coarsely (Section 3.1). It includes an anomaly generator $\mathbf{G}$ , a feature extractor $\mathbf{F}$ and a discriminative network $\mathbf{D}$ . In the second stage, the discriminative network pre-trained in stage I is incorporated with the contrastive learning network $\mathbf{C}$ , and these networks are jointly trained in a bootstrap contrast learning manner (Section 3.2). Finally, a memory bank $\mathcal{M}$ is constructed for storing the normal features, which is used to estimate the anomaly score maps during the testing phase (Section 3.3).

3.1 Discriminative Network Pre-training

Given the unavailability of anomalous samples for training a discriminative network, we initially utilize an anomaly generator $\mathbf{G}$ to synthesize pseudo anomalies from the training set. The generation of anomalies can be implemented in various forms by using self-supervised learning techniques [11, 10] or by injecting noises into normal images [9, 49, 38]. To obtain discriminative representations of various artificial defects, we employ widely-used Perlin noise [39] as the anomaly generator $\mathbf{G}$ , which synthesizes pseudo-anomaly samples by injecting Perlin noise into normal images.

As shown in Figure 3, random slight angle rotations and random {0, 90, 180, 270} degree rotations are applied to a base (normal) image, thereby increasing the diversity of the normal sample. Then, a random seed is employed to generate Perlin noise, with the threshold value subsequently adjusted to control the size, quantity and position of the Perlin noise. An additional texture dataset, such as the Describable Textures Dataset [50], is introduced as references to combine with normal samples based on the Perlin noise mask, which are considered as synthesized anomalous regions. Finally, the anomalous regions are pasted onto the base sample image to obtain the synthetic anomaly image and its corresponding mask. A normal image $I\in\mathcal{T}_{\text{train}}$ is passed through the anomaly generator $\mathbf{G}$ to produce a pseudo-anomaly sample $I_{\mathbf{G}}$ with an anomaly mask $y_{\mathbf{G}}\in\{0,1\}^{H\times W\times C}$ . A pseudo-anomaly dataset can be obtained:

\mathcal{T}_{\mathbf{G}}=\mathbf{G}(I_{i}),\ \forall\ I_{i}\in\mathcal{T}_{% \text{train}},

(1)

where $i=\{1,\dots,N\}$ denotes the index in the training set.

Then, each pseudo-anomaly image is used to train the discriminative network $\mathbf{D}$ , where the pre-trained feature extractor is fixed during the early stages of model training. The architecture of the discriminative network is symmetric with the feature extractor, such as the WideResNet50 [34]. This network operates by successively up-sampling and splicing the extracted features to finally output a feature map $\hat{y}_{\mathbf{G}}$ as a predicted mask with the same dimensions as the input image $I_{\mathbf{G}}$ . The training loss $L_{dis}$ is computed as:

L_{dis}=L_{focal}(\hat{y}_{\mathbf{G}},y_{\mathbf{G}})=-\alpha_{t}\left(1-p_{t% }\right)^{\gamma}\log\left(p_{t}\right),

(2)

where $\alpha_{t}$ is the scaling factor associated with the category $t$ , $\gamma$ is an adjustable parameter, and $p_{t}$ corresponds to the predicted pixel point classification with 1 for the anomalous category and 0 for the normal category. The Focal Loss [51] effectively addresses the sample imbalance problem in pixel-level one-class classification. Upon completion of the discriminative network training, the extracted feature map can coarsely localize the probability of anomalies in the input image.

3.2 Negative-guided Contrastive Learning

In stage II, the discriminative network $\mathbf{D}$ is employed to train the feature extractor $\mathbf{F}$ and the contrastive learning network $\mathbf{C}$ . Specifically, discriminative network $\mathbf{D}$ , pre-trained in stage I, acts as a guide for providing negative sample features in bootstrap learning. We employ the advanced SimSiam [40] as the contrastive learning network $\mathbf{C}$ due to its simplicity and exceptional capability in achieving robust and generalized feature representation. As shown in Figure 4, for training the contrastive learning network $\mathbf{C}$ , a normal image $I$ is augmented using a data augmentation set $\mathbf{S}$ (such as random crop, color jitter, grayscale) to generate $M$ different views $v^{1},\dots,v^{M}$ . These views are fed into the feature extractors $\mathbf{F}$ to extract features from different layers, which are considered as generic feature representations from a normal image, and these features contain both the low-level texture detail feature representation and the high-level abstract semantic feature representation.

For constructing contrastive instance pairs, we take an input image $I$ with two augmented views $v^{1}$ and $v^{2}$ ( $M$ set 2). As an example, the features $f^{1}$ are extracted from $v^{1}$ using feature extractor $\mathbf{F}$ and then are passed into the projector and predictor sequentially to obtain $z_{1}$ and $p_{1}$ respectively. Then, the cosine similarity loss can be obtained:

\mathcal{D}\left(p_{1},z_{2}\right)=-\frac{p_{1}}{\left\|p_{1}\right\|_{2}}% \cdot\frac{z_{2}}{\left\|z_{2}\right\|_{2}},

(3)

where $\left\|\cdot\right\|_{2}$ denotes the L2 distance of an output feature. Following previous work [40, 22], the stop-gradient (SG) operation is employed to halt the gradient propagation of $z_{1}$ and $z_{2}$ , effectively preventing collapsing solutions. Since $\mathcal{D}$ is asymmetric, the average bidirectional similarity between $f^{1}$ and $f^{2}$ is computed to maintain equilibrium, as shown below:

L_{cossim}(f^{1},f^{2})=\frac{1}{2}\mathcal{D}\left(p_{1},\text{SG}(z_{2})% \right)+\frac{1}{2}\mathcal{D}\left(p_{2},\text{SG}(z_{1})\right),

(4)

When generating $M$ augmented samples from each normal sample, the symmetric cosine similarity loss is computed as:

	$\displaystyle L_{sym}$	$\displaystyle=\frac{2}{M(M-1)}\sum_{i}^{M}\sum_{j\neq i}^{M}L_{cossim}(f^{i},f% ^{j})$		(5)
		$\displaystyle=\sum_{i}^{M}\sum_{j\neq i}^{M}\frac{\left[\mathcal{D}\left(p_{j}% ,\operatorname{SG}\left(z_{i}\right)\right)+\mathcal{D}\left(p_{i},% \operatorname{SG}\left(z_{j}\right)\right)\right]}{M(M-1)}.$		(5)

For synthetic anomaly sample, we obtain its pseudo-anomaly mask $y^{\prime}_{\mathbf{G}}$ and predictive feature map $\hat{y}^{\prime}_{\mathbf{G}}$ , and compute the Focal Loss $L_{neg}=L_{focal}(\hat{y}^{\prime}_{\mathbf{G}},y^{\prime}_{\mathbf{G}})$ as Equation 2. The training loss in the second training stage is shown as:

L_{ncl}=\lambda\cdot L_{sym}+(1-\lambda)\cdot L_{neg},

(6)

where $\lambda$ is a hyperparameter used to balance these two losses.

3.3 Memory Modeling and Anomaly Detection

During the inference phase, we utilize the feature extractor as a backbone to extract adapted features from various layers. These features are then compressed into a memory bank via the coreset selection [6], as shown in Figure 5. To concurrently harvest shallow texture information and deep semantic content, features extracted from both layer 2 and layer 3 are aggregated to derive a comprehensive feature for each image. These features are then used to construct the original memory bank $\mathcal{M}_{O}$ . To minimize storage and computation overhead, we use the greedy coreset sampling algorithm to identify the most representative features. This results in a sampled memory bank $\mathcal{M}$ obtained by solving Equation 7 using iterative greedy approximation suggested in [52].

\mathcal{M}^{*}=\underset{\mathcal{M}\subset\mathcal{M}_{O}}{\arg\min}\max_{m% \in\mathcal{M}_{O}}\min_{n\in\mathcal{M}}\|m-n\|_{2}.

(7)

When testing an input image $I_{t}\in\mathcal{T}_{\text{test}}$ , pixel-level anomaly score $s_{t}$ is calculated by maximum Euclidean distance between its adapt patch features $p_{t}$ and its nearest normal adapted features coreset $c^{*}$ from memory bank $\mathcal{M}$ :

s_{t}^{\prime}=\min_{c^{*}\in\mathcal{M}}E\left(p_{t},c^{*}\right),

(8)

s_{t}=\left(1-\frac{\exp(s_{t}^{\prime})}{\sum_{c^{\prime}\in\mathcal{N}_{b}% \left(c^{*}\right)}\exp(E\left(p_{t},c^{\prime}\right))}\right)s_{t}^{\prime}.

(9)

where $E(\cdot)$ is Euclidean distance and $\mathcal{N}_{b}$ is the b-nearest neighbor coresets of $c^{*}$ in the memory bank. Finally, the image-level anomaly score is calculated by the maximum anomaly score for all patches in the image.

Table 1: Image-level and pixel-level AUROC (%) on MVTec AD dataset. Average results are reported in 5 texture categories, 10 texture categories, and all categories, respectively. The best results are shown in bold.

	SPADE	PaDiM	PatchCore	FAPM	RD4AD	CFLOW-AD	DeSTSeg	PyramidFlow	MMR	ToCoAD	ToCoAD
	[7]	[5]	[6]	[53]	[54]	[8]	[55]	[31]	[56]	+ CutPaste [11]	+ Perlin [39]
Bottle	98.1 / 98.4	- / 98.3	100 / 98.6	100 / 98.2	98.7 / 96.6	98.9 / 96.8	- / 99.2	- / 95.9	100 / 98.3	100 / 98.4	100 / 98.5
Cable	93.2 / 97.1	- / 96.7	99.5 / 98.4	99.2 / 98.5	97.4 / 91.0	97.6 / 93.5	- / 97.3	- / 92.1	97.8 / 95.4	98.9 / 98.2	98.7 / 98.3
Capsule	98.2 / 99.0	- / 98.5	98.1 / 98.8	98.6 / 99.0	98.7 / 95.8	98.8 / 93.4	- / 99.1	- / 96.1	96.9 / 98.0	98.6 / 98.9	98.7 / 99.0
Carpet	98.5 / 97.5	- / 99.1	98.4 / 99.0	99.0 / 98.9	98.9 / 98.9	99.0 / 97.7	- / 96.1	- / 90.8	99.6 / 98.8	97.9 / 98.8	98.0 / 98.7
Grid	99.0 / 93.7	- / 97.3	98.2 / 98.7	98.0 / 97.8	99.3 / 97.6	98.7 / 96.0	- / 99.1	- / 94.2	100 / 99.0	98.5 / 98.4	98.7 / 98.5
Hazelnut	98.7 / 99.1	- / 98.1	100 / 98.7	99.9 / 98.6	98.9 / 95.5	98.7 / 96.6	- / 99.6	- / 98.0	100 / 98.5	100 / 98.3	100 / 98.4
Leather	99.1 / 97.5	- / 99.2	100 / 99.3	99.7 / 99.0	99.4 / 99.1	99.6 / 99.3	- / 99.5	- / 99.6	100 / 99.2	100 / 99.1	100 / 99.2
Metal_nut	96.7 / 98.1	- / 97.0	100 / 98.4	100 / 98.0	97.3 / 92.3	98.6 / 91.6	- / 98.6	- / 92.8	99.9 / 95.9	99.9 / 98.5	100 / 98.4
Pill	96.5 / 96.5	- / 95.7	96.2 / 97.4	96.0 / 98.0	98.2 / 96.4	98.8 / 95.3	- / 98.7	- / 96.2	98.2 / 98.4	94.8 / 98.3	97.0 / 98.3
Screw	99.5 / 98.7	- / 98.5	98.0 / 99.3	95.2 / 99.0	99.6 / 98.2	98.8 / 95.3	- / 98.5	- / 94.0	92.5 / 99.5	96.8 / 99.5	97.0 / 99.5
Tile	89.8 / 87.4	- / 94.1	98.4 / 95.6	99.4 / 95.2	95.6 / 90.6	98.0 / 94.3	- / 98.0	- / 97.9	98.7 / 95.6	99.2 / 97.7	99.5 / 97.5
Toothbrush	98.9 / 97.9	- / 98.5	99.7 / 98.7	100 / 98.5	99.1 / 94.5	98.8 / 95.0	- / 99.3	- / 98.9	100 / 98.4	100 / 98.7	100 / 98.7
Transistor	81.0 / 94.1	- / 97.5	100 / 96.3	100 / 98.2	92.5 / 78.0	98.0 / 81.3	- / 89.0	- / 97.4	95.1 / 90.2	98.4 / 94.5	99.8 / 95.5
Wood	95.0 / 88.4	- / 94.9	99.2 / 95.0	99.1 / 94.0	95.3 / 90.9	96.6 / 95.8	- / 97.7	- / 93.8	99.1 / 94.8	99.8 / 95.8	99.6 / 95.7
Zipper	98.8 / 96.5	- / 98.3	99.4 / 98.8	99.3 / 98.6	98.2 / 95.4	99.0 / 96.6	- / 99.1	- / 95.4	97.6 / 98.0	99.4 / 98.9	99.5 / 98.9
Texture avg.	96.28 / 92.90	- / 96.92	98.84 / 97.52	99.04 / 96.98	97.36 / 95.80	98.38 / 96.62	- / 98.08	- / 95.26	99.48 / 97.48	99.08 / 97.96	99.16 / 97.92
Object avg.	95.96 / 97.54	- / 97.71	99.09 / 98.34	98.82 / 98.46	99.01 / 98.81	98.60 / 93.54	- / 97.84	- / 95.68	97.80 / 97.06	98.69 / 98.22	99.07 / 98.35
Total avg.	96.06 / 95.99	- / 97.44	99.01 / 98.07	98.89 / 97.96	98.46 / 97.80	98.22 / 94.56	- / 97.92	- / 95.54	98.36 / 97.20	98.82 / 98.13	99.10 / 98.21

4 Experiments

We present the experimental settings in Section 4.1, including datasets, evaluation metrics, and implementation details. Then, we demonstrate the anomaly detection efficacy of our proposed method in comparison with existing models on three datasets, and show ablation experiments in Section 4.2.

4.1 Experimental Settings

4.1.1 Datasets

We use the MVTec Anomaly Detection Dataset [3] (MVTec AD), the Visual Anomaly (VisA) Dataset [57] and the BeanTech Anomaly Detection Dataset [58] (BTAD) dataset for our experiments.

MVTec AD is widely used as a benchmark for industrial anomaly detection. It consists of 15 categories including 5 types of texture and 10 types of object, and comprises 3,629 training images and 1,725 test images. All image sizes range from $700\times 700$ to $1,024\times 1,024$ pixels and each class has at least one type of defect.

VisA contains 12 subsets with a total of 10,821 images, of which 9,621 are normal images and 1,200 are anomalous images. Images can be classified into three principal categories based on the intrinsic properties of the object depicted. The first category comprises single-object images typically centered in the frame. The second category encompasses multi-object. The third category includes complex printed circuit board images.

BTAD consists of three industrial products, ranging in size from $600\times 600$ to $1,600\times 1,600$ pixels. It includes 1,799 images and 1,031 images for training and testing respectively.

4.1.2 Evaluation Metrics

Since determining of normal and anomalous samples is regarded as a binary classification problem, we use the Area Under the Receiver Operating Characteristic Curve (AUROC) metric to evaluate the anomaly detection performance. The ROC curve illustrates the performance of a model for binary classification across varying thresholds by delineating the correlation between the False Positive Rate (FPR) and the True Positive Rate (TPR). The AUROC is a numerical representation of the area under the ROC curve, which is used to measure the model’s ability to differentiate between positive and negative categories. Following the previous studies [6, 5], to evaluate the detection performance, we calculate the image-level AUROC score between the model output anomaly scores and image-level categories. For segmentation evaluation, we measure pixel-level AUROC scores between the anomaly score map and the ground truth mask of anomalous samples.

Table 2: Image-level and pixel-level AUROC (%) on VisA dataset. Average results are reported in 12 categories. The best results are shown in bold.

	SPADE	DRAEM	PatchCore	FastFlow	CFLOW-AD	ToCoAD	ToCoAD
	[7]	[9]	[6]	[59]	[8]	+ CutPaste [11]	+ Perlin [39]
Candle	91.0 / 97.9	91.8 / 96.6	99.1 / 98.8	92.4 / 94.2	93.5 / 98.5	96.4 / 98.9	96.4 / 98.9
Capsules	61.4 / 60.7	74.7 / 98.5	75.1 / 99.1	71.2 / 75.3	63.2 / 94.7	82.9 / 99.4	84.8 / 99.5
Cashew	97.8 / 86.4	95.1 / 83.5	97.2 / 98.5	90.7 / 91.2	94.8 / 99.1	96.5 / 98.6	96.9 / 98.5
Chewinggum	85.8 / 98.6	94.8 / 96.8	99.0 / 98.9	91.4 / 98.6	99.1 / 98.5	98.1 / 98.9	98.2 / 98.9
Fryum	88.6 / 96.7	97.4 / 87.2	95.7 / 92.7	88.6 / 97.3	93.1 / 95.9	95.8 / 92.0	96.5 / 92.1
Macaroni1	95.2 / 96.2	97.2 / 99.9	97.4 / 99.3	98.3 / 97.3	88.2 / 98.7	98.3 / 99.7	98.5 / 99.7
Macaroni2	87.9 / 87.5	85.0 / 99.2	76.4 / 98.5	86.3 / 89.2	66.5 / 96.7	81.4 / 99.1	81.1 / 99.2
PCB1	72.1 / 66.9	47.6 / 88.7	98.0 / 99.5	77.4 / 75.2	97.0 / 99.1	98.7 / 99.8	98.8 / 99.8
PCB2	50.7 / 71.1	89.8 / 91.3	97.5 / 98.7	62.2 / 67.3	89.4 / 96.6	96.4 / 98.3	96.6 / 98.3
PCB3	90.5 / 95.1	92.0 / 98.0	98.2 / 98.1	74.3 / 94.8	97.9 / 83.2	97.7 / 99.3	97.8 / 99.3
PCB4	83.1 / 89.0	98.6 / 96.8	99.5 / 98.2	80.9 / 89.9	98.6 / 98.1	99.7 / 98.1	99.8 / 98.2
Pipe fryum	81.1 / 81.8	99.8 / 85.8	99.7 / 98.7	72.2 / 87.3	99.1 / 99.3	98.8 / 98.7	98.8 / 98.7
Total avg.	82.1 / 85.65	88.65 / 93.52	94.40 / 98.25	81.25 / 88.13	90.05 / 96.56	95.06 / 98.40	95.35 / 98.43

4.1.3 Implementation Details

We used a WideResNet50 [34] pre-trained on ImageNet as the feature extractor. An inverse WideResNet50 is utilized as the discriminative network, which is operated with up-sampling, concatenating, and dimension compression operations. It finally predicts a feature map with 2 channels with the same dimensions as the input image. For the contrastive learning network, we performed adaptive average pooling operation on the feature maps of each layer and then fed them to the projector and predictor respectively. The projector and predictor consist of a 3-layer and 1-layer MLP respectively. During training, we removed the gradient of only the projector branch, and the projectors of both branches shared the weights. For the anomaly generator, we experimented with CutPaste [11] and Perlin noise [39] to generate a variety of pseudo anomalies. For the contrastive learning network, We used the same approach as [40] to construct two cropped images for providing positive samples. In building the memory bank, we employed a coreset subsampling percentage of 0.1 to obtain a compact memory bank.

In the discriminative network pre-training (DNP) training phase, we used the Adam [60] optimizer with an initial learning rate of 0.0001. The learning rate was dynamically adjusted with a decay factor ( $\gamma=0.2$ ) from 80 epochs to 90 epochs. In the negative-guided contrastive learning (NCL) training phase, we used the Stochastic Gradient Descent optimizer with a weight decay of 0.0001 and a momentum of 0.9, and employed the cosine annealing algorithm to adjust the learning rate. All images were first resized to $256\times 256$ pixels and then center-cropped to $224\times 224$ pixels for training and inference. For all classes of the MVTec AD dataset, we trained 100 epochs in the first stage and another 100 epochs in the second stage. For the BTAD dataset, we trained 150 epochs in the second stage. The batch size was set to 16. All of our experiments were conducted on a single NVIDIA RTX3090Ti.

4.2 Experiment Results and Analysis

4.2.1 Performances on MVTec AD

We evaluated the anomaly detection performance of our proposed model compared to advanced methods [7, 5, 6, 53, 54, 8, 55, 31, 56]. We summarized the texture category including carpet, grid, leather, tile, and wood, while the remaining categories are treated as object categories. We reported image-level and pixel-level AUROC scores for each class, texture, object, and all categories separately, shown in Table 1. For texture classes, our method achieved 99.08% and 97.96% of image- and pixel-level AUROC scores when using CutPaste as the anomaly generator. Similarly, our model achieved competitive performance of 99.16% and 97.92% of image-level and pixel-level AUROC scores using Perlin noise as the anomaly generator. For object classes, our models obtained 98.69% and 98.22% image- and pixel-level AUROC scores, 99.07% and 98.35% image-level and pixel-level AUROC scores respectively. In summary, our models achieved the best results for at least seven categories in image-level anomaly detection, with five of them reaching 100%, and produced the best AUROC scores of 99.10% at the image-level and 98.21% at the pixel-level compared to advanced methods. We also conducted qualitative visualization to show some representative samples for anomaly localization in the upper two rows of Figure 6, where we can observe that our models can accurately localize anomalous regions on samples.

4.2.2 Performances on VisA

We evaluated our proposed method and the existing methods [7, 9, 6, 59, 8] on the VisA dataset, as illustrated in Table 2. Our model achieved 95.06% image-level AUROC score and 98.40% pixel-level AUROC score based on CutPaste, while it achieved 95.35% image-level AUROC score and 98.43% pixel-level AUROC score based on Perlin noise. Furthermore, when our method employed Perlin noise as a defect generator, it attained the optimal image-level anomaly localization in four subsets and the optimal pixel-level anomaly localization in seven subsets. In the lower two rows of Figure 6, we additionally visualized samples from select categories of VisA.

4.2.3 Performances on BTAD

We calculated the pixel-level AUROC scores of the three categories and then obtained the average score. We compared our proposed model with existing methods [7, 5, 6, 58, 61, 62]. As shown in Table 3, our model based on CutPaste achieved the best performance of 96.7% pixel-level AUROC score for product 2 while the localization results. Our models also obtained competitive performance for the other two categories. When using Perlin noise as the anomaly generator, our model achieved the best 97.70% in average pixel-level AUROC score compared to other methods.

Table 3: Pixel-level AUROC (%) on BTAD dataset. Average results are reported in three categories. The best results are shown in bold.

	01	02	03	Avg.
SPADE [7]	97.3	94.4	99.1	96.93
PaDiM [5]	96.6	95.9	98.6	97.03
PatchCore [6]	95.5	94.7	99.3	96.50
VT-ADL [58]	99.0	94.0	77.0	90.00
FYD [61]	96.1	95.3	99.7	97.03
SPR [62]	96.7	96.2	95.2	96.03
Ours + NSA [10]	97.0	96.4	99.1	97.50
Ours + CutPaste [11]	97.0	96.7	99.1	97.60
Ours + Perlin [39]	97.4	96.6	99.1	97.70

4.2.4 Ablation Study

We explore the optimal configuration of our proposed method. We mainly employed CutPaste as the anomaly generator for the majority of experiments, while we illustrated the anomaly generator we used in our experiments

Contrastive learning network. This study was performed to determine the optimal hierarchy levels for the contrastive learning network. Table 4 showed the results of ablation studies with different hierarchy levels when using CutPaste as an anomaly generator. We noticed that the model using the features extracted from layer 3 and layer 4 achieved the best performance among all the results, which means that the positive sample comparison learning process requires rich semantic information. When using information from only layer 4, it is insufficient to provide details to reach a good performance. After adding features from layer 2, there was a slight decrease in performance. This suggested that the comparison learning process of the network is more sensitive to shallow features. When using features from a combination of layer 2 and layer 3 without those from layer 4, the results differed from the best result due to less high-level semantic information. In summary, our approach requires rich deep semantic information to foster the capability to represent semantic features.

Table 4: Ablation study results on MVTec AD dataset, using different hierarchy levels in NCL training phase. The best results are shown in bold.

	I-AUROC (%)	P-AUROC (%)
Layer 4	98.20	97.94
Layer 2+3	98.16	97.98
Layer 2+4	98.16	97.95
Layer 2+3+4	98.08	97.92
Layer 3+4	98.82	98.13

Two-stage training strategy. We conducted experiments to investigate the role of discriminative network pre-training (DNP) and negative-guided contrastive learning (NCL) while using Perlin noise as an anomaly generator. All results are shown in Table 5. The anomaly detection performance is compared in three scenarios: without DNP and NCL, with NCL only, and with both DNP and NCL. Since the role of DNP can only be manifested on NCL, the case of using only DNP without NCL is not included in the comparison. In the absence of DNP and NCL, the proposed network adopted a similar structure to Patchcore [6], using the pre-trained features for anomaly detection. When only NCL was used, the model was equivalent to being trained with only normal augmented samples resulting in unsatisfactory performance due to the lack of explicit negative sample guidance information. By adding both DNP and NCL simultaneously, both negative and positive samples were leveraged to fine-tune the feature extractor, which achieved higher pixel-level localization results. This demonstrated that using negative samples to guide the contrast learning process enables the model to learn feature representations adapted to the normal data samples. As shown in Figure 7, we also visualized the results of methods using or without DNP and NCL together. We noticed that our model is sensitive to anomalous areas, and the anomalous locations are highlighted prominently on the heat map than the results without two-stage training. This is attributed to the feature extractor learning a robust and generalized feature representation while reducing feature redundancy and domain gap.

Table 5: Ablation study results on MVTec AD dataset, using only NCL (CL-only), with both DNP and NCL. The anomaly generator of the model is based on Perlin noise. The best results are shown in bold.

NCL	DNP	I-AUROC (%)	P-AUROC (%)
\faTimes	\faTimes	98.90	98.02
\faCheck	\faTimes	98.11	97.94
\faCheck	\faCheck	99.10	98.21

Loss function. This study was conducted to explore the effect of different settings on the loss function while utilizing CutPaste as an anomaly generator. Table 6 reported the results of using Cross-Entropy (CE) Loss and Focal Loss for different values of the hyperparameter $\lambda$ . We observed that, when $\lambda$ is set to 0.5, a balance was achieved between the positive sample learning process and the negative sample bootstrap learning process, and the results were optimal. And when $\lambda$ was greater or less than 0.5, results showed a small decrease. When set $\lambda=0.5$ , the model using Focal loss outperformed the counterpart when using Cross-Entropy Loss. It can be demonstrated that Focal Loss can fully consider the imbalance between normal and anomalous samples.

Table 6: Ablation study results of different loss function settings. We compared the results for different values of

\lambda

and different

L_{neg}

, where CE denotes Cross-Entropy Loss while Focal denotes Focal Loss. The best results are shown in bold.

$\lambda$	CE	Focal	I-AUROC (%)	P-AUROC (%)
0.1	\faCheck		98.12	97.98
0.1		\faCheck	98.15	97.98
0.2	\faCheck		98.19	97.98
0.2		\faCheck	98.04	97.96
0.5	\faCheck		98.15	97.95
0.5		\faCheck	98.82	98.13
0.8	\faCheck		98.18	97.94
0.8		\faCheck	98.13	97.94
0.9	\faCheck		98.02	97.96
0.9		\faCheck	98.30	97.95

Synthetic anomaly strategies. We compared three commonly used synthetic defect methods [11, 10, 39] as the anomaly generator. These three methods focus on generating random patches, random seamless patches and Perlin noise respectively. As shown in Figure 7, our methods incorporating Perlin noise achieved the best performance of 99.10% image-level AUROC and 98.21% pixel-level AUROC. We considered that it is mainly attributed to the high randomness of the shape and distribution of the anomalies generated by Perlin noise, enhancing the coarse localization and bootstrap learning ability of the discriminative network.

Table 7: Ablation study results on MVTec AD dataset, using different anomaly generator. The best results are shown in bold.

Synthetic Method	I-AUROC (%)	P-AUROC (%)
CutPaste [11]	98.82	98.13
NSA [10]	98.04	97.96
Perlin [39]	99.10	98.21

Feature fusion in contrastive learning network. We investigated the effect of different feature fusion operations. As shown in Table 8, the model achieved better performance when the concatenation operation was not used. We suggested that the fused features pose challenges in providing an appropriate optimization direction for training multiple layers.

Table 8: Ablation study results on MVTec AD dataset, using features with/without concatenation between layer 3 and layer 4. The best results are shown in bold.

Concatenation	I-AUROC (%)	P-AUROC (%)
\faCheck	98.56	98.11
\faTimes	98.82	98.13

Contrastive learning method. We further conducted experiments to compare two widely used contrastive learning architectures, SimSiam [40] and BYOL [41]. The BYOL architecture incorporates additional exponential moving average operations for network updates. As shown in Table 9, the model achieved better performance when using SimSiam, due to its powerful ability to solve the collapsing solution problem [46].

Table 9: Ablation study on MVTec AD dataset, using two different contrastive learning methods. The best results are shown in bold.

CL Method	I-AUROC (%)	P-AUROC (%)
w/ SimSiam	98.82	98.13
w/ BYOL	98.45	98.08

5 Conclusion

To bridge the domain gap between the pre-trained and target-specific features for industrial anomaly detection tasks, we proposed a novel two-stage training strategy, named ToCoAD. In the first stage, a discriminative network is trained to coarsely localize anomalies. In the second stage, the pre-trained discriminative network is used to provide negative-guided information, and the contrastive learning network along with the feature extractor are jointly fine-tuned. Extensive experiments were conducted to verify the superior performance of our proposed models, achieving 99.10% image-level AUROC and 98.21% pixel-level AUROC on the MVTec AD dataset, 95.35% image-level AUROC and 98.43% pixel-level AUROC scores on the VisA dataset, and 97.70% pixel-level AUROC on the BTAD dataset.

References

[1] X. Tao, X. Gong, X. Zhang, S. Yan, C. Adak, Deep learning for unsupervised anomaly localization in industrial images: A survey, IEEE Transactions on Instrumentation and Measurement 71 (2022) 1–21.
[2] J. Liu, G. Xie, J. Wang, S. Li, C. Wang, F. Zheng, Y. **, Deep industrial image anomaly detection: A survey, Machine Intelligence Research 21 (1) (2024) 104–135.
[3] P. Bergmann, M. Fauser, D. Sattlegger, C. Steger, Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9592–9600.
[4] K. Batzner, L. Heckler, R. König, Efficientad: Accurate visual anomaly detection at millisecond-level latencies, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 128–138.
[5] T. Defard, A. Setkov, A. Loesch, R. Audigier, Padim: a patch distribution modeling framework for anomaly detection and localization, in: International Conference on Pattern Recognition, Springer, 2021, pp. 475–489.
[6] K. Roth, L. Pemula, J. Zepeda, B. Schölkopf, T. Brox, P. Gehler, Towards total recall in industrial anomaly detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14318–14328.
[7] N. Cohen, Y. Hoshen, Sub-image anomaly detection with deep pyramid correspondences, arXiv preprint arXiv:2005.02357 (2020).
[8] D. Gudovskiy, S. Ishizaka, K. Kozuka, Cflow-ad: Real-time unsupervised anomaly detection with localization via conditional normalizing flows, in: Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2022, pp. 98–107.
[9] V. Zavrtanik, M. Kristan, D. Skočaj, Draem-a discriminatively trained reconstruction embedding for surface anomaly detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8330–8339.
[10] H. M. Schlüter, J. Tan, B. Hou, B. Kainz, Natural synthetic anomalies for self-supervised anomaly detection and localization, in: European Conference on Computer Vision, Springer, 2022, pp. 474–489.
[11] C.-L. Li, K. Sohn, J. Yoon, T. Pfister, Cutpaste: Self-supervised learning for anomaly detection and localization, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 9664–9674.
[12] K. You, M. Long, Z. Cao, J. Wang, M. I. Jordan, Universal domain adaptation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2720–2729.
[13] H. Nam, H. Lee, J. Park, W. Yoon, D. Yoo, Reducing domain gap by reducing style bias, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8690–8699.
[14] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: 2009 IEEE conference on computer vision and pattern recognition, Ieee, 2009, pp. 248–255.
[15] V. Zavrtanik, M. Kristan, D. Skočaj, Reconstruction by inpainting for visual anomaly detection, Pattern Recognition 112 (2021) 107706.
[16] J. Pirnay, K. Chai, Inpainting transformer for anomaly detection, in: International Conference on Image Analysis and Processing, Springer, 2022, pp. 394–406.
[17] J. Wyatt, A. Leach, S. M. Schmon, C. G. Willcocks, Anoddpm: Anomaly detection with denoising diffusion probabilistic models using simplex noise, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 650–656.
[18] J. Huang, D. Guan, A. Xiao, S. Lu, L. Shao, Category contrast for unsupervised domain adaptation in visual tasks, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1203–1214.
[19] M. Thota, G. Leontidis, Contrastive domain adaptation, in: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2021, pp. 2209–2218.
[20] S. D. Dao, H. Zhao, D. Phung, J. Cai, Contrastively enforcing distinctiveness for multi-label image classification, Neurocomputing 555 (2023) 126605.
[21] M. Ki, Y. Uh, W. Lee, H. Byun, Contrastive and consistent feature learning for weakly supervised object localization and semantic segmentation, Neurocomputing 445 (2021) 244–254.
[22] H. Wang, E. Ahn, J. Kim, A dual-branch self-supervised representation learning framework for tumour segmentation in whole slide images, arXiv preprint arXiv:2303.11019 (2023).
[23] B. Wang, Q. Li, Z. You, Self-supervised learning based transformer and convolution hybrid network for one-shot organ segmentation, Neurocomputing 527 (2023) 1–12.
[24] X. Chen, J. Pan, K. Jiang, Y. Li, Y. Huang, C. Kong, L. Dai, Z. Fan, Unpaired deep image deraining using dual contrastive learning, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 2017–2026.
[25] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, P. Isola, What makes for good views for contrastive learning?, Advances in neural information processing systems 33 (2020) 6827–6839.
[26] M. Kim, J. Tack, S. J. Hwang, Adversarial self-supervised contrastive learning, Advances in Neural Information Processing Systems 33 (2020) 2983–2994.
[27] C.-H. Ho, N. Nvasconcelos, Contrastive learning with adversarial examples, Advances in Neural Information Processing Systems 33 (2020) 17081–17093.
[28] T. Cover, P. Hart, Nearest neighbor pattern classification, IEEE transactions on information theory 13 (1) (1967) 21–27.
[29] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[30] D. Rezende, S. Mohamed, Variational inference with normalizing flows, in: International conference on machine learning, PMLR, 2015, pp. 1530–1538.
[31] J. Lei, X. Hu, Y. Wang, D. Liu, Pyramidflow: High-resolution defect contrastive localization using pyramid normalizing flow, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14143–14152.
[32] W. Ma, Y. Li, S. Lan, W. Wang, W. Huang, W. Zhu, Semantic-aware normalizing flow with feature fusion for image anomaly detection, Neurocomputing 590 (2024) 127728.
[33] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020).
[34] S. Zagoruyko, N. Komodakis, Wide residual networks, arXiv preprint arXiv:1605.07146 (2016).
[35] M. Tan, Q. Le, Efficientnet: Rethinking model scaling for convolutional neural networks, in: International conference on machine learning, PMLR, 2019, pp. 6105–6114.
[36] H. Zhang, Z. Wang, Z. Wu, Y.-G. Jiang, Diffusionad: Denoising diffusion for anomaly detection, arXiv preprint arXiv:2303.08730 (2023).
[37] J. Ho, A. Jain, P. Abbeel, Denoising diffusion probabilistic models, Advances in neural information processing systems 33 (2020) 6840–6851.
[38] M. Yang, P. Wu, H. Feng, Memseg: A semi-supervised method for image surface defect detection using differences and commonalities, Engineering Applications of Artificial Intelligence 119 (2023) 105835.
[39] K. Perlin, An image synthesizer, ACM Siggraph Computer Graphics 19 (3) (1985) 287–296.
[40] X. Chen, K. He, Exploring simple siamese representation learning, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15750–15758.
[41] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, et al., Bootstrap your own latent-a new approach to self-supervised learning, Advances in neural information processing systems 33 (2020) 21271–21284.
[42] K. He, H. Fan, Y. Wu, S. Xie, R. Girshick, Momentum contrast for unsupervised visual representation learning, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738.
[43] T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A simple framework for contrastive learning of visual representations, in: International conference on machine learning, PMLR, 2020, pp. 1597–1607.
[44] K. Sohn, Improved deep metric learning with multi-class n-pair loss objective, Advances in neural information processing systems 29 (2016).
[45] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, A. Joulin, Unsupervised learning of visual features by contrasting cluster assignments, Advances in neural information processing systems 33 (2020) 9912–9924.
[46] C. Zhang, K. Zhang, C. Zhang, T. X. Pham, C. D. Yoo, I. S. Kweon, How does simsiam avoid collapse without negative samples? a unified understanding with self-supervised contrastive learning, arXiv preprint arXiv:2203.16262 (2022).
[47] Z. Zhang, W. Chen, H. Cheng, Z. Li, S. Li, L. Lin, G. Li, Divide and contrast: Source-free domain adaptation via adaptive contrastive learning, Advances in Neural Information Processing Systems 35 (2022) 5137–5149.
[48] R. Wang, Z. Wu, Z. Weng, J. Chen, G.-J. Qi, Y.-G. Jiang, Cross-domain contrastive learning for unsupervised domain adaptation, IEEE Transactions on Multimedia (2022).
[49] Z. Liu, Y. Zhou, Y. Xu, Z. Wang, Simplenet: A simple network for image anomaly detection and localization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20402–20411.
[50] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, A. Vedaldi, Describing textures in the wild, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 3606–3613.
[51] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
[52] O. Sener, S. Savarese, Active learning for convolutional neural networks: A core-set approach, arXiv preprint arXiv:1708.00489 (2017).
[53] D. Kim, C. Park, S. Cho, S. Lee, Fapm: Fast adaptive patch memory for real-time industrial anomaly detection, in: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2023, pp. 1–5.
[54] H. Deng, X. Li, Anomaly detection via reverse distillation from one-class embedding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9737–9746.
[55] X. Zhang, S. Li, X. Li, P. Huang, J. Shan, T. Chen, Destseg: Segmentation guided denoising student-teacher for anomaly detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3914–3923.
[56] Z. Zhang, Z. Zhao, X. Zhang, C. Sun, X. Chen, Industrial anomaly detection with domain shift: A real-world dataset and masked multi-scale reconstruction, Computers in Industry 151 (2023) 103990.
[57] Y. Zou, J. Jeong, L. Pemula, D. Zhang, O. Dabeer, Spot-the-difference self-supervised pre-training for anomaly detection and segmentation, in: European Conference on Computer Vision, Springer, 2022, pp. 392–408.
[58] P. Mishra, R. Verk, D. Fornasier, C. Piciarelli, G. L. Foresti, Vt-adl: A vision transformer network for image anomaly detection and localization, in: 2021 IEEE 30th International Symposium on Industrial Electronics (ISIE), IEEE, 2021, pp. 01–06.
[59] J. Yu, Y. Zheng, X. Wang, W. Li, Y. Wu, R. Zhao, L. Wu, Fastflow: Unsupervised anomaly detection and localization via 2d normalizing flows, arXiv preprint arXiv:2111.07677 (2021).
[60] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).
[61] Y. Zheng, X. Wang, R. Deng, T. Bao, R. Zhao, L. Wu, Focus your distribution: Coarse-to-fine non-contrastive learning for anomaly detection and localization, in: 2022 IEEE International Conference on Multimedia and Expo (ICME), IEEE, 2022, pp. 1–6.
[62] W. Shin, J. Lee, T. Lee, S. Lee, J. P. Yun, Anomaly detection using score-based perturbation resilience, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 23372–23382.