Odd-One-Out: Anomaly Detection by Comparing with Neighbors

Ankan Bhunia Changjian Li Hakan Bilen
University of Edinburgh
https://github.com/VICO-UoE/OddOneOutAD

Abstract

This paper introduces a novel anomaly detection (AD) problem that focuses on identifying ‘odd-looking’ objects relative to the other instances within a scene. Unlike the traditional AD benchmarks, in our setting, anomalies in this context are scene-specific, defined by the regular instances that make up the majority. Since object instances are often partly visible from a single viewpoint, our setting provides multiple views of each scene as input. To provide a testbed for future research in this task, we introduce two benchmarks, ToysAD-8K and PartsAD-15K. We propose a novel method that generates 3D object-centric representations for each instance and detects the anomalous ones through a cross-examination between the instances. We rigorously analyze our method quantitatively and qualitatively in the presented benchmarks.

1 Introduction

Anomaly detection (AD) [13, 34] aims to detect patterns that deviate from expected behavior. In standard computer vision AD benchmarks, the non-conforming (or anomalous) patterns in images are often either due to high-level variations such as introduction of an object instance from an unseen category [2, 8, 12] or low-level variations in object shape and texture [6, 16] (see Fig. 1(a)). In these benchmarks, the definitions of normal and anomalous patterns are typically not predefined but implicitly learned through representations and/or classifiers [21, 2, 16] that are invariant to appearance changes within the normal data distribution while being sensitive to the anomalous ones. However, in many real-world applications such as quality control in production lines, product specification, hence the definition of ‘normality’, is not only known in advance but also specific to each object instance. For example, while a coffee cup with a red handle is considered normal and one with a blue handle is anomalous when the goal is to produce red-handled cups, but the opposite is true if the goal is to produce blue-handled cups. This instance-specific definition of normality cannot be addressed by methods that rely on a global and implicit definition of normality.

In this paper, inspired from real-world visual inspection scenarios, we introduce a new AD problem, along with two new benchmarks with different characteristics and a novel solution. As depicted in Fig. 1 (b), in this proposed setting, each image contains a scene with multiple instances of the same object (e.g., coffee cups) including either only normal, or a mix of normal and anomalous instances where the anomalous instances constitute the minority. As defining precise visual specifications for shape and appearance is often infeasible, we assume that normal instances, which form the majority, provide a scene-specific reference for ‘normality’. Our objective is to detect anomalous instances in a scene while generalizing to previously unseen scenes including novel objects and spatial configurations.

Unlike existing benchmarks, our task often requires a holistic understanding of the scene by comparing instances with each other, although some anomalies, such as cracks and misaligned parts, can be identified by inspecting individual instances. Additionally, since performing AD from a single view can be ambiguous due to self-occlusion and occlusion between the instances, we provide multiple views of the scene to cover its entire relevant extent, unlike the existing benchmarks that provide a single view as input (see Fig. 1(c)). Our goal is to detect anomalous samples in previously unseen scenes from multiple views.

Refer to caption — Figure 1: The standard and our new AD settings.

The proposed problem presents several challenges requiring the following capabilities: i) 3D understanding of the scene and registering the views from multiple camera viewpoints without groundtruth 3D knowledge while identifying potential occlusions, ii) aligning and comparing object instances with each other without their relative pose information in both training and evaluation, iii) learning representations that generalize to unseen object instances during testing. To address these challenges, we propose a novel method that takes as input multiple views of the same scene, projects them into a 3D voxel grid, produces a 3D object-centric representation for each instance, and predicts their labels by cross-correlating instances through an efficient attention mechanism. We leverage recent advances in differentiable rendering [31] and self-supervised learning [33] to supervise 3D representation learning. Specifically, we render the voxel representations for multiple viewpoints and match them with the given 2D views to produce geometrically consistent features. Additionally, we encourage our model to learn part-aware 3D representations by distilling features from 2D self-supervised model DINOv2 [33], enhancing the correspondence matching across instances. Finally, since there is no prior benchmark for this task, we propose two new benchmarks – ToysAD-8K and PartsAD-15K, including instances from common semantic object categories as well as mechanical parts respectively to provide a testbed for future research. Our method significantly outperforms various baselines that do not compare instances with each other, and we rigorously analyze various aspects of our model and benchmarks.

2 Related Work

AD benchmarks. A key challenge in AD research is the scarcity of large datasets containing realistic anomalies. Earlier works [11, 36] focusing on high-level semantic anomalies often use existing classification datasets by treating a subset of classes as anomalies and the remainder as normal. There also exist several datasets containing real-world anomaly instances. For example, MVTec-AD [5] includes industrial objects with various defects like scratches, dents, and contaminations, Carrera et al. missing [10] presents various defects in nanofibrous material, and VisA [59] comprises complex industrial objects such as PCBs, as well as simpler objects like capsules and cashews, spanning a total of 12 categories. These datasets assume that objects are pose-aligned, and both normal images and their anomaly counterparts have the same pose. Zhou et al. missing [58] propose a pose-agnostic framework by introducing the PAD dataset, which comprises images of 20 LEGO bricks of animal toys from diverse viewpoints/poses. Unlike the works discussed above, we focus on AD in multi-object multi-view scene environments, where anomalies are predicted by assessing their mutual similarity with other objects in the scene. The proposed setting enables our framework to work seamlessly on novel object instances without requiring further training.

Few-shot AD. There are some works [18, 23, 52, 50, 7] aiming to detect anomalies from a small number of normal samples as support images. In our setting, the concept of normality in each image is also learned from a few instances only. However, unlike them, the concept of normality is scene-specific, and our method can generalize to previously unseen instances without requiring any modification by learning from a support set. In addition, our setting involves multi-object multi-view data samples as input, unlike the single-object single-view in theirs.

Multi-view 3D vision. Multi-view 3D detection [37, 40, 51, 48] is a related problem that aims to predict the locations and classes of objects in 3D space, given multi-view images of a scene along with their corresponding camera poses as input. Most existing works first project 2D image features onto a 3D voxel grid, followed by a detection head [29, 53, 9] that outputs the final 3D bounding boxes and class labels. While our task can be naively solved by treating it as a 3D object detection problem with anomaly and normal as the two possible classes, this approach has limited ability to perform effective comparisons with other instances in the scene, which is required for fine-grained AD. We compare our method to a 3D object detection technique in Sec. 4. Another related area involves taking multi-view images as input and training a feedforward model for 3D volume-based reconstruction [32, 55, 42] and novel view synthesis [45, 47, 54, 15, 24]. Similarly, in this work, we focus on learning a feedforward model in a multi-object scene environment using sparse multi-view images but for AD.

Leveraging foundation models. Large-scale pretraining on image datasets has shown impressive generalization capabilities on various tasks [35, 25, 33]. Previous works [57, 3] demonstrate that features extracted from DINOv2 [33] serve as effective dense visual descriptors with localized semantic information for dense correspondence estimation task. Some recent works [26, 46] use feature distillation techniques to leverage 2D foundation vision models for 3D task. Inspired by these works we utilize DINOv2 to distill its dense semantic knowledge into our 3D network that enables our network to infer robust local correspondences, which aids in fine-grained object matching.

3 Method

3.1 Overview

Consider a scene containing multiple rigid objects $\{o_{n}\}_{n=1}^{N}$ of the same instance, where $N$ represents the number of objects in the scene. Each object’s pose is arbitrary and unknown. The goal is to identify anomalies in the group of objects by assessing their mutual similarity. An observation in our setting consists of $M$ -view $H\times W$ dimensional RGB images $\mathcal{I}=\{\bm{I}_{t}\}_{t=1}^{M}$ , and their corresponding camera projection matrices $\mathcal{P}=\{\bm{P}_{t}\}_{t=1}^{M}$ , $\bm{P}_{t}=\bm{K}\left[\bm{R}_{t}|\bm{T}_{t}\right]$ , with intrinsic $\bm{K_{t}}$ , rotation $\bm{R}_{t}$ and translation $\bm{T}_{t}$ matrices. In our setting, $M$ is $5$ , forming sparse-view inputs. Our goal is to learn a map** $\psi$ from the multi-view images to object-centric anomaly labels ${y}_{n}\in\{0,1\}$ and its corresponding 3D bounding box $\bm{b}_{n}$ , defined as:

\psi:\{(\bm{I}_{t},\bm{P}_{t})\}_{t=1}^{M}\mapsto\{({y}_{n},\bm{b}_{n})\}_{n=1% }^{N}.

(1)

Note, labels $y_{n}$ are defined relative to other objects in the scene. For example, consider a group of three coffee cups, of which two have red handles, and one has a blue handle. In this example, the latter cup is considered an anomaly.

Our architecture, illustrated in Fig. 2, is comprised of three main components: First, the 3D feature fusion module encodes each view image and projects it to 3D, forming a fused 3D feature volume. Second, the feature distillation block is employed to enhance the 3D feature volume through differentiable rendering. This facilitates finding local correspondences between objects. Finally, the cross-instance matching module leverages established correspondences to compare all similar object regions in the scene using a sparse voxel attention mechanism. Next, we elaborate on details.

3.2 3D Feature Volume Construction

We first extract 2D features $\bm{F}_{t}=\mathcal{E}_{2D}(\bm{I}_{t})\in\mathbb{R}^{d\times h\times w}$ for each input view using a shared CNN encoder $\mathcal{E}_{2D}$ , where $d$ is the feature dimension. These 2D features are then projected into 3D voxel space as follows:

\bm{F}_{v}=\mathcal{E}_{3D}(\texttt{aggr}(\{\Pi_{\texttt{proj}}(\bm{F}_{t},\bm% {P}_{t})\}_{t=1}^{M})),

(2)

where $\Pi_{\texttt{proj}}$ back-projects each view feature $\bm{F}_{t}$ using known camera intrinsic and extrinsic, generating 3D feature volumes of size $d\times v_{x}\times v_{y}\times v_{z}$ . These feature volumes are aggregated over all input views using an average operation as in [32, 42]. Finally, a 3D CNN-based network $\mathcal{E}_{3D}$ is employed to refine the aggregated feature volume, resulting in a final voxel representation $\bm{F}_{v}$ .

We use volume rendering [31] to reconstruct the geometry and appearance of the scene. We implement the rendering operation as in [24], where for each 3D query point on a ray, we retrieve its corresponding features by bilinearly interpolating between the neighboring voxel grids. Specifically, we first apply a two-layered $1\times 1\times 1$ convolution block, i.e., $\alpha_{c}$ and $\alpha_{\sigma}$ respectively, to obtain color and density volumes denoted as ( $\bm{V}_{c}$ , $\bm{V}_{\sigma}$ ). Then, the pixel-wise color and density maps are composed by integrating along a camera ray using volume renderer $\mathcal{R}$ . Following this, we compute the L2 image reconstruction loss $\mathcal{L}^{\text{im}}_{t}$ . Formally, the image rendering and its corresponding loss function for a single viewpoint $\bm{P}_{t}$ are shown below:

[\hat{\bm{I}}_{t},\hat{\bm{I}}_{t\sigma}]=\mathcal{R}([\bm{V}_{c},\bm{V}_{% \sigma}],\bm{P}_{t}),\quad\mathcal{L}^{\text{im}}_{t}=||\bm{I}_{t}-\hat{\bm{I}% }_{t}||^{2}+\lambda_{\sigma}||\bm{I}_{t\sigma}-\hat{\bm{I}}_{t\sigma}||^{2},

(3)

where $\hat{\bm{I}}_{t}$ and $\hat{\bm{I}}_{t\sigma}$ are rendered image and mask. $\lambda_{\sigma}$ is a loss weight.

We propose to improve the 3D voxel representation $\bm{F}_{v}$ by also reconstructing neural features instead of just color and density. We supervise the feature reconstruction by a pretrained 2D image encoder $\Phi$ as a teacher network. We choose DINOv2 [33] as the teacher network due to its excellent ability to capture various object geometries and correspondences.

We use a projector function $\beta$ , implemented as a four-layered $1\times 1\times 1$ convolution block that projects $\bm{F}_{v}$ to a neural feature field $\bm{V}_{f}$ , changing the channel dimension from $d$ to $d_{f}$ . Similar to color rendering, we generate rendered features $\hat{\Phi}_{t}$ of size $d_{f}\times h_{f}\times w_{f}$ at a given viewpoint using volume renderer $\mathcal{R}$ . The objective is to minimize the difference between the rendered features and the teacher’s features $\Phi(\bm{I}_{t})$ . We choose cosine distance as our feature loss ( $\mathcal{L}^{\text{feat}}_{t}$ ) which we find easier to optimize compared to the standard L2 loss. We apply stop-gradient to density in rendering of features $\hat{\Phi}_{t}$ , as the teacher’s features are not fully multi-view consistent [3], which could harm the quality of reconstructed geometry. The final reconstruction loss is the sum of all image and feature reconstruction losses weighed by a factor $\lambda_{f}$ :

\mathcal{L}^{r}=\sum_{t=1}^{M}(\mathcal{L}^{\text{im}}_{t}+\lambda_{f}\mathcal% {L}^{\text{feat}}_{t}).

(4)

The key benefits of reconstructing DINOv2 features are twofold. 1) Firstly, distilling features from general-purpose feature extractors pre-trained on large external datasets, incorporates open-world knowledge into the 3D representation. This enables our model to perform significantly better on unseen object instances or even on novel categories, as demonstrated in the experiments. 2) Secondly, the distillation enforces consistent 3D scene representation, leading to identical features for the same object geometries. This enables the model to infer robust local correspondences (see Fig. 4), which aids in fine-grained object matching.

3.3 Object-centric 3D Features Extraction

We extract bounding box regions of objects using a box estimator $\mathcal{B}$ . First, we obtain a voxel reconstruction of the scene by applying a threshold to the predicted density $\bm{V}_{\sigma}$ . Then, we employ DBScan [20], a density-based clustering method to retrieve all bounding box regions $\{\bm{b}_{n}\}_{n=1}^{N}$ corresponding to the objects in the scene. Next we obtain object-centric feature volumes $\{\bm{z}_{n}\}_{n=1}^{N}$ , each with a size of $8\times 8\times 8$ from these regions through applying RoI pooling [17].

To extract correspondence indices given two objects with feature volumes $\bm{z}_{n}$ and $\bm{z}_{m}$ , We formally define a function $\mathcal{C}_{k}$ :

\mathcal{C}_{k}(\bm{z}_{n},\bm{z}_{m})=\texttt{top}_{k}[\beta(\bm{z}_{n})^{T}% \beta(\bm{z}_{m})].

(5)

The function returns top- $k$ most relevant feature locations in $\bm{z}_{m}$ for each voxel location in $\bm{z}_{n}$ . This is achieved by first projecting each voxel feature using the same projector function $\beta$ and then computing their voxel-level pairwise similarity. We use $\mathcal{C}_{k}$ to perform sparse attention-based comparisons between multiple object volumes, as described next.

3.4 Cross-instance Matching

The cross-instance matching module $\mathcal{M}$ takes the object features $\{\bm{z}_{n}\}_{n=1}^{N}$ , effectively learns to correlate them using sparse voxel attention, and subsequently predicts their object-specific labels.

Unlike the vanilla self-attention module in standard transformers [19] that uses all tokens for the attention computation, which is inefficient for our task and may introduce noisy interactions with irrelevant features, potentially degrading performance. To overcome this, we compute the sparse voxel attention only among geometrically corresponding voxel locations, as shown in the inset.

Let $z_{n}[i]\in\mathbb{R}^{d}$ denote the $i$ -th voxel of the $n$ -th object volume in the scene. The query, key, and value embeddings are calculated using linear projections as:

\bm{Q}_{n}[i]=\bm{W}^{Q}{\bm{z}}_{n}[i],\quad\bm{K}_{n}[i]=\bm{W}^{K}{\bm{z}}_% {n}[i],\quad\bm{V}_{n}[i]=\bm{W}^{V}{\bm{z}}_{n}[i],

(6)

where the weights $\bm{W}^{Q}$ , $\bm{W}^{K}$ and $\bm{W}^{V}$ are shared across all objects.

Let $c_{k}^{nm}[i]$ denote the set of all corresponding voxel indices obtained using Eq. 5. Then, the attention is calculated as:

\bar{\bm{z}}_{n}[i]=\sum_{\begin{subarray}{c}m=1\\ m\neq n\end{subarray}}^{N}\sum_{j\in c_{k}^{nm}[i]}\text{softmax}\left(\frac{% \bm{Q}_{n}[i]\bm{K}_{m}[j]}{\sqrt{d}}\right)\bm{V}_{m}[j].

(7)

The updated feature volume $\bar{\bm{z}}_{n}$ is passed through 3D CNN blocks to downsample by a factor of $1/8$ , which is finally reshaped into a vector and fed to a 2-layer MLP outputting the final prediction $\hat{y}_{n}$ . The classification loss is calculated as:

\mathcal{L}^{\text{bce}}=\sum_{n=1}^{N}\ell_{\text{bce}}(\hat{y}_{n},y_{n}),

(8)

where $\ell_{\text{bce}}$ is the binary cross-entropy loss function. The total training loss of our framework is $\mathcal{L}=\mathcal{L}^{\text{bce}}+\lambda_{p}\mathcal{L}^{r}$ , where $\lambda_{p}$ is a loss weight. We employ stage-wise training: first pretraining with only the reconstruction loss ( $\mathcal{L}^{r}$ ), followed by end-to-end training with both losses.

4 Experiments

4.1 Datasets

As no prior dataset exists for the task, we propose two challenging scene datasets, both essential for thorough evaluation: ToysAD-8K and PartsAD-15K. ToysAD-8K includes real-world objects from multiple categories. This allows us to evaluate our model’s ability to generalize to unseen object categories. PartsAD-15K comprises a more diverse collection of mechanical object parts with arbitrary shapes, thus being free from any class-level inductive biases. Both datasets include a wide range of fine-grained anomaly instances motivated by real-world applications in inspection and quality control. Scenes are generated with diverse backgrounds, illuminations, and camera viewpoints using photo-realistic ray tracing [43]. Next, we discuss the data generation for both datasets.

ToysAD-8K. We begin with a subset of $1050$ shapes from the Toys4K dataset [41]. The subset covers a wide range of objects from $51$ categories. We automatically create anomalies for a given 3D shape by applying various deformations to both the geometry and texture. This includes generating realistic cracks and fractures using [39], applying random geometric deformations [43] like bumps, bends, and twists, as well as randomly translating, rotating, and swap** materials in different parts of the shapes. In total, we generated $2345$ anomaly shapes. To generate each scene, we first randomly choose a set of objects consisting of both normal and their anomaly of the same instances. We ensure that the majority of objects in each scene are normal. We also have few scenes where no anomaly is present. The rotational poses for the objects are obtained using rigid body simulation [4]. The objects are scaled and placed into the scene at random locations, ensuring collisions do not occur. We generated total $8K$ scenes. Each scene consists of $3$ - $6$ objects rendered in $20$ views. For the training set, we randomly select $5K$ scenes from the $39$ categories. We build two disjoint test sets. The first one (seen) contains $1K$ scenes from the seen categories but with unseen object instances. The second one (unseen) contains $2K$ scenes from rest of the $12$ novel categories.

PartsAD-15K. We use a subset of the ABC dataset [27] that consists of $4200$ shapes. We follow the strategy above to generate anomalies. Additionally, for each shape, we sample geometrically close instances from the dataset and assign them as anomalies to use in the same scene. This approach generates a large set of high-quality anomalies that closely resemble their normal counterparts in high-level geometry, but with subtle shape variations that make them anomalous in the context of the normal ones. In total, we generated $10,203$ anomaly shapes. Using these shapes, we created $15K$ scenes, each consisting of $3$ to $12$ objects rendered from $20$ different viewpoints. We divide the dataset into a $12K$ training set and the rest as the test set. Note that the proposed datasets, source code, and models will be made public based upon publication.

4.2 Implementation details

We use a ResNet50-FPN [30] as our 2D encoder backbone. Our 3D backbone consists of a four-scale encoder-decoder-based 3D CNN [32]. We use $M=5$ images as input, each with a resolution of $256\times 256$ . Our architecture is flexible to accept a different $M$ during inference. During training, we consider a total of $2M$ views, which we separate into two sets, each containing $M$ views. We use one set to build the neural volume and the cameras of the other set to render the results, and vice versa. The 3D volume $\bm{F}_{v}$ contains $96\times 96\times 16$ voxels with a voxel size of $4cm$ . We sample $128$ points on each ray for rendering. We render the features with a spatial dimension of $32\times 32$ . The $I_{t\sigma}$ in Eq. 3 is the ground truth segmentation mask of the input image, we only use it for training. The threshold applied to the density volume $\bm{V}_{\sigma}$ is set to be $0.2$ and we run the DBScan algorithm with its default parameters. We resize the teacher’s (DINOv2) features to the same spatial dimension for loss computation. These features are pre-computed for all scenes using publicly available weights, which are not updated during distillation. These DINOv2 features are then reduced to $d_{f}$ = $128$ channel dimensions using PCA before distillation. We employ three sparse voxel attention blocks, and each applies 8-headed attention. The value of $k$ is chosen as $20$ . The loss weights $\lambda_{\sigma}$ , $\lambda_{f}$ , and $\lambda_{p}$ are all set to $1$ . We first pretrain the network with only the reconstruction loss for $50$ epochs. Then, we train the network end-to-end with both the reconstruction loss and the binary classification loss for another $50$ epochs. We maintain a batch size of $4$ , and use the Adam optimizer with a learning rate $2\times 10^{-5}$ . Finally, the run-time of our method is $65$ ms on a single A40 GPU for a typical scene with 5 views as input.

4.3 Baseline Comparisons

We quantitatively evaluate the anomaly classification results using two evaluation metrics – the area under the ROC curve (AUC) and accuracy. A prediction is considered correct if the bounding box IoU is greater than 0.5 and the corresponding anomaly classification is correct. We have not designed a separate localization metric as our estimated bounding boxes are very accurate. These metrics are calculated object-wise and then averaged across all test scenes. Since no prior work exists on this task, we define several competitive baselines for comparison. Specifically, in Tab. 1, we compare our method with two relevant approaches: a reconstruction-based baseline and two multi-view 3D object detection methods. For the reconstruction-based design, we first employ COLMAP [38] to obtain a point cloud reconstruction of the multi-object scene environment. We use default dense reconstruction parameters but utilize the provided ground truth camera matrices. Since 5 views are insufficient, we use a total of 20 views to reconstruct the scene. We extract individual object point clouds from the reconstructed scenes, and train a Siamese style network to obtain their pairwise similarity. We use DGCNN [49] (pretrained on ShapeNet [14]) as the point cloud feature extractor and the triplet loss [22] to supervise the network. Finally, we employ a voting strategy to aggregate the pairwise distances of all objects in the scene and obtain their individual binary labels. We observe that the accuracy of this method is sensitive to the reconstruction quality, leading to poor performance on both datasets. Next, we adapt ImVoxelNet [37] and DETR3D [48], two multi-view object detection frameworks in our problem setting, aiming to locate and classify each object in the scene as either anomaly or normal. ImVoxelNet uses the similar 2D-3D projection as ours to construct the voxel representation and then a 3D detection head [29] outputs the final prediction. DETR3D is a transformer-based design and uses the set prediction loss [9] for end-to-end detection without NMS. We use the ground truth 3D bounding boxes to train these models and the reported scores are calculated object-wise based on the output of the classification head. We observe that while both methods perform well for large cracks or fractures, they struggle when intra-group comparison is necessary. This is because they tend to memorize certain anomaly types without learning to generalize to compare with other objects in the scene. Our method significantly outperforms all baselines (Tab. 1) on both datasets, highlighting the effectiveness of our dedicated architecture for matching corresponding regions. Notably, the performance drop on the unseen set is relatively lower in our case, attributed to the robust 3D representation that effectively generalizes to novel categories. We show qualitative results of our method in Fig. 3

Table 1: Quantitative results. We compare our method to three related works in two datasets and report the results in terms of anomaly detection AUC and accuracy.

Datasets	COLMAP [38]		ImVoxelNet [37]		DETR3D [48]		Ours
Datasets	AUC	Accuracy	AUC	Accuracy	AUC	Accuracy	AUC	Accuracy
ToysAD-8K-Seen	73.45	60.48	78.13	65.55	79.16	67.37	91.78	83.21
ToysAD-8K-Unseen	72.86	58.12	73.19	60.12	74.60	62.98	89.15	81.57
PartsAD-15K	72.78	61.34	72.80	64.34	74.49	65.11	86.12	79.68

4.4 Ablations and Model Analysis

We conduct several studies into the performance of the proposed model including changes to architecture, robustness analysis and real-world experiments. Unless stated otherwise, all experiments in this section are carried out on the ToysAD-8K unseen set using 5 input views.

Ablation of architecture design. We ablate the core components in our model and report the results in Tab. 2. All variants include the 3D feature fusion module and are optimized at least for the image reconstruction loss, which is essential for constructing scene geometry. Variant A directly maps object-centric features to their binary labels using an MLP without comparing them, while for variant B, standard attention layers are applied to the object-centric features to learn cross-object correlations. Despite the attention layers, B only shows a slight improvement (+2.1% AUC) over A.

Table 2: Ablation Results on ToysAD-8K

Variants	AUC	Accuracy
A: baseline	79.13	66.78
B: A + vanilla attention	81.24	68.20
C: B + feature distillation	87.05	79.56
Final: C + sparse voxel attention	89.15	81.57

We then introduce DINOv2 feature distillation (variant C), which significantly boosts performance by 5.8% AUC and 11.3% accuracy, indicating the importance of the part-aware 3D representation and correspondences for our task (also see Fig. 4). Our final design achieves the performance gain by utilizing sparse voxel attention, which focuses on the top- $k$ most relevant features. This leverages robust correspondences learned through DINOv2 feature distillation, effectively eliminating noisy correlations and directing attention solely to corresponding object regions.

Robustness. Our method achieves some robustness to occlusion by effectively using input from multiple viewpoints. Fig. 7 illustrates examples where our model accurately identifies the anomalous region, even when occluded in some views. In Fig. 5 (left), we show how well our model can perform with additional or reduced input views at test time. We train our model with 5 views and then evaluate it using 1, 3, 5, 10 and 20 views. These results clearly indicate that our model can perform reasonably well with only 3-5 views; however, additional views can be used to boost performance at test time.

We analyze the impact of object count on model performance (see Fig. 5 (right)) using the PartsAD-15K dataset. Increasing the number of objects in a scene usually leads to more occlusion and lower resolution for each individual object. However, the performance drop between the two extremes is minimal ( $<$ 1.5% AUC) as shown in the figure. In addition, we investigate the model’s ability to generalize to more objects at test time than it observed during training. To this end, we trained our model on scenes with 3-7 objects and tested on two sets: one with 3-7 objects (AUC: 86.75) and another with 8-12 objects (AUC: 85.10). These results demonstrate our model’s adaptability to varying object counts.

In another study, we test our primary objective of choosing the majority group as normal and the rest as anomalies using an example shown in Fig. 6. We create five scenes (a-e) by using two geometrically similar objects, and gradually introduce more from one (1 to 5 respectively) while maintaining a total of six objects. For example, in scene (a), the first object appears only once, making it an anomaly. Similarly, in scene (e), the second object appears only once, hence considered an anomaly. We utilize a consistent background across all scenes to ensure uniformity. As shown in the figure, our model correctly classifies the anomalies in each scene. We note scene (c) presents an ambiguous case, where both objects appear in equal numbers. Despite this, our model is able to separate the two groups.

Real world testing. Here we apply our model, which is trained on the synthetic dataset, on a small set of real test scenes. Each scene is set up in an indoor environment with adequate lighting and is captured using a 3D scanning software [1]. This results in a set of input views with globally optimized cameras. Fig. 8 illustrates the results for three such scenes.

Limitations. Our benchmark and model also have a few limitations. Firstly, this work focuses solely on a limited set of anomalies that are common to manufacturing scenarios; potentially missing other anomalies in real-world scenarios. Since acquiring real damaged objects is expensive and difficult, our dataset primarily uses shapes of synthetic objects. Moreover, our model assumes that object instances are rigid and cannot handle articulations or deformations. It also assumes that objects are not touching or completely occluded in space. Moreover, a scene with mostly anomalies can be challenging for our model due to the lack of ‘normal’ data points for comparison (except for fractures or cracks). Moreover, the performance is dependent on the anomalous region being captured in at least one view. Finally, in real-world scene testing, the noisy camera poses as well as the gap between synthetic and real-world environments may degrade the performance.

5 Conclusion

In this paper, we have introduced a novel AD problem inspired by real-world applications along with two new benchmarks. The proposed task goes beyond the traditional AD setting and involves a cross study of objects in a scene from multiple camera viewpoints to identify the ‘odd-looking’ minority group. We show that our model is robust to varying number of views and objects, and outperforms the baselines that do not consider cross-object correlations.

Broader Impacts.

The proposed techniques could be potentially used in improving quality checks and product safety by providing warnings to the users in production lines. The authors are not aware of any potential harm that may arise when the technology is used.

References

[1] Polycam. https://github.com/PolyCam/polyform.
[2] Faruk Ahmed and Aaron Courville. Detecting semantic anomalies. In AAAI, 2020.
[3] Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, and Varun Jampani. Probing the 3d awareness of visual foundation models. arXiv preprint arXiv:2404.08636, 2024.
[4] David Baraff. Physically based modeling: Rigid body simulation. SIGGRAPH Course Notes, ACM SIGGRAPH, 2(1):2–1, 2001.
[5] Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection. In CVPR, 2019.
[6] Paul Bergmann, Xin **, David Sattlegger, and Carsten Steger. The mvtec 3d-ad dataset for unsupervised 3d anomaly detection and localization. arXiv preprint arXiv:2112.09045, 2021.
[7] Ankan Bhunia, Changjian Li, and Hakan Bilen. Looking 3d: Anomaly detection with 2d-3d alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
[8] Hermann Blum, Paul-Edouard Sarlin, Juan Nieto, Roland Siegwart, and Cesar Cadena. The fishyscapes benchmark: Measuring blind spots in semantic segmentation. IJCV, 2021.
[9] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
[10] Diego Carrera, Fabio Manganini, Giacomo Boracchi, and Ettore Lanzarone. Defect detection in sem images of nanofibrous materials. IEEE Transactions on Industrial Informatics, 2016.
[11] Raghavendra Chalapathy, Aditya Krishna Menon, and Sanjay Chawla. Anomaly detection using one-class neural networks. arXiv preprint arXiv:1802.06360, 2018.
[12] Robin Chan, Krzysztof Lis, Svenja Uhlemeyer, Hermann Blum, Sina Honari, Roland Siegwart, Pascal Fua, Mathieu Salzmann, and Matthias Rottmann. Segmentmeifyoucan: A benchmark for anomaly segmentation. arXiv preprint arXiv:2104.14812, 2021.
[13] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM computing surveys, 2009.
[14] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
[15] Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, **gyi Yu, and Hao Su. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In Proceedings of the IEEE/CVF international conference on computer vision, pages 14124–14133, 2021.
[16] Lucas Deecke, Lukas Ruff, Robert A Vandermeulen, and Hakan Bilen. Transfer-based semantic anomaly detection. In ICML, 2021.
[17] Jiajun Deng, Shaoshuai Shi, Peiwei Li, Wengang Zhou, Yanyong Zhang, and Houqiang Li. Voxel r-cnn: Towards high performance voxel-based 3d object detection. In Proceedings of the AAAI conference on artificial intelligence, pages 1201–1209, 2021.
[18] Choubo Ding, Guansong Pang, and Chunhua Shen. Catching both gray and black swans: Open-set supervised anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7388–7398, 2022.
[19] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[20] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. Density-based spatial clustering of applications with noise. In Int. Conf. knowledge discovery and data mining, 1996.
[21] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In ICLR, 2017.
[22] Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In Similarity-Based Pattern Recognition: Third International Workshop, SIMBAD 2015, Copenhagen, Denmark, October 12-14, 2015. Proceedings 3, pages 84–92. Springer, 2015.
[23] Chaoqin Huang, Haoyan Guan, Aofan Jiang, Ya Zhang, Michael Spratling, and Yan-Feng Wang. Registration based few-shot anomaly detection. In European Conference on Computer Vision, pages 303–319. Springer, 2022.
[24] Hanwen Jiang, Zhenyu Jiang, Kristen Grauman, and Yuke Zhu. Few-view object reconstruction with unknown categories and camera poses. arXiv preprint arXiv:2212.04492, 2022.
[25] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
[26] Sosuke Kobayashi, Eiichi Matsumoto, and Vincent Sitzmann. Decomposing nerf for editing via feature field distillation. Advances in Neural Information Processing Systems, 35:23311–23330, 2022.
[27] Sebastian Koch, Albert Matveev, Zhongshi Jiang, Francis Williams, Alexey Artemov, Evgeny Burnaev, Marc Alexa, Denis Zorin, and Daniele Panozzo. Abc: A big cad model dataset for geometric deep learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[28] Kiriakos N Kutulakos and Steven M Seitz. A theory of shape by space carving. International journal of computer vision, 38:199–218, 2000.
[29] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12697–12705, 2019.
[30] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
[31] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
[32] Zak Murez, Tarrence Van As, James Bartolozzi, Ayan Sinha, Vijay Badrinarayanan, and Andrew Rabinovich. Atlas: End-to-end 3d scene reconstruction from posed images. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pages 414–431. Springer, 2020.
[33] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
[34] Guansong Pang, Chunhua Shen, Longbing Cao, and Anton Van Den Hengel. Deep learning for anomaly detection: A review. ACM computing surveys, 2021.
[35] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[36] Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft. Deep one-class classification. In ICML, 2018.
[37] Danila Rukhovich, Anna Vorontsova, and Anton Konushin. Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2397–2406, 2022.
[38] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016.
[39] Silvia Sellán, Jack Luong, Leticia Mattos Da Silva, Aravind Ramakrishnan, Yuchuan Yang, and Alec Jacobson. Breaking good: Fracture modes for realtime destruction. ACM Transactions on Graphics, 42(1):1–12, 2023.
[40] Xuepeng Shi, Qi Ye, Xiaozhi Chen, Chuangrong Chen, Zhixiang Chen, and Tae-Kyun Kim. Geometry-based distance decomposition for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15172–15181, 2021.
[41] Stefan Stojanov, Anh Thai, and James M Rehg. Using shape to categorize: Low-shot learning with an explicit shape bias. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1798–1808, 2021.
[42] Jiaming Sun, Yiming Xie, Linghao Chen, Xiaowei Zhou, and Hujun Bao. Neuralrecon: Real-time coherent 3d reconstruction from monocular video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15598–15607, 2021.
[43] Blender Development Team. Blender (version 3.1.0) [computer software]. https://blender.org/, 2022.
[44] Anh Thai, Ahmad Humayun, Stefan Stojanov, Zixuan Huang, Bikram Boote, and James M Rehg. Low-shot object learning with mutual exclusivity bias. Advances in Neural Information Processing Systems, 36, 2024.
[45] Alex Trevithick and Bo Yang. Grf: Learning a general radiance field for 3d scene representation and rendering. In Proceedings of the IEEE/CVF international conference on computer vision, 2021.
[46] Vadim Tschernezki, Iro Laina, Diane Larlus, and Andrea Vedaldi. Neural feature fusion fields: 3d distillation of self-supervised 2d image representations. In 2022 International Conference on 3D Vision (3DV), pages 443–453. IEEE, 2022.
[47] Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2021.
[48] Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning, pages 180–191. PMLR, 2022.
[49] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (tog), 38(5):1–12, 2019.
[50] Jhih-Ciang Wu, Ding-Jie Chen, Chiou-Shann Fuh, and Tyng-Luh Liu. Learning unsupervised metaformer for anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4369–4378, 2021.
[51] Enze Xie, Zhiding Yu, Daquan Zhou, Jonah Philion, Anima Anandkumar, Sanja Fidler, ** Luo, and Jose M Alvarez. M²BEV: Multi-camera joint 3d detection and segmentation with unified birds-eye view representation. arXiv preprint arXiv:2204.05088, 2022.
[52] Guoyang Xie, **bao Wang, Jiaqi Liu, Feng Zheng, and Yaochu **. Pushing the limits of fewshot anomaly detection in industry vision: Graphcore. arXiv preprint arXiv:2301.12082, 2023.
[53] Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11784–11793, 2021.
[54] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578–4587, 2021.
[55] Weihao Yuan, Xiaodong Gu, Heng Li, Zilong Dong, and Siyu Zhu. 3d former: Monocular scene reconstruction with 3d sdf transformers. arXiv preprint arXiv:2301.13510, 2023.
[56] Greg Zaal, Rob Tuytel, Rico Cilliers, James Ray Cock, Andreas Mischok, Sergej Majboroda, Dimitrios Savva, and Jurita Burger. Polyhaven: a curated public asset library for visual effects artists and game designers, 2021.
[57] Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence. Advances in Neural Information Processing Systems, 36, 2024.
[58] Qiang Zhou, Weize Li, Lihan Jiang, Guoliang Wang, Guyue Zhou, Shanghang Zhang, and Hao Zhao. Pad: A dataset and benchmark for pose-agnostic anomaly detection. In NeurIPS, 2024.
[59] Yang Zou, Jongheon Jeong, Latha Pemula, Dongqing Zhang, and Onkar Dabeer. Spot-the-difference self-supervised pre-training for anomaly detection and segmentation. In European Conference on Computer Vision, pages 392–408. Springer, 2022.

Appendix A Appendix / supplemental material

This appendix is structured as follows: we present additional details of the proposed datasets in Sec. A.1, training details in Sec. A.2, and additional qualitative results in Sec. A.3.

A.1 Data Generation Details

Our proposed scene AD datasets, ToysAD-8K and PartsAD-15K, are built upon two publicly available 3D shape datasets: Toys4K [41] (Creative Commons and royalty-free licenses) and ABC [27] (MIT license). For the ToysAD-8K, we selected 1,050 shapes from the Toys4K dataset, focusing on the most common real-world objects across 51 categories. A complete list of these categories is provided in Table 3. On the other hand, PartsAD-15K is a non-categorical dataset. For this dataset, we randomly selected a subset of 4,200 shapes from the large-scale ABC.

For both datasets, we consider the following anomaly types: cracks, fractures, geometric deformations (e.g., bumps, bends, and twists), translation, rotation, material mismatch, and missing parts. Additionally, for PartsAD-15K, we use a different but geometrically similar instance in the same scene as an anomaly. To obtain a geometrically similar shape, we use feature-based KNN clustering. Specifically, we extract DINOv2 features from multi-view images (rendered from multiple fixed viewpoints) of the 3D shapes, concatenate these features, and use the resulting concatenated features to build the KNN cluster. Then, we retrieve a similar 3D shape by querying the KNN cluster. To ensure the retrieved shape is geometrically similar, we calculate the Chamfer distance between the shapes and only accept a shape if the distance is below a certain threshold.

Table 3: Dataset composition of ToysAD-8K

	Categories
Seen	dinosaur, fish, frog, monkey, light, lizard, orange, boat, dog, lion, pig, cookie, panda, chicken,
	orange, ice, horse, car, airplane, cake, shark, donut, hat, cow, apple, bowl, hamburger, octopus,
	giraffe, chess, bread, butterfly, cupcake, bunny, elephant, fox, deer, bus, bottle
Unseen	mug, plate, robot, glass, sheep, shoe, train, banana, cup, key, penguin, hammer

Our anomaly generation process is automatic. To ensure the quality of the generated anomalies, we perform several checks. For example, during positional or rotational anomaly creation, if a part detaches from the main body during deformation, we reject the anomaly and try again with adjusted parameters. Similarly, if removing a part makes the shape impractical, we discard it and try removing a different part. For fracture anomaly, if a particular fracture removes more than 90% or less than 10% of an object we discard the sample and regenerate another fracture. Finally, we ensure the anomalous region of an object is visible from at least one viewpoint.

To generate a realistic scene environment, we use PBR materials [44] for floors and HDRI environment maps [56] for image-based lighting to illuminate scenes. We randomly select a pair of PBR material and HDRI environment maps from the assets to randomize the scene background. Objects are placed randomly such that no collisions occur, and each object’s rotational poses are obtained using Blender’s rigid body simulation [43]. We employed Blender 2.93 [43] with Cycles ray-tracing renderer for photo-realistic rendering. Blender 2.93 is released under the GNU General Public License (GPL, or “free software”), and the PBR and HDRI maps are released under the CC0 license.

Our framework can easily be trained with real-world manufacturing scene environments. Our model relies solely on 2D supervision, making the data collection process much easier. We also do not need precise annotation of 3D bounding boxes as they are not used during training. For annotating instance-wise anomaly labels, we can use 2D bounding boxes, which can be projected in 3D using Visual Hull [28], then used for coarse localization.

A.2 Training Details

We train our model in two stages. In the first stage, we train it with just image and feature reconstruction losses. In the second stage, we train the model end-to-end with both reconstruction and binary classification losses. All experiments are performed on a single NVIDIA A40 GPU with a batch size of 4, utilizing 28GB of GPU memory. The first stage takes 36 hours to complete, followed by an additional 24 hours for the second stage.

The ablation experiments are conducted on the same workstation with the same GPU by removing one or a few core components from the full method. Specifically, variant methods A and B take $36$ hours to train the first stage, while only taking $14$ hours to train the second stage. Regarding variant method C, it takes similar $36$ and $24$ hours for the two stages as in the full method.

We compare our method with COLMAP (BSD license), ImVoxelNet (MIT license), and DETR3D (MIT license). For the COLMAP-based approach, we use DGCNN (MIT license) as a feature extractor. In terms of training time of these baseline methods, COLMAP takes $10$ hours to train, ImVoxelNet takes $24$ hours to train, and DETR3D takes $2$ days to converge.

The standard deviation of all our experiments (including the ablations and our method) under multiple runs is less than 0.5.

A.3 Additional Qualitative Results

In Fig. 9 and Fig. 10, we present additional qualitative results on ToysAD-8K and PartsAD-15K, respectively.