Odd-One-Out: Anomaly Detection by Comparing with Neighbors

Ankan Bhunia  Changjian Li  Hakan Bilen
University of Edinburgh
https://github.com/VICO-UoE/OddOneOutAD
Abstract

This paper introduces a novel anomaly detection (AD) problem that focuses on identifying ‘odd-looking’ objects relative to the other instances within a scene. Unlike the traditional AD benchmarks, in our setting, anomalies in this context are scene-specific, defined by the regular instances that make up the majority. Since object instances are often partly visible from a single viewpoint, our setting provides multiple views of each scene as input. To provide a testbed for future research in this task, we introduce two benchmarks, ToysAD-8K and PartsAD-15K. We propose a novel method that generates 3D object-centric representations for each instance and detects the anomalous ones through a cross-examination between the instances. We rigorously analyze our method quantitatively and qualitatively in the presented benchmarks.

1 Introduction

Anomaly detection (AD) [13, 34] aims to detect patterns that deviate from expected behavior. In standard computer vision AD benchmarks, the non-conforming (or anomalous) patterns in images are often either due to high-level variations such as introduction of an object instance from an unseen category [2, 8, 12] or low-level variations in object shape and texture [6, 16] (see Fig. 1(a)). In these benchmarks, the definitions of normal and anomalous patterns are typically not predefined but implicitly learned through representations and/or classifiers [21, 2, 16] that are invariant to appearance changes within the normal data distribution while being sensitive to the anomalous ones. However, in many real-world applications such as quality control in production lines, product specification, hence the definition of ‘normality’, is not only known in advance but also specific to each object instance. For example, while a coffee cup with a red handle is considered normal and one with a blue handle is anomalous when the goal is to produce red-handled cups, but the opposite is true if the goal is to produce blue-handled cups. This instance-specific definition of normality cannot be addressed by methods that rely on a global and implicit definition of normality.

In this paper, inspired from real-world visual inspection scenarios, we introduce a new AD problem, along with two new benchmarks with different characteristics and a novel solution. As depicted in Fig. 1 (b), in this proposed setting, each image contains a scene with multiple instances of the same object (e.g., coffee cups) including either only normal, or a mix of normal and anomalous instances where the anomalous instances constitute the minority. As defining precise visual specifications for shape and appearance is often infeasible, we assume that normal instances, which form the majority, provide a scene-specific reference for ‘normality’. Our objective is to detect anomalous instances in a scene while generalizing to previously unseen scenes including novel objects and spatial configurations.

Unlike existing benchmarks, our task often requires a holistic understanding of the scene by comparing instances with each other, although some anomalies, such as cracks and misaligned parts, can be identified by inspecting individual instances. Additionally, since performing AD from a single view can be ambiguous due to self-occlusion and occlusion between the instances, we provide multiple views of the scene to cover its entire relevant extent, unlike the existing benchmarks that provide a single view as input (see Fig. 1(c)). Our goal is to detect anomalous samples in previously unseen scenes from multiple views.

Refer to caption
Figure 1: The standard and our new AD settings.

The proposed problem presents several challenges requiring the following capabilities: i) 3D understanding of the scene and registering the views from multiple camera viewpoints without groundtruth 3D knowledge while identifying potential occlusions, ii) aligning and comparing object instances with each other without their relative pose information in both training and evaluation, iii) learning representations that generalize to unseen object instances during testing. To address these challenges, we propose a novel method that takes as input multiple views of the same scene, projects them into a 3D voxel grid, produces a 3D object-centric representation for each instance, and predicts their labels by cross-correlating instances through an efficient attention mechanism. We leverage recent advances in differentiable rendering [31] and self-supervised learning [33] to supervise 3D representation learning. Specifically, we render the voxel representations for multiple viewpoints and match them with the given 2D views to produce geometrically consistent features. Additionally, we encourage our model to learn part-aware 3D representations by distilling features from 2D self-supervised model DINOv2 [33], enhancing the correspondence matching across instances. Finally, since there is no prior benchmark for this task, we propose two new benchmarks – ToysAD-8K and PartsAD-15K, including instances from common semantic object categories as well as mechanical parts respectively to provide a testbed for future research. Our method significantly outperforms various baselines that do not compare instances with each other, and we rigorously analyze various aspects of our model and benchmarks.

2 Related Work

AD benchmarks. A key challenge in AD research is the scarcity of large datasets containing realistic anomalies. Earlier works [11, 36] focusing on high-level semantic anomalies often use existing classification datasets by treating a subset of classes as anomalies and the remainder as normal. There also exist several datasets containing real-world anomaly instances. For example, MVTec-AD [5] includes industrial objects with various defects like scratches, dents, and contaminations, Carrera et al. missing[10] presents various defects in nanofibrous material, and VisA [59] comprises complex industrial objects such as PCBs, as well as simpler objects like capsules and cashews, spanning a total of 12 categories. These datasets assume that objects are pose-aligned, and both normal images and their anomaly counterparts have the same pose. Zhou et al. missing[58] propose a pose-agnostic framework by introducing the PAD dataset, which comprises images of 20 LEGO bricks of animal toys from diverse viewpoints/poses. Unlike the works discussed above, we focus on AD in multi-object multi-view scene environments, where anomalies are predicted by assessing their mutual similarity with other objects in the scene. The proposed setting enables our framework to work seamlessly on novel object instances without requiring further training.

Few-shot AD. There are some works [18, 23, 52, 50, 7] aiming to detect anomalies from a small number of normal samples as support images. In our setting, the concept of normality in each image is also learned from a few instances only. However, unlike them, the concept of normality is scene-specific, and our method can generalize to previously unseen instances without requiring any modification by learning from a support set. In addition, our setting involves multi-object multi-view data samples as input, unlike the single-object single-view in theirs.

Multi-view 3D vision. Multi-view 3D detection [37, 40, 51, 48] is a related problem that aims to predict the locations and classes of objects in 3D space, given multi-view images of a scene along with their corresponding camera poses as input. Most existing works first project 2D image features onto a 3D voxel grid, followed by a detection head [29, 53, 9] that outputs the final 3D bounding boxes and class labels. While our task can be naively solved by treating it as a 3D object detection problem with anomaly and normal as the two possible classes, this approach has limited ability to perform effective comparisons with other instances in the scene, which is required for fine-grained AD. We compare our method to a 3D object detection technique in Sec. 4. Another related area involves taking multi-view images as input and training a feedforward model for 3D volume-based reconstruction [32, 55, 42] and novel view synthesis [45, 47, 54, 15, 24]. Similarly, in this work, we focus on learning a feedforward model in a multi-object scene environment using sparse multi-view images but for AD.

Leveraging foundation models. Large-scale pretraining on image datasets has shown impressive generalization capabilities on various tasks [35, 25, 33]. Previous works [57, 3] demonstrate that features extracted from DINOv2 [33] serve as effective dense visual descriptors with localized semantic information for dense correspondence estimation task. Some recent works [26, 46] use feature distillation techniques to leverage 2D foundation vision models for 3D task. Inspired by these works we utilize DINOv2 to distill its dense semantic knowledge into our 3D network that enables our network to infer robust local correspondences, which aids in fine-grained object matching.

3 Method

3.1 Overview

Consider a scene containing multiple rigid objects {on}n=1Nsuperscriptsubscriptsubscript𝑜𝑛𝑛1𝑁\{o_{n}\}_{n=1}^{N}{ italic_o start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT of the same instance, where N𝑁Nitalic_N represents the number of objects in the scene. Each object’s pose is arbitrary and unknown. The goal is to identify anomalies in the group of objects by assessing their mutual similarity. An observation in our setting consists of M𝑀Mitalic_M-view H×W𝐻𝑊H\times Witalic_H × italic_W dimensional RGB images ={𝑰t}t=1Msuperscriptsubscriptsubscript𝑰𝑡𝑡1𝑀\mathcal{I}=\{\bm{I}_{t}\}_{t=1}^{M}caligraphic_I = { bold_italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, and their corresponding camera projection matrices 𝒫={𝑷t}t=1M𝒫superscriptsubscriptsubscript𝑷𝑡𝑡1𝑀\mathcal{P}=\{\bm{P}_{t}\}_{t=1}^{M}caligraphic_P = { bold_italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, 𝑷t=𝑲[𝑹t|𝑻t]subscript𝑷𝑡𝑲delimited-[]conditionalsubscript𝑹𝑡subscript𝑻𝑡\bm{P}_{t}=\bm{K}\left[\bm{R}_{t}|\bm{T}_{t}\right]bold_italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_K [ bold_italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ], with intrinsic 𝑲𝒕subscript𝑲𝒕\bm{K_{t}}bold_italic_K start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT, rotation 𝑹tsubscript𝑹𝑡\bm{R}_{t}bold_italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and translation 𝑻tsubscript𝑻𝑡\bm{T}_{t}bold_italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT matrices. In our setting, M𝑀Mitalic_M is 5555, forming sparse-view inputs. Our goal is to learn a map** ψ𝜓\psiitalic_ψ from the multi-view images to object-centric anomaly labels yn{0,1}subscript𝑦𝑛01{y}_{n}\in\{0,1\}italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ { 0 , 1 } and its corresponding 3D bounding box 𝒃nsubscript𝒃𝑛\bm{b}_{n}bold_italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, defined as:

ψ:{(𝑰t,𝑷t)}t=1M{(yn,𝒃n)}n=1N.:𝜓maps-tosuperscriptsubscriptsubscript𝑰𝑡subscript𝑷𝑡𝑡1𝑀superscriptsubscriptsubscript𝑦𝑛subscript𝒃𝑛𝑛1𝑁\psi:\{(\bm{I}_{t},\bm{P}_{t})\}_{t=1}^{M}\mapsto\{({y}_{n},\bm{b}_{n})\}_{n=1% }^{N}.italic_ψ : { ( bold_italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ↦ { ( italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT . (1)

Note, labels ynsubscript𝑦𝑛y_{n}italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are defined relative to other objects in the scene. For example, consider a group of three coffee cups, of which two have red handles, and one has a blue handle. In this example, the latter cup is considered an anomaly.

Refer to caption
Figure 2: Overview of our framework. We extract features from a sequence of input views using a 2D CNN and then back-project them into a 3D volume, which is then refined using a 3D CNN, resulting in 𝑭vsubscript𝑭𝑣\bm{F}_{v}bold_italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. We then extract object-centric feature volumes {𝒛n}n=1Nsuperscriptsubscriptsubscript𝒛𝑛𝑛1𝑁\{\bm{z}_{n}\}_{n=1}^{N}{ bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT using RoI pooling, where \mathcal{B}caligraphic_B estimates the RoI region of each object. The cross-instance matching module \mathcal{M}caligraphic_M then learns a correlation among the objects using a sparse voxel attention. To improve the 3D representation of the scene, we distill the knowledge of a 2D vision model namely DINOv2, and integrate the learned knowledge into our 3D network via differentiable rendering.

Our architecture, illustrated in Fig. 2, is comprised of three main components: First, the 3D feature fusion module encodes each view image and projects it to 3D, forming a fused 3D feature volume. Second, the feature distillation block is employed to enhance the 3D feature volume through differentiable rendering. This facilitates finding local correspondences between objects. Finally, the cross-instance matching module leverages established correspondences to compare all similar object regions in the scene using a sparse voxel attention mechanism. Next, we elaborate on details.

3.2 3D Feature Volume Construction

We first extract 2D features 𝑭t=2D(𝑰t)d×h×wsubscript𝑭𝑡subscript2𝐷subscript𝑰𝑡superscript𝑑𝑤\bm{F}_{t}=\mathcal{E}_{2D}(\bm{I}_{t})\in\mathbb{R}^{d\times h\times w}bold_italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_E start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT ( bold_italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_h × italic_w end_POSTSUPERSCRIPT for each input view using a shared CNN encoder 2Dsubscript2𝐷\mathcal{E}_{2D}caligraphic_E start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT, where d𝑑ditalic_d is the feature dimension. These 2D features are then projected into 3D voxel space as follows:

𝑭v=3D(aggr({Πproj(𝑭t,𝑷t)}t=1M)),subscript𝑭𝑣subscript3𝐷aggrsuperscriptsubscriptsubscriptΠprojsubscript𝑭𝑡subscript𝑷𝑡𝑡1𝑀\bm{F}_{v}=\mathcal{E}_{3D}(\texttt{aggr}(\{\Pi_{\texttt{proj}}(\bm{F}_{t},\bm% {P}_{t})\}_{t=1}^{M})),bold_italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = caligraphic_E start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT ( aggr ( { roman_Π start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT ( bold_italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ) ) , (2)

where ΠprojsubscriptΠproj\Pi_{\texttt{proj}}roman_Π start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT back-projects each view feature 𝑭tsubscript𝑭𝑡\bm{F}_{t}bold_italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using known camera intrinsic and extrinsic, generating 3D feature volumes of size d×vx×vy×vz𝑑subscript𝑣𝑥subscript𝑣𝑦subscript𝑣𝑧d\times v_{x}\times v_{y}\times v_{z}italic_d × italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT × italic_v start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT. These feature volumes are aggregated over all input views using an average operation as in [32, 42]. Finally, a 3D CNN-based network 3Dsubscript3𝐷\mathcal{E}_{3D}caligraphic_E start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT is employed to refine the aggregated feature volume, resulting in a final voxel representation 𝑭vsubscript𝑭𝑣\bm{F}_{v}bold_italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT.

We use volume rendering [31] to reconstruct the geometry and appearance of the scene. We implement the rendering operation as in [24], where for each 3D query point on a ray, we retrieve its corresponding features by bilinearly interpolating between the neighboring voxel grids. Specifically, we first apply a two-layered 1×1×11111\times 1\times 11 × 1 × 1 convolution block, i.e., αcsubscript𝛼𝑐\alpha_{c}italic_α start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and ασsubscript𝛼𝜎\alpha_{\sigma}italic_α start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT respectively, to obtain color and density volumes denoted as (𝑽csubscript𝑽𝑐\bm{V}_{c}bold_italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, 𝑽σsubscript𝑽𝜎\bm{V}_{\sigma}bold_italic_V start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT). Then, the pixel-wise color and density maps are composed by integrating along a camera ray using volume renderer \mathcal{R}caligraphic_R. Following this, we compute the L2 image reconstruction loss timsubscriptsuperscriptim𝑡\mathcal{L}^{\text{im}}_{t}caligraphic_L start_POSTSUPERSCRIPT im end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Formally, the image rendering and its corresponding loss function for a single viewpoint 𝑷tsubscript𝑷𝑡\bm{P}_{t}bold_italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are shown below:

[𝑰^t,𝑰^tσ]=([𝑽c,𝑽σ],𝑷t),tim=𝑰t𝑰^t2+λσ𝑰tσ𝑰^tσ2,formulae-sequencesubscript^𝑰𝑡subscript^𝑰𝑡𝜎subscript𝑽𝑐subscript𝑽𝜎subscript𝑷𝑡subscriptsuperscriptim𝑡superscriptnormsubscript𝑰𝑡subscript^𝑰𝑡2subscript𝜆𝜎superscriptnormsubscript𝑰𝑡𝜎subscript^𝑰𝑡𝜎2[\hat{\bm{I}}_{t},\hat{\bm{I}}_{t\sigma}]=\mathcal{R}([\bm{V}_{c},\bm{V}_{% \sigma}],\bm{P}_{t}),\quad\mathcal{L}^{\text{im}}_{t}=||\bm{I}_{t}-\hat{\bm{I}% }_{t}||^{2}+\lambda_{\sigma}||\bm{I}_{t\sigma}-\hat{\bm{I}}_{t\sigma}||^{2},[ over^ start_ARG bold_italic_I end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_italic_I end_ARG start_POSTSUBSCRIPT italic_t italic_σ end_POSTSUBSCRIPT ] = caligraphic_R ( [ bold_italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , bold_italic_V start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ] , bold_italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , caligraphic_L start_POSTSUPERSCRIPT im end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = | | bold_italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG bold_italic_I end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT | | bold_italic_I start_POSTSUBSCRIPT italic_t italic_σ end_POSTSUBSCRIPT - over^ start_ARG bold_italic_I end_ARG start_POSTSUBSCRIPT italic_t italic_σ end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (3)

where 𝑰^tsubscript^𝑰𝑡\hat{\bm{I}}_{t}over^ start_ARG bold_italic_I end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝑰^tσsubscript^𝑰𝑡𝜎\hat{\bm{I}}_{t\sigma}over^ start_ARG bold_italic_I end_ARG start_POSTSUBSCRIPT italic_t italic_σ end_POSTSUBSCRIPT are rendered image and mask. λσsubscript𝜆𝜎\lambda_{\sigma}italic_λ start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT is a loss weight.

We propose to improve the 3D voxel representation 𝑭vsubscript𝑭𝑣\bm{F}_{v}bold_italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT by also reconstructing neural features instead of just color and density. We supervise the feature reconstruction by a pretrained 2D image encoder ΦΦ\Phiroman_Φ as a teacher network. We choose DINOv2 [33] as the teacher network due to its excellent ability to capture various object geometries and correspondences.

We use a projector function β𝛽\betaitalic_β, implemented as a four-layered 1×1×11111\times 1\times 11 × 1 × 1 convolution block that projects 𝑭vsubscript𝑭𝑣\bm{F}_{v}bold_italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT to a neural feature field 𝑽fsubscript𝑽𝑓\bm{V}_{f}bold_italic_V start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, changing the channel dimension from d𝑑ditalic_d to dfsubscript𝑑𝑓d_{f}italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. Similar to color rendering, we generate rendered features Φ^tsubscript^Φ𝑡\hat{\Phi}_{t}over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of size df×hf×wfsubscript𝑑𝑓subscript𝑓subscript𝑤𝑓d_{f}\times h_{f}\times w_{f}italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_h start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT at a given viewpoint using volume renderer \mathcal{R}caligraphic_R. The objective is to minimize the difference between the rendered features and the teacher’s features Φ(𝑰t)Φsubscript𝑰𝑡\Phi(\bm{I}_{t})roman_Φ ( bold_italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). We choose cosine distance as our feature loss (tfeatsubscriptsuperscriptfeat𝑡\mathcal{L}^{\text{feat}}_{t}caligraphic_L start_POSTSUPERSCRIPT feat end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) which we find easier to optimize compared to the standard L2 loss. We apply stop-gradient to density in rendering of features Φ^tsubscript^Φ𝑡\hat{\Phi}_{t}over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, as the teacher’s features are not fully multi-view consistent [3], which could harm the quality of reconstructed geometry. The final reconstruction loss is the sum of all image and feature reconstruction losses weighed by a factor λfsubscript𝜆𝑓\lambda_{f}italic_λ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT:

r=t=1M(tim+λftfeat).superscript𝑟superscriptsubscript𝑡1𝑀subscriptsuperscriptim𝑡subscript𝜆𝑓subscriptsuperscriptfeat𝑡\mathcal{L}^{r}=\sum_{t=1}^{M}(\mathcal{L}^{\text{im}}_{t}+\lambda_{f}\mathcal% {L}^{\text{feat}}_{t}).caligraphic_L start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_L start_POSTSUPERSCRIPT im end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT feat end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (4)

The key benefits of reconstructing DINOv2 features are twofold. 1) Firstly, distilling features from general-purpose feature extractors pre-trained on large external datasets, incorporates open-world knowledge into the 3D representation. This enables our model to perform significantly better on unseen object instances or even on novel categories, as demonstrated in the experiments. 2) Secondly, the distillation enforces consistent 3D scene representation, leading to identical features for the same object geometries. This enables the model to infer robust local correspondences (see Fig. 4), which aids in fine-grained object matching.

3.3 Object-centric 3D Features Extraction

We extract bounding box regions of objects using a box estimator \mathcal{B}caligraphic_B. First, we obtain a voxel reconstruction of the scene by applying a threshold to the predicted density 𝑽σsubscript𝑽𝜎\bm{V}_{\sigma}bold_italic_V start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT. Then, we employ DBScan [20], a density-based clustering method to retrieve all bounding box regions {𝒃n}n=1Nsuperscriptsubscriptsubscript𝒃𝑛𝑛1𝑁\{\bm{b}_{n}\}_{n=1}^{N}{ bold_italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT corresponding to the objects in the scene. Next we obtain object-centric feature volumes {𝒛n}n=1Nsuperscriptsubscriptsubscript𝒛𝑛𝑛1𝑁\{\bm{z}_{n}\}_{n=1}^{N}{ bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, each with a size of 8×8×88888\times 8\times 88 × 8 × 8 from these regions through applying RoI pooling [17].

To extract correspondence indices given two objects with feature volumes 𝒛nsubscript𝒛𝑛\bm{z}_{n}bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and 𝒛msubscript𝒛𝑚\bm{z}_{m}bold_italic_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, We formally define a function 𝒞ksubscript𝒞𝑘\mathcal{C}_{k}caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT:

𝒞k(𝒛n,𝒛m)=topk[β(𝒛n)Tβ(𝒛m)].subscript𝒞𝑘subscript𝒛𝑛subscript𝒛𝑚subscripttop𝑘delimited-[]𝛽superscriptsubscript𝒛𝑛𝑇𝛽subscript𝒛𝑚\mathcal{C}_{k}(\bm{z}_{n},\bm{z}_{m})=\texttt{top}_{k}[\beta(\bm{z}_{n})^{T}% \beta(\bm{z}_{m})].caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) = top start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_β ( bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_β ( bold_italic_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ] . (5)

The function returns top-k𝑘kitalic_k most relevant feature locations in 𝒛msubscript𝒛𝑚\bm{z}_{m}bold_italic_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT for each voxel location in 𝒛nsubscript𝒛𝑛\bm{z}_{n}bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. This is achieved by first projecting each voxel feature using the same projector function β𝛽\betaitalic_β and then computing their voxel-level pairwise similarity. We use 𝒞ksubscript𝒞𝑘\mathcal{C}_{k}caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to perform sparse attention-based comparisons between multiple object volumes, as described next.

[Uncaptioned image]

3.4 Cross-instance Matching

The cross-instance matching module \mathcal{M}caligraphic_M takes the object features {𝒛n}n=1Nsuperscriptsubscriptsubscript𝒛𝑛𝑛1𝑁\{\bm{z}_{n}\}_{n=1}^{N}{ bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, effectively learns to correlate them using sparse voxel attention, and subsequently predicts their object-specific labels.

Unlike the vanilla self-attention module in standard transformers [19] that uses all tokens for the attention computation, which is inefficient for our task and may introduce noisy interactions with irrelevant features, potentially degrading performance. To overcome this, we compute the sparse voxel attention only among geometrically corresponding voxel locations, as shown in the inset.

Let zn[i]dsubscript𝑧𝑛delimited-[]𝑖superscript𝑑z_{n}[i]\in\mathbb{R}^{d}italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ italic_i ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT denote the i𝑖iitalic_i-th voxel of the n𝑛nitalic_n-th object volume in the scene. The query, key, and value embeddings are calculated using linear projections as:

𝑸n[i]=𝑾Q𝒛n[i],𝑲n[i]=𝑾K𝒛n[i],𝑽n[i]=𝑾V𝒛n[i],formulae-sequencesubscript𝑸𝑛delimited-[]𝑖superscript𝑾𝑄subscript𝒛𝑛delimited-[]𝑖formulae-sequencesubscript𝑲𝑛delimited-[]𝑖superscript𝑾𝐾subscript𝒛𝑛delimited-[]𝑖subscript𝑽𝑛delimited-[]𝑖superscript𝑾𝑉subscript𝒛𝑛delimited-[]𝑖\bm{Q}_{n}[i]=\bm{W}^{Q}{\bm{z}}_{n}[i],\quad\bm{K}_{n}[i]=\bm{W}^{K}{\bm{z}}_% {n}[i],\quad\bm{V}_{n}[i]=\bm{W}^{V}{\bm{z}}_{n}[i],bold_italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ italic_i ] = bold_italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ italic_i ] , bold_italic_K start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ italic_i ] = bold_italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ italic_i ] , bold_italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ italic_i ] = bold_italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ italic_i ] , (6)

where the weights 𝑾Qsuperscript𝑾𝑄\bm{W}^{Q}bold_italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT, 𝑾Ksuperscript𝑾𝐾\bm{W}^{K}bold_italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT and 𝑾Vsuperscript𝑾𝑉\bm{W}^{V}bold_italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT are shared across all objects.

Let cknm[i]superscriptsubscript𝑐𝑘𝑛𝑚delimited-[]𝑖c_{k}^{nm}[i]italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_m end_POSTSUPERSCRIPT [ italic_i ] denote the set of all corresponding voxel indices obtained using Eq. 5. Then, the attention is calculated as:

𝒛¯n[i]=m=1mnNjcknm[i]softmax(𝑸n[i]𝑲m[j]d)𝑽m[j].subscript¯𝒛𝑛delimited-[]𝑖superscriptsubscript𝑚1𝑚𝑛𝑁subscript𝑗superscriptsubscript𝑐𝑘𝑛𝑚delimited-[]𝑖softmaxsubscript𝑸𝑛delimited-[]𝑖subscript𝑲𝑚delimited-[]𝑗𝑑subscript𝑽𝑚delimited-[]𝑗\bar{\bm{z}}_{n}[i]=\sum_{\begin{subarray}{c}m=1\\ m\neq n\end{subarray}}^{N}\sum_{j\in c_{k}^{nm}[i]}\text{softmax}\left(\frac{% \bm{Q}_{n}[i]\bm{K}_{m}[j]}{\sqrt{d}}\right)\bm{V}_{m}[j].over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ italic_i ] = ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_m = 1 end_CELL end_ROW start_ROW start_CELL italic_m ≠ italic_n end_CELL end_ROW end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_m end_POSTSUPERSCRIPT [ italic_i ] end_POSTSUBSCRIPT softmax ( divide start_ARG bold_italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ italic_i ] bold_italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT [ italic_j ] end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) bold_italic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT [ italic_j ] . (7)

The updated feature volume 𝒛¯nsubscript¯𝒛𝑛\bar{\bm{z}}_{n}over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is passed through 3D CNN blocks to downsample by a factor of 1/8181/81 / 8, which is finally reshaped into a vector and fed to a 2-layer MLP outputting the final prediction y^nsubscript^𝑦𝑛\hat{y}_{n}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. The classification loss is calculated as:

bce=n=1Nbce(y^n,yn),superscriptbcesuperscriptsubscript𝑛1𝑁subscriptbcesubscript^𝑦𝑛subscript𝑦𝑛\mathcal{L}^{\text{bce}}=\sum_{n=1}^{N}\ell_{\text{bce}}(\hat{y}_{n},y_{n}),caligraphic_L start_POSTSUPERSCRIPT bce end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT bce end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , (8)

where bcesubscriptbce\ell_{\text{bce}}roman_ℓ start_POSTSUBSCRIPT bce end_POSTSUBSCRIPT is the binary cross-entropy loss function. The total training loss of our framework is =bce+λprsuperscriptbcesubscript𝜆𝑝superscript𝑟\mathcal{L}=\mathcal{L}^{\text{bce}}+\lambda_{p}\mathcal{L}^{r}caligraphic_L = caligraphic_L start_POSTSUPERSCRIPT bce end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT, where λpsubscript𝜆𝑝\lambda_{p}italic_λ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is a loss weight. We employ stage-wise training: first pretraining with only the reconstruction loss (rsuperscript𝑟\mathcal{L}^{r}caligraphic_L start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT), followed by end-to-end training with both losses.

4 Experiments

4.1 Datasets

As no prior dataset exists for the task, we propose two challenging scene datasets, both essential for thorough evaluation: ToysAD-8K and PartsAD-15K. ToysAD-8K includes real-world objects from multiple categories. This allows us to evaluate our model’s ability to generalize to unseen object categories. PartsAD-15K comprises a more diverse collection of mechanical object parts with arbitrary shapes, thus being free from any class-level inductive biases. Both datasets include a wide range of fine-grained anomaly instances motivated by real-world applications in inspection and quality control. Scenes are generated with diverse backgrounds, illuminations, and camera viewpoints using photo-realistic ray tracing [43]. Next, we discuss the data generation for both datasets.

ToysAD-8K. We begin with a subset of 1050105010501050 shapes from the Toys4K dataset [41]. The subset covers a wide range of objects from 51515151 categories. We automatically create anomalies for a given 3D shape by applying various deformations to both the geometry and texture. This includes generating realistic cracks and fractures using [39], applying random geometric deformations [43] like bumps, bends, and twists, as well as randomly translating, rotating, and swap** materials in different parts of the shapes. In total, we generated 2345234523452345 anomaly shapes. To generate each scene, we first randomly choose a set of objects consisting of both normal and their anomaly of the same instances. We ensure that the majority of objects in each scene are normal. We also have few scenes where no anomaly is present. The rotational poses for the objects are obtained using rigid body simulation [4]. The objects are scaled and placed into the scene at random locations, ensuring collisions do not occur. We generated total 8K8𝐾8K8 italic_K scenes. Each scene consists of 3333-6666 objects rendered in 20202020 views. For the training set, we randomly select 5K5𝐾5K5 italic_K scenes from the 39393939 categories. We build two disjoint test sets. The first one (seen) contains 1K1𝐾1K1 italic_K scenes from the seen categories but with unseen object instances. The second one (unseen) contains 2K2𝐾2K2 italic_K scenes from rest of the 12121212 novel categories.

PartsAD-15K. We use a subset of the ABC dataset [27] that consists of 4200420042004200 shapes. We follow the strategy above to generate anomalies. Additionally, for each shape, we sample geometrically close instances from the dataset and assign them as anomalies to use in the same scene. This approach generates a large set of high-quality anomalies that closely resemble their normal counterparts in high-level geometry, but with subtle shape variations that make them anomalous in the context of the normal ones. In total, we generated 10,2031020310,20310 , 203 anomaly shapes. Using these shapes, we created 15K15𝐾15K15 italic_K scenes, each consisting of 3333 to 12121212 objects rendered from 20202020 different viewpoints. We divide the dataset into a 12K12𝐾12K12 italic_K training set and the rest as the test set. Note that the proposed datasets, source code, and models will be made public based upon publication.

4.2 Implementation details

We use a ResNet50-FPN [30] as our 2D encoder backbone. Our 3D backbone consists of a four-scale encoder-decoder-based 3D CNN [32]. We use M=5𝑀5M=5italic_M = 5 images as input, each with a resolution of 256×256256256256\times 256256 × 256. Our architecture is flexible to accept a different M𝑀Mitalic_M during inference. During training, we consider a total of 2M2𝑀2M2 italic_M views, which we separate into two sets, each containing M𝑀Mitalic_M views. We use one set to build the neural volume and the cameras of the other set to render the results, and vice versa. The 3D volume 𝑭vsubscript𝑭𝑣\bm{F}_{v}bold_italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT contains 96×96×1696961696\times 96\times 1696 × 96 × 16 voxels with a voxel size of 4cm4𝑐𝑚4cm4 italic_c italic_m. We sample 128128128128 points on each ray for rendering. We render the features with a spatial dimension of 32×32323232\times 3232 × 32. The Itσsubscript𝐼𝑡𝜎I_{t\sigma}italic_I start_POSTSUBSCRIPT italic_t italic_σ end_POSTSUBSCRIPT in Eq. 3 is the ground truth segmentation mask of the input image, we only use it for training. The threshold applied to the density volume 𝑽σsubscript𝑽𝜎\bm{V}_{\sigma}bold_italic_V start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT is set to be 0.20.20.20.2 and we run the DBScan algorithm with its default parameters. We resize the teacher’s (DINOv2) features to the same spatial dimension for loss computation. These features are pre-computed for all scenes using publicly available weights, which are not updated during distillation. These DINOv2 features are then reduced to dfsubscript𝑑𝑓d_{f}italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT=128128128128 channel dimensions using PCA before distillation. We employ three sparse voxel attention blocks, and each applies 8-headed attention. The value of k𝑘kitalic_k is chosen as 20202020. The loss weights λσsubscript𝜆𝜎\lambda_{\sigma}italic_λ start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT, λfsubscript𝜆𝑓\lambda_{f}italic_λ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, and λpsubscript𝜆𝑝\lambda_{p}italic_λ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT are all set to 1111. We first pretrain the network with only the reconstruction loss for 50505050 epochs. Then, we train the network end-to-end with both the reconstruction loss and the binary classification loss for another 50505050 epochs. We maintain a batch size of 4444, and use the Adam optimizer with a learning rate 2×1052superscript1052\times 10^{-5}2 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. Finally, the run-time of our method is 65656565ms on a single A40 GPU for a typical scene with 5 views as input.

4.3 Baseline Comparisons

We quantitatively evaluate the anomaly classification results using two evaluation metrics – the area under the ROC curve (AUC) and accuracy. A prediction is considered correct if the bounding box IoU is greater than 0.5 and the corresponding anomaly classification is correct. We have not designed a separate localization metric as our estimated bounding boxes are very accurate. These metrics are calculated object-wise and then averaged across all test scenes. Since no prior work exists on this task, we define several competitive baselines for comparison. Specifically, in  Tab. 1, we compare our method with two relevant approaches: a reconstruction-based baseline and two multi-view 3D object detection methods. For the reconstruction-based design, we first employ COLMAP [38] to obtain a point cloud reconstruction of the multi-object scene environment. We use default dense reconstruction parameters but utilize the provided ground truth camera matrices. Since 5 views are insufficient, we use a total of 20 views to reconstruct the scene. We extract individual object point clouds from the reconstructed scenes, and train a Siamese style network to obtain their pairwise similarity. We use DGCNN [49] (pretrained on ShapeNet [14]) as the point cloud feature extractor and the triplet loss [22] to supervise the network. Finally, we employ a voting strategy to aggregate the pairwise distances of all objects in the scene and obtain their individual binary labels. We observe that the accuracy of this method is sensitive to the reconstruction quality, leading to poor performance on both datasets. Next, we adapt ImVoxelNet [37] and DETR3D [48], two multi-view object detection frameworks in our problem setting, aiming to locate and classify each object in the scene as either anomaly or normal. ImVoxelNet uses the similar 2D-3D projection as ours to construct the voxel representation and then a 3D detection head [29] outputs the final prediction. DETR3D is a transformer-based design and uses the set prediction loss [9] for end-to-end detection without NMS. We use the ground truth 3D bounding boxes to train these models and the reported scores are calculated object-wise based on the output of the classification head. We observe that while both methods perform well for large cracks or fractures, they struggle when intra-group comparison is necessary. This is because they tend to memorize certain anomaly types without learning to generalize to compare with other objects in the scene. Our method significantly outperforms all baselines (Tab. 1) on both datasets, highlighting the effectiveness of our dedicated architecture for matching corresponding regions. Notably, the performance drop on the unseen set is relatively lower in our case, attributed to the robust 3D representation that effectively generalizes to novel categories. We show qualitative results of our method in Fig. 3

Table 1: Quantitative results. We compare our method to three related works in two datasets and report the results in terms of anomaly detection AUC and accuracy.
Datasets COLMAP [38] ImVoxelNet [37] DETR3D [48] Ours
AUC Accuracy AUC Accuracy AUC Accuracy AUC Accuracy
ToysAD-8K-Seen 73.45 60.48 78.13 65.55 79.16 67.37 91.78 83.21
ToysAD-8K-Unseen 72.86 58.12 73.19 60.12 74.60 62.98 89.15 81.57
PartsAD-15K 72.78 61.34 72.80 64.34 74.49 65.11 86.12 79.68
Refer to caption
Figure 3: AD qualitative results on (a) the unseen test categories of ToysAD-8K and (b) the test set of PartsAD-15K using our proposed framework. The green box denotes correct predictions, while the red box indicates where the model misses the anomaly. Due to limited space, one view is shown per scene. Please refer to the supplementary materials for more results.

4.4 Ablations and Model Analysis

We conduct several studies into the performance of the proposed model including changes to architecture, robustness analysis and real-world experiments. Unless stated otherwise, all experiments in this section are carried out on the ToysAD-8K unseen set using 5 input views.

Refer to caption
Figure 4: Correspondences are obtained in 3D space using the neural feature field 𝑽fsubscript𝑽𝑓\bm{V}_{f}bold_italic_V start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, then projected onto 2D views using camera matrices for visualization. The feature field is rendered at respective viewpoints, the first 3 PCA components are used, with each matched to a different color channel.

Ablation of architecture design. We ablate the core components in our model and report the results in Tab. 2. All variants include the 3D feature fusion module and are optimized at least for the image reconstruction loss, which is essential for constructing scene geometry. Variant A directly maps object-centric features to their binary labels using an MLP without comparing them, while for variant B, standard attention layers are applied to the object-centric features to learn cross-object correlations. Despite the attention layers, B only shows a slight improvement (+2.1% AUC) over A.

Table 2: Ablation Results on ToysAD-8K
Variants AUC Accuracy
A: baseline 79.13 66.78
B: A + vanilla attention 81.24 68.20
C: B + feature distillation 87.05 79.56
Final: C + sparse voxel attention 89.15 81.57

We then introduce DINOv2 feature distillation (variant C), which significantly boosts performance by 5.8% AUC and 11.3% accuracy, indicating the importance of the part-aware 3D representation and correspondences for our task (also see Fig. 4). Our final design achieves the performance gain by utilizing sparse voxel attention, which focuses on the top-k𝑘kitalic_k most relevant features. This leverages robust correspondences learned through DINOv2 feature distillation, effectively eliminating noisy correlations and directing attention solely to corresponding object regions.

Robustness. Our method achieves some robustness to occlusion by effectively using input from multiple viewpoints. Fig. 7 illustrates examples where our model accurately identifies the anomalous region, even when occluded in some views. In Fig. 5 (left), we show how well our model can perform with additional or reduced input views at test time. We train our model with 5 views and then evaluate it using 1, 3, 5, 10 and 20 views. These results clearly indicate that our model can perform reasonably well with only 3-5 views; however, additional views can be used to boost performance at test time.

Refer to caption
Figure 5: Impact of number of views (left) and object count (right) on model performance.

We analyze the impact of object count on model performance (see Fig. 5 (right)) using the PartsAD-15K dataset. Increasing the number of objects in a scene usually leads to more occlusion and lower resolution for each individual object. However, the performance drop between the two extremes is minimal (<<< 1.5% AUC) as shown in the figure. In addition, we investigate the model’s ability to generalize to more objects at test time than it observed during training. To this end, we trained our model on scenes with 3-7 objects and tested on two sets: one with 3-7 objects (AUC: 86.75) and another with 8-12 objects (AUC: 85.10). These results demonstrate our model’s adaptability to varying object counts.

In another study, we test our primary objective of choosing the majority group as normal and the rest as anomalies using an example shown in Fig. 6. We create five scenes (a-e) by using two geometrically similar objects, and gradually introduce more from one (1 to 5 respectively) while maintaining a total of six objects. For example, in scene (a), the first object appears only once, making it an anomaly. Similarly, in scene (e), the second object appears only once, hence considered an anomaly. We utilize a consistent background across all scenes to ensure uniformity. As shown in the figure, our model correctly classifies the anomalies in each scene. We note scene (c) presents an ambiguous case, where both objects appear in equal numbers. Despite this, our model is able to separate the two groups.

Real world testing. Here we apply our model, which is trained on the synthetic dataset, on a small set of real test scenes. Each scene is set up in an indoor environment with adequate lighting and is captured using a 3D scanning software [1]. This results in a set of input views with globally optimized cameras. Fig. 8 illustrates the results for three such scenes.

Limitations. Our benchmark and model also have a few limitations. Firstly, this work focuses solely on a limited set of anomalies that are common to manufacturing scenarios; potentially missing other anomalies in real-world scenarios. Since acquiring real damaged objects is expensive and difficult, our dataset primarily uses shapes of synthetic objects. Moreover, our model assumes that object instances are rigid and cannot handle articulations or deformations. It also assumes that objects are not touching or completely occluded in space. Moreover, a scene with mostly anomalies can be challenging for our model due to the lack of ‘normal’ data points for comparison (except for fractures or cracks). Moreover, the performance is dependent on the anomalous region being captured in at least one view. Finally, in real-world scene testing, the noisy camera poses as well as the gap between synthetic and real-world environments may degrade the performance.

Refer to caption
Figure 6: We create five scenes (a-e) with two similar objects, varying their counts per scene while maintaining a constant total. Our model correctly identifies minorities as anomalies in all cases except (c), where the equal number of objects creates ambiguity. Our model still selects one group.
Refer to caption
Figure 7: Resolving (a) occlusion and (b) 3D ambiguity using multi-view images. The anomaly ‘sheep’ in (a) has a missing tail (only visible in the last view due to occlusion), and the ‘hammer’ handle in (b) is bent (only apparent from the last view-angle due to 3D ambiguities).
Refer to caption
Figure 8: Three real-world scenes are tested using our method and it can successfully detect all anomalous instances. Five input views are used for testing, while one view for each scene is used for visualization.

5 Conclusion

In this paper, we have introduced a novel AD problem inspired by real-world applications along with two new benchmarks. The proposed task goes beyond the traditional AD setting and involves a cross study of objects in a scene from multiple camera viewpoints to identify the ‘odd-looking’ minority group. We show that our model is robust to varying number of views and objects, and outperforms the baselines that do not consider cross-object correlations.

Broader Impacts.

The proposed techniques could be potentially used in improving quality checks and product safety by providing warnings to the users in production lines. The authors are not aware of any potential harm that may arise when the technology is used.

References

  • [1] Polycam. https://github.com/PolyCam/polyform.
  • [2] Faruk Ahmed and Aaron Courville. Detecting semantic anomalies. In AAAI, 2020.
  • [3] Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, and Varun Jampani. Probing the 3d awareness of visual foundation models. arXiv preprint arXiv:2404.08636, 2024.
  • [4] David Baraff. Physically based modeling: Rigid body simulation. SIGGRAPH Course Notes, ACM SIGGRAPH, 2(1):2–1, 2001.
  • [5] Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection. In CVPR, 2019.
  • [6] Paul Bergmann, Xin **, David Sattlegger, and Carsten Steger. The mvtec 3d-ad dataset for unsupervised 3d anomaly detection and localization. arXiv preprint arXiv:2112.09045, 2021.
  • [7] Ankan Bhunia, Changjian Li, and Hakan Bilen. Looking 3d: Anomaly detection with 2d-3d alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  • [8] Hermann Blum, Paul-Edouard Sarlin, Juan Nieto, Roland Siegwart, and Cesar Cadena. The fishyscapes benchmark: Measuring blind spots in semantic segmentation. IJCV, 2021.
  • [9] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  • [10] Diego Carrera, Fabio Manganini, Giacomo Boracchi, and Ettore Lanzarone. Defect detection in sem images of nanofibrous materials. IEEE Transactions on Industrial Informatics, 2016.
  • [11] Raghavendra Chalapathy, Aditya Krishna Menon, and Sanjay Chawla. Anomaly detection using one-class neural networks. arXiv preprint arXiv:1802.06360, 2018.
  • [12] Robin Chan, Krzysztof Lis, Svenja Uhlemeyer, Hermann Blum, Sina Honari, Roland Siegwart, Pascal Fua, Mathieu Salzmann, and Matthias Rottmann. Segmentmeifyoucan: A benchmark for anomaly segmentation. arXiv preprint arXiv:2104.14812, 2021.
  • [13] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM computing surveys, 2009.
  • [14] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  • [15] Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, **gyi Yu, and Hao Su. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In Proceedings of the IEEE/CVF international conference on computer vision, pages 14124–14133, 2021.
  • [16] Lucas Deecke, Lukas Ruff, Robert A Vandermeulen, and Hakan Bilen. Transfer-based semantic anomaly detection. In ICML, 2021.
  • [17] Jiajun Deng, Shaoshuai Shi, Peiwei Li, Wengang Zhou, Yanyong Zhang, and Houqiang Li. Voxel r-cnn: Towards high performance voxel-based 3d object detection. In Proceedings of the AAAI conference on artificial intelligence, pages 1201–1209, 2021.
  • [18] Choubo Ding, Guansong Pang, and Chunhua Shen. Catching both gray and black swans: Open-set supervised anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7388–7398, 2022.
  • [19] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • [20] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. Density-based spatial clustering of applications with noise. In Int. Conf. knowledge discovery and data mining, 1996.
  • [21] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In ICLR, 2017.
  • [22] Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In Similarity-Based Pattern Recognition: Third International Workshop, SIMBAD 2015, Copenhagen, Denmark, October 12-14, 2015. Proceedings 3, pages 84–92. Springer, 2015.
  • [23] Chaoqin Huang, Haoyan Guan, Aofan Jiang, Ya Zhang, Michael Spratling, and Yan-Feng Wang. Registration based few-shot anomaly detection. In European Conference on Computer Vision, pages 303–319. Springer, 2022.
  • [24] Hanwen Jiang, Zhenyu Jiang, Kristen Grauman, and Yuke Zhu. Few-view object reconstruction with unknown categories and camera poses. arXiv preprint arXiv:2212.04492, 2022.
  • [25] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
  • [26] Sosuke Kobayashi, Eiichi Matsumoto, and Vincent Sitzmann. Decomposing nerf for editing via feature field distillation. Advances in Neural Information Processing Systems, 35:23311–23330, 2022.
  • [27] Sebastian Koch, Albert Matveev, Zhongshi Jiang, Francis Williams, Alexey Artemov, Evgeny Burnaev, Marc Alexa, Denis Zorin, and Daniele Panozzo. Abc: A big cad model dataset for geometric deep learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [28] Kiriakos N Kutulakos and Steven M Seitz. A theory of shape by space carving. International journal of computer vision, 38:199–218, 2000.
  • [29] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12697–12705, 2019.
  • [30] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  • [31] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  • [32] Zak Murez, Tarrence Van As, James Bartolozzi, Ayan Sinha, Vijay Badrinarayanan, and Andrew Rabinovich. Atlas: End-to-end 3d scene reconstruction from posed images. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pages 414–431. Springer, 2020.
  • [33] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  • [34] Guansong Pang, Chunhua Shen, Longbing Cao, and Anton Van Den Hengel. Deep learning for anomaly detection: A review. ACM computing surveys, 2021.
  • [35] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [36] Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft. Deep one-class classification. In ICML, 2018.
  • [37] Danila Rukhovich, Anna Vorontsova, and Anton Konushin. Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2397–2406, 2022.
  • [38] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016.
  • [39] Silvia Sellán, Jack Luong, Leticia Mattos Da Silva, Aravind Ramakrishnan, Yuchuan Yang, and Alec Jacobson. Breaking good: Fracture modes for realtime destruction. ACM Transactions on Graphics, 42(1):1–12, 2023.
  • [40] Xuepeng Shi, Qi Ye, Xiaozhi Chen, Chuangrong Chen, Zhixiang Chen, and Tae-Kyun Kim. Geometry-based distance decomposition for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15172–15181, 2021.
  • [41] Stefan Stojanov, Anh Thai, and James M Rehg. Using shape to categorize: Low-shot learning with an explicit shape bias. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1798–1808, 2021.
  • [42] Jiaming Sun, Yiming Xie, Linghao Chen, Xiaowei Zhou, and Hujun Bao. Neuralrecon: Real-time coherent 3d reconstruction from monocular video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15598–15607, 2021.
  • [43] Blender Development Team. Blender (version 3.1.0) [computer software]. https://blender.org/, 2022.
  • [44] Anh Thai, Ahmad Humayun, Stefan Stojanov, Zixuan Huang, Bikram Boote, and James M Rehg. Low-shot object learning with mutual exclusivity bias. Advances in Neural Information Processing Systems, 36, 2024.
  • [45] Alex Trevithick and Bo Yang. Grf: Learning a general radiance field for 3d scene representation and rendering. In Proceedings of the IEEE/CVF international conference on computer vision, 2021.
  • [46] Vadim Tschernezki, Iro Laina, Diane Larlus, and Andrea Vedaldi. Neural feature fusion fields: 3d distillation of self-supervised 2d image representations. In 2022 International Conference on 3D Vision (3DV), pages 443–453. IEEE, 2022.
  • [47] Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2021.
  • [48] Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning, pages 180–191. PMLR, 2022.
  • [49] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (tog), 38(5):1–12, 2019.
  • [50] Jhih-Ciang Wu, Ding-Jie Chen, Chiou-Shann Fuh, and Tyng-Luh Liu. Learning unsupervised metaformer for anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4369–4378, 2021.
  • [51] Enze Xie, Zhiding Yu, Daquan Zhou, Jonah Philion, Anima Anandkumar, Sanja Fidler, ** Luo, and Jose M Alvarez. M2BEV: Multi-camera joint 3d detection and segmentation with unified birds-eye view representation. arXiv preprint arXiv:2204.05088, 2022.
  • [52] Guoyang Xie, **bao Wang, Jiaqi Liu, Feng Zheng, and Yaochu **. Pushing the limits of fewshot anomaly detection in industry vision: Graphcore. arXiv preprint arXiv:2301.12082, 2023.
  • [53] Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11784–11793, 2021.
  • [54] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578–4587, 2021.
  • [55] Weihao Yuan, Xiaodong Gu, Heng Li, Zilong Dong, and Siyu Zhu. 3d former: Monocular scene reconstruction with 3d sdf transformers. arXiv preprint arXiv:2301.13510, 2023.
  • [56] Greg Zaal, Rob Tuytel, Rico Cilliers, James Ray Cock, Andreas Mischok, Sergej Majboroda, Dimitrios Savva, and Jurita Burger. Polyhaven: a curated public asset library for visual effects artists and game designers, 2021.
  • [57] Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence. Advances in Neural Information Processing Systems, 36, 2024.
  • [58] Qiang Zhou, Weize Li, Lihan Jiang, Guoliang Wang, Guyue Zhou, Shanghang Zhang, and Hao Zhao. Pad: A dataset and benchmark for pose-agnostic anomaly detection. In NeurIPS, 2024.
  • [59] Yang Zou, Jongheon Jeong, Latha Pemula, Dongqing Zhang, and Onkar Dabeer. Spot-the-difference self-supervised pre-training for anomaly detection and segmentation. In European Conference on Computer Vision, pages 392–408. Springer, 2022.

Appendix A Appendix / supplemental material

This appendix is structured as follows: we present additional details of the proposed datasets in Sec. A.1, training details in Sec. A.2, and additional qualitative results in Sec. A.3.

A.1 Data Generation Details

Our proposed scene AD datasets, ToysAD-8K and PartsAD-15K, are built upon two publicly available 3D shape datasets: Toys4K [41] (Creative Commons and royalty-free licenses) and ABC [27] (MIT license). For the ToysAD-8K, we selected 1,050 shapes from the Toys4K dataset, focusing on the most common real-world objects across 51 categories. A complete list of these categories is provided in Table 3. On the other hand, PartsAD-15K is a non-categorical dataset. For this dataset, we randomly selected a subset of 4,200 shapes from the large-scale ABC.

For both datasets, we consider the following anomaly types: cracks, fractures, geometric deformations (e.g., bumps, bends, and twists), translation, rotation, material mismatch, and missing parts. Additionally, for PartsAD-15K, we use a different but geometrically similar instance in the same scene as an anomaly. To obtain a geometrically similar shape, we use feature-based KNN clustering. Specifically, we extract DINOv2 features from multi-view images (rendered from multiple fixed viewpoints) of the 3D shapes, concatenate these features, and use the resulting concatenated features to build the KNN cluster. Then, we retrieve a similar 3D shape by querying the KNN cluster. To ensure the retrieved shape is geometrically similar, we calculate the Chamfer distance between the shapes and only accept a shape if the distance is below a certain threshold.

Table 3: Dataset composition of ToysAD-8K
Categories
Seen dinosaur, fish, frog, monkey, light, lizard, orange, boat, dog, lion, pig, cookie, panda, chicken,
orange, ice, horse, car, airplane, cake, shark, donut, hat, cow, apple, bowl, hamburger, octopus,
giraffe, chess, bread, butterfly, cupcake, bunny, elephant, fox, deer, bus, bottle
Unseen mug, plate, robot, glass, sheep, shoe, train, banana, cup, key, penguin, hammer

Our anomaly generation process is automatic. To ensure the quality of the generated anomalies, we perform several checks. For example, during positional or rotational anomaly creation, if a part detaches from the main body during deformation, we reject the anomaly and try again with adjusted parameters. Similarly, if removing a part makes the shape impractical, we discard it and try removing a different part. For fracture anomaly, if a particular fracture removes more than 90% or less than 10% of an object we discard the sample and regenerate another fracture. Finally, we ensure the anomalous region of an object is visible from at least one viewpoint.

To generate a realistic scene environment, we use PBR materials [44] for floors and HDRI environment maps [56] for image-based lighting to illuminate scenes. We randomly select a pair of PBR material and HDRI environment maps from the assets to randomize the scene background. Objects are placed randomly such that no collisions occur, and each object’s rotational poses are obtained using Blender’s rigid body simulation [43]. We employed Blender 2.93 [43] with Cycles ray-tracing renderer for photo-realistic rendering. Blender 2.93 is released under the GNU General Public License (GPL, or “free software”), and the PBR and HDRI maps are released under the CC0 license.

Our framework can easily be trained with real-world manufacturing scene environments. Our model relies solely on 2D supervision, making the data collection process much easier. We also do not need precise annotation of 3D bounding boxes as they are not used during training. For annotating instance-wise anomaly labels, we can use 2D bounding boxes, which can be projected in 3D using Visual Hull [28], then used for coarse localization.

A.2 Training Details

We train our model in two stages. In the first stage, we train it with just image and feature reconstruction losses. In the second stage, we train the model end-to-end with both reconstruction and binary classification losses. All experiments are performed on a single NVIDIA A40 GPU with a batch size of 4, utilizing 28GB of GPU memory. The first stage takes 36 hours to complete, followed by an additional 24 hours for the second stage.

The ablation experiments are conducted on the same workstation with the same GPU by removing one or a few core components from the full method. Specifically, variant methods A and B take 36363636 hours to train the first stage, while only taking 14141414 hours to train the second stage. Regarding variant method C, it takes similar 36363636 and 24242424 hours for the two stages as in the full method.

We compare our method with COLMAP (BSD license), ImVoxelNet (MIT license), and DETR3D (MIT license). For the COLMAP-based approach, we use DGCNN (MIT license) as a feature extractor. In terms of training time of these baseline methods, COLMAP takes 10101010 hours to train, ImVoxelNet takes 24242424 hours to train, and DETR3D takes 2222 days to converge.

The standard deviation of all our experiments (including the ablations and our method) under multiple runs is less than 0.5.

A.3 Additional Qualitative Results

In Fig. 9 and Fig. 10, we present additional qualitative results on ToysAD-8K and PartsAD-15K, respectively.

Refer to caption
Figure 9: Additional results on the unseen set of ToysAD-8K dataset. For rows 1 to 3, the anomalies are easy to spot and self-explanatory. In row 4, there are two anomalies: one with broken outer parts at the back (see view 5), and another with a tilted roof (see views 1 and 2). For row 5, one leg is broken (see view 2). In row 6, the eye is missing (see view 5).
Refer to caption
Figure 10: Additional results on the PartsAD-15K dataset.