Localizing Anomalies via Multiscale Score Matching Analysis

\nameAhsan Mahmood \email[email protected] \AND\nameJunier Oliva \email[email protected] \AND\nameMartin Styner \email[email protected]

\addrDepartment of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

Abstract

Anomaly detection and localization in medical imaging remain critical challenges in healthcare. This paper introduces Spatial-MSMA (Multiscale Score Matching Analysis), a novel unsupervised method for anomaly localization in volumetric brain MRIs. Building upon the MSMA framework, our approach incorporates spatial information and conditional likelihoods to enhance anomaly detection capabilities. We employ a flexible normalizing flow model conditioned on patch positions and global image features to estimate patch-wise anomaly scores. The method is evaluated on a dataset of 1,650 T1- and T2-weighted brain MRIs from typically develo** children, with simulated lesions added to the test set. Spatial-MSMA significantly outperforms existing methods, including reconstruction-based, generative-based, and interpretation-based approaches, in lesion detection and segmentation tasks. Our model achieves superior performance in both distance-based metrics (99th percentile Hausdorff Distance: 7.05 ± 0.61, Mean Surface Distance: 2.10 ± 0.43) and component-wise metrics (True Positive Rate: 0.83 ± 0.01, Positive Predictive Value: 0.96 ± 0.01). These results demonstrate Spatial-MSMA’s potential for accurate and interpretable anomaly localization in medical imaging, with implications for improved diagnosis and treatment planning in clinical settings.

Our code is available at https://github.com/ahsanMah/sade/.

Keywords: machine learning, anomaly detection, unsupervised, score matching, diffusion models

1 Introduction

Multiscale score matching analysis (MSMA) was introduced by Mahmood et al. (2021) as a novel methodology for sample-wise anomaly detection. The method learns the distribution of the magnitudes of the gradient of log-likelihoods i.e. the norm of the score function: $\left|\left|{s(x)}\right|\right|=\left|\left|{\nabla_{x}\log p(x)}\right|\right|$ . Sample-wise anomaly detection operates on the full input and makes binary predictions on whether a sample is anomalous. It does not provide information about the input components that led to that assessment.

However, it is often desirable to identify the specific regions within an image that are contributing to its atypicality. Such localization allows for increased model interpretability as well as directing future investigation. For instance, in healthcare, the ability to interpret a model’s prediction empowers medical practitioners to visually corroborate the identified regions of interest. Interpretation may also pave the way for novel insights into the disease. Furthermore, localizing anomalies enables targeted diagnosis and intervention planning based on the factors contributing to the detected outlier.

To address this, we propose Spatial-MSMA, an extension of MSMA that incorporates spatial information to enable precise anomaly localization. Spatial-MSMA leverages the concept of patch-wise analysis, considering not only the content of individual patches but also their spatial context within the image. By modeling conditional likelihoods that account for patch position and global image features, Spatial-MSMA provides a more nuanced and interpretable approach to anomaly detection and localization. For example, Spatial-MSMA is capable of highlighting brain lesions even though it was never trained with labeled data.

2 Background

Denoising score matching and MSMA (Multiscale Score Matching Analysis), play a pivotal role in Spatial-MSMA. This section elucidates score matching, its extensions such as noise-conditioned score matching, and its application in MSMA.

2.1 Score Matching

A score is defined as the gradient of the log probability density, with respect to the data. Conceptually, a score is a vector field that points in the direction where the log density grows the most. Hyvärinen (2005) introduced score matching as a means of computing the parameters of an unnormalized probability density model. The authors proved the remarkable property that learning the score involves the gradient of the score function itself as shown in Equation 1. Following the naming scheme used in Vincent (2011), this objective is called Implicit Score Matching.

	$\displaystyle J_{ISM}(\theta)$	$\displaystyle=\mathbb{E}_{x\sim p(x)}\frac{1}{2}\left[{\left\|\left\|{s_{\theta}% (x)-\nabla_{x}\log p(x)}\right\|\right\|^{2}}\right]$		(1)
		$\displaystyle=\mathbb{E}_{x\sim p(x)}\left[{\left\|\left\|{s_{\theta}(x)}\right\|% \right\|^{2}+\sum^{d}_{i=1}{\partial{x_{i}}s(x_{i})}}\right]$		(2)

Denoising Score Matching

Vincent (2011) formalized a connection between denoising autoencoders and score matching, and proposed the denoising score matching (DSM) objective. The authors noted how Hyvärinen (2005) had suggested the possibility of an alternate score matching objective; one that was based on regressing against the data gradients of a Parzen window density estimator. This so-called Explicit Score Matching objective is shown in Equation 3.

\displaystyle J_{ESM}(\theta)

\displaystyle=\mathbb{E}_{x\sim q_{\sigma}(x)}\frac{1}{2}\left[{\left|\left|{s% _{\theta}(x)-\nabla_{x}\log q_{\sigma}(x)}\right|\right|^{2}}\right]

(3)

Vincent (2011) proved that under certain regularity conditions,¹¹1For any window size $\sigma>0$ , the kernel $q_{\sigma}$ is differentiable, converges to 0 at infinity, and has a finite gradient norm the Parzen window based objective is equivalent to the original objective proposed by Hyvärinen (2005) in Equation 1

Taking it one step further, assume the Parzen density estimate is chosen to estimate the joint density of clean and corrupted samples $(x,\tilde{x})$ i.e. $q_{\sigma}(x,\tilde{x})=q_{\sigma}(x|\tilde{x})p(x)$ . Thus, the DSM objective is simply:

\displaystyle J_{DSM}(\theta)

\displaystyle=\mathbb{E}_{\tilde{x}\sim q_{\sigma}(x,\tilde{x})}\frac{1}{2}% \left[{\left|\left|{s_{\theta}(\tilde{x})-\nabla_{\tilde{x}}\log q_{\sigma}(% \tilde{x}|x)}\right|\right|^{2}}\right]

(4)

DSM mitigates the need for computing second order gradients as is the case for Equation 1. Furthermore, if $q_{\sigma}$ is set as the Gaussian kernel $\mathcal{N}(\tilde{x}|x,\,\sigma^{2}I)$ , then $\nabla_{x}\log q_{\sigma}(\tilde{x})=\frac{(x-\tilde{x})}{\sigma}$ . One can now see the connection between score matching and the denoising autoencoder objective (when using a Gaussian kernel). The score model is effectively being trained to estimate the noise that was added to the image.

It should be emphasized that while Vincent (2011) and many subsequent works Song and Ermon (2019, 2020); Song et al. (2020), use the Gaussian distribution in DSM, the proof for the validity of the objective by Vincent (2011) holds for any differentiable noise distribution.

Noise Conditioned Score Matching

Song and Ermon (2019) extended the DSM objective in Equation 4 to incorporate multiple scales $\sigma$ , and train a so-called Noise Conditioned Score Network (NCSN). The authors further outlined an iterative sampling algorithm, dubbed annealed Langevin dynamics, enabling the score network to be employed as a deep generative model. Let $\{\sigma_{i}\}_{i=1}^{L}$ be a positive geometric sequence that satisfies $\frac{\sigma_{1}}{\sigma_{2}}=...=\frac{\sigma_{L-1}}{\sigma_{L}}>1$ . NCSN is a conditional network, $s_{\theta}(x,\sigma)$ , trained to jointly estimate scores for various levels of noise $\sigma_{i}$ such that $\forall\sigma\in\{\sigma_{i}\}_{i=1}^{L}:s_{\theta}(x,\sigma)\approx\nabla_{x}% \log q_{\sigma}(x)$ . In Song and Ermon (2019), the conditioning information is explicitly provided via a one-hot vector denoting the noise level used to perturb the data. The network is then trained via a modified denoising score matching objective as shown in Equation 5.

J_{NCSN}(\theta)=\frac{1}{L}\sum_{i=1}^{L}\lambda(\sigma_{i})\left[{\frac{1}{2% }\displaystyle\text{\leavevmode\nobreak\ }\mathbb{E}_{\tilde{x}\sim q_{\sigma_% {i}}(\tilde{x}|x)p_{\text{data}}(x)}\left[{\left|\left|{s_{\theta}(\tilde{x},% \sigma_{i})+(\frac{\tilde{x}-x}{\sigma_{i}^{2}})}\right|\right|^{2}_{2}}\right% ]}\right]

(5)

Connecting Score Matching to Diffusion Models

In a follow-up work, Song et al. (2020) described a connection between noise-conditioned score matching and diffusion models Sohl-Dickstein et al. (2015). The connection is presented under the lens of generative modeling, and provides a framework which unifies Markov-based and continuous-time diffusion models. The key insight is that successive perturbation of a data point using a scale-dependent noise distribution (as done in NCSNs), follows a Stochastic Differential Equation (SDE). Thus, the ‘forward’ diffusion process can be modeled as an SDE. If one has access to the scores at each time point, it is possible to construct a reverse-time SDE as proved by Anderson (1982). These reverse-SDEs can be numerically solved using any differential equation solver, only requiring access to the score function. The authors were able to use this formulation to generate images that surpassed the state-of-the-art generative models. This work was seminal in the development of future diffusion models.

Continuous-Time Score Matching

More relevant to this research, Song et al. (2020) enabled a continuous relaxation to the discretized nature of noise conditioned score matching. Defining the noising process as a continuous forward SDE alleviates the need to predetermine the number of noise scales (as was the case for NCSN models). For generative models, this translates to faster sampling as one can control the number of gradient steps to take. For this work, it gives us the ability to observe different noise scales at test time while maintaining the advantages of using multiple noise scales during training. Namely, it forces the model to learn smoother transitions between noise scales, which helps test time generalizability.

2.2 Multiscale Score Matching Analysis

While Song and Ermon (2019) demonstrated the generative capabilities of NCSNs, Mahmood et al. (2021) outlined how these networks can be repurposed for outlier detection. Their methodology, Multiscale Score Matching Analysis (MSMA), incorporates noisy score estimators to separate in- and out-of-distribution (OOD) points. Recall that a score is the gradient of the likelihood. A typical point, residing in a space of high probability density will need to take a small gradient step in order to improve its likelihood. Conversely, a point further away from the typical region (an outlier) will need to take a comparatively larger gradient step towards the high density region. When we have multiple noisy score estimates, it is difficult to know apriori which noise scale accurately represents the gradient of the outliers. However, Mahmood et al. (2021), showed that learning the typical space of score-norms for all noise levels is sufficient to identify anomalies.

Concretely, assume we have a score estimator that is trained on $L$ noise levels and a set of inlier samples $X_{\text{IN}}$ . Computing the inlier score estimates for all noise levels and taking the L2-norms across the input dimensions results in an $L$ -dimensional feature vector for each input sample: $[\left|\left|{s(X_{\text{IN}},\sigma_{1})}\right|\right|_{2}^{2},...,\left|% \left|{s(X_{\text{IN}},\sigma_{L})}\right|\right|_{2}^{2}]$ . The authors in Mahmood et al. (2021) argue that inliers tend to concentrate in this multiscale score-norm embedding space. It follows that one could train an auxiliary model (such as a clustering model or a density estimator) to learn this score-norm space of inliers. At test time, the output of the auxiliary model (e.g. likelihoods in the case of density estimators) is used as an anomaly score. Results in Mahmood et al. (2021) show MSMA to be effective at identifying OOD samples in image datasets (e.g. CIFAR-10 as inliers and SVHN as OOD).

Our work builds on the MSMA methodology for anomaly detection. We will localize anomalies within the image by estimating the likelihoods of patch score-norms. This requires careful consideration of model architecture and the likelihood method applied, as we will discuss in Section 4.

2.3 Normalizing Flows

Spatial-MSMA employs deep likelihood (generative) models to estimate the likelihoods of the patch score norms. Normalizing flow models are a flexible class of generative models which can efficiently estimate the likelihood of a sample by usign the change of variable formula. They utilize invertible transformations to project the data into the space of a predefined base distribution, such as a Standard Gaussian. Examples of deep normalizing flows are models such as RealNVP (Dinh et al., 2017) or Neural Spline Flows (Durkan et al., 2019), which are typically trained via the maximum likelihood objective. However, these models fail to detect outlying samples much like deep autoregressive models. The work by Kirichenko et al. (2020) investigated the failure cases of flow models. The authors analyses suggest that flow models encode the visual appearance directly, without learning any semantic content. The anomaly detection performance of flow-based models can improve if they are trained on high-level semantic representations (e.g. from a pretrained neural network) rather than the raw images themselves.

3 Related Works

There are many existing approaches for anomaly detection. We will focus on a few that are commonly employed for medical imaging. These approaches can be organized into four broad categories as described below.

3.1 Reconstruction based Approaches

Reconstruction-based anomaly detectors are trained to produce typical counterparts (so-called reconstructions) of anomalous images. The methods may take some form of a deep autoencoder (Kascenas et al., 2022; Baur et al., 2021), trained with a reconstruction error objective such as mean squared error. At test time, the models are presumed to output an anomaly-free image, with the reconstruction error as the metric of atypicality. A known drawback of these models is the lack of specificity in their detection. As no reconstruction is pixel-perfect (especially in terms of image intensities), the output error maps have significant false-positives (Baur et al., 2021). Another drawback of autoencoders is that as their reconstruction abilities improve, their anomaly detection capabilities decrease as the models are better at reconstructing the anomalies.

3.2 Generative Modeling based Approaches

Some methods propose to employ generative models for removing anomalies via imputation or “restoration” (Schlegl et al., 2019; You et al., 2019). Imputation-based approaches utilize masking strategies to mask out certain regions of the image and use the generative model to in-paint only the masked regions. Restoration-based approaches do not use any masks and instead modify the entire image, using the original image as the starting point in the sampling procedure.

Recently, owing to the success of score-based diffusion models, much of the research has focused on using diffusion models as the generative model (replacing GANs of yesteryear) (Wyatt et al., 2022; Pinaya et al., 2022; Liu et al., 2023; Behrendt et al., 2023). All of these models provide slight modifications to the diffusion sampling process, and start the generation process from an input image rather than random noise. The sampling process starts by adding noise to the input image and iteratively denoises the sample to generate an anomaly-free counterpart of the original image. Once the cleaned sample is generated, a voxel-wise difference between the input and its anomaly-free counterpart is used as the anomaly score. The main differentiating factor between these methods are the hyperparameters used for training and the sampling strategies used during inference.

3.3 Feature Embedding based Approaches

Some methods aim to detect anomalies in a learned embedding space. The feature embeddings are computed by a neural network trained on the typical samples. At test time, it is assumed that the model will output feature embeddings that are close to the feature embeddings of the training population if the sample is an inlier and away otherwise. A popular method in this category is the Student-Teacher architecture anomaly detector by Bergmann et al. (2020). In this setting, we have two models: a high parameter-count Teacher network and low parameter-count Student network. The student model is considered to be a “weaker” version of the Teacher, and is trained using neural network distillation techniques. It is assumed that the Student model will fail to generalize to unseen datasets, producing a discrepancy between the teacher’s features and that of the student. This discrepancy is used to produce an anomaly score.

3.4 Attribution based Approaches

Certain techniques draw on the insights of interpretability research. The task is to identify features of the data that contribute to the model’s output. The identified features are often assigned a score relative to their importance, as determined by the rules of the interpretation technique. Examples of such methods include SHAP (Lundberg and Lee, 2017), Saliency Maps (Simonyan et al., 2013), and GradCAM (Selvaraju et al., 2017).

SHAP (SHapley Additive exPlanations) is a game-theoretic approach to explain the output of any machine learning model. SHAP values attribute the prediction of an instance to the different features, highlighting the positive or negative impact of each feature. Saliency Maps are a visualization technique that highlight areas of an input image that most influence the output of a network. The saliency map is computed by taking the gradient of the output with respect to the input image. Areas with high gradient values correspond to regions in the input that have a significant impact on the model’s prediction. GradCAM (Gradient-weighted Class Activation Map**) is an extension of saliency maps that computes gradients with respect to feature vectors of an image rather than the image itself.

All of the mentioned attribution-based approaches aim to identify the features or input regions that contribute most to the model’s predictions. This information can be used for anomaly detection, as it can localize the patches that lead a model to classify an instance as an anomaly.

3.5 Situating Spatial-MSMA within Existing Works

Spatial-MSMA addresses key limitations of existing anomaly localization techniques. Unlike reconstruction-based approaches, it avoids the pitfall of decreasing detection capabilities as reconstruction quality improves. In contrast to generative modeling methods, Spatial-MSMA doesn’t require complex sampling procedures or modifications to pretrained models. It surpasses feature embedding techniques by considering both local and global context, potentially leading to more nuanced anomaly detection. Unlike attribution-based methods that often require labeled data or focus solely on model interpretability, Spatial-MSMA offers unsupervised learning with built-in localization capabilities. By leveraging conditional likelihoods and spatial information, Spatial-MSMA is expected to provide more accurate and interpretable anomaly localization, making it particularly valuable for applications in healthcare and manufacturing where precise identification of atypical regions is crucial.

4 Spatial-MSMA: Incorporating Spatial Information into MSMA

Refer to caption — Figure 1: Overview of Spatial-MSMA. A neural score estimator produces score tensors at multiple noise scales. The score tensors are divided into patches and processed by a conditional flow to estimate patch-wise anomaly scores. Positional encodings are applied to patch locations, which are then combined with global image features extracted by a convolutional network. The patch score norms and conditioning vectors are fed into a normalizing flow model with conditional coupling blocks. The result is a negative likelihood heatmap that highlights anomalous patches within the image. Spatial-MSMA thus enables precise localization of anomalies based on the patch scores and their spatial context.

The basic assumption of MSMA, as explained in Section 2.2, is that inliers will occupy distinct regions in the score-norm space. At test time, we ask the question: Does the given sample belong to the inliers? MSMA consequently estimates the likelihood of a sample belonging to the inlier region in the score-norm space. Thus, MSMA looks at the data samples holistically, i.e. it considers the entire set of features available (e.g. all the pixels in an image).

However, we note that MSMA is also amenable for subsets of features. For instance, we may divide an image into patches and consider the score-norms of each patch independently. Now, we can ask the question: Does this patch belong to the inliers? As before, MSMA will output a likelihood estimate of a test patch belonging to the inliers, but this time only considering information present at the given patch location.

It is possible to naively extend MSMA to consider patches. One can decompose the image into a regular grid, and train an independent MSMA model for each grid location. One may even reduce computational costs by running training/inference in parallel for each patch location. However, while this approach is straightforward, it leaves much room for improvement.

Namely, we can leverage spatial locality: the notion that neighbouring image patches are highly correlated. Furthermore, even patches which are spatially apart may depend on each other. Consider an image of a face. Observing patches of the left eye gives us rich information about what we may observe in the location of the right eye, even if the location of the right eye is distal to the left. One can incorporate this information into the decision making process to reason about the typicality of a queried patch. For instance, observing a brown-colored right eye is typical. However, observing a brown colored right eye given a black-colored left eye, is atypical.

4.1 Modeling Conditional Likelihoods

Following the motivation above, one can employ a conditional model where in addition to the contents of a patch, its position and surrounding context are also taken into account. As such, we posit to use a conditional likelihood model as the basis of our patch-based anomaly detector.

Concretely, the model will be conditioned on the patch position and the image features. Let $s_{p}=\{s(x_{p})\}_{i=1}^{L}$ be the multi-scale score tensor for a given patch $x_{p}$ at location $p$ , belonging to the image $x$ . Let $h(x)$ are the feature vectors of the image $x$ computed by a convolutional network $h$ . We propose to estimate the conditional likelihood model $p(s_{p}|p,h(x))$ . As this model will output likelihoods of score-norms for each patch conditioned on the surrounding spatial information, the model is called Spatial-MSMA.

Spatial-MSMA uses a flexible class of likelihood estimators called normalizing flows introduced in Chapter 2. The patch locations are modeled via sinusoidal positional embeddings, commonly used in Transformer models (Vaswani et al., 2017). In order to capture global image context, the original image is passed through a convolutional network with a large receptive field. The resulting feature embeddings are concatenated with the positional embeddings and fed into the flow model as contextual information. The model then transforms the input samples, following the constraints of the change-of-variable technique. Finally the flow estimates likelihoods of the transformed samples via a trainable conditional Gaussian Mixture Mode.

5 Lesion Detection in Volumetric Brain MRIs

To illustrate the anomaly detection capabilities of Spatial-MSMA, will consider the task of detecting lesions in medical images.We choose this to reflect the real world usecase of automatic detection and segmentation of pathologies. Recall, the model will be trained without any labels and will not be fed any prior knowledge about the type of anomalies at test time.

To minimize confounding factors that can be introduced due to a distributional shift between the training (healthy) and testing (lesioned) populations, the anomalies will be simulated on a held out inlier test set, ensuring that the introduced anomalies are the primary factor differentiating the test set from the inlier population.

5.1 Constructing a Healthy Population

Our inlying, healthy population will comprise of typically develo** school-age children. We chose this cohort due to the availability of public datasets within this demographic. Speicifically, we retrieved data from two studies: the Adolescent Brain Cognitive Development (ABCD) Study (Casey et al., 2018), and the Human Connectome Project Development Study (HCP-D). Samples from these studies were preprocessed to remove any outliers. To keep the inlier cohort as nominal as possible, we used the Child Behavior Checklist (CBCL) Achenbach (1999) scores as our filtering mechanism. This checklist assesses the behavior and emotional competencies of children. Children with behavioral problems tend to score high on this test. For our analysis, all children that scored above a t-score of 66 ( $\sim 95$ -th percentile) in the summary scores as well as any of the subscores were removed. Note that this is more conservative than only using the summary scales. The data was then split into an 80/10/10 train/validation/test split. Our processing resulted in 1320 training, 165 validation, and 165 testing samples.

We use both T1-weigthed and T2-weighted images. As the images are high-resolution 3D MRIs, they require a lot of GPU memory during training. In order to fit a batch size of 4 per GPU, the images were downsampled to a pixel spacing of 2mm isotropic. They were further cropped by the largest brain mask, computed from the training data. After some padding to make the images multiples of 2, the resulting 3D volume was of size 96x112x80.

5.2 Simulating Lesions

The lesions were simulated using a lesion simulator tool da S Senra Filho et al. (2019), available as the MSLesionSimulator extension ²²2https://www.slicer.org/wiki/Documentation/Nightly/Modules/MSLesionSimulator of the Slicer3D software package Fedorov et al. (2012). The lesion load parameter was set to 20 and the rest of the hyperparameters were kept at their default values. A post processing step was performed to enhance the lesion intensity by a factor of $1.5$ . The lesions were generated on the test set.

Training Details

The score-norms were retrieved from a diffusion model, using a 3D convolutional UNet-like architecture. We followed the SDE formulations by Song et al. (2020), using the ‘Variance-Exploding’ SDE with 2000 timesteps. The minimum sigma was set to 0.06, which is the average standard deviation of the image intensities. This is done so that smallest noise scale, at minimum, is able to capture the intensity variation of within an image. Following the suggestion of Song et al. (2020), the maximum sigma was set to 545.0 which is the 99-th percentile of the pairwise distance in the training set. This is done to allow the largest noise distribution to maximally cover the support of the data distribution i.e. $p_{\sigma_{\text{max}}}(x)\approx\mathcal{N}(x|0,\sigma_{\text{max}}^{2}I)$ . The model was trained for 1.5 million iterations, by which point the validation loss had started to flatten out. The batch size was doubled at roughly the half way point during training. This is a simple yet effective method proposed by Smith et al. (2018), to effectively anneal the learning rate without having to use a decay schedule. The authors also reported that increasing the batch size reduced the number of parameter updates required to reach the same test accuracies when compared to strategies for decaying the learning rate.

During inference, the voxel-wise anomaly scores are first brain masked followed by thresholding. The threshold is determined for each sample by searching for the threshold that gives the lowest symmetric mean surface distance between the ground truth and the post-threshold segmentation. Searching for a threshold like this is common practice in evaluating anomaly detectors Baur et al. (2019). The segmentations are post-processed by removing connected components of size less than 3 voxels (using a connectivity of 1). The remaining segmentation mask is dilated via a disk of radius 1 as the structuring element. Note that this inference procedure is performed for all methods tested in the experiment.

5.3 Baseline Methodologies

Spatial-MSMA was compared to a selection of models that encompass a broad range of anomaly detection methodologies that have been successfully used in the medical imaging field. Namely, the baselines represent reconstruction-based, generative-based, and interpretation-based methods.

For the reconstruction-based baseline, we chose an autoencoder model by Luo et al. (2023) owing to its success on volumetric brain MRIs. The model uses a ResNet-like architecture is trained using a reconstruction objective based on a Mean Squared Error (MSE). The authors also provide a publicly available implementation. This method is denoted as AE in Table 1.

Two generative-model based approaches were also included in the comparison. First is an imputation-based approach inspired by Liu et al. (2023) which uses a checkerboard mask to in-paint different regions of the image (denoted as Inpaint in Table 1). This method performs multiple runs of imputation, alternating the checkerboard pattern each time and computing the average error across all runs. Second is a restoration-based approach (denoted as Restoration in Table 1), which first adds noise to the image and then invokes the sampling procedure of the diffusion model to iteratively generate the restored counterpart. Following Wyatt et al. (2022), the sampling procedure was initiated from 1/4th of the original timesteps. However, unlike Wyatt et al. (2022), we did not use Simplex noise during training/inference as work by Kascenas et al. (2023) has shown that it may not be necessary (and sometimes detrimental). Note that these technique are agnostic to the diffusion framework, which allows us to use the same diffusion model that was used as the backbone of Spatial-MSMA. This keeps the comparison fair and limits confounding factors as the diffusion model was trained to convergence and is able to generate realisitic looking samples.

Lastly, GradCAM was included as a representative of attribution-based approaches. Specifically, we used Guided-GradCAM (Selvaraju et al., 2016) which combines saliency maps (with some modifications) and GradCAM to give superior results to vanilla GradCAM. The gradients were computed using the outputs of a non-spatial MSMA. Concretely, a GMM was trained on the whole-image score norms and the GradCAM gradients were computed using the negative likelihood estimates. This corresponds to computing voxel-wise attribution maps for an MSMA anomaly score. Thus, the method is denoted as GradCAM-MSMA in Table 1.

5.4 Segmentation Metrics for Analysis

We chose mean surface distance (MSD) and the Hausdorff distance (HD) as the primary segmentation metrics for comparison. MSD calculates the average distance between the surfaces of the predictions and ground truth, providing a measure of overall segmentation accuracy. The Hausdorff distance represents the maximum distance between the two surfaces, capturing the worst-case scenario. We use the 99th percentile of the Hausdorff distance to mitigate the impact of outliers. These metrics are less biased towards over-segmentations compared to the more popular Dice score, making them particularly suitable for anomaly detection tasks where false positives can be problematic. Both distances are computed in a directed manner, i.e., from the ground truth to the prediction.

In addition to distance metrics, we computed component-wise metrics to assess the model’s ability to identify distinct anomalous regions. Connected components were determined from the voxel-wise segmentation masks using an 8-connectivity-neighborhood (including diagonals). We assign a true positive (TP) label to a predicted component that overlaps with any ground-truth component at any voxel location. The absence of any overlap is tallied as a false positive (FP). False negatives (FN) are ground-truth components with no corresponding prediction overlap. Table 1 reports the True Positive Rate (TPR = TP/(TP+FN), measuring sensitivity) and the Positive Predictive Value (PPV = TP/(TP+FP), measuring precision). These component-wise metrics provide insights into the model’s performance in detecting and delineating individual anomalous regions.

5.5 Results

Table 1 presents the segmentation performance of Spatial-MSMA compared to baseline methodologies across various metrics. We report the mean across (165) test samples alongside the standard error for each metric. Note that due to the relatively small size of the lesions, the segmentation task was difficult for all models. However, the results demonstrate Spatial-MSMA’s superior performance in lesion detection and localization tasks.

Spatial-MSMA achieved the lowest 99th percentile Hausdorff Distance (99-HD) of 7.05 and Mean Surface Distance (MSD) of 2.10. These metrics reflect the model’s ability to accurately delineate lesion boundaries. The significantly lower distances compared to baselines (e.g., Restoration: 99-HD of 8.67, MSD of 2.68) indicate that Spatial-MSMA produces tighter and more precise segmentations around anomalies.

The component-wise metrics reflect the sensitivity of the model to anomalous regions in the image, regardless of size. Spatial-MSMA exhibited exceptional performance in detecting individual lesion components, with a True Positive Rate (TPR) of 0.83 and a Positive Predictive Value (PPV) of 0.96. Recall that the maximum possible for either metric is 1.0. This high TPR indicates that Spatial-MSMA successfully identified 83% of all lesions, significantly outperforming the next best method (Restoration, TPR: 0.68). The high PPV of 0.96 suggests that Spatial-MSMA has a very low false positive rate, a crucial factor in clinical applications where false alarms can lead to unnecessary interventions or patient anxiety.

Figure 2 provides a visual comparison of anomaly heatmaps generated by different methods. The plotted heatmaps are clipped at the 90th percentile for each sample i.e. the range represents the top- $10\%$ of the anomaly scores. Spatial-MSMA consistently detects all lesions in the image, including smaller ones that other methods often miss. This is particularly evident in the third column, where Spatial-MSMA correctly identifies a small lesion (bottom left) that goes undetected by other approaches. The heatmaps also demonstrate Spatial-MSMA’s ability to provide more focused and accurate localization, with less diffuse activation around the lesions compared to methods like GradCAM. Note that while Spatial-MSMA tends towards over-segmentation, it manages to detect most if not all the lesions. Other baselines such as Inpaint, and Restoration are overly biased towards larger anomalies, and often fail to detect smaller lesions.

Table 1: Segmentation metrics for lesion detection. Each model was trained only on inlier samples only, and tested on the same 165 test samples. We report the mean across the test samples alongside the standard errors. Right column shows distance based metrics: 99th-percentile of the Hausdorff Distance (99-HD) and Mean Surface Distance (MSD). Right column shows component-wise metrics: True Positive Rate (TPR = TP/(TP+FN)) and Positive Predictive Value (PPV = TP/(TP+FP)). Spatial-MSMA significantly outperforms the baseline methodologies, especially for component-wise metrics.

	99-HD $\downarrow$	MSD $\downarrow$	TPR $\uparrow$	PPV $\uparrow$
AE Luo et al. (2023)	12.27 $\pm$ 0.51	3.63 $\pm$ 0.35	0.44 $\pm$ 0.02	0.19 $\pm$ 0.01
Inpaint	13.26 $\pm$ 0.50	3.71 $\pm$ 0.27	0.63 $\pm$ 0.02	0.50 $\pm$ 0.02
Restoration	8.67 $\pm$ 0.53	2.68 $\pm$ 0.36	0.68 $\pm$ 0.02	0.17 $\pm$ 0.01
GradCAM-MSMA	12.68 $\pm$ 0.54	3.75 $\pm$ 0.37	0.43 $\pm$ 0.02	0.16 $\pm$ 0.01
Spatial-MSMA	7.05 $\pm$ 0.61	2.10 $\pm$ 0.43	0.83 $\pm$ 0.01	0.96 $\pm$ 0.01

6 Discussion and Limitations

The results of our study demonstrate the effectiveness of Spatial-MSMA in addressing the challenging task of unsupervised anomaly detection and localization in medical imaging. The superior performance of Spatial-MSMA across various metrics highlights its potential as a powerful tool for automated lesion detection in brain MRIs. The difficulty of unsupervised anomaly detection in medical imaging cannot be overstated. Unlike supervised learning approaches that rely on large datasets of labeled abnormalities, unsupervised methods must learn to identify anomalies without prior knowledge of what constitutes an anomaly. This is particularly challenging in medical contexts where anomalies can be subtle, diverse, and often mimicked by normal anatomical variations. Spatial-MSMA’s ability to achieve high detection rates and low false positive rates in this unsupervised setting is therefore especially noteworthy.

Furthermore, the incorporation of spatial information and conditional likelihoods in Spatial-MSMA addresses key limitations of previous approaches. By considering both local patch content and global image context, Spatial-MSMA can better distinguish between true anomalies and normal variations that may appear anomalous when viewed in isolation. This is evident in the model’s ability to detect small lesions that other methods miss, as well as its high precision in avoiding false positives.

Note that this high precision (as illustrated by the TPR and PPV achieved by Spatial-MSMA) has significant implications for clinical applications. A high true positive rate ensures that clinicians are alerted to potential abnormalities, while a high positive predictive value minimizes false alarms. This balance is crucial in medical settings where both missed diagnoses and overdiagnosis can have serious consequences for patient care and resource allocation.

However, despite its strong performance, Spatial-MSMA is not without limitations. The model’s tendency towards slight over-segmentation, while preferable to under-segmentation in many clinical contexts, could be further refined. Future work could explore ways to fine-tune the model’s sensitivity to strike an even better balance between detection and precision. Another area of improvement is the computational resources required for both training and inference when compared to more efficient baselines such as an autoencoder. This is due to the high parameter count of the underlying score-based diffusion models. Note that Spatial-MSMA is still an order of magnitude faster than the iterative sampling based solutions such as AnoDDPM (Inpaint) Wyatt et al. (2022).

7 Conclusion

This work introduced Spatial-MSMA, a novel extension of Multiscale Score Matching Analysis that incorporates spatial information for improved anomaly detection and localization. By leveraging conditional likelihoods and patch-based analysis, Spatial-MSMA demonstrated superior performance in detecting and localizing simulated lesions in volumetric brain MRIs compared to several state-of-the-art baselines. Our results showed that Spatial-MSMA significantly outperformed existing methods across multiple metrics, including mean surface distance, Hausdorff distance, true positive rate, and positive predictive value. The model’s ability to detect lesions of varying sizes while maintaining a low false positive rate highlights its potential as a powerful tool for unsupervised anomaly detection in medical imaging.

Future work could explore several avenues such as extending the model to handle multi-modal imaging data, and unifying the learning objectives so as to train the score-estimator and the normalizing flow model in an end-to-end fashion. One may also explore the application of Spatial-MSMA to other domains beyond medical imaging, such as manufacturing or satellite imagery, where spatial context is crucial for anomaly detection.

Ethical Standards

The work follows appropriate ethical standards in conducting research and writing the manuscript, following all applicable laws and regulations regarding treatment of animals or human subjects.

Conflicts of Interest

We declare we don’t have conflicts of interest.

References

Achenbach (1999) Thomas M. Achenbach. The Child Behavior Checklist and related instruments. In The Use of Psychological Testing for Treatment Planning and Outcomes Assessment, 2nd Ed., pages 429–466. Lawrence Erlbaum Associates Publishers, 1999. ISBN 0-8058-2761-7 (Hardcover).
Anderson (1982) Brian DO Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, 1982.
Baur et al. (2019) Christoph Baur, Benedikt Wiestler, Shadi Albarqouni, and Nassir Navab. Deep autoencoding models for unsupervised anomaly segmentation in brain MR images. In Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 4th International Workshop, BrainLes 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Revised Selected Papers, Part I 4, pages 161–169. Springer, 2019.
Baur et al. (2021) Christoph Baur, Stefan Denner, Benedikt Wiestler, Nassir Navab, and Shadi Albarqouni. Autoencoders for unsupervised anomaly segmentation in brain mr images: A comparative study. Medical Image Analysis, 69:101952, 2021. ISSN 1361-8415. .
Behrendt et al. (2023) Finn Behrendt, Debayan Bhattacharya, Julia Krüger, Roland Opfer, and Alexander Schlaefer. Patched Diffusion Models for Unsupervised Anomaly Detection in Brain MRI. In Medical Imaging with Deep Learning, 2023.
Bergmann et al. (2020) Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4183–4192, 2020.
Casey et al. (2018) BJ Casey, Tariq Cannonier, May I Conley, Alexandra O Cohen, Deanna M Barch, Mary M Heitzeg, Mary E Soules, Theresa Teslovich, Danielle V Dellarco, Hugh Garavan, et al. The adolescent brain cognitive development (abcd) study: imaging acquisition across 21 sites. Developmental cognitive neuroscience, 32:43–54, 2018.
da S Senra Filho et al. (2019) Antonio Carlos da S Senra Filho, Fabrício Henrique Simozo, Antonio Carlos dos Santos, and Luiz Otavio Murta Junior. Multiple sclerosis multimodal lesion simulation tool (ms-mist). Biomedical Physics and Engineering Express, 5(3):035003, mar 2019. . URL https://dx.doi.org/10.1088/2057-1976/ab08fc.
Dinh et al. (2017) Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=HkpbnH9lx.
Durkan et al. (2019) Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. In H. Wallach, H. Larochelle, A. Beygelzimer, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/7ac71d433f282034e088473244df8c02-Paper.pdf.
Fedorov et al. (2012) Andriy Fedorov, Reinhard Beichel, Jayashree Kalpathy-Cramer, Julien Finet, Jean-Christophe Fillion-Robin, Sonia Pujol, Christian Bauer, Dominique Jennings, Fiona Fennessy, Milan Sonka, John Buatti, Stephen Aylward, James V. Miller, Steve Pieper, and Ron Kikinis. 3D Slicer as an image computing platform for the Quantitative Imaging Network. Magnetic resonance imaging, 30(9):1323–1341, 2012. ISSN 1873-5894 0730-725X. .
Hyvärinen (2005) Aapo Hyvärinen. Estimation of Non-Normalized Statistical Models by Score Matching. Technical report, 2005.
Kascenas et al. (2022) Antanas Kascenas, Nicolas Pugeault, and Alison Q. O’Neil. Denoising autoencoders for unsupervised anomaly detection in brain MRI. In Ender Konukoglu, Bjoern Menze, Archana Venkataraman, Christian Baumgartner, Qi Dou, and Shadi Albarqouni, editors, Proceedings of the 5th International Conference on Medical Imaging with Deep Learning, volume 172 of Proceedings of Machine Learning Research, pages 653–664. PMLR, 2022. URL https://proceedings.mlr.press/v172/kascenas22a.html.
Kascenas et al. (2023) Antanas Kascenas et al. The role of noise in denoising models for anomaly detection in medical images. Medical Image Analysis, 90:102963, 2023. ISSN 1361-8415. . URL https://www.sciencedirect.com/science/article/pii/S1361841523002232.
Kirichenko et al. (2020) Polina Kirichenko, Pavel Izmailov, and Andrew G Wilson. Why normalizing flows fail to detect out-of-distribution data. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 20578–20589. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/ecb9fe2fbb99c31f567e9823e884dbec-Paper.pdf.
Liu et al. (2023) Zhenzhen Liu, ** Peng Zhou, Yufan Wang, and Kilian Q Weinberger. Unsupervised Out-of-Distribution Detection with Diffusion Inpainting. In International Conference on Machine Learning, pages 22528–22538. PMLR, 2023.
Lundberg and Lee (2017) Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf.
Luo et al. (2023) Guoting Luo, Wei Xie, Ronghui Gao, Tao Zheng, Lei Chen, and Huaiqiang Sun. Unsupervised anomaly detection in brain mri: Learning abstract distribution from massive healthy brains. Computers in Biology and Medicine, 154:106610, 2023. ISSN 0010-4825. . URL https://www.sciencedirect.com/science/article/pii/S0010482523000756.
Mahmood et al. (2021) Ahsan Mahmood, Junier Oliva, and Martin Andreas Styner. Multiscale score matching for out-of-distribution detection. In International Conference on Learning Representations, 2021.
Pinaya et al. (2022) Walter HL Pinaya, Mark S Graham, Robert Gray, Pedro F Da Costa, Petru-Daniel Tudosiu, Paul Wright, Yee H Mah, Andrew D MacKinnon, James T Teo, Rolf Jager, et al. Fast unsupervised brain anomaly detection and segmentation with diffusion models. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 705–714. Springer, 2022.
Schlegl et al. (2019) Thomas Schlegl, Philipp Seeböck, Sebastian M Waldstein, Georg Langs, and Ursula Schmidt-Erfurth. f-anogan: Fast unsupervised anomaly detection with generative adversarial networks. Medical image analysis, 54:30–44, 2019.
Selvaraju et al. (2016) Ramprasaath R. Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. International Journal of Computer Vision, 128:336 – 359, 2016. URL https://api.semanticscholar.org/CorpusID:15019293.
Selvaraju et al. (2017) Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 618–626, 2017. .
Simonyan et al. (2013) Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR, abs/1312.6034, 2013. URL https://api.semanticscholar.org/CorpusID:1450294.
Smith et al. (2018) Samuel L. Smith, Pieter-Jan Kindermans, and Quoc V. Le. Don’t decay the learning rate, increase the batch size. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1Yy1BxCZ.
Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
Song and Ermon (2019) Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pages 11918–11930, 2019.
Song and Ermon (2020) Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020.
Song et al. (2020) Yang Song, Jascha Sohl-Dickstein, Diederik Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2020.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
Vincent (2011) Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
Wyatt et al. (2022) Julian Wyatt, Adam Leach, Sebastian M Schmon, and Chris G Willcocks. Anoddpm: Anomaly detection with denoising diffusion probabilistic models using simplex noise. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 650–656, 2022.
You et al. (2019) Suhang You, Kerem C. Tezcan, Xiaoran Chen, and Ender Konukoglu. Unsupervised lesion detection via image restoration with a normative prior. In M. Jorge Cardoso, Aasa Feragen, Ben Glocker, Ender Konukoglu, Ipek Oguz, Gozde Unal, and Tom Vercauteren, editors, Proceedings of The 2nd International Conference on Medical Imaging with Deep Learning, volume 102 of Proceedings of Machine Learning Research, pages 540–556. PMLR, 08–10 Jul 2019. URL https://proceedings.mlr.press/v102/you19a.html.