\floatsetup

[table]capposition=top

¹¹institutetext: Arizona State University
²²institutetext: ASU-Mayo Center for Innovative Imaging

AnoFPDM: Anomaly Segmentation with Forward Process of Diffusion Models for Brain MRI

Yiming Che 1122 Fazle Rafsani 112*2* Jay Shah 112*2* Md Mahfuzur Rahman Siddiquee 1122 Teresa Wu 1122

Abstract

Weakly-supervised diffusion models (DMs) in anomaly segmentation, leveraging image-level labels, have attracted significant attention for their superior performance compared to unsupervised methods. It eliminates the need for pixel-level labels in training, offering a more cost-effective alternative to supervised methods. However, existing methods are not fully weakly-supervised because they heavily rely on costly pixel-level labels for hyperparameter tuning in inference. To tackle this challenge, we introduce Anomaly Segmentation with Forward Process of Diffusion Models (AnoFPDM), a fully weakly-supervised framework that operates without the need of pixel-level labels. Leveraging the unguided forward process as a reference for the guided forward process, we select hyperparameters such as the noise scale, the threshold for segmentation and the guidance strength. We aggregate anomaly maps from guided forward process, enhancing the signal strength of anomalous regions. Remarkably, our proposed method outperforms recent state-of-the-art weakly-supervised approaches, even without utilizing pixel-level labels.

^†^†Code is available at https://github.com/SoloChe/AnoFPDM^†^†* Equal contribution

Keywords:

Anomaly segmentation Diffusion models Weakly-supervision

1 Introduction

Anomaly segmentation often requires a large number of pixel-level annotations. However, acquiring pixel-level annotations is not only costly but also prone to human annotator bias. Hence, weakly-supervised generative methods [21, 15], leveraging image-level labels in training, are gaining attention. Among these methods, diffusion models [6, 17, 18] are commonly chosen as the backbone due to their superior performance compared to other generative methods such as generative adversarial networks (GANs) [5] and variational autoencoder (VAE) [10]. However, current weakly-supervised methods are not fully weakly-supervised. They still heavily depend on pixel-level labels for hyperparameter tuning during the inference stage. For diffusion model-based methods, this tuning includes determining the appropriate guidance strength, amount of noise (noise scale) added to the input image, and the threshold for the anomaly map. Consequently, the need for pixel-level labels in hyperparameter tuning reintroduces the cost and bias.

In this paper, we propose a fully weakly-supervised framework named Ano-FPDM, built upon diffusion models with classifier-free guidance [7], to eliminate the need of pixel-level labels in hyperparameter tuning. In the training stage, our model follows the standard training process of classifier-free guidance as used in [7] with the image-level labels, e.g., healthy and unhealthy. In the inference stage, our framework utilizes the forward process of DMs instead of the sampling (backward) process. Rather than relying directly on pixel-level labels, we select hyperparameters by leveraging the unguided forward process as a reference for the guided forward process.

Our key idea for the inference stage is illustrated in Fig. 1. During the forward process, we gradually corrupt the inputs. The denoised inputs, representing the prediction of the inputs from the noised inputs, are then obtained with respect to the healthy label guidance and no guidance. Note that the denoised inputs in our framework are obtained by reverse map** between the noised inputs and the original inputs. The denoised inputs without guidance do not remove the anomalous regions but compress both the anomalous and non-anomalous regions during the forward process. Conversely, under ideal conditions, the denoised inputs with healthy label guidance remove the anomalous regions while compressing the non-anomalous regions. Our objective is to select the end step of the forward process such that the anomalous regions are maximally removed in the guided forward process. We utilize the difference between guided and unguided forward processes to indicate the removal of anomalous regions. The guided forward process focuses more on removing the anomalous regions at the beginning of the forward process, while the unguided forward process compresses the anomalous regions later, as these regions are typically low-frequency. We aggregate the anomaly maps from the guided forward process to enhance the signal strength of the anomalous regions. The threshold of the aggregated anomaly map is determined by its quantile, which is selected based on this property. The guidance strength directly controls the removal of anomalous regions and the discrimination between healthy and unhealthy samples. We select the guidance strength by balancing these factors.

Contributions: (i) Eliminating the need of pixel-level labels in hyperparameter selection by using unguided denoised inputs as reference; (ii) A novel anomaly segmentation framework using the forward process of DMs (iii) A novel anomaly map aggregation strategy to enhance the signal strength of anomalous regions.

Related Work Prior to the development of DMs, GAN and VAE dominated the field of anomaly segmentation [16, 3, 23]. However, GAN is often criticized for their unstable training, while VAE is not ideal because of their limited expressiveness in the latent space and a tendency to produce blurry reconstructions. DMs are promising alternative due to its successes in image synthesis [4]. In the arena of unsupervised anomaly segmentation, innovations such as AnoDDPM [22] have applied the unguided denoising diffusion probabilistic model (DDPM) with simplex noise [11] to achieve notable segmentation outcomes. Furthermore, Behrendt et al. [2] have enhanced model performance by utilizing patched images as inputs to mitigate generative errors, while Iqbal et al. [8] have explored the use of masked images in both physical and frequency domains. The latent diffusion model concept, detailed by Rombach et al. [13], has been adeptly incorporated for fast segmentation in Pinaya et al. [12]. In the domain of weakly-supervised approaches, pioneering efforts by Wolleb et al. [21] have leveraged DDIM with classifier-guidance [4], a technique paralleled in Sanchez et al. [15]’s adoption of DDIM with classifier-free guidance [7], similar to the DDIB approach [19]

Refer to caption — Figure 1: Overview of the proposed AnoFPDM framework. The forward process is utilized to extract the denoised inputs $\tilde{\boldsymbol{x}}_{0,t}^{h}$ and $\tilde{\boldsymbol{x}}_{0,t}^{\emptyset}$ for $t>0$ in terms of healthy label guidance and no guidance. Then, we collect the mean square error $MSE_{t}^{\emptyset}$ as a reference for hyperparameter tunning (end step $t_{e}$ and threshold) and $MSE_{t}^{h}$ for aggregated anomaly map.

2 Background

2.1 Diffusion Models

To approximate the unknown data distribution $q(\boldsymbol{x})$ , DMs perturb the data with increasing noise level and then learn to reverse it using a model $\boldsymbol{\epsilon_{\theta}}$ , typically utilizing U-net like architectures [14], parameterized by $\theta$ . To be more specific, the forward process of DDPM is factorized as $q(\boldsymbol{x}_{1:T}\mid\boldsymbol{x}_{0})=\prod_{t=1}^{T}q(\boldsymbol{x}_% {t}\mid\boldsymbol{x}_{t-1})$ , which is fixed to a Markov chain with variance schedule $\beta_{1},...,\beta_{T}$ . The transition kernel from $\boldsymbol{x}_{t-1}$ to $\boldsymbol{x}_{t}$ is modeled as Gaussian $q(\boldsymbol{x}_{t}\mid\boldsymbol{x}_{t-1})=\mathcal{N}\left(\boldsymbol{x}_% {t}\mid\sqrt{1-\beta_{t}}\boldsymbol{x}_{t-1},\beta_{t}\boldsymbol{I}\right)$ and hence, the transition from the input $\boldsymbol{x}_{0}$ to any arbitrary step $t$ , i.e., forward process, is in closed form

\boldsymbol{x}_{t}=\sqrt{\bar{\alpha}_{t}}\boldsymbol{x}_{0}+\sqrt{1-\bar{% \alpha}_{t}}\boldsymbol{\epsilon},

(1)

where $\boldsymbol{\epsilon}\sim\mathcal{N}\left(\boldsymbol{0},\boldsymbol{I}\right)$ , $\bar{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i}$ and $\alpha_{t}=1-\beta_{t}$ .

The sampling process $p_{\theta}\left(\boldsymbol{x}_{1:T}\right)=p(\boldsymbol{x}_{T})\prod_{t=1}^{% T}p_{\theta}(\boldsymbol{x}_{t-1}\mid\boldsymbol{x}_{t})$ is the transition from $\boldsymbol{x}_{T}$ to $\boldsymbol{x}_{0}$ with the learned distribution $p_{\theta}\left(\boldsymbol{x}_{t-1}\mid\boldsymbol{x}_{t}\right)$ which is the approximation of inference distribution $q(\boldsymbol{x}_{t-1}\mid\boldsymbol{x}_{t},\boldsymbol{x}_{0})=q(\boldsymbol% {x}_{t}\mid\boldsymbol{x}_{t-1},\boldsymbol{x}_{0})\frac{q(\boldsymbol{x}_{t-1% }\mid\boldsymbol{x}_{0})}{q(\boldsymbol{x}_{t}\mid\boldsymbol{x}_{0})}$ . The forms of both distributions $q$ and $p_{\theta}$ are assumed to be Gaussian. The training process is the minimization of KL-divergence

KL\left[q(\boldsymbol{x}_{1:T}\mid\boldsymbol{x}_{0})\mid p_{\theta}(% \boldsymbol{x}_{1:T}\mid\boldsymbol{x}_{0})\right]=\mathbb{E}_{\boldsymbol{x}_% {1:T}\sim q(\boldsymbol{x}_{1:T}\mid\boldsymbol{x}_{0})}\left[\log\frac{q(% \boldsymbol{x}_{1:T}\mid\boldsymbol{x}_{0})p_{\theta}(\boldsymbol{x}_{0})}{p_{% \theta}(\boldsymbol{x}_{0:T})}\right].

(2)

A variation of DDPM is DDIM [17]. It is non-Markovian, incorporating the input $\boldsymbol{x}_{0}$ into the forward process. The inference distribution is factorized as $q_{\sigma}(\boldsymbol{x}_{1:T}\mid\boldsymbol{x}_{0})=q_{\sigma}(\boldsymbol{% x}_{T}\mid\boldsymbol{x}_{0})\prod_{t=2}^{T}q_{\sigma}(\boldsymbol{x}_{t-1}% \mid\boldsymbol{x}_{t},\boldsymbol{x}_{0})$ for all $t>1$ . The forward process can be obtained through Bayes’ theorem in closed form:

q_{\sigma}(\boldsymbol{x}_{t+1}\mid\boldsymbol{x}_{t},\boldsymbol{x}_{0})=% \mathcal{N}\left(\sqrt{\bar{\alpha}_{t+1}}\boldsymbol{x}_{0}+\sqrt{1-\bar{% \alpha}_{t+1}-\sigma_{t}^{2}}\frac{\boldsymbol{x}_{t}-\sqrt{\bar{\alpha}_{t}}% \boldsymbol{x}_{0}}{\sqrt{1-\bar{\alpha}_{t}}},\sigma_{t}^{2}\boldsymbol{I}% \right),

(3)

where $\sigma_{t}$ is set to 0. They both share the same objective function in training.

2.2 Classifier-free Guidance

The classifier-free guidance exhibits superior performance in generative tasks compared to classifier guidance [7]. Additionally, both training and sampling processes are simplified as it eliminates the need for an external classifier. In the sampling process, for any $t\geq 0$ , the noise predictor $\boldsymbol{\epsilon_{\theta}}^{i,t}$ with guidance $c_{i}$ is $\boldsymbol{\epsilon_{\theta}}^{i,t}=(1+w)\boldsymbol{\epsilon_{\theta}}(% \boldsymbol{x}_{t},t,c_{i})-w\boldsymbol{\epsilon_{\theta}}(\boldsymbol{x}_{t}% ,t,\emptyset)$ . The parameter $w$ controls the strength of guidance. For unguided sampling, the predicted noise $\boldsymbol{\epsilon_{\theta}}^{i,t}=\boldsymbol{\epsilon_{\theta}}(% \boldsymbol{x}_{t},t,\emptyset)$ . The model $\boldsymbol{\epsilon_{\theta}}$ with guidance is implemented by utilizing the attention [20].

3 Methodology

Our method aims to eliminate the need for pixel-level labels in hyperparameter tuning by utilizing the unguided forward process as a reference for the guided forward process. We first introduce the guided and unguided denoised inputs. Then, we describe the hyperparameter selection process.

3.1 Denoised Inputs with Classifier-free Guidance

We can add noise to the inputs in forward process with either DDPM style using Eq. 1 or DDIM style using Eq. 3 without label information. After adding the noise on step $t$ , we obtain the guided and unguided denoised inputs in reverse map** according to Eq. 4 and Eq. 5 respectively. Note that the denoised inputs are NOT obtained from the iterative sampling process.

\tilde{\boldsymbol{x}}_{0,t}^{h}=\frac{\boldsymbol{x}_{t}-\sqrt{1-\bar{\alpha}% _{t}}\left[(1+w)\boldsymbol{\epsilon_{\theta}}(\boldsymbol{x}_{t},t,h)-w% \boldsymbol{\epsilon_{\theta}}(\boldsymbol{x}_{t},t,\emptyset)\right]}{\sqrt{% \bar{\alpha}_{t}}}

(4)

\tilde{\boldsymbol{x}}_{0,t}^{\emptyset}=\frac{\boldsymbol{x}_{t}-\sqrt{1-\bar% {\alpha}_{t}}\boldsymbol{\epsilon_{\theta}}(\boldsymbol{x}_{t},t,\emptyset)}{% \sqrt{\bar{\alpha}_{t}}}

(5)

Then, we collect $MSE_{t}^{h}=\left(\tilde{\boldsymbol{x}}_{0,t}^{h}-\boldsymbol{x}_{0}\right)^{2}$ and $MSE_{t}^{\emptyset}=\left(\tilde{\boldsymbol{x}}_{0,t}^{\emptyset}-\boldsymbol% {x}_{0}\right)^{2}$ for $t\in\{t_{i}\}_{i=0}^{T}$ . On testing set without image-level labels, we first determine the health status of the input $\boldsymbol{x}_{0}$ by assessing the cosine similarity between two MSEs. The details can be found in supplementary material B.

3.2 Hyperparameter Selection

Dynamical Noise Scale and Threshold We utilize the unguided forward process to select end step $t_{e}$ , i.e., noise scale, for each input. It is chosen based on the absolute average difference between two MSEs, given by

M_{t}=\frac{\sum_{d=1}^{D}\sum_{i=1}^{H}\sum_{j=1}^{W}\big{|}MSE_{t}^{h}-MSE_{% t}^{\emptyset}\big{|}}{D\times H\times W},\quad t\in\{t_{i}\}_{0}^{T},

(6)

such that

t_{e}=\arg\max_{t}M_{t},\quad t\in\{t_{i}\}_{0}^{T}.

(7)

Note that $MSE_{t}^{h},MSE_{t}^{\emptyset}\in\mathbb{R_{+}}^{D\times H\times W}$ where $D,H$ and $W$ are the number of modalities, height and width, respectively. We assume that the anomalous regions are mostly removed on $\tilde{\boldsymbol{x}}_{0,t}^{h}$ at step $t_{e}$ compared to other steps. To obtain the anomaly map for segmentation, we aggregate $MSE_{t}^{h}$ for $t\in\{t_{i}\}_{0}^{t_{e}}$ , i.e., $\frac{\sum_{t=0}^{t_{e}}MSE_{t}^{h}}{t_{e}}$ . Similarly, the segmentation threshold is individually determined for each input. We observe that $M_{t_{e}}$ is roughly linearly related to the size of anomalous regions. A larger anomalous region corresponds to a larger value $M_{t_{e}}$ as more regions are removed. Our threshold selection strategy is that a smaller quantile is selected for a larger anomalous region to include more possible pixels. For a comprehensive understanding of our selection process, please refer to Algorithm 1, detailed in the supplementary material D.

Guidance Strength We aim to select the guidance strength $w$ to be sufficiently large to remove anomalous regions on $\tilde{\boldsymbol{x}}_{0,t}^{h}$ , but not so large that non-anomalous regions are significantly changed. At the same time, it must ensure the classification accuracy between healthy and unhealthy samples. Our selection strategy is that we first ensure the classification accuracy is above a certain threshold, and then we select the smallest $w$ that achieves the maximal $M_{t_{e}}$ .

4 Experiments

We train and evaluate the proposed method on BraTS21 dataset [1]. BraTS21 dataset comprises of three-dimensional Magnetic Resonance brain images depicting subjects afflicted with a cerebral tumor, accompanied by pixel-wise annotations serving as ground truth labels. Each subject undergoes scanning through four distinct MR sequences, specifically T1-weighted, T2-weighted, FLAIR, and T1-weighted with contrast enhancement. Given our emphasis on a two-dimensional methodology, our analysis is confined to axial slices. There are 1,254 patients and we split the dataset into 939 patients for training, 63 patients for validation, 252 patients for testing. We randomly select 1,000 samples in validation set for hyperparameter tuning and 10,000 samples in testing set for evaluation. For training, we stack all four modalities while only FLAIR and T2-weighted modalities are used in inference. For preprocessing, we normalize each slice by dividing $99$ percentile foreground voxel intensity and then, pixels are scaled to the range of $\left[-1,1\right]$ . Finally, all samples are interpolated to $128\times 128$ . The training setup is demonstrated in supplementary material A. The backbone U-net is from the previous work [15].

4.1 Results: Weakly-supervised Segmentation

We report pixel-level DICE score, intersection over union (IoU) and area under the precision-recall curve (AUPRC) in terms of foreground area in Table 1. The performance on 10,000 mixed data, comprising 4,656 unhealthy samples and 5,344 healthy samples, along with the performance on all 4,656 unhealthy data are reported separately. For our method, we present results for four setups: (i) DDPM forward (stochastic encoding): Eq. 1 is used to add noise to inputs; (ii) DDIM forward (deterministic encoding): the inputs are noised by Eq. 3; (iii) tuned with labels: the hyperparameter is tuned with pixel-level labels for all inputs (non-dynamical) and noise is added using DDIM forward; (iv) DDIM forward $t_{e}$ : it uses the same setup as DDIM forward but only uses the anomaly map at $t_{e}$ for segmentation. For comparison methods, we report the results from AnoDDPM with Gaussian noise [22], pure DDIM, DDIM with classifier guidance[21] and DDIM classifier-free guidance [15]. The first two methods are only trained on healthy data and the hyperparameters of all methods are tuned by using 1,000 samples in validation set. Note that our methods do not require pixel-level labels for tunning. We adopt median filter [9] and connected component filter for postprocessing. The details are provided in supplementary material E. We surpass the previous methods in a quantitative evaluation. The qualitative results are shown in Fig. 6. We show more qualitative results in supplementary material F. It shows that our method can enhance the signal strength of the anomalous regions. We further exhibit four components, i.e., $\tilde{\boldsymbol{x}}_{0,t}^{\emptyset}$ , $\tilde{\boldsymbol{x}}_{0,t}^{h}$ , $MSE_{t}^{\emptyset}$ and $MSE_{t}^{h}$ , used in our framework in Fig.7, which illustrates the process of anomalous removal. More analysis can be found in the supplementary material C.

	Mixed			Unhealthy
Methods	DICE	IoU	AUPRC	DICE	IoU	AUPRC
AnoDDPM (Gaussian)[22]	66.1 $\pm 0.1$	61.7 $\pm 0.1$	51.8 $\pm 0.1$	37.6 $\pm$ 0.1	28.1 $\pm$ 0.1	61.3 $\pm$ 0.1
DDIM unguided	68.4 $\pm$ 0.1	63.7 $\pm 0.1$	54.3 $\pm$ 0.1	40.7 $\pm$ 0.7	31.0 $\pm$ 0.1	63.4 $\pm$ 0.1
DDIM clf[21]	76.5 $\pm$ 0.1	71.0 $\pm$ 0.1	58.4 $\pm$ 0.3	52.2 $\pm$ 0.2	40.4 $\pm$ 0.2	61.6 $\pm$ 0.2
DDIM clf-free[15]	74.3	69.1	59.9	49.1	38.1	61.4
Ours (DDPM forward)	77.6 $\pm$ 0.1	71.3 $\pm$ 0.1	72.4 $\pm$ 0.1	56.0 $\pm$ 4.0	45.7 $\pm$ 3.5	76.1 $\pm$ 0.1
Ours (DDIM forward)	77.8	72.0	69.7	54.0	43.5	72.3
Ours (tuned with labels)	77.7	71.8	69.5	53.2	42.6	72.4
Ours (DDIM forward $t_{e}$ )	75.6	70.6	66.0	52.9	42.5	70.7

Table 1: Segmentation performance on all slices and unhealthy slices. If the methods involve random noise, the standard deviations are reported based on three rounds of experiments. The best performance is in bold.

4.2 Ablation Study: Hyperparameter Selection

To validate the effectiveness of our hyperparameter selection, we randomly select 1,000 samples and calculate DICE score with various settings. In Fig. 2 (a), we first show the selection of guidance strength $w$ . We select the accuracy threshold $0.9$ , and hence, $w=2$ is selected. Then, we illustrate the DICE score with various guidance strength $w$ in Fig. 2 (b). Notably, the maximal DICE score is achieved at $w=2$ , supporting the feasibility of our selection. Another hyperparameter under consideration is the end step $t_{e}$ . The sensitivity of $t_{e}$ with $w=2$ in terms of DICE score is depicted in Fig. 2 (c). The maximal DICE score is achieved at $t_{e}$ . The results are consistent with our hypothesis that the anomalous regions are maximally removed at $t_{e}$ . When the ratio is smaller than 1, the DICE score is significantly reduced, indicating that the anomalous regions are not fully removed. Conversely, when the ratio is larger than 1, the DICE score is not reduced significantly, suggesting that the anomalous regions are fully removed but more noise is added to the aggregated anomaly map. However, the aggregation alleviates the impact of noise.

5 Conclusion

In this paper, we propose a novel anomaly segmentation framework that eliminates the need for pixel-level labels in hyperparameter tuning by utilizing the denoised inputs without guidance as a reference. By aggregating anomaly maps from each forward step, we enhance the signal strength of anomaly regions, which improves the quality of anomaly maps. Our method surpasses the previous methods in terms of weakly-supervised segmentation on BraTS21 dataset.

References

[1] Bakas, S., Reyes, M., Jakab, A., Bauer, S., Rempfler, M., Crimi, A., Shinohara, R.T., Berger, C., Ha, S.M., Rozycki, M., et al.: Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the brats challenge. arXiv preprint arXiv:1811.02629 (2018)
[2] Behrendt, F., Bhattacharya, D., Krüger, J., Opfer, R., Schlaefer, A.: Patched diffusion models for unsupervised anomaly detection in brain mri. In: Medical Imaging with Deep Learning (2023)
[3] Chen, X., You, S., Tezcan, K.C., Konukoglu, E.: Unsupervised lesion detection via image restoration with a normative prior. Medical image analysis 64, 101713 (2020)
[4] Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34, 8780–8794 (2021)
[5] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Advances in neural information processing systems 27 (2014)
[6] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. pp. 6840–6851 (2020)
[7] Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications (2021)
[8] Iqbal, H., Khalid, U., Chen, C., Hua, J.: Unsupervised anomaly detection in medical images using masked diffusion model. In: International Workshop on Machine Learning in Medical Imaging. pp. 372–381. Springer (2023)
[9] Kascenas, A., Pugeault, N., O’Neil, A.Q.: Denoising autoencoders for unsupervised anomaly detection in brain mri. In: International Conference on Medical Imaging with Deep Learning. pp. 653–664. PMLR (2022)
[10] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
[11] Perlin, K.: Improving noise. In: Proceedings of the 29th annual conference on Computer graphics and interactive techniques. pp. 681–682 (2002)
[12] Pinaya, W.H., Graham, M.S., Gray, R., da Costa, P.F., Tudosiu, P.D., Wright, P., Mah, Y.H., MacKinnon, A.D., Teo, J.T., Jager, R., et al.: Fast unsupervised brain anomaly detection and segmentation with diffusion models. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 705–714 (2022)
[13] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
[14] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI) (2015)
[15] Sanchez, P., Kascenas, A., Liu, X., O’Neil, A.Q., Tsaftaris, S.A.: What is healthy? generative counterfactual diffusion for lesion localization. In: MICCAI Workshop on Deep Generative Models. pp. 34–44. Springer (2022)
[16] Schlegl, T., Seeböck, P., Waldstein, S.M., Langs, G., Schmidt-Erfurth, U.: f-anogan: Fast unsupervised anomaly detection with generative adversarial networks. Medical image analysis 54, 30–44 (2019)
[17] Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International Conference on Learning Representations (2020)
[18] Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations (2020)
[19] Su, X., Song, J., Meng, C., Ermon, S.: Dual diffusion implicit bridges for image-to-image translation. In: The Eleventh International Conference on Learning Representations (2022)
[20] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
[21] Wolleb, J., Bieder, F., Sandkühler, R., Cattin, P.C.: Diffusion models for medical anomaly detection. In: International Conference on Medical image computing and computer-assisted intervention. pp. 35–45. Springer (2022)
[22] Wyatt, J., Leach, A., Schmon, S.M., Willcocks, C.G.: Anoddpm: Anomaly detection with denoising diffusion probabilistic models using simplex noise. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 650–656 (2022)
[23] Zimmerer, D., Isensee, F., Petersen, J., Kohl, S., Maier-Hein, K.: Unsupervised anomaly localization using variational auto-encoders. In: Medical Image Computing and Computer Assisted Interventione. pp. 289–297. Springer (2019)

Supplementary Material

A. Training Hyperparameter Settings

Our model is trained on 2 Nvidia A100 GPUs with 80GB memory. The training hyperparameters are summarized in Table 2.

Diffusion steps	1000
Noise schedule	linear
Channels	128
Heads	2
Attention resolution	32,16,8
Dropout	0.1
EMA rate	0.9999
Optimiser	AdamW
Learning rate	$1e^{-4}$
$\beta_{1}$ , $\beta_{2}$	0.9, 0.999
Batch size	64
Null label ratio	0.1
dropout	0.1

Table 2: Training hyperparameters used in our method.

B. Determining Health Status by Cosine Similarity

To determine the label of inputs $\boldsymbol{x}_{0}$ without the pixel-level labels, we utilize the cosine similarity between two MSEs

Cos\left(\overline{MSE}^{h},\overline{MSE}^{\emptyset}\right)=\frac{\overline{% MSE}^{h}\cdot\overline{MSE}^{\emptyset}}{\left(\|\overline{MSE}^{h}\|_{2}\cdot% \|\overline{MSE}^{\emptyset}\|_{2}\right)},

(8)

where $\overline{MSE}^{\emptyset}=\{\overline{MSE}^{\emptyset}_{1},...,\overline{MSE}% ^{\emptyset}_{T}\}$ , $\overline{MSE}^{h}=\{\overline{MSE}^{h}_{1},...,\overline{MSE}^{h}_{T}\}$ . The element for each step is defined as

\overline{MSE}^{\emptyset}_{t}=\frac{\sum_{d=1}^{D}\sum_{i=1}^{H}\sum_{j=1}^{W% }\big{|}MSE_{t}^{\emptyset}\big{|}}{D\times H\times W}

(9)

\overline{MSE}^{\emptyset}_{t}=\frac{\sum_{d=1}^{D}\sum_{i=1}^{H}\sum_{j=1}^{W% }\big{|}MSE_{t}^{h}\big{|}}{D\times H\times W}.

(10)

Here $T$ is the number of diffusion steps, $D$ is the number of modalities, $H$ and $W$ are the height and width. In practice, we don’t need to calculate the cosine similarity for all the steps because the input becomes significantly corrupted after a certain number of steps. Therefore, we calculate the cosine similarity for the first $600$ steps in our work.

On the validation set with image-level labels, we calculate the cosine similarity for each input $\boldsymbol{x}_{0}$ and determine the threshold $Cos_{thr}$ that achieves maximal accuracy. On the testing set without image-level labels, we use the threshold $Cos_{thr}$ to determine the label of the input $\boldsymbol{x}_{0}$ , i.e., if $Cos\left(\overline{MSE}^{h},\overline{MSE}^{\emptyset}\right)>Cos_{thr}$ , the input is healthy; otherwise, it is unhealthy. The threshold $Cos_{thr}$ is determined by the validation set and is fixed for the testing set.

C. Analysis of MSEs

We analyze the MSEs using 100 samples (50 healthy and 50 unhealthy) from the BraTS21 dataset, selected from the validation set. The MSEs are calculated for the first 600 steps in terms of healthy label guidance and no guidance. We first show that if the input is healthy, the two MSEs, $MSE_{t}^{\emptyset}$ and $MSE_{t}^{h}$ are similar. Conversely, if the input is unhealthy, they differ. We calculate the mean of $\overline{MSE}^{h}_{t}$ and $\overline{MSE}^{\emptyset}_{t}$ across the samples, as well as the corresponding standard deviation. The results are shown in Fig. 5 (a) for healthy samples and (b) for unhealthy samples.

We decompose the difference between the two MSEs of unhealthy samples into two parts: the removal process of the anomalous regions and the impact of non-anomalous regions. We mask out the tumor regions in $MSE_{t}^{\emptyset}$ and $MSE_{t}^{h}$ , i.e., $\widehat{MSE}_{t}^{\emptyset}=MSE_{t}^{\emptyset}-MSE_{t}^{\emptyset}\odot\bf{% M},$ and $\widehat{MSE}_{t}^{h}=MSE_{t}^{h}-MSE_{t}^{h}\odot\bf{M},$ where $\odot$ denotes element-wise multiplication and $\bf{M}$ is the mask of tumor. Hence, the removal process of the anomalous regions can be observed by comparing the MSEs between $MSE_{t}^{h}$ and $\widehat{MSE}_{t}^{\emptyset}$ . The impact of non-anomalous regions can be observed by comparing the MSEs between $\widehat{MSE}_{t}^{\emptyset}$ and $\widehat{MSE}_{t}^{h}$ . We define the following metrics for quantification:

M_{t}=\frac{\sum_{d=1}^{D}\sum_{i=1}^{H}\sum_{j=1}^{W}\big{|}MSE_{t}^{h}-MSE_{% t}^{\emptyset}\big{|}}{D\times H\times W},\quad t\in\{t_{i}\}_{0}^{600},

(11)

\widehat{M}_{t}=\frac{\sum_{d=1}^{D}\sum_{i=1}^{H}\sum_{j=1}^{W}\big{|}MSE_{t}% ^{h}-\widehat{MSE}_{t}^{\emptyset}\big{|}}{D\times H\times W},\quad t\in\{t_{i% }\}_{0}^{600},

(12)

\widehat{\widehat{M}}_{t}=\frac{\sum_{d=1}^{D}\sum_{i=1}^{H}\sum_{j=1}^{W}\big% {|}\widehat{MSE}_{t}^{h}-\widehat{MSE}_{t}^{\emptyset}\big{|}}{D\times H\times W% },\quad t\in\{t_{i}\}_{0}^{600},

(13)

These metrics are exhibited in the Fig. 5 (c). The green curve $\widehat{\widehat{M}}_{t}$ shows the impact of non-anomalous regions, while The black curve $\widehat{M}_{t}$ shows the removal of anomalous regions. However, in practice, we can only observe the red curve $M_{t}$ without pixel-level labels. The difference between $M_{t}$ and $\widehat{M}_{t}$ is shown in Fig. 5 (c) as the black area, representing the anomalous regions compressed in the unguided forward process.

Our aim is to find the step $t_{e}$ corresponding to the maximal $\widehat{M}_{t}$ (black plus green area) where the anomalous regions are mostly removed. The maximal $M_{t}$ is close to the maximal $\widehat{M}_{t}$ in most cases. Therefore, we use the maximal $M_{t}$ to select the end step $t_{e}$ .

D. Selection of Quantile for Anomaly Segmentation

In the Fig. 5 (d), we depict the relationship between the size of the anomalous region (number of pixels) and the maximal difference value $M_{t_{e}}$ . Notably, they exhibit a relatively linear relationship. This observation serves as a rough indication of the size of the anomalous region and can be utilized for the selection of the quantile $Q$ without relying on pixel-level labels. We select the segmentation quantile $Q^{*}\in\left[a,b\right]$ of the anomaly map $H$ for each input $\boldsymbol{x}_{0}$ by Algorithm. 1. The input $M_{max}$ is obtained from the validation set for scaling and is fixed for the testing set. A smaller quantile is selected for a larger anomalous region to include more possible pixels. In our case, we select $a=0.90$ , $b=0.98$ . Then, the predicted pixel-level labels is obtained as $H\geq Q^{*}$ .

Input:

M_{max}

a

b

H

range=reverse(linspace(a,b,101))

# Set quantile range

M_{s}=clamp\left(\frac{M_{t_{e}}}{M_{max}},0,1\right)

index=round(M_{s},2)\times 100

# Keep 2 digits

Return

Q^{*}=quantile(H,range\left[index\right])

Algorithm 1 Selection of quantile

Q

for a single input

\boldsymbol{x}_{0}

E. Postprocessing

After we obtain the anomaly map, we apply a median filter with kernel size 5 to effectively enhance the performance. Then, we apply the connected component filter to remove the small connected components which is regarded as noise. We apply the same postprocessing to all methods for fair comparison.

F. Additional Qualitative Results

We provide additional qualitative results in Fig. 6 for the BraTS21 dataset. The results show that our method can effectively segment the anomalous regions without pixel-level labels. We also provide more examples of removal of anomalous regions in Fig. 7.