\floatsetup

[table]capposition=top

11institutetext: Arizona State University
22institutetext: ASU-Mayo Center for Innovative Imaging

AnoFPDM: Anomaly Segmentation with Forward Process of Diffusion Models for Brain MRI

Yiming Che 1122    Fazle Rafsani 112*2*    Jay Shah 112*2*    Md Mahfuzur Rahman Siddiquee 1122    Teresa Wu 1122
Abstract

Weakly-supervised diffusion models (DMs) in anomaly segmentation, leveraging image-level labels, have attracted significant attention for their superior performance compared to unsupervised methods. It eliminates the need for pixel-level labels in training, offering a more cost-effective alternative to supervised methods. However, existing methods are not fully weakly-supervised because they heavily rely on costly pixel-level labels for hyperparameter tuning in inference. To tackle this challenge, we introduce Anomaly Segmentation with Forward Process of Diffusion Models (AnoFPDM), a fully weakly-supervised framework that operates without the need of pixel-level labels. Leveraging the unguided forward process as a reference for the guided forward process, we select hyperparameters such as the noise scale, the threshold for segmentation and the guidance strength. We aggregate anomaly maps from guided forward process, enhancing the signal strength of anomalous regions. Remarkably, our proposed method outperforms recent state-of-the-art weakly-supervised approaches, even without utilizing pixel-level labels.

Code is available at https://github.com/SoloChe/AnoFPDM* Equal contribution
Keywords:
Anomaly segmentation Diffusion models Weakly-supervision

1 Introduction

Anomaly segmentation often requires a large number of pixel-level annotations. However, acquiring pixel-level annotations is not only costly but also prone to human annotator bias. Hence, weakly-supervised generative methods [21, 15], leveraging image-level labels in training, are gaining attention. Among these methods, diffusion models [6, 17, 18] are commonly chosen as the backbone due to their superior performance compared to other generative methods such as generative adversarial networks (GANs) [5] and variational autoencoder (VAE) [10]. However, current weakly-supervised methods are not fully weakly-supervised. They still heavily depend on pixel-level labels for hyperparameter tuning during the inference stage. For diffusion model-based methods, this tuning includes determining the appropriate guidance strength, amount of noise (noise scale) added to the input image, and the threshold for the anomaly map. Consequently, the need for pixel-level labels in hyperparameter tuning reintroduces the cost and bias.

In this paper, we propose a fully weakly-supervised framework named Ano-FPDM, built upon diffusion models with classifier-free guidance [7], to eliminate the need of pixel-level labels in hyperparameter tuning. In the training stage, our model follows the standard training process of classifier-free guidance as used in [7] with the image-level labels, e.g., healthy and unhealthy. In the inference stage, our framework utilizes the forward process of DMs instead of the sampling (backward) process. Rather than relying directly on pixel-level labels, we select hyperparameters by leveraging the unguided forward process as a reference for the guided forward process.

Our key idea for the inference stage is illustrated in Fig. 1. During the forward process, we gradually corrupt the inputs. The denoised inputs, representing the prediction of the inputs from the noised inputs, are then obtained with respect to the healthy label guidance and no guidance. Note that the denoised inputs in our framework are obtained by reverse map** between the noised inputs and the original inputs. The denoised inputs without guidance do not remove the anomalous regions but compress both the anomalous and non-anomalous regions during the forward process. Conversely, under ideal conditions, the denoised inputs with healthy label guidance remove the anomalous regions while compressing the non-anomalous regions. Our objective is to select the end step of the forward process such that the anomalous regions are maximally removed in the guided forward process. We utilize the difference between guided and unguided forward processes to indicate the removal of anomalous regions. The guided forward process focuses more on removing the anomalous regions at the beginning of the forward process, while the unguided forward process compresses the anomalous regions later, as these regions are typically low-frequency. We aggregate the anomaly maps from the guided forward process to enhance the signal strength of the anomalous regions. The threshold of the aggregated anomaly map is determined by its quantile, which is selected based on this property. The guidance strength directly controls the removal of anomalous regions and the discrimination between healthy and unhealthy samples. We select the guidance strength by balancing these factors.

Contributions: (i) Eliminating the need of pixel-level labels in hyperparameter selection by using unguided denoised inputs as reference; (ii) A novel anomaly segmentation framework using the forward process of DMs (iii) A novel anomaly map aggregation strategy to enhance the signal strength of anomalous regions.

Related Work Prior to the development of DMs, GAN and VAE dominated the field of anomaly segmentation [16, 3, 23]. However, GAN is often criticized for their unstable training, while VAE is not ideal because of their limited expressiveness in the latent space and a tendency to produce blurry reconstructions. DMs are promising alternative due to its successes in image synthesis [4]. In the arena of unsupervised anomaly segmentation, innovations such as AnoDDPM [22] have applied the unguided denoising diffusion probabilistic model (DDPM) with simplex noise [11] to achieve notable segmentation outcomes. Furthermore, Behrendt et al. [2] have enhanced model performance by utilizing patched images as inputs to mitigate generative errors, while Iqbal et al. [8] have explored the use of masked images in both physical and frequency domains. The latent diffusion model concept, detailed by Rombach et al. [13], has been adeptly incorporated for fast segmentation in Pinaya et al. [12]. In the domain of weakly-supervised approaches, pioneering efforts by Wolleb et al. [21] have leveraged DDIM with classifier-guidance [4], a technique paralleled in Sanchez et al. [15]’s adoption of DDIM with classifier-free guidance [7], similar to the DDIB approach [19]

Refer to caption
Figure 1: Overview of the proposed AnoFPDM framework. The forward process is utilized to extract the denoised inputs 𝒙~0,thsuperscriptsubscript~𝒙0𝑡\tilde{\boldsymbol{x}}_{0,t}^{h}over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT and 𝒙~0,tsuperscriptsubscript~𝒙0𝑡\tilde{\boldsymbol{x}}_{0,t}^{\emptyset}over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT for t>0𝑡0t>0italic_t > 0 in terms of healthy label guidance and no guidance. Then, we collect the mean square error MSEt𝑀𝑆superscriptsubscript𝐸𝑡MSE_{t}^{\emptyset}italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT as a reference for hyperparameter tunning (end step tesubscript𝑡𝑒t_{e}italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and threshold) and MSEth𝑀𝑆superscriptsubscript𝐸𝑡MSE_{t}^{h}italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT for aggregated anomaly map.

2 Background

2.1 Diffusion Models

To approximate the unknown data distribution q(𝒙)𝑞𝒙q(\boldsymbol{x})italic_q ( bold_italic_x ), DMs perturb the data with increasing noise level and then learn to reverse it using a model ϵ𝜽subscriptbold-italic-ϵ𝜽\boldsymbol{\epsilon_{\theta}}bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT, typically utilizing U-net like architectures [14], parameterized by θ𝜃\thetaitalic_θ. To be more specific, the forward process of DDPM is factorized as q(𝒙1:T𝒙0)=t=1Tq(𝒙t𝒙t1)𝑞conditionalsubscript𝒙:1𝑇subscript𝒙0superscriptsubscriptproduct𝑡1𝑇𝑞conditionalsubscript𝒙𝑡subscript𝒙𝑡1q(\boldsymbol{x}_{1:T}\mid\boldsymbol{x}_{0})=\prod_{t=1}^{T}q(\boldsymbol{x}_% {t}\mid\boldsymbol{x}_{t-1})italic_q ( bold_italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ∣ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ), which is fixed to a Markov chain with variance schedule β1,,βTsubscript𝛽1subscript𝛽𝑇\beta_{1},...,\beta_{T}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_β start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. The transition kernel from 𝒙t1subscript𝒙𝑡1\boldsymbol{x}_{t-1}bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is modeled as Gaussian q(𝒙t𝒙t1)=𝒩(𝒙t1βt𝒙t1,βt𝑰)𝑞conditionalsubscript𝒙𝑡subscript𝒙𝑡1𝒩conditionalsubscript𝒙𝑡1subscript𝛽𝑡subscript𝒙𝑡1subscript𝛽𝑡𝑰q(\boldsymbol{x}_{t}\mid\boldsymbol{x}_{t-1})=\mathcal{N}\left(\boldsymbol{x}_% {t}\mid\sqrt{1-\beta_{t}}\boldsymbol{x}_{t-1},\beta_{t}\boldsymbol{I}\right)italic_q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_I ) and hence, the transition from the input 𝒙0subscript𝒙0\boldsymbol{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to any arbitrary step t𝑡titalic_t, i.e., forward process, is in closed form

𝒙t=α¯t𝒙0+1α¯tϵ,subscript𝒙𝑡subscript¯𝛼𝑡subscript𝒙01subscript¯𝛼𝑡bold-italic-ϵ\boldsymbol{x}_{t}=\sqrt{\bar{\alpha}_{t}}\boldsymbol{x}_{0}+\sqrt{1-\bar{% \alpha}_{t}}\boldsymbol{\epsilon},bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ , (1)

where ϵ𝒩(𝟎,𝑰)similar-tobold-italic-ϵ𝒩0𝑰\boldsymbol{\epsilon}\sim\mathcal{N}\left(\boldsymbol{0},\boldsymbol{I}\right)bold_italic_ϵ ∼ caligraphic_N ( bold_0 , bold_italic_I ), α¯t=i=1tαisubscript¯𝛼𝑡superscriptsubscriptproduct𝑖1𝑡subscript𝛼𝑖\bar{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and αt=1βtsubscript𝛼𝑡1subscript𝛽𝑡\alpha_{t}=1-\beta_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

The sampling process pθ(𝒙1:T)=p(𝒙T)t=1Tpθ(𝒙t1𝒙t)subscript𝑝𝜃subscript𝒙:1𝑇𝑝subscript𝒙𝑇superscriptsubscriptproduct𝑡1𝑇subscript𝑝𝜃conditionalsubscript𝒙𝑡1subscript𝒙𝑡p_{\theta}\left(\boldsymbol{x}_{1:T}\right)=p(\boldsymbol{x}_{T})\prod_{t=1}^{% T}p_{\theta}(\boldsymbol{x}_{t-1}\mid\boldsymbol{x}_{t})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) = italic_p ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the transition from 𝒙Tsubscript𝒙𝑇\boldsymbol{x}_{T}bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to 𝒙0subscript𝒙0\boldsymbol{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with the learned distribution pθ(𝒙t1𝒙t)subscript𝑝𝜃conditionalsubscript𝒙𝑡1subscript𝒙𝑡p_{\theta}\left(\boldsymbol{x}_{t-1}\mid\boldsymbol{x}_{t}\right)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) which is the approximation of inference distribution q(𝒙t1𝒙t,𝒙0)=q(𝒙t𝒙t1,𝒙0)q(𝒙t1𝒙0)q(𝒙t𝒙0)𝑞conditionalsubscript𝒙𝑡1subscript𝒙𝑡subscript𝒙0𝑞conditionalsubscript𝒙𝑡subscript𝒙𝑡1subscript𝒙0𝑞conditionalsubscript𝒙𝑡1subscript𝒙0𝑞conditionalsubscript𝒙𝑡subscript𝒙0q(\boldsymbol{x}_{t-1}\mid\boldsymbol{x}_{t},\boldsymbol{x}_{0})=q(\boldsymbol% {x}_{t}\mid\boldsymbol{x}_{t-1},\boldsymbol{x}_{0})\frac{q(\boldsymbol{x}_{t-1% }\mid\boldsymbol{x}_{0})}{q(\boldsymbol{x}_{t}\mid\boldsymbol{x}_{0})}italic_q ( bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = italic_q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) divide start_ARG italic_q ( bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG. The forms of both distributions q𝑞qitalic_q and pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT are assumed to be Gaussian. The training process is the minimization of KL-divergence

KL[q(𝒙1:T𝒙0)pθ(𝒙1:T𝒙0)]=𝔼𝒙1:Tq(𝒙1:T𝒙0)[logq(𝒙1:T𝒙0)pθ(𝒙0)pθ(𝒙0:T)].KL\left[q(\boldsymbol{x}_{1:T}\mid\boldsymbol{x}_{0})\mid p_{\theta}(% \boldsymbol{x}_{1:T}\mid\boldsymbol{x}_{0})\right]=\mathbb{E}_{\boldsymbol{x}_% {1:T}\sim q(\boldsymbol{x}_{1:T}\mid\boldsymbol{x}_{0})}\left[\log\frac{q(% \boldsymbol{x}_{1:T}\mid\boldsymbol{x}_{0})p_{\theta}(\boldsymbol{x}_{0})}{p_{% \theta}(\boldsymbol{x}_{0:T})}\right].italic_K italic_L [ italic_q ( bold_italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ∣ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∣ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ∣ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] = blackboard_E start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ∼ italic_q ( bold_italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ∣ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_q ( bold_italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ∣ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ) end_ARG ] . (2)

A variation of DDPM is DDIM [17]. It is non-Markovian, incorporating the input 𝒙0subscript𝒙0\boldsymbol{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT into the forward process. The inference distribution is factorized as qσ(𝒙1:T𝒙0)=qσ(𝒙T𝒙0)t=2Tqσ(𝒙t1𝒙t,𝒙0)subscript𝑞𝜎conditionalsubscript𝒙:1𝑇subscript𝒙0subscript𝑞𝜎conditionalsubscript𝒙𝑇subscript𝒙0superscriptsubscriptproduct𝑡2𝑇subscript𝑞𝜎conditionalsubscript𝒙𝑡1subscript𝒙𝑡subscript𝒙0q_{\sigma}(\boldsymbol{x}_{1:T}\mid\boldsymbol{x}_{0})=q_{\sigma}(\boldsymbol{% x}_{T}\mid\boldsymbol{x}_{0})\prod_{t=2}^{T}q_{\sigma}(\boldsymbol{x}_{t-1}% \mid\boldsymbol{x}_{t},\boldsymbol{x}_{0})italic_q start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ∣ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = italic_q start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∣ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) for all t>1𝑡1t>1italic_t > 1. The forward process can be obtained through Bayes’ theorem in closed form:

qσ(𝒙t+1𝒙t,𝒙0)=𝒩(α¯t+1𝒙0+1α¯t+1σt2𝒙tα¯t𝒙01α¯t,σt2𝑰),subscript𝑞𝜎conditionalsubscript𝒙𝑡1subscript𝒙𝑡subscript𝒙0𝒩subscript¯𝛼𝑡1subscript𝒙01subscript¯𝛼𝑡1superscriptsubscript𝜎𝑡2subscript𝒙𝑡subscript¯𝛼𝑡subscript𝒙01subscript¯𝛼𝑡superscriptsubscript𝜎𝑡2𝑰q_{\sigma}(\boldsymbol{x}_{t+1}\mid\boldsymbol{x}_{t},\boldsymbol{x}_{0})=% \mathcal{N}\left(\sqrt{\bar{\alpha}_{t+1}}\boldsymbol{x}_{0}+\sqrt{1-\bar{% \alpha}_{t+1}-\sigma_{t}^{2}}\frac{\boldsymbol{x}_{t}-\sqrt{\bar{\alpha}_{t}}% \boldsymbol{x}_{0}}{\sqrt{1-\bar{\alpha}_{t}}},\sigma_{t}^{2}\boldsymbol{I}% \right),italic_q start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG divide start_ARG bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I ) , (3)

where σtsubscript𝜎𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is set to 0. They both share the same objective function in training.

2.2 Classifier-free Guidance

The classifier-free guidance exhibits superior performance in generative tasks compared to classifier guidance [7]. Additionally, both training and sampling processes are simplified as it eliminates the need for an external classifier. In the sampling process, for any t0𝑡0t\geq 0italic_t ≥ 0, the noise predictor ϵ𝜽i,tsuperscriptsubscriptbold-italic-ϵ𝜽𝑖𝑡\boldsymbol{\epsilon_{\theta}}^{i,t}bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i , italic_t end_POSTSUPERSCRIPT with guidance cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is ϵ𝜽i,t=(1+w)ϵ𝜽(𝒙t,t,ci)wϵ𝜽(𝒙t,t,)superscriptsubscriptbold-italic-ϵ𝜽𝑖𝑡1𝑤subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡subscript𝑐𝑖𝑤subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡\boldsymbol{\epsilon_{\theta}}^{i,t}=(1+w)\boldsymbol{\epsilon_{\theta}}(% \boldsymbol{x}_{t},t,c_{i})-w\boldsymbol{\epsilon_{\theta}}(\boldsymbol{x}_{t}% ,t,\emptyset)bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i , italic_t end_POSTSUPERSCRIPT = ( 1 + italic_w ) bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_w bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , ∅ ). The parameter w𝑤witalic_w controls the strength of guidance. For unguided sampling, the predicted noise ϵ𝜽i,t=ϵ𝜽(𝒙t,t,)superscriptsubscriptbold-italic-ϵ𝜽𝑖𝑡subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡\boldsymbol{\epsilon_{\theta}}^{i,t}=\boldsymbol{\epsilon_{\theta}}(% \boldsymbol{x}_{t},t,\emptyset)bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i , italic_t end_POSTSUPERSCRIPT = bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , ∅ ). The model ϵ𝜽subscriptbold-italic-ϵ𝜽\boldsymbol{\epsilon_{\theta}}bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT with guidance is implemented by utilizing the attention [20].

3 Methodology

Our method aims to eliminate the need for pixel-level labels in hyperparameter tuning by utilizing the unguided forward process as a reference for the guided forward process. We first introduce the guided and unguided denoised inputs. Then, we describe the hyperparameter selection process.

3.1 Denoised Inputs with Classifier-free Guidance

We can add noise to the inputs in forward process with either DDPM style using Eq. 1 or DDIM style using Eq. 3 without label information. After adding the noise on step t𝑡titalic_t, we obtain the guided and unguided denoised inputs in reverse map** according to Eq. 4 and Eq. 5 respectively. Note that the denoised inputs are NOT obtained from the iterative sampling process.

𝒙~0,th=𝒙t1α¯t[(1+w)ϵ𝜽(𝒙t,t,h)wϵ𝜽(𝒙t,t,)]α¯tsuperscriptsubscript~𝒙0𝑡subscript𝒙𝑡1subscript¯𝛼𝑡delimited-[]1𝑤subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡𝑤subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡subscript¯𝛼𝑡\tilde{\boldsymbol{x}}_{0,t}^{h}=\frac{\boldsymbol{x}_{t}-\sqrt{1-\bar{\alpha}% _{t}}\left[(1+w)\boldsymbol{\epsilon_{\theta}}(\boldsymbol{x}_{t},t,h)-w% \boldsymbol{\epsilon_{\theta}}(\boldsymbol{x}_{t},t,\emptyset)\right]}{\sqrt{% \bar{\alpha}_{t}}}over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = divide start_ARG bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG [ ( 1 + italic_w ) bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_h ) - italic_w bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , ∅ ) ] end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG (4)
𝒙~0,t=𝒙t1α¯tϵ𝜽(𝒙t,t,)α¯tsuperscriptsubscript~𝒙0𝑡subscript𝒙𝑡1subscript¯𝛼𝑡subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡subscript¯𝛼𝑡\tilde{\boldsymbol{x}}_{0,t}^{\emptyset}=\frac{\boldsymbol{x}_{t}-\sqrt{1-\bar% {\alpha}_{t}}\boldsymbol{\epsilon_{\theta}}(\boldsymbol{x}_{t},t,\emptyset)}{% \sqrt{\bar{\alpha}_{t}}}over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT = divide start_ARG bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , ∅ ) end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG (5)

Then, we collect MSEth=(𝒙~0,th𝒙0)2𝑀𝑆superscriptsubscript𝐸𝑡superscriptsuperscriptsubscript~𝒙0𝑡subscript𝒙02MSE_{t}^{h}=\left(\tilde{\boldsymbol{x}}_{0,t}^{h}-\boldsymbol{x}_{0}\right)^{2}italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and MSEt=(𝒙~0,t𝒙0)2𝑀𝑆superscriptsubscript𝐸𝑡superscriptsuperscriptsubscript~𝒙0𝑡subscript𝒙02MSE_{t}^{\emptyset}=\left(\tilde{\boldsymbol{x}}_{0,t}^{\emptyset}-\boldsymbol% {x}_{0}\right)^{2}italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT = ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for t{ti}i=0T𝑡superscriptsubscriptsubscript𝑡𝑖𝑖0𝑇t\in\{t_{i}\}_{i=0}^{T}italic_t ∈ { italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. On testing set without image-level labels, we first determine the health status of the input 𝒙0subscript𝒙0\boldsymbol{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by assessing the cosine similarity between two MSEs. The details can be found in supplementary material B.

3.2 Hyperparameter Selection

Dynamical Noise Scale and Threshold We utilize the unguided forward process to select end step tesubscript𝑡𝑒t_{e}italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, i.e., noise scale, for each input. It is chosen based on the absolute average difference between two MSEs, given by

Mt=d=1Di=1Hj=1W|MSEthMSEt|D×H×W,t{ti}0T,formulae-sequencesubscript𝑀𝑡superscriptsubscript𝑑1𝐷superscriptsubscript𝑖1𝐻superscriptsubscript𝑗1𝑊𝑀𝑆superscriptsubscript𝐸𝑡𝑀𝑆superscriptsubscript𝐸𝑡𝐷𝐻𝑊𝑡superscriptsubscriptsubscript𝑡𝑖0𝑇M_{t}=\frac{\sum_{d=1}^{D}\sum_{i=1}^{H}\sum_{j=1}^{W}\big{|}MSE_{t}^{h}-MSE_{% t}^{\emptyset}\big{|}}{D\times H\times W},\quad t\in\{t_{i}\}_{0}^{T},italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT | italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT - italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT | end_ARG start_ARG italic_D × italic_H × italic_W end_ARG , italic_t ∈ { italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (6)

such that

te=argmaxtMt,t{ti}0T.formulae-sequencesubscript𝑡𝑒subscript𝑡subscript𝑀𝑡𝑡superscriptsubscriptsubscript𝑡𝑖0𝑇t_{e}=\arg\max_{t}M_{t},\quad t\in\{t_{i}\}_{0}^{T}.italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ∈ { italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT . (7)

Note that MSEth,MSEt+D×H×W𝑀𝑆superscriptsubscript𝐸𝑡𝑀𝑆superscriptsubscript𝐸𝑡superscriptsubscript𝐷𝐻𝑊MSE_{t}^{h},MSE_{t}^{\emptyset}\in\mathbb{R_{+}}^{D\times H\times W}italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D × italic_H × italic_W end_POSTSUPERSCRIPT where D,H𝐷𝐻D,Hitalic_D , italic_H and W𝑊Witalic_W are the number of modalities, height and width, respectively. We assume that the anomalous regions are mostly removed on 𝒙~0,thsuperscriptsubscript~𝒙0𝑡\tilde{\boldsymbol{x}}_{0,t}^{h}over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT at step tesubscript𝑡𝑒t_{e}italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT compared to other steps. To obtain the anomaly map for segmentation, we aggregate MSEth𝑀𝑆superscriptsubscript𝐸𝑡MSE_{t}^{h}italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT for t{ti}0te𝑡superscriptsubscriptsubscript𝑡𝑖0subscript𝑡𝑒t\in\{t_{i}\}_{0}^{t_{e}}italic_t ∈ { italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, i.e., t=0teMSEthtesuperscriptsubscript𝑡0subscript𝑡𝑒𝑀𝑆superscriptsubscript𝐸𝑡subscript𝑡𝑒\frac{\sum_{t=0}^{t_{e}}MSE_{t}^{h}}{t_{e}}divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT end_ARG start_ARG italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_ARG. Similarly, the segmentation threshold is individually determined for each input. We observe that Mtesubscript𝑀subscript𝑡𝑒M_{t_{e}}italic_M start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT is roughly linearly related to the size of anomalous regions. A larger anomalous region corresponds to a larger value Mtesubscript𝑀subscript𝑡𝑒M_{t_{e}}italic_M start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT as more regions are removed. Our threshold selection strategy is that a smaller quantile is selected for a larger anomalous region to include more possible pixels. For a comprehensive understanding of our selection process, please refer to Algorithm 1, detailed in the supplementary material D.

Guidance Strength We aim to select the guidance strength w𝑤witalic_w to be sufficiently large to remove anomalous regions on 𝒙~0,thsuperscriptsubscript~𝒙0𝑡\tilde{\boldsymbol{x}}_{0,t}^{h}over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT, but not so large that non-anomalous regions are significantly changed. At the same time, it must ensure the classification accuracy between healthy and unhealthy samples. Our selection strategy is that we first ensure the classification accuracy is above a certain threshold, and then we select the smallest w𝑤witalic_w that achieves the maximal Mtesubscript𝑀subscript𝑡𝑒M_{t_{e}}italic_M start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

4 Experiments

We train and evaluate the proposed method on BraTS21 dataset [1]. BraTS21 dataset comprises of three-dimensional Magnetic Resonance brain images depicting subjects afflicted with a cerebral tumor, accompanied by pixel-wise annotations serving as ground truth labels. Each subject undergoes scanning through four distinct MR sequences, specifically T1-weighted, T2-weighted, FLAIR, and T1-weighted with contrast enhancement. Given our emphasis on a two-dimensional methodology, our analysis is confined to axial slices. There are 1,254 patients and we split the dataset into 939 patients for training, 63 patients for validation, 252 patients for testing. We randomly select 1,000 samples in validation set for hyperparameter tuning and 10,000 samples in testing set for evaluation. For training, we stack all four modalities while only FLAIR and T2-weighted modalities are used in inference. For preprocessing, we normalize each slice by dividing 99999999 percentile foreground voxel intensity and then, pixels are scaled to the range of [1,1]11\left[-1,1\right][ - 1 , 1 ]. Finally, all samples are interpolated to 128×128128128128\times 128128 × 128. The training setup is demonstrated in supplementary material A. The backbone U-net is from the previous work [15].

4.1 Results: Weakly-supervised Segmentation

We report pixel-level DICE score, intersection over union (IoU) and area under the precision-recall curve (AUPRC) in terms of foreground area in Table 1. The performance on 10,000 mixed data, comprising 4,656 unhealthy samples and 5,344 healthy samples, along with the performance on all 4,656 unhealthy data are reported separately. For our method, we present results for four setups: (i) DDPM forward (stochastic encoding): Eq. 1 is used to add noise to inputs; (ii) DDIM forward (deterministic encoding): the inputs are noised by Eq. 3; (iii) tuned with labels: the hyperparameter is tuned with pixel-level labels for all inputs (non-dynamical) and noise is added using DDIM forward; (iv) DDIM forward tesubscript𝑡𝑒t_{e}italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT: it uses the same setup as DDIM forward but only uses the anomaly map at tesubscript𝑡𝑒t_{e}italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT for segmentation. For comparison methods, we report the results from AnoDDPM with Gaussian noise [22], pure DDIM, DDIM with classifier guidance[21] and DDIM classifier-free guidance [15]. The first two methods are only trained on healthy data and the hyperparameters of all methods are tuned by using 1,000 samples in validation set. Note that our methods do not require pixel-level labels for tunning. We adopt median filter [9] and connected component filter for postprocessing. The details are provided in supplementary material E. We surpass the previous methods in a quantitative evaluation. The qualitative results are shown in Fig. 6. We show more qualitative results in supplementary material F. It shows that our method can enhance the signal strength of the anomalous regions. We further exhibit four components, i.e., 𝒙~0,tsuperscriptsubscript~𝒙0𝑡\tilde{\boldsymbol{x}}_{0,t}^{\emptyset}over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT, 𝒙~0,thsuperscriptsubscript~𝒙0𝑡\tilde{\boldsymbol{x}}_{0,t}^{h}over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT,MSEt𝑀𝑆superscriptsubscript𝐸𝑡MSE_{t}^{\emptyset}italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT and MSEth𝑀𝑆superscriptsubscript𝐸𝑡MSE_{t}^{h}italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT, used in our framework in Fig.7, which illustrates the process of anomalous removal. More analysis can be found in the supplementary material C.

Mixed Unhealthy
Methods DICE IoU AUPRC DICE IoU AUPRC
AnoDDPM (Gaussian)[22] 66.1±0.1plus-or-minus0.1\pm 0.1± 0.1 61.7±0.1plus-or-minus0.1\pm 0.1± 0.1 51.8±0.1plus-or-minus0.1\pm 0.1± 0.1 37.6±plus-or-minus\pm±0.1 28.1±plus-or-minus\pm±0.1 61.3±plus-or-minus\pm±0.1
DDIM unguided 68.4±plus-or-minus\pm±0.1 63.7±0.1plus-or-minus0.1\pm 0.1± 0.1 54.3±plus-or-minus\pm±0.1 40.7±plus-or-minus\pm±0.7 31.0±plus-or-minus\pm±0.1 63.4±plus-or-minus\pm±0.1
DDIM clf[21] 76.5±plus-or-minus\pm±0.1 71.0±plus-or-minus\pm±0.1 58.4±plus-or-minus\pm±0.3 52.2±plus-or-minus\pm±0.2 40.4±plus-or-minus\pm±0.2 61.6±plus-or-minus\pm±0.2
DDIM clf-free[15] 74.3 69.1 59.9 49.1 38.1 61.4
Ours (DDPM forward) 77.6±plus-or-minus\pm±0.1 71.3±plus-or-minus\pm±0.1 72.4±plus-or-minus\pm±0.1 56.0±plus-or-minus\pm±4.0 45.7±plus-or-minus\pm±3.5 76.1±plus-or-minus\pm±0.1
Ours (DDIM forward) 77.8 72.0 69.7 54.0 43.5 72.3
Ours (tuned with labels) 77.7 71.8 69.5 53.2 42.6 72.4
Ours (DDIM forward tesubscript𝑡𝑒t_{e}italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT) 75.6 70.6 66.0 52.9 42.5 70.7
Table 1: Segmentation performance on all slices and unhealthy slices. If the methods involve random noise, the standard deviations are reported based on three rounds of experiments. The best performance is in bold.

4.2 Ablation Study: Hyperparameter Selection

To validate the effectiveness of our hyperparameter selection, we randomly select 1,000 samples and calculate DICE score with various settings. In Fig. 2 (a), we first show the selection of guidance strength w𝑤witalic_w. We select the accuracy threshold 0.90.90.90.9, and hence, w=2𝑤2w=2italic_w = 2 is selected. Then, we illustrate the DICE score with various guidance strength w𝑤witalic_w in Fig. 2 (b). Notably, the maximal DICE score is achieved at w=2𝑤2w=2italic_w = 2, supporting the feasibility of our selection. Another hyperparameter under consideration is the end step tesubscript𝑡𝑒t_{e}italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT. The sensitivity of tesubscript𝑡𝑒t_{e}italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT with w=2𝑤2w=2italic_w = 2 in terms of DICE score is depicted in Fig. 2 (c). The maximal DICE score is achieved at tesubscript𝑡𝑒t_{e}italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT. The results are consistent with our hypothesis that the anomalous regions are maximally removed at tesubscript𝑡𝑒t_{e}italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT. When the ratio is smaller than 1, the DICE score is significantly reduced, indicating that the anomalous regions are not fully removed. Conversely, when the ratio is larger than 1, the DICE score is not reduced significantly, suggesting that the anomalous regions are fully removed but more noise is added to the aggregated anomaly map. However, the aggregation alleviates the impact of noise.

Refer to caption
Figure 2: (a) Selection of guidance strength w𝑤witalic_w; (b) Sensitivity of the guidance strength w𝑤witalic_w; (c) Sensitivity of the end step tesubscript𝑡𝑒t_{e}italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT. The red points indicate the selection of hyperparameters.
Refer to captionInputGT Ours AnoDDPM Gaussian DDIM with clf DDIM clf-free DDIM unguided
Figure 3: Qualitative comparison of anomaly heatmaps. The first column displays the original input images from the FLAIR modality, while the second column illustrates the corresponding ground truth for anomaly segmentation. The subsequent columns shows the anomaly maps obtained from our method with DDIM forward and various comparison methods. Each row is a different sample.
Refer to caption𝒙~0,thsuperscriptsubscript~𝒙0𝑡\tilde{\boldsymbol{x}}_{0,t}^{h}over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT𝒙~0,tsuperscriptsubscript~𝒙0𝑡\tilde{\boldsymbol{x}}_{0,t}^{\emptyset}over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPTMSEth𝑀𝑆superscriptsubscript𝐸𝑡MSE_{t}^{h}italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPTMSEt𝑀𝑆superscriptsubscript𝐸𝑡MSE_{t}^{\emptyset}italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPTt=1𝑡1t=1italic_t = 1t=100𝑡100t=100italic_t = 100t=200𝑡200t=200italic_t = 200t=300𝑡300t=300italic_t = 300t=400𝑡400t=400italic_t = 400t=500𝑡500t=500italic_t = 500t=600𝑡600t=600italic_t = 600
Figure 4: An example of the four components used in our framework.

5 Conclusion

In this paper, we propose a novel anomaly segmentation framework that eliminates the need for pixel-level labels in hyperparameter tuning by utilizing the denoised inputs without guidance as a reference. By aggregating anomaly maps from each forward step, we enhance the signal strength of anomaly regions, which improves the quality of anomaly maps. Our method surpasses the previous methods in terms of weakly-supervised segmentation on BraTS21 dataset.

References

  • [1] Bakas, S., Reyes, M., Jakab, A., Bauer, S., Rempfler, M., Crimi, A., Shinohara, R.T., Berger, C., Ha, S.M., Rozycki, M., et al.: Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the brats challenge. arXiv preprint arXiv:1811.02629 (2018)
  • [2] Behrendt, F., Bhattacharya, D., Krüger, J., Opfer, R., Schlaefer, A.: Patched diffusion models for unsupervised anomaly detection in brain mri. In: Medical Imaging with Deep Learning (2023)
  • [3] Chen, X., You, S., Tezcan, K.C., Konukoglu, E.: Unsupervised lesion detection via image restoration with a normative prior. Medical image analysis 64, 101713 (2020)
  • [4] Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34, 8780–8794 (2021)
  • [5] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Advances in neural information processing systems 27 (2014)
  • [6] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. pp. 6840–6851 (2020)
  • [7] Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications (2021)
  • [8] Iqbal, H., Khalid, U., Chen, C., Hua, J.: Unsupervised anomaly detection in medical images using masked diffusion model. In: International Workshop on Machine Learning in Medical Imaging. pp. 372–381. Springer (2023)
  • [9] Kascenas, A., Pugeault, N., O’Neil, A.Q.: Denoising autoencoders for unsupervised anomaly detection in brain mri. In: International Conference on Medical Imaging with Deep Learning. pp. 653–664. PMLR (2022)
  • [10] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
  • [11] Perlin, K.: Improving noise. In: Proceedings of the 29th annual conference on Computer graphics and interactive techniques. pp. 681–682 (2002)
  • [12] Pinaya, W.H., Graham, M.S., Gray, R., da Costa, P.F., Tudosiu, P.D., Wright, P., Mah, Y.H., MacKinnon, A.D., Teo, J.T., Jager, R., et al.: Fast unsupervised brain anomaly detection and segmentation with diffusion models. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 705–714 (2022)
  • [13] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
  • [14] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI) (2015)
  • [15] Sanchez, P., Kascenas, A., Liu, X., O’Neil, A.Q., Tsaftaris, S.A.: What is healthy? generative counterfactual diffusion for lesion localization. In: MICCAI Workshop on Deep Generative Models. pp. 34–44. Springer (2022)
  • [16] Schlegl, T., Seeböck, P., Waldstein, S.M., Langs, G., Schmidt-Erfurth, U.: f-anogan: Fast unsupervised anomaly detection with generative adversarial networks. Medical image analysis 54, 30–44 (2019)
  • [17] Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International Conference on Learning Representations (2020)
  • [18] Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations (2020)
  • [19] Su, X., Song, J., Meng, C., Ermon, S.: Dual diffusion implicit bridges for image-to-image translation. In: The Eleventh International Conference on Learning Representations (2022)
  • [20] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
  • [21] Wolleb, J., Bieder, F., Sandkühler, R., Cattin, P.C.: Diffusion models for medical anomaly detection. In: International Conference on Medical image computing and computer-assisted intervention. pp. 35–45. Springer (2022)
  • [22] Wyatt, J., Leach, A., Schmon, S.M., Willcocks, C.G.: Anoddpm: Anomaly detection with denoising diffusion probabilistic models using simplex noise. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 650–656 (2022)
  • [23] Zimmerer, D., Isensee, F., Petersen, J., Kohl, S., Maier-Hein, K.: Unsupervised anomaly localization using variational auto-encoders. In: Medical Image Computing and Computer Assisted Interventione. pp. 289–297. Springer (2019)

Supplementary Material

A. Training Hyperparameter Settings

Our model is trained on 2 Nvidia A100 GPUs with 80GB memory. The training hyperparameters are summarized in Table 2.

Diffusion steps 1000
Noise schedule linear
Channels 128
Heads 2
Attention resolution 32,16,8
Dropout 0.1
EMA rate 0.9999
Optimiser AdamW
Learning rate 1e41superscript𝑒41e^{-4}1 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.9, 0.999
Batch size 64
Null label ratio 0.1
dropout 0.1
Table 2: Training hyperparameters used in our method.

B. Determining Health Status by Cosine Similarity

To determine the label of inputs 𝒙0subscript𝒙0\boldsymbol{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT without the pixel-level labels, we utilize the cosine similarity between two MSEs

Cos(MSE¯h,MSE¯)=MSE¯hMSE¯(MSE¯h2MSE¯2),𝐶𝑜𝑠superscript¯𝑀𝑆𝐸superscript¯𝑀𝑆𝐸superscript¯𝑀𝑆𝐸superscript¯𝑀𝑆𝐸subscriptnormsuperscript¯𝑀𝑆𝐸2subscriptnormsuperscript¯𝑀𝑆𝐸2Cos\left(\overline{MSE}^{h},\overline{MSE}^{\emptyset}\right)=\frac{\overline{% MSE}^{h}\cdot\overline{MSE}^{\emptyset}}{\left(\|\overline{MSE}^{h}\|_{2}\cdot% \|\overline{MSE}^{\emptyset}\|_{2}\right)},italic_C italic_o italic_s ( over¯ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , over¯ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT ) = divide start_ARG over¯ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ⋅ over¯ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT end_ARG start_ARG ( ∥ over¯ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ ∥ over¯ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG , (8)

where MSE¯={MSE¯1,,MSE¯T}superscript¯𝑀𝑆𝐸subscriptsuperscript¯𝑀𝑆𝐸1subscriptsuperscript¯𝑀𝑆𝐸𝑇\overline{MSE}^{\emptyset}=\{\overline{MSE}^{\emptyset}_{1},...,\overline{MSE}% ^{\emptyset}_{T}\}over¯ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT = { over¯ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over¯ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }, MSE¯h={MSE¯1h,,MSE¯Th}superscript¯𝑀𝑆𝐸subscriptsuperscript¯𝑀𝑆𝐸1subscriptsuperscript¯𝑀𝑆𝐸𝑇\overline{MSE}^{h}=\{\overline{MSE}^{h}_{1},...,\overline{MSE}^{h}_{T}\}over¯ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = { over¯ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over¯ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }. The element for each step is defined as

MSE¯t=d=1Di=1Hj=1W|MSEt|D×H×Wsubscriptsuperscript¯𝑀𝑆𝐸𝑡superscriptsubscript𝑑1𝐷superscriptsubscript𝑖1𝐻superscriptsubscript𝑗1𝑊𝑀𝑆superscriptsubscript𝐸𝑡𝐷𝐻𝑊\overline{MSE}^{\emptyset}_{t}=\frac{\sum_{d=1}^{D}\sum_{i=1}^{H}\sum_{j=1}^{W% }\big{|}MSE_{t}^{\emptyset}\big{|}}{D\times H\times W}over¯ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT | italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT | end_ARG start_ARG italic_D × italic_H × italic_W end_ARG (9)
MSE¯t=d=1Di=1Hj=1W|MSEth|D×H×W.subscriptsuperscript¯𝑀𝑆𝐸𝑡superscriptsubscript𝑑1𝐷superscriptsubscript𝑖1𝐻superscriptsubscript𝑗1𝑊𝑀𝑆superscriptsubscript𝐸𝑡𝐷𝐻𝑊\overline{MSE}^{\emptyset}_{t}=\frac{\sum_{d=1}^{D}\sum_{i=1}^{H}\sum_{j=1}^{W% }\big{|}MSE_{t}^{h}\big{|}}{D\times H\times W}.over¯ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT | italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT | end_ARG start_ARG italic_D × italic_H × italic_W end_ARG . (10)

Here T𝑇Titalic_T is the number of diffusion steps, D𝐷Ditalic_D is the number of modalities, H𝐻Hitalic_H and W𝑊Witalic_W are the height and width. In practice, we don’t need to calculate the cosine similarity for all the steps because the input becomes significantly corrupted after a certain number of steps. Therefore, we calculate the cosine similarity for the first 600600600600 steps in our work.

On the validation set with image-level labels, we calculate the cosine similarity for each input 𝒙0subscript𝒙0\boldsymbol{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and determine the threshold Costhr𝐶𝑜subscript𝑠𝑡𝑟Cos_{thr}italic_C italic_o italic_s start_POSTSUBSCRIPT italic_t italic_h italic_r end_POSTSUBSCRIPT that achieves maximal accuracy. On the testing set without image-level labels, we use the threshold Costhr𝐶𝑜subscript𝑠𝑡𝑟Cos_{thr}italic_C italic_o italic_s start_POSTSUBSCRIPT italic_t italic_h italic_r end_POSTSUBSCRIPT to determine the label of the input 𝒙0subscript𝒙0\boldsymbol{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, i.e., if Cos(MSE¯h,MSE¯)>Costhr𝐶𝑜𝑠superscript¯𝑀𝑆𝐸superscript¯𝑀𝑆𝐸𝐶𝑜subscript𝑠𝑡𝑟Cos\left(\overline{MSE}^{h},\overline{MSE}^{\emptyset}\right)>Cos_{thr}italic_C italic_o italic_s ( over¯ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , over¯ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT ) > italic_C italic_o italic_s start_POSTSUBSCRIPT italic_t italic_h italic_r end_POSTSUBSCRIPT, the input is healthy; otherwise, it is unhealthy. The threshold Costhr𝐶𝑜subscript𝑠𝑡𝑟Cos_{thr}italic_C italic_o italic_s start_POSTSUBSCRIPT italic_t italic_h italic_r end_POSTSUBSCRIPT is determined by the validation set and is fixed for the testing set.

C. Analysis of MSEs

We analyze the MSEs using 100 samples (50 healthy and 50 unhealthy) from the BraTS21 dataset, selected from the validation set. The MSEs are calculated for the first 600 steps in terms of healthy label guidance and no guidance. We first show that if the input is healthy, the two MSEs, MSEt𝑀𝑆superscriptsubscript𝐸𝑡MSE_{t}^{\emptyset}italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT and MSEth𝑀𝑆superscriptsubscript𝐸𝑡MSE_{t}^{h}italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT are similar. Conversely, if the input is unhealthy, they differ. We calculate the mean of MSE¯thsubscriptsuperscript¯𝑀𝑆𝐸𝑡\overline{MSE}^{h}_{t}over¯ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and MSE¯tsubscriptsuperscript¯𝑀𝑆𝐸𝑡\overline{MSE}^{\emptyset}_{t}over¯ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT across the samples, as well as the corresponding standard deviation. The results are shown in Fig. 5 (a) for healthy samples and (b) for unhealthy samples.

We decompose the difference between the two MSEs of unhealthy samples into two parts: the removal process of the anomalous regions and the impact of non-anomalous regions. We mask out the tumor regions in MSEt𝑀𝑆superscriptsubscript𝐸𝑡MSE_{t}^{\emptyset}italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT and MSEth𝑀𝑆superscriptsubscript𝐸𝑡MSE_{t}^{h}italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT, i.e., MSE^t=MSEtMSEt𝐌,superscriptsubscript^𝑀𝑆𝐸𝑡𝑀𝑆superscriptsubscript𝐸𝑡direct-product𝑀𝑆superscriptsubscript𝐸𝑡𝐌\widehat{MSE}_{t}^{\emptyset}=MSE_{t}^{\emptyset}-MSE_{t}^{\emptyset}\odot\bf{% M},over^ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT = italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT - italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT ⊙ bold_M , and MSE^th=MSEthMSEth𝐌,superscriptsubscript^𝑀𝑆𝐸𝑡𝑀𝑆superscriptsubscript𝐸𝑡direct-product𝑀𝑆superscriptsubscript𝐸𝑡𝐌\widehat{MSE}_{t}^{h}=MSE_{t}^{h}-MSE_{t}^{h}\odot\bf{M},over^ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT - italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ⊙ bold_M , where direct-product\odot denotes element-wise multiplication and 𝐌𝐌\bf{M}bold_M is the mask of tumor. Hence, the removal process of the anomalous regions can be observed by comparing the MSEs between MSEth𝑀𝑆superscriptsubscript𝐸𝑡MSE_{t}^{h}italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT and MSE^tsuperscriptsubscript^𝑀𝑆𝐸𝑡\widehat{MSE}_{t}^{\emptyset}over^ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT. The impact of non-anomalous regions can be observed by comparing the MSEs between MSE^tsuperscriptsubscript^𝑀𝑆𝐸𝑡\widehat{MSE}_{t}^{\emptyset}over^ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT and MSE^thsuperscriptsubscript^𝑀𝑆𝐸𝑡\widehat{MSE}_{t}^{h}over^ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT. We define the following metrics for quantification:

Mt=d=1Di=1Hj=1W|MSEthMSEt|D×H×W,t{ti}0600,formulae-sequencesubscript𝑀𝑡superscriptsubscript𝑑1𝐷superscriptsubscript𝑖1𝐻superscriptsubscript𝑗1𝑊𝑀𝑆superscriptsubscript𝐸𝑡𝑀𝑆superscriptsubscript𝐸𝑡𝐷𝐻𝑊𝑡superscriptsubscriptsubscript𝑡𝑖0600M_{t}=\frac{\sum_{d=1}^{D}\sum_{i=1}^{H}\sum_{j=1}^{W}\big{|}MSE_{t}^{h}-MSE_{% t}^{\emptyset}\big{|}}{D\times H\times W},\quad t\in\{t_{i}\}_{0}^{600},italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT | italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT - italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT | end_ARG start_ARG italic_D × italic_H × italic_W end_ARG , italic_t ∈ { italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 600 end_POSTSUPERSCRIPT , (11)
M^t=d=1Di=1Hj=1W|MSEthMSE^t|D×H×W,t{ti}0600,formulae-sequencesubscript^𝑀𝑡superscriptsubscript𝑑1𝐷superscriptsubscript𝑖1𝐻superscriptsubscript𝑗1𝑊𝑀𝑆superscriptsubscript𝐸𝑡superscriptsubscript^𝑀𝑆𝐸𝑡𝐷𝐻𝑊𝑡superscriptsubscriptsubscript𝑡𝑖0600\widehat{M}_{t}=\frac{\sum_{d=1}^{D}\sum_{i=1}^{H}\sum_{j=1}^{W}\big{|}MSE_{t}% ^{h}-\widehat{MSE}_{t}^{\emptyset}\big{|}}{D\times H\times W},\quad t\in\{t_{i% }\}_{0}^{600},over^ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT | italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT - over^ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT | end_ARG start_ARG italic_D × italic_H × italic_W end_ARG , italic_t ∈ { italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 600 end_POSTSUPERSCRIPT , (12)
M^^t=d=1Di=1Hj=1W|MSE^thMSE^t|D×H×W,t{ti}0600,formulae-sequencesubscript^^𝑀𝑡superscriptsubscript𝑑1𝐷superscriptsubscript𝑖1𝐻superscriptsubscript𝑗1𝑊superscriptsubscript^𝑀𝑆𝐸𝑡superscriptsubscript^𝑀𝑆𝐸𝑡𝐷𝐻𝑊𝑡superscriptsubscriptsubscript𝑡𝑖0600\widehat{\widehat{M}}_{t}=\frac{\sum_{d=1}^{D}\sum_{i=1}^{H}\sum_{j=1}^{W}\big% {|}\widehat{MSE}_{t}^{h}-\widehat{MSE}_{t}^{\emptyset}\big{|}}{D\times H\times W% },\quad t\in\{t_{i}\}_{0}^{600},over^ start_ARG over^ start_ARG italic_M end_ARG end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT | over^ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT - over^ start_ARG italic_M italic_S italic_E end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPT | end_ARG start_ARG italic_D × italic_H × italic_W end_ARG , italic_t ∈ { italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 600 end_POSTSUPERSCRIPT , (13)

These metrics are exhibited in the Fig. 5 (c). The green curve M^^tsubscript^^𝑀𝑡\widehat{\widehat{M}}_{t}over^ start_ARG over^ start_ARG italic_M end_ARG end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT shows the impact of non-anomalous regions, while The black curve M^tsubscript^𝑀𝑡\widehat{M}_{t}over^ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT shows the removal of anomalous regions. However, in practice, we can only observe the red curve Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT without pixel-level labels. The difference between Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and M^tsubscript^𝑀𝑡\widehat{M}_{t}over^ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is shown in Fig. 5 (c) as the black area, representing the anomalous regions compressed in the unguided forward process.

Our aim is to find the step tesubscript𝑡𝑒t_{e}italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT corresponding to the maximal M^tsubscript^𝑀𝑡\widehat{M}_{t}over^ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (black plus green area) where the anomalous regions are mostly removed. The maximal Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is close to the maximal M^tsubscript^𝑀𝑡\widehat{M}_{t}over^ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in most cases. Therefore, we use the maximal Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to select the end step tesubscript𝑡𝑒t_{e}italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT.

Refer to caption
Figure 5: Analysis of MSEs. (a) Mean of the two MSEs for healthy samples. (b) Mean of two MSEs for unhealthy samples. (c) Decomposition of the difference between the two MSEs of unhealthy samples. (d) Relationship between the size of the anomalous region and the maximal difference value M𝑀Mitalic_M.

D. Selection of Quantile for Anomaly Segmentation

In the Fig. 5 (d), we depict the relationship between the size of the anomalous region (number of pixels) and the maximal difference value Mtesubscript𝑀subscript𝑡𝑒M_{t_{e}}italic_M start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Notably, they exhibit a relatively linear relationship. This observation serves as a rough indication of the size of the anomalous region and can be utilized for the selection of the quantile Q𝑄Qitalic_Q without relying on pixel-level labels. We select the segmentation quantile Q[a,b]superscript𝑄𝑎𝑏Q^{*}\in\left[a,b\right]italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ [ italic_a , italic_b ] of the anomaly map H𝐻Hitalic_H for each input 𝒙0subscript𝒙0\boldsymbol{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by Algorithm. 1. The input Mmaxsubscript𝑀𝑚𝑎𝑥M_{max}italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT is obtained from the validation set for scaling and is fixed for the testing set. A smaller quantile is selected for a larger anomalous region to include more possible pixels. In our case, we select a=0.90𝑎0.90a=0.90italic_a = 0.90, b=0.98𝑏0.98b=0.98italic_b = 0.98. Then, the predicted pixel-level labels is obtained as HQ𝐻superscript𝑄H\geq Q^{*}italic_H ≥ italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Input: Mmaxsubscript𝑀𝑚𝑎𝑥M_{max}italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT, a𝑎aitalic_a, b𝑏bitalic_b, H𝐻Hitalic_H
range=reverse(linspace(a,b,101))𝑟𝑎𝑛𝑔𝑒𝑟𝑒𝑣𝑒𝑟𝑠𝑒𝑙𝑖𝑛𝑠𝑝𝑎𝑐𝑒𝑎𝑏101range=reverse(linspace(a,b,101))italic_r italic_a italic_n italic_g italic_e = italic_r italic_e italic_v italic_e italic_r italic_s italic_e ( italic_l italic_i italic_n italic_s italic_p italic_a italic_c italic_e ( italic_a , italic_b , 101 ) ) # Set quantile range
Ms=clamp(MteMmax,0,1)subscript𝑀𝑠𝑐𝑙𝑎𝑚𝑝subscript𝑀subscript𝑡𝑒subscript𝑀𝑚𝑎𝑥01M_{s}=clamp\left(\frac{M_{t_{e}}}{M_{max}},0,1\right)italic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_c italic_l italic_a italic_m italic_p ( divide start_ARG italic_M start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG , 0 , 1 )
index=round(Ms,2)×100𝑖𝑛𝑑𝑒𝑥𝑟𝑜𝑢𝑛𝑑subscript𝑀𝑠2100index=round(M_{s},2)\times 100italic_i italic_n italic_d italic_e italic_x = italic_r italic_o italic_u italic_n italic_d ( italic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , 2 ) × 100 # Keep 2 digits
Return Q=quantile(H,range[index])superscript𝑄𝑞𝑢𝑎𝑛𝑡𝑖𝑙𝑒𝐻𝑟𝑎𝑛𝑔𝑒delimited-[]𝑖𝑛𝑑𝑒𝑥Q^{*}=quantile(H,range\left[index\right])italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_q italic_u italic_a italic_n italic_t italic_i italic_l italic_e ( italic_H , italic_r italic_a italic_n italic_g italic_e [ italic_i italic_n italic_d italic_e italic_x ] )
Algorithm 1 Selection of quantile Q𝑄Qitalic_Q for a single input 𝒙0subscript𝒙0\boldsymbol{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

E. Postprocessing

After we obtain the anomaly map, we apply a median filter with kernel size 5 to effectively enhance the performance. Then, we apply the connected component filter to remove the small connected components which is regarded as noise. We apply the same postprocessing to all methods for fair comparison.

F. Additional Qualitative Results

We provide additional qualitative results in Fig. 6 for the BraTS21 dataset. The results show that our method can effectively segment the anomalous regions without pixel-level labels. We also provide more examples of removal of anomalous regions in Fig. 7.

Refer to captionInputGT Ours AnoDDPM Gaussian DDIM with clf DDIM clf-free DDIM unguided
Figure 6: Qualitative comparison of anomaly heatmaps. The first column displays the original input images from the FLAIR modality, while the second column illustrates the corresponding ground truth for anomaly segmentation. The subsequent columns shows the anomaly maps obtained from our method with DDIM forward and various comparison methods. Each row is a different sample.
Refer to caption𝒙~0,thsuperscriptsubscript~𝒙0𝑡\tilde{\boldsymbol{x}}_{0,t}^{h}over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT𝒙~0,tsuperscriptsubscript~𝒙0𝑡\tilde{\boldsymbol{x}}_{0,t}^{\emptyset}over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPTMSEth𝑀𝑆superscriptsubscript𝐸𝑡MSE_{t}^{h}italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPTMSEt𝑀𝑆superscriptsubscript𝐸𝑡MSE_{t}^{\emptyset}italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPTt=1𝑡1t=1italic_t = 1t=100𝑡100t=100italic_t = 100t=200𝑡200t=200italic_t = 200t=300𝑡300t=300italic_t = 300t=400𝑡400t=400italic_t = 400t=500𝑡500t=500italic_t = 500t=600𝑡600t=600italic_t = 600
Refer to caption𝒙~0,thsuperscriptsubscript~𝒙0𝑡\tilde{\boldsymbol{x}}_{0,t}^{h}over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT𝒙~0,tsuperscriptsubscript~𝒙0𝑡\tilde{\boldsymbol{x}}_{0,t}^{\emptyset}over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPTMSEth𝑀𝑆superscriptsubscript𝐸𝑡MSE_{t}^{h}italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPTMSEt𝑀𝑆superscriptsubscript𝐸𝑡MSE_{t}^{\emptyset}italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPTt=1𝑡1t=1italic_t = 1t=100𝑡100t=100italic_t = 100t=200𝑡200t=200italic_t = 200t=300𝑡300t=300italic_t = 300t=400𝑡400t=400italic_t = 400t=500𝑡500t=500italic_t = 500t=600𝑡600t=600italic_t = 600
Refer to caption𝒙~0,thsuperscriptsubscript~𝒙0𝑡\tilde{\boldsymbol{x}}_{0,t}^{h}over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT𝒙~0,tsuperscriptsubscript~𝒙0𝑡\tilde{\boldsymbol{x}}_{0,t}^{\emptyset}over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPTMSEth𝑀𝑆superscriptsubscript𝐸𝑡MSE_{t}^{h}italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPTMSEt𝑀𝑆superscriptsubscript𝐸𝑡MSE_{t}^{\emptyset}italic_M italic_S italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∅ end_POSTSUPERSCRIPTt=1𝑡1t=1italic_t = 1t=100𝑡100t=100italic_t = 100t=200𝑡200t=200italic_t = 200t=300𝑡300t=300italic_t = 300t=400𝑡400t=400italic_t = 400t=500𝑡500t=500italic_t = 500t=600𝑡600t=600italic_t = 600
Figure 7: More examples of the four components used in our framework.