DoseDiff: Distance-aware Diffusion Model for Dose Prediction in Radiotherapy

Yiwen Zhang    Chuanpu Li    Liming Zhong    Zeli Chen    Wei Yang    and Xuetao Wang This work was supported in part by the National Natural Science Foundation of China under Grant 82172020 and Grant 62101239, in part by the Guangdong Basic and Applied Basic Research Foundation under Grant 2022A1515011336, and in part by the Guangdong Provincial Key Laboratory of Medical Image Processing under Grant 2020B1212060039. (Yiwen Zhang and Chuanpu Li are co-first authors.)(Corresponding authors: Xuetao Wang; Wei Yang)Yiwen Zhang, Chuanpu Li, Liming Zhong, Zeli Chen, and Wei Yang are with the School of Biomedical Engineering, Southern Medical University, Guangzhou 510515, China, and also with the Guangdong Provincial Key Laboratory of Medical Image Processing, Guangzhou 510515, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]).Xuetao Wang is with the Department of Radiation Therapy, The Second Affiliated Hospital, Guangzhou University of Chinese Medicine, Guangzhou 510006, China (e-mail: [email protected]).
Abstract

Treatment planning, which is a critical component of the radiotherapy workflow, is typically carried out by a medical physicist in a time-consuming trial-and-error manner. Previous studies have proposed knowledge-based or deep-learning-based methods for predicting dose distribution maps to assist medical physicists in improving the efficiency of treatment planning. However, these dose prediction methods usually fail to effectively utilize distance information between surrounding tissues and targets or organs-at-risk (OARs). Moreover, they are poor at maintaining the distribution characteristics of ray paths in the predicted dose distribution maps, resulting in a loss of valuable information. In this paper, we propose a distance-aware diffusion model (DoseDiff) for precise prediction of dose distribution. We define dose prediction as a sequence of denoising steps, wherein the predicted dose distribution map is generated with the conditions of the computed tomography (CT) image and signed distance maps (SDMs). The SDMs are obtained by distance transformation from the masks of targets or OARs, which provide the distance from each pixel in the image to the outline of the targets or OARs. We further propose a multi-encoder and multi-scale fusion network (MMFNet) that incorporates multi-scale and transformer-based fusion modules to enhance information fusion between the CT image and SDMs at the feature level. We evaluate our model on two in-house datasets and a public dataset, respectively. The results demonstrate that our DoseDiff method outperforms state-of-the-art dose prediction methods in terms of both quantitative performance and visual quality.

{IEEEkeywords}

Deep learning, Diffusion model, Dose prediction, Radiotherapy, Signed distance map.

1 Introduction

\IEEEPARstart

Radiation therapy (RT) is an essential cancer treatment modality; approximately 50% of cancer patients receive RT during their course of illness and it contributes to around 40% of curative treatment cases [1, 2]. Treatment planning plays an important part in the current RT workflow as it is used to determine the optimal radiation dose, technique, and schedule to target cancer while minimizing exposure to healthy tissue, with the aim of maximizing effectiveness and minimizing side effects. Clinically, a medical physicist typically spends hours adjusting a set of hyper-parameters and weightings in a trial-and-error manner to ensure that the RT plan can achieve the desired treatment effect, which is time-consuming and labor-intensive [3, 4]. If an appropriate dose distribution map can be obtained in advance, the medical physicist will be able to use it as a reference, enabling the RT planning to be completed with fewer hyper-parameter adjustments [5, 6]. Therefore, dose prediction is of great value in enhancing efficiency and streamlining the workflow of RT.

Various approaches have been proposed for dose prediction in RT. Knowledge-based planning (KBP) provides a traditional paradigm for dose prediction that leverages planning information from historical patients to predict dosimetry for a new patient [7, 8]. Initially, some handcrafted features related to dosimetry are directly used to match historical patients with new patients, e.g., the overlap volume histogram [9], distance-to-target histogram [10], and planning volume shapes [11]. With the development of machine learning technologies, support vector regression, random forests, and gradient boosting have been used to find more effective features [12, 13, 14]. However, these KBP-based methods typically concentrate on predicting dosimetric endpoints or dose-volume histograms (DVHs), which are dosimetric-related statistical indicators and lack spatial information [5]. Furthermore, the accuracy of methods that rely on traditional feature extraction often falls short of expectations.

Deep learning is currently a popular approach for dose prediction [5, 6, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]. With its powerful feature extraction and analysis capabilities, a deep learning model can learn map** relationships from computed tomography (CT) images and region-of-interest (ROI) masks, i.e., targets and organs-at-risk (OARs), to dose distribution maps. Nguyen et al. [21] and Fan et al. [16] used a two-dimensional (2D) convolutional neural network (CNN) to predict dose distribution slice by slice, with the ultimate objective of achieving three-dimensional (3D) dose prediction. To make full use of spatial information, Kearney et al. [17] proposed DoseNet, a modified 3D UNet, for volumetric dose prediction. Recent research has focused on constructing more effective loss functions to improve the accuracy of dose prediction, e.g., adversarial loss [18] and ROI-weighted loss [5].

Most deep-learning-based dose prediction methods use CT images and ROI masks directly as model inputs; however, they rarely exploit distance information between surrounding tissues and targets or OARs. In RT planning, distance information is important as the relative positions of ROIs determine their mutual influence, e.g., the dose levels of tissues close to targets are relatively large, whereas those of the tissues near OARs are relatively small [23, 25]. Although CNNs are capable of implicitly learning distance relationships among ROIs from masks, most of deep-learning-based methods use a slice-based or patch-based training strategy owing to memory constraints, which means they are unable to leverage useful information from mask images when the slice or patch contains little or no ROI. While some methods incorporate attention modules into the networks to perceive long-dependency range information, the attention perception of these modules remains confined to slices or blocks [26, 27]. To address the aforementioned challenge, certain studies have introduced distance information into neural networks through distance maps. For example, Kontaxis et al. [19] fed the distance map from the central beamline to the model to improve the accuracy of dose prediction. However, their method requires knowledge of the parameters of beamlets, which is often not practical in a clinical setting. Zhang et al. [28] introduced a distance image of planning target volumes (PTV) for esophageal radiotherapy dose prediction, but ignored OARs and other targets. Yue et al. [23] utilized ROI masks for distance map calculation and proposed a normalization technique to rescale the numerical range of the distance map for network training, but their distance map only considered pixel distance in image space, rather than real-world physical distance, and had a limited dynamic range. In addition, these methods simply concatenated the distance maps and CT images as network inputs, restricting the potential performance of their model. Various studies have shown that feature-level fusion is more capable of capturing complex relationships among different modalities compared with input-level fusion [29, 30, 31].

Ensuring the authenticity and feasibility of the predicted dose distribution map is also a challenge, e.g., it involves ensuring that the radiation paths in the map adhere to straight-line propagation and that the predicted cumulative dose distribution can be achieved within an acceptable number of treatment fields. However, most of previous deep-learning-based dose prediction methods only learn the map** from input to output images and so cannot fully capture the prior data distribution of real dose distribution map. As a result, the predicted dose distribution maps appear excessively smooth and distort ray path characteristics. To address this problem, Wang et al. [32] proposed a beam-wise dose network to refine the predicted dose within beam masks, but these artificially predefined beam masks derived from the PTV masks are quite coarse. Some studies have adopted frameworks based on generative adversarial networks (GANs) to enhance the realism of the predicted dose distribution map [5, 8], but GANs can be challenging to train and often drop modes in the output distribution [33, 34, 35]. Recently, diffusion models [36, 37] have attracted attention as powerful generative models that are capable of modeling complex image distribution. Diffusion models have been shown to provide superior image sampling quality and more stable training compared with GANs [38]. Conditional diffusion models introduce conditional information to guide the model in generating specific images. Researchers have applied conditional diffusion models to various image-to-image translation tasks, e.g., image inpainting [39], colorization [35], super-resolution [40, 41], and medical image synthesis [42], and achieved the SOTA results. Unlike the previous method of learning pixel map**s between input and output, conditional diffusion models can learn the data distribution from training images and sample the images that best match the given conditions based on the distribution. A conditional diffusion model also partitions image-to-image generation into a sequence of denoising steps, which typically recover the general outline initially and then produce details. This can be considered to be a recursive image generation method and has been proven to be effective by previous studies [43, 44]. Despite these advantages of diffusion models, further exploration is required to fully realize their potential in the task of predicting dose distribution in RT. To the best of our knowledge, only Feng et al. [45] proposed a diffusion-based dose prediction (DiffDP) model for predicting the radiotherapy dose distribution of cancer patients. However, DiffDP is limited to utilizing 2D in-plane information for slice-based dose prediction.

In this paper, we propose a novel distance-aware conditional diffusion model for dose prediction, named DoseDiff, which takes CT images and signed distance maps (SDMs) as conditions to accurately generate dose distribution maps. Based on the mechanism of the conditional diffusion model, we define dose prediction using a sequence of denoising steps, wherein the predicted dose distribution maps are generated from Gaussian noise images with the guidance of conditions. The SDMs are obtained by performing a distance transform on the masks to obtain indicates the distance of each voxel in the image from the contours of the ROIs in 3D space. Unlike Yue et al. [23], who simply combined CT images and distance maps through channel concatenation at the input, we propose a multi-encoder and multi-scale Fusion Network (MMFNet) to enhance information fusion at the feature level. While the early fusion methods, such as channel concatenation, are straightforward and efficient, their performances are constrained by the presence of considerable noise and redundant information in the low-level features. Extensive experiments have demonstrated high-level feature fusion is valuable to the precise prediction in complex tasks [46, 47]. We design independent encoders for CT images and SDMs, respectively, and perform information fusion on feature maps of multiple scales, which is able to integrate both low-level and high-level features. A fusion module based on self-attention mechanism [48, 49] is also proposed for inclusion in MMFNet to further fuse global information.

Refer to caption

Figure 1: The overall workflow for DoseDiff.

2 Method

2.1 Background

Conditional diffusion models represent a class of conditional generative methods aimed at transforming a Gaussian distribution to an empirical data distribution [37, 36, 50]. This usually involves two processes in opposite directions: the forward (diffusion) process and reverse process. We denote by x𝑥xitalic_x and y0subscript𝑦0y_{0}italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT the condition and target domain images, respectively. The forward process aims to collapse the target image distribution p(y0)𝑝subscript𝑦0p(y_{0})italic_p ( italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) to a standard Gaussian distribution 𝒩(𝟎,𝐈)𝒩0𝐈\mathcal{N}(\bf{0},\rm{\bf{I}})caligraphic_N ( bold_0 , bold_I ) by gradually adding Gaussian noise of varying scales to the target image:

q(ytyt1)=𝒩(yt;αtyt1,(1αt)𝐈);𝑞conditionalsubscript𝑦𝑡subscript𝑦𝑡1𝒩subscript𝑦𝑡subscript𝛼𝑡subscript𝑦𝑡11subscript𝛼𝑡𝐈\displaystyle q\left(y_{t}\mid y_{t-1}\right)=\mathcal{N}\left(y_{t};\sqrt{% \alpha_{t}}y_{t-1},\left(1-\alpha_{t}\right)\rm{\bf{I}}\right);italic_q ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I ) ; (1)
q(yty0)=𝒩(yt;α¯ty0,(1α¯t)𝐈),𝑞conditionalsubscript𝑦𝑡subscript𝑦0𝒩subscript𝑦𝑡subscript¯𝛼𝑡subscript𝑦01subscript¯𝛼𝑡𝐈\displaystyle q\left(y_{t}\mid y_{0}\right)=\mathcal{N}\left(y_{t};\sqrt{\bar{% \alpha}_{t}}y_{0},\left(1-\bar{\alpha}_{t}\right)\rm{\bf{I}}\right),italic_q ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I ) , (2)

where t[1,T]𝑡1𝑇t\in[1,T]italic_t ∈ [ 1 , italic_T ] is the timestep and T𝑇Titalic_T denotes the total number of diffusion steps. The scale of the added Gaussian noise is determined by a set of hyper-parameters, αtsubscript𝛼𝑡{\alpha}_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, that change depending on t𝑡titalic_t, with α¯t=s=1tαssubscript¯𝛼𝑡superscriptsubscriptproduct𝑠1𝑡subscript𝛼𝑠\bar{\alpha}_{t}=\prod_{s=1}^{t}\alpha_{s}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Using the theoretical assumption of the diffusion model, when T𝑇Titalic_T is sufficiently large, p(yT)𝑝subscript𝑦𝑇p(y_{T})italic_p ( italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) can be approximated as a standard Gaussian distribution [36, 37].

The target image is generated by fitting other distributions that are parameterized by θ𝜃\thetaitalic_θ in the reverse process:

pθ(yt1yt)=𝒩(yt1;μθ(yt,x,t),(1αt)𝐈),subscript𝑝𝜃conditionalsubscript𝑦𝑡1subscript𝑦𝑡𝒩subscript𝑦𝑡1subscript𝜇𝜃subscript𝑦𝑡𝑥𝑡1subscript𝛼𝑡𝐈\displaystyle p_{\theta}\left(y_{t-1}\mid y_{t}\right)=\mathcal{N}\left(y_{t-1% };\mu_{\theta}\left(y_{t},x,t\right),\left(1-\alpha_{t}\right)\rm{\bf{I}}% \right),italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_N ( italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x , italic_t ) , ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I ) , (3)

where μθ(yt,x,t)subscript𝜇𝜃subscript𝑦𝑡𝑥𝑡\mu_{\theta}\left(y_{t},x,t\right)italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x , italic_t ) can be parameterized by a neural network ϵθ(yt,x,t)subscriptitalic-ϵ𝜃subscript𝑦𝑡𝑥𝑡\epsilon_{\theta}\left(y_{t},x,t\right)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x , italic_t ). The objective of training the neural network is to minimize the variational upper bound of the negative log-likelihood:

θ=𝔼y0,x,t,ϵϵϵθ(yt,x,t),subscript𝜃subscript𝔼subscript𝑦0𝑥𝑡italic-ϵnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑦𝑡𝑥𝑡\displaystyle\mathcal{L}_{\theta}=\mathbb{E}_{y_{0},x,t,\epsilon}\left\|% \epsilon-\epsilon_{\theta}\left(y_{t},x,t\right)\right\|,caligraphic_L start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x , italic_t , italic_ϵ end_POSTSUBSCRIPT ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x , italic_t ) ∥ , (4)

where ϵ𝒩(0,𝐈)similar-toitalic-ϵ𝒩0𝐈\epsilon\sim\mathcal{N}(0,\rm{\bf{I}})italic_ϵ ∼ caligraphic_N ( 0 , bold_I ).

2.2 DoseDiff

In contrast to previous methods, the conditional diffusion model can model the data distribution, which enables it to perceive the dose distribution characteristics in the real dose distribution map. Furthermore, the conditional diffusion model exhibits more stable training and yields higher image quality compared with GAN-based methods. Therefore, we propose DoseDiff for dose prediction, as shown in Fig. 1. DoseDiff is essentially a conditional diffusion model conditioned by CT images and SDMs, which considers dose prediction as a sequence of denoising steps, i.e., the reverse process. To adapt the two conditional terms, we redefine the neural network as ϵθ(yt,xCT,xsdm,t)subscriptitalic-ϵ𝜃subscript𝑦𝑡subscript𝑥𝐶𝑇subscript𝑥𝑠𝑑𝑚𝑡\epsilon_{\theta}\left(y_{t},x_{CT},x_{sdm},t\right)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_s italic_d italic_m end_POSTSUBSCRIPT , italic_t ) and accordingly modify the reverse process and objective function outlined in Eqs. 3 and 4:

pθ(yt1yt)=𝒩(yt;μθ(yt,xCT,xsdm,t),(1αt)𝐈),subscript𝑝𝜃conditionalsubscript𝑦𝑡1subscript𝑦𝑡𝒩subscript𝑦𝑡subscript𝜇𝜃subscript𝑦𝑡subscript𝑥𝐶𝑇subscript𝑥𝑠𝑑𝑚𝑡1subscript𝛼𝑡𝐈\displaystyle p_{\theta}\left(y_{t-1}\mid y_{t}\right)=\mathcal{N}\left(y_{t};% \mu_{\theta}\left(y_{t},x_{CT},x_{sdm},t\right),\left(1-\alpha_{t}\right)\rm{% \bf{I}}\right),italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_N ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_s italic_d italic_m end_POSTSUBSCRIPT , italic_t ) , ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I ) , (5)
θ=𝔼y0,xCT,xsdm,ϵϵϵθ(yt,xCT,xsdm,t),subscript𝜃subscript𝔼subscript𝑦0subscript𝑥𝐶𝑇subscript𝑥𝑠𝑑𝑚italic-ϵnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑦𝑡subscript𝑥𝐶𝑇subscript𝑥𝑠𝑑𝑚𝑡\displaystyle\mathcal{L}_{\theta}=\mathbb{E}_{y_{0},x_{CT},x_{sdm},\epsilon}% \left\|\epsilon-\epsilon_{\theta}\left(y_{t},x_{CT},x_{sdm},t\right)\right\|,caligraphic_L start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_s italic_d italic_m end_POSTSUBSCRIPT , italic_ϵ end_POSTSUBSCRIPT ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_s italic_d italic_m end_POSTSUBSCRIPT , italic_t ) ∥ , (6)

where xCTsubscript𝑥𝐶𝑇x_{CT}italic_x start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT and xsdmsubscript𝑥𝑠𝑑𝑚x_{sdm}italic_x start_POSTSUBSCRIPT italic_s italic_d italic_m end_POSTSUBSCRIPT are the CT image and SDM, respectively, and ϵ𝒩(𝟎,𝐈)similar-toitalic-ϵ𝒩0𝐈\epsilon\sim\mathcal{N}(\bf{0},\rm{\bf{I}})italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ).

The conditional diffusion model assumes that the reverse process is a Markov chain, and diffusion step T𝑇Titalic_T should be sufficiently large. Therefore, the inference of DoseDiff consumes significant computational time. To reduce the inference time, we adopt the accelerated generation technology of the denoising diffusion implicit model (DDIM) [51]. Based on the non-Markovian hypothesis, the DDIM allows a reduced number of sampling steps (fewer than T𝑇Titalic_T) in the inverse process. Furthermore, the implementation of DDIM requires modifications only to the sampling technique of the inverse process, with no impact on the training and forward processes. Let τ𝜏\tauitalic_τ represent a sub-sequence of [1,,T]1𝑇[1,…,T][ 1 , … , italic_T ] of length S𝑆Sitalic_S with τS=Tsubscript𝜏𝑆𝑇\tau_{S}=Titalic_τ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = italic_T; then, the accelerated reverse process of DoseDiff can be expressed as:

pθ(yτi1yτi)=𝒩(α¯τi1y0|τi+1α¯τi1στi2\displaystyle p_{\theta}\left(y_{\tau_{i-1}}\mid y_{\tau_{i}}\right)=\mathcal{% N}\left(\sqrt{\bar{\alpha}_{\tau_{i-1}}}y_{0|\tau_{i}}+\sqrt{1-\bar{\alpha}_{% \tau_{i-1}}-\sigma_{\tau_{i}}^{2}}\right.italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_y start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = caligraphic_N ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG italic_y start_POSTSUBSCRIPT 0 | italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (7)
ϵθ(yτi,xCT,xsdm,τi),στi2𝐈) if i[S],i>1,\displaystyle\left.\cdot\epsilon_{\theta}(y_{\tau_{i}},x_{CT},x_{sdm},\tau_{i}% ),\sigma_{\tau_{i}}^{2}\rm{\bf{I}}\right)\text{ if }i\in[S],i>1,⋅ italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_s italic_d italic_m end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_σ start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) if italic_i ∈ [ italic_S ] , italic_i > 1 ,
pθ(y0yt)=𝒩(y0|t,σt2𝐈)otherwise,subscript𝑝𝜃conditionalsubscript𝑦0subscript𝑦𝑡𝒩subscript𝑦conditional0𝑡superscriptsubscript𝜎𝑡2𝐈otherwise\displaystyle p_{\theta}\left(y_{0}\mid y_{t}\right)=\mathcal{N}\left(y_{0|t},% \sigma_{t}^{2}\rm{\bf{I}}\right)\text{otherwise},italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∣ italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_N ( italic_y start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) otherwise , (8)

where y0|t=(yt1α¯tϵθ(yt,xCT,xsdm,t))/α¯tsubscript𝑦conditional0𝑡subscript𝑦𝑡1subscript¯𝛼𝑡subscriptitalic-ϵ𝜃subscript𝑦𝑡subscript𝑥𝐶𝑇subscript𝑥𝑠𝑑𝑚𝑡subscript¯𝛼𝑡y_{0|t}=\left(y_{t}-\sqrt{1-\bar{\alpha}_{t}}\epsilon_{\theta}\left(y_{t},x_{% CT},x_{sdm},t\right)\right)/\sqrt{\bar{\alpha}_{t}}italic_y start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT = ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_s italic_d italic_m end_POSTSUBSCRIPT , italic_t ) ) / square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG and σt=(1α¯t1)/(1α¯t)1α¯t/α¯t1subscript𝜎𝑡1subscript¯𝛼𝑡11subscript¯𝛼𝑡1subscript¯𝛼𝑡subscript¯𝛼𝑡1\sigma_{t}=\sqrt{(1-\bar{\alpha}_{t-1})/(1-\bar{\alpha}_{t})}\sqrt{1-\bar{% \alpha}_{t}/\bar{\alpha}_{t-1}}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) / ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG.

Refer to caption


Figure 2: Architecture of MMFNet for dose prediction.

2.3 Signed distance map

Most previous dose prediction methods have directly used masks to provide location and distance information related to ROIs. A limitation of using masks in this way is that current slice-based and patch-based CNN training strategies cannot guarantee that all foreground regions in the mask images are completely included in the sampled images. Moreover, although CNNs are capable of implicitly learning distance relationship among ROIs from mask images, their convolution kernels can only extract local features, and thus they may struggle to perceive long-range distance relationships. On the contrary, a distance map provides the distance from each voxel to the contour of the ROI in the 3D space of the image. The distance image value itself is capable of conveying distance information, regardless of incomplete sampling of the image or the limited receptive field of CNNs. The SDM uses sign to identify whether a voxel is inside or outside the ROI.

Let Mcsubscript𝑀𝑐M_{c}italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT denote the mask image, where c[1,C]𝑐1𝐶c\in[1,C]italic_c ∈ [ 1 , italic_C ] and C𝐶Citalic_C is the total number of targets and OARs. We denote by Mcinsuperscriptsubscript𝑀𝑐𝑖𝑛M_{c}^{in}italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n end_POSTSUPERSCRIPT, Mcoutsuperscriptsubscript𝑀𝑐𝑜𝑢𝑡M_{c}^{out}italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t end_POSTSUPERSCRIPT, and Mcsubscript𝑀𝑐\partial M_{c}∂ italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT inside, outside, and boundary of the ROI, respectively. The SDM in image space (ISDM) is defined as:

xcisdm={infvMc𝒟i(u,v),uMcin0,uMcinfvMc𝒟i(u,v),uMcout,superscriptsubscript𝑥𝑐𝑖𝑠𝑑𝑚casessubscriptinfimum𝑣subscript𝑀𝑐subscript𝒟𝑖𝑢𝑣𝑢superscriptsubscript𝑀𝑐𝑖𝑛0𝑢subscript𝑀𝑐subscriptinfimum𝑣subscript𝑀𝑐subscript𝒟𝑖𝑢𝑣𝑢superscriptsubscript𝑀𝑐𝑜𝑢𝑡\displaystyle x_{c}^{isdm}=\left\{\begin{array}[]{c}\inf_{v\in\partial M_{c}}% \mathcal{D}_{i}(u,v),u\in M_{c}^{in}\\ 0,u\in\partial M_{c}\\ -\inf_{v\in\partial M_{c}}\mathcal{D}_{i}(u,v),u\in M_{c}^{out}\end{array}% \right.,italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_s italic_d italic_m end_POSTSUPERSCRIPT = { start_ARRAY start_ROW start_CELL roman_inf start_POSTSUBSCRIPT italic_v ∈ ∂ italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_u , italic_v ) , italic_u ∈ italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 , italic_u ∈ ∂ italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL - roman_inf start_POSTSUBSCRIPT italic_v ∈ ∂ italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_u , italic_v ) , italic_u ∈ italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY , (9)

where 𝒟i(u,v)=(iuiv)2+(jujv)2+(kukv)2subscript𝒟𝑖𝑢𝑣superscriptsubscript𝑖𝑢subscript𝑖𝑣2superscriptsubscript𝑗𝑢subscript𝑗𝑣2superscriptsubscript𝑘𝑢subscript𝑘𝑣2\mathcal{D}_{i}(u,v)=\sqrt{(i_{u}-i_{v})^{2}+(j_{u}-j_{v})^{2}+(k_{u}-k_{v})^{% 2}}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_u , italic_v ) = square-root start_ARG ( italic_i start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT - italic_i start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_j start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT - italic_j start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_k start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG is the Euclidean distance between voxels u𝑢uitalic_u and v𝑣vitalic_v in image space, where (iu,ju,ku)subscript𝑖𝑢subscript𝑗𝑢subscript𝑘𝑢(i_{u},j_{u},k_{u})( italic_i start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) and (iv,jv,kv)subscript𝑖𝑣subscript𝑗𝑣subscript𝑘𝑣(i_{v},j_{v},k_{v})( italic_i start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) are voxel coordinates of u𝑢uitalic_u and v𝑣vitalic_v. However, owing to inconsistent voxel resolution, ISDM values are not comparable between different images and have no practical connotation. Therefore, we consider spacing as a factor when calculating the SDM, transforming the ISDM from image space to physical space (PSDM): 𝒟i(u,v)𝒟p(u,v,s)=is(iuiv)2+js(jujv)2+ks(kukv)2subscript𝒟𝑖𝑢𝑣subscript𝒟𝑝𝑢𝑣ssubscript𝑖ssuperscriptsubscript𝑖𝑢subscript𝑖𝑣2subscript𝑗ssuperscriptsubscript𝑗𝑢subscript𝑗𝑣2subscript𝑘ssuperscriptsubscript𝑘𝑢subscript𝑘𝑣2\mathcal{D}_{i}(u,v)\rightarrow\mathcal{D}_{p}(u,v,\textbf{s})=\sqrt{i_{% \textbf{s}}(i_{u}-i_{v})^{2}+j_{\textbf{s}}(j_{u}-j_{v})^{2}+k_{\textbf{s}}(k_% {u}-k_{v})^{2}}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_u , italic_v ) → caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_u , italic_v , s ) = square-root start_ARG italic_i start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( italic_i start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT - italic_i start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_j start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( italic_j start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT - italic_j start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_k start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( italic_k start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, where s(is,js,ks)ssubscript𝑖ssubscript𝑗ssubscript𝑘s\textbf{s}(i_{\textbf{s}},j_{\textbf{s}},k_{\textbf{s}})s ( italic_i start_POSTSUBSCRIPT s end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT s end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ) is the voxel spacing of the image. Finally, the input SDM xsdmsubscript𝑥𝑠𝑑𝑚x_{sdm}italic_x start_POSTSUBSCRIPT italic_s italic_d italic_m end_POSTSUBSCRIPT of the neural network ϵθ(yt,xCT,xsdm,t)subscriptitalic-ϵ𝜃subscript𝑦𝑡subscript𝑥𝐶𝑇subscript𝑥𝑠𝑑𝑚𝑡\epsilon_{\theta}\left(y_{t},x_{CT},x_{sdm},t\right)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_s italic_d italic_m end_POSTSUBSCRIPT , italic_t ) is a collection of the SDMs of all ROIs:

xsdm={x1psdm,x2psdm,,xC1psdm,xCpsdm}.subscript𝑥𝑠𝑑𝑚superscriptsubscript𝑥1𝑝𝑠𝑑𝑚superscriptsubscript𝑥2𝑝𝑠𝑑𝑚superscriptsubscript𝑥𝐶1𝑝𝑠𝑑𝑚superscriptsubscript𝑥𝐶𝑝𝑠𝑑𝑚\displaystyle x_{sdm}=\left\{x_{1}^{psdm},x_{2}^{psdm},...,x_{C-1}^{psdm},x_{C% }^{psdm}\right\}.italic_x start_POSTSUBSCRIPT italic_s italic_d italic_m end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_s italic_d italic_m end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_s italic_d italic_m end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_C - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_s italic_d italic_m end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_s italic_d italic_m end_POSTSUPERSCRIPT } . (10)

2.4 MMFNet

The conditions of DoseDiff contain multimodal images, i.e., the CT image and SDM. To effectively extract and fuse their features, we propose MMFNet for ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, as shown in Fig. 2. MMFNet adopts three encoders used to extract the features of xCTsubscript𝑥𝐶𝑇x_{CT}italic_x start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT, xsdmsubscript𝑥𝑠𝑑𝑚x_{sdm}italic_x start_POSTSUBSCRIPT italic_s italic_d italic_m end_POSTSUBSCRIPT, and ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, respectively. The encoder part accomplishes multi-scale information fusion by summing feature maps of the same resolution from the three inputs at every level. Our encoders and decoder are implemented in the same way as those of [38], with four down-sampling or up-sampling operations. Each level of encoders and decoder contains two residual blocks (Res-blocks), each of which is composed of one group normalization (GN) layer [52], two sigmoid linear units, two 3×3333\times 33 × 3 convolution layers, one adaptive GN (AdaGN) layer [38], and one residual connection. Similarly, the timestep t𝑡titalic_t is embedded in each Res-block through AdaGN, which is defined as [38]:

AdaGN(f,𝐞𝐭)=etsGN(f)+etb,AdaGN𝑓subscript𝐞𝐭superscriptsubscript𝑒𝑡𝑠GN𝑓superscriptsubscript𝑒𝑡𝑏\displaystyle\operatorname{AdaGN}\left(f,\mathbf{e_{t}}\right)=e_{t}^{s}% \mathrm{GN}(f)+e_{t}^{b},roman_AdaGN ( italic_f , bold_e start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT ) = italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT roman_GN ( italic_f ) + italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , (11)

where f𝑓fitalic_f is the input feature map and 𝐞𝐭=[ets,etb]subscript𝐞𝐭superscriptsubscript𝑒𝑡𝑠superscriptsubscript𝑒𝑡𝑏\mathbf{e_{t}}=[e_{t}^{s},e_{t}^{b}]bold_e start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT = [ italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ] is obtained from a linear projection of the timestep t𝑡titalic_t.

We propose a transformer-based fusion module, fusionFormer, for further global information fusion. Unlike the convolution layer used to extract local features, the transformer [48, 49] uses an attention mechanism to capture global context information. A currently popular approach to enhance model performance is to combine the strengths of CNNs and transformers for local and global feature extraction [53, 54]. To minimize memory cost, fusionFormer is added to the path at the lowest resolution level. In fusionFormer, the feature maps of xCTsubscript𝑥𝐶𝑇x_{CT}italic_x start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT, xsdmsubscript𝑥𝑠𝑑𝑚x_{sdm}italic_x start_POSTSUBSCRIPT italic_s italic_d italic_m end_POSTSUBSCRIPT, and ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are partitioned into 2×2222\times 22 × 2 non-overlap** patches, i.e., the channels remain unchanged while evenly slicing along the length and height dimensions. The patch size is associated with the size of the original input image. The three feature maps are then transformed into three fundamental components of the attention using an embedded layer consisting of linear and normalization layers: the query (fqsubscript𝑓𝑞f_{q}italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT), key (fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT), and value (fvsubscript𝑓𝑣f_{v}italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT). As both xCTsubscript𝑥𝐶𝑇x_{CT}italic_x start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT and xsdmsubscript𝑥𝑠𝑑𝑚x_{sdm}italic_x start_POSTSUBSCRIPT italic_s italic_d italic_m end_POSTSUBSCRIPT are conditions, we use their feature maps as the query and key (or inverse) to calculate attention matrix. The attention matrix is then used to weight the feature map of the noisy dose distribution map, enabling conditional guidance. The attention block in the transformer can be expressed as follows:

fatt=Softmax(β(fq)×γ(fk)T)×δ(fv)+δ(fv),subscript𝑓𝑎𝑡𝑡Softmax𝛽subscript𝑓𝑞𝛾superscriptsubscript𝑓𝑘T𝛿subscript𝑓𝑣𝛿subscript𝑓𝑣\displaystyle f_{att}=\operatorname{Softmax}\left(\beta\left(f_{q}\right)% \times\gamma\left(f_{k}\right)^{\mathrm{T}}\right)\times\delta\left(f_{v}% \right)+\delta\left(f_{v}\right),italic_f start_POSTSUBSCRIPT italic_a italic_t italic_t end_POSTSUBSCRIPT = roman_Softmax ( italic_β ( italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) × italic_γ ( italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ) × italic_δ ( italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) + italic_δ ( italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) , (12)

where β()𝛽\beta(\cdot)italic_β ( ⋅ ), γ()𝛾\gamma(\cdot)italic_γ ( ⋅ ), and δ()𝛿\delta(\cdot)italic_δ ( ⋅ ) are linear operations. Then, a feed-forward block is used for the nonlinear transformation of features:

fffb=MLP(LN(fatt))+fatt,subscript𝑓𝑓𝑓𝑏MLPLNsubscript𝑓𝑎𝑡𝑡subscript𝑓𝑎𝑡𝑡\displaystyle f_{ffb}=\operatorname{MLP}\left(\operatorname{LN}\left(f_{att}% \right)\right)+f_{att},italic_f start_POSTSUBSCRIPT italic_f italic_f italic_b end_POSTSUBSCRIPT = roman_MLP ( roman_LN ( italic_f start_POSTSUBSCRIPT italic_a italic_t italic_t end_POSTSUBSCRIPT ) ) + italic_f start_POSTSUBSCRIPT italic_a italic_t italic_t end_POSTSUBSCRIPT , (13)

where MLP()MLP\operatorname{MLP}(\cdot)roman_MLP ( ⋅ ) is a multi-layer perceptron consisting of two linear layers and a nonlinear activation function and LN()LN\operatorname{LN}(\cdot)roman_LN ( ⋅ ) is layer normalization. Last, a linear layer is used to adjust the dimensionality of the feature vector to the original size; it is then reshaped to be the same shape as the input feature maps.

2.5 Implementation details

Our network implementation was based on PyTorch, and the code was run on a server with two RTX 2080Ti GPUs. Following [36], we set the total number of diffusion steps (T𝑇Titalic_T) to 1000 and the forward process variances αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to constants decreasing linearly from α1=0.9999subscript𝛼10.9999\alpha_{1}=0.9999italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9999 to αT=0.08subscript𝛼𝑇0.08\alpha_{T}=0.08italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 0.08. To ensure the input value of the model fell within a suitable range, we set the distance unit for the PSDM to decimeters. We used the AdamW [55] optimizer with a step-decay learning rate to train our model for one million iterations. The initial learning rate and batch size were 0.0001 and 8, respectively. Some simple online data augmentation operators, including random flip**, rotating, and zooming, were used on the training set to improve the generalization capacity of the model. To reduce the inference time, we set the reduced generation step S𝑆Sitalic_S in DDIM to 8. The full implementation is available at https://github.com/whisney/DoseDiff.

3 Experiments

3.1 Datasets and Preprocessing

3.1.1 In-house Datasets

The breast cancer and nasopharyngeal cancer (NPC) datasets were constructed by collecting data for patients receiving RT at Guangdong Provincial Hospital of Traditional Chinese Medicine, China, from 2016 to 2020. The patients in the breast cancer dataset underwent postoperative RT following breast-conserving surgery, whereas the patients in the NPC dataset received RT for primary lesions. The CT volumes were obtained using a Siemens Sensation Open (Siemens Healthcare, Forchheim, Germany) scanner. RT plans were received from a treatment-planning system (Philips, Pinnacle3, Netherlands). All the targets and OARs were delineated by experienced oncologists and all the plans were clinically approved. The details of the breast cancer and NPC datasets were as follows.

Breast cancer dataset: The breast cancer dataset consisted of data from 119 patients. The in-plane pixel spacings of the CT images ranged from 0.77 to 0.97 mm with an average of 0.95 mm, and slice thicknesses were all 5.0 mm. The in-plane resolutions were 512×512512512512\times 512512 × 512, and the number of slices ranged from 50 to 129. We used a binary body mask to exclude unnecessary background from each image, and all the images were cropped by the minimal external cube of the body masks and then resized to 320×192320192320\times 192320 × 192 in-plane. The target areas of tumor bed (TB) and clinical target volume (CTV), and the OARs for heart, spinal cord, and left and right lungs were included in our experiments.

NPC dataset: The NPC dataset consisted of data from 139 patients. The in-plane pixel spacings of the CT images ranged from 0.75 to 0.97 mm with an average of 0.95 mm, and slice thicknesses were 3.0 mm. The in-plane resolutions were 512×512512512512\times 512512 × 512, and the number of slices ranged from 69 to 176. All images were also masked and cropped by body masks but then resized to 448×224448224448\times 224448 × 224 in-plane. The target areas included gross tumor volume (GTV) and PTV, and the OARs included eyes, optic nerve, temporal lobe, brain stem, parotid gland, mandible, and spinal cord.

The intensity ranges for CT images and dose distribution maps were set to [1000,1500]10001500[-1000,1500][ - 1000 , 1500 ] HU and [0,75]075[0,75][ 0 , 75 ] Gy, respectively. Both were uniformly and linearly normalized to [1,1]11[-1,1][ - 1 , 1 ] for training. Both datasets were split into training, validation, and test sets using a patient-wise ratio of 7:1:2:71:27:1:27 : 1 : 2; these sets were used for model training, model selection, and performance evaluation, respectively.

3.1.2 Public Dataset

We utilize the public Head and neck cancer dataset from the AAPM OpenKBP challenge [56], which contains 200 training cases, 40 validation cases, and 100 testing cases. The RT plans were prescribed 70, 63, and 56 Gy in 35 fractions to the gross disease (PTV70subscriptPTV70\mathrm{PTV}_{70}roman_PTV start_POSTSUBSCRIPT 70 end_POSTSUBSCRIPT), intermediate-risk target volumes (PTV63subscriptPTV63\mathrm{PTV}_{63}roman_PTV start_POSTSUBSCRIPT 63 end_POSTSUBSCRIPT), and elective target volumes (PTV56subscriptPTV56\mathrm{PTV}_{56}roman_PTV start_POSTSUBSCRIPT 56 end_POSTSUBSCRIPT). The spacing and size of all images were consistent at 3.906 mm ×\times× 3.906 mm ×\times× 2.5 mm and 128×128×128128128128128\times 128\times 128128 × 128 × 128, respectively. The target areas included PTV, brain stem, spinal cord, parotid gland, larynx, esophagus, and mandible.

Table 1: Results of ablation studies for DoseDiff on the breast cancer dataset.
UNet Diffusion model PSDM MS FF MAE (Gy)\downarrow SSIM\uparrow PSNR (dB)\uparrow Dose score (Gy)\downarrow Volume score (%)\downarrow
\checkmark 3.416±plus-or-minus\pm±1.741 0.797±plus-or-minus\pm±0.077 19.675±plus-or-minus\pm±4.175 11.206±plus-or-minus\pm±7.456 18.910±plus-or-minus\pm±9.909
\checkmark \checkmark 2.863±plus-or-minus\pm±1.199 0.811±plus-or-minus\pm±0.059 20.900±plus-or-minus\pm±3.570 9.228±plus-or-minus\pm±5.114 15.039±plus-or-minus\pm±8.963
\checkmark \checkmark 2.984±plus-or-minus\pm±1.392 0.813±plus-or-minus\pm±0059 20.539±plus-or-minus\pm±3.680 8.369±plus-or-minus\pm±5.129 15.981±plus-or-minus\pm±8.909
\checkmark \checkmark 1.438±plus-or-minus\pm±0.224 0.896±plus-or-minus\pm±0.032 26.662±plus-or-minus\pm±1.240 1.880±plus-or-minus\pm±0.611 11.301±plus-or-minus\pm±8.167
\checkmark \checkmark \checkmark 1.420±plus-or-minus\pm±0.211 0.890±plus-or-minus\pm±0.034 26.710±plus-or-minus\pm±1.140 1.150±plus-or-minus\pm±0.333 1.718±plus-or-minus\pm±1.357
\checkmark \checkmark \checkmark \checkmark 1.221±plus-or-minus\pm±0.241 0.913±plus-or-minus\pm±0.030 27.813±plus-or-minus\pm±1.649 1.047±plus-or-minus\pm±0.438 1.256±plus-or-minus\pm±0.740
\checkmark 2.937±plus-or-minus\pm±1.132 0.807±plus-or-minus\pm±0.052 20.270±plus-or-minus\pm±3.134 10.391±plus-or-minus\pm±6.376 19.413±plus-or-minus\pm±7.791
\checkmark \checkmark 2.654±plus-or-minus\pm±1.156 0.820±plus-or-minus\pm±0.054 21.333±plus-or-minus\pm±3.475 8.317±plus-or-minus\pm±5.361 14.790±plus-or-minus\pm±9.077
\checkmark \checkmark 2.493±plus-or-minus\pm±1.011 0.819±plus-or-minus\pm±0.044 22.114±plus-or-minus\pm±3.039 7.047±plus-or-minus\pm±6.557 13.143±plus-or-minus\pm±8.746
\checkmark \checkmark 1.223±plus-or-minus\pm±0.206 0.903±plus-or-minus\pm±0.209 27.710±plus-or-minus\pm±1.539 1.575±plus-or-minus\pm±0.440 7.031±plus-or-minus\pm±7.549
\checkmark \checkmark \checkmark 1.201±plus-or-minus\pm±0.208 0.906±plus-or-minus\pm±0.028 27.700±plus-or-minus\pm±1.485 0.848±plus-or-minus\pm±0.233 1.138±plus-or-minus\pm±0.563
\checkmark \checkmark \checkmark \checkmark 1.190±plus-or-minus\pm±0.227 0.900±plus-or-minus\pm±0.028 28.118±plus-or-minus\pm±1.699 0.698±plus-or-minus\pm±0.131 0.922±plus-or-minus\pm±0.296
Table 2: Quantitative comparison of DoseDiff with original mask, TSBDM, ISDM, and PSDM input, respectively.
Input MAE (Gy)\downarrow SSIM\uparrow PSNR (dB)\uparrow Dose score (Gy)\downarrow Volume score (%)\downarrow
Mask 1.536±plus-or-minus\pm±0.244 0.859±plus-or-minus\pm±0.031 25.632±plus-or-minus\pm±1.271 0.856±plus-or-minus\pm±0.238 0.921±plus-or-minus\pm±0.372
TSBDM 1.287±plus-or-minus\pm±0.219 0.859±plus-or-minus\pm±0.029 28.099±plus-or-minus\pm±1.673 0.788±plus-or-minus\pm±0.238 0.928±plus-or-minus\pm±0.304
ISDM 1.248±plus-or-minus\pm±0.214 0.878±plus-or-minus\pm±0.028 28.042±plus-or-minus\pm±1.632 0.705±plus-or-minus\pm±0.181 0.928±plus-or-minus\pm±0.300
PSDM 1.190±plus-or-minus\pm±0.227 0.900±plus-or-minus\pm±0.028 28.118±plus-or-minus\pm±1.699 0.698±plus-or-minus\pm±0.131 0.922±plus-or-minus\pm±0.296

3.2 Evaluation Metrics

To evaluate the model’s performance quantitatively, we used both common image similarity metrics and dosimetry-related metrics. The image similarity metrics included the mean absolute error (MAE), structural similarity index measurement (SSIM), and signal-to-noise ratio (PSNR):

MAE(u,v)=1Ni=1N|uivi|,MAE𝑢𝑣1𝑁superscriptsubscript𝑖1𝑁subscript𝑢𝑖subscript𝑣𝑖\displaystyle\operatorname{MAE}(u,v)=\frac{1}{N}\sum_{i=1}^{N}\left|u_{i}-v_{i% }\right|,roman_MAE ( italic_u , italic_v ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | , (14)
SSIM(u,v)=1Mj=1M(2μu,jμv,j+c1)(2σuv,j+c2)(μu,j2+μv,j2+c1)(σu,j2+σv,j2+c2),SSIM𝑢𝑣1𝑀superscriptsubscript𝑗1𝑀2subscript𝜇𝑢𝑗subscript𝜇𝑣𝑗subscript𝑐12subscript𝜎𝑢𝑣𝑗subscript𝑐2superscriptsubscript𝜇𝑢𝑗2superscriptsubscript𝜇𝑣𝑗2subscript𝑐1superscriptsubscript𝜎𝑢𝑗2superscriptsubscript𝜎𝑣𝑗2subscript𝑐2\displaystyle\operatorname{SSIM}(u,v)=\frac{1}{M}\sum_{j=1}^{M}\frac{\left(2% \mu_{u,j}\mu_{v,j}+c_{1}\right)\left(2\sigma_{uv,j}+c_{2}\right)}{\left(\mu_{u% ,j}^{2}+\mu_{v,j}^{2}+c_{1}\right)\left(\sigma_{u,j}^{2}+\sigma_{v,j}^{2}+c_{2% }\right)},roman_SSIM ( italic_u , italic_v ) = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG ( 2 italic_μ start_POSTSUBSCRIPT italic_u , italic_j end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_v , italic_j end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( 2 italic_σ start_POSTSUBSCRIPT italic_u italic_v , italic_j end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ( italic_μ start_POSTSUBSCRIPT italic_u , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ start_POSTSUBSCRIPT italic_v , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( italic_σ start_POSTSUBSCRIPT italic_u , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_v , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG , (15)
PSNR(u,v)=10log10(crange 21Ni=1N(uivi)2),PSNR𝑢𝑣10subscript10superscriptsubscript𝑐range 21𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑢𝑖subscript𝑣𝑖2\displaystyle\operatorname{PSNR}(u,v)=10\log_{10}\left(\frac{c_{\text{range }}% ^{2}}{\frac{1}{N}\sum_{i=1}^{N}\left(u_{i}-v_{i}\right)^{2}}\right),roman_PSNR ( italic_u , italic_v ) = 10 roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( divide start_ARG italic_c start_POSTSUBSCRIPT range end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (16)

where N𝑁Nitalic_N is the total number of voxels; uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT voxel value in volumes u𝑢uitalic_u and v𝑣vitalic_v, respectively. M𝑀Mitalic_M is the number of local windows in the image. The symbols μu,jsubscript𝜇𝑢𝑗\mu_{u,j}italic_μ start_POSTSUBSCRIPT italic_u , italic_j end_POSTSUBSCRIPT, μv,jsubscript𝜇𝑣𝑗\mu_{v,j}italic_μ start_POSTSUBSCRIPT italic_v , italic_j end_POSTSUBSCRIPT, σu,jsubscript𝜎𝑢𝑗\sigma_{u,j}italic_σ start_POSTSUBSCRIPT italic_u , italic_j end_POSTSUBSCRIPT, σv,jsubscript𝜎𝑣𝑗\sigma_{v,j}italic_σ start_POSTSUBSCRIPT italic_v , italic_j end_POSTSUBSCRIPT, and σuv,jsubscript𝜎𝑢𝑣𝑗\sigma_{uv,j}italic_σ start_POSTSUBSCRIPT italic_u italic_v , italic_j end_POSTSUBSCRIPT denote local mean, variance, and covariance of the local window j𝑗jitalic_j in volumes u𝑢uitalic_u and v𝑣vitalic_v. crangesubscript𝑐𝑟𝑎𝑛𝑔𝑒c_{range}italic_c start_POSTSUBSCRIPT italic_r italic_a italic_n italic_g italic_e end_POSTSUBSCRIPT denotes the distance between minimum and maximum possible values of input images, which is set to 3000 in our experiments. Following, [57], c1=(0.01crange)2subscript𝑐1superscript0.01subscript𝑐𝑟𝑎𝑛𝑔𝑒2c_{1}=(0.01c_{range})^{2}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( 0.01 italic_c start_POSTSUBSCRIPT italic_r italic_a italic_n italic_g italic_e end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and c2=(0.03crange)2subscript𝑐2superscript0.03subscript𝑐𝑟𝑎𝑛𝑔𝑒2c_{2}=(0.03c_{range})^{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( 0.03 italic_c start_POSTSUBSCRIPT italic_r italic_a italic_n italic_g italic_e end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are constants to stabilize the division. To eliminate the influence of the background, the mean values over pixels contained within the body masks were computed for these three metrics.

The dosimetry-related metrics included DVsubscript𝐷𝑉\triangle D_{V}△ italic_D start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT and VDsubscript𝑉𝐷\triangle V_{D}△ italic_V start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT. The symbol \triangle denotes the absolute error of the dosimetric metrics between the predicted and ground-truth dose distribution maps. DVsubscript𝐷𝑉\triangle D_{V}△ italic_D start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT denotes the dose received by V%percent𝑉V\%italic_V % of the volume of the ROIs. Dminsubscript𝐷𝑚𝑖𝑛D_{min}italic_D start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT, Dmeansubscript𝐷𝑚𝑒𝑎𝑛D_{mean}italic_D start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT, and Dmaxsubscript𝐷𝑚𝑎𝑥D_{max}italic_D start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT, indicating the minimum, average, and maximum doses in the ROI, respectively, were also considered to be members of DVsubscript𝐷𝑉D_{V}italic_D start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT. For the treatment targets, VDsubscript𝑉𝐷V_{D}italic_V start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT was the volume percentage of the target region receiving D%percent𝐷D\%italic_D % of the prescribed dose; for OARs, VDsubscript𝑉𝐷V_{D}italic_V start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT was the volume percentage of the ROI receiving V𝑉Vitalic_V-Gy dose. In clinical practice, different indicators are considered for different ROIs. Therefore, we assigned appropriate evaluation metrics for different ROIs in the two datasets. In the breast cancer dataset, we used D95subscript𝐷95\triangle D_{95}△ italic_D start_POSTSUBSCRIPT 95 end_POSTSUBSCRIPT and V95subscript𝑉95\triangle V_{95}△ italic_V start_POSTSUBSCRIPT 95 end_POSTSUBSCRIPT for TB and CTV; Dmeansubscript𝐷𝑚𝑒𝑎𝑛\triangle D_{mean}△ italic_D start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT and V30subscript𝑉30\triangle V_{30}△ italic_V start_POSTSUBSCRIPT 30 end_POSTSUBSCRIPT for heart; Dmeansubscript𝐷𝑚𝑒𝑎𝑛\triangle D_{mean}△ italic_D start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT and V20subscript𝑉20\triangle V_{20}△ italic_V start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT for ipsilateral lung; and Dmaxsubscript𝐷𝑚𝑎𝑥\triangle D_{max}△ italic_D start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT for spinal cord. In the NPC dataset, we used D95subscript𝐷95\triangle D_{95}△ italic_D start_POSTSUBSCRIPT 95 end_POSTSUBSCRIPT and V95subscript𝑉95\triangle V_{95}△ italic_V start_POSTSUBSCRIPT 95 end_POSTSUBSCRIPT for GTV and PTV; Dmaxsubscript𝐷𝑚𝑎𝑥\triangle D_{max}△ italic_D start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT for eyes, optic nerve, temporal lobe, brain stem, and spinal cord; Dmeansubscript𝐷𝑚𝑒𝑎𝑛\triangle D_{mean}△ italic_D start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT and V30subscript𝑉30\triangle V_{30}△ italic_V start_POSTSUBSCRIPT 30 end_POSTSUBSCRIPT for parotid gland; and Dminsubscript𝐷𝑚𝑖𝑛\triangle D_{min}△ italic_D start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT for mandible. To facilitate model comparison and selection, we defined the mean values of DVsubscript𝐷𝑉\triangle D_{V}△ italic_D start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT and VDsubscript𝑉𝐷\triangle V_{D}△ italic_V start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT for all ROIs as the “dose score” and “volume score”, respectively. The DVH was used for a comprehensive comparison of the differences between the predicted results obtained with different methods and the real dose distribution.

The dose score and DVH score are employed to evaluate the results of OpenKBP dataset, utilizing evaluation code111https://github.com/ababier/open-kbp directly provided by the event organizers.

Refer to caption

Figure 3: Visual comparison of the mask, TSBDM, ISDM, and PSDM with respect to the ROIs of breast cancer.

3.3 Ablation studies for DoseDiff

We conducted ablation studies to quantitatively validate the contribution of each term in DoseDiff. Specifically, we sequentially added the PSDM input, multi-scale fusion (MS), and fusionFormer module (FF) into the baseline conditional diffusion model, which initially only used CT images as input. Moreover, we added the proposed elements to the baseline UNet model to further prove their effectiveness. As shown in Table 1, the introduction of PSDM, MS, and FF individually results in performance improvement, with PSDM contributing the most significant enhancement in performance. Although multi-scale fusion led to limited improvement in the common image similarity metrics, it significantly enhanced the dosimetric metrics. Compared with the conditional diffusion model baseline, DoseDiff led to improvements of 1.747Gy in MAE, 0.093 in SSIM, 7.848dB in RSNR, 9.693Gy in dose score, and 18.491% in volume score. Furthermore, the experimental results demonstrate that the proposed elements were effective in both the UNet and conditional diffusion model frameworks, with the conditional diffusion model generally outperforming UNet. The results of the ablation experiments indicate that incorporating distance information and enhancing the fusion strategy can be conducive to achieving accurate dose distribution prediction for deep-learning models.

Refer to caption

Figure 4: Performance and inference time of different generation steps based on DDIM reverse process. Gray dotted lines indicate the metrics of DoseDiff without DDIM (T=1000𝑇1000T=1000italic_T = 1000).

3.4 Performance of PSDM

To verify the benefit of the proposed PSDM, we compared DoseDiff implementations with different distance information inputs, including mask, transformed signed boundary distance map (TSBDM) [23], ISDM, and PSDM. Note that we shrank all ISDM values by a factor of 100 to stabilize the training. Table 2 shows the quantitative comparison results, demonstration that PSDM achieved the best performance. A visual comparison of the mask, TSBDM, ISDM, and PSDM on the ROIs of breast cancer is presented in Fig. 3. As shown, the distance maps provided more distinct and effective distance information than the mask images. However, the voxel-wise transformation in TSDBM quickly zeroed out the pixels far from the contour, rendering many background pixels ineffective. In addition, the dynamic change in distance was not obvious in TSDBM. PSDM converts the image distance to the real-world distance, which enabled the model to extract the distance map information in different volumes under a consistent measure. PSDM is also more practical than ISDM in the clinic, for example, ISDM equates the in-plane and inter-slice pixel distance, but the slice thickness of CT images is usually larger than the in-plane spacing.

Refer to caption

Figure 5: Boxplots of dose prediction results using compared methods on the breast cancer dataset.
Table 3: Quantitative comparison with SOTA methods for dose prediction on breast cancer dataset.
ROI Metrics DeepLabV3+ DoseNet DoseGAN MCGAN DGDL C3D DiffDP DoseDiff (Ours)
Body MAE (Gy)\downarrow 1.602±plus-or-minus\pm±0.307 1.385±plus-or-minus\pm±0.256 1.706±plus-or-minus\pm±0.268 1.510±plus-or-minus\pm±0.268 1.330±plus-or-minus\pm±0.242 1.195±plus-or-minus\pm±0.256 1.455±plus-or-minus\pm±0.289 1.076±plus-or-minus\pm±0.232
SSIM\uparrow 0.851±plus-or-minus\pm±0.034 0.857±plus-or-minus\pm±0.030 0.821±plus-or-minus\pm±0.031 0.858±plus-or-minus\pm±0.026 0.880±plus-or-minus\pm±0.027 0.913±plus-or-minus\pm±0.027 0.875±plus-or-minus\pm±0.027 0.913±plus-or-minus\pm±0.024
PSNR (dB)\uparrow 25.861±plus-or-minus\pm±1.367 28.555±plus-or-minus\pm±1.452 26.121±plus-or-minus\pm±1.312 25.287±plus-or-minus\pm±1.305 28.425±plus-or-minus\pm±1.376 27.981±plus-or-minus\pm±1.476 25.411±plus-or-minus\pm±1.411 28.887±plus-or-minus\pm±1.481
TB D95subscript𝐷95\triangle D_{95}△ italic_D start_POSTSUBSCRIPT 95 end_POSTSUBSCRIPT (Gy)\downarrow 0.794±plus-or-minus\pm±0.737 0.623±plus-or-minus\pm±0.416 0.854±plus-or-minus\pm±0.759 1.204±plus-or-minus\pm±0.669 1.029±plus-or-minus\pm±0.657 0.567±plus-or-minus\pm±0.404 1.183±plus-or-minus\pm±0.665 1.035±plus-or-minus\pm±0.602
V95subscript𝑉95\triangle V_{95}△ italic_V start_POSTSUBSCRIPT 95 end_POSTSUBSCRIPT (%)\downarrow 0.098±plus-or-minus\pm±0.254 0.029±plus-or-minus\pm±0.074 0.503±plus-or-minus\pm±0.528 0.037±plus-or-minus\pm±0.061 0.185±plus-or-minus\pm±0.327 0.031±plus-or-minus\pm±0.074 0.107±plus-or-minus\pm±0.235 0.084±plus-or-minus\pm±0.194
CTV D95subscript𝐷95\triangle D_{95}△ italic_D start_POSTSUBSCRIPT 95 end_POSTSUBSCRIPT (Gy)\downarrow 0.713±plus-or-minus\pm±0.387 1.115±plus-or-minus\pm±0.420 0.620±plus-or-minus\pm±0.299 0.844±plus-or-minus\pm±0.332 1.038±plus-or-minus\pm±0.385 0.975±plus-or-minus\pm±0.404 0.378±plus-or-minus\pm±0.313 0.328±plus-or-minus\pm±0.261
V95subscript𝑉95\triangle V_{95}△ italic_V start_POSTSUBSCRIPT 95 end_POSTSUBSCRIPT (%)\downarrow 0.377±plus-or-minus\pm±0.301 0.562±plus-or-minus\pm±0.463 0.495±plus-or-minus\pm±0.448 0.593±plus-or-minus\pm±0.462 0.582±plus-or-minus\pm±0.456 0.557±plus-or-minus\pm±0.466 0.290±plus-or-minus\pm±0.411 0.201±plus-or-minus\pm±0.214
Heart Dmeansubscript𝐷𝑚𝑒𝑎𝑛\triangle D_{mean}△ italic_D start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT (Gy)\downarrow 1.289±plus-or-minus\pm±1.408 1.120±plus-or-minus\pm±1.454 1.503±plus-or-minus\pm±1.802 0.996±plus-or-minus\pm±1.330 1.293±plus-or-minus\pm±1.596 1.191±plus-or-minus\pm±1.537 1.044±plus-or-minus\pm±1.179 0.938±plus-or-minus\pm±1.061
V30subscript𝑉30\triangle V_{30}△ italic_V start_POSTSUBSCRIPT 30 end_POSTSUBSCRIPT (%)\downarrow 1.679±plus-or-minus\pm±2.541 2.021±plus-or-minus\pm±3.549 2.094±plus-or-minus\pm±3.775 1.081±plus-or-minus\pm±1.810 2.026±plus-or-minus\pm±3.645 2.231±plus-or-minus\pm±3.819 1.337±plus-or-minus\pm±2.091 1.315±plus-or-minus\pm±2.026
Ipsilateral lung Dmeansubscript𝐷𝑚𝑒𝑎𝑛\triangle D_{mean}△ italic_D start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT (Gy)\downarrow 1.179±plus-or-minus\pm±0.986 0.910±plus-or-minus\pm±0.704 1.491±plus-or-minus\pm±1.069 1.071±plus-or-minus\pm±0.787 1.105±plus-or-minus\pm±0.927 1.011±plus-or-minus\pm±0.741 0.818±plus-or-minus\pm±0.717 0.872±plus-or-minus\pm±0.737
V20subscript𝑉20\triangle V_{20}△ italic_V start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT (%)\downarrow 1.566±plus-or-minus\pm±1.465 1.709±plus-or-minus\pm±1.269 1.772±plus-or-minus\pm±1.695 2.092±plus-or-minus\pm±1.325 1.868±plus-or-minus\pm±1.246 1.508±plus-or-minus\pm±1.185 1.229±plus-or-minus\pm±1.034 1.345±plus-or-minus\pm±1.021
Spinal cord Dmaxsubscript𝐷𝑚𝑎𝑥\triangle D_{max}△ italic_D start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT (Gy)\downarrow 1.037±plus-or-minus\pm±1.050 1.376±plus-or-minus\pm±1.317 1.557±plus-or-minus\pm±1.273 1.183±plus-or-minus\pm±1.018 1.238±plus-or-minus\pm±1.284 1.308±plus-or-minus\pm±1.307 1.044±plus-or-minus\pm±1.016 0.877±plus-or-minus\pm±0.812

Refer to caption

Figure 6: Visual comparison of predicted dose distribution maps obtained with different methods on the breast cancer dataset. Red dashed arrows represent the estimated input directions of certain rays identified from the real dose distribution maps.

Refer to caption

Figure 7: DVHs of the dose distribution maps predicted by different methods for a breast cancer patient in the test set. The solid line represents the ground-truth DVH; dotted line represents the DVH of the predicted dose distribution map.

3.5 Trade-off between performance and inference speed

In diffusion models, the generative process is defined as the reverse of a particular Markovian diffusion process. Consequently, the generation of the image necessitates a complete sequence of sampling, which is time-consuming. The DDIM converts the Markovian diffusion process in the diffusion model into a non-Markovian diffusion process that allows us to trade computation for sample quality. Based on the DDIM, we conducted a comparison of the performance and inference time of DoseDiff across varying generation steps. As shown in Fig. 4, the performance of the model improved with more generation steps but stabilized after eight steps. The inference time was approximately proportional to the number of generation steps. To trade off performance and inference time, we suggest setting the generation step to 8 in the DDIM reverse process, resulting in an inference time of 4.7s per volume; our subsequent experiments all adopt this setting.

Table 4: Quantitative comparison with SOTA methods for dose prediction on NPC dataset.
ROI Metrics DeepLabV3+ DoseNet DoseGAN MCGAN DGDL C3D DiffDP DoseDiff (Ours)
Body MAE (Gy)\downarrow 2.075±plus-or-minus\pm±0.309 1.923±plus-or-minus\pm±0.346 2.495±plus-or-minus\pm±0.435 1.973±plus-or-minus\pm±0.359 1.926±plus-or-minus\pm±0.346 1.851±plus-or-minus\pm±0.378 1.965±plus-or-minus\pm±0.424 1.676±plus-or-minus\pm±0.387
SSIM\uparrow 0.827±plus-or-minus\pm±0.042 0.839±plus-or-minus\pm±0.034 0.776±plus-or-minus\pm±0.082 0.856±plus-or-minus\pm±0.030 0.837±plus-or-minus\pm±0.044 0.860±plus-or-minus\pm±0.031 0.867±plus-or-minus\pm±0.037 0.905±plus-or-minus\pm±0.018
PSNR (dB)\uparrow 25.880±plus-or-minus\pm±1.048 26.430±plus-or-minus\pm±1.304 24.592±plus-or-minus\pm±1.044 25.580±plus-or-minus\pm±1.071 26.445±plus-or-minus\pm±1.191 26.256±plus-or-minus\pm±1.341 25.526±plus-or-minus\pm±1.254 26.397±plus-or-minus\pm±1.249
GTV D95subscript𝐷95\triangle D_{95}△ italic_D start_POSTSUBSCRIPT 95 end_POSTSUBSCRIPT (Gy)\downarrow 0.758±plus-or-minus\pm±0.585 0.596±plus-or-minus\pm±0.475 0.858±plus-or-minus\pm±0.684 0.756±plus-or-minus\pm±0.518 0.637±plus-or-minus\pm±0.374 1.831±plus-or-minus\pm±0.553 1.777±plus-or-minus\pm±1.282 1.243±plus-or-minus\pm±0.677
V95subscript𝑉95\triangle V_{95}△ italic_V start_POSTSUBSCRIPT 95 end_POSTSUBSCRIPT (%)\downarrow 0.494±plus-or-minus\pm±0.488 0.500±plus-or-minus\pm±0.548 0.889±plus-or-minus\pm±0.975 0.430±plus-or-minus\pm±0.560 0.516±plus-or-minus\pm±0.504 0.585±plus-or-minus\pm±0.691 0.996±plus-or-minus\pm±1.177 0.697±plus-or-minus\pm±0.842
PTV D95subscript𝐷95\triangle D_{95}△ italic_D start_POSTSUBSCRIPT 95 end_POSTSUBSCRIPT (Gy)\downarrow 8.695±plus-or-minus\pm±9.358 7.896±plus-or-minus\pm±7.309 7.822±plus-or-minus\pm±9.289 3.072±plus-or-minus\pm±3.389 7.475±plus-or-minus\pm±7.005 6.666±plus-or-minus\pm±6.334 2.977±plus-or-minus\pm±2.220 2.623±plus-or-minus\pm±2.322
V95subscript𝑉95\triangle V_{95}△ italic_V start_POSTSUBSCRIPT 95 end_POSTSUBSCRIPT (%)\downarrow 4.285±plus-or-minus\pm±3.283 3.740±plus-or-minus\pm±2.199 6.595±plus-or-minus\pm±4.172 5.534±plus-or-minus\pm±3.660 3.669±plus-or-minus\pm±2.282 2.441±plus-or-minus\pm±1.84 5.032±plus-or-minus\pm±3.284 4.247±plus-or-minus\pm±2.811
parotid glan Dmeansubscript𝐷𝑚𝑒𝑎𝑛\triangle D_{mean}△ italic_D start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT (Gy)\downarrow 2.229±plus-or-minus\pm±1.492 1.804±plus-or-minus\pm±1.497 2.312±plus-or-minus\pm±1.196 1.581±plus-or-minus\pm±1.237 1.962±plus-or-minus\pm±1.361 1.538±plus-or-minus\pm±1.006 2.144±plus-or-minus\pm±1.308 1.944±plus-or-minus\pm±1.118
V30subscript𝑉30\triangle V_{30}△ italic_V start_POSTSUBSCRIPT 30 end_POSTSUBSCRIPT (%)\downarrow 8.684±plus-or-minus\pm±5.510 5.898±plus-or-minus\pm±4.276 4.814±plus-or-minus\pm±3.722 4.808±plus-or-minus\pm±3.878 5.736±plus-or-minus\pm±3.420 6.779±plus-or-minus\pm±5.191 5.404±plus-or-minus\pm±3.640 5.112±plus-or-minus\pm±3.300
Eyes Dmaxsubscript𝐷𝑚𝑎𝑥\triangle D_{max}△ italic_D start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT (Gy)\downarrow 6.106±plus-or-minus\pm±5.561 5.801±plus-or-minus\pm±5.836 5.601±plus-or-minus\pm±3.816 6.442±plus-or-minus\pm±6.387 6.163±plus-or-minus\pm±4.553 5.087±plus-or-minus\pm±4.245 7.926±plus-or-minus\pm±4.247 6.449±plus-or-minus\pm±4.271
Optic nerve Dmaxsubscript𝐷𝑚𝑎𝑥\triangle D_{max}△ italic_D start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT (Gy)\downarrow 5.753±plus-or-minus\pm±4.525 2.808±plus-or-minus\pm±2.637 2.947±plus-or-minus\pm±3.031 6.164±plus-or-minus\pm±6.851 4.244±plus-or-minus\pm±3.479 2.871±plus-or-minus\pm±2.976 7.879±plus-or-minus\pm±7.959 6.324±plus-or-minus\pm±6.922
Temporal lobe Dmaxsubscript𝐷𝑚𝑎𝑥\triangle D_{max}△ italic_D start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT (Gy)\downarrow 3.130±plus-or-minus\pm±1.593 4.446±plus-or-minus\pm±2.241 3.739±plus-or-minus\pm±2.863 2.659±plus-or-minus\pm±1.615 3.743±plus-or-minus\pm±2.528 3.361±plus-or-minus\pm±2.530 3.948±plus-or-minus\pm±2.849 4.165±plus-or-minus\pm±2.531
Brain stem Dmaxsubscript𝐷𝑚𝑎𝑥\triangle D_{max}△ italic_D start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT (Gy)\downarrow 4.030±plus-or-minus\pm±1.863 4.033±plus-or-minus\pm±1.770 4.960±plus-or-minus\pm±1.865 3.842±plus-or-minus\pm±1.675 4.334±plus-or-minus\pm±1.927 3.623±plus-or-minus\pm±1.810 4.118±plus-or-minus\pm±2.649 3.616±plus-or-minus\pm±2.276
Mandible Dminsubscript𝐷𝑚𝑖𝑛\triangle D_{min}△ italic_D start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT (Gy)\downarrow 7.206±plus-or-minus\pm±5.779 7.006±plus-or-minus\pm±6.568 5.466±plus-or-minus\pm±5.957 5.226±plus-or-minus\pm±5.801 8.627±plus-or-minus\pm±6.366 6.639±plus-or-minus\pm±5.807 4.674±plus-or-minus\pm±4.500 4.483±plus-or-minus\pm±4.798
Spinal cord Dmaxsubscript𝐷𝑚𝑎𝑥\triangle D_{max}△ italic_D start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT (Gy)\downarrow 2.760±plus-or-minus\pm±1.543 1.602±plus-or-minus\pm±0.914 1.421±plus-or-minus\pm±0.913 4.855±plus-or-minus\pm±0.868 2.405±plus-or-minus\pm±0.888 2.558±plus-or-minus\pm±1.418 2.302±plus-or-minus\pm±1.252 1.992±plus-or-minus\pm±1.293

Refer to caption

Figure 8: Boxplots of dose prediction results using compared methods on the NPC dataset.

3.6 Comparison with SOTA methods

To demonstrate the superiority of our method, we conducted a comparison between DoseDiff and several SOTA dose prediction methods, including the 2D models DeeplapV3+ [6], MCGAN [5], and DiffDP [45], and the 3D models DoseNet [17], DoseGAN [18], distance-guided deep learning (DGDL) [23], and cascade 3D U-Net (C3D) [58]. DGDL differs from the other methods in its use of TSBDM distance maps as input; the other methods use mask images. We conducted both quantitative and visual comparisons on the breast cancer and NPC datasets.

Comparison on breast cancer dataset: Table 3.4 and Figure 5 illustrate the common image similarity metrics and detailed dosimetry-related metrics for all methods compared on the breast cancer dataset. As shown in the Table 3.4, our method achieved stable top performance on most evaluation metrics. We can observe that although the C3D and DoseNet exhibit excellent predictive performance on the TB, their performance on other ROIs is relatively poor. We speculate that they may be overfitting to the dose distribution in the TB, which is a region with relatively uniform and fixed cumulative dose values (e.g., 63.8 Gy for our dataset). Figure 6 shows predicted dose distribution maps obtained with different methods, as well as the difference between the dose distribution maps and the ground truths. The dose distribution predicted by our method was closest closer to the real dose distribution. Specifically, our method can learn a clear path of the light ray that reflects the fundamental properties of straight-line propagation, which is beneficial for medical physicists to obtain precise parameters, such as the number of radiation fields and the incident angle, from the predicted dose distribution map. Furthermore, Fig. 7 presents the DVHs of the dose distribution maps predicted by the compared methods, allowing for a comprehensive comparison of the dose distributions. We also observed that the DVH curve of the dose distribution map predicted by our method was most similar to the ground-truth curve.

Table 5: Quantitative comparison with SOTA methods for dose prediction on OpenKBP dataset.
Methods Dose score \downarrow DVH score \downarrow
DeepLabV3+ [6] 3.551±plus-or-minus\pm±1.126 2.081±plus-or-minus\pm±2.169
DoseNet [17] 2.861±plus-or-minus\pm±1.107 1.647±plus-or-minus\pm±1.926
DoseGAN [18] 2.873±plus-or-minus\pm±1.125 1.695±plus-or-minus\pm±1.973
MCGAN [5] 3.442±plus-or-minus\pm±1.081 1.796±plus-or-minus\pm±2.210
DGDL [23] 3.026±plus-or-minus\pm±1.085 1.748±plus-or-minus\pm±2.033
C3D [58] 2.429±plus-or-minus\pm±1.101 1.478±plus-or-minus\pm±1.931
DiffDP [45] 3.173±plus-or-minus\pm±0.966 1.602±plus-or-minus\pm±1.889
DoseDiff (Ours) 2.382±plus-or-minus\pm±0.925 1.476±plus-or-minus\pm±1.645

Comparison on NPC dataset: Table 4 and Figure 8 present the results of a quantitative comparison between our method and the SOTA approaches on the NPC dataset. Our method achieved competitive results in terms of both image similarity and dosimetry-related metrics. The transverse and coronal plane samples of dose distribution maps predicted by all the compared methods are shown in Fig. 9. Consistent with the performance observed on the breast cancer dataset, the dose distribution maps predicted by previous methods were overly smooth, whereas our method’s predictions retained more realistic ray path characteristics. Furthermore, we could observe from the coronal images that our method maintained a high level of prediction accuracy for regions that were far from the ROIs, owing to the distance information provided by the PSDM. The DVH curves of the dose distribution maps predicted by the compared methods on the NPC dataset are presented in Fig. 10. Compared to other methods, the DVH curves of the target and OARs predicted by our method are closer to the ground truth.

Refer to caption

Figure 9: Visual comparison of predicted dose distribution maps obtained with different methods on the NPC dataset.

Refer to caption

Figure 10: The dose-volume histograms (DVHs) of the dose distribution maps predicted by the different methods on a testing NPC patient. The solid line represents the ground-truth DVH, while the dotted line represents the DVH of the predicted dose distribution map.
Table 6: The variance of each mean metric for our DoseDiff with 50 different Gaussian samplings on the two datasets.
Dataset MAE (Gy) SSIM PSNR (dB) Dose score (Gy) Volume score (%)
Breast cancer 0.0011 0.0002 0.0099 0.0058 0.0029
NPC 0.0012 0.0001 0.0076 0.0226 0.0153

Comparison on OpenKBP dataset: The results of a quantitative comparison between our method and the SOTA methods on the OpenKBP dataset are shown in Table 5. The images in the OpenKBP dataset exhibit larger in-plane spacings and smaller slice thickness (compared to internal datasets), implying that dose prediction on this dataset relies more heavily on 3D information. Consequently, we can observe that 2D models generally underperform compared to 3D models on the OpenKBP dataset. Despite also being a 2D model, the introduction of 3D distance information by PSDM in our DoseDiff enables our method to achieve competitive results. Figure 11 shows the transverse and coronal plane samples of dose distribution maps predicted by all the compared methods. The visualization results demonstrate that our method exhibits minimal error between predicted values and ground truth.

Refer to caption

Figure 11: Visual comparison of predicted dose distribution maps obtained with different methods on the OpenKBP dataset.

3.7 The influence of non-deterministic prediction

Each dose prediction in our DoseDiff begins with a randomly generated Gaussian noise image. Moreover, each step in the DDIM-based inverse process involves Gaussian sampling as in Eqs. 7 and 8. The randomness introduced by these two factors leads to the non-determinism in dose prediction of our model. To investigate the influence of this non-determinism on the prediction results, we conducted 50 dose predictions for both breast cancer and NPC datasets using different random seeds. The variance of each mean metric for our DoseDiff on the two datasets are shown in Table 6, and the results indicate a small degree of fluctuation for each metric.

4 Discussion

In this study, we aimed to develop a model to accurately predict dose distribution for individual RT patients, enabling medical physicists to achieve acceptable RT plans with less trial and error. Clinically, both CT images and ROI masks are essential for RT planning, so most previous studies have used deep learning to directly learn map**s from them to dose distribution maps. However, the mask image prevents the deep learning model from effectively extracting distance information between surrounding tissues and targets or OARs. Some studies have replaced masks with distance maps and observed improved dose prediction accuracy as a result [19, 23], because each voxel of the distance map explicitly provides distance information relative to the ROI contour. Here, inspired by the success of diffusion models, we propose a distance-aware conditional diffusion model called DoseDiff for dose prediction. Using distance maps as input, we also explore a fusion strategy involving distance maps and CT images. We incorporate multi-scale fusion strategy and fusionFormer module in the proposed MMFNet to achieve effective fusion of complex information

Diffusion models naturally possess a strong ability to model data distributions, as they were specifically developed for this purpose. Therefore, one advantage of introducing conditional diffusion model into image-to-image translation tasks is that as well as learning the map** relationship from conditions to output, it can also effectively capture the distribution characteristics of the generated image itself. By contrast, previous methods have only learned the map** relationship between input and output images. As shown in Figs. 6 and 9, that the dose distribution maps generated by our method are more realistic than those obtained with previous methods, in which the ray path characteristics are unclear and distorted. These dose distribution maps may not exist in the real world and may be limited in their usefulness for assisting medical physicists in obtaining the relevant parameters for treatment planning.

Unlike the previous single-step methods, our conditional diffusion model predicts dose distribution through a multi-step denoising process. This paradigm progressively recovers prediction results from Gaussian noise in a recursive manner, which improves the reliability and realism of the results. As shown in Table 1, the conditional diffusion model outperformed the single-step UNet model in various metrics on dose prediction. However, the multi-step paradigm inevitably increases the time for model training and inference. We explore the relationship between the performance and inference time of DoseDiff in different generating steps in Section 3.5. Our experimental results demonstrate that DoseDiff, using DDIM technology, is able to generate an accurate dose distribution map for a new patient in only eight steps.

Proper use of distance maps can improve the performance of dose prediction models. The unit distance in the vanilla SDM (i.e., ISDM) is one voxel, so the magnitude of the values in ISDM are usually comparable to the volume size. Such large values are usually not conducive to the stable training of neural networks. Therefore, Yue et al. [23] proposed a voxel-wise transformation to shrink the values in SDM. The transformation keeps values inside the ROI large, while gradually decreasing the values outside the ROI to eventually approach zero; this makes the numerical distribution trend in SDM similar to that of the dose distribution. However, this transformation is too drastic, as it causes large numbers of voxels outside the ROI to become zero, rendering them unable to provide effective distance information. Our experiments demonstrate that the ISDM with simple division by a normalization constant achieves better performance than TSBDM, as shown in Table 2. The PSDM can be regarded as a global normalized version of ISDM with the introduction of spacing information. The transformation in TSBDM is still the instance-wise normalization, whereas our proposed PSDM normalizes the values in all SDM to a unified distance unit (e.g., a decimeter). Therefore, the model can extract and analyze the information in PSDM more efficiently. In addition, PSDM is calculated in 3D volume, so every voxel in PSDM contains some 3D information. Although DoseDiff is a 2D model, with the support of PSDM, its performance is not inferior to that of SOTA 3D models (i.e., DoseNet, DoseGAN, and DGDL).

Existing dose prediction methods usually adopt early fusion for CT image and mask/SDM, i.e., concatenating them at the input level. This fusion strategy is easy to implement, but it is often suboptimal. Therefore, we propose MMFNet to extract more valuable interactive information between the CT image and PSDM. MMFNet uses multi-scale feature-level fusion for CT image and PSDM. Moreover, a fusionFormer module is used to perceive long-range dependence and global information fusion. Long-range perception is important for dose prediction because the dose of a voxel is related to all the tissues through which the ray passes before reaching that voxel. Transformers have better long-range perception capability compared with CNNs, so we designed a transformer-based fusion module. The experimental results in Table 1 show the effectiveness of our fusion strategy.

Our work had several limitations. First, the 3D information provided by PSDM is relatively limited, and our method is unable to use 3D structural information from CT images, which would be beneficial to dose prediction. Therefore, our future work will focus on expanding DoseDiff into a 3D model. Second, DoseDiff has no advantage in inference speed compared with one-step dose prediction models. The diffusion model paradigm inevitably incurs more inference costs. Fortunately, approaches such as DDIM have been proposed to reduce the inference time of the diffusion model and preserve the model’s performance. Third, while non-deterministic predictions have limited influence on model performance as shown in Table 6, they remain an unresolved issue that requires attention. This is a common challenge for applying conditional diffusion model to image-to-image tasks.

5 Conclusion

In this paper, we propose DoseDiff, a distance-aware conditional diffusion model, for predicting dose distribution maps. DoseDiff uses CT images and SDMs as conditions and dose distribution map prediction is defined as a sequence of denoising steps guided by these conditions. Using PSDM as a model input results in better performance compared with using a mask, as it can effectively provide distance information. Moreover, MMFNet is proposed to effectively extract and fuse features from CT images and SDMs. We used ablation studies to evaluate the contribution of each element of DoseDiff. Comparison with other methods on two datasets showed that DoseDiff achieves SOTA performance in dose prediction.

References

  • [1] B. Sahiner, A. Pezeshk, L. M. Hadjiiski, X. Wang, K. Drukker, K. H. Cha, R. M. Summers, and M. L. Giger, “Deep learning in medical imaging and radiation therapy,” Medical Physics, vol. 46, no. 1, pp. 1–36, 2019.
  • [2] G. Delaney, S. Jacob, C. Featherstone, and M. Barton, “The role of radiotherapy in cancer treatment: estimating optimal utilization from a review of evidence-based clinical guidelines,” Cancer: Interdisciplinary International Journal of the American Cancer Society, vol. 104, no. 6, pp. 1129–1137, 2005.
  • [3] D. L. Craft, T. S. Hong, H. A. Shih, and T. R. Bortfeld, “Improved planning time and plan quality through multicriteria optimization for intensity-modulated radiotherapy,” International Journal of Radiation Oncology* Biology* Physics, vol. 82, no. 1, pp. 83–90, 2012.
  • [4] L. J. Schreiner, “On the quality assurance and verification of modern radiation therapy treatment,” Journal of Medical Physics/Association of Medical Physicists of India, vol. 36, no. 4, p. 189, 2011.
  • [5] B. Zhan, J. Xiao, C. Cao, X. Peng, C. Zu, J. Zhou, and Y. Wang, “Multi-constraint generative adversarial network for dose prediction in radiotherapy,” Medical Image Analysis, vol. 77, p. 102339, 2022.
  • [6] Y. Song, J. Hu, Y. Liu, H. Hu, Y. Huang, S. Bai, and Z. Yi, “Dose prediction using a deep neural network for accelerated planning of rectal cancer radiotherapy,” Radiotherapy and Oncology, vol. 149, pp. 111–116, 2020.
  • [7] S. Shiraishi, J. Tan, L. A. Olsen, and K. L. Moore, “Knowledge-based prediction of plan quality metrics in intracranial stereotactic radiosurgery,” Medical Physics, vol. 42, no. 2, pp. 908–917, 2015.
  • [8] O. Nwankwo, H. Mekdash, D. S. K. Sihono, F. Wenz, and G. Glatting, “Knowledge-based radiation therapy (kbrt) treatment planning versus planning by experts: validation of a kbrt algorithm for prostate cancer treatment planning,” Radiation Oncology, vol. 10, no. 1, pp. 1–5, 2015.
  • [9] B. Wu, F. Ricchetti, G. Sanguineti, M. Kazhdan, P. Simari, R. Jacques, R. Taylor, and T. McNutt, “Data-driven approach to generating achievable dose–volume histogram objectives in intensity-modulated radiotherapy planning,” International Journal of Radiation Oncology* Biology* Physics, vol. 79, no. 4, pp. 1241–1247, 2011.
  • [10] T. Song, D. Staub, M. Chen, W. Lu, Z. Tian, X. Jia, Y. Li, L. Zhou, S. B. Jiang, and X. Gu, “Patient-specific dosimetric endpoints based treatment plan quality control in radiotherapy,” Physics in Medicine & Biology, vol. 60, no. 21, p. 8213, 2015.
  • [11] R. R. Deshpande, J. DeMarco, J. W. Sayre, and B. J. Liu, “Knowledge-driven decision support for assessing dose distributions in radiation therapy of head and neck cancer,” International Journal of Computer Assisted Radiology and Surgery, vol. 11, pp. 2071–2083, 2016.
  • [12] G. Valdes, C. B. Simone II, J. Chen, A. Lin, S. S. Yom, A. J. Pattison, C. M. Carpenter, and T. D. Solberg, “Clinical decision support of radiotherapy treatment planning: A data-driven machine learning strategy for patient-specific dosimetric decision making,” Radiotherapy and Oncology, vol. 125, no. 3, pp. 392–397, 2017.
  • [13] C. McIntosh and T. G. Purdie, “Contextual atlas regression forests: multiple-atlas-based automated dose prediction in radiation therapy,” IEEE Transactions on Medical Imaging, vol. 35, no. 4, pp. 1000–1012, 2015.
  • [14] O. Morin, M. Vallières, A. Jochems, H. C. Woodruff, G. Valdes, S. E. Braunstein, J. E. Wildberger, J. E. Villanueva-Meyer, V. Kearney, S. S. Yom et al., “A deep look into the future of quantitative imaging in oncology: a statement of working principles and proposal for change,” International Journal of Radiation Oncology* Biology* Physics, vol. 102, no. 4, pp. 1074–1082, 2018.
  • [15] P. Dong and L. Xing, “Deep dosenet: a deep neural network for accurate dosimetric transformation between different spatial resolutions and/or different dose calculation algorithms for precision radiation therapy,” Physics in Medicine & Biology, vol. 65, no. 3, p. 035010, 2020.
  • [16] J. Fan, J. Wang, Z. Chen, C. Hu, Z. Zhang, and W. Hu, “Automatic treatment planning based on three-dimensional dose distribution predicted from deep learning technique,” Medical Physics, vol. 46, no. 1, pp. 370–381, 2019.
  • [17] V. Kearney, J. W. Chan, S. Haaf, M. Descovich, and T. D. Solberg, “Dosenet: a volumetric dose prediction algorithm using 3d fully-convolutional neural networks,” Physics in Medicine & Biology, vol. 63, no. 23, p. 235022, 2018.
  • [18] V. Kearney, J. W. Chan, T. Wang, A. Perry, M. Descovich, O. Morin, S. S. Yom, and T. D. Solberg, “Dosegan: a generative adversarial network for synthetic dose prediction using attention-gated discrimination and generation,” Scientific Reports, vol. 10, no. 1, p. 11073, 2020.
  • [19] C. Kontaxis, G. Bol, J. Lagendijk, and B. Raaymakers, “Deepdose: towards a fast dose calculation engine for radiation therapy using deep learning,” Physics in Medicine & Biology, vol. 65, no. 7, p. 075013, 2020.
  • [20] D. Nguyen, X. Jia, D. Sher, M.-H. Lin, Z. Iqbal, H. Liu, and S. Jiang, “3d radiotherapy dose prediction on head and neck cancer patients with a hierarchically densely connected u-net deep learning architecture,” Physics in Medicine & Biology, vol. 64, no. 6, p. 065020, 2019.
  • [21] D. Nguyen, T. Long, X. Jia, W. Lu, X. Gu, Z. Iqbal, and S. Jiang, “A feasibility study for predicting optimal radiation therapy dose distributions of prostate cancer patients from patient anatomy using deep learning,” Scientific Reports, vol. 9, no. 1, p. 1076, 2019.
  • [22] I. Sumida, T. Magome, I. J. Das, H. Yamaguchi, H. Kizaki, K. Aboshi, H. Yamaguchi, Y. Seo, F. Isohashi, and K. Ogawa, “A convolution neural network for higher resolution dose prediction in prostate volumetric modulated arc therapy,” Medical Physics, vol. 72, pp. 88–95, 2020.
  • [23] M. Yue, X. Xue, Z. Wang, R. L. Lambo, W. Zhao, Y. Xie, J. Cai, and W. Qin, “Dose prediction via distance-guided deep learning: Initial development for nasopharyngeal carcinoma radiotherapy,” Radiotherapy and Oncology, vol. 170, pp. 198–204, 2022.
  • [24] J. Hu, Y. Song, Q. Wang, S. Bai, and Z. Yi, “Incorporating historical sub-optimal deep neural networks for dose prediction in radiotherapy,” Medical Image Analysis, vol. 67, p. 101886, 2021.
  • [25] L. Yuan, Y. Ge, W. R. Lee, F. F. Yin, J. P. Kirkpatrick, and Q. J. Wu, “Quantitative analysis of the factors which affect the interpatient organ-at-risk dose sparing variation in imrt plans,” Medical Physics, vol. 39, no. 11, pp. 6868–6878, 2012.
  • [26] Z. Jiao, X. Peng, Y. Wang, J. Xiao, D. Nie, X. Wu, X. Wang, J. Zhou, and D. Shen, “Transdose: Transformer-based radiotherapy dose prediction from ct images guided by super-pixel-level gcn classification,” Medical Image Analysis, vol. 89, p. 102902, 2023.
  • [27] F. Xiao, J. Cai, X. Zhou, L. Zhou, T. Song, and Y. Li, “Transdose: a transformer-based unet model for fast and accurate dose calculation for mr-linacs,” Physics in Medicine & Biology, vol. 67, no. 12, p. 125013, 2022.
  • [28] J. Zhang, S. Liu, H. Yan, T. Li, R. Mao, and J. Liu, “Predicting voxel-level dose distributions for esophageal radiotherapy using densely connected network with dilated convolutions,” Physics in Medicine & Biology, vol. 65, no. 20, p. 205013, 2020.
  • [29] T. Zhou, S. Ruan, and S. Canu, “A review: Deep learning for medical image segmentation using multi-modality fusion,” Array, vol. 3, p. 100004, 2019.
  • [30] T. Zhou, H. Fu, G. Chen, J. Shen, and L. Shao, “Hi-net: hybrid-fusion network for multi-modal mr image synthesis,” IEEE Transactions on Medical Imaging, vol. 39, no. 9, pp. 2772–2781, 2020.
  • [31] Z. Meng, Y. Zhu, W. Pang, J. Tian, F. Nie, and K. Wang, “Msmfn: An ultrasound based multi-step modality fusion network for identifying the histologic subtypes of metastatic cervical lymphadenopathy,” IEEE Transactions on Medical Imaging, vol. 42, no. 4, pp. 996–1008, 2023.
  • [32] B. Wang, L. Teng, L. Mei, Z. Cui, X. Xu, Q. Feng, and D. Shen, “Deep learning-based head and neck radiotherapy planning dose prediction via beam-wise dose decomposition,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2022, pp. 575–584.
  • [33] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [34] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein, “Unrolled generative adversarial networks,” arXiv preprint arXiv:1611.02163, 2016.
  • [35] C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet, and M. Norouzi, “Palette: Image-to-image diffusion models,” in ACM SIGGRAPH 2022 Conference Proceedings, 2022, pp. 1–10.
  • [36] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020.
  • [37] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in International Conference on Learning Representations, 2021.
  • [38] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Advances in Neural Information Processing Systems, vol. 34, pp. 8780–8794, 2021.
  • [39] S. Xie, Z. Zhang, Z. Lin, T. Hinz, and K. Zhang, “Smartbrush: Text and shape guided object inpainting with diffusion model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 428–22 437.
  • [40] S. Gao, X. Liu, B. Zeng, S. Xu, Y. Li, X. Luo, J. Liu, X. Zhen, and B. Zhang, “Implicit diffusion models for continuous super-resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 021–10 030.
  • [41] C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi, “Image super-resolution via iterative refinement,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 4713–4726, 2022.
  • [42] Q. Lyu and G. Wang, “Conversion between ct and mri images using diffusion and score-matching models,” arXiv preprint arXiv:2209.12104, 2022.
  • [43] L. Cai, H. Gao, and S. Ji, “Multi-stage variational auto-encoders for coarse-to-fine image generation,” in Proceedings of the 2019 SIAM International Conference on Data Mining.   SIAM, 2019, pp. 630–638.
  • [44] Y. Ma, X. Liu, S. Bai, L. Wang, D. He, and A. Liu, “Coarse-to-fine image inpainting via region-wise convolutions and non-local correlation.” in IJCAI, 2019, pp. 3123–3129.
  • [45] Z. Feng, L. Wen, P. Wang, B. Yan, X. Wu, J. Zhou, and Y. Wang, “Diffdp: Radiotherapy dose prediction via a diffusion model,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2023, pp. 191–201.
  • [46] M. A. Wajid and A. Zafar, “Multimodal fusion: A review, taxonomy, open challenges, research roadmap and future directions,” Neutrosophic Sets and Systems, vol. 45, no. 1, p. 8, 2021.
  • [47] H. Hermessi, O. Mourali, and E. Zagrouba, “Multimodal medical image fusion review: Theoretical background and recent advances,” Signal Processing, vol. 183, p. 108036, 2021.
  • [48] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [49] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  • [50] A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” in International Conference on Machine Learning.   PMLR, 2021, pp. 8162–8171.
  • [51] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020.
  • [52] Y. Wu and K. He, “Group normalization,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19.
  • [53] S. d’Ascoli, H. Touvron, M. L. Leavitt, A. S. Morcos, G. Biroli, and L. Sagun, “Convit: Improving vision transformers with soft convolutional inductive biases,” in International Conference on Machine Learning.   PMLR, 2021, pp. 2286–2296.
  • [54] K. Yuan, S. Guo, Z. Liu, A. Zhou, F. Yu, and W. Wu, “Incorporating convolution designs into visual transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 579–588.
  • [55] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  • [56] A. Babier, B. Zhang, R. Mahmood, K. L. Moore, T. G. Purdie, A. L. McNiven, and T. C. Chan, “Openkbp: the open-access knowledge-based planning grand challenge and dataset,” Medical Physics, vol. 48, no. 9, pp. 5549–5561, 2021.
  • [57] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
  • [58] S. Liu, J. Zhang, T. Li, H. Yan, and J. Liu, “A cascade 3d u-net for dose prediction in radiotherapy,” Medical physics, vol. 48, no. 9, pp. 5574–5582, 2021.