DoseDiff: Distance-aware Diffusion Model for Dose Prediction in Radiotherapy
Abstract
Treatment planning, which is a critical component of the radiotherapy workflow, is typically carried out by a medical physicist in a time-consuming trial-and-error manner. Previous studies have proposed knowledge-based or deep-learning-based methods for predicting dose distribution maps to assist medical physicists in improving the efficiency of treatment planning. However, these dose prediction methods usually fail to effectively utilize distance information between surrounding tissues and targets or organs-at-risk (OARs). Moreover, they are poor at maintaining the distribution characteristics of ray paths in the predicted dose distribution maps, resulting in a loss of valuable information. In this paper, we propose a distance-aware diffusion model (DoseDiff) for precise prediction of dose distribution. We define dose prediction as a sequence of denoising steps, wherein the predicted dose distribution map is generated with the conditions of the computed tomography (CT) image and signed distance maps (SDMs). The SDMs are obtained by distance transformation from the masks of targets or OARs, which provide the distance from each pixel in the image to the outline of the targets or OARs. We further propose a multi-encoder and multi-scale fusion network (MMFNet) that incorporates multi-scale and transformer-based fusion modules to enhance information fusion between the CT image and SDMs at the feature level. We evaluate our model on two in-house datasets and a public dataset, respectively. The results demonstrate that our DoseDiff method outperforms state-of-the-art dose prediction methods in terms of both quantitative performance and visual quality.
Deep learning, Diffusion model, Dose prediction, Radiotherapy, Signed distance map.
1 Introduction
Radiation therapy (RT) is an essential cancer treatment modality; approximately 50% of cancer patients receive RT during their course of illness and it contributes to around 40% of curative treatment cases [1, 2]. Treatment planning plays an important part in the current RT workflow as it is used to determine the optimal radiation dose, technique, and schedule to target cancer while minimizing exposure to healthy tissue, with the aim of maximizing effectiveness and minimizing side effects. Clinically, a medical physicist typically spends hours adjusting a set of hyper-parameters and weightings in a trial-and-error manner to ensure that the RT plan can achieve the desired treatment effect, which is time-consuming and labor-intensive [3, 4]. If an appropriate dose distribution map can be obtained in advance, the medical physicist will be able to use it as a reference, enabling the RT planning to be completed with fewer hyper-parameter adjustments [5, 6]. Therefore, dose prediction is of great value in enhancing efficiency and streamlining the workflow of RT.
Various approaches have been proposed for dose prediction in RT. Knowledge-based planning (KBP) provides a traditional paradigm for dose prediction that leverages planning information from historical patients to predict dosimetry for a new patient [7, 8]. Initially, some handcrafted features related to dosimetry are directly used to match historical patients with new patients, e.g., the overlap volume histogram [9], distance-to-target histogram [10], and planning volume shapes [11]. With the development of machine learning technologies, support vector regression, random forests, and gradient boosting have been used to find more effective features [12, 13, 14]. However, these KBP-based methods typically concentrate on predicting dosimetric endpoints or dose-volume histograms (DVHs), which are dosimetric-related statistical indicators and lack spatial information [5]. Furthermore, the accuracy of methods that rely on traditional feature extraction often falls short of expectations.
Deep learning is currently a popular approach for dose prediction [5, 6, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]. With its powerful feature extraction and analysis capabilities, a deep learning model can learn map** relationships from computed tomography (CT) images and region-of-interest (ROI) masks, i.e., targets and organs-at-risk (OARs), to dose distribution maps. Nguyen et al. [21] and Fan et al. [16] used a two-dimensional (2D) convolutional neural network (CNN) to predict dose distribution slice by slice, with the ultimate objective of achieving three-dimensional (3D) dose prediction. To make full use of spatial information, Kearney et al. [17] proposed DoseNet, a modified 3D UNet, for volumetric dose prediction. Recent research has focused on constructing more effective loss functions to improve the accuracy of dose prediction, e.g., adversarial loss [18] and ROI-weighted loss [5].
Most deep-learning-based dose prediction methods use CT images and ROI masks directly as model inputs; however, they rarely exploit distance information between surrounding tissues and targets or OARs. In RT planning, distance information is important as the relative positions of ROIs determine their mutual influence, e.g., the dose levels of tissues close to targets are relatively large, whereas those of the tissues near OARs are relatively small [23, 25]. Although CNNs are capable of implicitly learning distance relationships among ROIs from masks, most of deep-learning-based methods use a slice-based or patch-based training strategy owing to memory constraints, which means they are unable to leverage useful information from mask images when the slice or patch contains little or no ROI. While some methods incorporate attention modules into the networks to perceive long-dependency range information, the attention perception of these modules remains confined to slices or blocks [26, 27]. To address the aforementioned challenge, certain studies have introduced distance information into neural networks through distance maps. For example, Kontaxis et al. [19] fed the distance map from the central beamline to the model to improve the accuracy of dose prediction. However, their method requires knowledge of the parameters of beamlets, which is often not practical in a clinical setting. Zhang et al. [28] introduced a distance image of planning target volumes (PTV) for esophageal radiotherapy dose prediction, but ignored OARs and other targets. Yue et al. [23] utilized ROI masks for distance map calculation and proposed a normalization technique to rescale the numerical range of the distance map for network training, but their distance map only considered pixel distance in image space, rather than real-world physical distance, and had a limited dynamic range. In addition, these methods simply concatenated the distance maps and CT images as network inputs, restricting the potential performance of their model. Various studies have shown that feature-level fusion is more capable of capturing complex relationships among different modalities compared with input-level fusion [29, 30, 31].
Ensuring the authenticity and feasibility of the predicted dose distribution map is also a challenge, e.g., it involves ensuring that the radiation paths in the map adhere to straight-line propagation and that the predicted cumulative dose distribution can be achieved within an acceptable number of treatment fields. However, most of previous deep-learning-based dose prediction methods only learn the map** from input to output images and so cannot fully capture the prior data distribution of real dose distribution map. As a result, the predicted dose distribution maps appear excessively smooth and distort ray path characteristics. To address this problem, Wang et al. [32] proposed a beam-wise dose network to refine the predicted dose within beam masks, but these artificially predefined beam masks derived from the PTV masks are quite coarse. Some studies have adopted frameworks based on generative adversarial networks (GANs) to enhance the realism of the predicted dose distribution map [5, 8], but GANs can be challenging to train and often drop modes in the output distribution [33, 34, 35]. Recently, diffusion models [36, 37] have attracted attention as powerful generative models that are capable of modeling complex image distribution. Diffusion models have been shown to provide superior image sampling quality and more stable training compared with GANs [38]. Conditional diffusion models introduce conditional information to guide the model in generating specific images. Researchers have applied conditional diffusion models to various image-to-image translation tasks, e.g., image inpainting [39], colorization [35], super-resolution [40, 41], and medical image synthesis [42], and achieved the SOTA results. Unlike the previous method of learning pixel map**s between input and output, conditional diffusion models can learn the data distribution from training images and sample the images that best match the given conditions based on the distribution. A conditional diffusion model also partitions image-to-image generation into a sequence of denoising steps, which typically recover the general outline initially and then produce details. This can be considered to be a recursive image generation method and has been proven to be effective by previous studies [43, 44]. Despite these advantages of diffusion models, further exploration is required to fully realize their potential in the task of predicting dose distribution in RT. To the best of our knowledge, only Feng et al. [45] proposed a diffusion-based dose prediction (DiffDP) model for predicting the radiotherapy dose distribution of cancer patients. However, DiffDP is limited to utilizing 2D in-plane information for slice-based dose prediction.
In this paper, we propose a novel distance-aware conditional diffusion model for dose prediction, named DoseDiff, which takes CT images and signed distance maps (SDMs) as conditions to accurately generate dose distribution maps. Based on the mechanism of the conditional diffusion model, we define dose prediction using a sequence of denoising steps, wherein the predicted dose distribution maps are generated from Gaussian noise images with the guidance of conditions. The SDMs are obtained by performing a distance transform on the masks to obtain indicates the distance of each voxel in the image from the contours of the ROIs in 3D space. Unlike Yue et al. [23], who simply combined CT images and distance maps through channel concatenation at the input, we propose a multi-encoder and multi-scale Fusion Network (MMFNet) to enhance information fusion at the feature level. While the early fusion methods, such as channel concatenation, are straightforward and efficient, their performances are constrained by the presence of considerable noise and redundant information in the low-level features. Extensive experiments have demonstrated high-level feature fusion is valuable to the precise prediction in complex tasks [46, 47]. We design independent encoders for CT images and SDMs, respectively, and perform information fusion on feature maps of multiple scales, which is able to integrate both low-level and high-level features. A fusion module based on self-attention mechanism [48, 49] is also proposed for inclusion in MMFNet to further fuse global information.
2 Method
2.1 Background
Conditional diffusion models represent a class of conditional generative methods aimed at transforming a Gaussian distribution to an empirical data distribution [37, 36, 50]. This usually involves two processes in opposite directions: the forward (diffusion) process and reverse process. We denote by and the condition and target domain images, respectively. The forward process aims to collapse the target image distribution to a standard Gaussian distribution by gradually adding Gaussian noise of varying scales to the target image:
(1) |
(2) |
where is the timestep and denotes the total number of diffusion steps. The scale of the added Gaussian noise is determined by a set of hyper-parameters, , that change depending on , with . Using the theoretical assumption of the diffusion model, when is sufficiently large, can be approximated as a standard Gaussian distribution [36, 37].
The target image is generated by fitting other distributions that are parameterized by in the reverse process:
(3) |
where can be parameterized by a neural network . The objective of training the neural network is to minimize the variational upper bound of the negative log-likelihood:
(4) |
where .
2.2 DoseDiff
In contrast to previous methods, the conditional diffusion model can model the data distribution, which enables it to perceive the dose distribution characteristics in the real dose distribution map. Furthermore, the conditional diffusion model exhibits more stable training and yields higher image quality compared with GAN-based methods. Therefore, we propose DoseDiff for dose prediction, as shown in Fig. 1. DoseDiff is essentially a conditional diffusion model conditioned by CT images and SDMs, which considers dose prediction as a sequence of denoising steps, i.e., the reverse process. To adapt the two conditional terms, we redefine the neural network as and accordingly modify the reverse process and objective function outlined in Eqs. 3 and 4:
(5) |
(6) |
where and are the CT image and SDM, respectively, and .
The conditional diffusion model assumes that the reverse process is a Markov chain, and diffusion step should be sufficiently large. Therefore, the inference of DoseDiff consumes significant computational time. To reduce the inference time, we adopt the accelerated generation technology of the denoising diffusion implicit model (DDIM) [51]. Based on the non-Markovian hypothesis, the DDIM allows a reduced number of sampling steps (fewer than ) in the inverse process. Furthermore, the implementation of DDIM requires modifications only to the sampling technique of the inverse process, with no impact on the training and forward processes. Let represent a sub-sequence of of length with ; then, the accelerated reverse process of DoseDiff can be expressed as:
(7) | ||||
(8) |
where and .
2.3 Signed distance map
Most previous dose prediction methods have directly used masks to provide location and distance information related to ROIs. A limitation of using masks in this way is that current slice-based and patch-based CNN training strategies cannot guarantee that all foreground regions in the mask images are completely included in the sampled images. Moreover, although CNNs are capable of implicitly learning distance relationship among ROIs from mask images, their convolution kernels can only extract local features, and thus they may struggle to perceive long-range distance relationships. On the contrary, a distance map provides the distance from each voxel to the contour of the ROI in the 3D space of the image. The distance image value itself is capable of conveying distance information, regardless of incomplete sampling of the image or the limited receptive field of CNNs. The SDM uses sign to identify whether a voxel is inside or outside the ROI.
Let denote the mask image, where and is the total number of targets and OARs. We denote by , , and inside, outside, and boundary of the ROI, respectively. The SDM in image space (ISDM) is defined as:
(9) |
where is the Euclidean distance between voxels and in image space, where and are voxel coordinates of and . However, owing to inconsistent voxel resolution, ISDM values are not comparable between different images and have no practical connotation. Therefore, we consider spacing as a factor when calculating the SDM, transforming the ISDM from image space to physical space (PSDM): , where is the voxel spacing of the image. Finally, the input SDM of the neural network is a collection of the SDMs of all ROIs:
(10) |
2.4 MMFNet
The conditions of DoseDiff contain multimodal images, i.e., the CT image and SDM. To effectively extract and fuse their features, we propose MMFNet for , as shown in Fig. 2. MMFNet adopts three encoders used to extract the features of , , and , respectively. The encoder part accomplishes multi-scale information fusion by summing feature maps of the same resolution from the three inputs at every level. Our encoders and decoder are implemented in the same way as those of [38], with four down-sampling or up-sampling operations. Each level of encoders and decoder contains two residual blocks (Res-blocks), each of which is composed of one group normalization (GN) layer [52], two sigmoid linear units, two convolution layers, one adaptive GN (AdaGN) layer [38], and one residual connection. Similarly, the timestep is embedded in each Res-block through AdaGN, which is defined as [38]:
(11) |
where is the input feature map and is obtained from a linear projection of the timestep .
We propose a transformer-based fusion module, fusionFormer, for further global information fusion. Unlike the convolution layer used to extract local features, the transformer [48, 49] uses an attention mechanism to capture global context information. A currently popular approach to enhance model performance is to combine the strengths of CNNs and transformers for local and global feature extraction [53, 54]. To minimize memory cost, fusionFormer is added to the path at the lowest resolution level. In fusionFormer, the feature maps of , , and are partitioned into non-overlap** patches, i.e., the channels remain unchanged while evenly slicing along the length and height dimensions. The patch size is associated with the size of the original input image. The three feature maps are then transformed into three fundamental components of the attention using an embedded layer consisting of linear and normalization layers: the query (), key (), and value (). As both and are conditions, we use their feature maps as the query and key (or inverse) to calculate attention matrix. The attention matrix is then used to weight the feature map of the noisy dose distribution map, enabling conditional guidance. The attention block in the transformer can be expressed as follows:
(12) |
where , , and are linear operations. Then, a feed-forward block is used for the nonlinear transformation of features:
(13) |
where is a multi-layer perceptron consisting of two linear layers and a nonlinear activation function and is layer normalization. Last, a linear layer is used to adjust the dimensionality of the feature vector to the original size; it is then reshaped to be the same shape as the input feature maps.
2.5 Implementation details
Our network implementation was based on PyTorch, and the code was run on a server with two RTX 2080Ti GPUs. Following [36], we set the total number of diffusion steps () to 1000 and the forward process variances to constants decreasing linearly from to . To ensure the input value of the model fell within a suitable range, we set the distance unit for the PSDM to decimeters. We used the AdamW [55] optimizer with a step-decay learning rate to train our model for one million iterations. The initial learning rate and batch size were 0.0001 and 8, respectively. Some simple online data augmentation operators, including random flip**, rotating, and zooming, were used on the training set to improve the generalization capacity of the model. To reduce the inference time, we set the reduced generation step in DDIM to 8. The full implementation is available at https://github.com/whisney/DoseDiff.
3 Experiments
3.1 Datasets and Preprocessing
3.1.1 In-house Datasets
The breast cancer and nasopharyngeal cancer (NPC) datasets were constructed by collecting data for patients receiving RT at Guangdong Provincial Hospital of Traditional Chinese Medicine, China, from 2016 to 2020. The patients in the breast cancer dataset underwent postoperative RT following breast-conserving surgery, whereas the patients in the NPC dataset received RT for primary lesions. The CT volumes were obtained using a Siemens Sensation Open (Siemens Healthcare, Forchheim, Germany) scanner. RT plans were received from a treatment-planning system (Philips, Pinnacle3, Netherlands). All the targets and OARs were delineated by experienced oncologists and all the plans were clinically approved. The details of the breast cancer and NPC datasets were as follows.
Breast cancer dataset: The breast cancer dataset consisted of data from 119 patients. The in-plane pixel spacings of the CT images ranged from 0.77 to 0.97 mm with an average of 0.95 mm, and slice thicknesses were all 5.0 mm. The in-plane resolutions were , and the number of slices ranged from 50 to 129. We used a binary body mask to exclude unnecessary background from each image, and all the images were cropped by the minimal external cube of the body masks and then resized to in-plane. The target areas of tumor bed (TB) and clinical target volume (CTV), and the OARs for heart, spinal cord, and left and right lungs were included in our experiments.
NPC dataset: The NPC dataset consisted of data from 139 patients. The in-plane pixel spacings of the CT images ranged from 0.75 to 0.97 mm with an average of 0.95 mm, and slice thicknesses were 3.0 mm. The in-plane resolutions were , and the number of slices ranged from 69 to 176. All images were also masked and cropped by body masks but then resized to in-plane. The target areas included gross tumor volume (GTV) and PTV, and the OARs included eyes, optic nerve, temporal lobe, brain stem, parotid gland, mandible, and spinal cord.
The intensity ranges for CT images and dose distribution maps were set to HU and Gy, respectively. Both were uniformly and linearly normalized to for training. Both datasets were split into training, validation, and test sets using a patient-wise ratio of ; these sets were used for model training, model selection, and performance evaluation, respectively.
3.1.2 Public Dataset
We utilize the public Head and neck cancer dataset from the AAPM OpenKBP challenge [56], which contains 200 training cases, 40 validation cases, and 100 testing cases. The RT plans were prescribed 70, 63, and 56 Gy in 35 fractions to the gross disease (), intermediate-risk target volumes (), and elective target volumes (). The spacing and size of all images were consistent at 3.906 mm 3.906 mm 2.5 mm and , respectively. The target areas included PTV, brain stem, spinal cord, parotid gland, larynx, esophagus, and mandible.
UNet | Diffusion model | PSDM | MS | FF | MAE (Gy) | SSIM | PSNR (dB) | Dose score (Gy) | Volume score (%) |
3.4161.741 | 0.7970.077 | 19.6754.175 | 11.2067.456 | 18.9109.909 | |||||
2.8631.199 | 0.8110.059 | 20.9003.570 | 9.2285.114 | 15.0398.963 | |||||
2.9841.392 | 0.8130059 | 20.5393.680 | 8.3695.129 | 15.9818.909 | |||||
1.4380.224 | 0.8960.032 | 26.6621.240 | 1.8800.611 | 11.3018.167 | |||||
1.4200.211 | 0.8900.034 | 26.7101.140 | 1.1500.333 | 1.7181.357 | |||||
1.2210.241 | 0.9130.030 | 27.8131.649 | 1.0470.438 | 1.2560.740 | |||||
2.9371.132 | 0.8070.052 | 20.2703.134 | 10.3916.376 | 19.4137.791 | |||||
2.6541.156 | 0.8200.054 | 21.3333.475 | 8.3175.361 | 14.7909.077 | |||||
2.4931.011 | 0.8190.044 | 22.1143.039 | 7.0476.557 | 13.1438.746 | |||||
1.2230.206 | 0.9030.209 | 27.7101.539 | 1.5750.440 | 7.0317.549 | |||||
1.2010.208 | 0.9060.028 | 27.7001.485 | 0.8480.233 | 1.1380.563 | |||||
1.1900.227 | 0.9000.028 | 28.1181.699 | 0.6980.131 | 0.9220.296 |
Input | MAE (Gy) | SSIM | PSNR (dB) | Dose score (Gy) | Volume score (%) |
Mask | 1.5360.244 | 0.8590.031 | 25.6321.271 | 0.8560.238 | 0.9210.372 |
TSBDM | 1.2870.219 | 0.8590.029 | 28.0991.673 | 0.7880.238 | 0.9280.304 |
ISDM | 1.2480.214 | 0.8780.028 | 28.0421.632 | 0.7050.181 | 0.9280.300 |
PSDM | 1.1900.227 | 0.9000.028 | 28.1181.699 | 0.6980.131 | 0.9220.296 |
3.2 Evaluation Metrics
To evaluate the model’s performance quantitatively, we used both common image similarity metrics and dosimetry-related metrics. The image similarity metrics included the mean absolute error (MAE), structural similarity index measurement (SSIM), and signal-to-noise ratio (PSNR):
(14) |
(15) |
(16) |
where is the total number of voxels; and are the voxel value in volumes and , respectively. is the number of local windows in the image. The symbols , , , , and denote local mean, variance, and covariance of the local window in volumes and . denotes the distance between minimum and maximum possible values of input images, which is set to 3000 in our experiments. Following, [57], and are constants to stabilize the division. To eliminate the influence of the background, the mean values over pixels contained within the body masks were computed for these three metrics.
The dosimetry-related metrics included and . The symbol denotes the absolute error of the dosimetric metrics between the predicted and ground-truth dose distribution maps. denotes the dose received by of the volume of the ROIs. , , and , indicating the minimum, average, and maximum doses in the ROI, respectively, were also considered to be members of . For the treatment targets, was the volume percentage of the target region receiving of the prescribed dose; for OARs, was the volume percentage of the ROI receiving -Gy dose. In clinical practice, different indicators are considered for different ROIs. Therefore, we assigned appropriate evaluation metrics for different ROIs in the two datasets. In the breast cancer dataset, we used and for TB and CTV; and for heart; and for ipsilateral lung; and for spinal cord. In the NPC dataset, we used and for GTV and PTV; for eyes, optic nerve, temporal lobe, brain stem, and spinal cord; and for parotid gland; and for mandible. To facilitate model comparison and selection, we defined the mean values of and for all ROIs as the “dose score” and “volume score”, respectively. The DVH was used for a comprehensive comparison of the differences between the predicted results obtained with different methods and the real dose distribution.
The dose score and DVH score are employed to evaluate the results of OpenKBP dataset, utilizing evaluation code111https://github.com/ababier/open-kbp directly provided by the event organizers.
3.3 Ablation studies for DoseDiff
We conducted ablation studies to quantitatively validate the contribution of each term in DoseDiff. Specifically, we sequentially added the PSDM input, multi-scale fusion (MS), and fusionFormer module (FF) into the baseline conditional diffusion model, which initially only used CT images as input. Moreover, we added the proposed elements to the baseline UNet model to further prove their effectiveness. As shown in Table 1, the introduction of PSDM, MS, and FF individually results in performance improvement, with PSDM contributing the most significant enhancement in performance. Although multi-scale fusion led to limited improvement in the common image similarity metrics, it significantly enhanced the dosimetric metrics. Compared with the conditional diffusion model baseline, DoseDiff led to improvements of 1.747Gy in MAE, 0.093 in SSIM, 7.848dB in RSNR, 9.693Gy in dose score, and 18.491% in volume score. Furthermore, the experimental results demonstrate that the proposed elements were effective in both the UNet and conditional diffusion model frameworks, with the conditional diffusion model generally outperforming UNet. The results of the ablation experiments indicate that incorporating distance information and enhancing the fusion strategy can be conducive to achieving accurate dose distribution prediction for deep-learning models.
3.4 Performance of PSDM
To verify the benefit of the proposed PSDM, we compared DoseDiff implementations with different distance information inputs, including mask, transformed signed boundary distance map (TSBDM) [23], ISDM, and PSDM. Note that we shrank all ISDM values by a factor of 100 to stabilize the training. Table 2 shows the quantitative comparison results, demonstration that PSDM achieved the best performance. A visual comparison of the mask, TSBDM, ISDM, and PSDM on the ROIs of breast cancer is presented in Fig. 3. As shown, the distance maps provided more distinct and effective distance information than the mask images. However, the voxel-wise transformation in TSDBM quickly zeroed out the pixels far from the contour, rendering many background pixels ineffective. In addition, the dynamic change in distance was not obvious in TSDBM. PSDM converts the image distance to the real-world distance, which enabled the model to extract the distance map information in different volumes under a consistent measure. PSDM is also more practical than ISDM in the clinic, for example, ISDM equates the in-plane and inter-slice pixel distance, but the slice thickness of CT images is usually larger than the in-plane spacing.
ROI | Metrics | DeepLabV3+ | DoseNet | DoseGAN | MCGAN | DGDL | C3D | DiffDP | DoseDiff (Ours) |
Body | MAE (Gy) | 1.6020.307 | 1.3850.256 | 1.7060.268 | 1.5100.268 | 1.3300.242 | 1.1950.256 | 1.4550.289 | 1.0760.232 |
SSIM | 0.8510.034 | 0.8570.030 | 0.8210.031 | 0.8580.026 | 0.8800.027 | 0.9130.027 | 0.8750.027 | 0.9130.024 | |
PSNR (dB) | 25.8611.367 | 28.5551.452 | 26.1211.312 | 25.2871.305 | 28.4251.376 | 27.9811.476 | 25.4111.411 | 28.8871.481 | |
TB | (Gy) | 0.7940.737 | 0.6230.416 | 0.8540.759 | 1.2040.669 | 1.0290.657 | 0.5670.404 | 1.1830.665 | 1.0350.602 |
(%) | 0.0980.254 | 0.0290.074 | 0.5030.528 | 0.0370.061 | 0.1850.327 | 0.0310.074 | 0.1070.235 | 0.0840.194 | |
CTV | (Gy) | 0.7130.387 | 1.1150.420 | 0.6200.299 | 0.8440.332 | 1.0380.385 | 0.9750.404 | 0.3780.313 | 0.3280.261 |
(%) | 0.3770.301 | 0.5620.463 | 0.4950.448 | 0.5930.462 | 0.5820.456 | 0.5570.466 | 0.2900.411 | 0.2010.214 | |
Heart | (Gy) | 1.2891.408 | 1.1201.454 | 1.5031.802 | 0.9961.330 | 1.2931.596 | 1.1911.537 | 1.0441.179 | 0.9381.061 |
(%) | 1.6792.541 | 2.0213.549 | 2.0943.775 | 1.0811.810 | 2.0263.645 | 2.2313.819 | 1.3372.091 | 1.3152.026 | |
Ipsilateral lung | (Gy) | 1.1790.986 | 0.9100.704 | 1.4911.069 | 1.0710.787 | 1.1050.927 | 1.0110.741 | 0.8180.717 | 0.8720.737 |
(%) | 1.5661.465 | 1.7091.269 | 1.7721.695 | 2.0921.325 | 1.8681.246 | 1.5081.185 | 1.2291.034 | 1.3451.021 | |
Spinal cord | (Gy) | 1.0371.050 | 1.3761.317 | 1.5571.273 | 1.1831.018 | 1.2381.284 | 1.3081.307 | 1.0441.016 | 0.8770.812 |
3.5 Trade-off between performance and inference speed
In diffusion models, the generative process is defined as the reverse of a particular Markovian diffusion process. Consequently, the generation of the image necessitates a complete sequence of sampling, which is time-consuming. The DDIM converts the Markovian diffusion process in the diffusion model into a non-Markovian diffusion process that allows us to trade computation for sample quality. Based on the DDIM, we conducted a comparison of the performance and inference time of DoseDiff across varying generation steps. As shown in Fig. 4, the performance of the model improved with more generation steps but stabilized after eight steps. The inference time was approximately proportional to the number of generation steps. To trade off performance and inference time, we suggest setting the generation step to 8 in the DDIM reverse process, resulting in an inference time of 4.7s per volume; our subsequent experiments all adopt this setting.
ROI | Metrics | DeepLabV3+ | DoseNet | DoseGAN | MCGAN | DGDL | C3D | DiffDP | DoseDiff (Ours) |
Body | MAE (Gy) | 2.0750.309 | 1.9230.346 | 2.4950.435 | 1.9730.359 | 1.9260.346 | 1.8510.378 | 1.9650.424 | 1.6760.387 |
SSIM | 0.8270.042 | 0.8390.034 | 0.7760.082 | 0.8560.030 | 0.8370.044 | 0.8600.031 | 0.8670.037 | 0.9050.018 | |
PSNR (dB) | 25.8801.048 | 26.4301.304 | 24.5921.044 | 25.5801.071 | 26.4451.191 | 26.2561.341 | 25.5261.254 | 26.3971.249 | |
GTV | (Gy) | 0.7580.585 | 0.5960.475 | 0.8580.684 | 0.7560.518 | 0.6370.374 | 1.8310.553 | 1.7771.282 | 1.2430.677 |
(%) | 0.4940.488 | 0.5000.548 | 0.8890.975 | 0.4300.560 | 0.5160.504 | 0.5850.691 | 0.9961.177 | 0.6970.842 | |
PTV | (Gy) | 8.6959.358 | 7.8967.309 | 7.8229.289 | 3.0723.389 | 7.4757.005 | 6.6666.334 | 2.9772.220 | 2.6232.322 |
(%) | 4.2853.283 | 3.7402.199 | 6.5954.172 | 5.5343.660 | 3.6692.282 | 2.4411.84 | 5.0323.284 | 4.2472.811 | |
parotid glan | (Gy) | 2.2291.492 | 1.8041.497 | 2.3121.196 | 1.5811.237 | 1.9621.361 | 1.5381.006 | 2.1441.308 | 1.9441.118 |
(%) | 8.6845.510 | 5.8984.276 | 4.8143.722 | 4.8083.878 | 5.7363.420 | 6.7795.191 | 5.4043.640 | 5.1123.300 | |
Eyes | (Gy) | 6.1065.561 | 5.8015.836 | 5.6013.816 | 6.4426.387 | 6.1634.553 | 5.0874.245 | 7.9264.247 | 6.4494.271 |
Optic nerve | (Gy) | 5.7534.525 | 2.8082.637 | 2.9473.031 | 6.1646.851 | 4.2443.479 | 2.8712.976 | 7.8797.959 | 6.3246.922 |
Temporal lobe | (Gy) | 3.1301.593 | 4.4462.241 | 3.7392.863 | 2.6591.615 | 3.7432.528 | 3.3612.530 | 3.9482.849 | 4.1652.531 |
Brain stem | (Gy) | 4.0301.863 | 4.0331.770 | 4.9601.865 | 3.8421.675 | 4.3341.927 | 3.6231.810 | 4.1182.649 | 3.6162.276 |
Mandible | (Gy) | 7.2065.779 | 7.0066.568 | 5.4665.957 | 5.2265.801 | 8.6276.366 | 6.6395.807 | 4.6744.500 | 4.4834.798 |
Spinal cord | (Gy) | 2.7601.543 | 1.6020.914 | 1.4210.913 | 4.8550.868 | 2.4050.888 | 2.5581.418 | 2.3021.252 | 1.9921.293 |
3.6 Comparison with SOTA methods
To demonstrate the superiority of our method, we conducted a comparison between DoseDiff and several SOTA dose prediction methods, including the 2D models DeeplapV3+ [6], MCGAN [5], and DiffDP [45], and the 3D models DoseNet [17], DoseGAN [18], distance-guided deep learning (DGDL) [23], and cascade 3D U-Net (C3D) [58]. DGDL differs from the other methods in its use of TSBDM distance maps as input; the other methods use mask images. We conducted both quantitative and visual comparisons on the breast cancer and NPC datasets.
Comparison on breast cancer dataset: Table 3.4 and Figure 5 illustrate the common image similarity metrics and detailed dosimetry-related metrics for all methods compared on the breast cancer dataset. As shown in the Table 3.4, our method achieved stable top performance on most evaluation metrics. We can observe that although the C3D and DoseNet exhibit excellent predictive performance on the TB, their performance on other ROIs is relatively poor. We speculate that they may be overfitting to the dose distribution in the TB, which is a region with relatively uniform and fixed cumulative dose values (e.g., 63.8 Gy for our dataset). Figure 6 shows predicted dose distribution maps obtained with different methods, as well as the difference between the dose distribution maps and the ground truths. The dose distribution predicted by our method was closest closer to the real dose distribution. Specifically, our method can learn a clear path of the light ray that reflects the fundamental properties of straight-line propagation, which is beneficial for medical physicists to obtain precise parameters, such as the number of radiation fields and the incident angle, from the predicted dose distribution map. Furthermore, Fig. 7 presents the DVHs of the dose distribution maps predicted by the compared methods, allowing for a comprehensive comparison of the dose distributions. We also observed that the DVH curve of the dose distribution map predicted by our method was most similar to the ground-truth curve.
Methods | Dose score | DVH score |
DeepLabV3+ [6] | 3.5511.126 | 2.0812.169 |
DoseNet [17] | 2.8611.107 | 1.6471.926 |
DoseGAN [18] | 2.8731.125 | 1.6951.973 |
MCGAN [5] | 3.4421.081 | 1.7962.210 |
DGDL [23] | 3.0261.085 | 1.7482.033 |
C3D [58] | 2.4291.101 | 1.4781.931 |
DiffDP [45] | 3.1730.966 | 1.6021.889 |
DoseDiff (Ours) | 2.3820.925 | 1.4761.645 |
Comparison on NPC dataset: Table 4 and Figure 8 present the results of a quantitative comparison between our method and the SOTA approaches on the NPC dataset. Our method achieved competitive results in terms of both image similarity and dosimetry-related metrics. The transverse and coronal plane samples of dose distribution maps predicted by all the compared methods are shown in Fig. 9. Consistent with the performance observed on the breast cancer dataset, the dose distribution maps predicted by previous methods were overly smooth, whereas our method’s predictions retained more realistic ray path characteristics. Furthermore, we could observe from the coronal images that our method maintained a high level of prediction accuracy for regions that were far from the ROIs, owing to the distance information provided by the PSDM. The DVH curves of the dose distribution maps predicted by the compared methods on the NPC dataset are presented in Fig. 10. Compared to other methods, the DVH curves of the target and OARs predicted by our method are closer to the ground truth.
Dataset | MAE (Gy) | SSIM | PSNR (dB) | Dose score (Gy) | Volume score (%) |
Breast cancer | 0.0011 | 0.0002 | 0.0099 | 0.0058 | 0.0029 |
NPC | 0.0012 | 0.0001 | 0.0076 | 0.0226 | 0.0153 |
Comparison on OpenKBP dataset: The results of a quantitative comparison between our method and the SOTA methods on the OpenKBP dataset are shown in Table 5. The images in the OpenKBP dataset exhibit larger in-plane spacings and smaller slice thickness (compared to internal datasets), implying that dose prediction on this dataset relies more heavily on 3D information. Consequently, we can observe that 2D models generally underperform compared to 3D models on the OpenKBP dataset. Despite also being a 2D model, the introduction of 3D distance information by PSDM in our DoseDiff enables our method to achieve competitive results. Figure 11 shows the transverse and coronal plane samples of dose distribution maps predicted by all the compared methods. The visualization results demonstrate that our method exhibits minimal error between predicted values and ground truth.
3.7 The influence of non-deterministic prediction
Each dose prediction in our DoseDiff begins with a randomly generated Gaussian noise image. Moreover, each step in the DDIM-based inverse process involves Gaussian sampling as in Eqs. 7 and 8. The randomness introduced by these two factors leads to the non-determinism in dose prediction of our model. To investigate the influence of this non-determinism on the prediction results, we conducted 50 dose predictions for both breast cancer and NPC datasets using different random seeds. The variance of each mean metric for our DoseDiff on the two datasets are shown in Table 6, and the results indicate a small degree of fluctuation for each metric.
4 Discussion
In this study, we aimed to develop a model to accurately predict dose distribution for individual RT patients, enabling medical physicists to achieve acceptable RT plans with less trial and error. Clinically, both CT images and ROI masks are essential for RT planning, so most previous studies have used deep learning to directly learn map**s from them to dose distribution maps. However, the mask image prevents the deep learning model from effectively extracting distance information between surrounding tissues and targets or OARs. Some studies have replaced masks with distance maps and observed improved dose prediction accuracy as a result [19, 23], because each voxel of the distance map explicitly provides distance information relative to the ROI contour. Here, inspired by the success of diffusion models, we propose a distance-aware conditional diffusion model called DoseDiff for dose prediction. Using distance maps as input, we also explore a fusion strategy involving distance maps and CT images. We incorporate multi-scale fusion strategy and fusionFormer module in the proposed MMFNet to achieve effective fusion of complex information
Diffusion models naturally possess a strong ability to model data distributions, as they were specifically developed for this purpose. Therefore, one advantage of introducing conditional diffusion model into image-to-image translation tasks is that as well as learning the map** relationship from conditions to output, it can also effectively capture the distribution characteristics of the generated image itself. By contrast, previous methods have only learned the map** relationship between input and output images. As shown in Figs. 6 and 9, that the dose distribution maps generated by our method are more realistic than those obtained with previous methods, in which the ray path characteristics are unclear and distorted. These dose distribution maps may not exist in the real world and may be limited in their usefulness for assisting medical physicists in obtaining the relevant parameters for treatment planning.
Unlike the previous single-step methods, our conditional diffusion model predicts dose distribution through a multi-step denoising process. This paradigm progressively recovers prediction results from Gaussian noise in a recursive manner, which improves the reliability and realism of the results. As shown in Table 1, the conditional diffusion model outperformed the single-step UNet model in various metrics on dose prediction. However, the multi-step paradigm inevitably increases the time for model training and inference. We explore the relationship between the performance and inference time of DoseDiff in different generating steps in Section 3.5. Our experimental results demonstrate that DoseDiff, using DDIM technology, is able to generate an accurate dose distribution map for a new patient in only eight steps.
Proper use of distance maps can improve the performance of dose prediction models. The unit distance in the vanilla SDM (i.e., ISDM) is one voxel, so the magnitude of the values in ISDM are usually comparable to the volume size. Such large values are usually not conducive to the stable training of neural networks. Therefore, Yue et al. [23] proposed a voxel-wise transformation to shrink the values in SDM. The transformation keeps values inside the ROI large, while gradually decreasing the values outside the ROI to eventually approach zero; this makes the numerical distribution trend in SDM similar to that of the dose distribution. However, this transformation is too drastic, as it causes large numbers of voxels outside the ROI to become zero, rendering them unable to provide effective distance information. Our experiments demonstrate that the ISDM with simple division by a normalization constant achieves better performance than TSBDM, as shown in Table 2. The PSDM can be regarded as a global normalized version of ISDM with the introduction of spacing information. The transformation in TSBDM is still the instance-wise normalization, whereas our proposed PSDM normalizes the values in all SDM to a unified distance unit (e.g., a decimeter). Therefore, the model can extract and analyze the information in PSDM more efficiently. In addition, PSDM is calculated in 3D volume, so every voxel in PSDM contains some 3D information. Although DoseDiff is a 2D model, with the support of PSDM, its performance is not inferior to that of SOTA 3D models (i.e., DoseNet, DoseGAN, and DGDL).
Existing dose prediction methods usually adopt early fusion for CT image and mask/SDM, i.e., concatenating them at the input level. This fusion strategy is easy to implement, but it is often suboptimal. Therefore, we propose MMFNet to extract more valuable interactive information between the CT image and PSDM. MMFNet uses multi-scale feature-level fusion for CT image and PSDM. Moreover, a fusionFormer module is used to perceive long-range dependence and global information fusion. Long-range perception is important for dose prediction because the dose of a voxel is related to all the tissues through which the ray passes before reaching that voxel. Transformers have better long-range perception capability compared with CNNs, so we designed a transformer-based fusion module. The experimental results in Table 1 show the effectiveness of our fusion strategy.
Our work had several limitations. First, the 3D information provided by PSDM is relatively limited, and our method is unable to use 3D structural information from CT images, which would be beneficial to dose prediction. Therefore, our future work will focus on expanding DoseDiff into a 3D model. Second, DoseDiff has no advantage in inference speed compared with one-step dose prediction models. The diffusion model paradigm inevitably incurs more inference costs. Fortunately, approaches such as DDIM have been proposed to reduce the inference time of the diffusion model and preserve the model’s performance. Third, while non-deterministic predictions have limited influence on model performance as shown in Table 6, they remain an unresolved issue that requires attention. This is a common challenge for applying conditional diffusion model to image-to-image tasks.
5 Conclusion
In this paper, we propose DoseDiff, a distance-aware conditional diffusion model, for predicting dose distribution maps. DoseDiff uses CT images and SDMs as conditions and dose distribution map prediction is defined as a sequence of denoising steps guided by these conditions. Using PSDM as a model input results in better performance compared with using a mask, as it can effectively provide distance information. Moreover, MMFNet is proposed to effectively extract and fuse features from CT images and SDMs. We used ablation studies to evaluate the contribution of each element of DoseDiff. Comparison with other methods on two datasets showed that DoseDiff achieves SOTA performance in dose prediction.
References
- [1] B. Sahiner, A. Pezeshk, L. M. Hadjiiski, X. Wang, K. Drukker, K. H. Cha, R. M. Summers, and M. L. Giger, “Deep learning in medical imaging and radiation therapy,” Medical Physics, vol. 46, no. 1, pp. 1–36, 2019.
- [2] G. Delaney, S. Jacob, C. Featherstone, and M. Barton, “The role of radiotherapy in cancer treatment: estimating optimal utilization from a review of evidence-based clinical guidelines,” Cancer: Interdisciplinary International Journal of the American Cancer Society, vol. 104, no. 6, pp. 1129–1137, 2005.
- [3] D. L. Craft, T. S. Hong, H. A. Shih, and T. R. Bortfeld, “Improved planning time and plan quality through multicriteria optimization for intensity-modulated radiotherapy,” International Journal of Radiation Oncology* Biology* Physics, vol. 82, no. 1, pp. 83–90, 2012.
- [4] L. J. Schreiner, “On the quality assurance and verification of modern radiation therapy treatment,” Journal of Medical Physics/Association of Medical Physicists of India, vol. 36, no. 4, p. 189, 2011.
- [5] B. Zhan, J. Xiao, C. Cao, X. Peng, C. Zu, J. Zhou, and Y. Wang, “Multi-constraint generative adversarial network for dose prediction in radiotherapy,” Medical Image Analysis, vol. 77, p. 102339, 2022.
- [6] Y. Song, J. Hu, Y. Liu, H. Hu, Y. Huang, S. Bai, and Z. Yi, “Dose prediction using a deep neural network for accelerated planning of rectal cancer radiotherapy,” Radiotherapy and Oncology, vol. 149, pp. 111–116, 2020.
- [7] S. Shiraishi, J. Tan, L. A. Olsen, and K. L. Moore, “Knowledge-based prediction of plan quality metrics in intracranial stereotactic radiosurgery,” Medical Physics, vol. 42, no. 2, pp. 908–917, 2015.
- [8] O. Nwankwo, H. Mekdash, D. S. K. Sihono, F. Wenz, and G. Glatting, “Knowledge-based radiation therapy (kbrt) treatment planning versus planning by experts: validation of a kbrt algorithm for prostate cancer treatment planning,” Radiation Oncology, vol. 10, no. 1, pp. 1–5, 2015.
- [9] B. Wu, F. Ricchetti, G. Sanguineti, M. Kazhdan, P. Simari, R. Jacques, R. Taylor, and T. McNutt, “Data-driven approach to generating achievable dose–volume histogram objectives in intensity-modulated radiotherapy planning,” International Journal of Radiation Oncology* Biology* Physics, vol. 79, no. 4, pp. 1241–1247, 2011.
- [10] T. Song, D. Staub, M. Chen, W. Lu, Z. Tian, X. Jia, Y. Li, L. Zhou, S. B. Jiang, and X. Gu, “Patient-specific dosimetric endpoints based treatment plan quality control in radiotherapy,” Physics in Medicine & Biology, vol. 60, no. 21, p. 8213, 2015.
- [11] R. R. Deshpande, J. DeMarco, J. W. Sayre, and B. J. Liu, “Knowledge-driven decision support for assessing dose distributions in radiation therapy of head and neck cancer,” International Journal of Computer Assisted Radiology and Surgery, vol. 11, pp. 2071–2083, 2016.
- [12] G. Valdes, C. B. Simone II, J. Chen, A. Lin, S. S. Yom, A. J. Pattison, C. M. Carpenter, and T. D. Solberg, “Clinical decision support of radiotherapy treatment planning: A data-driven machine learning strategy for patient-specific dosimetric decision making,” Radiotherapy and Oncology, vol. 125, no. 3, pp. 392–397, 2017.
- [13] C. McIntosh and T. G. Purdie, “Contextual atlas regression forests: multiple-atlas-based automated dose prediction in radiation therapy,” IEEE Transactions on Medical Imaging, vol. 35, no. 4, pp. 1000–1012, 2015.
- [14] O. Morin, M. Vallières, A. Jochems, H. C. Woodruff, G. Valdes, S. E. Braunstein, J. E. Wildberger, J. E. Villanueva-Meyer, V. Kearney, S. S. Yom et al., “A deep look into the future of quantitative imaging in oncology: a statement of working principles and proposal for change,” International Journal of Radiation Oncology* Biology* Physics, vol. 102, no. 4, pp. 1074–1082, 2018.
- [15] P. Dong and L. Xing, “Deep dosenet: a deep neural network for accurate dosimetric transformation between different spatial resolutions and/or different dose calculation algorithms for precision radiation therapy,” Physics in Medicine & Biology, vol. 65, no. 3, p. 035010, 2020.
- [16] J. Fan, J. Wang, Z. Chen, C. Hu, Z. Zhang, and W. Hu, “Automatic treatment planning based on three-dimensional dose distribution predicted from deep learning technique,” Medical Physics, vol. 46, no. 1, pp. 370–381, 2019.
- [17] V. Kearney, J. W. Chan, S. Haaf, M. Descovich, and T. D. Solberg, “Dosenet: a volumetric dose prediction algorithm using 3d fully-convolutional neural networks,” Physics in Medicine & Biology, vol. 63, no. 23, p. 235022, 2018.
- [18] V. Kearney, J. W. Chan, T. Wang, A. Perry, M. Descovich, O. Morin, S. S. Yom, and T. D. Solberg, “Dosegan: a generative adversarial network for synthetic dose prediction using attention-gated discrimination and generation,” Scientific Reports, vol. 10, no. 1, p. 11073, 2020.
- [19] C. Kontaxis, G. Bol, J. Lagendijk, and B. Raaymakers, “Deepdose: towards a fast dose calculation engine for radiation therapy using deep learning,” Physics in Medicine & Biology, vol. 65, no. 7, p. 075013, 2020.
- [20] D. Nguyen, X. Jia, D. Sher, M.-H. Lin, Z. Iqbal, H. Liu, and S. Jiang, “3d radiotherapy dose prediction on head and neck cancer patients with a hierarchically densely connected u-net deep learning architecture,” Physics in Medicine & Biology, vol. 64, no. 6, p. 065020, 2019.
- [21] D. Nguyen, T. Long, X. Jia, W. Lu, X. Gu, Z. Iqbal, and S. Jiang, “A feasibility study for predicting optimal radiation therapy dose distributions of prostate cancer patients from patient anatomy using deep learning,” Scientific Reports, vol. 9, no. 1, p. 1076, 2019.
- [22] I. Sumida, T. Magome, I. J. Das, H. Yamaguchi, H. Kizaki, K. Aboshi, H. Yamaguchi, Y. Seo, F. Isohashi, and K. Ogawa, “A convolution neural network for higher resolution dose prediction in prostate volumetric modulated arc therapy,” Medical Physics, vol. 72, pp. 88–95, 2020.
- [23] M. Yue, X. Xue, Z. Wang, R. L. Lambo, W. Zhao, Y. Xie, J. Cai, and W. Qin, “Dose prediction via distance-guided deep learning: Initial development for nasopharyngeal carcinoma radiotherapy,” Radiotherapy and Oncology, vol. 170, pp. 198–204, 2022.
- [24] J. Hu, Y. Song, Q. Wang, S. Bai, and Z. Yi, “Incorporating historical sub-optimal deep neural networks for dose prediction in radiotherapy,” Medical Image Analysis, vol. 67, p. 101886, 2021.
- [25] L. Yuan, Y. Ge, W. R. Lee, F. F. Yin, J. P. Kirkpatrick, and Q. J. Wu, “Quantitative analysis of the factors which affect the interpatient organ-at-risk dose sparing variation in imrt plans,” Medical Physics, vol. 39, no. 11, pp. 6868–6878, 2012.
- [26] Z. Jiao, X. Peng, Y. Wang, J. Xiao, D. Nie, X. Wu, X. Wang, J. Zhou, and D. Shen, “Transdose: Transformer-based radiotherapy dose prediction from ct images guided by super-pixel-level gcn classification,” Medical Image Analysis, vol. 89, p. 102902, 2023.
- [27] F. Xiao, J. Cai, X. Zhou, L. Zhou, T. Song, and Y. Li, “Transdose: a transformer-based unet model for fast and accurate dose calculation for mr-linacs,” Physics in Medicine & Biology, vol. 67, no. 12, p. 125013, 2022.
- [28] J. Zhang, S. Liu, H. Yan, T. Li, R. Mao, and J. Liu, “Predicting voxel-level dose distributions for esophageal radiotherapy using densely connected network with dilated convolutions,” Physics in Medicine & Biology, vol. 65, no. 20, p. 205013, 2020.
- [29] T. Zhou, S. Ruan, and S. Canu, “A review: Deep learning for medical image segmentation using multi-modality fusion,” Array, vol. 3, p. 100004, 2019.
- [30] T. Zhou, H. Fu, G. Chen, J. Shen, and L. Shao, “Hi-net: hybrid-fusion network for multi-modal mr image synthesis,” IEEE Transactions on Medical Imaging, vol. 39, no. 9, pp. 2772–2781, 2020.
- [31] Z. Meng, Y. Zhu, W. Pang, J. Tian, F. Nie, and K. Wang, “Msmfn: An ultrasound based multi-step modality fusion network for identifying the histologic subtypes of metastatic cervical lymphadenopathy,” IEEE Transactions on Medical Imaging, vol. 42, no. 4, pp. 996–1008, 2023.
- [32] B. Wang, L. Teng, L. Mei, Z. Cui, X. Xu, Q. Feng, and D. Shen, “Deep learning-based head and neck radiotherapy planning dose prediction via beam-wise dose decomposition,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2022, pp. 575–584.
- [33] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” Advances in Neural Information Processing Systems, vol. 30, 2017.
- [34] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein, “Unrolled generative adversarial networks,” arXiv preprint arXiv:1611.02163, 2016.
- [35] C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet, and M. Norouzi, “Palette: Image-to-image diffusion models,” in ACM SIGGRAPH 2022 Conference Proceedings, 2022, pp. 1–10.
- [36] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020.
- [37] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in International Conference on Learning Representations, 2021.
- [38] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Advances in Neural Information Processing Systems, vol. 34, pp. 8780–8794, 2021.
- [39] S. Xie, Z. Zhang, Z. Lin, T. Hinz, and K. Zhang, “Smartbrush: Text and shape guided object inpainting with diffusion model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 428–22 437.
- [40] S. Gao, X. Liu, B. Zeng, S. Xu, Y. Li, X. Luo, J. Liu, X. Zhen, and B. Zhang, “Implicit diffusion models for continuous super-resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 021–10 030.
- [41] C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi, “Image super-resolution via iterative refinement,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 4713–4726, 2022.
- [42] Q. Lyu and G. Wang, “Conversion between ct and mri images using diffusion and score-matching models,” arXiv preprint arXiv:2209.12104, 2022.
- [43] L. Cai, H. Gao, and S. Ji, “Multi-stage variational auto-encoders for coarse-to-fine image generation,” in Proceedings of the 2019 SIAM International Conference on Data Mining. SIAM, 2019, pp. 630–638.
- [44] Y. Ma, X. Liu, S. Bai, L. Wang, D. He, and A. Liu, “Coarse-to-fine image inpainting via region-wise convolutions and non-local correlation.” in IJCAI, 2019, pp. 3123–3129.
- [45] Z. Feng, L. Wen, P. Wang, B. Yan, X. Wu, J. Zhou, and Y. Wang, “Diffdp: Radiotherapy dose prediction via a diffusion model,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2023, pp. 191–201.
- [46] M. A. Wajid and A. Zafar, “Multimodal fusion: A review, taxonomy, open challenges, research roadmap and future directions,” Neutrosophic Sets and Systems, vol. 45, no. 1, p. 8, 2021.
- [47] H. Hermessi, O. Mourali, and E. Zagrouba, “Multimodal medical image fusion review: Theoretical background and recent advances,” Signal Processing, vol. 183, p. 108036, 2021.
- [48] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
- [49] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- [50] A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” in International Conference on Machine Learning. PMLR, 2021, pp. 8162–8171.
- [51] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020.
- [52] Y. Wu and K. He, “Group normalization,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19.
- [53] S. d’Ascoli, H. Touvron, M. L. Leavitt, A. S. Morcos, G. Biroli, and L. Sagun, “Convit: Improving vision transformers with soft convolutional inductive biases,” in International Conference on Machine Learning. PMLR, 2021, pp. 2286–2296.
- [54] K. Yuan, S. Guo, Z. Liu, A. Zhou, F. Yu, and W. Wu, “Incorporating convolution designs into visual transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 579–588.
- [55] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
- [56] A. Babier, B. Zhang, R. Mahmood, K. L. Moore, T. G. Purdie, A. L. McNiven, and T. C. Chan, “Openkbp: the open-access knowledge-based planning grand challenge and dataset,” Medical Physics, vol. 48, no. 9, pp. 5549–5561, 2021.
- [57] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
- [58] S. Liu, J. Zhang, T. Li, H. Yan, and J. Liu, “A cascade 3d u-net for dose prediction in radiotherapy,” Medical physics, vol. 48, no. 9, pp. 5574–5582, 2021.