License: arXiv.org perpetual non-exclusive license
arXiv:2404.01102v2 [eess.IV] 09 Apr 2024

Diffusion based Zero-shot Medical Image-to-Image Translation for Cross Modality Segmentation

Zihao Wang
Athinoula A. Martinos Center for Biomedical Imaging
Massachusetts General Hospital and Harvard University
02129, Boston, US. &Yingyu Yang
Inria centre at Université Côte d’Azur
06902 Valbonne, France. &Yuzhou Chen
Department of Computer and Information Sciences at Temple University
19122, Philadelphia, US. &Tinting Yuan
Institute of Computer Science at Georg-August-University of Göttingen
37073 Göttingen, Germany. &Maxime Sermesant
Inria centre at Université Côte d’Azur
06902 Valbonne, France. &Hervé Delingette
Inria centre at Université Côte d’Azur
06902 Valbonne, France. &Ona Wu
Athinoula A. Martinos Center for Biomedical Imaging
Massachusetts General Hospital and Harvard University
02129, Boston, US

Cross-modality image segmentation aims to segment the target modalities using a method designed in the source modality. Deep generative models can translate the target modality images into the source modality, thus enabling cross-modality segmentation. However, a vast body of existing cross-modality image translation methods relies on supervised learning. In this work, we aim to address the challenge of zero-shot learning-based image translation tasks (extreme scenarios in the target modality is unseen in the training phase). To leverage generative learning for zero-shot cross-modality image segmentation, we propose a novel unsupervised image translation method. The framework learns to translate the unseen source image to the target modality for image segmentation by leveraging the inherent statistical consistency between different modalities for diffusion guidance. Our framework captures identical cross-modality features in the statistical domain, offering diffusion guidance without relying on direct map**s between the source and target domains. This advantage allows our method to adapt to changing source domains without the need for retraining, making it highly practical when sufficient labeled source domain data is not available. The proposed framework is validated in zero-shot cross-modality image segmentation tasks through empirical comparisons with influential generative models, including adversarial-based and diffusion-based models.

1 Introduction

Leveraging existing tools to handle data across multiple modalities presents both practical benefits and inherent challenges. For example, in the medical imaging field, numerous resources are readily available for image segmentation within certain modalities, such as 3D T1-weighted MRI (referred to as T1w). However, for other modalities like Proton Density-weighted (PDw MRI), there are difficulties. It would be more resource-efficient to employ segmentation tools initially designed for T1w images on PDw images, instead of hunting for extra segmentation tools crafted specifically for the PDw modality. One promising avenue to navigate this issue is zero-shot cross-modality image translation.

Several methods grounded on generative adversarial networks (GAN) 8709985 ; gan1 ; gan2 ; gan3 ; gan4 ; gan5 ; cmtgan7 have been put forth to tackle the image translation challenge. In these techniques, the generative model typically models the destination modality straightaway, thereby ensuring translation authenticity gan8 ; gan7 . Notably, designs based on GAN often incorporate intricate adversarial structures and demand the crafting of modality-specific loss functions tailored to diverse translation endeavors sdeit ; WANG2021101990 . While GAN-based translations can function without matching datasets during the training phase, they still hinge on data from the original domain. This dependence can pose difficulties, especially when there’s a scarcity of source domain data, complicating the equilibrium of cycle consistency training 9612090 ; 9195151 . This issue becomes even more pronounced when only a scant amount of samples is present for the target modality.

Recent studies dhariwal2021diffusion ; diff_cond1 ; diff_con2 ; nonuni have unveiled that score-based generative models outperform GAN-based models. Muzaffer et al. GANDIFF introduced SynDiff, a framework with cycle consistency and dual diffusion for enhanced semantic alignment. Nonetheless, this method entails a doubled computational overhead, necessitates pre-training a generator to assess associated source images, and mandates source domain data, which deviates from the zero-shot learning paradigm. Meanwhile, Meng et al. sdeit recommended harnessing a Diffusion Model (termed SDEdit) songsde to effectuate image translation via zero-shot learning. In contrast with GAN-based models, SDEdit exhibits superiority by its streamlined model configuration and a more straightforward loss function. However, a limitation of SDEdit emerges in the cross-modality translation context. It is predicted on perturbation-based guidance, which inherently assumes that both the original and destination modalities can be influenced uniformly by noise. Such an assumption often proves invalid in cross-modality image translation, for instance, transitioning from MRI PDw to T1 imaging.

To address the hurdles posed by zero-shot learning in cross-modality translation, our approach leverages statistical feature-wise homogeneity to condition a diffusion model tailored for zero-shot cross-modality image translation. This strategy leverages a connection between the origin and the target modalities, capitalizing on their localized statistical features for accomplishing cross-modality image transition via the diffusion paradigm. Fig. 1 illustrates our statistical feature-driven diffusion model (LMI𝐿𝑀𝐼LMIitalic_L italic_M italic_I: Locale-based Mutual Information), which performs zero-shot image translation for cross-modality image segmentation. Our proposition is an unsupervised Zero-shot Learning framework capable of navigating translations amidst previously unseen modalities.

Refer to caption
Figure 1: Schematic diagram shows the LMI-guided diffusion for zero-shot cross-modal segmentation. The blue and orange contours are source and target distributions. The blue dot in the orange contour represents the target datapoint of the source datapoint (orange dot in the blue contour) in the source distribution. LMIDiffusion uses explicit statistical features (LMI) to navigate the next step (yellow dot), providing continuous guidance (yellow dot) from start to finish. In the end, the translated image can be segmented using arbitrary segmentation methods that were trained only on the target modality.

2 Method

2.1 Diffusion Model for Cross-modality Image Translation

The cross-modality image translation can be tackled by score-matching frameworks for generating target data (represented as F𝐹Fitalic_F) based on source data G𝐺Gitalic_G. This is followed by employing a perturbed source domain G𝐺Gitalic_G to condition the step-by-step diffusion process sdeit ; GANDIFF ; ozdenizci2022 ; EnergyGuided ; kawar2022denoising ; nonuni . Yet, when viewed through the lens of zero-shot learning for image translation, the data from the source domain remains inaccessible during the learning phase. However, it is posited that the local statistical attributes between the source and target modalities bear a semblance. Maximizing Mutual Information (MI) has been validated as a potent strategy to equip neural constructs with the capability to model non-linear map**s hjelm2018learning . To capture those shared representations for image translation, we propose using MI to model the local statistical features in the denoising process.

2.2 Local-wise Mutual Information

To distill semantic information from the dataset for conditioning, it is necessary to translate the raw data into statistical features as encapsulated by MI. Given an image X𝑋Xitalic_X, for point, xiXsubscript𝑥𝑖𝑋x_{i}\in Xitalic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X at position i𝑖iitalic_i, the local-wise statistical information at i𝑖iitalic_i can be captured through the probability density function (PDF) pδxi()subscript𝑝subscript𝛿subscript𝑥𝑖p_{\delta_{x_{i}}}(\cdot)italic_p start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ) of the neighborhood area δxisubscript𝛿subscript𝑥𝑖\delta_{x_{i}}italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT of xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Concerning other data points, represented as xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, that reside within the neighborhood, δxisubscript𝛿subscript𝑥𝑖\delta_{x_{i}}italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, of xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the statistical features can be ascertained via the PDF: pδxj()subscript𝑝subscript𝛿subscript𝑥𝑗p_{\delta_{x_{j}}}(\cdot)italic_p start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ), with j𝑗jitalic_j being an element of δisubscript𝛿𝑖\delta_{i}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The local-wise MI (LMI𝐿𝑀𝐼LMIitalic_L italic_M italic_I) from image X𝑋Xitalic_X to image Y𝑌Yitalic_Y at point xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is defined through:

LMIδ(xi,yj)=suppδ(x,y)logpδ(x,y)pδxi(x)pδyj(y)dxdy,yjδxi.formulae-sequence𝐿𝑀subscript𝐼𝛿subscript𝑥𝑖subscript𝑦𝑗supremumdouble-integralsubscript𝑝𝛿𝑥𝑦subscript𝑝𝛿𝑥𝑦subscript𝑝subscript𝛿subscript𝑥𝑖𝑥subscript𝑝subscript𝛿subscript𝑦𝑗𝑦𝑑𝑥𝑑𝑦for-allsubscript𝑦𝑗subscript𝛿subscript𝑥𝑖LMI_{\delta}(x_{i},y_{j})=\sup\iint p_{\delta}(x,y)\log\frac{p_{\delta}(x,y)}{% p_{\delta_{x_{i}}}(x)p_{\delta_{y_{j}}}(y)}dxdy,\forall y_{j}\in\delta_{x_{i}}.italic_L italic_M italic_I start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = roman_sup ∬ italic_p start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_x , italic_y ) roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_x , italic_y ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) italic_p start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_y ) end_ARG italic_d italic_x italic_d italic_y , ∀ italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT . (1)

We employ Eq. 1 as a guidance to condition every training step of diffusion. This is realized by evaluating the LMI(x0,yt),t[0,T]𝐿𝑀𝐼subscript𝑥0subscript𝑦𝑡𝑡0𝑇LMI(x_{0},y_{t}),t\in[0,T]italic_L italic_M italic_I ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_t ∈ [ 0 , italic_T ] throughout the training steps of the score network.

Property 1.

The upper bound of the LMI𝐿𝑀𝐼LMIitalic_L italic_M italic_I from X𝑋Xitalic_X to Y𝑌Yitalic_Y at location i𝑖iitalic_i is: LMIδ(xi,yi)LMIδ(xi,xi)𝐿𝑀subscript𝐼𝛿subscript𝑥𝑖subscript𝑦𝑖𝐿𝑀subscript𝐼𝛿subscript𝑥𝑖subscript𝑥𝑖LMI_{\delta}(x_{i},y_{i})\leq LMI_{\delta}(x_{i},x_{i})italic_L italic_M italic_I start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ italic_L italic_M italic_I start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , which is the optimum informative match between X𝑋Xitalic_X and Y𝑌Yitalic_Y at point xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Property 2.

The translation error for the LMIDiffusion generation is limΔLMI0Δ𝔼F^=0normal-→normal-Δ𝐿𝑀𝐼0normal-Δ𝔼normal-^𝐹0\underset{\Delta LMI\to 0}{\lim}\Delta\mathbb{E}\hat{F}=0start_UNDERACCENT roman_Δ italic_L italic_M italic_I → 0 end_UNDERACCENT start_ARG roman_lim end_ARG roman_Δ blackboard_E over^ start_ARG italic_F end_ARG = 0.

In the supplemental section, we prove the properties 1 and 2. Property 1 underscores that LMI𝐿𝑀𝐼LMIitalic_L italic_M italic_I attains statistical congruence under the condition:pδ(X)=pδ(Y),δXformulae-sequencesubscript𝑝𝛿𝑋subscript𝑝𝛿𝑌for-all𝛿𝑋p_{\delta}(X)=p_{\delta}(Y),\forall\delta\in Xitalic_p start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_X ) = italic_p start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_Y ) , ∀ italic_δ ∈ italic_X. Consequently, throughout the training iterations, the LMI𝐿𝑀𝐼LMIitalic_L italic_M italic_I consistently peaks at an identical locale between X0subscript𝑋0X_{0}italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in training steps. Yet, when pδ(X)pδ(Y)subscript𝑝𝛿𝑋subscript𝑝𝛿𝑌p_{\delta}(X)\neq p_{\delta}(Y)italic_p start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_X ) ≠ italic_p start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_Y ), the LMI𝐿𝑀𝐼LMIitalic_L italic_M italic_I achieves the local maximum at j𝑗jitalic_j, which is located in the neighborhood δyisubscript𝛿subscript𝑦𝑖\delta_{y_{i}}italic_δ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT of yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The process of MI determination can be expedited by converting iterative mutual information computations into tensor-based operations. Such tensor operations can further benefit from speed enhancements through techniques like memory duplication and parallel reduction when executed on a GPU cudac .

2.3 Conditioning the Diffusion through the LMI

During the training phase, we can guide the noise perturbation procedure by incorporating the LMI𝐿𝑀𝐼LMIitalic_L italic_M italic_I between a data point, denoted as F𝐹Fitalic_F, and its perturbed counterpart, Ftsubscript𝐹𝑡F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , into the score network, symbolized as sθsubscript𝑠𝜃s_{\theta}italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. This is realized by adapting the training objective of score matching as:

argminθ𝔼t𝒰(0,T){𝔼x0p0(x)𝔼xtqσt(xt,x0)[12|sθ(F^t,LMI(F;F,Ft),t)logpσt(xt|x0)xt|2]}subscript𝜃subscript𝔼similar-to𝑡𝒰0𝑇subscript𝔼similar-tosubscript𝑥0subscript𝑝0𝑥subscript𝔼similar-tosubscript𝑥𝑡subscript𝑞subscript𝜎𝑡subscript𝑥𝑡subscript𝑥0delimited-[]12superscriptsubscript𝑠𝜃subscript^𝐹𝑡𝐿𝑀𝐼𝐹𝐹subscript𝐹𝑡𝑡subscript𝑝subscript𝜎𝑡conditionalsubscript𝑥𝑡subscript𝑥0subscript𝑥𝑡2\arg\min_{\theta}\mathbb{E}_{t\sim\mathcal{U}(0,T)}\{\mathbb{E}_{x_{0}\sim p_{% 0}(x)}\mathbb{E}_{x_{t}\sim q_{\sigma_{t}}(x_{t},x_{0})}[\frac{1}{2}|s_{\theta% }(\hat{F}_{t},LMI(F;F,F_{t}),t)-\dfrac{\partial\log p_{\sigma_{t}}(x_{t}|x_{0}% )}{\partial x_{t}}|^{2}]\}roman_arg roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t ∼ caligraphic_U ( 0 , italic_T ) end_POSTSUBSCRIPT { blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_q start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG | italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_L italic_M italic_I ( italic_F ; italic_F , italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_t ) - divide start_ARG ∂ roman_log italic_p start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] } (2)

During the sampling phase, we can integrate the suggested conditioning operator into the SDE:

dF^t=dσt2dtsθ(F^t,LMI(G;G,F^t),t)dt+dσt2dtd𝐰t,tT0[0,T]d\hat{F}_{t}=-\frac{d\sigma_{t}^{2}}{dt}s_{\theta}(\hat{F}_{t},LMI(G;G,\hat{F}% _{t}),t)dt+\sqrt{\frac{d\sigma_{t}^{2}}{dt}}d\mathbf{w}_{t},~{}~{}~{}~{}t_{T% \rightarrow 0}\in[0,T]start_ROW start_CELL italic_d over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = - divide start_ARG italic_d italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d italic_t end_ARG italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_L italic_M italic_I ( italic_G ; italic_G , over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_t ) italic_d italic_t + square-root start_ARG divide start_ARG italic_d italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d italic_t end_ARG end_ARG italic_d bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_T → 0 end_POSTSUBSCRIPT ∈ [ 0 , italic_T ] end_CELL end_ROW (3)

and use a naive Euler-Maruyama solver to integrate the SDE. The sampled images can be segmented directly by the tool designed solely for the target domain image segmentation.

3 Experiment and Result

Refer to caption
Figure 2: Qualitative evaluation of different models’ translation results. The first two rows show target and original modality images, with close-ups of ROIs, followed by transformations from CycleGAN, StyleGAN, SDEdit, and LMIDiffusion. The subsequent row displays binarized segmentation results in the ROIs using a 3 clusters K-Means method for segmentation, trained solely on the target modality.

Dataset

we utilize the IXI dataset dataset marcoixi to demonstrate the efficacy of the proposed cross-modality method for image segmentation. This dataset comprises 600 pre-aligned multi-modality images from healthy subjects. We undertake translation tasks between PD and T1w modalities. From a subset of the IXI dataset , a training set consisting of 300 slices (drawn from 100 subjects) and a testing set consisting of 75 slices (drawn from 25 subjects) were formulated.

Experiment

We compare our LMIDiffusion (unsupervised) with a few-shot learning-based CycleGAN (supervised) translation model CycleGAN2017 , a GAN inversion-based approach styleGAN2 (unsupervised) and a diffusion-based method SDEdit sdeit (zero-shot supervised learning with SDE-based perturbing guidance). The CycleGAN (supervised) in this experiment will be allowed to see both the full target domain dataset and a small group (about 11%percent1111\%11 % of dataset IXI dataset) of the source domain dataset. The second baseline is a GAN inversion-based approach (unsupervised). A StyleGAN2-ADA styleGAN2 is allowed to see the target domain training data. The out-domain guided generation is performed through 5000 steps of optimization of inversion in the latent space of the trained StyleGAN2-ADA ganinversion . An extra network is introduced in the generation step for inversion. The SDEdit sdeit (unsupervised) uses a distribution perturbation guidance for diffusion. Our score network structure is the same as the UNet-like score matching networksdeit , which was optimized through an Adam optimizer with a learning rate of 3e43𝑒43e-43 italic_e - 4. The model was trained on an NVIDIA DGX Station with four Tesla V100 GPUs, over 300K iterations. For the segmentation backend, we employed the K-Means (5 clusters) algorithm, which was trained exclusively on the target modality. Subsequent segmentation was performed on the translation results obtained from the various methodologies mentioned above.

Result

Fig. 2 shows both the translation and segmentation outcomes from various models. The bottom rows provide a clear illustration of segmentation results based on translated images. Notably, the segmentation derived from LMIDiffusion translations closely mirrors the segmentation ground truth. While direct segmentation using a method trained on the target image is not feasible for the source domain image, our translation facilitates this, yielding commendable segmentation results on the translated image. Our approach not only offers maximal resemblance to the translation target (PDw) but also retains the superior anatomical features of the original modality (T1w). Although the CycleGAN is trained with supervised (few-shot) data, it fails to produce high-quality images. This shortcoming can be attributed to GAN-based models’ difficulty in discerning relationships between source and target modalities when training data from both domains is limited. On the other hand, while SDEdit does produce images in the PDw domain, it compromises the anatomical features of the original domain, rendering it less effective for zero-shot image segmentation.

We present the Dice score, PSNR and SSIM values computed based on different models in Table 1. It is evident that the LMIDiffusion achieves a leading average Dice score of 0.88±0.05plus-or-minus0.880.050.88\pm 0.050.88 ± 0.05. This is closely followed by the CycleGAN-based method with a score of 0.85±0.07plus-or-minus0.850.070.85\pm 0.070.85 ± 0.07. The SDEdit technique demonstrates suboptimal performance with a Dice score of 0.820.820.820.82. In contrast, the GAN-inversion method (StyleGAN) yields a result of 0.660.660.660.66, emphasizing its challenges in cross-modality image translation. In terms of translation quality metrics, LMIDiffusion leads the pack, boasting PSNR and SSIM values of 20.2220.2220.2220.22and SSIM of 0.690.690.690.69, respectively. CycleGAN follows suit with a PSNR of 18.8818.8818.8818.88 and SSIM of 0.270.270.270.27, while SDEdit reports a PSNR of 15.9615.9615.9615.96 and SSIM of 0.500.500.500.50. The StyleGAN method, consistent with its lower Dice score, presents the lowest PSNR of 10.1510.1510.1510.15 and SSIM of 0.060.060.060.06, further evidences the limitation of the GAN-inversion approach for zero-shot image translation task.

Table 1: Quantitative evaluation of the translation performance for different methods.
PDw\rightarrowT1 CycleGANCycleGAN2017 StyleGANstyleGAN2 SDEdit sdeit LMIDiffusion
Dice Score 0.85 ±plus-or-minus\pm± 0.07 0.66 ±plus-or-minus\pm± 0.10 0.82 ±plus-or-minus\pm± 0.06 0.88 ±plus-or-minus\pm± 0.05
PSNR 18.88 ±plus-or-minus\pm± 1.26 10.15 ±plus-or-minus\pm± 1.49 15.96 ±plus-or-minus\pm± 1.54 20.22 ±plus-or-minus\pm± 1.43
SSIM 0.27 ±plus-or-minus\pm± 0.04 0.06 ±plus-or-minus\pm± 0.03 0.50 ±plus-or-minus\pm± 0.06 0.69 ±plus-or-minus\pm± 0.06

4 Conclusion

We propose a novel method for zero-shot cross-modality image translation with application in cross-modality image segmentation. For challenging of zero-shot learning, our method outperforms existing GAN-based and diffusion-based methods regarding translation quality and zero-shot segmentation performance. The results show that the proposed LMI-guided diffusion is a promising approach for cross-modality image translation-based segmentation in a zero-shot learning way. The performance of the model can be potentially improved by introducing one-shot or few-shot learning to refine the LMI computing for diffusion guidance.

Appendix A Appendix

Property 1.
LMIδ(xi,yj)=suppδ(x,y)logpδ(x,y)pδxi(x)pδyj(y)dxdy𝐿𝑀subscript𝐼𝛿subscript𝑥𝑖subscript𝑦𝑗supremumdouble-integralsubscript𝑝𝛿𝑥𝑦subscript𝑝𝛿𝑥𝑦subscript𝑝subscript𝛿subscript𝑥𝑖𝑥subscript𝑝subscript𝛿subscript𝑦𝑗𝑦𝑑𝑥𝑑𝑦\displaystyle LMI_{\delta}(x_{i},y_{j})=\sup\iint p_{\delta}(x,y)\log\frac{p_{% \delta}(x,y)}{p_{\delta_{x_{i}}}(x)p_{\delta_{y_{j}}}(y)}dxdyitalic_L italic_M italic_I start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = roman_sup ∬ italic_p start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_x , italic_y ) roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_x , italic_y ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) italic_p start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_y ) end_ARG italic_d italic_x italic_d italic_y (4)
=sup(pδ(x,y)logpδ(x,y)pδyj(y)dxdypδ(x,y)logpδxi(x)dxdy),yjδxi\displaystyle=\sup\Biggl{(}\iint p_{\delta}(x,y)\log\frac{p_{\delta}(x,y)}{p_{% \delta_{y_{j}}}(y)}dxdy-\iint p_{\delta}(x,y)\log{p_{\delta_{x_{i}}}(x)}dxdy% \Biggl{)},\forall y_{j}\in\delta_{x_{i}}= roman_sup ( ∬ italic_p start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_x , italic_y ) roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_x , italic_y ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_y ) end_ARG italic_d italic_x italic_d italic_y - ∬ italic_p start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_x , italic_y ) roman_log italic_p start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) italic_d italic_x italic_d italic_y ) , ∀ italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT (5)

Applying Bayes’ theorem we obtain:

=sup(pδyj(y)pδxi|yj(x|y)logpδxi|yj(x|y)dxdy\displaystyle=\sup\Biggl{(}\int p_{\delta_{y_{j}}}(y)\int p_{\delta_{x_{i}|y_{% j}}}(x|y)\log p_{\delta_{x_{i}|y_{j}}}(x|y)dxdy= roman_sup ( ∫ italic_p start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_y ) ∫ italic_p start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x | italic_y ) roman_log italic_p start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x | italic_y ) italic_d italic_x italic_d italic_y (6)
pδ(x,y)logpδxi(x)dxdy),yjδxi\displaystyle-\iint p_{\delta}(x,y)\log{p_{\delta_{x_{i}}}(x)}dxdy\Biggl{)},% \forall y_{j}\in\delta_{x_{i}}- ∬ italic_p start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_x , italic_y ) roman_log italic_p start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) italic_d italic_x italic_d italic_y ) , ∀ italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT (7)
=sup(hδxi(x)hδxi|yj(x|y)),yjδxiformulae-sequenceabsentsupremumsubscriptsubscript𝛿subscript𝑥𝑖𝑥subscriptconditional𝛿subscript𝑥𝑖subscript𝑦𝑗conditional𝑥𝑦for-allsubscript𝑦𝑗subscript𝛿subscript𝑥𝑖\displaystyle=\sup(h_{\delta_{x_{i}}}(x)-h_{\delta{x_{i}|y_{j}}}(x|y)),\forall y% _{j}\in\delta_{x_{i}}= roman_sup ( italic_h start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) - italic_h start_POSTSUBSCRIPT italic_δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x | italic_y ) ) , ∀ italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT (8)
sup(hδxi(x)),yjδxiformulae-sequenceabsentsupremumsubscriptsubscript𝛿subscript𝑥𝑖𝑥for-allsubscript𝑦𝑗subscript𝛿subscript𝑥𝑖\displaystyle\leq\sup(h_{\delta_{x_{i}}}(x)),\forall y_{j}\in\delta_{x_{i}}≤ roman_sup ( italic_h start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) , ∀ italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT (9)

where hhitalic_h is entropy that measures the informative of the random variables. We can have the following corollaries:

LMI bound of Forward SDE: If and only if xi=yjsubscript𝑥𝑖subscript𝑦𝑗x_{i}=y_{j}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT: LMIδ(xi,xi)=hδxi(x)hδxi|xi(x|x))=hδxi(x)LMI_{\delta}(x_{i},x_{i})=h_{\delta_{x_{i}}}(x)-h_{\delta_{x_{i}|x_{i}}}(x|x))% =h_{\delta_{x_{i}}}(x)italic_L italic_M italic_I start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_h start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) - italic_h start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x | italic_x ) ) = italic_h start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ), which is the upper bound; therefore for the forward SDE: LMIδ(xi,yj)LMIδ(xi,xi),yjδxiformulae-sequence𝐿𝑀subscript𝐼𝛿subscript𝑥𝑖subscript𝑦𝑗𝐿𝑀subscript𝐼𝛿subscript𝑥𝑖subscript𝑥𝑖for-allsubscript𝑦𝑗subscript𝛿subscript𝑥𝑖LMI_{\delta}(x_{i},y_{j})\leq LMI_{\delta}(x_{i},x_{i}),\forall y_{j}\in\delta% _{x_{i}}italic_L italic_M italic_I start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≤ italic_L italic_M italic_I start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , ∀ italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

LMI bound of Reverse SDE: If xiyjsubscript𝑥𝑖subscript𝑦𝑗x_{i}\neq y_{j}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, xizjsubscript𝑥𝑖subscript𝑧𝑗x_{i}\neq z_{j}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT: LMIδ(xi,yj)=hδxi(x)hδxi|yj(x|y)𝐿𝑀subscript𝐼𝛿subscript𝑥𝑖subscript𝑦𝑗subscriptsubscript𝛿subscript𝑥𝑖𝑥subscriptsubscript𝛿conditionalsubscript𝑥𝑖subscript𝑦𝑗conditional𝑥𝑦LMI_{\delta}(x_{i},y_{j})=h_{\delta_{x_{i}}}(x)-h_{\delta_{x_{i}|y_{j}}}(x|y)italic_L italic_M italic_I start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_h start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) - italic_h start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x | italic_y ), then the LMIδ(xi,yj)𝐿𝑀subscript𝐼𝛿subscript𝑥𝑖subscript𝑦𝑗LMI_{\delta}(x_{i},y_{j})italic_L italic_M italic_I start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is the supremum in δ𝛿\deltaitalic_δ neighborhood, bounded by LMIδ(xi,xi)𝐿𝑀subscript𝐼𝛿subscript𝑥𝑖subscript𝑥𝑖LMI_{\delta}(x_{i},x_{i})italic_L italic_M italic_I start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ): LMIδ(xi,zj)LMIδ(xi,yj)<LMIδ(xi,xi),yj,zjδxiformulae-sequence𝐿𝑀subscript𝐼𝛿subscript𝑥𝑖subscript𝑧𝑗𝐿𝑀subscript𝐼𝛿subscript𝑥𝑖subscript𝑦𝑗𝐿𝑀subscript𝐼𝛿subscript𝑥𝑖subscript𝑥𝑖for-allsubscript𝑦𝑗subscript𝑧𝑗subscript𝛿subscript𝑥𝑖LMI_{\delta}(x_{i},z_{j})\leq LMI_{\delta}(x_{i},y_{j})<LMI_{\delta}(x_{i},x_{% i}),\forall y_{j},z_{j}\in\delta_{x_{i}}italic_L italic_M italic_I start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≤ italic_L italic_M italic_I start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) < italic_L italic_M italic_I start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , ∀ italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT

Property 2.

For cross-modality data translation, the reverse SDE fortT0[0,T]subscript𝑡𝑇00𝑇t_{T\rightarrow 0}\in[0,T]italic_t start_POSTSUBSCRIPT italic_T → 0 end_POSTSUBSCRIPT ∈ [ 0 , italic_T ] is:

dF^t=dσt2dtsθ(F^t,LMI(G;G,F^t),t)dt+dσt2dtd𝐰t𝑑subscript^𝐹𝑡𝑑superscriptsubscript𝜎𝑡2𝑑𝑡subscript𝑠𝜃subscript^𝐹𝑡𝐿𝑀𝐼𝐺𝐺subscript^𝐹𝑡𝑡𝑑𝑡𝑑superscriptsubscript𝜎𝑡2𝑑𝑡𝑑subscript𝐰𝑡d\hat{F}_{t}=-\frac{d\sigma_{t}^{2}}{dt}s_{\theta}(\hat{F}_{t},LMI(G;G,\hat{F}% _{t}),t)dt+\sqrt{\frac{d\sigma_{t}^{2}}{dt}}d\mathbf{w}_{t}italic_d over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = - divide start_ARG italic_d italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d italic_t end_ARG italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_L italic_M italic_I ( italic_G ; italic_G , over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_t ) italic_d italic_t + square-root start_ARG divide start_ARG italic_d italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d italic_t end_ARG end_ARG italic_d bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (10)

Its solution is:

F^t=F^00tdσs2dssθ(F^s,LMI(G;G,F^s),s)𝑑s+0tdσs2ds𝑑𝐰ssubscript^𝐹𝑡subscript^𝐹0superscriptsubscript0𝑡𝑑superscriptsubscript𝜎𝑠2𝑑𝑠subscript𝑠𝜃subscript^𝐹𝑠𝐿𝑀𝐼𝐺𝐺subscript^𝐹𝑠𝑠differential-d𝑠superscriptsubscript0𝑡𝑑superscriptsubscript𝜎𝑠2𝑑𝑠differential-dsubscript𝐰𝑠\hat{F}_{t}=\hat{F}_{0}-\int_{0}^{t}\frac{d\sigma_{s}^{2}}{ds}s_{\theta}(\hat{% F}_{s},LMI(G;G,\hat{F}_{s}),s)ds+\int_{0}^{t}\sqrt{\frac{d\sigma_{s}^{2}}{ds}}% d\mathbf{w}_{s}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT divide start_ARG italic_d italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d italic_s end_ARG italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_L italic_M italic_I ( italic_G ; italic_G , over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , italic_s ) italic_d italic_s + ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG italic_d italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d italic_s end_ARG end_ARG italic_d bold_w start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT (11)

Take the expectation on both side of Eq. 11:

𝔼F^t=𝔼F^0𝔼0tdσs2dssθ(F^s,LMI(G;G,F^s),s)𝑑s+𝔼0tdσs2ds𝑑𝐰s=F^00t𝔼dσs2dssθ(F^s,LMI(G;G,F^s),s)𝑑s𝔼subscript^𝐹𝑡𝔼subscript^𝐹0𝔼superscriptsubscript0𝑡𝑑superscriptsubscript𝜎𝑠2𝑑𝑠subscript𝑠𝜃subscript^𝐹𝑠𝐿𝑀𝐼𝐺𝐺subscript^𝐹𝑠𝑠differential-d𝑠𝔼superscriptsubscript0𝑡𝑑superscriptsubscript𝜎𝑠2𝑑𝑠differential-dsubscript𝐰𝑠subscript^𝐹0superscriptsubscript0𝑡𝔼𝑑superscriptsubscript𝜎𝑠2𝑑𝑠subscript𝑠𝜃subscript^𝐹𝑠𝐿𝑀𝐼𝐺𝐺subscript^𝐹𝑠𝑠differential-d𝑠\mathbb{E}\hat{F}_{t}=\mathbb{E}\hat{F}_{0}-\mathbb{E}\int_{0}^{t}\frac{d% \sigma_{s}^{2}}{ds}s_{\theta}(\hat{F}_{s},LMI(G;G,\hat{F}_{s}),s)ds+\mathbb{E}% \int_{0}^{t}\sqrt{\frac{d\sigma_{s}^{2}}{ds}}d\mathbf{w}_{s}\\ =\hat{F}_{0}-\int_{0}^{t}\mathbb{E}\frac{d\sigma_{s}^{2}}{ds}s_{\theta}(\hat{F% }_{s},LMI(G;G,\hat{F}_{s}),s)dsstart_ROW start_CELL blackboard_E over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = blackboard_E over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - blackboard_E ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT divide start_ARG italic_d italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d italic_s end_ARG italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_L italic_M italic_I ( italic_G ; italic_G , over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , italic_s ) italic_d italic_s + blackboard_E ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG italic_d italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d italic_s end_ARG end_ARG italic_d bold_w start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL = over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_E divide start_ARG italic_d italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d italic_s end_ARG italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_L italic_M italic_I ( italic_G ; italic_G , over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , italic_s ) italic_d italic_s end_CELL end_ROW (12)

Thus, the expectation of generation error is:

Δ𝔼F^t=0tdσs2ds𝔼[sθ(F^s,LMI(G;G,F^s),s)sθ(F^s,LMI(F;F,F^s),s)]ds\displaystyle\Delta\mathbb{E}\hat{F}_{t}=-\int_{0}^{t}\frac{d\sigma_{s}^{2}}{% ds}\mathbb{E}\Biggl{[}s_{\theta}(\hat{F}_{s},LMI(G;G,\hat{F}_{s}),s)-s_{\theta% }(\hat{F}_{s},LMI(F;F,\hat{F}_{s}),s)\Biggl{]}dsroman_Δ blackboard_E over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = - ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT divide start_ARG italic_d italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d italic_s end_ARG blackboard_E [ italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_L italic_M italic_I ( italic_G ; italic_G , over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , italic_s ) - italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_L italic_M italic_I ( italic_F ; italic_F , over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , italic_s ) ] italic_d italic_s (13)

Eq. 13 shows that the expectation of generation error is the accumulated expectation of the conditioning error of score model sθsubscript𝑠𝜃s_{\theta}italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT between the training and testing steps.

Without loss of generality, assuming that the score model sθsubscript𝑠𝜃s_{\theta}italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT satisfied additivity and homogeneity for the guidance LMI𝐿𝑀𝐼LMIitalic_L italic_M italic_I:

sθ(F^t,LMI(,t),t)=α(F^t,t)LMI(,t)+β(F^t,t)subscript𝑠𝜃subscript^𝐹𝑡𝐿𝑀𝐼𝑡𝑡𝛼subscript^𝐹𝑡𝑡𝐿𝑀𝐼𝑡𝛽subscript^𝐹𝑡𝑡\displaystyle s_{\theta}(\hat{F}_{t},LMI(\cdot,t),t)=\alpha(\hat{F}_{t},t)LMI(% \cdot,t)+\beta(\hat{F}_{t},t)italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_L italic_M italic_I ( ⋅ , italic_t ) , italic_t ) = italic_α ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) italic_L italic_M italic_I ( ⋅ , italic_t ) + italic_β ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) (14)

Substituting Eq. 14 into Eq. 13:

Δ𝔼F^sΔ𝔼subscript^𝐹𝑠\displaystyle\Delta\mathbb{E}\hat{F}_{s}roman_Δ blackboard_E over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT =0tdσs2ds𝔼[α(F^s,s)LMI(G;G,F^s)+β(F^s,s)\displaystyle=-\int_{0}^{t}\frac{d\sigma_{s}^{2}}{ds}\mathbb{E}\Biggl{[}\alpha% (\hat{F}_{s},s)LMI(G;G,\hat{F}_{s})+\beta(\hat{F}_{s},s)-= - ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT divide start_ARG italic_d italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d italic_s end_ARG blackboard_E [ italic_α ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) italic_L italic_M italic_I ( italic_G ; italic_G , over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) + italic_β ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) - (15)
α(F^s,s)LMI(F;F,F^s)β(F^s,s)]ds\displaystyle\alpha(\hat{F}_{s},s)LMI(F;F,\hat{F}_{s})-\beta(\hat{F}_{s},s)% \Biggl{]}dsitalic_α ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) italic_L italic_M italic_I ( italic_F ; italic_F , over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - italic_β ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) ] italic_d italic_s (16)
=0tdσs2dsα(F^s,s)𝔼[LMI(G;G,F^s)LMI(F;F,F^s)]ds\displaystyle=-\int_{0}^{t}\frac{d\sigma_{s}^{2}}{ds}\alpha(\hat{F}_{s},s)% \mathbb{E}\Biggl{[}LMI(G;G,\hat{F}_{s})-LMI(F;F,\hat{F}_{s})\Biggl{]}ds= - ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT divide start_ARG italic_d italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d italic_s end_ARG italic_α ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) blackboard_E [ italic_L italic_M italic_I ( italic_G ; italic_G , over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - italic_L italic_M italic_I ( italic_F ; italic_F , over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ] italic_d italic_s (17)

Denote: ΔLMI=LMI(G;G,F^s)LMI(F;F,F^s)Δ𝐿𝑀𝐼𝐿𝑀𝐼𝐺𝐺subscript^𝐹𝑠𝐿𝑀𝐼𝐹𝐹subscript^𝐹𝑠\Delta LMI=LMI(G;G,\hat{F}_{s})-LMI(F;F,\hat{F}_{s})roman_Δ italic_L italic_M italic_I = italic_L italic_M italic_I ( italic_G ; italic_G , over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - italic_L italic_M italic_I ( italic_F ; italic_F , over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ), the error for the expectation of LMI𝐿𝑀𝐼LMIitalic_L italic_M italic_I-guided generation is:

Δ𝔼F^sΔ𝔼subscript^𝐹𝑠\displaystyle\Delta\mathbb{E}\hat{F}_{s}roman_Δ blackboard_E over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT =0tdσs2dsα(F^s,s)ΔLMI𝑑sabsentsuperscriptsubscript0𝑡𝑑superscriptsubscript𝜎𝑠2𝑑𝑠𝛼subscript^𝐹𝑠𝑠Δ𝐿𝑀𝐼differential-d𝑠\displaystyle=-\int_{0}^{t}\frac{d\sigma_{s}^{2}}{ds}\alpha(\hat{F}_{s},s)% \Delta LMIds= - ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT divide start_ARG italic_d italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d italic_s end_ARG italic_α ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) roman_Δ italic_L italic_M italic_I italic_d italic_s (18)

Thus, limΔLMI0Δ𝔼F^s=0Δ𝐿𝑀𝐼0Δ𝔼subscript^𝐹𝑠0\underset{\Delta LMI\to 0}{\lim}\Delta\mathbb{E}\hat{F}_{s}=0start_UNDERACCENT roman_Δ italic_L italic_M italic_I → 0 end_UNDERACCENT start_ARG roman_lim end_ARG roman_Δ blackboard_E over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0

References

  • [1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2StyleGAN++: How to edit the embedded images? In Proc. of the IEEE CVPR, June 2020.
  • [2] Moab Arar, Yiftach Ginger, Dov Danon, Amit H. Bermano, and Daniel Cohen-Or. Unsupervised multi-modal image registration via geometry preserving image-to-image translation. In Proc. of the IEEE CVPR, June 2020.
  • [3] Karim Armanious and et al.. MedGAN: Medical image translation using GANs. Comput Med Imaging Graph, 79:101684, 2020.
  • [4] Georgios Batzolis, Jan Stanczuk, Carola-Bibiane Schönlieb, and Christian Etmann. Non-uniform diffusion models, 2022.
  • [5] Andrew Brock, Theodore Lim, J.M. Ritchie, and Nick Weston. Neural photo editing with introspective adversarial networks. In ICLR, 2017.
  • [6] John Cheng, Max Grossman, and Ty McKercher. Professional CUDA c programming. John Wiley & Sons, 2014.
  • [7] Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon. ILVR: Conditioning method for denoising diffusion probabilistic models, 2021.
  • [8] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, volume 34, pages 8780–8794. Curran Associates, Inc., 2021.
  • [9] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In ICLR, 2019.
  • [10] Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration models, 2022.
  • [11] Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration models. arXiv preprint arXiv:2201.11793, 2022.
  • [12] Vasant Kearney and et al.. Attention-aware discrimination for mr-to-ct image translation using cycle-consistent generative adversarial networks. Radiology. Artificial Intelligence, 2(2), 2020.
  • [13] Chenlin Meng and et al.. SDEdit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2022.
  • [14] Muzaffer Ozbey and et al.. Unsupervised medical image translation with adversarial diffusion models. IEEE Trans. Med. Imag., pages 1–1, 2023.
  • [15] Ozan Özdenizci and Robert Legenstein. Restoring vision in adverse weather conditions with patch-based denoising diffusion models. arXiv preprint arXiv:2207.14626, 2022.
  • [16] Ayed Samy-Safwan Silva Santiago, Lorenzi Marco. IXI sample dataset, 2022.
  • [17] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In ICLR, 2021.
  • [18] Hao Tang, Hong Liu, and Nicu Sebe. Unified generative adversarial networks for controllable image-to-image translation. IEEE Trans. Image Process, 29:8916–8929, 2020.
  • [19] Hisatoshi Toriya, Ashraf Dewan, and Itaru Kitahara. Sar2opt: Image alignment between multi-modal images using generative adversarial networks. In IGARSS 2019, pages 923–926. IEEE, 2019.
  • [20] Leichen Wang, Bastian Goldluecke, and Carsten Anklam. L2R GAN: Lidar-to-radar translation. In Proc. of the ACCV, 2020.
  • [21] Zihao Wang, Clair Vandersteen, Thomas Demarcy, Dan Gnansia, Charles Raffaelli, Nicolas Guevara, and Hervé Delingette. Inner-ear augmented metal artifact reduction with simulation-based 3d generative adversarial networks. Computerized Medical Imaging and Graphics, 93:101990, 2021.
  • [22] Tianyi Wei and et al.. E2Style: Improve the efficiency and effectiveness of stylegan inversion. IEEE Trans. Image Process, 31:3267–3280, 2022.
  • [23] Jelmer M Wolterink and et al.. Deep MR to CT synthesis using unpaired data. In International workshop on simulation and synthesis in medical imaging, pages 14–23. Springer, 2017.
  • [24] Wenju Xu and Guanghui Wang. A domain gap aware generative adversarial network for multi-domain image translation. IEEE Trans. Image Process, 31:72–84, 2022.
  • [25] Chao Yang, Taehwan Kim, Ruizhe Wang, Hao Peng, and C.-C. Jay Kuo. Show, attend, and translate: Unsupervised image translation with self-regularization and attention. IEEE Trans. Image Process, 28(10):4845–4856, 2019.
  • [26] Min Zhao, Fan Bao, Chongxuan Li, and Jun Zhu. Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations, 2022.
  • [27] Jiapeng Zhu, Yujun Shen, Deli Zhao, and Bolei Zhou. In-domain gan inversion for real image editing. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, ECCV 2020, pages 592–608, Cham, 2020. Springer.
  • [28] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE ICCV, 2017.