Segmentation-Guided Knee Radiograph Generation using Conditional Diffusion Models

Siyuan Mei, Fuxin Fan, Fabian Wagner, Mareike Thies, Mingxuan Gu, Yipeng Sun, and Andreas Maier All authors are with the Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany. Contact E-mail: [email protected].
Abstract

Deep learning-based medical image processing algorithms require representative data during development. In particular, surgical data might be difficult to obtain, and high-quality public datasets are limited. To overcome this limitation and augment datasets, a widely adopted solution is the generation of synthetic images. In this work, we employ conditional diffusion models to generate knee radiographs from contour and bone segmentations. Remarkably, two distinct strategies are presented by incorporating the segmentation as a condition into the sampling and training process, namely, conditional sampling and conditional training. The results demonstrate that both methods can generate realistic images while adhering to the conditioning segmentation. The conditional training method outperforms the conditional sampling method and the conventional U-Net.

Index Terms:
Radiograph synthesis, diffusion models, conditional image generation.

I Introduction

Radiography is one of the most commonly used medical imaging techniques for diagnosis and surgical interventions, capturing 2D projection images of patients using X-ray. The advent of deep learning-based techniques has enabled automated and precise processing of X-ray images, including organ segmentation [1], motion compensation [2], and denoising [3]. However, many of these methods heavily rely on extensive training data which is challenging to collect at scale [4]. In particular, recent research demands atypical data, such as weight-bearing imaging of knees [5, 6].

To address this challenge, recent works proposed synthesizing simulated data as a substitute for clinical data [7]. Traditional forward projection methods create digitally reconstructed radiographs (DRRs) from 3D computed tomography (CT) volumes using Radon transform [8], which guarantees geometrical accuracy. However, such methods require original volumetric CT scans, and sophisticated forward models are needed to capture all properties of realistic-looking X-rays (e.g., energy-dependent effects, noise, scatter, etc.). In contrast, deep generative models leverage 2D X-ray datasets to generate radiographs. For instance, Weber et al. [9] employed generative adversarial network (GAN)-based models to augment chest X-ray images.

Recently, diffusion models have emerged as a powerful technique for data generation, demonstrating competitive performance compared to GANs [10]. In addition, diffusion models are also successfully applied in conditional radiograph generation, e.g., projection inpainting conditioned on masked projection [11] and class-conditional chest radiograph synthesis [12]. Despite achieving convincing performance, previous work has mainly focused on generating radiographs under specific conditions. In this work, we focus on knee imaging and extend the conditional generation of knee radiographs to include more general conditions, such as simple segmentation with contour and bone information. To learn the conditional distribution given the segmentation, we propose two distinct pipelines of conditional diffusion models that incorporate conditional images into the sampling and training processes, respectively.

II Methods and Materials

II-A Method of Conditional Sampling

Refer to caption
Figure 1: Overview of the sampling processes. (A) illustrates the sampling process of the conditional sampling method; (B) illustrates the sampling process of the conditional training method.

Diffusion models employ a forward diffusion process and a reverse diffusion process for image generation. In the forward process, data points gradually diffuse into random Gaussian noise over time, implying the transition from a complex to a simple data distribution. Conversely, the reverse diffusion process generates new data samples by progressively removing noise, starting from a Gaussian prior. Elegantly, the forward perturbation can be modeled as a stochastic differential equation (SDE), which is tractable throughout the reverse process [13].

Let 𝐱0dsubscript𝐱0superscript𝑑\mathbf{x}_{0}\in\mathbb{R}^{d}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT denote the data sample, 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denote the perturbed data at time point t(0,1]𝑡01t\in(0,1]italic_t ∈ ( 0 , 1 ], and p(𝐱t)𝑝subscript𝐱𝑡p(\mathbf{x}_{t})italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) denote the corresponding probability density function of 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The forward SDE is defined by

d𝐱t=𝐟(𝐱t,t)dt+g(t)𝐳t,t:01,:dsubscript𝐱𝑡𝐟subscript𝐱𝑡𝑡d𝑡𝑔𝑡subscript𝐳𝑡𝑡01{\rm d}\mathbf{x}_{t}=\mathbf{f}(\mathbf{x}_{t},t){\rm d}t+g(t)\mathbf{z}_{t},% \quad t:0\rightarrow 1,roman_d bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_f ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) roman_d italic_t + italic_g ( italic_t ) bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t : 0 → 1 , (1)

where 𝐟(𝐱t,t)d𝐟subscript𝐱𝑡𝑡superscript𝑑\mathbf{f}(\mathbf{x}_{t},t)\in\mathbb{R}^{d}bold_f ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and g(t)𝑔𝑡g(t)\in\mathbb{R}italic_g ( italic_t ) ∈ blackboard_R are the drift and diffusion coefficients of 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and 𝐳t𝒩(0,𝐈)similar-tosubscript𝐳𝑡𝒩0𝐈\mathbf{z}_{t}\sim\mathcal{N}(0,\mathbf{I})bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , bold_I ) defines a random variable drawn from a standard normal distribution approximating the standard Wiener process. The corresponding reverse-time SDE has the form

d𝐱¯t=[𝐟(𝐱t,t)g(t)2𝐱tlogp(𝐱t)]dt+g(t)𝐳t,t:10,:dsubscript¯𝐱𝑡delimited-[]𝐟subscript𝐱𝑡𝑡𝑔superscript𝑡2subscriptsubscript𝐱𝑡𝑝subscript𝐱𝑡d𝑡𝑔𝑡subscript𝐳𝑡𝑡10{\rm d}\overline{\mathbf{x}}_{t}=[\mathbf{f}(\mathbf{x}_{t},t)-g(t)^{2}\nabla_% {\mathbf{x}_{t}}\log p(\mathbf{x}_{t})]{\rm d}t+g(t)\mathbf{z}_{t},\quad t:1% \rightarrow 0,roman_d over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ bold_f ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) - italic_g ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] roman_d italic_t + italic_g ( italic_t ) bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t : 1 → 0 , (2)

where 𝐱tlogp(𝐱t)subscriptsubscript𝐱𝑡𝑝subscript𝐱𝑡\nabla_{\mathbf{x}_{t}}\log p(\mathbf{x}_{t})∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) represents the gradient of the logarithmic probability density, also known as the score function.

In this paper, we adopt the variance exploding (VE) SDE setting [13], defining the growing standard deviation of the noise as

σt=σmin(σmaxσmin)t,subscript𝜎𝑡subscript𝜎𝑚𝑖𝑛superscriptsubscript𝜎𝑚𝑎𝑥subscript𝜎𝑚𝑖𝑛𝑡\sigma_{t}=\sigma_{min}\left(\frac{\sigma_{max}}{\sigma_{min}}\right)^{t},italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ( divide start_ARG italic_σ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , (3)

which determines the drift and diffusion coefficients in (1) and (2) as

𝐟(𝐱t,t)=0,g(t)=σt2logσmaxσmin.formulae-sequence𝐟subscript𝐱𝑡𝑡0𝑔𝑡subscript𝜎𝑡2subscript𝜎𝑚𝑎𝑥subscript𝜎𝑚𝑖𝑛\mathbf{f}(\mathbf{x}_{t},t)=0,\quad g(t)=\sigma_{t}\sqrt{2\log\frac{\sigma_{% max}}{\sigma_{min}}}.bold_f ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) = 0 , italic_g ( italic_t ) = italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG 2 roman_log divide start_ARG italic_σ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG end_ARG . (4)

Crucially, the last unknown term 𝐱tlogp(𝐱t)subscriptsubscript𝐱𝑡𝑝subscript𝐱𝑡\nabla_{\mathbf{x}_{t}}\log p(\mathbf{x}_{t})∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) in (2) can be estimated by training a time-conditional neural network 𝐬𝜽(𝐱t,t)subscript𝐬𝜽subscript𝐱𝑡𝑡\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ). The optimal parameter 𝜽superscript𝜽{\bm{\theta}}^{\ast}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is obtained by minimizing denoising score matching [14]

𝜽superscript𝜽\displaystyle{\bm{\theta}}^{\ast}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT =argmin𝜽𝔼tU[0,1][𝐬𝜽(𝐱t,t)𝐱tlogp(𝐱t)22]absentsubscriptargmin𝜽subscript𝔼similar-to𝑡𝑈01delimited-[]superscriptsubscriptdelimited-∥∥subscript𝐬𝜽subscript𝐱𝑡𝑡subscriptsubscript𝐱𝑡𝑝subscript𝐱𝑡22\displaystyle=\mathop{{\rm argmin}}_{{\bm{\theta}}}\mathbb{E}_{t\sim U[0,1]}% \left[\lVert\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}% \log p(\mathbf{x}_{t})\rVert_{2}^{2}\right]= roman_argmin start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t ∼ italic_U [ 0 , 1 ] end_POSTSUBSCRIPT [ ∥ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) - ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
argmin𝜽𝔼tU[0,1][𝐬𝜽(𝐱t,t)𝐱tlogp(𝐱t|𝐱0)22],absentsubscriptargmin𝜽subscript𝔼similar-to𝑡𝑈01delimited-[]superscriptsubscriptdelimited-∥∥subscript𝐬𝜽subscript𝐱𝑡𝑡subscriptsubscript𝐱𝑡𝑝conditionalsubscript𝐱𝑡subscript𝐱022\displaystyle\approx\mathop{{\rm argmin}}_{{\bm{\theta}}}\mathbb{E}_{t\sim U[0% ,1]}\left[\lVert\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_% {t}}\log p(\mathbf{x}_{t}|\mathbf{x}_{0})\rVert_{2}^{2}\right],≈ roman_argmin start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t ∼ italic_U [ 0 , 1 ] end_POSTSUBSCRIPT [ ∥ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) - ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (5)

where p(𝐱t|𝐱0)𝑝conditionalsubscript𝐱𝑡subscript𝐱0p(\mathbf{x}_{t}|\mathbf{x}_{0})italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is the perturbation kernel of the VE SDE.

With segmentation as known information (denoted as 𝐲𝐲\mathbf{y}bold_y), the conditional sampling method (CSM) follows the SDEdit algorithm [15] to synthesize knee radiographs. As depicted in Fig 1 (A), CSM commences with a perturbed leg contour segmentation, and realistic details are generated through iterative denoising while retaining the desired shape. Notably, the initial perturbing noise is appropriately reduced to preserve conditional information by setting the starting time point t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT of the reverse diffusion process smaller than 1. Therefore, the prior sample becomes

𝐲t0𝒩(𝐲,σt02𝐈).similar-tosubscript𝐲subscript𝑡0𝒩𝐲superscriptsubscript𝜎subscript𝑡02𝐈\mathbf{y}_{t_{0}}\sim\mathcal{N}(\mathbf{y},\sigma_{t_{0}}^{2}\mathbf{I}).bold_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_y , italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) . (6)

The pipeline of CSM is described in algorithm 1.

Algorithm 1 Sampling algorithm of CSM
1:N𝑁Nitalic_N (number of sampling steps), 𝐲𝐲\mathbf{y}bold_y (segmentation guide), t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (starting time of the reverse diffusion process)
2:𝐱Nsimilar-tosubscript𝐱𝑁absent\mathbf{x}_{N}\simbold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ∼ 𝐲+𝒩(0,σt02𝐈)𝐲𝒩0superscriptsubscript𝜎subscript𝑡02𝐈\mathbf{y}+\mathcal{N}(0,\sigma_{t_{0}}^{2}\mathbf{I})bold_y + caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I )
3:for n=N1𝑛𝑁1n=N-1italic_n = italic_N - 1 to 00 do
4:     𝐳𝒩(0,𝐈)similar-to𝐳𝒩0𝐈\mathbf{z}\sim\mathcal{N}(0,\mathbf{I})bold_z ∼ caligraphic_N ( 0 , bold_I )
5:     tnsubscript𝑡𝑛absentt_{n}\leftarrowitalic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ← nNt0𝑛𝑁subscript𝑡0\frac{n}{N}t_{0}divide start_ARG italic_n end_ARG start_ARG italic_N end_ARG italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
6:     Δ𝐱¯ng(tn)2Δsubscript¯𝐱𝑛𝑔superscriptsubscript𝑡𝑛2\Delta\overline{\mathbf{x}}_{n}\leftarrow g(t_{n})^{2}roman_Δ over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ← italic_g ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT𝐬𝜽(𝐱n,tn)subscript𝐬superscript𝜽subscript𝐱𝑛subscript𝑡𝑛\mathbf{s}_{{\bm{\theta}}^{\ast}}(\mathbf{x}_{n},t_{n})bold_s start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )+g(tn)𝐳𝑔subscript𝑡𝑛𝐳+g(t_{n})\mathbf{z}+ italic_g ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) bold_z
7:     𝐱n𝐱n+1+Δ𝐱¯nsubscript𝐱𝑛subscript𝐱𝑛1Δsubscript¯𝐱𝑛\mathbf{x}_{n}\leftarrow\mathbf{x}_{n+1}+\Delta\overline{\mathbf{x}}_{n}bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ← bold_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT + roman_Δ over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
8:end for
9:return 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

II-B Method of Conditional Training

An alternative to the CSM is integrating the conditions into the training process, thereby directly estimating the score function of the conditional distribution [16]. We concatenate the condition and the perturbed image along the channel dimension as network input. The structure of the score-based network is detailed in section II-D. Surprisingly, this conditional score network can be trained following the same form of denoising score matching [17], which is

𝜽superscript𝜽\displaystyle{\bm{\theta}}^{\ast}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT =argmin𝜽𝔼tU[0,1][𝐬𝜽(𝐱t,𝐲,t)𝐱tlogp(𝐱t|𝐲)22]absentsubscriptargmin𝜽subscript𝔼similar-to𝑡𝑈01delimited-[]superscriptsubscriptdelimited-∥∥subscript𝐬𝜽subscript𝐱𝑡𝐲𝑡subscriptsubscript𝐱𝑡𝑝conditionalsubscript𝐱𝑡𝐲22\displaystyle=\mathop{{\rm argmin}}_{{\bm{\theta}}}\mathbb{E}_{t\sim U[0,1]}% \left[\lVert\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},\mathbf{y},t)-\nabla_{% \mathbf{x}_{t}}\log p(\mathbf{x}_{t}|\mathbf{y})\rVert_{2}^{2}\right]= roman_argmin start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t ∼ italic_U [ 0 , 1 ] end_POSTSUBSCRIPT [ ∥ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y , italic_t ) - ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
argmin𝜽𝔼tU[0,1][𝐬𝜽(𝐱t,𝐲,t)𝐱tlogp(𝐱t|𝐱0)22].absentsubscriptargmin𝜽subscript𝔼similar-to𝑡𝑈01delimited-[]superscriptsubscriptdelimited-∥∥subscript𝐬𝜽subscript𝐱𝑡𝐲𝑡subscriptsubscript𝐱𝑡𝑝conditionalsubscript𝐱𝑡subscript𝐱022\displaystyle\approx\mathop{{\rm argmin}}_{{\bm{\theta}}}\mathbb{E}_{t\sim U[0% ,1]}\left[\lVert\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},\mathbf{y},t)-\nabla_{% \mathbf{x}_{t}}\log p(\mathbf{x}_{t}|\mathbf{x}_{0})\rVert_{2}^{2}\right].≈ roman_argmin start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t ∼ italic_U [ 0 , 1 ] end_POSTSUBSCRIPT [ ∥ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y , italic_t ) - ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (7)

After training the conditioned network, the score function in (2) can be directly substituted by 𝐬𝜽(𝐱t,𝐲,t)subscript𝐬superscript𝜽subscript𝐱𝑡𝐲𝑡\mathbf{s}_{\bm{\theta}^{*}}(\mathbf{x}_{t},\mathbf{y},t)bold_s start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y , italic_t ) for the provided conditional image in each sampling step. As shown in Fig. 1 (B), the generated sample is controlled even though the sampling process starts from random Gaussian noise. Algorithm 2 outlines the sampling process of the conditional training method (CTM), where the modifications with respect to algorithm 1 are highlighted in orange.

Algorithm 2 Sampling algorithm of CTM
1:N𝑁Nitalic_N (number of sampling steps), 𝐲𝐲\mathbf{y}bold_y (segmentation guide)
2:𝐱Nsimilar-tosubscript𝐱𝑁absent\mathbf{x}_{N}\simbold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ∼𝒩(0,σmax2𝐈)𝒩0superscriptsubscript𝜎𝑚𝑎𝑥2𝐈\mathcal{N}(0,\sigma_{max}^{2}\mathbf{I})caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I )
3:for n=N1𝑛𝑁1n=N-1italic_n = italic_N - 1 to 00 do
4:     𝐳𝒩(0,𝐈)similar-to𝐳𝒩0𝐈\mathbf{z}\sim\mathcal{N}(0,\mathbf{I})bold_z ∼ caligraphic_N ( 0 , bold_I )
5:     tnsubscript𝑡𝑛absentt_{n}\leftarrowitalic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ←nN𝑛𝑁\frac{n}{N}divide start_ARG italic_n end_ARG start_ARG italic_N end_ARG
6:     Δ𝐱¯ng(tn)2Δsubscript¯𝐱𝑛𝑔superscriptsubscript𝑡𝑛2\Delta\overline{\mathbf{x}}_{n}\leftarrow g(t_{n})^{2}roman_Δ over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ← italic_g ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT𝐬𝜽(𝐱n,𝐲,tn)subscript𝐬superscript𝜽subscript𝐱𝑛𝐲subscript𝑡𝑛\mathbf{s}_{{\bm{\theta}}^{\ast}}(\mathbf{x}_{n},\mathbf{y},t_{n})bold_s start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_y , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )+g(tn)𝐳𝑔subscript𝑡𝑛𝐳+g(t_{n})\mathbf{z}+ italic_g ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) bold_z
7:     𝐱n𝐱n+1+Δ𝐱¯nsubscript𝐱𝑛subscript𝐱𝑛1Δsubscript¯𝐱𝑛\mathbf{x}_{n}\leftarrow\mathbf{x}_{n+1}+\Delta\overline{\mathbf{x}}_{n}bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ← bold_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT + roman_Δ over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
8:end for
9:return 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

II-C Dataset Preparation

Refer to caption
Figure 2: Structure of noise-conditional score network. (A) represents the network input of CSM; (B) illustrates the network input of CTM.

To obtain knee radiographs, we selected 55 leg CT volumes from the public SICAS medical image repository [18]. Each CT volume was simulated over 360superscript360360^{\circ}360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT with an angular increment of 6superscript66^{\circ}6 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT and projected onto the detector of the size 256×256256256256\times 256256 × 256 using CONRAD [19], resulting in 60 DRRs per volume. All projections were normalized to the range of [0,1].

In addition, two different segmentations for each DRR were automatically generated as follows. The first segmentation extracted the leg contour using a threshold of 0.1, having a value of 0.5 for the contour and 0 for the background (refer to a(0)-c(0) in Fig. 3). The second segmentation extracted bones by thresholding the original CT volume, followed by forward projecting the bone. The bone projection was set to 0.5, and adding it to the contour segmentation resulted in the second segmentation, with bones having a value of 1 and contour a value of 0.5 (refer to d(0)-f(0) in Fig. 3). In total, 3300 radiographs are generated for each type of segmentation. They were randomly split into a 9:1:1 ratio for training, validation, and testing.

II-D Network Structure and Hyperparameters

The backbone of the neural network employs the noise-conditional score network++ (NCSNpp) [13]. As illustrated in Fig. 2, we configure six resolution levels of (256,128,64,32,16,8)2561286432168(256,128,64,32,16,8)( 256 , 128 , 64 , 32 , 16 , 8 ) with a corresponding number of channels of (64,128,128,128,128,256)64128128128128256(64,128,128,128,128,256)( 64 , 128 , 128 , 128 , 128 , 256 ). Furthermore, the time-conditional noise scale σtsubscript𝜎𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is encoded to random Gaussian feature [20] and embedded into all residual blocks. Importantly, for CSM only the perturbed X-ray images are input to the network. When conducting the CTM, additional segmentations are concatenated in the input. To compare the diffusion models with naive image-to-image models, the noise-encoding module is removed to form an improved U-Net model and then trained using the L1𝐿1L1italic_L 1 loss.

In our experiments, the parameters σminsubscript𝜎𝑚𝑖𝑛\sigma_{min}italic_σ start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT and σmaxsubscript𝜎𝑚𝑎𝑥\sigma_{max}italic_σ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT are set to 0.01 and 128. We use a batch size of 16 and the Adam optimizer with learning rate 2×1042superscript1042\times 10^{-4}2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for training. During the sampling process, a Langevin dynamic corrector with a signal-to-noise ratio of 0.4 is supplied after reverse SDE at each sampling step, and the number of sampling steps N𝑁Nitalic_N is chosen as 500 to improve sampling speed. Moreover, the hyperparameter t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for CSM is set to 0.4. All models are trained on four Nvidia A100 GPUs on a single cluster node with a cap of 300 epochs.

Condition

U-Net

CSM

CTM

Label

Refer to caption

(a0)

Refer to caption

(a1)

Refer to caption

(a2)

Refer to caption

(a3)

Refer to caption

(a4)

Refer to caption

(b0)

Refer to caption

(b1)

Refer to caption

(b2)

Refer to caption

(b3)

Refer to caption

(b4)

Refer to caption

(c0)

Refer to caption

(c1)

Refer to caption

(c2)

Refer to caption

(c3)

Refer to caption

(c4)

Refer to caption

(d0)

Refer to caption

(d1)

Refer to caption

(d2)

Refer to caption

(d3)

Refer to caption

(d4)

Refer to caption

(e0)

Refer to caption

(e1)

Refer to caption

(e2)

Refer to caption

(e3)

Refer to caption

(e4)

Refer to caption

(f0)

Refer to caption

(f1)

Refer to caption

(f2)

Refer to caption

(f3)

Refer to caption

(f4)

Figure 3: The generated samples under different conditions. Rows (a)-(c): contour segmentation. Rows (d)-(f): contour and bone segmentation.

III Results and Discussion

In this section, we provide qualitative and quantitative results of the two proposed diffusion-based methods and compare them with the baseline U-Net model. The first column of Fig. 3 showcases six randomly selected conditions: (a0)-(c0) show contour segmentations, and (d0)-(f0) denote segmentations containing contour and bones. In Fig. 3 (a1)-(f1), the images generated by U-Net contain blurred fine details in locations where bones overlap, despite maintaining the given shape, as highlighted by the red circle. In contrast, the results from CSM appear more realistic than the U-Net. However, their quality decreases with introduced constraints, as indicated by the red arrow in Fig. 3 (d2) and (f2). The results from CTM not only achieve nearly the same level of fineness as the labels but also provide reasonable results with respect to the given conditions as illustrated in the fourth column.

Condition Contour Contour+bone
U-Net MAE 0.0209±0.007plus-or-minus0.02090.0070.0209{\pm 0.007}0.0209 ± 0.007 0.0188±0.006plus-or-minus0.01880.0060.0188{\pm 0.006}0.0188 ± 0.006
PSNR (dB) 29.188±2.22plus-or-minus29.1882.2229.188{\pm 2.22}29.188 ± 2.22 30.304±2.45plus-or-minus30.3042.4530.304{\pm 2.45}30.304 ± 2.45
CSM MAE 0.0395±0.010plus-or-minus0.03950.0100.0395{\pm 0.010}0.0395 ± 0.010 0.0507±0.010plus-or-minus0.05070.0100.0507{\pm 0.010}0.0507 ± 0.010
PSNR (dB) 22.911±1.89plus-or-minus22.9111.8922.911{\pm 1.89}22.911 ± 1.89 21.350±1.49plus-or-minus21.3501.4921.350{\pm 1.49}21.350 ± 1.49
CTM MAE 0.0193±0.005plus-or-minus0.01930.005\mathbf{0.0193{\pm 0.005}}bold_0.0193 ± bold_0.005 0.0152±0.007plus-or-minus0.01520.007\mathbf{0.0152{\pm 0.007}}bold_0.0152 ± bold_0.007
PSNR (dB) 29.498±1.91plus-or-minus29.4981.91\mathbf{29.498{\pm 1.91}}bold_29.498 ± bold_1.91 31.680±1.76plus-or-minus31.6801.76\mathbf{31.680{\pm 1.76}}bold_31.680 ± bold_1.76
TABLE I: Quantitative model comparison.

Table I summarizes the quantitative results averaged across all testing data. The evaluation metrics include mean absolute error (MAE) and peak signal-to-noise ratio (PSNR). We observed that CTM performs substantially better than U-Net and CSM under both segmentation-based conditions, and CSM performs worse than the U-Net.

Unlike the U-Net which learns a map** function between input and output, the diffusion models can implicitly capture the underlying data distribution from the training data and then sample it, preventing the loss of fine details on the pixel level. However, in CSM, conditions are incorporated only at the first sampling step while being perturbed, which results in imprecise conditional information. Instead, CTM provides an estimated score function of the conditional distribution for each sampling step, accommodating both reliability and realism. Nonetheless, presently generated X-ray images only encompass independent 2D conditional information, which may introduce geometric inconsistencies between a set of projections. Future research will focus on modeling 3D probabilistic distributions with the provided 2D conditions to enable CT reconstruction from the generated projections. In addition, clinical datasets will also be incorporated.

IV Conclusion

In this work, we explored two different pipelines of diffusion models to generate segmentation-conditioned knee X-ray data. The results demonstrate that both methods can generate realistic radiographs under the given conditions, with the method of conditional training achieving more stable performance. Ultimately, these high-quality synthetic medical images have the potential to benefit the development of data-driven research and educational applications in the medical field.

Acknowledgment

This work was supported by the European Research Council (ERC Grant No. 810316). The authors gratefully acknowledge the HPC resources provided by NHR@FAU using hardware funded by the German Research Foundation (DFG).

References

  • [1] F. Isensee, J. Petersen, A. Klein, D. Zimmerer, P. F. Jaeger, S. Kohl, J. Wasserthal, G. Koehler, T. Norajitra, S. Wirkert et al., “nnu-net: Self-adapting framework for u-net-based medical image segmentation,” arXiv preprint arXiv:1809.10486, 2018.
  • [2] M. Thies, F. Wagner, N. Maul, L. Folle, M. Meier, M. Rohleder, L.-S. Schneider, L. Pfaff, M. Gu, J. Utz et al., “Gradient-based geometry learning for fan-beam ct reconstruction,” Physics in Medicine & Biology, vol. 68, no. 20, p. 205004, 2023.
  • [3] F. Wagner, M. Thies, M. Gu, Y. Huang, S. Pechmann, M. Patwari, S. Ploner, O. Aust, S. Uderhardt, G. Schett et al., “Ultralow-parameter denoising: Trainable bilateral filter layers in computed tomography,” Medical Physics, vol. 49, no. 8, pp. 5107–5120, 2022.
  • [4] C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmentation for deep learning,” Journal of big data, vol. 6, no. 1, pp. 1–48, 2019.
  • [5] B. Bier, K. Aschoff, C. Syben, M. Unberath, M. Levenston, G. Gold, R. Fahrig, and A. Maier, “Detecting anatomical landmarks for motion estimation in weight-bearing imaging of knees,” in Machine Learning for Medical Image Reconstruction: First International Workshop, MLMIR 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Proceedings 1.   Springer, 2018, pp. 83–90.
  • [6] J.-H. Choi, R. Fahrig, A. Keil, T. F. Besier, S. Pal, E. J. McWalter, G. S. Beaupré, and A. Maier, “Fiducial marker-based correction for involuntary motion in weight-bearing c-arm ct scanning of knees. part i. numerical model-based optimization,” Medical physics, vol. 40, no. 9, p. 091905, 2013.
  • [7] C. Gao, B. D. Killeen, Y. Hu, R. B. Grupp, R. H. Taylor, M. Armand, and M. Unberath, “Synthetic data accelerates the development of generalizable learning-based algorithms for x-ray image analysis,” Nature Machine Intelligence, vol. 5, no. 3, pp. 294–308, 2023.
  • [8] A. C. Kak and M. Slaney, Principles of computerized tomographic imaging.   SIAM, 2001.
  • [9] T. Weber, M. Ingrisch, B. Bischl, and D. Rügamer, “Implicit embeddings via gan inversion for high resolution chest radiographs,” in MICCAI Workshop on Medical Applications with Disentanglements.   Springer, 2022, pp. 22–32.
  • [10] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021.
  • [11] S. Mei, F. Fan, and A. Maier, “Metal inpainting in cbct projections using score-based generative model,” in 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI).   IEEE, 2023, pp. 1–5.
  • [12] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
  • [13] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” arXiv preprint arXiv:2011.13456, 2020.
  • [14] P. Vincent, “A connection between score matching and denoising autoencoders,” Neural computation, vol. 23, no. 7, pp. 1661–1674, 2011.
  • [15] C. Meng, Y. He, Y. Song, J. Song, J. Wu, J.-Y. Zhu, and S. Ermon, “Sdedit: Guided image synthesis and editing with stochastic differential equations,” arXiv preprint arXiv:2108.01073, 2021.
  • [16] C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi, “Image super-resolution via iterative refinement,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 4713–4726, 2022.
  • [17] G. Batzolis, J. Stanczuk, C.-B. Schönlieb, and C. Etmann, “Conditional image generation with score-based diffusion models,” arXiv preprint arXiv:2111.13606, 2021.
  • [18] M. Kistler, S. Bonaretti, M. Pfahrer, R. Niklaus, and P. Büchler, “The virtual skeleton database: an open access repository for biomedical research and collaboration,” Journal of medical Internet research, vol. 15, no. 11, p. e245, 2013.
  • [19] A. Maier, H. G. Hofmann, M. Berger, P. Fischer, C. Schwemmer, H. Wu, K. Müller, J. Hornegger, J.-H. Choi, C. Riess et al., “Conrad—a software framework for cone-beam imaging in radiology,” Medical physics, vol. 40, no. 11, p. 111914, 2013.
  • [20] M. Tancik, P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. Barron, and R. Ng, “Fourier features let networks learn high frequency functions in low dimensional domains,” Advances in Neural Information Processing Systems, vol. 33, pp. 7537–7547, 2020.