Segmentation-Guided Knee Radiograph Generation using Conditional Diffusion Models

Siyuan Mei, Fuxin Fan, Fabian Wagner, Mareike Thies, Mingxuan Gu, Yipeng Sun, and Andreas Maier All authors are with the Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany. Contact E-mail: [email protected].

Abstract

Deep learning-based medical image processing algorithms require representative data during development. In particular, surgical data might be difficult to obtain, and high-quality public datasets are limited. To overcome this limitation and augment datasets, a widely adopted solution is the generation of synthetic images. In this work, we employ conditional diffusion models to generate knee radiographs from contour and bone segmentations. Remarkably, two distinct strategies are presented by incorporating the segmentation as a condition into the sampling and training process, namely, conditional sampling and conditional training. The results demonstrate that both methods can generate realistic images while adhering to the conditioning segmentation. The conditional training method outperforms the conditional sampling method and the conventional U-Net.

Index Terms:

Radiograph synthesis, diffusion models, conditional image generation.

I Introduction

Radiography is one of the most commonly used medical imaging techniques for diagnosis and surgical interventions, capturing 2D projection images of patients using X-ray. The advent of deep learning-based techniques has enabled automated and precise processing of X-ray images, including organ segmentation [1], motion compensation [2], and denoising [3]. However, many of these methods heavily rely on extensive training data which is challenging to collect at scale [4]. In particular, recent research demands atypical data, such as weight-bearing imaging of knees [5, 6].

To address this challenge, recent works proposed synthesizing simulated data as a substitute for clinical data [7]. Traditional forward projection methods create digitally reconstructed radiographs (DRRs) from 3D computed tomography (CT) volumes using Radon transform [8], which guarantees geometrical accuracy. However, such methods require original volumetric CT scans, and sophisticated forward models are needed to capture all properties of realistic-looking X-rays (e.g., energy-dependent effects, noise, scatter, etc.). In contrast, deep generative models leverage 2D X-ray datasets to generate radiographs. For instance, Weber et al. [9] employed generative adversarial network (GAN)-based models to augment chest X-ray images.

Recently, diffusion models have emerged as a powerful technique for data generation, demonstrating competitive performance compared to GANs [10]. In addition, diffusion models are also successfully applied in conditional radiograph generation, e.g., projection inpainting conditioned on masked projection [11] and class-conditional chest radiograph synthesis [12]. Despite achieving convincing performance, previous work has mainly focused on generating radiographs under specific conditions. In this work, we focus on knee imaging and extend the conditional generation of knee radiographs to include more general conditions, such as simple segmentation with contour and bone information. To learn the conditional distribution given the segmentation, we propose two distinct pipelines of conditional diffusion models that incorporate conditional images into the sampling and training processes, respectively.

II Methods and Materials

II-A Method of Conditional Sampling

Refer to caption — Figure 1: Overview of the sampling processes. (A) illustrates the sampling process of the conditional sampling method; (B) illustrates the sampling process of the conditional training method.

Diffusion models employ a forward diffusion process and a reverse diffusion process for image generation. In the forward process, data points gradually diffuse into random Gaussian noise over time, implying the transition from a complex to a simple data distribution. Conversely, the reverse diffusion process generates new data samples by progressively removing noise, starting from a Gaussian prior. Elegantly, the forward perturbation can be modeled as a stochastic differential equation (SDE), which is tractable throughout the reverse process [13].

Let $\mathbf{x}_{0}\in\mathbb{R}^{d}$ denote the data sample, $\mathbf{x}_{t}$ denote the perturbed data at time point $t\in(0,1]$ , and $p(\mathbf{x}_{t})$ denote the corresponding probability density function of $\mathbf{x}_{t}$ . The forward SDE is defined by

{\rm d}\mathbf{x}_{t}=\mathbf{f}(\mathbf{x}_{t},t){\rm d}t+g(t)\mathbf{z}_{t},% \quad t:0\rightarrow 1,

(1)

where $\mathbf{f}(\mathbf{x}_{t},t)\in\mathbb{R}^{d}$ and $g(t)\in\mathbb{R}$ are the drift and diffusion coefficients of $\mathbf{x}_{t}$ , and $\mathbf{z}_{t}\sim\mathcal{N}(0,\mathbf{I})$ defines a random variable drawn from a standard normal distribution approximating the standard Wiener process. The corresponding reverse-time SDE has the form

{\rm d}\overline{\mathbf{x}}_{t}=[\mathbf{f}(\mathbf{x}_{t},t)-g(t)^{2}\nabla_% {\mathbf{x}_{t}}\log p(\mathbf{x}_{t})]{\rm d}t+g(t)\mathbf{z}_{t},\quad t:1% \rightarrow 0,

(2)

where $\nabla_{\mathbf{x}_{t}}\log p(\mathbf{x}_{t})$ represents the gradient of the logarithmic probability density, also known as the score function.

In this paper, we adopt the variance exploding (VE) SDE setting [13], defining the growing standard deviation of the noise as

\sigma_{t}=\sigma_{min}\left(\frac{\sigma_{max}}{\sigma_{min}}\right)^{t},

(3)

which determines the drift and diffusion coefficients in (1) and (2) as

\mathbf{f}(\mathbf{x}_{t},t)=0,\quad g(t)=\sigma_{t}\sqrt{2\log\frac{\sigma_{% max}}{\sigma_{min}}}.

(4)

Crucially, the last unknown term $\nabla_{\mathbf{x}_{t}}\log p(\mathbf{x}_{t})$ in (2) can be estimated by training a time-conditional neural network $\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)$ . The optimal parameter ${\bm{\theta}}^{\ast}$ is obtained by minimizing denoising score matching [14]

	$\displaystyle{\bm{\theta}}^{\ast}$	$\displaystyle=\mathop{{\rm argmin}}_{{\bm{\theta}}}\mathbb{E}_{t\sim U[0,1]}% \left[\lVert\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_{t}}% \log p(\mathbf{x}_{t})\rVert_{2}^{2}\right]$
		$\displaystyle\approx\mathop{{\rm argmin}}_{{\bm{\theta}}}\mathbb{E}_{t\sim U[0% ,1]}\left[\lVert\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},t)-\nabla_{\mathbf{x}_% {t}}\log p(\mathbf{x}_{t}\|\mathbf{x}_{0})\rVert_{2}^{2}\right],$		(5)

where $p(\mathbf{x}_{t}|\mathbf{x}_{0})$ is the perturbation kernel of the VE SDE.

With segmentation as known information (denoted as $\mathbf{y}$ ), the conditional sampling method (CSM) follows the SDEdit algorithm [15] to synthesize knee radiographs. As depicted in Fig 1 (A), CSM commences with a perturbed leg contour segmentation, and realistic details are generated through iterative denoising while retaining the desired shape. Notably, the initial perturbing noise is appropriately reduced to preserve conditional information by setting the starting time point $t_{0}$ of the reverse diffusion process smaller than 1. Therefore, the prior sample becomes

\mathbf{y}_{t_{0}}\sim\mathcal{N}(\mathbf{y},\sigma_{t_{0}}^{2}\mathbf{I}).

(6)

The pipeline of CSM is described in algorithm 1.

Algorithm 1 Sampling algorithm of CSM

N

(number of sampling steps),

\mathbf{y}

(segmentation guide),

t_{0}

(starting time of the reverse diffusion process)

\mathbf{x}_{N}\sim

\mathbf{y}+\mathcal{N}(0,\sigma_{t_{0}}^{2}\mathbf{I})

3:for

n=N-1

0

\mathbf{z}\sim\mathcal{N}(0,\mathbf{I})

t_{n}\leftarrow

\frac{n}{N}t_{0}

\Delta\overline{\mathbf{x}}_{n}\leftarrow g(t_{n})^{2}

\mathbf{s}_{{\bm{\theta}}^{\ast}}(\mathbf{x}_{n},t_{n})

+g(t_{n})\mathbf{z}

\mathbf{x}_{n}\leftarrow\mathbf{x}_{n+1}+\Delta\overline{\mathbf{x}}_{n}

8:end for

9:return

\mathbf{x}_{0}

II-B Method of Conditional Training

An alternative to the CSM is integrating the conditions into the training process, thereby directly estimating the score function of the conditional distribution [16]. We concatenate the condition and the perturbed image along the channel dimension as network input. The structure of the score-based network is detailed in section II-D. Surprisingly, this conditional score network can be trained following the same form of denoising score matching [17], which is

	$\displaystyle{\bm{\theta}}^{\ast}$	$\displaystyle=\mathop{{\rm argmin}}_{{\bm{\theta}}}\mathbb{E}_{t\sim U[0,1]}% \left[\lVert\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},\mathbf{y},t)-\nabla_{% \mathbf{x}_{t}}\log p(\mathbf{x}_{t}\|\mathbf{y})\rVert_{2}^{2}\right]$
		$\displaystyle\approx\mathop{{\rm argmin}}_{{\bm{\theta}}}\mathbb{E}_{t\sim U[0% ,1]}\left[\lVert\mathbf{s}_{\bm{\theta}}(\mathbf{x}_{t},\mathbf{y},t)-\nabla_{% \mathbf{x}_{t}}\log p(\mathbf{x}_{t}\|\mathbf{x}_{0})\rVert_{2}^{2}\right].$		(7)

After training the conditioned network, the score function in (2) can be directly substituted by $\mathbf{s}_{\bm{\theta}^{*}}(\mathbf{x}_{t},\mathbf{y},t)$ for the provided conditional image in each sampling step. As shown in Fig. 1 (B), the generated sample is controlled even though the sampling process starts from random Gaussian noise. Algorithm 2 outlines the sampling process of the conditional training method (CTM), where the modifications with respect to algorithm 1 are highlighted in orange.

Algorithm 2 Sampling algorithm of CTM

N

(number of sampling steps),

\mathbf{y}

(segmentation guide)

\mathbf{x}_{N}\sim

\mathcal{N}(0,\sigma_{max}^{2}\mathbf{I})

3:for

n=N-1

0

\mathbf{z}\sim\mathcal{N}(0,\mathbf{I})

t_{n}\leftarrow

\frac{n}{N}

\Delta\overline{\mathbf{x}}_{n}\leftarrow g(t_{n})^{2}

\mathbf{s}_{{\bm{\theta}}^{\ast}}(\mathbf{x}_{n},\mathbf{y},t_{n})

+g(t_{n})\mathbf{z}

\mathbf{x}_{n}\leftarrow\mathbf{x}_{n+1}+\Delta\overline{\mathbf{x}}_{n}

8:end for

9:return

\mathbf{x}_{0}

II-C Dataset Preparation

To obtain knee radiographs, we selected 55 leg CT volumes from the public SICAS medical image repository [18]. Each CT volume was simulated over $360^{\circ}$ with an angular increment of $6^{\circ}$ and projected onto the detector of the size $256\times 256$ using CONRAD [19], resulting in 60 DRRs per volume. All projections were normalized to the range of [0,1].

In addition, two different segmentations for each DRR were automatically generated as follows. The first segmentation extracted the leg contour using a threshold of 0.1, having a value of 0.5 for the contour and 0 for the background (refer to a(0)-c(0) in Fig. 3). The second segmentation extracted bones by thresholding the original CT volume, followed by forward projecting the bone. The bone projection was set to 0.5, and adding it to the contour segmentation resulted in the second segmentation, with bones having a value of 1 and contour a value of 0.5 (refer to d(0)-f(0) in Fig. 3). In total, 3300 radiographs are generated for each type of segmentation. They were randomly split into a 9:1:1 ratio for training, validation, and testing.

II-D Network Structure and Hyperparameters

The backbone of the neural network employs the noise-conditional score network++ (NCSNpp) [13]. As illustrated in Fig. 2, we configure six resolution levels of $(256,128,64,32,16,8)$ with a corresponding number of channels of $(64,128,128,128,128,256)$ . Furthermore, the time-conditional noise scale $\sigma_{t}$ is encoded to random Gaussian feature [20] and embedded into all residual blocks. Importantly, for CSM only the perturbed X-ray images are input to the network. When conducting the CTM, additional segmentations are concatenated in the input. To compare the diffusion models with naive image-to-image models, the noise-encoding module is removed to form an improved U-Net model and then trained using the $L1$ loss.

In our experiments, the parameters $\sigma_{min}$ and $\sigma_{max}$ are set to 0.01 and 128. We use a batch size of 16 and the Adam optimizer with learning rate $2\times 10^{-4}$ for training. During the sampling process, a Langevin dynamic corrector with a signal-to-noise ratio of 0.4 is supplied after reverse SDE at each sampling step, and the number of sampling steps $N$ is chosen as 500 to improve sampling speed. Moreover, the hyperparameter $t_{0}$ for CSM is set to 0.4. All models are trained on four Nvidia A100 GPUs on a single cluster node with a cap of 300 epochs.

III Results and Discussion

In this section, we provide qualitative and quantitative results of the two proposed diffusion-based methods and compare them with the baseline U-Net model. The first column of Fig. 3 showcases six randomly selected conditions: (a0)-(c0) show contour segmentations, and (d0)-(f0) denote segmentations containing contour and bones. In Fig. 3 (a1)-(f1), the images generated by U-Net contain blurred fine details in locations where bones overlap, despite maintaining the given shape, as highlighted by the red circle. In contrast, the results from CSM appear more realistic than the U-Net. However, their quality decreases with introduced constraints, as indicated by the red arrow in Fig. 3 (d2) and (f2). The results from CTM not only achieve nearly the same level of fineness as the labels but also provide reasonable results with respect to the given conditions as illustrated in the fourth column.

Condition		Contour	Contour+bone
U-Net	MAE	$0.0209{\pm 0.007}$	$0.0188{\pm 0.006}$
U-Net	PSNR (dB)	$29.188{\pm 2.22}$	$30.304{\pm 2.45}$
CSM	MAE	$0.0395{\pm 0.010}$	$0.0507{\pm 0.010}$
CSM	PSNR (dB)	$22.911{\pm 1.89}$	$21.350{\pm 1.49}$
CTM	MAE	$\mathbf{0.0193{\pm 0.005}}$	$\mathbf{0.0152{\pm 0.007}}$
CTM	PSNR (dB)	$\mathbf{29.498{\pm 1.91}}$	$\mathbf{31.680{\pm 1.76}}$

TABLE I: Quantitative model comparison.

Table I summarizes the quantitative results averaged across all testing data. The evaluation metrics include mean absolute error (MAE) and peak signal-to-noise ratio (PSNR). We observed that CTM performs substantially better than U-Net and CSM under both segmentation-based conditions, and CSM performs worse than the U-Net.

Unlike the U-Net which learns a map** function between input and output, the diffusion models can implicitly capture the underlying data distribution from the training data and then sample it, preventing the loss of fine details on the pixel level. However, in CSM, conditions are incorporated only at the first sampling step while being perturbed, which results in imprecise conditional information. Instead, CTM provides an estimated score function of the conditional distribution for each sampling step, accommodating both reliability and realism. Nonetheless, presently generated X-ray images only encompass independent 2D conditional information, which may introduce geometric inconsistencies between a set of projections. Future research will focus on modeling 3D probabilistic distributions with the provided 2D conditions to enable CT reconstruction from the generated projections. In addition, clinical datasets will also be incorporated.

IV Conclusion

In this work, we explored two different pipelines of diffusion models to generate segmentation-conditioned knee X-ray data. The results demonstrate that both methods can generate realistic radiographs under the given conditions, with the method of conditional training achieving more stable performance. Ultimately, these high-quality synthetic medical images have the potential to benefit the development of data-driven research and educational applications in the medical field.

Acknowledgment

This work was supported by the European Research Council (ERC Grant No. 810316). The authors gratefully acknowledge the HPC resources provided by NHR@FAU using hardware funded by the German Research Foundation (DFG).

References

[1] F. Isensee, J. Petersen, A. Klein, D. Zimmerer, P. F. Jaeger, S. Kohl, J. Wasserthal, G. Koehler, T. Norajitra, S. Wirkert et al., “nnu-net: Self-adapting framework for u-net-based medical image segmentation,” arXiv preprint arXiv:1809.10486, 2018.
[2] M. Thies, F. Wagner, N. Maul, L. Folle, M. Meier, M. Rohleder, L.-S. Schneider, L. Pfaff, M. Gu, J. Utz et al., “Gradient-based geometry learning for fan-beam ct reconstruction,” Physics in Medicine & Biology, vol. 68, no. 20, p. 205004, 2023.
[3] F. Wagner, M. Thies, M. Gu, Y. Huang, S. Pechmann, M. Patwari, S. Ploner, O. Aust, S. Uderhardt, G. Schett et al., “Ultralow-parameter denoising: Trainable bilateral filter layers in computed tomography,” Medical Physics, vol. 49, no. 8, pp. 5107–5120, 2022.
[4] C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmentation for deep learning,” Journal of big data, vol. 6, no. 1, pp. 1–48, 2019.
[5] B. Bier, K. Aschoff, C. Syben, M. Unberath, M. Levenston, G. Gold, R. Fahrig, and A. Maier, “Detecting anatomical landmarks for motion estimation in weight-bearing imaging of knees,” in Machine Learning for Medical Image Reconstruction: First International Workshop, MLMIR 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Proceedings 1. Springer, 2018, pp. 83–90.
[6] J.-H. Choi, R. Fahrig, A. Keil, T. F. Besier, S. Pal, E. J. McWalter, G. S. Beaupré, and A. Maier, “Fiducial marker-based correction for involuntary motion in weight-bearing c-arm ct scanning of knees. part i. numerical model-based optimization,” Medical physics, vol. 40, no. 9, p. 091905, 2013.
[7] C. Gao, B. D. Killeen, Y. Hu, R. B. Grupp, R. H. Taylor, M. Armand, and M. Unberath, “Synthetic data accelerates the development of generalizable learning-based algorithms for x-ray image analysis,” Nature Machine Intelligence, vol. 5, no. 3, pp. 294–308, 2023.
[8] A. C. Kak and M. Slaney, Principles of computerized tomographic imaging. SIAM, 2001.
[9] T. Weber, M. Ingrisch, B. Bischl, and D. Rügamer, “Implicit embeddings via gan inversion for high resolution chest radiographs,” in MICCAI Workshop on Medical Applications with Disentanglements. Springer, 2022, pp. 22–32.
[10] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021.
[11] S. Mei, F. Fan, and A. Maier, “Metal inpainting in cbct projections using score-based generative model,” in 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI). IEEE, 2023, pp. 1–5.
[12] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
[13] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” arXiv preprint arXiv:2011.13456, 2020.
[14] P. Vincent, “A connection between score matching and denoising autoencoders,” Neural computation, vol. 23, no. 7, pp. 1661–1674, 2011.
[15] C. Meng, Y. He, Y. Song, J. Song, J. Wu, J.-Y. Zhu, and S. Ermon, “Sdedit: Guided image synthesis and editing with stochastic differential equations,” arXiv preprint arXiv:2108.01073, 2021.
[16] C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi, “Image super-resolution via iterative refinement,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 4713–4726, 2022.
[17] G. Batzolis, J. Stanczuk, C.-B. Schönlieb, and C. Etmann, “Conditional image generation with score-based diffusion models,” arXiv preprint arXiv:2111.13606, 2021.
[18] M. Kistler, S. Bonaretti, M. Pfahrer, R. Niklaus, and P. Büchler, “The virtual skeleton database: an open access repository for biomedical research and collaboration,” Journal of medical Internet research, vol. 15, no. 11, p. e245, 2013.
[19] A. Maier, H. G. Hofmann, M. Berger, P. Fischer, C. Schwemmer, H. Wu, K. Müller, J. Hornegger, J.-H. Choi, C. Riess et al., “Conrad—a software framework for cone-beam imaging in radiology,” Medical physics, vol. 40, no. 11, p. 111914, 2013.
[20] M. Tancik, P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. Barron, and R. Ng, “Fourier features let networks learn high frequency functions in low dimensional domains,” Advances in Neural Information Processing Systems, vol. 33, pp. 7537–7547, 2020.