Repeat and Concatenate: 2D to 3D Image Translation
with 3D to 3D Generative Modeling

Abril Corona-Figueroa, Hubert P. H. Shum, Chris G. Willcocks
Department of Computer Science, Durham University, Durham, UK
https://github.com/abrilcf/3D-3D_repeat-concatenate
Abstract

This paper investigates a 2D to 3D image translation method with a straightforward technique, enabling correlated 2D X-ray to 3D CT-like reconstruction. We observe that existing approaches, which integrate information across multiple 2D views in the latent space, lose valuable signal information during latent encoding. Instead, we simply repeat and concatenate the 2D views into higher-channel 3D volumes and approach the 3D reconstruction challenge as a straightforward 3D to 3D generative modeling problem, sidestep** several complex modeling issues. This method enables the reconstructed 3D volume to retain valuable information from the 2D inputs, which are passed between channel states in a Swin UNETR backbone. Our approach applies neural optimal transport, which is fast and stable to train, effectively integrating signal information across multiple views without the requirement for precise alignment; it produces non-collapsed reconstructions that are highly faithful to the 2D views, even after limited training. We demonstrate correlated results, both qualitatively and quantitatively, having trained our model on a single dataset and evaluated its generalization ability across six datasets, including out-of-distribution samples.

1 Introduction

2D to 3D image translation is a class of computer vision problems where the goal is to learn the map** between one or more 2D images and a corresponding 3D volumetric image. Years of research in this topic have given rise to many image and graphics applications, such as augmented reality [7, 27], sensor fusion [33, 25], scene rendering [5, 19], and multimodal translation [17, 38, 50]. The latter has attracted attention in the medical domain, including 2D X-ray, Computed Tomography (CT), Magnetic Resonance Imaging (MRI), Ultrasound, among others, due to the potential to translate from cheap, low-quality and available medical imaging modalities to those that are more expensive, with high-waiting times, or exhibit harmful ionizing radiation. While translating between image modalities of common dimensionality (i.e., 2D to 2D, 3D to 3D) has achieved notable results [50] due to its seemingly straightforward adaptation of specific state-of-the-art (SOTA) deep learning models, translating between data of distinct dimensionality poses further challenges. In particular, converting an image from lower-dimension to a higher-dimensional representation (e.g., 2D to 3D) is considered a reconstructive generative modelling problem [45, 4].

Refer to caption
Figure 1: (a) Previous approaches focus on 2D to 3D map**, often employing asymmetric architectures and compressed latent encoding. (b) In contrast, we propose 3D to 3D map** from repeated and concatenated inputs, enabling faster training with highly correlated outputs without latent compression, even with small datasets (a few hundred images).

One highly relevant application in the medical field involves obtaining 3D CT representations from 2D X-ray projections [30]. However, unlike image super-resolution approaches, bridging the 2D to 3D data dimensionality gap presents a unique modeling challenge to estimate spatial missing details [45]. Moreover, machine learning models achieve remarkable results thanks to abundant natural image data, but medical models often struggle due to the limited size of medical datasets. Furthermore, real-world medical 3D datasets involve volumetric features with varying density structures, making hallucinated data a significant concern that can prove counterproductive [34].

Existing methods approach 2D to 3D medical image translation through asymmetrical architectures and/or incorporate various regularization techniques to enforce plausible reconstructions [49, 35] (Fig. 1). While these techniques may generate high-quality outputs, even with high-frequency details, they often lack correlation to the input data. This means that the model becomes overly reliant on the latent prior, potentially ignoring the original 2D signal. The situation is further exacerbated if the training data is limited (few hundred images), so when the model is presented with inputs significantly different from the training dataset, such as out-of-distribution samples, it is likely to perform poorly making it unusable for practical scenarios.

In this paper, we propose a simple 2D to 3D map** translation framework that addresses the correlation issue, ensuring highly-associated outputs even in limited datasets. We achieve this by a preprocessing step that preserves 2D information content throughout the network transformations without relying on other priors. First, we repeat the various 2D input projections to match the output 3D target depth (Fig. 1). These 3D volumes are then concatenated into a single higher-channel 3D volume. Then we propose a 3D to 3D conditional generative modeling approach that applies neural optimal transport with the de-biased Sinkhorn divergence. Such an approach effectively allows mass splitting; we find this particularly suitable for our task, as the original 2D inputs contain valuable information content, requiring each area of the 2D inputs to contribute to entire 3D regions of details in the synthesized 3D image (Fig. 2).

The approach was found to give significant improvement in terms of generalization, while being stable to train with only a few hundred images (ideal for medical/real-world datasets). Our evaluation showed that the method generates plausible reconstructions after only 2,000 training optimization steps, which can generalize to out-of-distribution inputs after approximately 28 hours of training.

Refer to caption
Figure 2: Proposed 2D to 3D image translation approach. (a) We learn the map** between 2D inputs and their corresponding 3D representation adapting a Swin UNe-Transformer gφsubscript𝑔𝜑g_{\varphi}italic_g start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT [16] for image translation. (b) Optimization is based on the dual optimal transport regularization between networks gφsubscript𝑔𝜑g_{\varphi}italic_g start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT and dϕsubscript𝑑italic-ϕd_{\phi}italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, comparing data points from the two image spaces using the Sinkhorn divergence SεsubscriptS𝜀\mathrm{S}_{\mathrm{\varepsilon}}roman_S start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT on the activations of a feature extractor f𝑓fitalic_f.

To summarize, our contributions are

  1. 1.

    An alternative and simple 2D to 3D image translation framework that effectively reconstructs a CT volumetric representation given only one or any number of X-ray projections, with potential application in clinical use (e.g., cutting down costs and radiation to patients).

  2. 2.

    A processing pipeline that retains all 2D information throughout the 3D transformation network without loosing information content in the latent encoding.

  3. 3.

    Demonstrable generalization with very limited data (a few hundred images) trained in less than 28 hours. The processing step and approach can be directly applied to other generative modelling approaches and applications.

2 Related Work

2D to 3D image translation:

Several methods have attempted to obtain 3D representations from 2D images for medical applications, such as CT reconstruction from X-rays. Conditional approaches, generating 3D images from one or two 2D images are often based on Generative Adversarial Networks (GANs) [13] and Variational Autoencoders (VAEs) [22]. These methods transform noise sampled from a Gaussian prior, which is concatenated with a latent representation of the 2D data; this may loose or hallucinate important 2D signal information not captured in the latent encoding. Strategies to mitigate this include incorporating 3D priors from real data [18] or simply generating 2D slices as output, which are stacked into a 3D representation [52].

Another line of work involves asymmetric autoencoder architectures, where a 2D encoder extracts spatial features, which are then expanded to fit into a 3D decoder [32, 24, 40, 15, 26]. To mitigate information loss at the bottleneck, additional objectives are learned, enforcing a more interpretable latent space or adding an additional 2D encoder [49]. Recent approaches explore implicit neural representations [5, 28, 39] or two-stage vector-quantized approaches combined with diffusion [6]. In contrast, we explore a simpler approach that generalizes from small datasets (similar-to\sim1k images) in a 3D to 3D setting without relying on more complex asymmetric architectures.

Regularization and prior knowledge:

Generalizing beyond the examples in the training set is a fundamental aspect of machine learning models for medical applications. Key problems are mode collapse and overfitting, which arise due to the limited availability and real-world complexity of medical datasets. This is further exacerbated by data noise and heterogeneity among different imaging modalities. Fortunately, regularization techniques such as parameter constraints or augmentation can mitigate overfitting [31, 41] and help alleviate data shortages.

Data augmentation, in particular, is an established solution to mitigate overfitting [21, 11] by introducing semantic-preserving data transformations such as rotation, translation, and contrast adjustments. However, selecting and combining augmentations is a delicate process, especially to prevent undesirable outcomes such as the model learning to generate the augmented distribution and deviating from its original objective [20]. In this context, incorporating prior knowledge from initial assumptions about the data distribution can guide learning for improved generalization, such as additional conditional information in generative modeling or via additional terms to the main objective functions [46, 44, 10]. In our approach, we know that 2D X-rays capture the underlying 3D objects over a range of depths, acting as a prior that motivates our repeat and concatenate map** strategy. We find this facilitates improved generalization for both in- and out-of-distribution inputs, even when trained on small datasets.

Deep generative modeling:

2D to 3D image translation involves modeling the probability of unobserved 3D regions which is therefore a generative modeling problem. Deep generative modeling (DGM) is a very large field, from which mainstream approaches include GANs, VAEs, normalizing flows, autoregressive models and probablistic diffusion models [4]. These methods classically balance an empirically observed trilemma of modeling quality, mode coverage and the number of ‘steps’ taken during sampling [47]. VAEs and normalizing flows are well-known for lower quality sampling, while GANs suffer from mode collapse, capturing only part of the distribution during training. Various strategies, such as weight clip**, gradient penalties, and spectral normalization, have been proposed to mitigate these issues, but they often fail to address fundamental challenges rooted in optimal transport (OT) theory.

Neural optimal transport:

In contrast, neural optimal transport offers a more nuanced approach for map** probability distributions, which is especially relevant for handling the intricacies of medical imaging data. Traditional methods like f𝑓fitalic_f-divergences, commonly employed in GANs, perform poorly with smaller, high-dimensional datasets, failing to provide meaningful gradients for effective training [42]. Capturing variability in limited medical data is crucial, and leveraging regularity in human datasets is one approach. For example, organs and bones consistently occupy locations in the human chest, allowing neural networks to learn the map** between X-ray and 3D CT volumes. OT is relevant in this setting by providing a rigorous distance metric that respects the geometric properties of the distributions. It computes the minimal cost required to transform one distribution into another, offering a coherent and robust measure of similarity that is both intuitive and stable [8]. This geometrically-aware approach enables a more direct and meaningful comparison of medical images across different modalities, potentially facilitating the development of more accurate and reliable diagnostic tools.

3 Method

We initially detail our 2D to 3D processing approach (Fig. 1), and then discuss our 3D to 3D generative modeling regularized with optimal transport (Fig. 2). We also justify how the 3D to 3D map** approach improves correlation by analyzing the information content through the transformation strategies.

3.1 Processing pipeline

At a high-level, we wish to investigate whether multi-view 2D information can be combined for accurate 3D synthesis without latent encoding. Previous approaches integrate multi-view information in the latent space, as it typically captures a degree of spatial and rotational invariance. However the latents, even with information-rich modeling, fail to retain the full signal from the input 2D views. While, through modern deep generative modeling, they may synthesize high-quality signals with high-frequency details, these are typically prone to over-hallucination [6] which can potentially pose fatal in real-world medical application. In this application, we would rather have blurriness and uncertainty at the expense of outputs that are highly-correlated to their corresponding 2D inputs.

To achieve this objective, we propose increasing the dimensionality at the 2D inputs to ensure the signal can influence an entire region of the 3D volume, and concatenating these inputs ensuring that information content loss is reduced within 3D to 3D residual architectures.

More formally, we outline our processing pipeline for N𝑁Nitalic_N input X-ray projections. Let IC×H×W𝐼superscript𝐶𝐻𝑊I\in\mathbb{R}^{C\times H\times W}italic_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT be an input 2D X-ray, representing a single view with height H𝐻Hitalic_H, width W𝑊Witalic_W and channels C𝐶Citalic_C. For a set of N𝑁Nitalic_N views, we have {Ii}i=1Nsuperscriptsubscriptsuperscript𝐼𝑖𝑖1𝑁\{I^{i}\}_{i=1}^{N}{ italic_I start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, each aiming to contribute to the reconstruction of a 3D CT volume VC×H×W×D𝑉superscript𝐶𝐻𝑊𝐷V\in\mathbb{R}^{C\times H\times W\times D}italic_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W × italic_D end_POSTSUPERSCRIPT, where D𝐷Ditalic_D is the depth.

We ‘stretch’ the 2D inputs to match the dimensions of the target 3D volume; we repeat each view D𝐷Ditalic_D times by Γ:C×H×WC×H×W×D:Γsuperscript𝐶𝐻𝑊superscript𝐶𝐻𝑊𝐷\Gamma:\mathbb{R}^{C\times H\times W}\to\mathbb{R}^{C\times H\times W\times D}roman_Γ : blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W × italic_D end_POSTSUPERSCRIPT and coarsely align them by transposing views that differ by 90-degrees (leaving the others unchanged, see our experiments section on Views alignment for further discussion on this). This is applied for each of the N𝑁Nitalic_N views, which are concatenated across the depth axis (denoted by direct-sum\bigoplus) yielding a composite volume VcatNC×H×W×Dsubscript𝑉catsuperscript𝑁𝐶𝐻𝑊𝐷V_{\text{cat}}\in\mathbb{R}^{NC\times H\times W\times D}italic_V start_POSTSUBSCRIPT cat end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N italic_C × italic_H × italic_W × italic_D end_POSTSUPERSCRIPT, where

Vcat=i=1NΓ(Ii),subscript𝑉catsuperscriptsubscriptdirect-sum𝑖1𝑁Γsuperscript𝐼𝑖V_{\text{cat}}=\bigoplus_{i=1}^{N}\Gamma(I^{i}),italic_V start_POSTSUBSCRIPT cat end_POSTSUBSCRIPT = ⨁ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_Γ ( italic_I start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , (1)

and the reconstructed 3D volume is modeled by a U-Net based architecture gφunet:NC×H×W×DC×H×W×D:superscriptsubscript𝑔𝜑unetsuperscript𝑁𝐶𝐻𝑊𝐷superscript𝐶𝐻𝑊𝐷g_{\varphi}^{\text{unet}}:\mathbb{R}^{NC\times H\times W\times D}\to\mathbb{R}% ^{C\times H\times W\times D}italic_g start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT unet end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_N italic_C × italic_H × italic_W × italic_D end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W × italic_D end_POSTSUPERSCRIPT.

3.2 Map** network and generative modeling

For our 3D to 3D map** network gφsubscript𝑔𝜑g_{\varphi}italic_g start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT with parameters φ𝜑\varphiitalic_φ we considered a residual U-Net [36] equipped with self-attention following the Swin UNEt-TRansformer (Swin UNETR) model [16], applied for image translation instead of image segmentation. Based on experimenting with a variety of off-the-shelf U-Net backbones, we settled on Swin UNETR as it empirically generated higher image quality outputs across all metrics.

3.2.1 Modeling with neural optimal transport

For the generative modeling task, we applied the dual regularized optimal transport (OTOT\mathrm{OT}roman_OT) learning approach [14] using the geometric de-biased Sinkhorn divergence Sε()subscriptS𝜀{\mathrm{S}}_{\varepsilon}(\cdot)roman_S start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( ⋅ ) [9] as the cost function. The Sinkhorn divergence is suitable for this task as it is an efficient yet positive and definite approximation of OT that interpolates between Maximum Mean Discrepancies (MMD) and OT. Specifically, we incorporate a feature extractor f𝑓fitalic_f that reduces the dimensionality of g(Vcat)𝑔subscript𝑉catg(V_{\text{cat}})italic_g ( italic_V start_POSTSUBSCRIPT cat end_POSTSUBSCRIPT ) and V𝑉Vitalic_V through a map** into a reduced dimensional vector as suggested in [12]. We parametrized f𝑓fitalic_f as a neural network and calculate the cost function in the latent space,

Sε(α,β):=OTε(α,β)12OTε(α,α)12OTε(β,β),assignsubscriptS𝜀𝛼𝛽subscriptOT𝜀𝛼𝛽12subscriptOT𝜀𝛼𝛼12subscriptOT𝜀𝛽𝛽{\mathrm{S}}_{\mathrm{\varepsilon}}(\alpha,\beta)\vcentcolon=\mathrm{OT}_{% \varepsilon}(\alpha,\beta)-\frac{1}{2}\mathrm{OT}_{\varepsilon}(\alpha,\alpha)% -\frac{1}{2}\mathrm{OT}_{\varepsilon}(\beta,\beta),roman_S start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_α , italic_β ) := roman_OT start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_α , italic_β ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_OT start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_α , italic_α ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_OT start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_β , italic_β ) , (2)

where α𝛼\alphaitalic_α represents the distribution of predicted features and β𝛽\betaitalic_β the distribution of ground truth 3D CT features extracted from f𝑓fitalic_f. While calculating the cost in the data space might suffice in some cases, we find that incorporating the feature extractor f𝑓fitalic_f is beneficial with 3D data as it reduces its dimensionality and makes our map** network less susceptible to mode collapse.

This modeling approach is similar to the min-max objective in GANs where the Swin UNETR map** network gφsubscript𝑔𝜑g_{\varphi}italic_g start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT (Fig. 2a) replaces the generator and a residual discriminator dϕsubscript𝑑italic-ϕd_{\phi}italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT assigns transportation costs to the produced samples gφ(Vcat)subscript𝑔𝜑subscript𝑉catg_{\varphi}(V_{\text{cat}})italic_g start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT cat end_POSTSUBSCRIPT ). We trained our networks using:

g=𝔼(Vcat,V)pdata[Sε(g(Vcat),V)λd(g(Vcat))],subscript𝑔subscript𝔼similar-tosubscript𝑉cat𝑉subscript𝑝datadelimited-[]subscriptS𝜀𝑔subscript𝑉cat𝑉𝜆𝑑𝑔subscript𝑉cat\operatorname{\mathcal{L}}_{g}=\mathbb{E}_{(V_{\text{cat}},V)\sim p_{\text{% data}}}\left[{\mathrm{S}}_{\mathrm{\varepsilon}}(g(V_{\text{cat}}),V)-\lambda% \,d(g(V_{\text{cat}}))\right],caligraphic_L start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT cat end_POSTSUBSCRIPT , italic_V ) ∼ italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_S start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_g ( italic_V start_POSTSUBSCRIPT cat end_POSTSUBSCRIPT ) , italic_V ) - italic_λ italic_d ( italic_g ( italic_V start_POSTSUBSCRIPT cat end_POSTSUBSCRIPT ) ) ] , (3)
d=𝔼(Vcat,V)pdata[d(g(Vcat))d(V)].subscript𝑑subscript𝔼similar-tosubscript𝑉cat𝑉subscript𝑝datadelimited-[]𝑑𝑔subscript𝑉cat𝑑𝑉\operatorname{\mathcal{L}}_{d}=\mathbb{E}_{(V_{\text{cat}},V)\sim p_{\text{% data}}}\left[d(g(V_{\text{cat}}))-d(V)\right].caligraphic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT cat end_POSTSUBSCRIPT , italic_V ) ∼ italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_d ( italic_g ( italic_V start_POSTSUBSCRIPT cat end_POSTSUBSCRIPT ) ) - italic_d ( italic_V ) ] . (4)

The optimization of the network d𝑑ditalic_d involves finding a function that yields the minimal cost associated with transporting mass from each point in Vcatsubscript𝑉catV_{\text{cat}}italic_V start_POSTSUBSCRIPT cat end_POSTSUBSCRIPT to each point in V𝑉Vitalic_V. Overall this allows us to align the distributions of features from ground truth CT samples with our samples from gφsubscript𝑔𝜑g_{\varphi}italic_g start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT attained from the concatenated volumes from X-ray views Vcatsubscript𝑉catV_{\text{cat}}italic_V start_POSTSUBSCRIPT cat end_POSTSUBSCRIPT, through regularized dual OT (Fig. 2b).

Even though we can use the L2 distance to approximate the OT plan, it does not capture the underlying structure of the distributions or how to transport mass from one to another. In contrast, the Sinkhorn divergence induces a scaling process that ensures, at each step, the resulting transportation plan satisfies probability distribution constraints, thus helps avoiding common training instability issues in adversarial training. Furthermore, our model relies solely on 2D views, promoting deterministic map** and reducing hallucination by avoiding sampling from additional distributions, such as Gaussian, typical of multimodal image translation. We find that OT regularization greatly reduces overfitting but may introduce blur in the generated outputs due to the unconstrained nature of 2D-3D translation.

3.3 Justification

We justify our processing pipeline by examining information loss through different approaches. In previous works, encoded features 𝐳M𝐳superscript𝑀\mathbf{z}\in\mathbb{R}^{M}bold_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT are typically obtained through an encoding function 𝐳=fenc(I)𝐳superscript𝑓enc𝐼\mathbf{z}=f^{\text{enc}}(I)bold_z = italic_f start_POSTSUPERSCRIPT enc end_POSTSUPERSCRIPT ( italic_I ), where typically, M<CHW𝑀𝐶𝐻𝑊M<CHWitalic_M < italic_C italic_H italic_W, indicating a reduction in dimensionality. The decoding function V=fdec(𝐳)𝑉superscript𝑓dec𝐳V=f^{\text{dec}}(\mathbf{z})italic_V = italic_f start_POSTSUPERSCRIPT dec end_POSTSUPERSCRIPT ( bold_z ) similarly tries to reconstruct a volume given the latent.

Information content I()I\textrm{I}(\cdot)I ( ⋅ )

Let I()I\textrm{I}(\cdot)I ( ⋅ ) quantify the informational content of an image or volume, indicating that I(V)I𝑉\textrm{I}(V)I ( italic_V ) is maximized when V𝑉Vitalic_V contains the full detail and structure inherent to the original 3D object.

We show that the repetition and concatenation approach retains information when integrated via U-Net over multiple views, compared to classicial encoder-decoder approaches

gunet(Vcat)fdec(i=1Nfenc(Ii)),succeedssuperscript𝑔unetsubscript𝑉catsuperscript𝑓decsuperscriptsubscriptdirect-sum𝑖1𝑁superscript𝑓encsuperscript𝐼𝑖g^{\text{unet}}\left(V_{\text{cat}}\right)\succ f^{\text{dec}}\left(\oplus_{i=% 1}^{N}f^{\text{enc}}(I^{i})\right),italic_g start_POSTSUPERSCRIPT unet end_POSTSUPERSCRIPT ( italic_V start_POSTSUBSCRIPT cat end_POSTSUBSCRIPT ) ≻ italic_f start_POSTSUPERSCRIPT dec end_POSTSUPERSCRIPT ( ⊕ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT enc end_POSTSUPERSCRIPT ( italic_I start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) ,

where succeeds\succ denotes higher fidelity (closeness to original) in 3D reconstruction, directly correlated with informational content I()I\textrm{I}(\cdot)I ( ⋅ ).

Information loss in encoding and decoding

Dimensionality reduction through encoding implies: M<CHW,𝑀𝐶𝐻𝑊M<CHW,italic_M < italic_C italic_H italic_W , highlighting a reduction in the capacity to represent the full informational content of the original volume. This reduction leads to inherent information loss, which the decoding process cannot fully recover as

M<CHWI(fdec(𝐳))<I(I),𝑀𝐶𝐻𝑊Isuperscript𝑓dec𝐳I𝐼M<CHW\implies\textrm{I}(f^{\text{dec}}(\mathbf{z}))<\textrm{I}(I),italic_M < italic_C italic_H italic_W ⟹ I ( italic_f start_POSTSUPERSCRIPT dec end_POSTSUPERSCRIPT ( bold_z ) ) < I ( italic_I ) , (5)

due to the lossy nature of compression in fencsuperscript𝑓encf^{\text{enc}}italic_f start_POSTSUPERSCRIPT enc end_POSTSUPERSCRIPT and the limited ability of fdecsuperscript𝑓decf^{\text{dec}}italic_f start_POSTSUPERSCRIPT dec end_POSTSUPERSCRIPT to fully recover original information.

Information retention through skip-connections

The skip connections in gunetsuperscript𝑔unetg^{\text{unet}}italic_g start_POSTSUPERSCRIPT unet end_POSTSUPERSCRIPT enable direct transfer of features across layers, preserving and refining informational content where

I(gunet(Vcat))I(Vcat)=I(I),Isuperscript𝑔unetsubscript𝑉catIsubscript𝑉catI𝐼\textrm{I}(g^{\text{unet}}(V_{\text{cat}}))\approx\textrm{I}(V_{\text{cat}})=% \textrm{I}(I),I ( italic_g start_POSTSUPERSCRIPT unet end_POSTSUPERSCRIPT ( italic_V start_POSTSUBSCRIPT cat end_POSTSUBSCRIPT ) ) ≈ I ( italic_V start_POSTSUBSCRIPT cat end_POSTSUBSCRIPT ) = I ( italic_I ) , (6)

especially when V𝑉Vitalic_V is constructed from repeating I𝐼Iitalic_I across the depth dimension (without information loss), thereby maintaining high fidelity from input to output.

Information retention across views

The concatenation of repeated views tends to preserve informational content, while the concatenation of encoded views may lead to information loss due to aggregation

I(i=1NΓ(𝒙i))i=1NI(Γ(Ii)),Isuperscriptsubscriptdirect-sum𝑖1𝑁Γsuperscript𝒙𝑖superscriptsubscript𝑖1𝑁IΓsuperscript𝐼𝑖\textrm{I}\left(\oplus_{i=1}^{N}\Gamma(\boldsymbol{x}^{i})\right)\approx\sum_{% i=1}^{N}\textrm{I}(\Gamma(I^{i})),I ( ⊕ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_Γ ( bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) ≈ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT I ( roman_Γ ( italic_I start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) , (7)
I(i=1Ngenc(𝒙i))i=1NI(Ii).Isuperscriptsubscriptdirect-sum𝑖1𝑁superscript𝑔encsuperscript𝒙𝑖superscriptsubscript𝑖1𝑁Isuperscript𝐼𝑖\textrm{I}\left(\oplus_{i=1}^{N}g^{\text{enc}}(\boldsymbol{x}^{i})\right)\leq% \sum_{i=1}^{N}\textrm{I}(I^{i}).I ( ⊕ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT enc end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) ≤ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT I ( italic_I start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) . (8)

Therefore, the combination of multiple views through concatenation and subsequent processing with gunetsuperscript𝑔unetg^{\text{unet}}italic_g start_POSTSUPERSCRIPT unet end_POSTSUPERSCRIPT tends to retain information content, whereas concatenating the encodings leads to significant loss in the correlation on aggregate, leading to

I(gunet(i=1NΓ(Ii)))>I(fdec(i=1Nfenc(Ii))),Isuperscript𝑔unetsuperscriptsubscriptdirect-sum𝑖1𝑁Γsuperscript𝐼𝑖Isuperscript𝑓decsuperscriptsubscriptdirect-sum𝑖1𝑁superscript𝑓encsuperscript𝐼𝑖\textrm{I}\left(g^{\text{unet}}\left(\oplus_{i=1}^{N}\Gamma(I^{i})\right)% \right)>\textrm{I}\left(f^{\text{dec}}\left(\oplus_{i=1}^{N}f^{\text{enc}}(I^{% i})\right)\right),I ( italic_g start_POSTSUPERSCRIPT unet end_POSTSUPERSCRIPT ( ⊕ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_Γ ( italic_I start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) ) > I ( italic_f start_POSTSUPERSCRIPT dec end_POSTSUPERSCRIPT ( ⊕ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT enc end_POSTSUPERSCRIPT ( italic_I start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) ) , (9)

indicating the potential of enhanced fidelity of 3D reconstruction achieved through a U-Net based architecture with repeated concatenation and skip connections over latent integration frameworks.

LIDC-IDRI dataset [1] (in-of-distribution inputs) using two views
Experiment \uparrow SSIM \uparrow PSNR \downarrow MSE \downarrow MAE
2D-3D AE 0.1545 3.804 995.2582 5.81
2D-3D AE + L2 Norm. 0.2086 10.833 197.2649 2.5103
3D-3D U-Net 0.4787 22.063 58.7546 0.9264
Refer to caption
Figure 3: Effect of reformulating 2D-3D map** into 3D-3D.
Refer to caption
Figure 4: Example projections from generated 3D CT volumes from inputs using one, two, four and eight X-rays; obtained from our 3D-3D translation approach with Swin UNETR backbone. We use testing instances from LIDC-IDRI dataset [1].

4 Experiments

We trained our models on the LIDC chest dataset [1], which consists of only 916 training CT scans, and tested them on both in-distribution and out-of-distribution inputs from six lung datasets. We created paired datasets generating digitally reconstructed radiographs as 2D inputs from the CT scans using the open-source software Plastimatch, following previous work [49, 35, 5, 6]. Our model’s convergence plateaued after 2,000 iterations, taking only an average of 28 hours to reach 5,000 steps (selected weight). The training runs were performed on NVIDIA TITAN RTX with a batch size of eight using PyTorch.

In-distribution inputs (LIDC-IDRI dataset [1])
No. input views \uparrow SSIM \uparrow PSNR \downarrow MSE \downarrow MAE
1 z𝒩(0,𝐈)similar-todirect-sum𝑧𝒩0𝐈\oplus\,\,z\sim\mathcal{N}(0,\boldsymbol{\mathrm{I}})⊕ italic_z ∼ caligraphic_N ( 0 , bold_I ) 0.2337 20.2827 0.0403 0.1342
1 0.4891 23.2198 0.0214 0.0751
2 0.5272 24.3192 0.0180 0.0665
4 0.5129 24.3358 0.0167 0.0666
8 0.5402 24.6352 0.0155 0.0613
Out-of-distribution inputs (MIDRC-1b dataset [43])
1 z𝒩(0,𝐈)similar-todirect-sum𝑧𝒩0𝐈\oplus\,\,z\sim\mathcal{N}(0,\boldsymbol{\mathrm{I}})⊕ italic_z ∼ caligraphic_N ( 0 , bold_I ) 0.1353 15.3223 0.0800 0.2334
1 0.4043 20.0800 0.0526 0.1510
2 0.4048 20.3636 0.0505 0.1474
4 0.3779 21.2115 0.0439 0.1337
8 0.2029 17.7972 0.0777 0.2029
Table 1: Primary results of our map** approach with varying input views. Each model was trained on LIDC dataset [1] and tested for both in- and out-of-distribution inputs from MIDRC-1b dataset [43]. We computed metrics five times with different random seeds and report their average.
(a) In-of-distribution inputs
Dataset Method \uparrow SSIM \uparrow PSNR \downarrow MSE \downarrow MAE
LIDC-IDRI [1] X2CT-GAN [49] 0.321 19.68 0.045 0.151
CCX-rayNet [35] 0.386 22.66 0.032 0.108
Ours 0.527 24.35 0.018 0.066
(b) Out-of-distribution inputs
COVID-19-NY-SBU [37] X2CT-GAN [49] 0.236 16.74 0.089 0.199
CCX-rayNet [35] 0.205 19.03 0.054 0.144
Ours 0.400 22.78 0.022 0.077
SPIE-AAPM-NCI [2] X2CT-GAN [49] 0.174 15.63 0.115 0.245
CCX-rayNet [35] 0.130 17.87 0.080 0.196
Ours 0.399 22.04 0.026 0.087
MIDRC-1b [43] X2CT-GAN [49] 0.220 14.70 0.156 0.360
CCX-rayNet [35] 0.114 18.06 0.076 0.228
Ours 0.404 20.36 0.050 0.147
ANTI-PD-1 [29] X2CT-GAN [49] 0.286 18.04 0.072 0.164
CCX-rayNet [35] 0.205 18.15 0.067 0.167
Ours 0.349 20.93 0.040 0.112
LCTSC [48] X2CT-GAN [49] 0.326 19.20 0.052 0.125
CCX-rayNet [35] 0.206 19.75 0.062 0.156
Ours 0.331 20.38 0.040 0.109
NSCLC [3] X2CT-GAN [49] 0.280 18.13 0.072 0.163
CCX-rayNet [35] 0.122 16.62 0.093 0.216
Ours 0.315 19.85 0.047 0.126
Table 2: Quantitative results on both in-distribution and several out-of-distribution datasets using only two input X-rays. We used our model weights for 5,000 training optimization steps, trained with Swin-U-Net. Other models weights are from 100 epochs (similar-to\sim 90k iterations).
Map** reformulation

In Figure 3, we present results from initial experiments comparing an asymmetrical 2D to 3D strategy versus our reformulation as a 3D to 3D map**. The 2D to 3D approach involves aggregating the individual encoded 2D view features, which serve as input to a 3D decoder. However, this results in highly blurred outputs due to the information content loss from the latent encoding. Despite the simplicitiy of such an approach in terms of feature alignment, important fine-grained details are missing. In contrast, our approach reformulates the problem into a 3D to 3D map** in a simple architecture, achieving high-fidelity image translations even when using a single input view. We ablate our ’repeat and concatenate’ preprocessing pipeline by instead concatenating noise sampled from a normal distribution. When the vector 𝒛𝒛\boldsymbol{z}bold_italic_z is fixed for each patient, the model causes overfitting, while gathering a new 𝒛𝒛\boldsymbol{z}bold_italic_z for each iteration results in mode collapsing, where it produces the same output irrespective of the input.

Ablations & comparisons

Our approach allows the use of multiple views as input without modifying our model’s architecture. We observe that the model benefits from additional views when tested on in-distribution inputs (Fig. 4). However, this correlation was found not strictly hold for out-of-distribution inputs. This suggests that when there are disparities in view alignment or style-domain features compared to the training distribution, the model predominantly relies on views that closely resemble the learned patterns. Investigating whether certain projections contain more valuable information than others is a valuable research direction for future work. Overall, we find that the combination of 2 or 4 views to be empirically effective (Table 1).

In Table 2 we compare our approach to paired alternative methods in terms of quality of the 3D outputs for both in- and out-of-distribution inputs. Despite variations in datasets originating from different imaging systems, resolutions, and patients’ health conditions, our model demonstrates superior performance across all metrics and maintains consistency across all out-of-distribution datasets. Refer to Figures 5 and 9 for qualitative results.

Refer to caption
Figure 5: CT projections from generated 3D volumes on various out-of-distribution lung datasets. Model weights were selected from iteration 5,000 on the LIDC-IDRI dataset [1]. GT projections are displayed in odd rows, while our model’s outputs are shown in even rows.
(a) In-of-distrib. inputs (LIDC-IDRI dataset [1])
Oblique views ablation \uparrow SSIM \uparrow PSNR \downarrow MSE \downarrow MAE
no alignment 0.5001 24.0513 0.0178 0.0682
coarse alignment 0.4248 22.5182 0.0259 0.0872
(b) Out-of-distrib. inputs (COVID-19-NY-SBU dataset [37])
no alignment 0.3581 21.9844 0.0272 0.0881
coarse alignment 0.3365 21.0851 0.0340 0.1026
Figure 6: Results on oblique views with coarse alignment.
\nextfloat
Refer to caption
Refer to caption
Figure 7: Comparison between only transposing perpendicular views (indicated by a star marker) and a coarse alignment (triangle marker), using two, four, and eight input views. Smoother colors represent results for out-of-distrib. inputs (MIDRC dataset [43]).
Refer to caption
Figure 8: Gradient magnitudes with respect to the input views (X-rays) of the proposed image translation model. The upper row displays the variability within slices extracted from the 3D input volume of the same patient, while the bottom row shows the attribution among slices from different patients.
Refer to caption
Figure 9: Examples of the correlated 3D projections from our proposed approach compared to alternative supervised methods X2CT-GAN [49] and CCX-rayNet [35].
Gradients & patient-specific learning

In Figure 8, we visualize the magnitudes of the gradients of our model with respect to a varying number of input views. We observe higher variability across different patients, indicated by the color variations for one, two, four, and eight views independently. On the other hand, if we analyze intra-patient gradients, we notice lower variability, suggesting consistent patterns captured in views coming from the same patient. With our approach, the generative model implicitly learns patient-specific representations, where outputs reflect differences in anatomical structures and/or pathological conditions present in the input data from each patient.

View alignment

We study the effect of aligning the input views during the preprocessing stage prior to the repeat and concatenation operations. Initially, we experimented without any alignment, which resulted in outputs containing checkerboard-like artifacts that diminished after an additional 1k training iterations. However, a simple transposition of perpendicular views from coronal and sagittal planes alleviated this. To test our model capacity on views outside of such geometry, we tested on four oblique projections, for which we find that our model performs equally well without relying on any sort of alignment (Fig. 7). We also investigated a coarse alignment by approximating their locations through 3D rotations using data transformations from Kornia’s library (Figs. 7, 7). However, we observe that this might required a precise and potentially non-affine transformation which is non-trivial. Overall, we find that the model learns some invariance to this, where a simple transpose consistently yields improved results without apparent artifacts (Fig. 7). While precise projection alignment in 3D space could potentially enhance results, we consider there may be a tradeoff between alignment robustness (e.g., due to patient movement) and generation accuracy. In the future, an iterative 3D alignment approach would be worthwhile investigating as a secondary objective.

4.1 Implementation Details

Our training algorithm is based on the neural optimal transport approach by Korotin et al. [23]. In this approach, the map** network gφsubscript𝑔𝜑g_{\varphi}italic_g start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT undergoes k=10𝑘10k=10italic_k = 10 iterations while the parameters of the potential network dϕsubscript𝑑italic-ϕd_{\phi}italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT remain frozen. Then, a single training step is performed for d𝑑ditalic_d, unfreezing its parameters while freezing those of g𝑔gitalic_g. Our feature extractor f𝑓fitalic_f is parameterized by a 3D convolutional neural network with Kaiming weight initialization. We set λ=0.1𝜆0.1\lambda=0.1italic_λ = 0.1 to control the degree of regularization of the reconstructed outputs by d𝑑ditalic_d. Improving generation quality could be achieved by scaling to higher resolutions and adjusting the λ𝜆\lambdaitalic_λ parameter. Our networks are trained using the AdamW optimizer with parameters β1=0.5subscript𝛽10.5\beta_{1}=0.5italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.5, β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999, ϵ=0.001italic-ϵ0.001\epsilon=0.001italic_ϵ = 0.001, and a learning rate of 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT with weight decay. We apply a combination of differential augmentations [51], including random contrast, rotation, and horizontal flip, to the input 2D X-rays in all experiments. 3D reconstructions are saved in both numpy and .mha formats for visualization using medical image software such as 3D Slicer.

5 Limitations and future directions

The approach, while simple and intuitive, has several clear limitations. Firstly, our current architecture attempts the non-linear transformation in a ‘single’ transformative step, resulting in uncertainty and therefore blurriness. We expect improved performance through an iterative multi-step transformation, such as a modern probabilistic diffusion generative model. Secondly, while we found some invariance to the input alignment, we would like to investigate whether the concatenated 3D inputs could be better aligned to match the target 3D locations and therefore support better modeling of some high-frequency details near their corresponding specific 3D slices. For example we could potentially optimize the input affine transformations, in particular their roto-translations, according to a secondary objective, repeated at inference as part of the modeling.

6 Conclusion

In conclusion, we found that simply repeating the 2D inputs into concatenated 3D volumes, then treating the 2D to 3D translation problem as a 3D to 3D translation task, leads to improved correlation between the synthesized outputs over more sophisticated multi-view latent integration approaches. While we expect other off-the-shelf 3D to 3D conditional generative modeling approaches to be immediately applicable within our framework, we found particular success by directly applying 3D to 3D neural optimal transport to attain highly correlated 3D synthesized outputs with their corresponding 2D inputs. The overall proposed approach is fast to train, data efficient, stable, and generalizes well to new out-of-distribution views making it applicable in a real-world clinical setting. However it does exhibit blurriness where there is uncertainty in the outputs; in the future it would be worth investigating iterative alignment optimization for the repeated 3D inputs as a secondary objective to the generative modeling, repeated during inference, potentially mitigating this uncertainty.

Acknowledgments

This work was supported by CONAHCyT, and Durham University.

References

  • Armato III et al. [2011] S. G Armato III, Geoffrey McLennan, Luc Bidaut, Michael F McNitt-Gray, Charles R Meyer, Anthony P Reeves, Binsheng Zhao, Denise R Aberle, Claudia I Henschke, Eric A Hoffman, et al. The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on CT scans. Medical physics, 38(2):915–931, 2011.
  • Armato III et al. [2015] Samuel G Armato III, Lubomir Hadjiiski, Georgia D Tourassi, Karen Drukker, Maryellen L Giger, Feng Li, George Redmond, Keyvan Farahani, Justin S Kirby, and Laurence P Clarke. SPIE-AAPM-NCI Lung Nodule Classification Challenge Dataset. TCIA., 2015.
  • Bakr et al. [2018] Shaimaa Bakr, Olivier Gevaert, Sebastian Echegaray, Kelsey Ayers, Mu Zhou, Majid Shafiq, Hong Zheng, Jalen Anthony Benson, Weiruo Zhang, Ann NC Leung, et al. A radiogenomic dataset of non-small cell lung cancer. Scientific data, 5(1):1–9, 2018.
  • Bond-Taylor et al. [2021] Sam Bond-Taylor, Adam Leach, Yang Long, and Chris G Willcocks. Deep Generative Modelling: A Comparative Review of VAEs, GANs, Normalizing Flows, Energy-Based and Autoregressive Models. IEEE TPAMI, 44(11):7327–7347, 2021.
  • Corona-Figueroa et al. [2022] Abril Corona-Figueroa, Jonathan Frawley, Sam Bond-Taylor, Sarath Bethapudi, Hubert PH Shum, and Chris G Willcocks. MedNeRF: Medical Neural Radiance Fields for Reconstructing 3D-aware CT-Projections from a Single X-ray. In 2022 44th Annual Int. Conf. of the IEEE Engineering in Medicine & Biology Society (EMBC), pages 3843–3848. IEEE, 2022.
  • Corona-Figueroa et al. [2023] Abril Corona-Figueroa, Sam Bond-Taylor, Neelanjan Bhowmik, Yona Falinie A. Gaus, Toby P. Breckon, Hubert P. H. Shum, and Chris G. Willcocks. Unaligned 2D to 3D Translation with Conditional Vector-Quantized Code Diffusion using Transformers. In Proc. of the IEEE/CVF Int. Conf. on Comput. Vis. (ICCV), pages 14585–14594, 2023.
  • Feng et al. [2023] Qi Feng, Hubert PH Shum, and Shigeo Morishima. Enhancing Perception and Immersion in Pre-Captured Environments through Learning-Based Eye Height Adaptation. In 2023 IEEE Int. Symposium on Mixed and Augmented Reality (ISMAR), pages 405–414. IEEE, 2023.
  • Feydy [2020] Jean Feydy. Geometric data analysis, beyond convolutions. Applied Mathematics, 2020.
  • Feydy et al. [2019] Jean Feydy, Thibault Séjourné, François-Xavier Vialard, Shun-ichi Amari, Alain Trouvé, and Gabriel Peyré. Interpolating between Optimal Transport and MMD using Sinkhorn Divergences. In The 22nd Int. Conf. on Artificial Intelligence and Statistics, pages 2681–2690. PMLR, 2019.
  • Fortuin [2022] Vincent Fortuin. Priors in Bayesian Deep Learning: A Review. Int. Statistical Review, 90(3):563–591, 2022.
  • Garcea et al. [2023] Fabio Garcea, Alessio Serra, Fabrizio Lamberti, and Lia Morra. Data augmentation for medical imaging: A systematic literature review. Comput.s in Biology and Medicine, 152:106391, 2023.
  • Genevay et al. [2018] Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning generative models with sinkhorn divergences. In Int. Conf. on Artificial Intelligence and Statistics, pages 1608–1617. PMLR, 2018.
  • Goodfellow et al. [2020] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  • Gozlan et al. [2017] Nathael Gozlan, Cyril Roberto, Paul-Marie Samson, and Prasad Tetali. Kantorovich duality for general transport costs and applications. Journal of Functional Analysis, 273(11):3327–3405, 2017.
  • Gunduzalp et al. [2021] Doga Gunduzalp, Batuhan Cengiz, Mehmet Ozan Unal, and Isa Yildirim. 3D U-NetR: Low Dose Computed Tomography Reconstruction via Deep Learning and 3 Dimensional Convolutions. arXiv preprint arXiv:2105.14130, 2021.
  • Hatamizadeh et al. [2021] Ali Hatamizadeh, Vishwesh Nath, Yucheng Tang, Dong Yang, Holger R Roth, and Daguang Xu. Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images. In Int. MICCAI Brainlesion Workshop, pages 272–284. Springer, 2021.
  • Isaac-Medina et al. [2022] Brian KS Isaac-Medina, Neelanjan Bhowmik, Chris G Willcocks, and Toby P Breckon. Cross-modal Image Synthesis within Dual-Energy X-ray Security Imagery. In Proc. of the IEEE/CVF Conf. on Comput. Vis. and Pattern Recog., pages 333–341, 2022.
  • Jiang et al. [2021] Ling Jiang, Mengxi Zhang, Ran Wei, Bo Liu, Xiangzhi Bai, and Fugen Zhou. Reconstruction of 3D CT from A Single X-ray Projection View Using CVAE-GAN. In 2021 IEEE Int. Conf. on Medical Imaging Physics and Engineering (ICMIPE), pages 1–6, 2021.
  • ** et al. [2023] Hao **, Minghui Lian, Shicheng Qiu, Xuxu Han, Xizhi Zhao, Long Yang, Zhiyi Zhang, Haoran Xie, Kouichi Konno, and Shaojun Hu. A Semi-automatic Oriental Ink Painting Framework for Robotic Drawing from 3D Models. IEEE Robotics and Automation Letters, 2023.
  • Karras et al. [2020] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative adversarial networks with limited data. Adv. Neural Inform. Process. Syst., 33:12104–12114, 2020.
  • Kebaili et al. [2023] Aghiles Kebaili, Jérôme Lapuyade-Lahorgue, and Su Ruan. Deep Learning Approaches for Data Augmentation in Medical Imaging: A Review. J. of Imaging, 9(4):81, 2023.
  • Kingma and Welling [2013] Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Korotin et al. [2022] Alexander Korotin, Daniil Selikhanovych, and Evgeny Burnaev. Neural Optimal Transport. In Int. Conf. on Learn. Represent., 2022.
  • Kyung et al. [2023] Daeun Kyung, Kyungmin Jo, Jaegul Choo, Joonseok Lee, and Edward Choi. Perspective Projection-Based 3d CT Reconstruction from Biplanar X-Rays. In ICASSP 2023-2023 IEEE Int. Conf. on Acoustics, Speech and Signal Process. (ICASSP), pages 1–5. IEEE, 2023.
  • Lee et al. [2023] Alex Junho Lee, Seungwon Song, Hyungtae Lim, Woojoo Lee, and Hyun Myung. (LC)2: LiDAR-Camera Loop Constraints For Cross-Modal Place Recognition. IEEE Robotics and Automation Letters, 2023.
  • Liu et al. [2023] Yanbin Liu, Girish Dwivedi, Farid Boussaid, Frank Sanfilippo, Makoto Yamada, and Mohammed Bennamoun. Inflating 2D convolution weights for efficient generation of 3D medical images. Comput. Methods and Programs in Biomedicine, 240:107685, 2023.
  • Liu et al. [2024] Zhenyu Liu, Qide Wang, Daxin Liu, and Jianrong Tan. PA-Pose: Partial point cloud fusion based on reliable alignment for 6D pose tracking. Pattern Recog., 148:110151, 2024.
  • Maas et al. [2023] Kirsten WH Maas, Nicola Pezzotti, Amy JE Vermeer, Danny Ruijters, and Anna Vilanova. Nerf for 3d reconstruction from x-ray angiography: Possibilities and limitations. In VCBM 2023: Eurographics Workshop on Visual Computing for Biology and Medicine, pages 29–40. Eurographics Association, 2023.
  • Madhavi et al. [2019] P Madhavi, S Patel, and AS Tsao. Data from Anti-PD-1 Immunotherapy Lung [Data set]. TCIA. TCIA., 10, 2019.
  • Maken and Gupta [2023] Payal Maken and Abhishek Gupta. 2D-to-3D: a review for computational 3D image reconstruction from X-ray images. Archives of Computational Methods in Engineering, 30(1):85–114, 2023.
  • Moradi et al. [2020] Reza Moradi, Reza Berangi, and Behrouz Minaei. A survey of regularization strategies for deep models. Artificial Intelligence Review, 53:3947–3986, 2020.
  • Nashed et al. [2021] Youssef SG Nashed, Frederic Poitevin, Harshit Gupta, Geoffrey Woollard, Michael Kagan, Chuck Yoon, and Daniel Ratner. End-to-End Simultaneous Learning of Single-particle Orientation and 3D Map Reconstruction from Cryo-electron Microscopy Data. arXiv preprint arXiv:2107.02958, 2021.
  • Oh et al. [2023] Sang Hyeon Oh, Kwak Dong Hwan, and Hyun Tek Lim. 2.5 D SLAM Algorithm with Novel Data Fusion Method Between 2D-Lidar and Camera. In 2023 23rd Int. Conf. on Control, Automation and Syst. (ICCAS), pages 1904–1907. IEEE, 2023.
  • Paavilainen et al. [2021] Pauliina Paavilainen, Saad Ullah Akram, and Juho Kannala. Bridging the gap between paired and unpaired medical image translation. In MICCAI Workshop on Deep Generative Models, pages 35–44. Springer, 2021.
  • Ratul et al. [2021] Md Aminur Rab Ratul, Kun Yuan, and WonSook Lee. CCX-rayNet: a class conditioned convolutional neural network for biplanar X-rays to CT volume. In 2021 IEEE 18th Int. Symposium on Biomedical Imaging (ISBI), pages 1655–1659. IEEE, 2021.
  • Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Comput.-Assisted Intervention–MICCAI 2015: 18th Int. Conf., Munich, Germany, October 5-9, 2015, Proc., Part III 18, pages 234–241. Springer, 2015.
  • Saltz et al. [2021] Joel Saltz, Mary Saltz, Prateek Prasanna, Richard Moffitt, Janos Hajagos, Erich Bremer, Joseph Balsamo, and Tahsin Kurc. Stony Brook University COVID-19 Positive Cases [Data set]. TCIA. BBAG-2923, 2021.
  • Sasaki et al. [2021] Hiroshi Sasaki, Chris G Willcocks, and Toby P Breckon. UNIT-DDPM: UNpaired Image Translation with Denoising Diffusion Probabilistic Models. arXiv preprint arXiv:2104.05358, 2021.
  • Song et al. [2023] Bowen Song, Liyue Shen, and Lei Xing. PINER: Prior-informed Implicit Neural Representation Learning for Test-time Adaptation in Sparse-view CT Reconstruction. In Proc. of the IEEE/CVF Winter Conf. on Applications of Comput. Vis., pages 1928–1938, 2023.
  • Stojanovski et al. [2022] David Stojanovski, Uxio Hermida, Marica Muffoletto, Pablo Lamata, Arian Beqiri, and Alberto Gomez. Efficient Pix2Vox++ for 3D Cardiac Reconstruction from 2D echo views. In Int. Workshop on Advances in Simplifying Medical Ultrasound, pages 86–95. Springer, 2022.
  • Tian and Zhang [2022] Yingjie Tian and Yuqi Zhang. A comprehensive survey on regularization strategies in machine learning. Inform. Fusion, 80:146–166, 2022.
  • Tolstikhin et al. [2018] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. In Int. Conf. on Learn. Represent., 2018.
  • Tsai et al. [2020] E Tsai, S Simpson, MP Lungren, M Hershman, L Roshkovan, E Colak, BJ Erickson, G Shih, A Stein, J Kalpathy-Cramer, et al. Medical Imaging Data Resource Center (MIDRC)-RSNA Int. COVID-19 Open Radiology Database (RICORD) Release 1b-Chest CT Covid-(MIDRC-RICORD-1B)[dataset]. TCIA., 202010, 2020.
  • Ulyanov et al. [2018] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. In Proc. of the IEEE Conf. on Comput. Vis. and pattern recognition, pages 9446–9454, 2018.
  • Wang et al. [2020] Ge Wang, Jong Chul Ye, and Bruno De Man. Deep learning for tomographic image reconstruction. Nature machine intelligence, 2(12):737–748, 2020.
  • Weinberger et al. [2020] Ethan Weinberger, Joseph Janizek, and Su-In Lee. Learning deep attribution priors based on prior knowledge. Advances in Neural Inform. Process. Syst., 33:14034–14045, 2020.
  • Xiao et al. [2022] Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling the generative learning trilemma with denoising diffusion GANs. In Int. Conf. on Learn. Represent., 2022.
  • Yang et al. [2017] **zhong Yang, Greg Sharp, Harini Veeraraghavan, Wouter Van Elmpt, Andre Dekker, Tim Lustberg, and Mark Gooding. Data from lung CT segmentation challenge. TCIA., 2017.
  • Ying et al. [2019] Xingde Ying, Heng Guo, Kai Ma, Jian Wu, Zhengxin Weng, and Yefeng Zheng. X2CT-GAN: reconstructing CT from biplanar X-rays with generative adversarial networks. In Proc. of the IEEE/CVF Conf. on Comput. Vis. and Pattern Recog., pages 10619–10628, 2019.
  • Zhan et al. [2023] Fangneng Zhan, Yingchen Yu, Rongliang Wu, Jiahui Zhang, Shijian Lu, Lingjie Liu, Adam Kortylewski, Christian Theobalt, and Eric Xing. Multimodal image synthesis and editing: A survey and taxonomy. IEEE TPAMI, 2023.
  • Zhao et al. [2020] Shengyu Zhao, Zhijian Liu, Ji Lin, Jun-Yan Zhu, and Song Han. Differentiable augmentation for data-efficient gan training. Adv. Neural Inform. Process. Syst., 33:7559–7570, 2020.
  • Zuo [2021] **gyi Zuo. 2D to 3D Neurovascular Reconstruction from Biplane View via Deep Learning. In Int. Conf. on Comput. and Data Sci., pages 383–387, 2021.