11institutetext: 1Nuffield Department of Population Health, University of Oxford, UK
2Big Data Institute, University of Oxford, UK
3 Oxford Radiology Research Unit, Oxford University Hospitals NHS Foundation Trust, UK
11email: {jiahua.li@ndph, bartlomiej.papiez@bdi}.ox.ac.uk

Multimodal Deformable Image Registration for Long-COVID Analysis Based on Progressive Alignment and Multi-perspective Loss

Jiahua Li\orcidlink0009-0009-7903-9875 1122    James T. Grist\orcidlink0000-0001-7223-4031 33    Fergus V. Gleeson\orcidlink0000-0002-5121-3917 33    Bartłomiej W. Papież\orcidlink0000-0002-8432-2511 1122
Abstract

Long COVID is characterized by persistent symptoms, particularly pulmonary impairment, which necessitates advanced imaging for accurate diagnosis. Hyperpolarised Xenon-129 MRI (XeMRI) offers a promising avenue by visualising lung ventilation, perfusion, as well as gas transfer. Integrating functional data from XeMRI with structural data from Computed Tomography (CT) is crucial for comprehensive analysis and effective treatment strategies in long COVID, requiring precise data alignment from those complementary imaging modalities. To this end, CT-MRI registration is an essential intermediate step, given the significant challenges posed by the direct alignment of CT and Xe-MRI. Therefore, we proposed an end-to-end multimodal deformable image registration method that achieves superior performance for aligning long-COVID lung CT and proton density MRI (pMRI) data. Moreover, our method incorporates a novel Multi-perspective Loss (MPL) function, enhancing state-of-the-art deep learning methods for monomodal registration by making them adaptable for multimodal tasks. The registration results achieve a Dice coefficient score of 0.913, indicating a substantial improvement over the state-of-the-art multimodal image registration techniques. Since the XeMRI and pMRI images are acquired in the same sessions and can be roughly aligned, our results facilitate subsequent registration between XeMRI and CT, thereby potentially enhancing clinical decision-making for long COVID management.

Keywords:
M

edical image registration, Multimodal image registration, Progressive learning

1 Introduction

Considering the thorough documentation of over 651 million COVID-19 cases worldwide, the current conservative estimates suggest that around 65 million people are suffering from long COVID [5]. What is more, a number of patients with long COVID present no findings in Computed Tomography (CT), and more advanced imaging techniques such as hyperpolarized Xenon MRI (XeMRI) have to be utilised to detect lung abnormalities [9]. While XeMRI provides insight about the lung function, it needs to be analysed with respect to the underlying anatomy (shown e.g. in CT) to be utlised in clinical decision-making consequently requiring multimodal image registration for this task.

Monomodal deformable image registration (DIR) is regarded as a non-trivial task, due to patient motion [20, 16, 6, 2] (for longitudinal studies) or the subject variability [8] (for cross-sectional studies). Nevertheless, the complexity of DIR increases in the multimodal scenarios, fueled by differences in intensities between the images acquired to visualise diverse physical phenomena, e.g. CT or MRI, where each relies upon different physical properties of tissue to create images. Since multimodal DIR gives clinicians more comprehensive insights about a patient’s condition, benefiting diagnostic accuracy and personalised treatment plans, efficient multimodal DIR is critical, and many methods have been suggested [23]. However, statistical and information theory-based methods suffer from computational complexity and slow convergence [26, 18, 13], while descriptor-based methods prove sensitive to initial conditions and require effective pre-alignment to handle extensive translations[11, 12]; their reliance on hand-crafted features calls for domain expertise for fine-tuning and restricts their adaptability. As of late, Convolutional Neural Networks (CNNs) have been utilised to learn a standard representation for DIR by optimising a similarity metric [14, 15, 10]. Concurrently, selecting appropriate similarity metrics proves challenging since multimodal images can exhibit differences in intrinsic intensity distribution and resolution, leading to the effectiveness of learning-based methods being limited mainly in the monomodal scenarios [4, 27, 29, 28]. Alternatively, multimodal DIR can be transformed into a less complex monomodal task utilising an image-to-image (I2I) translation [21]. Nonetheless, such translation can potentially result in shape inconsistency and produce artificial anatomical features, further deteriorating the performance of the DIR.

The focus of this work is on the DIR between CT and proton MRI (pMRI), a process of significance to the analysis of XeMRI. Owing to its non-ionising characteristics, XeMRI has gained considerable interest for long COVID, primarily due to capturing images related to lung ventilation, perfusion, and gas transfer in lungs[1, 19, 25, 9]. Since XeMRI does not provide anatomical information, the alignment of XeMRI images with pMRI and CT is essential. pMRI is typically acquired in the same imaging session as XeMRI, albeit not within the same breath-hold, while CT is taken a couple of days prior. This poses a challenge when attempting to fuse XeMRI with CT, thus necessitating DIR between pMRI and CT.

Contributions of our work are as follows. To overcome the aforementioned limitations, we proposed a multimodal, end-to-end method based on progressive alignment architecture which can tackle significant deformations (Sec. 2.2)). Moreover, we introduce a novel Multi-perspective Loss (MPL) function, applicable to any existing monomodal DIR architecture, extending their application to multimodal imaging registration (Sec. 2.3). Lastly, our method was evaluated on challenging long-COVID lung CT and pMRI dataset, which achieved the Dice coefficient (DSC) of 0.91, outperforming the state-of-the-art models for multimodal DIR (Sec. 3). To the best of our knowledge, this is the first effort to automate mutlimodal deformable image registration for long-COVID CT and pMRI.

2 Methodology

2.1 Overview

As seen in Fig. 1, DIR aims to estimate a non-linear voxel-to-voxel correspondence between a fixed image F𝐹Fitalic_F and a moving image M𝑀Mitalic_M, in which the estimated transformation is parameterized with ϕitalic-ϕ\phiitalic_ϕ:

ϕ=fθ(M,F)italic-ϕsubscript𝑓𝜃𝑀𝐹\phi=f_{\theta}(M,F)italic_ϕ = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_M , italic_F ) (1)

with f𝑓fitalic_f and θ𝜃\thetaitalic_θ corresponding to the utilised neural networks and the networks’ learning parameters, respectively. Our method uses two 3D images as input: the pMRI image (the moving image) and the CT image serving as a reference (the fixed image). These are introduced into a cascading sequence of 3D CNNs (described in Sec. 2.2) to extract distinctive feature maps from both input images. Furthermore, we use a novel loss function (described in Sec. 2.3) that combines Mutual Information (MI) and Gaussian Pyramid labels to capture both global and local intensity information. In this section, the workflow of our methodology is outlined, with detailed description in the subsequent subsections.

2.2 Progressive Alignment Architecture

As a consequence of the significant deformation observed across diverse modalities, estimation of the displacement field in one attempt proves to be challenging. Thus, the model is implemented iteratively to ensure progressive refinement (see Fig.1). The first iteration aims to establish the coarse transformation, while subsequent refinements estimate finer transformations. Specifically, the suggested model is initiated by a network that predicts an affine transformation matrix with 12 degrees of freedom (denoted by ϕaffinesubscriptitalic-ϕaffine\phi_{\text{affine}}italic_ϕ start_POSTSUBSCRIPT affine end_POSTSUBSCRIPT in Fig. 1). The network for affine transformation has four downsampling residual-network blocks (ResBlock). The final convolutional layer employs a fully-connected matrix, subject to learning, to create a linear projection, producing a vector encompassing 12 parameters for affine transformation. Following the network for the affine alignment, cascades of registration networks (sharing weights) predicting dense displacement fields (DDF) are employed to estimate local (non-rigid) deformation ϕnsubscriptitalic-ϕ𝑛\phi_{n}italic_ϕ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Similarly, each cascaded network has a Voxelmorph-style architecture[4], replacing the encoder component with four downsampling ResBlocks. The affine transformation and the DDFs are recursively estimated by the Spatial Transformer Network (STN) [17] to produce the final DDF ϕitalic-ϕ\phiitalic_ϕ. Considering the n𝑛nitalic_n-th cascade, the output will be estimated according to the DDF ϕn1subscriptitalic-ϕ𝑛1\phi_{n-1}italic_ϕ start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT from the (n1)𝑛1(n-1)( italic_n - 1 )-th cascade:

fθ(M,F)=(ϕn1ϕn)+ϕnsubscript𝑓𝜃𝑀𝐹subscriptitalic-ϕ𝑛1subscriptitalic-ϕ𝑛subscriptitalic-ϕ𝑛f_{\theta}(M,F)=(\phi_{n-1}\circ\phi_{n})+\phi_{n}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_M , italic_F ) = ( italic_ϕ start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∘ italic_ϕ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + italic_ϕ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (2)

where \circ corresponds to the war** operation facilitated by a trilinear image resampler. Theoretically, this recursive process can be infinitely applied. Hence, the input image M𝑀Mitalic_M becomes warped by the final DDF ϕitalic-ϕ\phiitalic_ϕ estimated according to its affine transformation and multiple cascades of deformable transformation, resulting in the registered image Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, represented as:

M=ϕMsuperscript𝑀italic-ϕ𝑀M^{\prime}=\phi\circ Mitalic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_ϕ ∘ italic_M (3)
Refer to caption
Figure 1: Illustration of our proposed architecture: The network initially predicts the affine transformation ϕaffinesubscriptitalic-ϕaffine\phi_{\text{affine}}italic_ϕ start_POSTSUBSCRIPT affine end_POSTSUBSCRIPT. Subsequently, cascaded networks iteratively compute the displacement fields. At each cascade n𝑛nitalic_n, the network predicts the displacement field ϕnsubscriptitalic-ϕn\phi_{\text{n}}italic_ϕ start_POSTSUBSCRIPT n end_POSTSUBSCRIPT using the input image M(n)superscript𝑀(n)M^{\text{(n)}}italic_M start_POSTSUPERSCRIPT (n) end_POSTSUPERSCRIPT warped by the displacement field ϕn-1subscriptitalic-ϕn-1\phi_{\text{n-1}}italic_ϕ start_POSTSUBSCRIPT n-1 end_POSTSUBSCRIPT from the previous cascade n1𝑛1n-1italic_n - 1.

2.3 Multi-perspective Loss

The deformations observed in the thorax and complementary information captured by both pMRI and CT require the model to pay attention to not only the local information (like edges, textures, and corners) but long-range (dis)similarities. To address the challenge, we proposed a novel loss function: the multi-perspective loss (MPL), including the Mutual Information (MI) and Gaussian Pyramid label (GPL) loss. The MI loss,

MIloss(M,F)=mMfFp(m,f)log(p(m,f)p(m)p(f)),𝑀subscript𝐼loss𝑀𝐹subscript𝑚𝑀subscript𝑓𝐹𝑝𝑚𝑓𝑝𝑚𝑓𝑝𝑚𝑝𝑓MI_{\text{loss}}(M,F)=-\sum_{m\in M}\sum_{f\in F}p(m,f)\log\left(\frac{p(m,f)}% {p(m)p(f)}\right),italic_M italic_I start_POSTSUBSCRIPT loss end_POSTSUBSCRIPT ( italic_M , italic_F ) = - ∑ start_POSTSUBSCRIPT italic_m ∈ italic_M end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_f ∈ italic_F end_POSTSUBSCRIPT italic_p ( italic_m , italic_f ) roman_log ( divide start_ARG italic_p ( italic_m , italic_f ) end_ARG start_ARG italic_p ( italic_m ) italic_p ( italic_f ) end_ARG ) , (4)

quantifying the statistical discrepancy between two images, focuses on global alignment. Simultaneously, to overcome its limitations for local alignment, the GPL loss is employed, which used Gaussian filters,

G(x,y,z)=1(2πσ2)3/2exp(x2+y2+z22σ2),𝐺𝑥𝑦𝑧1superscript2𝜋superscript𝜎232superscript𝑥2superscript𝑦2superscript𝑧22superscript𝜎2G(x,y,z)=\frac{1}{(2\pi\sigma^{2})^{3/2}}\exp\left(-\frac{x^{2}+y^{2}+z^{2}}{2% \sigma^{2}}\right),italic_G ( italic_x , italic_y , italic_z ) = divide start_ARG 1 end_ARG start_ARG ( 2 italic_π italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG roman_exp ( - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (5)

to derive feature pyramids across various scales, thereby facilitating the capture of local correspondences between images. Specifically, segmentation labels from MRI Mlabelsubscript𝑀𝑙𝑎𝑏𝑒𝑙M_{label}italic_M start_POSTSUBSCRIPT italic_l italic_a italic_b italic_e italic_l end_POSTSUBSCRIPT and CT Flabelsubscript𝐹𝑙𝑎𝑏𝑒𝑙F_{label}italic_F start_POSTSUBSCRIPT italic_l italic_a italic_b italic_e italic_l end_POSTSUBSCRIPT images were selectively filtered by 3D Gaussian kernels, operating at six separate standard deviation scales, σ{0,1,2,4,8,16}𝜎0124816\sigma\in\{0,1,2,4,8,16\}italic_σ ∈ { 0 , 1 , 2 , 4 , 8 , 16 }. The higher scales encourage the model to focus on the entire lung cavity, while the lower scales target more local features, such as edges and corners. This dual focus enables the alignment of both large-scale structural features and smaller, more intricate details, thus addressing the MI loss’s limitation of neglecting anatomical information. As such, the resulting loss function is denoted as follows:

L(M,F,ϕ)𝐿𝑀𝐹italic-ϕ\displaystyle L(M,F,\phi)italic_L ( italic_M , italic_F , italic_ϕ ) =αLMI(M,F,ϕ)absent𝛼subscript𝐿𝑀𝐼𝑀𝐹italic-ϕ\displaystyle=\alpha L_{MI}(M,F,\phi)= italic_α italic_L start_POSTSUBSCRIPT italic_M italic_I end_POSTSUBSCRIPT ( italic_M , italic_F , italic_ϕ ) (6)
+βLGPL(Mlabel,Flabel,ϕ)𝛽subscript𝐿𝐺𝑃𝐿subscript𝑀𝑙𝑎𝑏𝑒𝑙subscript𝐹𝑙𝑎𝑏𝑒𝑙italic-ϕ\displaystyle+\beta L_{GPL}(M_{label},F_{label},\phi)+ italic_β italic_L start_POSTSUBSCRIPT italic_G italic_P italic_L end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_l italic_a italic_b italic_e italic_l end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_l italic_a italic_b italic_e italic_l end_POSTSUBSCRIPT , italic_ϕ )
+λLreg(ϕ)𝜆subscript𝐿𝑟𝑒𝑔italic-ϕ\displaystyle+\lambda L_{reg}(\phi)+ italic_λ italic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT ( italic_ϕ )

where LMIsubscript𝐿𝑀𝐼L_{MI}italic_L start_POSTSUBSCRIPT italic_M italic_I end_POSTSUBSCRIPT and LGPLsubscript𝐿𝐺𝑃𝐿L_{GPL}italic_L start_POSTSUBSCRIPT italic_G italic_P italic_L end_POSTSUBSCRIPT represent the MI and GPL losses. The function Lregsubscript𝐿𝑟𝑒𝑔L_{reg}italic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT is a regularisation term using a weighted bending energy [22] to penalise local spatial variations in ϕitalic-ϕ\phiitalic_ϕ, ensuring a smooth displacement field. The parameters α𝛼\alphaitalic_α, β𝛽\betaitalic_β and λ𝜆\lambdaitalic_λ serve as the weighting coefficients, modulating the contribution of every corresponding term in the loss function, respectively.

3 The Experiment

3.1 Dataset

We conducted an assessment of the proposed method using an in-house Post-COVID Assessment Clinic dataset, including 46 pairs of CT, pMRI and XeMRI images. Specifically, MRI was performed at 3 T (GE Healthcare, Premier) using a phased array thoracic imaging coil (30 channels). Proton imaging consisted of a 3D spoiled gradient echo sequence, characterized by a Repetition Time (TR) of 3.1 ms, Echo Time (TE) of 1 ms, Field of View (FOV) of 400 mm, slice thickness of 5 mm, an acquisition matrix of 256×128256128256\times 128256 × 128, a reconstruction matrix of 256×256256256256\times 256256 × 256, number of slices = 36, performed in a single breath-hold, with a bandwidth of 62.5 kHz and a flip angle of 20 degrees.

Subsequent to inhalation of 1L of polarized Xenon-129, XeMRI was acquired using a Transmit/Receive vest coil (PulseTeq, Cobham, UK) employing a 4-echo radial sequence with TR = 23 ms, an acquisition matrix of 16×16×1616161616\times 16\times 1616 × 16 × 16, a reconstruction matrix of 32×32×3232323232\times 32\times 3232 × 32 × 32, FOV of 400 mm, a flip angle of 40 degrees, and using Iterative Decomposition of Water and Fat with Shifted Echo Times and Lease Squares Regression (IDEAL) Reconstruction.

CT was performed using a GE Healthcare system with a section thickness of 0.625 mm and a slice resolution of 512×512512512512\times 512512 × 512 after an inhalation of 1L of room air.

All images were resampled as isotropic, with a spatial resolution of 5×5×5mm3555superscriptmm35\times 5\times 5\,\text{mm}^{3}5 × 5 × 5 mm start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. Subsequently, the images were cropped based on the lung region, followed by padding to the size of 128×128×128128128128128\times 128\times 128128 × 128 × 128. The dataset was then randomly split into 30 pairs for training, 6 for validation, and 10 for testing. All reported results presented within this study are derived from the analysis conducted on the testing dataset.

3.2 Implementation Details

Our method was implemented using Pytorch on an NVIDIA RTX6000 GPU. All models were trained for 300 epochs, with a batch size of 1 and the experiments of five-fold cross-validation. To ensure the most favourable results, we set cascades in our method to 5. The Adam optimiser was utilised, with a 1×1051superscript1051\times 10^{-5}1 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT learning rate. Lastly, hyperparameters for our loss function, α𝛼\alphaitalic_α, β𝛽\betaitalic_β and λ𝜆\lambdaitalic_λ are set to 1.0, 1.0 and 2.0. These values were carefully optimized to achieve a balanced improvement in training stability, registration accuracy, and transformation invertibility.

3.3 Comparison with the state-of-the-art methods

The proposed model was benchmarked against the state-of-the-art iterative DIR: SyN [3, 24], and deep learning based DIR methods: VXM [4], RCN [27], and CompositeNet (CompNet)[15]. First, SyN was implemented using ANTsPy. Next, VXM used a U-Net for non-iterative registration, while RCN employed an iterative approach with a Volume Tweening Network (VTN) configuration [27]. Initially, VXM and RCN adopted Normalised Cross Correlation (NCC) loss and L2𝐿2L2italic_L 2 variation loss as a regularisation. However, since the NCC loss leads to poor registration results, we further substituted the NCC loss with the proposed MPL (See Eq. 6). This comparison was conducted to underscore the superior applicability of the suggested loss across the state-of-the-art models. Furthermore, CompNet is a popular multimodal DIR method consisting of the GlobalNet and the LocalNet. As such, it encourages both global and local alignment, calculating the loss by seven scales of the DSC and the weighted bending energy as the regularisation [15, 22]. Registration accuracy was evaluated by measuring the overlap between registered and fixed segmentation masks with the Dice Similarity Coefficient (DSC) [7]. The percentage of negative Jacobian determinants on the estimated displacement fields (%Jϕ\%J_{\phi}% italic_J start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT) allowed for a further assessment of the transformation invertibility with a lower %Jϕ\%J_{\phi}% italic_J start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT indicating smoother transformations. The traditional methods are evaluated on the same testing data, while all the state-of-the-art deep learning-based methods are trained and tested on the same splits of the dataset.

Table 1: Quantitative evaluation results of the proposed and comparison methods.
Methods Loss Function DSC %Jϕ\%J_{\phi}% italic_J start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT
Initial 0.671
SyN MI 0.693
VXM NCC + Dice Loss 0.691 0.31%
VXM MIND 0.701 0.35%
VXM MPL 0.789 0.51%
RCN NCC + Dice Loss 0.695 0.49%
RCN MPL 0.895 1.89%
CompNet Multi-scale Dice Loss 0.848 0.76%
Ours MPL 0.913 0.89%

4 Results and Discussion

4.1 Registration Results

Refer to caption
Figure 2: Visualisation of CT-pMRI registration through a wide range of methods: The top row reflects the input pMRI image before registration and the results from state-of-the-art methods, whereas the bottom row shows the fixed image (CT) overlaid by edges extracted from pMRI image (in green). Red arrows point out areas of mis-registration.

Results are summarised in Tab. 1 and visualised in Fig. 2. Our model demonstrates superior performance compared to state-of-the-art methods, achieving the highest Dice Similarity Coefficient (DSC) of 0.91, in contrast to the second-best performing method (CompNet), which attained a DSC of 0.84 only. Even though the %Jϕ\%J_{\phi}% italic_J start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT is 0.89 % for our method (comparing to 0.31% for VoxelMorph), it remains situated within an acceptable range, pointing to sufficient transformation invertibility. Intuitively, the results of VXM and RCN, with the NCC loss, point out a marginal improvement in registration performance. Nevertheless, integrating the MPL considerably boost to their registration accuracy, emphasising the robustness of the proposed registration loss. Thus, our loss has the potential to enhance the performance of any existing state-of-the-art models tailored to monomodal scenarios and enable them to address challenging multimodal image registration like pMRI and CT outlined in this paper.

Table 2: Comparison of the proposed model with varied cascade configurations.
Methods DSC
Initial 0.671
1 cascade 0.871
2 cascades 0.882
3 cascades 0.896
4 cascades 0.902
5 cascades 0.913

With the aim of further assessing the effectiveness of our method, an evaluation was conducted using different configurations of cascades. The configurations varied in the number of cascades exhaustively detailed in Tab. 2. A systematic approach was adopted to explore the effect of each configuration on the method’s general performance. Accordingly, the results showcase that the architecture incorporating five cascades achieves the highest registration accuracy within the evaluated range. While architectures with more than five cascades lacked exploration due to computational limits, the timely findings are firmly in favour of the efficacy of the five-cascade design.

4.2 Ablation Study

Our novel loss function combines the advantages of the MI and the multi-scale label loss, enhancing the accuracy of multimodal DIR. As seen in Tab. 3, an ablation study assesses every component’s impact of each similarity measure in our loss function. In addition, we compare our method to the most relevant method i.e. CompNet by Hu et al.[15]. The results indicate that by combining global and local information, our method can efficiently register challenging multimodal images such as pMRI and CT.

Table 3: The ablation study results on: 1) Mutual Information (MI) loss 2) Gaussian-pyramid label loss
Loss Functions 1) 2) DSC
Initial 0.671
Hu et al. [15] * ×\times× \checkmark 0.879
MI \checkmark ×\times× 0.760
Gaussian-pyramid ×\times× \checkmark 0.890
Multi-perspective \checkmark \checkmark 0.913
  • *

    We employ the loss function proposed by Hu et al., applying it to our proposed network instead of CompNet.

4.3 CT-XeMRI Registration

The pMRI and XeMRI images are acquired within the same session, ensuring inherent alignment. Utilizing the transformation matrices derived from the CT-pMRI registration via our proposed network, we can facilitate the CT-XeMRI registration, as illustrated in Fig. 3. This process aligns the structural and functional data, which is instrumental in clinical analyses that explore the relationship between anatomical and functional impairments. However, the acquisition of pMRI and XeMRI images during distinct breath-hold intervals introduces some degree of misalignment. Future research will aim at addressing this breath-hold variability to enhance the pMRI-XeMRI alignment, thereby improving the precision of CT-XeMRI registration.

Refer to caption
Figure 3: Visualization of the CT-XeMRI results before and after registration.

5 Conclusion

This paper presented an end-to-end model based on progressive alignment for multimodal DIR. Our novel loss function enhances the performance of cutting-edge models formerly restricted to monomodal scenarios, promoting their utilisation in multimodal imaging registration scenarios. The proposed methods outperformed existing ones when evaluated on challenging 3D lung images from CT and pMRI. Notably, this work can significantly advance multimodal image analysis, offering a pivotal contribution that holds the potential to reshape our understanding and method for long-COVID research.

6 COMPLIANCE WITH ETHICAL STANDARDS

This study was performed in line with the principles of the Declaration of Helsinki. Approval was granted by the South Central - Oxford C Research Ethics Committee on 15 Dec 2021 (reference 21/SC/0398).

7 ACKNOWLEDGEMENTS

This study is funded by the National Institute for Health and Care Research (NIHR) (Long Covid grant, Ref: COV‐LT2‐0049). The views expressed in this publication are those of the authors and not necessarily those of NIHR or The Department of Health and Social Care.

References

  • [1] Albert, M., Cates, G., Driehuys, B., Happer, W., Saam, B., Springer Jr, C., Wishnia, A.: Biological magnetic resonance imaging using laser-polarized 129xe. Nature 370(6486), 199–201 (1994)
  • [2] Anas, E.R., Onsy, A., Matuszewski, B.J.: Ct scan registration with 3d dense motion field estimation using lsgan. In: Medical Image Understanding and Analysis: 24th Annual Conference, MIUA 2020, Oxford, UK, July 15-17, 2020, Proceedings 24. pp. 195–207. Springer (2020)
  • [3] Avants, B.B., Epstein, C.L., Grossman, M., Gee, J.C.: Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. Med. Image Anal. 12(1), 26–41 (2008)
  • [4] Balakrishnan, G., Zhao, A., Sabuncu, M.R., Guttag, J., Dalca, A.V.: Voxelmorph: a learning framework for deformable medical image registration. IEEE Trans. Med. Imaging 38(8), 1788–1800 (2019)
  • [5] Ballering, A.V., van Zon, S.K., Olde Hartman, T.C., Rosmalen, J.G.: Persistence of somatic symptoms after covid-19 in the netherlands: an observational cohort study. The Lancet 400(10350), 452–461 (2022)
  • [6] De Vos, B.D., Berendsen, F.F., Viergever, M.A., Sokooti, H., Staring, M., Išgum, I.: A deep learning framework for unsupervised affine and deformable image registration. Medical image analysis 52, 128–143 (2019)
  • [7] Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)
  • [8] Ehrhardt, J., Werner, R., Schmidt-Richberg, A., Handels, H.: Statistical modeling of 4d respiratory lung motion using diffeomorphic image registration. IEEE Trans. Med. Imaging 30(2), 251–265 (2010)
  • [9] Grist, J.T., Collier, G.J., Walters, H., Kim, M., Chen, M., Abu Eid, G., Laws, A., Matthews, V., Jacob, K., Cross, S., et al.: Lung abnormalities detected with hyperpolarized 129xe mri in patients with long covid. Radiology 305(3), 709–717 (2022)
  • [10] Guo, C.K.: Multi-modal image registration with unsupervised deep learning. Ph.D. thesis, Massachusetts Institute of Technology (2019)
  • [11] Heinrich, M.P., Jenkinson, M., Bhushan, M., Matin, T., Gleeson, F.V., Brady, M., Schnabel, J.A.: MIND: Modality independent neighbourhood descriptor for multi-modal deformable registration. Med. Image Anal. 16(7), 1423–1435 (2012)
  • [12] Heinrich, M.P., Jenkinson, M., Papież, B.W., Brady, S.M., Schnabel, J.A.: Towards realtime multimodal fusion for image-guided interventions using self-similarities. In: MICCAI 2013: 16th International Conference, Nagoya, Japan, September 22-26, 2013, Proceedings, Part I 16. pp. 187–194. Springer (2013)
  • [13] Hermosillo, G., Chefd’Hotel, C., Faugeras, O.: Variational methods for multimodal image matching. International Journal of Computer Vision 50(3), 329–343 (2002)
  • [14] Hu, Y., Modat, M., Gibson, E., Ghavami, N., Bonmati, E., Moore, C.M., Emberton, M., Noble, J.A., Barratt, D.C., Vercauteren, T.: Label-driven weakly-supervised learning for multimodal deformable image registration. In: 15th ISBI. pp. 1070–1074. IEEE (2018)
  • [15] Hu, Y., Modat, M., Gibson, E., Li, W., Ghavami, N., Bonmati, E., Wang, G., Bandula, S., Moore, C.M., Emberton, M., et al.: Weakly-supervised convolutional neural networks for multimodal image registration. Med. Image Anal. 49, 1–13 (2018)
  • [16] Hua, R., Pozo, J.M., Taylor, Z.A., Frangi, A.F.: Multiresolution extended free-form deformations (xffd) for non-rigid registration with discontinuous transforms. Medical image analysis 36, 113–122 (2017)
  • [17] Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. NIPS 28 (2015)
  • [18] Maes, F., Collignon, A., Vandermeulen, D., Marchal, G., Suetens, P.: Multimodality image registration by maximization of mutual information. IEEE Trans. Med. Imaging 16(2), 187–198 (1997)
  • [19] Mugler III, J.P., Altes, T.A.: Hyperpolarized 129xe mri of the human lung. J. Magn. Reson. Imaging 37(2), 313–331 (2013)
  • [20] Papież, B.W., Heinrich, M.P., Fehrenbach, J., Risser, L., Schnabel, J.A.: An implicit sliding-motion preserving regularisation via bilateral filtering for deformable image registration. Med. Image Anal. 18(8), 1299–1311 (2014)
  • [21] Qin, C., Shi, B., Liao, R., Mansi, T., Rueckert, D., Kamen, A.: Unsupervised deformable registration for multi-modal images via disentangled representations. In: IPMI, Hong Kong, China, June 2–7, 2019, Proceedings 26. pp. 249–261. Springer (2019)
  • [22] Rueckert, D., Sonoda, L.I., Hayes, C., Hill, D.L., Leach, M.O., Hawkes, D.J.: Nonrigid registration using free-form deformations: application to breast mr images. IEEE Trans. Med. Imaging 18(8), 712–721 (1999)
  • [23] Sotiras, A., Davatzikos, C., Paragios, N.: Deformable medical image registration: A survey. IEEE Trans. Med. Imaging 32(7), 1153–1190 (2013)
  • [24] Szmul, A., Matin, T., Gleeson, F.V., Schnabel, J.A., Grau, V., Papież, B.W.: XeMRI to CT lung image registration enhanced with personalized 4DCT-derived motion model. In: Image Analysis for Moving Organ, Breast, and Thoracic Images: Third International Workshop, RAMBO 2018, Fourth International Workshop, BIA 2018, and First International Workshop, TIA 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16 and 20, 2018, Proceedings 3. pp. 260–271. Springer (2018)
  • [25] Szmul, A., Matin, T., Gleeson, F.V., Schnabel, J.A., Grau, V., Papież, B.W.: Patch-based lung ventilation estimation using multi-layer supervoxels. Comput. Med. Imaging Graph. 74, 49–60 (2019)
  • [26] Wells III, W.M., Viola, P., Atsumi, H., Nakajima, S., Kikinis, R.: Multi-modal volume registration by maximization of mutual information. Med. Image Anal. 1(1), 35–51 (1996)
  • [27] Zhao, S., Dong, Y., Chang, E.I., Xu, Y., et al.: Recursive cascaded networks for unsupervised medical image registration. In: ICCV. pp. 10600–10610 (2019)
  • [28] Zheng, J.Q., Wang, Z., Huang, B., Lim, N.H., Papież, B.W.: Residual aligner-based network (RAN): Motion-separable structure for coarse-to-fine discontinuous deformable registration. Med. Image Anal. 91, 103038 (2024)
  • [29] Zheng, J.Q., Wang, Z., Huang, B., Vincent, T., Lim, N.H., Papież, B.W.: Recursive deformable image registration network with mutual attention. In: MIUA 2022, Cambridge, UK, July 27–29, 2022, Proceedings. pp. 75–86. Springer (2022)