2Big Data Institute, University of Oxford, UK
3 Oxford Radiology Research Unit, Oxford University Hospitals NHS Foundation Trust, UK
11email: {jiahua.li@ndph, bartlomiej.papiez@bdi}.ox.ac.uk
Multimodal Deformable Image Registration for Long-COVID Analysis Based on Progressive Alignment and Multi-perspective Loss
Abstract
Long COVID is characterized by persistent symptoms, particularly pulmonary impairment, which necessitates advanced imaging for accurate diagnosis. Hyperpolarised Xenon-129 MRI (XeMRI) offers a promising avenue by visualising lung ventilation, perfusion, as well as gas transfer. Integrating functional data from XeMRI with structural data from Computed Tomography (CT) is crucial for comprehensive analysis and effective treatment strategies in long COVID, requiring precise data alignment from those complementary imaging modalities. To this end, CT-MRI registration is an essential intermediate step, given the significant challenges posed by the direct alignment of CT and Xe-MRI. Therefore, we proposed an end-to-end multimodal deformable image registration method that achieves superior performance for aligning long-COVID lung CT and proton density MRI (pMRI) data. Moreover, our method incorporates a novel Multi-perspective Loss (MPL) function, enhancing state-of-the-art deep learning methods for monomodal registration by making them adaptable for multimodal tasks. The registration results achieve a Dice coefficient score of 0.913, indicating a substantial improvement over the state-of-the-art multimodal image registration techniques. Since the XeMRI and pMRI images are acquired in the same sessions and can be roughly aligned, our results facilitate subsequent registration between XeMRI and CT, thereby potentially enhancing clinical decision-making for long COVID management.
Keywords:
Medical image registration, Multimodal image registration, Progressive learning
1 Introduction
Considering the thorough documentation of over 651 million COVID-19 cases worldwide, the current conservative estimates suggest that around 65 million people are suffering from long COVID [5]. What is more, a number of patients with long COVID present no findings in Computed Tomography (CT), and more advanced imaging techniques such as hyperpolarized Xenon MRI (XeMRI) have to be utilised to detect lung abnormalities [9]. While XeMRI provides insight about the lung function, it needs to be analysed with respect to the underlying anatomy (shown e.g. in CT) to be utlised in clinical decision-making consequently requiring multimodal image registration for this task.
Monomodal deformable image registration (DIR) is regarded as a non-trivial task, due to patient motion [20, 16, 6, 2] (for longitudinal studies) or the subject variability [8] (for cross-sectional studies). Nevertheless, the complexity of DIR increases in the multimodal scenarios, fueled by differences in intensities between the images acquired to visualise diverse physical phenomena, e.g. CT or MRI, where each relies upon different physical properties of tissue to create images. Since multimodal DIR gives clinicians more comprehensive insights about a patient’s condition, benefiting diagnostic accuracy and personalised treatment plans, efficient multimodal DIR is critical, and many methods have been suggested [23]. However, statistical and information theory-based methods suffer from computational complexity and slow convergence [26, 18, 13], while descriptor-based methods prove sensitive to initial conditions and require effective pre-alignment to handle extensive translations[11, 12]; their reliance on hand-crafted features calls for domain expertise for fine-tuning and restricts their adaptability. As of late, Convolutional Neural Networks (CNNs) have been utilised to learn a standard representation for DIR by optimising a similarity metric [14, 15, 10]. Concurrently, selecting appropriate similarity metrics proves challenging since multimodal images can exhibit differences in intrinsic intensity distribution and resolution, leading to the effectiveness of learning-based methods being limited mainly in the monomodal scenarios [4, 27, 29, 28]. Alternatively, multimodal DIR can be transformed into a less complex monomodal task utilising an image-to-image (I2I) translation [21]. Nonetheless, such translation can potentially result in shape inconsistency and produce artificial anatomical features, further deteriorating the performance of the DIR.
The focus of this work is on the DIR between CT and proton MRI (pMRI), a process of significance to the analysis of XeMRI. Owing to its non-ionising characteristics, XeMRI has gained considerable interest for long COVID, primarily due to capturing images related to lung ventilation, perfusion, and gas transfer in lungs[1, 19, 25, 9]. Since XeMRI does not provide anatomical information, the alignment of XeMRI images with pMRI and CT is essential. pMRI is typically acquired in the same imaging session as XeMRI, albeit not within the same breath-hold, while CT is taken a couple of days prior. This poses a challenge when attempting to fuse XeMRI with CT, thus necessitating DIR between pMRI and CT.
Contributions of our work are as follows. To overcome the aforementioned limitations, we proposed a multimodal, end-to-end method based on progressive alignment architecture which can tackle significant deformations (Sec. 2.2)). Moreover, we introduce a novel Multi-perspective Loss (MPL) function, applicable to any existing monomodal DIR architecture, extending their application to multimodal imaging registration (Sec. 2.3). Lastly, our method was evaluated on challenging long-COVID lung CT and pMRI dataset, which achieved the Dice coefficient (DSC) of 0.91, outperforming the state-of-the-art models for multimodal DIR (Sec. 3). To the best of our knowledge, this is the first effort to automate mutlimodal deformable image registration for long-COVID CT and pMRI.
2 Methodology
2.1 Overview
As seen in Fig. 1, DIR aims to estimate a non-linear voxel-to-voxel correspondence between a fixed image and a moving image , in which the estimated transformation is parameterized with :
(1) |
with and corresponding to the utilised neural networks and the networks’ learning parameters, respectively. Our method uses two 3D images as input: the pMRI image (the moving image) and the CT image serving as a reference (the fixed image). These are introduced into a cascading sequence of 3D CNNs (described in Sec. 2.2) to extract distinctive feature maps from both input images. Furthermore, we use a novel loss function (described in Sec. 2.3) that combines Mutual Information (MI) and Gaussian Pyramid labels to capture both global and local intensity information. In this section, the workflow of our methodology is outlined, with detailed description in the subsequent subsections.
2.2 Progressive Alignment Architecture
As a consequence of the significant deformation observed across diverse modalities, estimation of the displacement field in one attempt proves to be challenging. Thus, the model is implemented iteratively to ensure progressive refinement (see Fig.1). The first iteration aims to establish the coarse transformation, while subsequent refinements estimate finer transformations. Specifically, the suggested model is initiated by a network that predicts an affine transformation matrix with 12 degrees of freedom (denoted by in Fig. 1). The network for affine transformation has four downsampling residual-network blocks (ResBlock). The final convolutional layer employs a fully-connected matrix, subject to learning, to create a linear projection, producing a vector encompassing 12 parameters for affine transformation. Following the network for the affine alignment, cascades of registration networks (sharing weights) predicting dense displacement fields (DDF) are employed to estimate local (non-rigid) deformation . Similarly, each cascaded network has a Voxelmorph-style architecture[4], replacing the encoder component with four downsampling ResBlocks. The affine transformation and the DDFs are recursively estimated by the Spatial Transformer Network (STN) [17] to produce the final DDF . Considering the -th cascade, the output will be estimated according to the DDF from the -th cascade:
(2) |
where corresponds to the war** operation facilitated by a trilinear image resampler. Theoretically, this recursive process can be infinitely applied. Hence, the input image becomes warped by the final DDF estimated according to its affine transformation and multiple cascades of deformable transformation, resulting in the registered image , represented as:
(3) |
2.3 Multi-perspective Loss
The deformations observed in the thorax and complementary information captured by both pMRI and CT require the model to pay attention to not only the local information (like edges, textures, and corners) but long-range (dis)similarities. To address the challenge, we proposed a novel loss function: the multi-perspective loss (MPL), including the Mutual Information (MI) and Gaussian Pyramid label (GPL) loss. The MI loss,
(4) |
quantifying the statistical discrepancy between two images, focuses on global alignment. Simultaneously, to overcome its limitations for local alignment, the GPL loss is employed, which used Gaussian filters,
(5) |
to derive feature pyramids across various scales, thereby facilitating the capture of local correspondences between images. Specifically, segmentation labels from MRI and CT images were selectively filtered by 3D Gaussian kernels, operating at six separate standard deviation scales, . The higher scales encourage the model to focus on the entire lung cavity, while the lower scales target more local features, such as edges and corners. This dual focus enables the alignment of both large-scale structural features and smaller, more intricate details, thus addressing the MI loss’s limitation of neglecting anatomical information. As such, the resulting loss function is denoted as follows:
(6) | ||||
where and represent the MI and GPL losses. The function is a regularisation term using a weighted bending energy [22] to penalise local spatial variations in , ensuring a smooth displacement field. The parameters , and serve as the weighting coefficients, modulating the contribution of every corresponding term in the loss function, respectively.
3 The Experiment
3.1 Dataset
We conducted an assessment of the proposed method using an in-house Post-COVID Assessment Clinic dataset, including 46 pairs of CT, pMRI and XeMRI images. Specifically, MRI was performed at 3 T (GE Healthcare, Premier) using a phased array thoracic imaging coil (30 channels). Proton imaging consisted of a 3D spoiled gradient echo sequence, characterized by a Repetition Time (TR) of 3.1 ms, Echo Time (TE) of 1 ms, Field of View (FOV) of 400 mm, slice thickness of 5 mm, an acquisition matrix of , a reconstruction matrix of , number of slices = 36, performed in a single breath-hold, with a bandwidth of 62.5 kHz and a flip angle of 20 degrees.
Subsequent to inhalation of 1L of polarized Xenon-129, XeMRI was acquired using a Transmit/Receive vest coil (PulseTeq, Cobham, UK) employing a 4-echo radial sequence with TR = 23 ms, an acquisition matrix of , a reconstruction matrix of , FOV of 400 mm, a flip angle of 40 degrees, and using Iterative Decomposition of Water and Fat with Shifted Echo Times and Lease Squares Regression (IDEAL) Reconstruction.
CT was performed using a GE Healthcare system with a section thickness of 0.625 mm and a slice resolution of after an inhalation of 1L of room air.
All images were resampled as isotropic, with a spatial resolution of . Subsequently, the images were cropped based on the lung region, followed by padding to the size of . The dataset was then randomly split into 30 pairs for training, 6 for validation, and 10 for testing. All reported results presented within this study are derived from the analysis conducted on the testing dataset.
3.2 Implementation Details
Our method was implemented using Pytorch on an NVIDIA RTX6000 GPU. All models were trained for 300 epochs, with a batch size of 1 and the experiments of five-fold cross-validation. To ensure the most favourable results, we set cascades in our method to 5. The Adam optimiser was utilised, with a learning rate. Lastly, hyperparameters for our loss function, , and are set to 1.0, 1.0 and 2.0. These values were carefully optimized to achieve a balanced improvement in training stability, registration accuracy, and transformation invertibility.
3.3 Comparison with the state-of-the-art methods
The proposed model was benchmarked against the state-of-the-art iterative DIR: SyN [3, 24], and deep learning based DIR methods: VXM [4], RCN [27], and CompositeNet (CompNet)[15]. First, SyN was implemented using ANTsPy. Next, VXM used a U-Net for non-iterative registration, while RCN employed an iterative approach with a Volume Tweening Network (VTN) configuration [27]. Initially, VXM and RCN adopted Normalised Cross Correlation (NCC) loss and variation loss as a regularisation. However, since the NCC loss leads to poor registration results, we further substituted the NCC loss with the proposed MPL (See Eq. 6). This comparison was conducted to underscore the superior applicability of the suggested loss across the state-of-the-art models. Furthermore, CompNet is a popular multimodal DIR method consisting of the GlobalNet and the LocalNet. As such, it encourages both global and local alignment, calculating the loss by seven scales of the DSC and the weighted bending energy as the regularisation [15, 22]. Registration accuracy was evaluated by measuring the overlap between registered and fixed segmentation masks with the Dice Similarity Coefficient (DSC) [7]. The percentage of negative Jacobian determinants on the estimated displacement fields () allowed for a further assessment of the transformation invertibility with a lower indicating smoother transformations. The traditional methods are evaluated on the same testing data, while all the state-of-the-art deep learning-based methods are trained and tested on the same splits of the dataset.
Methods | Loss Function | DSC | |
---|---|---|---|
Initial | 0.671 | ||
SyN | MI | 0.693 | |
VXM | NCC + Dice Loss | 0.691 | 0.31% |
VXM | MIND | 0.701 | 0.35% |
VXM | MPL | 0.789 | 0.51% |
RCN | NCC + Dice Loss | 0.695 | 0.49% |
RCN | MPL | 0.895 | 1.89% |
CompNet | Multi-scale Dice Loss | 0.848 | 0.76% |
Ours | MPL | 0.913 | 0.89% |
4 Results and Discussion
4.1 Registration Results
Results are summarised in Tab. 1 and visualised in Fig. 2. Our model demonstrates superior performance compared to state-of-the-art methods, achieving the highest Dice Similarity Coefficient (DSC) of 0.91, in contrast to the second-best performing method (CompNet), which attained a DSC of 0.84 only. Even though the is 0.89 % for our method (comparing to 0.31% for VoxelMorph), it remains situated within an acceptable range, pointing to sufficient transformation invertibility. Intuitively, the results of VXM and RCN, with the NCC loss, point out a marginal improvement in registration performance. Nevertheless, integrating the MPL considerably boost to their registration accuracy, emphasising the robustness of the proposed registration loss. Thus, our loss has the potential to enhance the performance of any existing state-of-the-art models tailored to monomodal scenarios and enable them to address challenging multimodal image registration like pMRI and CT outlined in this paper.
Methods | DSC |
---|---|
Initial | 0.671 |
1 cascade | 0.871 |
2 cascades | 0.882 |
3 cascades | 0.896 |
4 cascades | 0.902 |
5 cascades | 0.913 |
With the aim of further assessing the effectiveness of our method, an evaluation was conducted using different configurations of cascades. The configurations varied in the number of cascades exhaustively detailed in Tab. 2. A systematic approach was adopted to explore the effect of each configuration on the method’s general performance. Accordingly, the results showcase that the architecture incorporating five cascades achieves the highest registration accuracy within the evaluated range. While architectures with more than five cascades lacked exploration due to computational limits, the timely findings are firmly in favour of the efficacy of the five-cascade design.
4.2 Ablation Study
Our novel loss function combines the advantages of the MI and the multi-scale label loss, enhancing the accuracy of multimodal DIR. As seen in Tab. 3, an ablation study assesses every component’s impact of each similarity measure in our loss function. In addition, we compare our method to the most relevant method i.e. CompNet by Hu et al.[15]. The results indicate that by combining global and local information, our method can efficiently register challenging multimodal images such as pMRI and CT.
Loss Functions | 1) | 2) | DSC |
---|---|---|---|
Initial | 0.671 | ||
Hu et al. [15] * | 0.879 | ||
MI | 0.760 | ||
Gaussian-pyramid | 0.890 | ||
Multi-perspective | 0.913 |
-
*
We employ the loss function proposed by Hu et al., applying it to our proposed network instead of CompNet.
4.3 CT-XeMRI Registration
The pMRI and XeMRI images are acquired within the same session, ensuring inherent alignment. Utilizing the transformation matrices derived from the CT-pMRI registration via our proposed network, we can facilitate the CT-XeMRI registration, as illustrated in Fig. 3. This process aligns the structural and functional data, which is instrumental in clinical analyses that explore the relationship between anatomical and functional impairments. However, the acquisition of pMRI and XeMRI images during distinct breath-hold intervals introduces some degree of misalignment. Future research will aim at addressing this breath-hold variability to enhance the pMRI-XeMRI alignment, thereby improving the precision of CT-XeMRI registration.
5 Conclusion
This paper presented an end-to-end model based on progressive alignment for multimodal DIR. Our novel loss function enhances the performance of cutting-edge models formerly restricted to monomodal scenarios, promoting their utilisation in multimodal imaging registration scenarios. The proposed methods outperformed existing ones when evaluated on challenging 3D lung images from CT and pMRI. Notably, this work can significantly advance multimodal image analysis, offering a pivotal contribution that holds the potential to reshape our understanding and method for long-COVID research.
6 COMPLIANCE WITH ETHICAL STANDARDS
This study was performed in line with the principles of the Declaration of Helsinki. Approval was granted by the South Central - Oxford C Research Ethics Committee on 15 Dec 2021 (reference 21/SC/0398).
7 ACKNOWLEDGEMENTS
This study is funded by the National Institute for Health and Care Research (NIHR) (Long Covid grant, Ref: COV‐LT2‐0049). The views expressed in this publication are those of the authors and not necessarily those of NIHR or The Department of Health and Social Care.
References
- [1] Albert, M., Cates, G., Driehuys, B., Happer, W., Saam, B., Springer Jr, C., Wishnia, A.: Biological magnetic resonance imaging using laser-polarized 129xe. Nature 370(6486), 199–201 (1994)
- [2] Anas, E.R., Onsy, A., Matuszewski, B.J.: Ct scan registration with 3d dense motion field estimation using lsgan. In: Medical Image Understanding and Analysis: 24th Annual Conference, MIUA 2020, Oxford, UK, July 15-17, 2020, Proceedings 24. pp. 195–207. Springer (2020)
- [3] Avants, B.B., Epstein, C.L., Grossman, M., Gee, J.C.: Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. Med. Image Anal. 12(1), 26–41 (2008)
- [4] Balakrishnan, G., Zhao, A., Sabuncu, M.R., Guttag, J., Dalca, A.V.: Voxelmorph: a learning framework for deformable medical image registration. IEEE Trans. Med. Imaging 38(8), 1788–1800 (2019)
- [5] Ballering, A.V., van Zon, S.K., Olde Hartman, T.C., Rosmalen, J.G.: Persistence of somatic symptoms after covid-19 in the netherlands: an observational cohort study. The Lancet 400(10350), 452–461 (2022)
- [6] De Vos, B.D., Berendsen, F.F., Viergever, M.A., Sokooti, H., Staring, M., Išgum, I.: A deep learning framework for unsupervised affine and deformable image registration. Medical image analysis 52, 128–143 (2019)
- [7] Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)
- [8] Ehrhardt, J., Werner, R., Schmidt-Richberg, A., Handels, H.: Statistical modeling of 4d respiratory lung motion using diffeomorphic image registration. IEEE Trans. Med. Imaging 30(2), 251–265 (2010)
- [9] Grist, J.T., Collier, G.J., Walters, H., Kim, M., Chen, M., Abu Eid, G., Laws, A., Matthews, V., Jacob, K., Cross, S., et al.: Lung abnormalities detected with hyperpolarized 129xe mri in patients with long covid. Radiology 305(3), 709–717 (2022)
- [10] Guo, C.K.: Multi-modal image registration with unsupervised deep learning. Ph.D. thesis, Massachusetts Institute of Technology (2019)
- [11] Heinrich, M.P., Jenkinson, M., Bhushan, M., Matin, T., Gleeson, F.V., Brady, M., Schnabel, J.A.: MIND: Modality independent neighbourhood descriptor for multi-modal deformable registration. Med. Image Anal. 16(7), 1423–1435 (2012)
- [12] Heinrich, M.P., Jenkinson, M., Papież, B.W., Brady, S.M., Schnabel, J.A.: Towards realtime multimodal fusion for image-guided interventions using self-similarities. In: MICCAI 2013: 16th International Conference, Nagoya, Japan, September 22-26, 2013, Proceedings, Part I 16. pp. 187–194. Springer (2013)
- [13] Hermosillo, G., Chefd’Hotel, C., Faugeras, O.: Variational methods for multimodal image matching. International Journal of Computer Vision 50(3), 329–343 (2002)
- [14] Hu, Y., Modat, M., Gibson, E., Ghavami, N., Bonmati, E., Moore, C.M., Emberton, M., Noble, J.A., Barratt, D.C., Vercauteren, T.: Label-driven weakly-supervised learning for multimodal deformable image registration. In: 15th ISBI. pp. 1070–1074. IEEE (2018)
- [15] Hu, Y., Modat, M., Gibson, E., Li, W., Ghavami, N., Bonmati, E., Wang, G., Bandula, S., Moore, C.M., Emberton, M., et al.: Weakly-supervised convolutional neural networks for multimodal image registration. Med. Image Anal. 49, 1–13 (2018)
- [16] Hua, R., Pozo, J.M., Taylor, Z.A., Frangi, A.F.: Multiresolution extended free-form deformations (xffd) for non-rigid registration with discontinuous transforms. Medical image analysis 36, 113–122 (2017)
- [17] Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. NIPS 28 (2015)
- [18] Maes, F., Collignon, A., Vandermeulen, D., Marchal, G., Suetens, P.: Multimodality image registration by maximization of mutual information. IEEE Trans. Med. Imaging 16(2), 187–198 (1997)
- [19] Mugler III, J.P., Altes, T.A.: Hyperpolarized 129xe mri of the human lung. J. Magn. Reson. Imaging 37(2), 313–331 (2013)
- [20] Papież, B.W., Heinrich, M.P., Fehrenbach, J., Risser, L., Schnabel, J.A.: An implicit sliding-motion preserving regularisation via bilateral filtering for deformable image registration. Med. Image Anal. 18(8), 1299–1311 (2014)
- [21] Qin, C., Shi, B., Liao, R., Mansi, T., Rueckert, D., Kamen, A.: Unsupervised deformable registration for multi-modal images via disentangled representations. In: IPMI, Hong Kong, China, June 2–7, 2019, Proceedings 26. pp. 249–261. Springer (2019)
- [22] Rueckert, D., Sonoda, L.I., Hayes, C., Hill, D.L., Leach, M.O., Hawkes, D.J.: Nonrigid registration using free-form deformations: application to breast mr images. IEEE Trans. Med. Imaging 18(8), 712–721 (1999)
- [23] Sotiras, A., Davatzikos, C., Paragios, N.: Deformable medical image registration: A survey. IEEE Trans. Med. Imaging 32(7), 1153–1190 (2013)
- [24] Szmul, A., Matin, T., Gleeson, F.V., Schnabel, J.A., Grau, V., Papież, B.W.: XeMRI to CT lung image registration enhanced with personalized 4DCT-derived motion model. In: Image Analysis for Moving Organ, Breast, and Thoracic Images: Third International Workshop, RAMBO 2018, Fourth International Workshop, BIA 2018, and First International Workshop, TIA 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16 and 20, 2018, Proceedings 3. pp. 260–271. Springer (2018)
- [25] Szmul, A., Matin, T., Gleeson, F.V., Schnabel, J.A., Grau, V., Papież, B.W.: Patch-based lung ventilation estimation using multi-layer supervoxels. Comput. Med. Imaging Graph. 74, 49–60 (2019)
- [26] Wells III, W.M., Viola, P., Atsumi, H., Nakajima, S., Kikinis, R.: Multi-modal volume registration by maximization of mutual information. Med. Image Anal. 1(1), 35–51 (1996)
- [27] Zhao, S., Dong, Y., Chang, E.I., Xu, Y., et al.: Recursive cascaded networks for unsupervised medical image registration. In: ICCV. pp. 10600–10610 (2019)
- [28] Zheng, J.Q., Wang, Z., Huang, B., Lim, N.H., Papież, B.W.: Residual aligner-based network (RAN): Motion-separable structure for coarse-to-fine discontinuous deformable registration. Med. Image Anal. 91, 103038 (2024)
- [29] Zheng, J.Q., Wang, Z., Huang, B., Vincent, T., Lim, N.H., Papież, B.W.: Recursive deformable image registration network with mutual attention. In: MIUA 2022, Cambridge, UK, July 27–29, 2022, Proceedings. pp. 75–86. Springer (2022)