MMR-Mamba: Multi-Contrast
MRI Reconstruction with Mamba and
Spatial-Frequency Information Fusion
Abstract
Multi-contrast MRI acceleration has become prevalent in MR imaging, enabling the reconstruction of high-quality MR images from under-sampled k-space data of the target modality, using guidance from a fully-sampled auxiliary modality. The main crux lies in efficiently and comprehensively integrating complementary information from the auxiliary modality. Existing methods either suffer from quadratic computational complexity or fail to capture long-range correlated features comprehensively. In this work, we propose MMR-Mamba, a novel framework that achieves comprehensive integration of multi-contrast features through Mamba and spatial-frequency information fusion. Firstly, we design the Target modality-guided Cross Mamba (TCM) module in the spatial domain, which maximally restores the target modality information by selectively absorbing useful information from the auxiliary modality. Secondly, leveraging global properties of the Fourier domain, we introduce the Selective Frequency Fusion (SFF) module to efficiently integrate global information in the frequency domain and recover high-frequency signals for the reconstruction of structure details. Additionally, we present the Adaptive Spatial-Frequency Fusion (ASFF) module, which enhances fused features by supplementing less informative features from one domain with corresponding features from the other domain. These innovative strategies ensure efficient feature fusion across spatial and frequency domains, avoiding the introduction of redundant information and facilitating the reconstruction of high-quality target images. Extensive experiments on the BraTS and fastMRI knee datasets demonstrate the superiority of the proposed MMR-Mamba over state-of-the-art MRI reconstruction methods.
Index Terms:
MRI reconstruction, Multi-contrast, State space models, Fourier domain, Spatial-frequency information fusion.I Introduction
Magnetic resonance imaging (MRI) is an essential clinical imaging technology owing to its non-invasive, free-of-radiation characteristics and its capability to provide high-resolution morphological information with varying contrast [1]. However, due to the intrinsic physics of MR imaging systems [2], acquiring fully sampled k-space data can take tens of minutes, resulting in patient discomfort and increased operational costs. Therefore, accelerating MRI acquisition, particularly through the reconstruction of high-quality MR images from undersampled k-space data has become a popular research topic [3].
In clinical practice, multi-contrast MR images are simultaneously acquired to enable more accurate disease diagnosis and treatment planning. These modalities provide complementary information, revealing detailed anatomical structures of the patient [4]. For instance, T1 and T2 weighted images (T1WIs and T2WIs), along with proton density and fat-suppressed proton density-weighted images (PDWIs and FS-PDWIs), are two pairs of images with complementary structures. Typically, T1WIs are acquired more quickly due to shorter repetition time (TR) and echo time (TE) requirements, whereas T2WIs acquisition requires longer TR times, resulting in slower imaging speeds [5]. Similarly, FS-PDWIs necessitate longer scan times than PDWIs. In this context, leveraging readily obtainable modalities (i.e. auxiliary modalities such as T1WIs or PDWIs) as supplementary guidance for the reconstruction of target modalities (e.g., T2WIs or FS-PDWIs) with slower imaging speeds, known as multi-contrast MRI reconstruction, has been investigated and verified by previous works. For instance, compressed sensing (CS), Bayesian learning, dictionary learning, and graph representation theory have been utilized to accelerate multi-contrast MR imaging [6, 7, 8]. More recently, deep learning has significantly advanced MRI reconstruction due to their powerful feature representation capabilities [5, 9, 7, 10, 11].
In multi-contrast MR images, similar structural features are distributed across different regions within each modality, while complementary information is provided by different modalities. Consequently, the primary challenge of the multi-contrast MRI reconstruction task is to comprehensively explore long-range dependencies within each modality and effectively leverage complementary information from auxiliary modalities. CNN-based methods have been proposed to integrate multi-contrast information [5, 12], but they typically demonstrate local sensitivity and a lack of long-range dependency, thereby limiting their ability to capture global contextual information. In contrast, Transformer-based models [4, 13, 14], distinguished by their large receptive fields and global sensitivity, often surpass CNNs in capturing extensive contextual information. However, these models are burdened by substantial computational overhead due to the quadratic growth of resources with respect to sequence length. Therefore, develo** an algorithm that comprehensively integrates complementary information from different modalities without significant computational overhead is crucial.
Recently, the improved structured state-space sequence model with a selective scanning mechanism, Mamba [15], has emerged as a compelling alternative to Transformer, due to its ability to model long-range sequence relationships with linear complexity. Mamba has shown superior performance compared to Transformers in tasks involving long-term dependency modeling, such as natural language processing [16, 17] and medical image segmentation [18, 19]. Investigating Mamba’s potential for long-range dependency modeling and complementary information fusion in multi-contrast MRI reconstruction is highly promising. On the other hand, each component in the frequency domain represents a combination of all the pixel values in the spatial domain, meaning that frequency features capture the overall patterns and structures, providing a global view of the entire image. Meanwhile, Fourier features are beneficial for recovering high-frequency signals that are crucial for addressing image degradation[20]. Therefore, comprehensive and efficient global feature integration across different modalities can be achieved by performing feature fusion in the frequency domain. The Mamba block and frequency domain thus offer two promising solutions for efficient fusion of the multi-contrast information.
Motivated by the above analysis, we propose a novel framework, MMR-Mamba, for multi-contrast MRI reconstruction. Built upon the Mamba architecture, our MMR-Mamba jointly explores the complementary information fusion in the spatial and frequency domains, implemented by the Target modality-guided Cross Mamba (TCM) module and Selective Frequency Fusion (SFF) module, respectively. Additionally, we introduce an Adaptive Saptial-Frequency Fusion (ASFF) module to enhance and integrate the fused features from both domains. Specifically, we first adapt Mamba blocks for feature extraction, leveraging their capability to effectively capture fine-grained relationships and long-range dependencies within MR images. Then we design TCM for spatial domain information fusion, where correlated features from the auxiliary modality are selectively supplemented to the target modality. In the SFF module of the frequency domain, we perform element-wise summation for the phase spectrum and selective integration for the amplitude spectrum, as the phase spectrum of both modalities primarily contains consistent structural information while the amplitude spectrum from different modalities holds incompatible style information. Finally, we adopt the ASFF module to enhance the fused features from both domains, where less informative channels from one domain are supplemented by incorporating the corresponding channel features from the other domain. The ASFF module enables the integration of relevant information and the suppression of redundant features. Our contributions can be summarised as follows:
-
•
We propose MMR-Mamba, an efficient framework for multi-contrast MRI reconstruction. To the best of our knowledge, this is the first exploration of Mamba for integrating complementary information across multi-contrast MR images.
-
•
We design the TCM module in the spatial domain for complementary feature fusion and the SSF module in the frequency domain for global structure information fusion.
-
•
We introduce the ASFF module for spatial-frequency information fusion, enhancing task-relevant features while suppressing irrelevant features from the two domains.
-
•
Extensive experiments on the BraTS and fastMRI datasets validate the effectiveness of our MMR-Mamba framework, demonstrating superior performance compared to existing methods.
II Related Works
II-A Spatial Domain MRI Reconstruction
Benefiting from the complementary information from multi-contrast images, many methods have been proposed to reconstruct high-quality images from low-quality images in the spatial domain, guided by the auxiliary modality images[21, 5, 4, 22]. Previously, Xiang et al.[5] proposed Dense-Unet to reconstruct the T2WIs from the T1WIs and under-sampled T2WIs, with the concatenation of under-sampled T2WIs and T1WIs serving as the input of the network. Xuan et al. [12] further enhanced the performance by introducing a spatial alignment network to compensate for spatial misalignment between multiple modality images.
Recently, regarding the Transformer’s ability in modeling long-range dependencies, several works have explored Transformer-based approaches for multi-contrast MRI reconstruction. MTrans [4] employed a cross-attention module for multi-scale features fusion of the target modality and the auxiliary modality. MD-GraphFormer [22] incorporated the physical constraint of MRI into the network architecture, and modeled the multiple contrasts as graph nodes for joint reconstruction of multi-contrast MR images over the graph. MCCA [13] introduced the hybrid convolutional transformer operation to enrich global and local spatial context representation. DCAMSR [14] proposed a dual cross-attention mechanism, where the features of the reference image and the upsampled input image are extracted and promoted with both spatial and channel attention at multiple resolutions. Despite the promising performance, the CNN and Transformer-based approaches either fall short of fully capturing long-range dependencies or suffer from quadratic complexity. In contrast, our model achieves comprehensive exploration of long-range dependencies without imposing a heavy computational burden.
II-B Dual-Domain MRI Reconstruction
Given that aliasing artifacts in the images are structural and non-local, relying solely on image domain restoration is insufficient to eliminate them and achieve faithful reconstruction results. Consequently, several studies have explored the utilization of both k-space information and spatial domain information for MRI reconstruction [23, 24, 25, 26]. For example, Yang et al.[23] incorporated frequency domain information as additional constraints, formulated as an extra loss term. Similarly, the Fourier-constrained diffusion bridge (FDB) was introduced for diffusion model-based MRI reconstruction [24]. Subsequently, DuDoRNet [25] proposed a novel paradigm that simultaneously recovers k-space and images to accelerate MR imaging through Residual Dense Network (DRDNet). Building on DuDoRNet, DuDoCAF [27] adopted a recurrent transformer structure to fuse features from two modalities for cross-modality reconstruction. Liu et al.[28] proposed Faster Fourier Convolution (FasterFC) for 3D MRI reconstruction, directly restoring frequency domain information. Furthermore, FMTNet [29] separately repaired frequency information through a high-frequency learning branch and a low-frequency learning branch, and the two branches are concatenated for final results. McSTRA [30] also designed low-pass and high-pass reconstruction branches for frequency information restoration, and the whole reconstruction iterates between intermediate de-aliasing and data consistency via cascaded Swin-Transformers.
However, current dual-domain methods typically use similar networks for both k-space and image domains, which overlooks the specific characteristics inherent to each domain. Moreover, these methods often rely on straightforward parallel or sequential combinations of two domains, failing to fully exploit their unique features and complementary information.
II-C State Space Models
State Space Models (SSMs) offer a powerful framework for efficiently modeling long sequences and have garnered significant attention [31, 32]. Previously, Structured State Space Sequence Models (S4) [33] was designed to capture long-range dependencies within sequences by introducing Higher-Order Polynomial Project Operator [34]. S4 has demonstrated exceptional performance across various benchmarks while reducing computational complexity to , significantly superior to the quadratic computational complexity of Transformers. More recently, S6, i.e. Mamba [15], was introduced. It employs a selection mechanism for choosing relevant information based on input and features an efficient hardware-aware algorithm grounded in selective scanning.
Mamba has significantly advanced natural language tasks, surpassing traditional Transformers. Subsequently, Mamba has been extended to various vision tasks. For instance, Visual Mamba (Vim) [35] proposed a pure Mamba-based backbone utilizing bidirectional Mamba blocks, modeling images in a sequence manner through positional embedding, akin to ViT [36]. Visual State Space Model (Vmamba) [37] introduces a cross-scanning mechanism for spatial traversal, converting non-causal visual images into ordered patch sequences. In the medical imaging domain, Mamba has been explored for tasks such as medical image segmentation [18, 19], registration [38], and classification [39, 40]. For MRI reconstruction, Huang et al.[41] first proposed MambaMIR with an Arbitrary-Mask mechanism for medical image reconstruction. However, the complementary information from muti-contrast images is neglected. In this work, we explore Mamba for the integration of features from different modalities, harnessing its capability in modeling long-range dependencies.
III Methodology
![Refer to caption](x1.png)
III-A Preliminaries
State Space Models. SSMs are typically defined as linear, time-invariant systems that map an input sequence to an output sequence through a hidden state . These systems can be mathematically expressed as the following ordinary differential equation (ODE):
(1) | |||
where denotes the state matrix, and represent the projection parameters, and is a skip connection.
To incorporate SSMs into deep learning algorithms, discretization is indispensable. The system in Eq 1 is discretized through the zero-order hold (ZOH). After discretization, the system can be written as
(2) | ||||
where , , , are discretized parameters, and is the discretization step size, which can be seen as the resolution of the continuous input .
Furthermore, the Eq. 2 can also be reformulated and computed as the following convolution:
(3) | ||||
where denotes the length of the input sequence and is the SSM convolution kernel.
Fourier Transform. We here briefly review the meaning of the Fourier transformation of images for a better understanding of our work. The Fourier transform (FT) serves as a crucial technique in analyzing the frequency characteristics of an image. Transforming images from the spatial domain to the frequency domain through FT allows us to examine the images from a global perspective. Given an image , the FT can be expressed as follows:
(4) |
where and are coordinates in the Fourier space. The frequency domain feature is represented as , with and denote the real and imaginary part respectively. Then the amplitude spectrum and phase spectrum are defined as:
(5) | ||||
According to the Fourier theory, the amplitude spectrum encodes the style information of an image, whereas the phase component spectrum characterizes the structure information [42, 43]. Furthermore, as highlighted in [44, 45] the phase spectrum maintains global structure well and the degradation of images is mainly manifested in the amplitude spectrum.
III-B Overall Architecture
Let and be a pair of MR images from the target modality and auxiliary modality. Our objective is to reconstruct the high-quality target modality image through a network that takes the under-sampled target modality image and fully-sampled auxiliary modality image as inputs. Here, , where represents the inverse Fourier transform (IFT), denotes the binary under-sample mask, and represents the fully-sampled k-space data of the target modality. The multi-contrast reconstruction task can be formulated as the following optimization problem:
(6) |
To fully harness and fuse the information from different modalities, we design the MMR-Mamba network, depicted in Fig. 1. Initially, we adopt CNN blocks for low-level feature extraction and Mamba blocks for high-level feature extraction, producing modality-specific features and for the target modality and auxiliary modality, respectively.
(7) |
Subsequently, we design the Target modality-guided Cross Mamba (TCM) module in the spatial domain and Selective Frequency Fusion (SFF) module in the frequency domain, generating domain-specific fused features and :
(8) |
Lastly, the multi-contrast features in these two domains are further integrated through an Adaptative Spatial-Frequency Fusion (ASFF) module to generate the final fused feature, which is then fed into a CNN Decoder to reconstruct the target modality image:
(9) |
The model is trained by minimizing the loss between the reconstructed image and ground-truth image:
(10) |
where is the number of training samples.
III-C Target-guided Cross Mamba
For the extracted target modality feature and auxiliary modality feature , complementary information is distributed across different regions, making efficient integration crucial for reconstructing high-quality images from under-sampled data. Existing fusion methods based on CNNs and Transformers face challenges due to their restricted capability in handling long-range dependencies or their high computational requirements. To address this, we leverage state space models for target and auxiliary feature fusion, benefiting from their ability to model long-range dependencies. Inspired by the cross model Mamba in [46], we design the Target-guided Cross Mamba (TCM) module to enrich the target modality features with complementary information from auxiliary features.
Specifically, we derive and from the target modality feature and auxiliary modality feature using normalization layer and linear layer:
(11) | ||||
Then and are projected into the hidden state space through one-dimensional convolution with the SiLU activation function and SSM without gating as
(12) | ||||
After that, to maximally restore the target modality information and selectively integrate the auxiliary information, we adopt as the gating to modulate hidden state features and , and implement the hidden state feature fusion as
(13) |
where represents element-wise production. In this way, we build the TCM module in a hidden state space based on the gating mechanism. The gating parameters are derived from the target modality, ensuring that the target modality predominates in the fusion process and selectively incorporates complementary information from the auxiliary modality.
In our framework, we stack four TCM modules, incorporating residual connections to preserve the integrity of target features. The fused feature is subsequently forwarded to the spatial-frequency fusion module to facilitate the reconstruction of target images.
III-D Selective Frequency Fusion
To further fuse the complementary information from the extracted features in a broader view, we resort to the frequency domain (i.e., Fourier domain), where each frequency component corresponds to all the pixels in the spatial domain, inherently capturing global properties. Additionally, as highlighted in [44, 45], the phase spectrum in the Fourier domain preserves global structure well, while the image degradation mainly manifests in the amplitude spectrum. To comprehensively integrate the complementary information and to restore the degraded features in the amplitude spectrum, we propose Selective Frequency Fusion (SFF) module within the frequency domain, illustrated in Fig. 2.
Given the extracted features and from previous modules, we first transform them to frequency domain through Fourier transform and obtain their amplitude spectrum and phase spectrum:
(14) | ||||
For the phase spectrum fusion, we perform element-wise addition on and , as both contain crucial and consistent structure information [47, 48]. Regarding the amplitude spectrum, it encapsulates style information, which varies significantly across different modalities. Moreover, the amplitude of the under-sampled low-quality images contains interference information that can negatively impact the final reconstruction. Direct concatenation of the amplitude spectrum from these modalities would introduce incompatible and interference information, compromising the quality of the reconstructed image [47, 48].
![Refer to caption](x2.png)
To mitigate this issue, we adopt a selective strategy to fuse the amplitude spectrum. Specifically, we design a Selective Amplitude Harmonization Module (SAHM), wherein the amplitude spectrum across modalities is dynamically adjusted based on global statistics. At first, we conduct element-wise addition of and to produce an intermediate feature . Next, we adopt global average pooling to generate channel-wise statistics , embedding global information. Specifically, the element of is calculated by shrinking through spatial dimensions . For computational efficiency, a fully connected (FC) layer is utilized to create a compact feature , which is further used to guide the adaptive selection. The operation can be expressed as follows:
(15) |
where is the ReLU activation function, represents batch normalization, and with means the number of channel after dimension reduction.
To dynamically select different scales of amplitude spectrum from the two modalities, the compact feature descriptor is used to compute a soft attention score across channels. To achieve this, softmax is performed on the channel-wise digits:
(16) |
where and represent the soft attention vector for and . At last, the fused amplitude spectrum is obtained by weighted sum of and :
(17) |
To further enhance the fused amplitude spectrum and phase spectrum, we employ two groups of independent operations and . Each group consists of convolution layers with a kernel size of and a ReLU activation function. The final restored amplitude spectrum and phase spectrum is obtained by
(18) | ||||
Finally, the restored amplitude spectrum and phase spectrum are convert to spatial domain through inverse Fourier transform:
(19) |
where donates inverse Fourier transform. Through the FFT and selective fusion in the Fourier domain, the SFF module efficiently achieves global feature fusion and recovers high-frequency signals for structure details.
![Refer to caption](x3.png)
III-E Adaptative Spatial-Frequency Fusion
To further enhance the fused spatial domain feature and fused frequency domain feature , and to facilitate the efficient integration of these features, we introduce the Adaptative Spatial-Frequency Fusion (ASFF) module, as illustrated in Fig. 3. The ASFF module allows the two domains to mutually complement each other through channel-wise adaptive integration according to the informativeness of the channel, which is measured by the Batch-Normalization (BN) layer. Previous studies [49, 50] have shown that the BN value indicates channel importance, with lower norms signifying lesser relevance to the final outcomes. Consequently, we propose that channels with diminished significance in one domain can be reinforced by incorporating corresponding channel information from the other domain.
Firstly, we measure channel-wise informativeness of the fused spatial domain feature and frequency feature through the BN layer. Denote the channel of the features as and , the BN is computed as follows:
(20) | ||||
where and are trainable scaling factors; and are trainable offset, and is a small constant to avoid divisions by zero.
The factors and evaluate the correlation between the input and the output after normalization during training. If or approaches 0, it indicates that the corresponding gradient loss of or will be close to 0. This implies that or will have minimal influence on the final results. Given this insight, we propose to enhance the channels with small scaling factors in one domain by incorporating the corresponding channels from the other domain. The incorporation is performed as follows:
(21) |
(22) |
where donates element-wise multiplication and and are the thresholds, we set it according to the maximum and the minimum of the scaling factor as follows:
(23) | |||
where is a value, we set it as 0.1 in our experiments.
Through the ASFF module, less informative features from one domain are enhanced by the corresponding features from the other, while redundant information is effectively eliminated. This process ensures that each channel contributes optimally to the final fused features. By leveraging the most relevant information from spatial and frequency domains, the ASFF module produces feature representations that are most conducive to generating high-quality target images. After the channel-wise incorporation, the features and are concatenated and fed to CNN Decoder to recover the reconstructed target modality image :
(24) |
IV Experiments and Results
IV-A Dataset Description
In the study, two datasets with different anatomical structures and protocols are adopted for evaluation, including the BraTS dataset [51] and fastMRI knee dataset [52]. The BraTS Dataset contains both T1WIs and T2WIs scans of the brain. We extract 2D images uniformly from 100 3D MRI volumes in the BraTS dataset. The dataset is split subject-wise with a 3:1 ratio, yielding a total of 3,621 images for training and 1,088 images for testing. The 2D image size is . In our experiments, we adopt T1WI as the auxiliary modality for the reconstruction of the T2WI modality. The fastMRI Dataset is the largest public MRI dataset with raw k-space data. Following [53], 227 and 45 pairs of single-coil PDWI and FS-PDWI knee volumes are selected for training and testing, respectively, resulting in a total of 8,332 pairs of 2D images for training and 1,665 images for testing. The 2D image size is . In our experiments, we adopt PDWI as the auxiliary modality for the reconstruction of the FS-PDWI modality.
IV-B Experimental Setup
Comparison Methods. To demonstrate the effectiveness of our proposed MMR-Mamba, we compare it against seven multi-contrast reconstruction methods: MDUNet [5], MINet [54], MCCA [13], MTrans [4], DCAMSR [14], SwinIR [55], and Pan-Mamba [46]. For a fair comparison, we run all the models for iterations with a batch size of 4. Other settings of the compared methods followed the implementations provided in their original papers.
Performance Metrics. For quantitative evaluation, we assess the image reconstruction results using peak-to-noise ratio (PSNR), structural similarity index (SSIM), and normalized mean squared error (NMSE). For qualitative evaluation, we visualize the reconstruction results with error maps.
Implementation Details. The proposed framework is developed with PyTorch, and the training and testing processes are executed on one NVIDIA RTX A6000 GPU (48GB). We employ an AdamW optimizer with an initial learning rate of and a weight decay of for the training. We run iterations with a batch size of 4 for both datasets. For both datasets, the undersampled images are obtained by applying a 1D cartesian random under-sample mask with and acceleration.
BraTS | |||||||
---|---|---|---|---|---|---|---|
Method | Year | PSNR | SSIM | NMSE | PSNR | SSIM | NMSE |
Zero-filling [56] | JMRI’01 | 30.11±1.53 | 0.767±0.038 | 5.659±0.018 | 26.58±1.49 | 0.673±0.036 | 12.469±0.033 |
MDUNet [5] | TBME’18 | 37.94±1.66 | 0.975±0.006 | 0.905±0.003 | 35.19±1.64 | 0.960±0.009 | 1.690±0.005 |
MINet [54] | MICCAI’21 | 38.26±1.74 | 0.976±0.006 | 0.847±0.003 | 35.23±1.72 | 0.961±0.009 | 1.697±0.005 |
SwinIR [55] | CVPR’21 | 37.87±1.73 | 0.974±0.006 | 0.926±0.003 | 34.95±1.72 | 0.960±0.009 | 1.805±0.005 |
MTrans [4] | TMI’22 | 36.02±1.67 | 0.962±0.007 | 1.429±0.005 | 34.81±1.57 | 0.957±0.009 | 1.868±0.005 |
MCCA [13] | JBHI’23 | 38.03±1.68 | 0.975±0.006 | 0.903±0.003 | 35.37±1.66 | 0.962±0.009 | 1.633±0.005 |
DCAMSR [14] | MICCAI’23 | 38.60±1.75 | 0.978±0.006 | 1.193±0.004 | 35.99±1.74 | 0.965±0.009 | 1.417±0.005 |
Pan-Mamba [46] | Arxiv’24 | 38.84±1.79 | 0.978±0.006 | 0.739±0.273 | 36.18±1.77 | 0.966±0.009 | 1.363±0.466 |
MMR-Mamba (Ours) | — | 40.98±1.88 | 0.985±0.005 | 0.454±0.190 | 37.75±1.85 | 0.974±0.008 | 0.955±0.358 |
fastMRI | |||||||
Method | Year | PSNR | SSIM | NMSE | PSNR | SSIM | NMSE |
Zero-filling [56] | JMRI’01 | 27.68±1.75 | 0.571±0.061 | 5.020±0.013 | 25.64±1.64 | 0.454±0.070 | 7.719±0.02 |
MDUNet [5] | TBME’18 | 28.60±1.00 | 0.600±0.050 | 4.000±0.010 | 27.90±0.86 | 0.544±0.050 | 4.600±0.02 |
MINet [54] | MICCAI’21 | 29.47±1.88 | 0.639±0.069 | 3.334±0.009 | 28.17±1.74 | 0.563±0.081 | 4.327±0.01 |
SwinIR [55] | CVPR’21 | 29.42±1.87 | 0.636±0.069 | 3.382±0.009 | 28.09±1.74 | 0.560±0.081 | 4.407±0.01 |
MTrans [4] | TMI’22 | 29.00±1.79 | 0.619±0.068 | 3.699±0.009 | 27.31±1.68 | 0.526±0.081 | 5.254±0.01 |
MCCA [13] | JBHI’23 | 29.46±1.87 | 0.637±0.069 | 3.346±0.009 | 28.23±1.75 | 0.562±0.081 | 4.275±0.01 |
DCAMSR [14] | MICCAI’23 | 29.45±1.87 | 0.637±0.068 | 3.349±0.009 | 28.42±1.79 | 0.569±0.081 | 4.098±0.01 |
Pan-Mamba [46] | Arxiv’24 | 29.59±1.87 | 0.645±0.068 | 3.248±0.928 | 28.36±1.76 | 0.570±0.081 | 4.159±1.034 |
MMR-Mamba (Ours) | — | 29.66±1.88 | 0.647±0.068 | 3.201±0.927 | 28.44±1.76 | 0.572±0.081 | 4.084±1.036 |
IV-C Experimental Results
In this section, we evaluate the proposed method against state-of-the-art techniques under various experimental settings.
Quantitative Results. In Table I, we report the PSNR, SSIM, and NMSE results on both datasets for 4 and 8 acceleration. Firstly, focusing on the BraTS dataset in the upper part of the table, our method achieves the best reconstruction results, achieving a PSNR of 40.98 dB and SSIM of 0.985 under 4 acceleration, and a PSNR of 37.75 dB and SSIM of 0.974 under 8 acceleration. These results highlight our model’s efficacy in fusing information from multi-contrast images for reconstructing target images. Additionally, our method outperforms the second-best method, Pan-Manba, by 2.14 dB in PSNR under 4 acceleration and by 1.57 dB in PSNR under 8 acceleration. The fact that Pan-Manba achieves the second-best results further highlights the effectiveness of the Mamba in MRI reconstruction. Similarly, as shown in the lower part of the table, our method achieves the best reconstruction results on the fastMRI knee dataset, with a PSNR of 29.66 dB and SSIM of 0.647 under 4 acceleration and a PSNR of 28.44 dB and SSIM of 0.572 under 8 acceleration. This further demonstrates the effectiveness and robustness of our model in reconstructing high-quality MR images.
Model | TCM | SFF | ASFF | 4 | 8 | ||||
---|---|---|---|---|---|---|---|---|---|
PSNR | SSIM | NMSE | PSNR | SSIM | NMSE | ||||
a | ✗ | ✗ | ✗ | 38.45+1.75 | 0.977±0.006 | 0.809±0.294 | 35.72±1.72 | 0.964±0.009 | 1.512±0.499 |
b | ✓ | ✗ | ✗ | 39.05±1.79 | 0.979±0.006 | 0.706±0.266 | 36.17±1.80 | 0.967±0.009 | 1.368±0.476 |
c | ✗ | ✓ | ✗ | 40.49±1.84 | 0.984±0.005 | 0.509±0.205 | 37.22±1.79 | 0.971±0.008 | 1.078±0.392 |
d | ✓ | ✓ | ✗ | 40.66+1.86 | 0.985±0.005 | 0.489±0.201 | 37.50±1.84 | 0.973±0.008 | 1.012±0.374 |
e | ✓ | ✓ | ✓ | 40.98±1.88 | 0.985±0.005 | 0.454±0.190 | 37.75±1.85 | 0.974±0.008 | 0.955±0.358 |
Qualitative Results. To better evaluate the reconstruction quality, we visualize the outputs from different methods for the BraTS and fastMRI datasets under 4 and 8 acceleration in Fig. 4, along with their corresponding error maps. These maps illustrate the discrepancies between the reconstructed and the ground truth images, with blue indicating minimal error and red indicating higher error. The error maps reveal that zero-filled reconstructions exhibit pronounced artifacts, with error levels escalating with increasing acceleration factors from 4 to 8. Notably, our method consistently shows the lowest reconstruction error across both datasets and acceleration factors. This observation underscores that our method ensures superior preservation of essential anatomical details and consistently produces high-quality reconstructed images.
![Refer to caption](x4.png)
IV-D Ablation Studies
In this section, we first conduct an ablation study to assess the efficacy of the proposed modules. Then we analyze different fusion strategies for both spatial and frequency domains.
IV-D1 Effectiveness of the Proposed Modules
To verify the validity of the proposed modules, we conducted ablation experiments on the BraTS dataset under 4 and 8 acceleration. The different configurations tested are denoted as follows: (a) a baseline model with all proposed modules removed; (b) the baseline model enhanced with the TCM; (c) the baseline model enhanced with the SFF; (d) the model augmented with both the TCM and SFF; and (e) the complete model incorporating all proposed modules. The results of these experiments are reported in Table II. As shown, incorporating the TCM improves the PSNR from 38.45 dB to 39.05 dB under 4 acceleration. while incorporating the SFF improves the PSNR to 40.49 dB under 4 acceleration Combining both TCM and SFF modules further boosts PSNR to 40.66 dB. Finally, the inclusion of the ASFF led to additional improvements. Similar trends were observed under 8 acceleration. Visual results on the BraTS dataset, including error maps, are shown in Fig. 5. As shown in the figure, the TCM module alone reduces the overall error, yet some structural details remain unrecovered, as indicated in the yellow box. Conversely, the SFF module effectively restores structural details but exhibits larger errors in the lateral ventricle areas, highlighted in red box. When both the TCM and SFF modules are employed, most regions and structures are adequately restored. Ultimately, integrating all proposed modules yields the best results. The quantitative and visualization results from our ablation experiments verify the effectiveness of each proposed module in enhancing the performance of multi-contrast MRI reconstruction.
![Refer to caption](x5.png)
IV-D2 Analysis on Spatial Domain Fusion
To evaluate the effectiveness of our proposed spatial domain fusion module, we design three ablation experiments: (1) fusion through element-wise addition, (2) fusion through cross-attention, and (3) fusion through our proposed TCM module, donated as ‘Sum’, ‘CA’ and ‘TCM’ in Table III. The results indicate that element-wise addition of features from two modalities leads to the worst results. In contrast, our TCM module achieves superior performance, outperforming ‘Sum’ and ‘CA’ under both 4 and 8 acceleration, notably with minimal parameters.
Method PSNR SSIM NMSE Param(M) Sum 38.56+1.74 0.977±0.006 0.789±0.285 0.90 CA 38.76+1.76 0.978±0.006 0.754±0.278 110.42 TCM 39.05±1.79 0.979±0.006 0.706±0.266 0.86 Sum 35.82+1.74 0.965±0.009 1.481±0.492 0.90 CA 35.92+1.74 0.965±0.009 1.447±0.492 110.42 TCM 36.17±1.80 0.967±0.009 1.368±0.476 0.86
IV-D3 Analysis on Frequency Domain Fusion.
To verify the effectiveness of the proposed selective frequency fusion module, we design the following experiments: (1) fusion through element-wise addition and (2) fusion through our proposed SFF module, donated as ‘Sum’ and ‘SFF’ in Table IV, respectively. The results presented in the table show that element-wise addition of features from two modalities results in suboptimal performance. In contrast, our proposed SFF module significantly enhances performance, achieving a 0.43 dB improvement in PSNR under 4 acceleration and a 0.21 dB improvement under 8 acceleration.
V Conclusion
This study explores the comprehensive and efficient integration of complementary information across modalities for multi-contrast MRI reconstruction. We present our MMR-Mamba framework, which integrates information through the TCM in the spatial domain and the SFF in the frequency domain, along with integrating the spatial-frequency features through the ASFF module. We conducted extensive experiments on the BraTS and fastMRI knee datasets, with results demonstrating the superiority of our proposed module in reconstructing MRI under different acceleration factors. This work contributes valuable insights for advancing research in multi-contrast MRI reconstruction, particularly in optimizing information fusion across spatial and frequency domains.
Method | PSNR | SSIM | NMSE | |
---|---|---|---|---|
Sum | 40.06±1.79 | 0.982±0.005 | 0.560±0.002 | |
SFF | 40.49±1.84 | 0.984±0.005 | 0.509±0.002 | |
Sum | 37.01+1.77 | 0.970±0.008 | 1.128±0.004 | |
SFF | 37.22±1.79 | 0.971±0.008 | 1.078±0.004 |
References
- [1] E. Stoja, S. Konstandin, D. Philipp, R. N. Wilke, D. Betancourt, T. Bertuch, J. Jenne, R. Umathum, and M. Günther, “Improving magnetic resonance imaging with smart and thin metasurfaces,” Scientific reports, vol. 11, no. 1, p. 16179, 2021.
- [2] E. Plenge, D. H. Poot, M. Bernsen, G. Kotek, G. Houston, P. Wielopolski, L. van der Weerd, W. J. Niessen, and E. Meijering, “Super-resolution methods in mri: can they improve the trade-off between resolution, signal-to-noise ratio, and acquisition time?” Magnetic resonance in medicine, vol. 68, no. 6, pp. 1983–1993, 2012.
- [3] P. Guo, Y. Mei, J. Zhou, S. Jiang, and V. M. Patel, “Reconformer: Accelerated mri reconstruction using recurrent transformer,” IEEE Trans. Med. Imaging, 2023.
- [4] C.-M. Feng, Y. Yan, G. Chen, Y. Xu, Y. Hu, L. Shao, and H. Fu, “Multi-modal transformer for accelerated mr imaging,” IEEE Trans. Med. Imaging, 2022.
- [5] L. Xiang, Y. Chen, W. Chang, Y. Zhan, W. Lin, Q. Wang, and D. Shen, “Deep-learning-based multi-modal fusion for fast mr reconstruction,” IEEE Int. Conf. Comput. Vis., vol. 66, no. 7, pp. 2105–2114, 2018.
- [6] B. Bilgic, V. K. Goyal, and E. Adalsteinsson, “Multi-contrast reconstruction with bayesian compressed sensing,” Magnetic resonance in medicine, vol. 66, no. 6, pp. 1601–1615, 2011.
- [7] P. Song, L. Weizman, J. F. Mota, Y. C. Eldar, and M. R. Rodrigues, “Coupled dictionary learning for multi-contrast mri reconstruction,” IEEE Trans. Med. Imaging, vol. 39, no. 3, pp. 621–633, 2019.
- [8] Z. Lai, X. Qu, H. Lu, X. Peng, D. Guo, Y. Yang, G. Guo, and Z. Chen, “Sparse mri reconstruction using multi-contrast image guided graph representation,” Magnetic resonance imaging, vol. 43, pp. 95–104, 2017.
- [9] L. Sun, Z. Fan, X. Fu, Y. Huang, X. Ding, and J. Paisley, “A deep information sharing network for multi-contrast compressed sensing mri reconstruction,” IEEE Trans. Image Process., vol. 28, no. 12, pp. 6141–6153, 2019.
- [10] J. Lyu, G. Li, C. Wang, C. Qin, S. Wang, Q. Dou, and J. Qin, “Region-focused multi-view transformer-based generative adversarial network for cardiac cine mri reconstruction,” Med. Image Anal., vol. 85, p. 102760, 2023.
- [11] G. Li, J. Lv, Y. Tian, Q. Dou, C. Wang, C. Xu, and J. Qin, “Transformer-empowered multi-scale contextual matching and aggregation for multi-contrast mri super-resolution,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 20 636–20 645.
- [12] K. Xuan, L. Xiang, X. Huang, L. Zhang, S. Liao, D. Shen, and Q. Wang, “Multimodal mri reconstruction assisted with spatial alignment network,” IEEE Trans. Med. Imaging, vol. 41, no. 9, pp. 2499–2509, 2022.
- [13] B. Li, W. Hu, C.-M. Feng, Y. Li, Z. Liu, and Y. Xu, “Multi-contrast complementary learning for accelerated mr imaging,” IEEE J. Biomed. Health. Inf., 2023.
- [14] S. Huang, J. Li, L. Mei, T. Zhang, Z. Chen, Y. Dong, L. Dong, S. Liu, and M. Lyu, “Accurate multi-contrast mri super-resolution via a dual cross-attention transformer network,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assisted Intervention. Springer, 2023, pp. 313–322.
- [15] A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.
- [16] A. Gupta, A. Gu, and J. Berant, “Diagonal state spaces are as effective as structured state spaces,” Adv. Neural Inf. Process. Syst., vol. 35, pp. 22 982–22 994, 2022.
- [17] Z. Qin, X. Han, W. Sun, B. He, D. Li, D. Li, Y. Dai, L. Kong, and Y. Zhong, “Toeplitz neural network for sequence modeling,” arXiv preprint arXiv:2305.04749, 2023.
- [18] Z. Xing, T. Ye, Y. Yang, G. Liu, and L. Zhu, “Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation,” arXiv preprint arXiv:2401.13560, 2024.
- [19] J. Ma, F. Li, and B. Wang, “U-mamba: Enhancing long-range dependency for biomedical image segmentation,” arXiv preprint arXiv:2401.04722, 2024.
- [20] M. Tancik, P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. Barron, and R. Ng, “Fourier features let networks learn high frequency functions in low dimensional domains,” Adv. Neural Inf. Process. Syst., vol. 33, pp. 7537–7547, 2020.
- [21] B. Zhou, N. Dey, J. Schlemper, S. S. M. Salehi, C. Liu, J. S. Duncan, and M. Sofka, “Dsformer: A dual-domain self-supervised transformer for accelerated multi-contrast mri reconstruction,” in Proc. IEEE winter Conf. on applications of Comput. Vis., 2023, pp. 4966–4975.
- [22] J. Wang, Y. Yang, H. Yang, C. Lian, Z. Xu, and J. Sun, “Md-graphformer: A model-driven graph transformer for fast multi-contrast mr imaging,” IEEE Trans. Comput. Imaging, 2023.
- [23] G. Yang, S. Yu, H. Dong, G. Slabaugh, P. L. Dragotti, X. Ye, F. Liu, S. Arridge, J. Keegan, Y. Guo, et al., “Dagan: deep de-aliasing generative adversarial networks for fast compressed sensing mri reconstruction,” IEEE Trans. Med. Imaging, vol. 37, no. 6, pp. 1310–1321, 2017.
- [24] M. U. Mirza, O. Dalmaz, H. A. Bedel, G. Elmas, Y. Korkmaz, A. Gungor, S. U. Dar, and T. Çukur, “Learning fourier-constrained diffusion bridges for mri reconstruction,” arXiv preprint arXiv:2308.01096, 2023.
- [25] B. Zhou and S. K. Zhou, “Dudornet: learning a dual-domain recurrent network for fast mri reconstruction with deep t1 prior,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 4273–4282.
- [26] Y. Wang, Y. Pang, and C. Tong, “Dsmenet: Detail and structure mutually enhancing network for under-sampled mri reconstruction,” Comput. Biol. Med., vol. 154, p. 106204, 2023.
- [27] J. Lyu, B. Sui, C. Wang, Y. Tian, Q. Dou, and J. Qin, “Dudocaf: Dual-domain cross-attention fusion with recurrent transformer for fast multi-contrast mr imaging,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assisted Intervention. Springer, 2022, pp. 474–484.
- [28] X. Liu, Y. Pang, X. Sun, Y. Liu, Y. Hou, Z. Wang, and X. Li, “Image reconstruction for accelerated mr scan with faster fourier convolutional neural networks,” IEEE Trans. Image Process., 2024.
- [29] Q. Yi, F. Fang, G. Zhang, and T. Zeng, “Frequency learning via multi-scale fourier transformer for mri reconstruction,” IEEE J. Biomed. Health. Inf., 2023.
- [30] M. Ekanayake, K. Pawar, M. Harandi, G. Egan, and Z. Chen, “Mcstra: A multi-branch cascaded swin transformer for point spread function-guided robust mri reconstruction,” Comput. Biol. Med., vol. 168, p. 107775, 2024.
- [31] R. Hasani, M. Lechner, T.-H. Wang, M. Chahine, A. Amini, and D. Rus, “Liquid structural state-space models,” arXiv preprint arXiv:2209.12951, 2022.
- [32] A. Gu, I. Johnson, A. Timalsina, A. Rudra, and C. Ré, “How to train your hippo: State space models with generalized orthogonal basis projections,” arXiv preprint arXiv:2206.12037, 2022.
- [33] A. Gu, K. Goel, and C. Ré, “Efficiently modeling long sequences with structured state spaces,” arXiv preprint arXiv:2111.00396, 2021.
- [34] A. Gu, T. Dao, S. Ermon, A. Rudra, and C. Ré, “Hippo: Recurrent memory with optimal polynomial projections,” Adv. Neural Inf. Process. Syst., vol. 33, pp. 1474–1487, 2020.
- [35] L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,” arXiv preprint arXiv:2401.09417, 2024.
- [36] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- [37] Y. Liu, Y. Tian, Y. Zhao, H. Yu, L. Xie, Y. Wang, Q. Ye, and Y. Liu, “Vmamba: Visual state space model,” arXiv arXiv:2401.10166, 2024.
- [38] T. Guo, Y. Wang, and C. Meng, “Mambamorph: a mamba-based backbone with contrastive feature learning for deformable mr-ct registration,” arXiv preprint arXiv:2401.13934, 2024.
- [39] Y. Yue and Z. Li, “Medmamba: Vision mamba for medical image classification,” arXiv preprint arXiv:2403.03849, 2024.
- [40] G. Yang, K. Du, Z. Yang, Y. Du, Y. Zheng, and S. Wang, “Cmvim: Contrastive masked vim autoencoder for 3d multi-modal representation learning for ad classification,” arXiv preprint arXiv:2403.16520, 2024.
- [41] J. Huang, L. Yang, F. Wang, Y. Wu, Y. Nan, A. I. Aviles-Rivero, C.-B. Schönlieb, D. Zhang, and G. Yang, “Mambamir: An arbitrary-masked mamba for joint medical image reconstruction and uncertainty estimation,” arXiv preprint arXiv:2402.18451, 2024.
- [42] Q. Xu, R. Zhang, Y. Zhang, Y. Wang, and Q. Tian, “A fourier-based framework for domain generalization,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 14 383–14 392.
- [43] Y. Yang and S. Soatto, “Fda: Fourier domain adaptation for semantic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 4085–4095.
- [44] H. Yu, N. Zheng, M. Zhou, J. Huang, Z. Xiao, and F. Zhao, “Frequency and spatial dual guidance for image dehazing,” in IEEE Conf. Comput. Vis. Pattern Recognit. Springer, 2022, pp. 181–198.
- [45] M. Zhou, J. Huang, K. Yan, H. Yu, X. Fu, A. Liu, X. Wei, and F. Zhao, “Spatial-frequency domain information integration for pan-sharpening,” in IEEE Conf. Comput. Vis. Pattern Recognit. Springer, 2022, pp. 274–291.
- [46] X. He, K. Cao, K. Yan, R. Li, C. Xie, J. Zhang, and M. Zhou, “Pan-mamba: Effective pan-sharpening with state space model,” arXiv preprint arXiv:2402.12192, 2024.
- [47] N. Skarbnik, Y. Y. Zeevi, and C. Sagiv, The importance of phase in image processing. Technion-Israel Institute of Technology, Faculty of Electrical Engineering, 2009.
- [48] Q. Liu, C. Chen, J. Qin, Q. Dou, and P.-A. Heng, “Feddg: Federated domain generalization on medical image segmentation via episodic learning in continuous frequency space,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 1013–1023.
- [49] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient convolutional networks through network slimming,” in Proceedings of theIEEE Int. Conf. Comput. Vis., 2017, pp. 2736–2744.
- [50] J. Ye, X. Lu, Z. Lin, and J. Z. Wang, “Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers,” arXiv preprint arXiv:1802.00124, 2018.
- [51] B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest, et al., “The multimodal brain tumor image segmentation benchmark (brats),” IEEE Trans. Med. Imaging, vol. 34, no. 10, pp. 1993–2024, 2014.
- [52] F. Knoll, J. Zbontar, A. Sriram, M. J. Muckley, M. Bruno, A. Defazio, M. Parente, K. J. Geras, J. Katsnelson, H. Chandarana, et al., “fastmri: A publicly available raw k-space and dicom dataset of knee images for accelerated mr image reconstruction using machine learning,” Radiology: Artificial Intelligence, vol. 2, no. 1, p. e190007, 2020.
- [53] K. Xuan, S. Sun, Z. Xue, Q. Wang, and S. Liao, “Learning mri k-space subsampling pattern using progressive weight pruning,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assisted Intervention. Springer, 2020, pp. 178–187.
- [54] C.-M. Feng, H. Fu, S. Yuan, and Y. Xu, “Multi-contrast mri super-resolution via a multi-stage integration network,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assisted Intervention. Springer, 2021, pp. 140–149.
- [55] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “Swinir: Image restoration using swin transformer,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 1833–1844.
- [56] M. A. Bernstein, S. B. Fain, and S. J. Riederer, “Effect of windowing and zero-filled reconstruction of mri data on spatial resolution and acquisition strategy,” J. Magn. Reson. Imaging, vol. 14, no. 3, pp. 270–280, 2001.