MMR-Mamba: Multi-Contrast
MRI Reconstruction with Mamba and
Spatial-Frequency Information Fusion

**g Zou, Lanqing Liu, Qi Chen, Shujun Wang,
Xiaohan Xing, and **g Qin **g Zou, Lanqing Liu and **g Qin are with Center for Smart Health, School of Nursing, The Hong Kong Polytechnic University, Hong Kong SAR, China (email: [email protected], [email protected], [email protected]).Qi Chen is with the Department of Electronic Engineering and Information Science, University of Science and Technology of China, Anhui, China (e-mail: [email protected]).Shujun Wang is with the Department of Biomedical Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China (email: [email protected]).Xiaohan Xing is with the Department of Radiation Oncology, Stanford University, Stanford, CA 94305 USA (email: [email protected]).Corresponding author: Xiaohan Xing ([email protected]).

Abstract

Multi-contrast MRI acceleration has become prevalent in MR imaging, enabling the reconstruction of high-quality MR images from under-sampled k-space data of the target modality, using guidance from a fully-sampled auxiliary modality. The main crux lies in efficiently and comprehensively integrating complementary information from the auxiliary modality. Existing methods either suffer from quadratic computational complexity or fail to capture long-range correlated features comprehensively. In this work, we propose MMR-Mamba, a novel framework that achieves comprehensive integration of multi-contrast features through Mamba and spatial-frequency information fusion. Firstly, we design the Target modality-guided Cross Mamba (TCM) module in the spatial domain, which maximally restores the target modality information by selectively absorbing useful information from the auxiliary modality. Secondly, leveraging global properties of the Fourier domain, we introduce the Selective Frequency Fusion (SFF) module to efficiently integrate global information in the frequency domain and recover high-frequency signals for the reconstruction of structure details. Additionally, we present the Adaptive Spatial-Frequency Fusion (ASFF) module, which enhances fused features by supplementing less informative features from one domain with corresponding features from the other domain. These innovative strategies ensure efficient feature fusion across spatial and frequency domains, avoiding the introduction of redundant information and facilitating the reconstruction of high-quality target images. Extensive experiments on the BraTS and fastMRI knee datasets demonstrate the superiority of the proposed MMR-Mamba over state-of-the-art MRI reconstruction methods.

Index Terms:

MRI reconstruction, Multi-contrast, State space models, Fourier domain, Spatial-frequency information fusion.

I Introduction

Magnetic resonance imaging (MRI) is an essential clinical imaging technology owing to its non-invasive, free-of-radiation characteristics and its capability to provide high-resolution morphological information with varying contrast [1]. However, due to the intrinsic physics of MR imaging systems [2], acquiring fully sampled k-space data can take tens of minutes, resulting in patient discomfort and increased operational costs. Therefore, accelerating MRI acquisition, particularly through the reconstruction of high-quality MR images from undersampled k-space data has become a popular research topic [3].

In clinical practice, multi-contrast MR images are simultaneously acquired to enable more accurate disease diagnosis and treatment planning. These modalities provide complementary information, revealing detailed anatomical structures of the patient [4]. For instance, T1 and T2 weighted images (T1WIs and T2WIs), along with proton density and fat-suppressed proton density-weighted images (PDWIs and FS-PDWIs), are two pairs of images with complementary structures. Typically, T1WIs are acquired more quickly due to shorter repetition time (TR) and echo time (TE) requirements, whereas T2WIs acquisition requires longer TR times, resulting in slower imaging speeds [5]. Similarly, FS-PDWIs necessitate longer scan times than PDWIs. In this context, leveraging readily obtainable modalities (i.e. auxiliary modalities such as T1WIs or PDWIs) as supplementary guidance for the reconstruction of target modalities (e.g., T2WIs or FS-PDWIs) with slower imaging speeds, known as multi-contrast MRI reconstruction, has been investigated and verified by previous works. For instance, compressed sensing (CS), Bayesian learning, dictionary learning, and graph representation theory have been utilized to accelerate multi-contrast MR imaging [6, 7, 8]. More recently, deep learning has significantly advanced MRI reconstruction due to their powerful feature representation capabilities [5, 9, 7, 10, 11].

In multi-contrast MR images, similar structural features are distributed across different regions within each modality, while complementary information is provided by different modalities. Consequently, the primary challenge of the multi-contrast MRI reconstruction task is to comprehensively explore long-range dependencies within each modality and effectively leverage complementary information from auxiliary modalities. CNN-based methods have been proposed to integrate multi-contrast information [5, 12], but they typically demonstrate local sensitivity and a lack of long-range dependency, thereby limiting their ability to capture global contextual information. In contrast, Transformer-based models [4, 13, 14], distinguished by their large receptive fields and global sensitivity, often surpass CNNs in capturing extensive contextual information. However, these models are burdened by substantial computational overhead due to the quadratic growth of resources with respect to sequence length. Therefore, develo** an algorithm that comprehensively integrates complementary information from different modalities without significant computational overhead is crucial.

Recently, the improved structured state-space sequence model with a selective scanning mechanism, Mamba [15], has emerged as a compelling alternative to Transformer, due to its ability to model long-range sequence relationships with linear complexity. Mamba has shown superior performance compared to Transformers in tasks involving long-term dependency modeling, such as natural language processing [16, 17] and medical image segmentation [18, 19]. Investigating Mamba’s potential for long-range dependency modeling and complementary information fusion in multi-contrast MRI reconstruction is highly promising. On the other hand, each component in the frequency domain represents a combination of all the pixel values in the spatial domain, meaning that frequency features capture the overall patterns and structures, providing a global view of the entire image. Meanwhile, Fourier features are beneficial for recovering high-frequency signals that are crucial for addressing image degradation[20]. Therefore, comprehensive and efficient global feature integration across different modalities can be achieved by performing feature fusion in the frequency domain. The Mamba block and frequency domain thus offer two promising solutions for efficient fusion of the multi-contrast information.

Motivated by the above analysis, we propose a novel framework, MMR-Mamba, for multi-contrast MRI reconstruction. Built upon the Mamba architecture, our MMR-Mamba jointly explores the complementary information fusion in the spatial and frequency domains, implemented by the Target modality-guided Cross Mamba (TCM) module and Selective Frequency Fusion (SFF) module, respectively. Additionally, we introduce an Adaptive Saptial-Frequency Fusion (ASFF) module to enhance and integrate the fused features from both domains. Specifically, we first adapt Mamba blocks for feature extraction, leveraging their capability to effectively capture fine-grained relationships and long-range dependencies within MR images. Then we design TCM for spatial domain information fusion, where correlated features from the auxiliary modality are selectively supplemented to the target modality. In the SFF module of the frequency domain, we perform element-wise summation for the phase spectrum and selective integration for the amplitude spectrum, as the phase spectrum of both modalities primarily contains consistent structural information while the amplitude spectrum from different modalities holds incompatible style information. Finally, we adopt the ASFF module to enhance the fused features from both domains, where less informative channels from one domain are supplemented by incorporating the corresponding channel features from the other domain. The ASFF module enables the integration of relevant information and the suppression of redundant features. Our contributions can be summarised as follows:

•

We propose MMR-Mamba, an efficient framework for multi-contrast MRI reconstruction. To the best of our knowledge, this is the first exploration of Mamba for integrating complementary information across multi-contrast MR images.
•

We design the TCM module in the spatial domain for complementary feature fusion and the SSF module in the frequency domain for global structure information fusion.
•

We introduce the ASFF module for spatial-frequency information fusion, enhancing task-relevant features while suppressing irrelevant features from the two domains.
•

Extensive experiments on the BraTS and fastMRI datasets validate the effectiveness of our MMR-Mamba framework, demonstrating superior performance compared to existing methods.

II Related Works

II-A Spatial Domain MRI Reconstruction

Benefiting from the complementary information from multi-contrast images, many methods have been proposed to reconstruct high-quality images from low-quality images in the spatial domain, guided by the auxiliary modality images[21, 5, 4, 22]. Previously, Xiang et al.[5] proposed Dense-Unet to reconstruct the T2WIs from the T1WIs and under-sampled T2WIs, with the concatenation of under-sampled T2WIs and T1WIs serving as the input of the network. Xuan et al. [12] further enhanced the performance by introducing a spatial alignment network to compensate for spatial misalignment between multiple modality images.

Recently, regarding the Transformer’s ability in modeling long-range dependencies, several works have explored Transformer-based approaches for multi-contrast MRI reconstruction. MTrans [4] employed a cross-attention module for multi-scale features fusion of the target modality and the auxiliary modality. MD-GraphFormer [22] incorporated the physical constraint of MRI into the network architecture, and modeled the multiple contrasts as graph nodes for joint reconstruction of multi-contrast MR images over the graph. MCCA [13] introduced the hybrid convolutional transformer operation to enrich global and local spatial context representation. DCAMSR [14] proposed a dual cross-attention mechanism, where the features of the reference image and the upsampled input image are extracted and promoted with both spatial and channel attention at multiple resolutions. Despite the promising performance, the CNN and Transformer-based approaches either fall short of fully capturing long-range dependencies or suffer from quadratic complexity. In contrast, our model achieves comprehensive exploration of long-range dependencies without imposing a heavy computational burden.

II-B Dual-Domain MRI Reconstruction

Given that aliasing artifacts in the images are structural and non-local, relying solely on image domain restoration is insufficient to eliminate them and achieve faithful reconstruction results. Consequently, several studies have explored the utilization of both k-space information and spatial domain information for MRI reconstruction [23, 24, 25, 26]. For example, Yang et al.[23] incorporated frequency domain information as additional constraints, formulated as an extra loss term. Similarly, the Fourier-constrained diffusion bridge (FDB) was introduced for diffusion model-based MRI reconstruction [24]. Subsequently, DuDoRNet [25] proposed a novel paradigm that simultaneously recovers k-space and images to accelerate MR imaging through Residual Dense Network (DRDNet). Building on DuDoRNet, DuDoCAF [27] adopted a recurrent transformer structure to fuse features from two modalities for cross-modality reconstruction. Liu et al.[28] proposed Faster Fourier Convolution (FasterFC) for 3D MRI reconstruction, directly restoring frequency domain information. Furthermore, FMTNet [29] separately repaired frequency information through a high-frequency learning branch and a low-frequency learning branch, and the two branches are concatenated for final results. McSTRA [30] also designed low-pass and high-pass reconstruction branches for frequency information restoration, and the whole reconstruction iterates between intermediate de-aliasing and data consistency via cascaded Swin-Transformers.

However, current dual-domain methods typically use similar networks for both k-space and image domains, which overlooks the specific characteristics inherent to each domain. Moreover, these methods often rely on straightforward parallel or sequential combinations of two domains, failing to fully exploit their unique features and complementary information.

II-C State Space Models

State Space Models (SSMs) offer a powerful framework for efficiently modeling long sequences and have garnered significant attention [31, 32]. Previously, Structured State Space Sequence Models (S4) [33] was designed to capture long-range dependencies within sequences by introducing Higher-Order Polynomial Project Operator [34]. S4 has demonstrated exceptional performance across various benchmarks while reducing computational complexity to $O(Nlog(N))$ , significantly superior to the quadratic computational complexity $O(N^{2})$ of Transformers. More recently, S6, i.e. Mamba [15], was introduced. It employs a selection mechanism for choosing relevant information based on input and features an efficient hardware-aware algorithm grounded in selective scanning.

Mamba has significantly advanced natural language tasks, surpassing traditional Transformers. Subsequently, Mamba has been extended to various vision tasks. For instance, Visual Mamba (Vim) [35] proposed a pure Mamba-based backbone utilizing bidirectional Mamba blocks, modeling images in a sequence manner through positional embedding, akin to ViT [36]. Visual State Space Model (Vmamba) [37] introduces a cross-scanning mechanism for spatial traversal, converting non-causal visual images into ordered patch sequences. In the medical imaging domain, Mamba has been explored for tasks such as medical image segmentation [18, 19], registration [38], and classification [39, 40]. For MRI reconstruction, Huang et al.[41] first proposed MambaMIR with an Arbitrary-Mask mechanism for medical image reconstruction. However, the complementary information from muti-contrast images is neglected. In this work, we explore Mamba for the integration of features from different modalities, harnessing its capability in modeling long-range dependencies.

III Methodology

Refer to caption — Figure 1: Overview of the proposed MMR-Mamba framework (left). It contains Mamba blocks for feature extraction, TCM for spatial domain fusion, SFF for frequency domain fusion, and ASFF for spatial-frequency information integration. Structure of Mamba block and TCM (right).

III-A Preliminaries

State Space Models. SSMs are typically defined as linear, time-invariant systems that map an input sequence $x(t)\in\mathbb{R}^{L}$ to an output sequence $y(t)\in\mathbb{R}^{L}$ through a hidden state $h(t)\in\mathbb{R}^{N}$ . These systems can be mathematically expressed as the following ordinary differential equation (ODE):

	$\displaystyle h^{\prime}(t)=Ah(t)+Bx(t),$		(1)
	$\displaystyle y(t)=Ch(t)+Dx(t),$		(1)

where $A\in\mathbb{R}^{N\times N}$ denotes the state matrix, $B\in\mathbb{R}^{N\times 1}$ and $C\in\mathbb{R}^{1\times N}$ represent the projection parameters, and $D\in\mathbb{R}^{1}$ is a skip connection.

To incorporate SSMs into deep learning algorithms, discretization is indispensable. The system in Eq 1 is discretized through the zero-order hold (ZOH). After discretization, the system can be written as

		$\displaystyle h_{k}=\overline{A}h_{k-1}+\overline{B}x_{k},$		(2)
		$\displaystyle y_{k}=\overline{C}h_{k}+\overline{D}x_{k},$		(2)

where $\overline{A}=exp(\Delta A)$ , $\overline{B}=(\Delta A)^{-1}(exp(\Delta A)-I)\cdot\Delta B$ , $\overline{C}=C$ , $\overline{D}=D$ are discretized parameters, and $\Delta$ is the discretization step size, which can be seen as the resolution of the continuous input $x(t)$ .

Furthermore, the Eq. 2 can also be reformulated and computed as the following convolution:

		$\displaystyle\overline{K}=(C\overline{B},C\overline{AB},...,C\overline{A}^{L-1% }\overline{B}),$		(3)
		$\displaystyle y=x*\overline{K},$		(3)

where $L$ denotes the length of the input sequence $x$ and $K\in\mathbb{R}^{L}$ is the SSM convolution kernel.

Fourier Transform. We here briefly review the meaning of the Fourier transformation of images for a better understanding of our work. The Fourier transform (FT) serves as a crucial technique in analyzing the frequency characteristics of an image. Transforming images from the spatial domain to the frequency domain through FT allows us to examine the images from a global perspective. Given an image $x$ , the FT can be expressed as follows:

\displaystyle\mathcal{F}(x)(u,v)=\sum_{h=0}^{H-1}\sum_{w=0}^{W-1}x(h,w)e^{-j2% \pi(\frac{h}{H}u+\frac{w}{W}v)},

(4)

where $u$ and $v$ are coordinates in the Fourier space. The frequency domain feature $\mathcal{F}(x)$ is represented as $\mathcal{F}(x)=\mathcal{R}(x)+j\mathcal{I}(x)$ , with $\mathcal{R}(x)$ and $\mathcal{I}(x)$ denote the real and imaginary part respectively. Then the amplitude spectrum $\mathcal{A}(x)(u,v)$ and phase spectrum $\mathcal{P}(x)(u,v)$ are defined as:

		$\displaystyle\mathcal{A}(x)(u,v)={\left[\mathcal{R}^{2}(x)(u,v)+\mathcal{I}^{2% }(x)(u,v)\right]}^{\frac{1}{2}},$		(5)
		$\displaystyle\mathcal{P}(x)(u,v)=arctan\left[\frac{\mathcal{I}(x)(u,v)}{% \mathcal{R}(x)(u,v)}\right].$		(5)

According to the Fourier theory, the amplitude spectrum $\mathcal{A}$ encodes the style information of an image, whereas the phase component spectrum $\mathcal{P}$ characterizes the structure information [42, 43]. Furthermore, as highlighted in [44, 45] the phase spectrum maintains global structure well and the degradation of images is mainly manifested in the amplitude spectrum.

III-B Overall Architecture

Let ${I}_{tar}$ and ${I}_{aux}\in\mathbb{R}^{H\times W}$ be a pair of MR images from the target modality and auxiliary modality. Our objective is to reconstruct the high-quality target modality image $I_{tar}$ through a network that takes the under-sampled target modality image $\widetilde{I}_{tar}$ and fully-sampled auxiliary modality image ${I}_{aux}$ as inputs. Here, $\widetilde{I}_{tar}=\mathcal{F}^{-1}(\mathcal{M}\odot k_{tar})$ , where $\mathcal{F}^{-1}$ represents the inverse Fourier transform (IFT), $\mathcal{M}$ denotes the binary under-sample mask, and $k_{tar}$ represents the fully-sampled k-space data of the target modality. The multi-contrast reconstruction task can be formulated as the following optimization problem:

\displaystyle\hat{I}_{tar}=\operatorname*{argmin}_{f(\theta)}\|I_{tar}-f(% \mathcal{F}^{-1}(\mathcal{M}\odot k_{tar}),I_{aux};\theta)\|_{1}.

(6)

To fully harness and fuse the information from different modalities, we design the MMR-Mamba network, depicted in Fig. 1. Initially, we adopt CNN blocks $\psi$ for low-level feature extraction and Mamba blocks $\phi$ for high-level feature extraction, producing modality-specific features $F_{tar}$ and $F_{aux}$ for the target modality and auxiliary modality, respectively.

\displaystyle F_{tar}=\phi_{tar}(\psi_{tar}(\widetilde{I}_{tar})),~{}F_{aux}=% \phi_{aux}(\psi_{aux}({I}_{aux})).

(7)

Subsequently, we design the Target modality-guided Cross Mamba (TCM) module in the spatial domain and Selective Frequency Fusion (SFF) module in the frequency domain, generating domain-specific fused features $F_{spa}$ and $F_{fre}$ :

\displaystyle F_{spa}=TCM(F_{tar},F_{aux}),~{}F_{fre}=SFF(F_{tar},F_{aux}).

(8)

Lastly, the multi-contrast features in these two domains are further integrated through an Adaptative Spatial-Frequency Fusion (ASFF) module to generate the final fused feature, which is then fed into a CNN Decoder to reconstruct the target modality image:

\displaystyle\hat{I}_{tar}=Decoder(ASFF(F_{spa},F_{fre})).

(9)

The model is trained by minimizing the $L_{1}$ loss between the reconstructed image and ground-truth image:

\displaystyle L=\frac{1}{N}\sum_{n=1}^{N}\|\hat{I}_{tar}-I_{tar}\|_{1},

(10)

where $N$ is the number of training samples.

III-C Target-guided Cross Mamba

For the extracted target modality feature $F_{tar}$ and auxiliary modality feature $F_{aux}$ , complementary information is distributed across different regions, making efficient integration crucial for reconstructing high-quality images from under-sampled data. Existing fusion methods based on CNNs and Transformers face challenges due to their restricted capability in handling long-range dependencies or their high computational requirements. To address this, we leverage state space models for target and auxiliary feature fusion, benefiting from their ability to model long-range dependencies. Inspired by the cross model Mamba in [46], we design the Target-guided Cross Mamba (TCM) module to enrich the target modality features with complementary information from auxiliary features.

Specifically, we derive $z_{tar}$ and $z_{aux}$ from the target modality feature $F_{tar}$ and auxiliary modality feature $F_{aux}$ using normalization layer and linear layer:

		$\displaystyle z_{tar}=Linear(Norm(F_{tar})),$		(11)
		$\displaystyle z_{aux}=Linear(Norm(F_{aux})).$		(11)

Then $z_{tar}$ and $z_{aux}$ are projected into the hidden state space through one-dimensional convolution with the SiLU activation function and SSM without gating as

		$\displaystyle H_{tar}=SSM(SiLU(Conv1d(z_{tar}))),$		(12)
		$\displaystyle H_{aux}=SSM(SiLU(Conv1d(z_{aux}))).$		(12)

After that, to maximally restore the target modality information and selectively integrate the auxiliary information, we adopt $z_{tar}$ as the gating to modulate hidden state features $H_{tar}$ and $H_{aux}$ , and implement the hidden state feature fusion as

\displaystyle F_{spa}=H_{tar}\otimes SiLU(z_{tar})+H_{aux}\otimes SiLU(z_{tar}),

(13)

where $\otimes$ represents element-wise production. In this way, we build the TCM module in a hidden state space based on the gating mechanism. The gating parameters are derived from the target modality, ensuring that the target modality predominates in the fusion process and selectively incorporates complementary information from the auxiliary modality.

In our framework, we stack four TCM modules, incorporating residual connections to preserve the integrity of target features. The fused feature $F_{spa}$ is subsequently forwarded to the spatial-frequency fusion module to facilitate the reconstruction of target images.

III-D Selective Frequency Fusion

To further fuse the complementary information from the extracted features in a broader view, we resort to the frequency domain (i.e., Fourier domain), where each frequency component corresponds to all the pixels in the spatial domain, inherently capturing global properties. Additionally, as highlighted in [44, 45], the phase spectrum in the Fourier domain preserves global structure well, while the image degradation mainly manifests in the amplitude spectrum. To comprehensively integrate the complementary information and to restore the degraded features in the amplitude spectrum, we propose Selective Frequency Fusion (SFF) module within the frequency domain, illustrated in Fig. 2.

Given the extracted features $F_{tar}$ and $F_{aux}$ from previous modules, we first transform them to frequency domain through Fourier transform and obtain their amplitude spectrum and phase spectrum:

	$\displaystyle\mathcal{A}_{tar},\mathcal{P}_{tar}$	$\displaystyle=\mathcal{F}(F_{tar}),$		(14)
	$\displaystyle\mathcal{A}_{aux},\mathcal{P}_{aux}$	$\displaystyle=\mathcal{F}(F_{aux}).$		(14)

For the phase spectrum fusion, we perform element-wise addition on $\mathcal{P}_{aux}$ and $\mathcal{P}_{tar}$ , as both contain crucial and consistent structure information [47, 48]. Regarding the amplitude spectrum, it encapsulates style information, which varies significantly across different modalities. Moreover, the amplitude of the under-sampled low-quality images contains interference information that can negatively impact the final reconstruction. Direct concatenation of the amplitude spectrum from these modalities would introduce incompatible and interference information, compromising the quality of the reconstructed image [47, 48].

To mitigate this issue, we adopt a selective strategy to fuse the amplitude spectrum. Specifically, we design a Selective Amplitude Harmonization Module (SAHM), wherein the amplitude spectrum across modalities is dynamically adjusted based on global statistics. At first, we conduct element-wise addition of $\mathcal{A}_{aux}$ and $\mathcal{A}_{tar}$ to produce an intermediate feature $\mathcal{A}$ . Next, we adopt global average pooling to generate channel-wise statistics $s\in\mathbb{R}^{C}$ , embedding global information. Specifically, the $c-th$ element of $s$ is calculated by shrinking $\mathcal{A}$ through spatial dimensions $H\times W$ . For computational efficiency, a fully connected (FC) layer is utilized to create a compact feature $z\in\mathbb{R}^{L\times 1}$ , which is further used to guide the adaptive selection. The operation can be expressed as follows:

\displaystyle z=f_{fc}(s)=\sigma(\mathcal{B}(\mathbf{W}s),

(15)

where $\sigma$ is the ReLU activation function, $\mathcal{B}$ represents batch normalization, and $\mathbf{W}\in\mathcal{R}^{L\times C}$ with $L$ means the number of channel after dimension reduction.

To dynamically select different scales of amplitude spectrum from the two modalities, the compact feature descriptor $z\in\mathbb{R}^{L\times 1}$ is used to compute a soft attention score across channels. To achieve this, softmax is performed on the channel-wise digits:

\displaystyle a_{c}=\frac{e^{A_{c}z}}{e^{A_{c}z}+e^{B_{c}z}},~{}b_{c}=\frac{e^% {B_{c}z}}{e^{A_{c}z}+e^{B_{c}z}},

(16)

where $A,B\in\mathcal{R}^{C\times L}$ and $a,~{}b$ represent the soft attention vector for $\mathcal{A}_{aux}$ and $\mathcal{A}_{tar}$ . At last, the fused amplitude spectrum $\overline{\mathcal{A}}$ is obtained by weighted sum of $\mathcal{A}_{aux}$ and $\mathcal{A}_{tar}$ :

\displaystyle\overline{\mathcal{A}_{c}}=a_{c}\cdot\mathcal{A}_{aux}+b_{c}\cdot% \mathcal{A}_{tar}.

(17)

To further enhance the fused amplitude spectrum and phase spectrum, we employ two groups of independent operations $ConvA(\cdot)$ and $CovnP(\cdot)$ . Each group consists of convolution layers with a kernel size of $1\times 1$ and a ReLU activation function. The final restored amplitude spectrum $\mathcal{A}^{\prime}$ and phase spectrum $\mathcal{P}^{\prime}$ is obtained by

		$\displaystyle\mathcal{A}^{\prime}=ConvA(\overline{\mathcal{A}}),$		(18)
		$\displaystyle\mathcal{P}^{\prime}=ConvP(\mathcal{P}_{aux}+\mathcal{P}_{tar}).$		(18)

Finally, the restored amplitude spectrum $\mathcal{A}^{\prime}$ and phase spectrum $\mathcal{P}^{\prime}$ are convert to spatial domain through inverse Fourier transform:

\displaystyle F_{fre}=\mathcal{F}^{-1}(\mathcal{A}^{\prime},\mathcal{P}^{% \prime}),

(19)

where $\mathcal{F}^{-1}$ donates inverse Fourier transform. Through the FFT and selective fusion in the Fourier domain, the SFF module efficiently achieves global feature fusion and recovers high-frequency signals for structure details.

III-E Adaptative Spatial-Frequency Fusion

To further enhance the fused spatial domain feature $F_{spa}$ and fused frequency domain feature $F_{fre}$ , and to facilitate the efficient integration of these features, we introduce the Adaptative Spatial-Frequency Fusion (ASFF) module, as illustrated in Fig. 3. The ASFF module allows the two domains to mutually complement each other through channel-wise adaptive integration according to the informativeness of the channel, which is measured by the Batch-Normalization (BN) layer. Previous studies [49, 50] have shown that the BN value indicates channel importance, with lower norms signifying lesser relevance to the final outcomes. Consequently, we propose that channels with diminished significance in one domain can be reinforced by incorporating corresponding channel information from the other domain.

Firstly, we measure channel-wise informativeness of the fused spatial domain feature $F_{spa}$ and frequency feature $F_{fre}$ through the BN layer. Denote the $c-th$ channel of the features as $F_{spa,c}$ and $F_{fre,c}$ , the BN is computed as follows:

	$\displaystyle F_{spa,c}^{\prime}$	$\displaystyle=\omega_{spa,c}\frac{F_{spa,c}-\mu_{spa,c}}{\sqrt{\sigma_{spa,c}^% {2}+\epsilon}}+\beta_{spa,c},$		(20)
	$\displaystyle F_{fre,c}^{\prime}$	$\displaystyle=\omega_{fre,c}\frac{F_{fre,c}-\mu_{fre,c}}{\sqrt{\sigma_{fre,c}^% {2}+\epsilon}}+\beta_{fre,c},$		(20)

where $\omega_{spa,c}$ and $\omega_{fre,c}$ are trainable scaling factors; $\beta_{spa,c}$ and $\beta_{fre,c}$ are trainable offset, and $\epsilon$ is a small constant to avoid divisions by zero.

The factors $\omega_{spa,c}$ and $\omega_{fre,c}$ evaluate the correlation between the input and the output after normalization during training. If $\omega_{tar,c}$ or $\omega_{fre,c}$ approaches 0, it indicates that the corresponding gradient loss of $F_{spa}$ or $F_{fre}$ will be close to 0. This implies that $F_{spa,c}^{\prime}$ or $F_{fre,c}^{\prime}$ will have minimal influence on the final results. Given this insight, we propose to enhance the channels with small scaling factors in one domain by incorporating the corresponding channels from the other domain. The incorporation is performed as follows:

F_{spa,c}^{\prime}=\left\{\begin{aligned} &F_{spa,c},&if~{}\omega_{spa,c}~{}% \geq~{}\tau_{spa};\\ &F_{spa,c}\otimes F_{fre,c},&if~{}\omega_{spa,c}~{}\textless~{}\tau_{spa};\end% {aligned}\right.

(21)

F_{fre,c}^{\prime}=\left\{\begin{aligned} &F_{fre,c},&if~{}\omega_{fre,c}~{}% \geq~{}\tau_{fre};\\ &F_{fre,c}\otimes F_{spa,c},&if~{}\omega_{fre,c}~{}\textless~{}\tau_{fre};\end% {aligned}\right.

(22)

where $\otimes$ donates element-wise multiplication and $\tau_{spa}$ and $\tau_{fre}$ are the thresholds, we set it according to the maximum and the minimum of the scaling factor as follows:

	$\displaystyle\tau_{spa}=\omega_{spa}^{min}+\alpha(\omega_{spa}^{max}-\omega_{% spa}^{min}),$		(23)
	$\displaystyle\tau_{fre}=\omega_{fre}^{min}+\alpha(\omega_{fre}^{max}-\omega_{% fre}^{min}),$		(23)

where $\alpha$ is a value, we set it as 0.1 in our experiments.

Through the ASFF module, less informative features from one domain are enhanced by the corresponding features from the other, while redundant information is effectively eliminated. This process ensures that each channel contributes optimally to the final fused features. By leveraging the most relevant information from spatial and frequency domains, the ASFF module produces feature representations that are most conducive to generating high-quality target images. After the channel-wise incorporation, the features $F_{spa}^{\prime}$ and $F_{fre}^{\prime}$ are concatenated and fed to CNN Decoder to recover the reconstructed target modality image $\hat{I}_{tar}$ :

\displaystyle\hat{I}_{tar}=Decoder(Cat\left[F_{spa}^{\prime},F_{fre}^{\prime}% \right]).

(24)

IV Experiments and Results

IV-A Dataset Description

In the study, two datasets with different anatomical structures and protocols are adopted for evaluation, including the BraTS dataset [51] and fastMRI knee dataset [52]. The BraTS Dataset contains both T1WIs and T2WIs scans of the brain. We extract 2D images uniformly from 100 3D MRI volumes in the BraTS dataset. The dataset is split subject-wise with a 3:1 ratio, yielding a total of 3,621 images for training and 1,088 images for testing. The 2D image size is $240\times 240$ . In our experiments, we adopt T1WI as the auxiliary modality for the reconstruction of the T2WI modality. The fastMRI Dataset is the largest public MRI dataset with raw k-space data. Following [53], 227 and 45 pairs of single-coil PDWI and FS-PDWI knee volumes are selected for training and testing, respectively, resulting in a total of 8,332 pairs of 2D images for training and 1,665 images for testing. The 2D image size is $320\times 320$ . In our experiments, we adopt PDWI as the auxiliary modality for the reconstruction of the FS-PDWI modality.

IV-B Experimental Setup

Comparison Methods. To demonstrate the effectiveness of our proposed MMR-Mamba, we compare it against seven multi-contrast reconstruction methods: MDUNet [5], MINet [54], MCCA [13], MTrans [4], DCAMSR [14], SwinIR [55], and Pan-Mamba [46]. For a fair comparison, we run all the models for $100,000$ iterations with a batch size of 4. Other settings of the compared methods followed the implementations provided in their original papers.

Performance Metrics. For quantitative evaluation, we assess the image reconstruction results using peak-to-noise ratio (PSNR), structural similarity index (SSIM), and normalized mean squared error (NMSE). For qualitative evaluation, we visualize the reconstruction results with error maps.

Implementation Details. The proposed framework is developed with PyTorch, and the training and testing processes are executed on one NVIDIA RTX A6000 GPU (48GB). We employ an AdamW optimizer with an initial learning rate of $0.001$ and a weight decay of $1e-4$ for the training. We run $100,000$ iterations with a batch size of 4 for both datasets. For both datasets, the undersampled images are obtained by applying a 1D cartesian random under-sample mask with $4\times$ and $8\times$ acceleration.

TABLE I: Quantitative results on the BraTS and fastMRI datasets with different acceleration factors. We report mean±std for the PSNR, SSIM, and NMSE metrics. The best results are highlighted in red.

BraTS		$4\times$			$8\times$
Method	Year	PSNR $\uparrow$	SSIM $\uparrow$	NMSE $(10^{-2})$ $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	NMSE $(10^{-2})$ $\downarrow$
Zero-filling [56]	JMRI’01	30.11±1.53	0.767±0.038	5.659±0.018	26.58±1.49	0.673±0.036	12.469±0.033
MDUNet [5]	TBME’18	37.94±1.66	0.975±0.006	0.905±0.003	35.19±1.64	0.960±0.009	1.690±0.005
MINet [54]	MICCAI’21	38.26±1.74	0.976±0.006	0.847±0.003	35.23±1.72	0.961±0.009	1.697±0.005
SwinIR [55]	CVPR’21	37.87±1.73	0.974±0.006	0.926±0.003	34.95±1.72	0.960±0.009	1.805±0.005
MTrans [4]	TMI’22	36.02±1.67	0.962±0.007	1.429±0.005	34.81±1.57	0.957±0.009	1.868±0.005
MCCA [13]	JBHI’23	38.03±1.68	0.975±0.006	0.903±0.003	35.37±1.66	0.962±0.009	1.633±0.005
DCAMSR [14]	MICCAI’23	38.60±1.75	0.978±0.006	1.193±0.004	35.99±1.74	0.965±0.009	1.417±0.005
Pan-Mamba [46]	Arxiv’24	38.84±1.79	0.978±0.006	0.739±0.273	36.18±1.77	0.966±0.009	1.363±0.466
MMR-Mamba (Ours)	—	40.98±1.88	0.985±0.005	0.454±0.190	37.75±1.85	0.974±0.008	0.955±0.358
fastMRI		$4\times$			$8\times$
Method	Year	PSNR $\uparrow$	SSIM $\uparrow$	NMSE $(10^{-2})$ $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	NMSE $(10^{-2})$ $\downarrow$
Zero-filling [56]	JMRI’01	27.68±1.75	0.571±0.061	5.020±0.013	25.64±1.64	0.454±0.070	7.719±0.02
MDUNet [5]	TBME’18	28.60±1.00	0.600±0.050	4.000±0.010	27.90±0.86	0.544±0.050	4.600±0.02
MINet [54]	MICCAI’21	29.47±1.88	0.639±0.069	3.334±0.009	28.17±1.74	0.563±0.081	4.327±0.01
SwinIR [55]	CVPR’21	29.42±1.87	0.636±0.069	3.382±0.009	28.09±1.74	0.560±0.081	4.407±0.01
MTrans [4]	TMI’22	29.00±1.79	0.619±0.068	3.699±0.009	27.31±1.68	0.526±0.081	5.254±0.01
MCCA [13]	JBHI’23	29.46±1.87	0.637±0.069	3.346±0.009	28.23±1.75	0.562±0.081	4.275±0.01
DCAMSR [14]	MICCAI’23	29.45±1.87	0.637±0.068	3.349±0.009	28.42±1.79	0.569±0.081	4.098±0.01
Pan-Mamba [46]	Arxiv’24	29.59±1.87	0.645±0.068	3.248±0.928	28.36±1.76	0.570±0.081	4.159±1.034
MMR-Mamba (Ours)	—	29.66±1.88	0.647±0.068	3.201±0.927	28.44±1.76	0.572±0.081	4.084±1.036

IV-C Experimental Results

In this section, we evaluate the proposed method against state-of-the-art techniques under various experimental settings.

Quantitative Results. In Table I, we report the PSNR, SSIM, and NMSE results on both datasets for 4 $\times$ and 8 $\times$ acceleration. Firstly, focusing on the BraTS dataset in the upper part of the table, our method achieves the best reconstruction results, achieving a PSNR of 40.98 dB and SSIM of 0.985 under 4 $\times$ acceleration, and a PSNR of 37.75 dB and SSIM of 0.974 under 8 $\times$ acceleration. These results highlight our model’s efficacy in fusing information from multi-contrast images for reconstructing target images. Additionally, our method outperforms the second-best method, Pan-Manba, by 2.14 dB in PSNR under 4 $\times$ acceleration and by 1.57 dB in PSNR under 8 $\times$ acceleration. The fact that Pan-Manba achieves the second-best results further highlights the effectiveness of the Mamba in MRI reconstruction. Similarly, as shown in the lower part of the table, our method achieves the best reconstruction results on the fastMRI knee dataset, with a PSNR of 29.66 dB and SSIM of 0.647 under 4 $\times$ acceleration and a PSNR of 28.44 dB and SSIM of 0.572 under 8 $\times$ acceleration. This further demonstrates the effectiveness and robustness of our model in reconstructing high-quality MR images.

TABLE II: Ablation Study of the proposed modules on BraTS dataset. We report mean±std for the PSNR, SSIM, and NMSE metrics.

Model	TCM	SFF	ASFF	4 $\times$			8 $\times$
Model	TCM	SFF	ASFF	PSNR $\uparrow$	SSIM $\uparrow$	NMSE $(10^{-2})$ $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	NMSE $(10^{-2})$ $\downarrow$
a	✗	✗	✗	38.45+1.75	0.977±0.006	0.809±0.294	35.72±1.72	0.964±0.009	1.512±0.499
b	✓	✗	✗	39.05±1.79	0.979±0.006	0.706±0.266	36.17±1.80	0.967±0.009	1.368±0.476
c	✗	✓	✗	40.49±1.84	0.984±0.005	0.509±0.205	37.22±1.79	0.971±0.008	1.078±0.392
d	✓	✓	✗	40.66+1.86	0.985±0.005	0.489±0.201	37.50±1.84	0.973±0.008	1.012±0.374
e	✓	✓	✓	40.98±1.88	0.985±0.005	0.454±0.190	37.75±1.85	0.974±0.008	0.955±0.358

Qualitative Results. To better evaluate the reconstruction quality, we visualize the outputs from different methods for the BraTS and fastMRI datasets under 4 $\times$ and 8 $\times$ acceleration in Fig. 4, along with their corresponding error maps. These maps illustrate the discrepancies between the reconstructed and the ground truth images, with blue indicating minimal error and red indicating higher error. The error maps reveal that zero-filled reconstructions exhibit pronounced artifacts, with error levels escalating with increasing acceleration factors from 4 $\times$ to 8 $\times$ . Notably, our method consistently shows the lowest reconstruction error across both datasets and acceleration factors. This observation underscores that our method ensures superior preservation of essential anatomical details and consistently produces high-quality reconstructed images.

IV-D Ablation Studies

In this section, we first conduct an ablation study to assess the efficacy of the proposed modules. Then we analyze different fusion strategies for both spatial and frequency domains.

IV-D1 Effectiveness of the Proposed Modules

To verify the validity of the proposed modules, we conducted ablation experiments on the BraTS dataset under 4 $\times$ and 8 $\times$ acceleration. The different configurations tested are denoted as follows: (a) a baseline model with all proposed modules removed; (b) the baseline model enhanced with the TCM; (c) the baseline model enhanced with the SFF; (d) the model augmented with both the TCM and SFF; and (e) the complete model incorporating all proposed modules. The results of these experiments are reported in Table II. As shown, incorporating the TCM improves the PSNR from 38.45 dB to 39.05 dB under 4 $\times$ acceleration. while incorporating the SFF improves the PSNR to 40.49 dB under 4 $\times$ acceleration Combining both TCM and SFF modules further boosts PSNR to 40.66 dB. Finally, the inclusion of the ASFF led to additional improvements. Similar trends were observed under 8 $\times$ acceleration. Visual results on the BraTS dataset, including error maps, are shown in Fig. 5. As shown in the figure, the TCM module alone reduces the overall error, yet some structural details remain unrecovered, as indicated in the yellow box. Conversely, the SFF module effectively restores structural details but exhibits larger errors in the lateral ventricle areas, highlighted in red box. When both the TCM and SFF modules are employed, most regions and structures are adequately restored. Ultimately, integrating all proposed modules yields the best results. The quantitative and visualization results from our ablation experiments verify the effectiveness of each proposed module in enhancing the performance of multi-contrast MRI reconstruction.

IV-D2 Analysis on Spatial Domain Fusion

To evaluate the effectiveness of our proposed spatial domain fusion module, we design three ablation experiments: (1) fusion through element-wise addition, (2) fusion through cross-attention, and (3) fusion through our proposed TCM module, donated as ‘Sum’, ‘CA’ and ‘TCM’ in Table III. The results indicate that element-wise addition of features from two modalities leads to the worst results. In contrast, our TCM module achieves superior performance, outperforming ‘Sum’ and ‘CA’ under both 4 $\times$ and 8 $\times$ acceleration, notably with minimal parameters.

TABLE III: Ablation Study on spatial domain fusion on the BraTS dataset under

4\times

and

8\times

acceleration.

Method PSNR $\uparrow$ SSIM $\uparrow$ NMSE $(10^{-2})$ $\downarrow$ Param(M) $4\times$ Sum 38.56+1.74 0.977±0.006 0.789±0.285 0.90 CA 38.76+1.76 0.978±0.006 0.754±0.278 110.42 TCM 39.05±1.79 0.979±0.006 0.706±0.266 0.86 $8\times$ Sum 35.82+1.74 0.965±0.009 1.481±0.492 0.90 CA 35.92+1.74 0.965±0.009 1.447±0.492 110.42 TCM 36.17±1.80 0.967±0.009 1.368±0.476 0.86

IV-D3 Analysis on Frequency Domain Fusion.

To verify the effectiveness of the proposed selective frequency fusion module, we design the following experiments: (1) fusion through element-wise addition and (2) fusion through our proposed SFF module, donated as ‘Sum’ and ‘SFF’ in Table IV, respectively. The results presented in the table show that element-wise addition of features from two modalities results in suboptimal performance. In contrast, our proposed SFF module significantly enhances performance, achieving a 0.43 dB improvement in PSNR under 4 $\times$ acceleration and a 0.21 dB improvement under 8 $\times$ acceleration.

V Conclusion

This study explores the comprehensive and efficient integration of complementary information across modalities for multi-contrast MRI reconstruction. We present our MMR-Mamba framework, which integrates information through the TCM in the spatial domain and the SFF in the frequency domain, along with integrating the spatial-frequency features through the ASFF module. We conducted extensive experiments on the BraTS and fastMRI knee datasets, with results demonstrating the superiority of our proposed module in reconstructing MRI under different acceleration factors. This work contributes valuable insights for advancing research in multi-contrast MRI reconstruction, particularly in optimizing information fusion across spatial and frequency domains.

TABLE IV: Ablation Study on Frequency Domain Fusion. We report mean±std for the PSNR, SSIM, and NMSE metrics on the BraTS dataset under

4\times

and

8\times