MMR-Mamba: Multi-Contrast
MRI Reconstruction with Mamba and
Spatial-Frequency Information Fusion

**g Zou,  Lanqing Liu,  Qi Chen,  Shujun Wang, 
Xiaohan Xing, and **g Qin
**g Zou, Lanqing Liu and **g Qin are with Center for Smart Health, School of Nursing, The Hong Kong Polytechnic University, Hong Kong SAR, China (email: [email protected], [email protected], [email protected]).Qi Chen is with the Department of Electronic Engineering and Information Science, University of Science and Technology of China, Anhui, China (e-mail: [email protected]).Shujun Wang is with the Department of Biomedical Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China (email: [email protected]).Xiaohan Xing is with the Department of Radiation Oncology, Stanford University, Stanford, CA 94305 USA (email: [email protected]).Corresponding author: Xiaohan Xing ([email protected]).
Abstract

Multi-contrast MRI acceleration has become prevalent in MR imaging, enabling the reconstruction of high-quality MR images from under-sampled k-space data of the target modality, using guidance from a fully-sampled auxiliary modality. The main crux lies in efficiently and comprehensively integrating complementary information from the auxiliary modality. Existing methods either suffer from quadratic computational complexity or fail to capture long-range correlated features comprehensively. In this work, we propose MMR-Mamba, a novel framework that achieves comprehensive integration of multi-contrast features through Mamba and spatial-frequency information fusion. Firstly, we design the Target modality-guided Cross Mamba (TCM) module in the spatial domain, which maximally restores the target modality information by selectively absorbing useful information from the auxiliary modality. Secondly, leveraging global properties of the Fourier domain, we introduce the Selective Frequency Fusion (SFF) module to efficiently integrate global information in the frequency domain and recover high-frequency signals for the reconstruction of structure details. Additionally, we present the Adaptive Spatial-Frequency Fusion (ASFF) module, which enhances fused features by supplementing less informative features from one domain with corresponding features from the other domain. These innovative strategies ensure efficient feature fusion across spatial and frequency domains, avoiding the introduction of redundant information and facilitating the reconstruction of high-quality target images. Extensive experiments on the BraTS and fastMRI knee datasets demonstrate the superiority of the proposed MMR-Mamba over state-of-the-art MRI reconstruction methods.

Index Terms:
MRI reconstruction, Multi-contrast, State space models, Fourier domain, Spatial-frequency information fusion.

I Introduction

Magnetic resonance imaging (MRI) is an essential clinical imaging technology owing to its non-invasive, free-of-radiation characteristics and its capability to provide high-resolution morphological information with varying contrast [1]. However, due to the intrinsic physics of MR imaging systems [2], acquiring fully sampled k-space data can take tens of minutes, resulting in patient discomfort and increased operational costs. Therefore, accelerating MRI acquisition, particularly through the reconstruction of high-quality MR images from undersampled k-space data has become a popular research topic [3].

In clinical practice, multi-contrast MR images are simultaneously acquired to enable more accurate disease diagnosis and treatment planning. These modalities provide complementary information, revealing detailed anatomical structures of the patient [4]. For instance, T1 and T2 weighted images (T1WIs and T2WIs), along with proton density and fat-suppressed proton density-weighted images (PDWIs and FS-PDWIs), are two pairs of images with complementary structures. Typically, T1WIs are acquired more quickly due to shorter repetition time (TR) and echo time (TE) requirements, whereas T2WIs acquisition requires longer TR times, resulting in slower imaging speeds [5]. Similarly, FS-PDWIs necessitate longer scan times than PDWIs. In this context, leveraging readily obtainable modalities (i.e. auxiliary modalities such as T1WIs or PDWIs) as supplementary guidance for the reconstruction of target modalities (e.g., T2WIs or FS-PDWIs) with slower imaging speeds, known as multi-contrast MRI reconstruction, has been investigated and verified by previous works. For instance, compressed sensing (CS), Bayesian learning, dictionary learning, and graph representation theory have been utilized to accelerate multi-contrast MR imaging [6, 7, 8]. More recently, deep learning has significantly advanced MRI reconstruction due to their powerful feature representation capabilities [5, 9, 7, 10, 11].

In multi-contrast MR images, similar structural features are distributed across different regions within each modality, while complementary information is provided by different modalities. Consequently, the primary challenge of the multi-contrast MRI reconstruction task is to comprehensively explore long-range dependencies within each modality and effectively leverage complementary information from auxiliary modalities. CNN-based methods have been proposed to integrate multi-contrast information [5, 12], but they typically demonstrate local sensitivity and a lack of long-range dependency, thereby limiting their ability to capture global contextual information. In contrast, Transformer-based models [4, 13, 14], distinguished by their large receptive fields and global sensitivity, often surpass CNNs in capturing extensive contextual information. However, these models are burdened by substantial computational overhead due to the quadratic growth of resources with respect to sequence length. Therefore, develo** an algorithm that comprehensively integrates complementary information from different modalities without significant computational overhead is crucial.

Recently, the improved structured state-space sequence model with a selective scanning mechanism, Mamba [15], has emerged as a compelling alternative to Transformer, due to its ability to model long-range sequence relationships with linear complexity. Mamba has shown superior performance compared to Transformers in tasks involving long-term dependency modeling, such as natural language processing [16, 17] and medical image segmentation [18, 19]. Investigating Mamba’s potential for long-range dependency modeling and complementary information fusion in multi-contrast MRI reconstruction is highly promising. On the other hand, each component in the frequency domain represents a combination of all the pixel values in the spatial domain, meaning that frequency features capture the overall patterns and structures, providing a global view of the entire image. Meanwhile, Fourier features are beneficial for recovering high-frequency signals that are crucial for addressing image degradation[20]. Therefore, comprehensive and efficient global feature integration across different modalities can be achieved by performing feature fusion in the frequency domain. The Mamba block and frequency domain thus offer two promising solutions for efficient fusion of the multi-contrast information.

Motivated by the above analysis, we propose a novel framework, MMR-Mamba, for multi-contrast MRI reconstruction. Built upon the Mamba architecture, our MMR-Mamba jointly explores the complementary information fusion in the spatial and frequency domains, implemented by the Target modality-guided Cross Mamba (TCM) module and Selective Frequency Fusion (SFF) module, respectively. Additionally, we introduce an Adaptive Saptial-Frequency Fusion (ASFF) module to enhance and integrate the fused features from both domains. Specifically, we first adapt Mamba blocks for feature extraction, leveraging their capability to effectively capture fine-grained relationships and long-range dependencies within MR images. Then we design TCM for spatial domain information fusion, where correlated features from the auxiliary modality are selectively supplemented to the target modality. In the SFF module of the frequency domain, we perform element-wise summation for the phase spectrum and selective integration for the amplitude spectrum, as the phase spectrum of both modalities primarily contains consistent structural information while the amplitude spectrum from different modalities holds incompatible style information. Finally, we adopt the ASFF module to enhance the fused features from both domains, where less informative channels from one domain are supplemented by incorporating the corresponding channel features from the other domain. The ASFF module enables the integration of relevant information and the suppression of redundant features. Our contributions can be summarised as follows:

  • We propose MMR-Mamba, an efficient framework for multi-contrast MRI reconstruction. To the best of our knowledge, this is the first exploration of Mamba for integrating complementary information across multi-contrast MR images.

  • We design the TCM module in the spatial domain for complementary feature fusion and the SSF module in the frequency domain for global structure information fusion.

  • We introduce the ASFF module for spatial-frequency information fusion, enhancing task-relevant features while suppressing irrelevant features from the two domains.

  • Extensive experiments on the BraTS and fastMRI datasets validate the effectiveness of our MMR-Mamba framework, demonstrating superior performance compared to existing methods.

II Related Works

II-A Spatial Domain MRI Reconstruction

Benefiting from the complementary information from multi-contrast images, many methods have been proposed to reconstruct high-quality images from low-quality images in the spatial domain, guided by the auxiliary modality images[21, 5, 4, 22]. Previously, Xiang et al.[5] proposed Dense-Unet to reconstruct the T2WIs from the T1WIs and under-sampled T2WIs, with the concatenation of under-sampled T2WIs and T1WIs serving as the input of the network. Xuan et al. [12] further enhanced the performance by introducing a spatial alignment network to compensate for spatial misalignment between multiple modality images.

Recently, regarding the Transformer’s ability in modeling long-range dependencies, several works have explored Transformer-based approaches for multi-contrast MRI reconstruction. MTrans [4] employed a cross-attention module for multi-scale features fusion of the target modality and the auxiliary modality. MD-GraphFormer [22] incorporated the physical constraint of MRI into the network architecture, and modeled the multiple contrasts as graph nodes for joint reconstruction of multi-contrast MR images over the graph. MCCA [13] introduced the hybrid convolutional transformer operation to enrich global and local spatial context representation. DCAMSR [14] proposed a dual cross-attention mechanism, where the features of the reference image and the upsampled input image are extracted and promoted with both spatial and channel attention at multiple resolutions. Despite the promising performance, the CNN and Transformer-based approaches either fall short of fully capturing long-range dependencies or suffer from quadratic complexity. In contrast, our model achieves comprehensive exploration of long-range dependencies without imposing a heavy computational burden.

II-B Dual-Domain MRI Reconstruction

Given that aliasing artifacts in the images are structural and non-local, relying solely on image domain restoration is insufficient to eliminate them and achieve faithful reconstruction results. Consequently, several studies have explored the utilization of both k-space information and spatial domain information for MRI reconstruction [23, 24, 25, 26]. For example, Yang et al.[23] incorporated frequency domain information as additional constraints, formulated as an extra loss term. Similarly, the Fourier-constrained diffusion bridge (FDB) was introduced for diffusion model-based MRI reconstruction [24]. Subsequently, DuDoRNet [25] proposed a novel paradigm that simultaneously recovers k-space and images to accelerate MR imaging through Residual Dense Network (DRDNet). Building on DuDoRNet, DuDoCAF [27] adopted a recurrent transformer structure to fuse features from two modalities for cross-modality reconstruction. Liu et al.[28] proposed Faster Fourier Convolution (FasterFC) for 3D MRI reconstruction, directly restoring frequency domain information. Furthermore, FMTNet [29] separately repaired frequency information through a high-frequency learning branch and a low-frequency learning branch, and the two branches are concatenated for final results. McSTRA [30] also designed low-pass and high-pass reconstruction branches for frequency information restoration, and the whole reconstruction iterates between intermediate de-aliasing and data consistency via cascaded Swin-Transformers.

However, current dual-domain methods typically use similar networks for both k-space and image domains, which overlooks the specific characteristics inherent to each domain. Moreover, these methods often rely on straightforward parallel or sequential combinations of two domains, failing to fully exploit their unique features and complementary information.

II-C State Space Models

State Space Models (SSMs) offer a powerful framework for efficiently modeling long sequences and have garnered significant attention [31, 32]. Previously, Structured State Space Sequence Models (S4) [33] was designed to capture long-range dependencies within sequences by introducing Higher-Order Polynomial Project Operator [34]. S4 has demonstrated exceptional performance across various benchmarks while reducing computational complexity to O(Nlog(N))𝑂𝑁𝑙𝑜𝑔𝑁O(Nlog(N))italic_O ( italic_N italic_l italic_o italic_g ( italic_N ) ), significantly superior to the quadratic computational complexity O(N2)𝑂superscript𝑁2O(N^{2})italic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) of Transformers. More recently, S6, i.e. Mamba [15], was introduced. It employs a selection mechanism for choosing relevant information based on input and features an efficient hardware-aware algorithm grounded in selective scanning.

Mamba has significantly advanced natural language tasks, surpassing traditional Transformers. Subsequently, Mamba has been extended to various vision tasks. For instance, Visual Mamba (Vim) [35] proposed a pure Mamba-based backbone utilizing bidirectional Mamba blocks, modeling images in a sequence manner through positional embedding, akin to ViT [36]. Visual State Space Model (Vmamba) [37] introduces a cross-scanning mechanism for spatial traversal, converting non-causal visual images into ordered patch sequences. In the medical imaging domain, Mamba has been explored for tasks such as medical image segmentation [18, 19], registration [38], and classification [39, 40]. For MRI reconstruction, Huang et al.[41] first proposed MambaMIR with an Arbitrary-Mask mechanism for medical image reconstruction. However, the complementary information from muti-contrast images is neglected. In this work, we explore Mamba for the integration of features from different modalities, harnessing its capability in modeling long-range dependencies.

III Methodology

Refer to caption
Figure 1: Overview of the proposed MMR-Mamba framework (left). It contains Mamba blocks for feature extraction, TCM for spatial domain fusion, SFF for frequency domain fusion, and ASFF for spatial-frequency information integration. Structure of Mamba block and TCM (right).

III-A Preliminaries

State Space Models. SSMs are typically defined as linear, time-invariant systems that map an input sequence x(t)L𝑥𝑡superscript𝐿x(t)\in\mathbb{R}^{L}italic_x ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT to an output sequence y(t)L𝑦𝑡superscript𝐿y(t)\in\mathbb{R}^{L}italic_y ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT through a hidden state h(t)N𝑡superscript𝑁h(t)\in\mathbb{R}^{N}italic_h ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. These systems can be mathematically expressed as the following ordinary differential equation (ODE):

h(t)=Ah(t)+Bx(t),superscript𝑡𝐴𝑡𝐵𝑥𝑡\displaystyle h^{\prime}(t)=Ah(t)+Bx(t),italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t ) = italic_A italic_h ( italic_t ) + italic_B italic_x ( italic_t ) , (1)
y(t)=Ch(t)+Dx(t),𝑦𝑡𝐶𝑡𝐷𝑥𝑡\displaystyle y(t)=Ch(t)+Dx(t),italic_y ( italic_t ) = italic_C italic_h ( italic_t ) + italic_D italic_x ( italic_t ) ,

where AN×N𝐴superscript𝑁𝑁A\in\mathbb{R}^{N\times N}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT denotes the state matrix, BN×1𝐵superscript𝑁1B\in\mathbb{R}^{N\times 1}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT and C1×N𝐶superscript1𝑁C\in\mathbb{R}^{1\times N}italic_C ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_N end_POSTSUPERSCRIPT represent the projection parameters, and D1𝐷superscript1D\in\mathbb{R}^{1}italic_D ∈ blackboard_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT is a skip connection.

To incorporate SSMs into deep learning algorithms, discretization is indispensable. The system in Eq 1 is discretized through the zero-order hold (ZOH). After discretization, the system can be written as

hk=A¯hk1+B¯xk,subscript𝑘¯𝐴subscript𝑘1¯𝐵subscript𝑥𝑘\displaystyle h_{k}=\overline{A}h_{k-1}+\overline{B}x_{k},italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = over¯ start_ARG italic_A end_ARG italic_h start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + over¯ start_ARG italic_B end_ARG italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , (2)
yk=C¯hk+D¯xk,subscript𝑦𝑘¯𝐶subscript𝑘¯𝐷subscript𝑥𝑘\displaystyle y_{k}=\overline{C}h_{k}+\overline{D}x_{k},italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = over¯ start_ARG italic_C end_ARG italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over¯ start_ARG italic_D end_ARG italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,

where A¯=exp(ΔA)¯𝐴𝑒𝑥𝑝Δ𝐴\overline{A}=exp(\Delta A)over¯ start_ARG italic_A end_ARG = italic_e italic_x italic_p ( roman_Δ italic_A ), B¯=(ΔA)1(exp(ΔA)I)ΔB¯𝐵superscriptΔ𝐴1𝑒𝑥𝑝Δ𝐴𝐼Δ𝐵\overline{B}=(\Delta A)^{-1}(exp(\Delta A)-I)\cdot\Delta Bover¯ start_ARG italic_B end_ARG = ( roman_Δ italic_A ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_e italic_x italic_p ( roman_Δ italic_A ) - italic_I ) ⋅ roman_Δ italic_B, C¯=C¯𝐶𝐶\overline{C}=Cover¯ start_ARG italic_C end_ARG = italic_C, D¯=D¯𝐷𝐷\overline{D}=Dover¯ start_ARG italic_D end_ARG = italic_D are discretized parameters, and ΔΔ\Deltaroman_Δ is the discretization step size, which can be seen as the resolution of the continuous input x(t)𝑥𝑡x(t)italic_x ( italic_t ).

Furthermore, the Eq. 2 can also be reformulated and computed as the following convolution:

K¯=(CB¯,CAB¯,,CA¯L1B¯),¯𝐾𝐶¯𝐵𝐶¯𝐴𝐵𝐶superscript¯𝐴𝐿1¯𝐵\displaystyle\overline{K}=(C\overline{B},C\overline{AB},...,C\overline{A}^{L-1% }\overline{B}),over¯ start_ARG italic_K end_ARG = ( italic_C over¯ start_ARG italic_B end_ARG , italic_C over¯ start_ARG italic_A italic_B end_ARG , … , italic_C over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT over¯ start_ARG italic_B end_ARG ) , (3)
y=xK¯,𝑦𝑥¯𝐾\displaystyle y=x*\overline{K},italic_y = italic_x ∗ over¯ start_ARG italic_K end_ARG ,

where L𝐿Litalic_L denotes the length of the input sequence x𝑥xitalic_x and KL𝐾superscript𝐿K\in\mathbb{R}^{L}italic_K ∈ blackboard_R start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT is the SSM convolution kernel.

Fourier Transform. We here briefly review the meaning of the Fourier transformation of images for a better understanding of our work. The Fourier transform (FT) serves as a crucial technique in analyzing the frequency characteristics of an image. Transforming images from the spatial domain to the frequency domain through FT allows us to examine the images from a global perspective. Given an image x𝑥xitalic_x, the FT can be expressed as follows:

(x)(u,v)=h=0H1w=0W1x(h,w)ej2π(hHu+wWv),𝑥𝑢𝑣superscriptsubscript0𝐻1superscriptsubscript𝑤0𝑊1𝑥𝑤superscript𝑒𝑗2𝜋𝐻𝑢𝑤𝑊𝑣\displaystyle\mathcal{F}(x)(u,v)=\sum_{h=0}^{H-1}\sum_{w=0}^{W-1}x(h,w)e^{-j2% \pi(\frac{h}{H}u+\frac{w}{W}v)},caligraphic_F ( italic_x ) ( italic_u , italic_v ) = ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_w = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W - 1 end_POSTSUPERSCRIPT italic_x ( italic_h , italic_w ) italic_e start_POSTSUPERSCRIPT - italic_j 2 italic_π ( divide start_ARG italic_h end_ARG start_ARG italic_H end_ARG italic_u + divide start_ARG italic_w end_ARG start_ARG italic_W end_ARG italic_v ) end_POSTSUPERSCRIPT , (4)

where u𝑢uitalic_u and v𝑣vitalic_v are coordinates in the Fourier space. The frequency domain feature (x)𝑥\mathcal{F}(x)caligraphic_F ( italic_x ) is represented as (x)=(x)+j(x)𝑥𝑥𝑗𝑥\mathcal{F}(x)=\mathcal{R}(x)+j\mathcal{I}(x)caligraphic_F ( italic_x ) = caligraphic_R ( italic_x ) + italic_j caligraphic_I ( italic_x ), with (x)𝑥\mathcal{R}(x)caligraphic_R ( italic_x ) and (x)𝑥\mathcal{I}(x)caligraphic_I ( italic_x ) denote the real and imaginary part respectively. Then the amplitude spectrum 𝒜(x)(u,v)𝒜𝑥𝑢𝑣\mathcal{A}(x)(u,v)caligraphic_A ( italic_x ) ( italic_u , italic_v ) and phase spectrum 𝒫(x)(u,v)𝒫𝑥𝑢𝑣\mathcal{P}(x)(u,v)caligraphic_P ( italic_x ) ( italic_u , italic_v ) are defined as:

𝒜(x)(u,v)=[2(x)(u,v)+2(x)(u,v)]12,𝒜𝑥𝑢𝑣superscriptdelimited-[]superscript2𝑥𝑢𝑣superscript2𝑥𝑢𝑣12\displaystyle\mathcal{A}(x)(u,v)={\left[\mathcal{R}^{2}(x)(u,v)+\mathcal{I}^{2% }(x)(u,v)\right]}^{\frac{1}{2}},caligraphic_A ( italic_x ) ( italic_u , italic_v ) = [ caligraphic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x ) ( italic_u , italic_v ) + caligraphic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x ) ( italic_u , italic_v ) ] start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT , (5)
𝒫(x)(u,v)=arctan[(x)(u,v)(x)(u,v)].𝒫𝑥𝑢𝑣𝑎𝑟𝑐𝑡𝑎𝑛delimited-[]𝑥𝑢𝑣𝑥𝑢𝑣\displaystyle\mathcal{P}(x)(u,v)=arctan\left[\frac{\mathcal{I}(x)(u,v)}{% \mathcal{R}(x)(u,v)}\right].caligraphic_P ( italic_x ) ( italic_u , italic_v ) = italic_a italic_r italic_c italic_t italic_a italic_n [ divide start_ARG caligraphic_I ( italic_x ) ( italic_u , italic_v ) end_ARG start_ARG caligraphic_R ( italic_x ) ( italic_u , italic_v ) end_ARG ] .

According to the Fourier theory, the amplitude spectrum 𝒜𝒜\mathcal{A}caligraphic_A encodes the style information of an image, whereas the phase component spectrum 𝒫𝒫\mathcal{P}caligraphic_P characterizes the structure information [42, 43]. Furthermore, as highlighted in [44, 45] the phase spectrum maintains global structure well and the degradation of images is mainly manifested in the amplitude spectrum.

III-B Overall Architecture

Let Itarsubscript𝐼𝑡𝑎𝑟{I}_{tar}italic_I start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT and IauxH×Wsubscript𝐼𝑎𝑢𝑥superscript𝐻𝑊{I}_{aux}\in\mathbb{R}^{H\times W}italic_I start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT be a pair of MR images from the target modality and auxiliary modality. Our objective is to reconstruct the high-quality target modality image Itarsubscript𝐼𝑡𝑎𝑟I_{tar}italic_I start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT through a network that takes the under-sampled target modality image I~tarsubscript~𝐼𝑡𝑎𝑟\widetilde{I}_{tar}over~ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT and fully-sampled auxiliary modality image Iauxsubscript𝐼𝑎𝑢𝑥{I}_{aux}italic_I start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT as inputs. Here, I~tar=1(ktar)subscript~𝐼𝑡𝑎𝑟superscript1direct-productsubscript𝑘𝑡𝑎𝑟\widetilde{I}_{tar}=\mathcal{F}^{-1}(\mathcal{M}\odot k_{tar})over~ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT = caligraphic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( caligraphic_M ⊙ italic_k start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT ), where 1superscript1\mathcal{F}^{-1}caligraphic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT represents the inverse Fourier transform (IFT), \mathcal{M}caligraphic_M denotes the binary under-sample mask, and ktarsubscript𝑘𝑡𝑎𝑟k_{tar}italic_k start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT represents the fully-sampled k-space data of the target modality. The multi-contrast reconstruction task can be formulated as the following optimization problem:

I^tar=argminf(θ)Itarf(1(ktar),Iaux;θ)1.subscript^𝐼𝑡𝑎𝑟subscriptargmin𝑓𝜃subscriptnormsubscript𝐼𝑡𝑎𝑟𝑓superscript1direct-productsubscript𝑘𝑡𝑎𝑟subscript𝐼𝑎𝑢𝑥𝜃1\displaystyle\hat{I}_{tar}=\operatorname*{argmin}_{f(\theta)}\|I_{tar}-f(% \mathcal{F}^{-1}(\mathcal{M}\odot k_{tar}),I_{aux};\theta)\|_{1}.over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT = roman_argmin start_POSTSUBSCRIPT italic_f ( italic_θ ) end_POSTSUBSCRIPT ∥ italic_I start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT - italic_f ( caligraphic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( caligraphic_M ⊙ italic_k start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT ) , italic_I start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT ; italic_θ ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (6)

To fully harness and fuse the information from different modalities, we design the MMR-Mamba network, depicted in Fig. 1. Initially, we adopt CNN blocks ψ𝜓\psiitalic_ψ for low-level feature extraction and Mamba blocks ϕitalic-ϕ\phiitalic_ϕ for high-level feature extraction, producing modality-specific features Ftarsubscript𝐹𝑡𝑎𝑟F_{tar}italic_F start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT and Fauxsubscript𝐹𝑎𝑢𝑥F_{aux}italic_F start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT for the target modality and auxiliary modality, respectively.

Ftar=ϕtar(ψtar(I~tar)),Faux=ϕaux(ψaux(Iaux)).formulae-sequencesubscript𝐹𝑡𝑎𝑟subscriptitalic-ϕ𝑡𝑎𝑟subscript𝜓𝑡𝑎𝑟subscript~𝐼𝑡𝑎𝑟subscript𝐹𝑎𝑢𝑥subscriptitalic-ϕ𝑎𝑢𝑥subscript𝜓𝑎𝑢𝑥subscript𝐼𝑎𝑢𝑥\displaystyle F_{tar}=\phi_{tar}(\psi_{tar}(\widetilde{I}_{tar})),~{}F_{aux}=% \phi_{aux}(\psi_{aux}({I}_{aux})).italic_F start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT = italic_ϕ start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT ( over~ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT ) ) , italic_F start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT = italic_ϕ start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT ) ) . (7)

Subsequently, we design the Target modality-guided Cross Mamba (TCM) module in the spatial domain and Selective Frequency Fusion (SFF) module in the frequency domain, generating domain-specific fused features Fspasubscript𝐹𝑠𝑝𝑎F_{spa}italic_F start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT and Ffresubscript𝐹𝑓𝑟𝑒F_{fre}italic_F start_POSTSUBSCRIPT italic_f italic_r italic_e end_POSTSUBSCRIPT:

Fspa=TCM(Ftar,Faux),Ffre=SFF(Ftar,Faux).formulae-sequencesubscript𝐹𝑠𝑝𝑎𝑇𝐶𝑀subscript𝐹𝑡𝑎𝑟subscript𝐹𝑎𝑢𝑥subscript𝐹𝑓𝑟𝑒𝑆𝐹𝐹subscript𝐹𝑡𝑎𝑟subscript𝐹𝑎𝑢𝑥\displaystyle F_{spa}=TCM(F_{tar},F_{aux}),~{}F_{fre}=SFF(F_{tar},F_{aux}).italic_F start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT = italic_T italic_C italic_M ( italic_F start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT ) , italic_F start_POSTSUBSCRIPT italic_f italic_r italic_e end_POSTSUBSCRIPT = italic_S italic_F italic_F ( italic_F start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT ) . (8)

Lastly, the multi-contrast features in these two domains are further integrated through an Adaptative Spatial-Frequency Fusion (ASFF) module to generate the final fused feature, which is then fed into a CNN Decoder to reconstruct the target modality image:

I^tar=Decoder(ASFF(Fspa,Ffre)).subscript^𝐼𝑡𝑎𝑟𝐷𝑒𝑐𝑜𝑑𝑒𝑟𝐴𝑆𝐹𝐹subscript𝐹𝑠𝑝𝑎subscript𝐹𝑓𝑟𝑒\displaystyle\hat{I}_{tar}=Decoder(ASFF(F_{spa},F_{fre})).over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT = italic_D italic_e italic_c italic_o italic_d italic_e italic_r ( italic_A italic_S italic_F italic_F ( italic_F start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_f italic_r italic_e end_POSTSUBSCRIPT ) ) . (9)

The model is trained by minimizing the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss between the reconstructed image and ground-truth image:

L=1Nn=1NI^tarItar1,𝐿1𝑁superscriptsubscript𝑛1𝑁subscriptnormsubscript^𝐼𝑡𝑎𝑟subscript𝐼𝑡𝑎𝑟1\displaystyle L=\frac{1}{N}\sum_{n=1}^{N}\|\hat{I}_{tar}-I_{tar}\|_{1},italic_L = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT - italic_I start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (10)

where N𝑁Nitalic_N is the number of training samples.

III-C Target-guided Cross Mamba

For the extracted target modality feature Ftarsubscript𝐹𝑡𝑎𝑟F_{tar}italic_F start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT and auxiliary modality feature Fauxsubscript𝐹𝑎𝑢𝑥F_{aux}italic_F start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT, complementary information is distributed across different regions, making efficient integration crucial for reconstructing high-quality images from under-sampled data. Existing fusion methods based on CNNs and Transformers face challenges due to their restricted capability in handling long-range dependencies or their high computational requirements. To address this, we leverage state space models for target and auxiliary feature fusion, benefiting from their ability to model long-range dependencies. Inspired by the cross model Mamba in [46], we design the Target-guided Cross Mamba (TCM) module to enrich the target modality features with complementary information from auxiliary features.

Specifically, we derive ztarsubscript𝑧𝑡𝑎𝑟z_{tar}italic_z start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT and zauxsubscript𝑧𝑎𝑢𝑥z_{aux}italic_z start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT from the target modality feature Ftarsubscript𝐹𝑡𝑎𝑟F_{tar}italic_F start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT and auxiliary modality feature Fauxsubscript𝐹𝑎𝑢𝑥F_{aux}italic_F start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT using normalization layer and linear layer:

ztar=Linear(Norm(Ftar)),subscript𝑧𝑡𝑎𝑟𝐿𝑖𝑛𝑒𝑎𝑟𝑁𝑜𝑟𝑚subscript𝐹𝑡𝑎𝑟\displaystyle z_{tar}=Linear(Norm(F_{tar})),italic_z start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT = italic_L italic_i italic_n italic_e italic_a italic_r ( italic_N italic_o italic_r italic_m ( italic_F start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT ) ) , (11)
zaux=Linear(Norm(Faux)).subscript𝑧𝑎𝑢𝑥𝐿𝑖𝑛𝑒𝑎𝑟𝑁𝑜𝑟𝑚subscript𝐹𝑎𝑢𝑥\displaystyle z_{aux}=Linear(Norm(F_{aux})).italic_z start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT = italic_L italic_i italic_n italic_e italic_a italic_r ( italic_N italic_o italic_r italic_m ( italic_F start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT ) ) .

Then ztarsubscript𝑧𝑡𝑎𝑟z_{tar}italic_z start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT and zauxsubscript𝑧𝑎𝑢𝑥z_{aux}italic_z start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT are projected into the hidden state space through one-dimensional convolution with the SiLU activation function and SSM without gating as

Htar=SSM(SiLU(Conv1d(ztar))),subscript𝐻𝑡𝑎𝑟𝑆𝑆𝑀𝑆𝑖𝐿𝑈𝐶𝑜𝑛𝑣1𝑑subscript𝑧𝑡𝑎𝑟\displaystyle H_{tar}=SSM(SiLU(Conv1d(z_{tar}))),italic_H start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT = italic_S italic_S italic_M ( italic_S italic_i italic_L italic_U ( italic_C italic_o italic_n italic_v 1 italic_d ( italic_z start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT ) ) ) , (12)
Haux=SSM(SiLU(Conv1d(zaux))).subscript𝐻𝑎𝑢𝑥𝑆𝑆𝑀𝑆𝑖𝐿𝑈𝐶𝑜𝑛𝑣1𝑑subscript𝑧𝑎𝑢𝑥\displaystyle H_{aux}=SSM(SiLU(Conv1d(z_{aux}))).italic_H start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT = italic_S italic_S italic_M ( italic_S italic_i italic_L italic_U ( italic_C italic_o italic_n italic_v 1 italic_d ( italic_z start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT ) ) ) .

After that, to maximally restore the target modality information and selectively integrate the auxiliary information, we adopt ztarsubscript𝑧𝑡𝑎𝑟z_{tar}italic_z start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT as the gating to modulate hidden state features Htarsubscript𝐻𝑡𝑎𝑟H_{tar}italic_H start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT and Hauxsubscript𝐻𝑎𝑢𝑥H_{aux}italic_H start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT, and implement the hidden state feature fusion as

Fspa=HtarSiLU(ztar)+HauxSiLU(ztar),subscript𝐹𝑠𝑝𝑎tensor-productsubscript𝐻𝑡𝑎𝑟𝑆𝑖𝐿𝑈subscript𝑧𝑡𝑎𝑟tensor-productsubscript𝐻𝑎𝑢𝑥𝑆𝑖𝐿𝑈subscript𝑧𝑡𝑎𝑟\displaystyle F_{spa}=H_{tar}\otimes SiLU(z_{tar})+H_{aux}\otimes SiLU(z_{tar}),italic_F start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT = italic_H start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT ⊗ italic_S italic_i italic_L italic_U ( italic_z start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT ) + italic_H start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT ⊗ italic_S italic_i italic_L italic_U ( italic_z start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT ) , (13)

where tensor-product\otimes represents element-wise production. In this way, we build the TCM module in a hidden state space based on the gating mechanism. The gating parameters are derived from the target modality, ensuring that the target modality predominates in the fusion process and selectively incorporates complementary information from the auxiliary modality.

In our framework, we stack four TCM modules, incorporating residual connections to preserve the integrity of target features. The fused feature Fspasubscript𝐹𝑠𝑝𝑎F_{spa}italic_F start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT is subsequently forwarded to the spatial-frequency fusion module to facilitate the reconstruction of target images.

III-D Selective Frequency Fusion

To further fuse the complementary information from the extracted features in a broader view, we resort to the frequency domain (i.e.,  Fourier domain), where each frequency component corresponds to all the pixels in the spatial domain, inherently capturing global properties. Additionally, as highlighted in [44, 45], the phase spectrum in the Fourier domain preserves global structure well, while the image degradation mainly manifests in the amplitude spectrum. To comprehensively integrate the complementary information and to restore the degraded features in the amplitude spectrum, we propose Selective Frequency Fusion (SFF) module within the frequency domain, illustrated in Fig. 2.

Given the extracted features Ftarsubscript𝐹𝑡𝑎𝑟F_{tar}italic_F start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT and Fauxsubscript𝐹𝑎𝑢𝑥F_{aux}italic_F start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT from previous modules, we first transform them to frequency domain through Fourier transform and obtain their amplitude spectrum and phase spectrum:

𝒜tar,𝒫tarsubscript𝒜𝑡𝑎𝑟subscript𝒫𝑡𝑎𝑟\displaystyle\mathcal{A}_{tar},\mathcal{P}_{tar}caligraphic_A start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT =(Ftar),absentsubscript𝐹𝑡𝑎𝑟\displaystyle=\mathcal{F}(F_{tar}),= caligraphic_F ( italic_F start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT ) , (14)
𝒜aux,𝒫auxsubscript𝒜𝑎𝑢𝑥subscript𝒫𝑎𝑢𝑥\displaystyle\mathcal{A}_{aux},\mathcal{P}_{aux}caligraphic_A start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT =(Faux).absentsubscript𝐹𝑎𝑢𝑥\displaystyle=\mathcal{F}(F_{aux}).= caligraphic_F ( italic_F start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT ) .

For the phase spectrum fusion, we perform element-wise addition on 𝒫auxsubscript𝒫𝑎𝑢𝑥\mathcal{P}_{aux}caligraphic_P start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT and 𝒫tarsubscript𝒫𝑡𝑎𝑟\mathcal{P}_{tar}caligraphic_P start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT, as both contain crucial and consistent structure information [47, 48]. Regarding the amplitude spectrum, it encapsulates style information, which varies significantly across different modalities. Moreover, the amplitude of the under-sampled low-quality images contains interference information that can negatively impact the final reconstruction. Direct concatenation of the amplitude spectrum from these modalities would introduce incompatible and interference information, compromising the quality of the reconstructed image [47, 48].

Refer to caption
Figure 2: Illustration of Selective Frequency Fusion (SFF) module.

To mitigate this issue, we adopt a selective strategy to fuse the amplitude spectrum. Specifically, we design a Selective Amplitude Harmonization Module (SAHM), wherein the amplitude spectrum across modalities is dynamically adjusted based on global statistics. At first, we conduct element-wise addition of 𝒜auxsubscript𝒜𝑎𝑢𝑥\mathcal{A}_{aux}caligraphic_A start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT and 𝒜tarsubscript𝒜𝑡𝑎𝑟\mathcal{A}_{tar}caligraphic_A start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT to produce an intermediate feature 𝒜𝒜\mathcal{A}caligraphic_A. Next, we adopt global average pooling to generate channel-wise statistics sC𝑠superscript𝐶s\in\mathbb{R}^{C}italic_s ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT, embedding global information. Specifically, the cth𝑐𝑡c-thitalic_c - italic_t italic_h element of s𝑠sitalic_s is calculated by shrinking 𝒜𝒜\mathcal{A}caligraphic_A through spatial dimensions H×W𝐻𝑊H\times Witalic_H × italic_W. For computational efficiency, a fully connected (FC) layer is utilized to create a compact feature zL×1𝑧superscript𝐿1z\in\mathbb{R}^{L\times 1}italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × 1 end_POSTSUPERSCRIPT, which is further used to guide the adaptive selection. The operation can be expressed as follows:

z=ffc(s)=σ((𝐖s),\displaystyle z=f_{fc}(s)=\sigma(\mathcal{B}(\mathbf{W}s),italic_z = italic_f start_POSTSUBSCRIPT italic_f italic_c end_POSTSUBSCRIPT ( italic_s ) = italic_σ ( caligraphic_B ( bold_W italic_s ) , (15)

where σ𝜎\sigmaitalic_σ is the ReLU activation function, \mathcal{B}caligraphic_B represents batch normalization, and 𝐖L×C𝐖superscript𝐿𝐶\mathbf{W}\in\mathcal{R}^{L\times C}bold_W ∈ caligraphic_R start_POSTSUPERSCRIPT italic_L × italic_C end_POSTSUPERSCRIPT with L𝐿Litalic_L means the number of channel after dimension reduction.

To dynamically select different scales of amplitude spectrum from the two modalities, the compact feature descriptor zL×1𝑧superscript𝐿1z\in\mathbb{R}^{L\times 1}italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × 1 end_POSTSUPERSCRIPT is used to compute a soft attention score across channels. To achieve this, softmax is performed on the channel-wise digits:

ac=eAczeAcz+eBcz,bc=eBczeAcz+eBcz,formulae-sequencesubscript𝑎𝑐superscript𝑒subscript𝐴𝑐𝑧superscript𝑒subscript𝐴𝑐𝑧superscript𝑒subscript𝐵𝑐𝑧subscript𝑏𝑐superscript𝑒subscript𝐵𝑐𝑧superscript𝑒subscript𝐴𝑐𝑧superscript𝑒subscript𝐵𝑐𝑧\displaystyle a_{c}=\frac{e^{A_{c}z}}{e^{A_{c}z}+e^{B_{c}z}},~{}b_{c}=\frac{e^% {B_{c}z}}{e^{A_{c}z}+e^{B_{c}z}},italic_a start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_z end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_z end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_z end_POSTSUPERSCRIPT end_ARG , italic_b start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_z end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_z end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_z end_POSTSUPERSCRIPT end_ARG , (16)

where A,BC×L𝐴𝐵superscript𝐶𝐿A,B\in\mathcal{R}^{C\times L}italic_A , italic_B ∈ caligraphic_R start_POSTSUPERSCRIPT italic_C × italic_L end_POSTSUPERSCRIPT and a,b𝑎𝑏a,~{}bitalic_a , italic_b represent the soft attention vector for 𝒜auxsubscript𝒜𝑎𝑢𝑥\mathcal{A}_{aux}caligraphic_A start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT and 𝒜tarsubscript𝒜𝑡𝑎𝑟\mathcal{A}_{tar}caligraphic_A start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT. At last, the fused amplitude spectrum 𝒜¯¯𝒜\overline{\mathcal{A}}over¯ start_ARG caligraphic_A end_ARG is obtained by weighted sum of 𝒜auxsubscript𝒜𝑎𝑢𝑥\mathcal{A}_{aux}caligraphic_A start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT and 𝒜tarsubscript𝒜𝑡𝑎𝑟\mathcal{A}_{tar}caligraphic_A start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT:

𝒜c¯=ac𝒜aux+bc𝒜tar.¯subscript𝒜𝑐subscript𝑎𝑐subscript𝒜𝑎𝑢𝑥subscript𝑏𝑐subscript𝒜𝑡𝑎𝑟\displaystyle\overline{\mathcal{A}_{c}}=a_{c}\cdot\mathcal{A}_{aux}+b_{c}\cdot% \mathcal{A}_{tar}.over¯ start_ARG caligraphic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG = italic_a start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⋅ caligraphic_A start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⋅ caligraphic_A start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT . (17)

To further enhance the fused amplitude spectrum and phase spectrum, we employ two groups of independent operations ConvA()𝐶𝑜𝑛𝑣𝐴ConvA(\cdot)italic_C italic_o italic_n italic_v italic_A ( ⋅ ) and CovnP()𝐶𝑜𝑣𝑛𝑃CovnP(\cdot)italic_C italic_o italic_v italic_n italic_P ( ⋅ ). Each group consists of convolution layers with a kernel size of 1×1111\times 11 × 1 and a ReLU activation function. The final restored amplitude spectrum 𝒜superscript𝒜\mathcal{A}^{\prime}caligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and phase spectrum 𝒫superscript𝒫\mathcal{P}^{\prime}caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is obtained by

𝒜=ConvA(𝒜¯),superscript𝒜𝐶𝑜𝑛𝑣𝐴¯𝒜\displaystyle\mathcal{A}^{\prime}=ConvA(\overline{\mathcal{A}}),caligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_C italic_o italic_n italic_v italic_A ( over¯ start_ARG caligraphic_A end_ARG ) , (18)
𝒫=ConvP(𝒫aux+𝒫tar).superscript𝒫𝐶𝑜𝑛𝑣𝑃subscript𝒫𝑎𝑢𝑥subscript𝒫𝑡𝑎𝑟\displaystyle\mathcal{P}^{\prime}=ConvP(\mathcal{P}_{aux}+\mathcal{P}_{tar}).caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_C italic_o italic_n italic_v italic_P ( caligraphic_P start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT + caligraphic_P start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT ) .

Finally, the restored amplitude spectrum 𝒜superscript𝒜\mathcal{A}^{\prime}caligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and phase spectrum 𝒫superscript𝒫\mathcal{P}^{\prime}caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are convert to spatial domain through inverse Fourier transform:

Ffre=1(𝒜,𝒫),subscript𝐹𝑓𝑟𝑒superscript1superscript𝒜superscript𝒫\displaystyle F_{fre}=\mathcal{F}^{-1}(\mathcal{A}^{\prime},\mathcal{P}^{% \prime}),italic_F start_POSTSUBSCRIPT italic_f italic_r italic_e end_POSTSUBSCRIPT = caligraphic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( caligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , (19)

where 1superscript1\mathcal{F}^{-1}caligraphic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT donates inverse Fourier transform. Through the FFT and selective fusion in the Fourier domain, the SFF module efficiently achieves global feature fusion and recovers high-frequency signals for structure details.

Refer to caption
Figure 3: Illustration of the Adaptative Spatial-Frequency Fusion module.

III-E Adaptative Spatial-Frequency Fusion

To further enhance the fused spatial domain feature Fspasubscript𝐹𝑠𝑝𝑎F_{spa}italic_F start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT and fused frequency domain feature Ffresubscript𝐹𝑓𝑟𝑒F_{fre}italic_F start_POSTSUBSCRIPT italic_f italic_r italic_e end_POSTSUBSCRIPT, and to facilitate the efficient integration of these features, we introduce the Adaptative Spatial-Frequency Fusion (ASFF) module, as illustrated in Fig. 3. The ASFF module allows the two domains to mutually complement each other through channel-wise adaptive integration according to the informativeness of the channel, which is measured by the Batch-Normalization (BN) layer. Previous studies [49, 50] have shown that the BN value indicates channel importance, with lower norms signifying lesser relevance to the final outcomes. Consequently, we propose that channels with diminished significance in one domain can be reinforced by incorporating corresponding channel information from the other domain.

Firstly, we measure channel-wise informativeness of the fused spatial domain feature Fspasubscript𝐹𝑠𝑝𝑎F_{spa}italic_F start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT and frequency feature Ffresubscript𝐹𝑓𝑟𝑒F_{fre}italic_F start_POSTSUBSCRIPT italic_f italic_r italic_e end_POSTSUBSCRIPT through the BN layer. Denote the cth𝑐𝑡c-thitalic_c - italic_t italic_h channel of the features as Fspa,csubscript𝐹𝑠𝑝𝑎𝑐F_{spa,c}italic_F start_POSTSUBSCRIPT italic_s italic_p italic_a , italic_c end_POSTSUBSCRIPT and Ffre,csubscript𝐹𝑓𝑟𝑒𝑐F_{fre,c}italic_F start_POSTSUBSCRIPT italic_f italic_r italic_e , italic_c end_POSTSUBSCRIPT, the BN is computed as follows:

Fspa,csuperscriptsubscript𝐹𝑠𝑝𝑎𝑐\displaystyle F_{spa,c}^{\prime}italic_F start_POSTSUBSCRIPT italic_s italic_p italic_a , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =ωspa,cFspa,cμspa,cσspa,c2+ϵ+βspa,c,absentsubscript𝜔𝑠𝑝𝑎𝑐subscript𝐹𝑠𝑝𝑎𝑐subscript𝜇𝑠𝑝𝑎𝑐superscriptsubscript𝜎𝑠𝑝𝑎𝑐2italic-ϵsubscript𝛽𝑠𝑝𝑎𝑐\displaystyle=\omega_{spa,c}\frac{F_{spa,c}-\mu_{spa,c}}{\sqrt{\sigma_{spa,c}^% {2}+\epsilon}}+\beta_{spa,c},= italic_ω start_POSTSUBSCRIPT italic_s italic_p italic_a , italic_c end_POSTSUBSCRIPT divide start_ARG italic_F start_POSTSUBSCRIPT italic_s italic_p italic_a , italic_c end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_s italic_p italic_a , italic_c end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_σ start_POSTSUBSCRIPT italic_s italic_p italic_a , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ϵ end_ARG end_ARG + italic_β start_POSTSUBSCRIPT italic_s italic_p italic_a , italic_c end_POSTSUBSCRIPT , (20)
Ffre,csuperscriptsubscript𝐹𝑓𝑟𝑒𝑐\displaystyle F_{fre,c}^{\prime}italic_F start_POSTSUBSCRIPT italic_f italic_r italic_e , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =ωfre,cFfre,cμfre,cσfre,c2+ϵ+βfre,c,absentsubscript𝜔𝑓𝑟𝑒𝑐subscript𝐹𝑓𝑟𝑒𝑐subscript𝜇𝑓𝑟𝑒𝑐superscriptsubscript𝜎𝑓𝑟𝑒𝑐2italic-ϵsubscript𝛽𝑓𝑟𝑒𝑐\displaystyle=\omega_{fre,c}\frac{F_{fre,c}-\mu_{fre,c}}{\sqrt{\sigma_{fre,c}^% {2}+\epsilon}}+\beta_{fre,c},= italic_ω start_POSTSUBSCRIPT italic_f italic_r italic_e , italic_c end_POSTSUBSCRIPT divide start_ARG italic_F start_POSTSUBSCRIPT italic_f italic_r italic_e , italic_c end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_f italic_r italic_e , italic_c end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_σ start_POSTSUBSCRIPT italic_f italic_r italic_e , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ϵ end_ARG end_ARG + italic_β start_POSTSUBSCRIPT italic_f italic_r italic_e , italic_c end_POSTSUBSCRIPT ,

where ωspa,csubscript𝜔𝑠𝑝𝑎𝑐\omega_{spa,c}italic_ω start_POSTSUBSCRIPT italic_s italic_p italic_a , italic_c end_POSTSUBSCRIPT and ωfre,csubscript𝜔𝑓𝑟𝑒𝑐\omega_{fre,c}italic_ω start_POSTSUBSCRIPT italic_f italic_r italic_e , italic_c end_POSTSUBSCRIPT are trainable scaling factors; βspa,csubscript𝛽𝑠𝑝𝑎𝑐\beta_{spa,c}italic_β start_POSTSUBSCRIPT italic_s italic_p italic_a , italic_c end_POSTSUBSCRIPT and βfre,csubscript𝛽𝑓𝑟𝑒𝑐\beta_{fre,c}italic_β start_POSTSUBSCRIPT italic_f italic_r italic_e , italic_c end_POSTSUBSCRIPT are trainable offset, and ϵitalic-ϵ\epsilonitalic_ϵ is a small constant to avoid divisions by zero.

The factors ωspa,csubscript𝜔𝑠𝑝𝑎𝑐\omega_{spa,c}italic_ω start_POSTSUBSCRIPT italic_s italic_p italic_a , italic_c end_POSTSUBSCRIPT and ωfre,csubscript𝜔𝑓𝑟𝑒𝑐\omega_{fre,c}italic_ω start_POSTSUBSCRIPT italic_f italic_r italic_e , italic_c end_POSTSUBSCRIPT evaluate the correlation between the input and the output after normalization during training. If ωtar,csubscript𝜔𝑡𝑎𝑟𝑐\omega_{tar,c}italic_ω start_POSTSUBSCRIPT italic_t italic_a italic_r , italic_c end_POSTSUBSCRIPT or ωfre,csubscript𝜔𝑓𝑟𝑒𝑐\omega_{fre,c}italic_ω start_POSTSUBSCRIPT italic_f italic_r italic_e , italic_c end_POSTSUBSCRIPT approaches 0, it indicates that the corresponding gradient loss of Fspasubscript𝐹𝑠𝑝𝑎F_{spa}italic_F start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT or Ffresubscript𝐹𝑓𝑟𝑒F_{fre}italic_F start_POSTSUBSCRIPT italic_f italic_r italic_e end_POSTSUBSCRIPT will be close to 0. This implies that Fspa,csuperscriptsubscript𝐹𝑠𝑝𝑎𝑐F_{spa,c}^{\prime}italic_F start_POSTSUBSCRIPT italic_s italic_p italic_a , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT or Ffre,csuperscriptsubscript𝐹𝑓𝑟𝑒𝑐F_{fre,c}^{\prime}italic_F start_POSTSUBSCRIPT italic_f italic_r italic_e , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT will have minimal influence on the final results. Given this insight, we propose to enhance the channels with small scaling factors in one domain by incorporating the corresponding channels from the other domain. The incorporation is performed as follows:

Fspa,c={Fspa,c,ifωspa,cτspa;Fspa,cFfre,c,ifωspa,c<τspa;F_{spa,c}^{\prime}=\left\{\begin{aligned} &F_{spa,c},&if~{}\omega_{spa,c}~{}% \geq~{}\tau_{spa};\\ &F_{spa,c}\otimes F_{fre,c},&if~{}\omega_{spa,c}~{}\textless~{}\tau_{spa};\end% {aligned}\right.italic_F start_POSTSUBSCRIPT italic_s italic_p italic_a , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { start_ROW start_CELL end_CELL start_CELL italic_F start_POSTSUBSCRIPT italic_s italic_p italic_a , italic_c end_POSTSUBSCRIPT , end_CELL start_CELL italic_i italic_f italic_ω start_POSTSUBSCRIPT italic_s italic_p italic_a , italic_c end_POSTSUBSCRIPT ≥ italic_τ start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT ; end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_F start_POSTSUBSCRIPT italic_s italic_p italic_a , italic_c end_POSTSUBSCRIPT ⊗ italic_F start_POSTSUBSCRIPT italic_f italic_r italic_e , italic_c end_POSTSUBSCRIPT , end_CELL start_CELL italic_i italic_f italic_ω start_POSTSUBSCRIPT italic_s italic_p italic_a , italic_c end_POSTSUBSCRIPT < italic_τ start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT ; end_CELL end_ROW (21)
Ffre,c={Ffre,c,ifωfre,cτfre;Ffre,cFspa,c,ifωfre,c<τfre;F_{fre,c}^{\prime}=\left\{\begin{aligned} &F_{fre,c},&if~{}\omega_{fre,c}~{}% \geq~{}\tau_{fre};\\ &F_{fre,c}\otimes F_{spa,c},&if~{}\omega_{fre,c}~{}\textless~{}\tau_{fre};\end% {aligned}\right.italic_F start_POSTSUBSCRIPT italic_f italic_r italic_e , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { start_ROW start_CELL end_CELL start_CELL italic_F start_POSTSUBSCRIPT italic_f italic_r italic_e , italic_c end_POSTSUBSCRIPT , end_CELL start_CELL italic_i italic_f italic_ω start_POSTSUBSCRIPT italic_f italic_r italic_e , italic_c end_POSTSUBSCRIPT ≥ italic_τ start_POSTSUBSCRIPT italic_f italic_r italic_e end_POSTSUBSCRIPT ; end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_F start_POSTSUBSCRIPT italic_f italic_r italic_e , italic_c end_POSTSUBSCRIPT ⊗ italic_F start_POSTSUBSCRIPT italic_s italic_p italic_a , italic_c end_POSTSUBSCRIPT , end_CELL start_CELL italic_i italic_f italic_ω start_POSTSUBSCRIPT italic_f italic_r italic_e , italic_c end_POSTSUBSCRIPT < italic_τ start_POSTSUBSCRIPT italic_f italic_r italic_e end_POSTSUBSCRIPT ; end_CELL end_ROW (22)

where tensor-product\otimes donates element-wise multiplication and τspasubscript𝜏𝑠𝑝𝑎\tau_{spa}italic_τ start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT and τfresubscript𝜏𝑓𝑟𝑒\tau_{fre}italic_τ start_POSTSUBSCRIPT italic_f italic_r italic_e end_POSTSUBSCRIPT are the thresholds, we set it according to the maximum and the minimum of the scaling factor as follows:

τspa=ωspamin+α(ωspamaxωspamin),subscript𝜏𝑠𝑝𝑎superscriptsubscript𝜔𝑠𝑝𝑎𝑚𝑖𝑛𝛼superscriptsubscript𝜔𝑠𝑝𝑎𝑚𝑎𝑥superscriptsubscript𝜔𝑠𝑝𝑎𝑚𝑖𝑛\displaystyle\tau_{spa}=\omega_{spa}^{min}+\alpha(\omega_{spa}^{max}-\omega_{% spa}^{min}),italic_τ start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT = italic_ω start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT + italic_α ( italic_ω start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT - italic_ω start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT ) , (23)
τfre=ωfremin+α(ωfremaxωfremin),subscript𝜏𝑓𝑟𝑒superscriptsubscript𝜔𝑓𝑟𝑒𝑚𝑖𝑛𝛼superscriptsubscript𝜔𝑓𝑟𝑒𝑚𝑎𝑥superscriptsubscript𝜔𝑓𝑟𝑒𝑚𝑖𝑛\displaystyle\tau_{fre}=\omega_{fre}^{min}+\alpha(\omega_{fre}^{max}-\omega_{% fre}^{min}),italic_τ start_POSTSUBSCRIPT italic_f italic_r italic_e end_POSTSUBSCRIPT = italic_ω start_POSTSUBSCRIPT italic_f italic_r italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT + italic_α ( italic_ω start_POSTSUBSCRIPT italic_f italic_r italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT - italic_ω start_POSTSUBSCRIPT italic_f italic_r italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT ) ,

where α𝛼\alphaitalic_α is a value, we set it as 0.1 in our experiments.

Through the ASFF module, less informative features from one domain are enhanced by the corresponding features from the other, while redundant information is effectively eliminated. This process ensures that each channel contributes optimally to the final fused features. By leveraging the most relevant information from spatial and frequency domains, the ASFF module produces feature representations that are most conducive to generating high-quality target images. After the channel-wise incorporation, the features Fspasuperscriptsubscript𝐹𝑠𝑝𝑎F_{spa}^{\prime}italic_F start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and Ffresuperscriptsubscript𝐹𝑓𝑟𝑒F_{fre}^{\prime}italic_F start_POSTSUBSCRIPT italic_f italic_r italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are concatenated and fed to CNN Decoder to recover the reconstructed target modality image I^tarsubscript^𝐼𝑡𝑎𝑟\hat{I}_{tar}over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT:

I^tar=Decoder(Cat[Fspa,Ffre]).subscript^𝐼𝑡𝑎𝑟𝐷𝑒𝑐𝑜𝑑𝑒𝑟𝐶𝑎𝑡superscriptsubscript𝐹𝑠𝑝𝑎superscriptsubscript𝐹𝑓𝑟𝑒\displaystyle\hat{I}_{tar}=Decoder(Cat\left[F_{spa}^{\prime},F_{fre}^{\prime}% \right]).over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT = italic_D italic_e italic_c italic_o italic_d italic_e italic_r ( italic_C italic_a italic_t [ italic_F start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_f italic_r italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] ) . (24)

IV Experiments and Results

IV-A Dataset Description

In the study, two datasets with different anatomical structures and protocols are adopted for evaluation, including the BraTS dataset [51] and fastMRI knee dataset [52]. The BraTS Dataset contains both T1WIs and T2WIs scans of the brain. We extract 2D images uniformly from 100 3D MRI volumes in the BraTS dataset. The dataset is split subject-wise with a 3:1 ratio, yielding a total of 3,621 images for training and 1,088 images for testing. The 2D image size is 240×240240240240\times 240240 × 240. In our experiments, we adopt T1WI as the auxiliary modality for the reconstruction of the T2WI modality. The fastMRI Dataset is the largest public MRI dataset with raw k-space data. Following [53], 227 and 45 pairs of single-coil PDWI and FS-PDWI knee volumes are selected for training and testing, respectively, resulting in a total of 8,332 pairs of 2D images for training and 1,665 images for testing. The 2D image size is 320×320320320320\times 320320 × 320. In our experiments, we adopt PDWI as the auxiliary modality for the reconstruction of the FS-PDWI modality.

IV-B Experimental Setup

Comparison Methods. To demonstrate the effectiveness of our proposed MMR-Mamba, we compare it against seven multi-contrast reconstruction methods: MDUNet [5], MINet [54], MCCA [13], MTrans [4], DCAMSR [14], SwinIR [55], and Pan-Mamba [46]. For a fair comparison, we run all the models for 100,000100000100,000100 , 000 iterations with a batch size of 4. Other settings of the compared methods followed the implementations provided in their original papers.

Performance Metrics. For quantitative evaluation, we assess the image reconstruction results using peak-to-noise ratio (PSNR), structural similarity index (SSIM), and normalized mean squared error (NMSE). For qualitative evaluation, we visualize the reconstruction results with error maps.

Implementation Details. The proposed framework is developed with PyTorch, and the training and testing processes are executed on one NVIDIA RTX A6000 GPU (48GB). We employ an AdamW optimizer with an initial learning rate of 0.0010.0010.0010.001 and a weight decay of 1e41𝑒41e-41 italic_e - 4 for the training. We run 100,000100000100,000100 , 000 iterations with a batch size of 4 for both datasets. For both datasets, the undersampled images are obtained by applying a 1D cartesian random under-sample mask with 4×4\times4 × and 8×8\times8 × acceleration.

TABLE I: Quantitative results on the BraTS and fastMRI datasets with different acceleration factors. We report mean±std for the PSNR, SSIM, and NMSE metrics. The best results are highlighted in red.
BraTS 4×4\times4 × 8×8\times8 ×
Method Year PSNR\uparrow SSIM\uparrow NMSE (102)superscript102(10^{-2})( 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT )\downarrow PSNR\uparrow SSIM\uparrow NMSE (102)superscript102(10^{-2})( 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT )\downarrow
Zero-filling [56] JMRI’01 30.11±1.53 0.767±0.038 5.659±0.018 26.58±1.49 0.673±0.036 12.469±0.033
MDUNet [5] TBME’18 37.94±1.66 0.975±0.006 0.905±0.003 35.19±1.64 0.960±0.009 1.690±0.005
MINet [54] MICCAI’21 38.26±1.74 0.976±0.006 0.847±0.003 35.23±1.72 0.961±0.009 1.697±0.005
SwinIR [55] CVPR’21 37.87±1.73 0.974±0.006 0.926±0.003 34.95±1.72 0.960±0.009 1.805±0.005
MTrans [4] TMI’22 36.02±1.67 0.962±0.007 1.429±0.005 34.81±1.57 0.957±0.009 1.868±0.005
MCCA [13] JBHI’23 38.03±1.68 0.975±0.006 0.903±0.003 35.37±1.66 0.962±0.009 1.633±0.005
DCAMSR [14] MICCAI’23 38.60±1.75 0.978±0.006 1.193±0.004 35.99±1.74 0.965±0.009 1.417±0.005
Pan-Mamba [46] Arxiv’24 38.84±1.79 0.978±0.006 0.739±0.273 36.18±1.77 0.966±0.009 1.363±0.466
MMR-Mamba (Ours) 40.98±1.88 0.985±0.005 0.454±0.190 37.75±1.85 0.974±0.008 0.955±0.358
fastMRI 4×4\times4 × 8×8\times8 ×
Method Year PSNR\uparrow SSIM\uparrow NMSE (102)superscript102(10^{-2})( 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT )\downarrow PSNR\uparrow SSIM\uparrow NMSE (102)superscript102(10^{-2})( 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT )\downarrow
Zero-filling [56] JMRI’01 27.68±1.75 0.571±0.061 5.020±0.013 25.64±1.64 0.454±0.070 7.719±0.02
MDUNet [5] TBME’18 28.60±1.00 0.600±0.050 4.000±0.010 27.90±0.86 0.544±0.050 4.600±0.02
MINet [54] MICCAI’21 29.47±1.88 0.639±0.069 3.334±0.009 28.17±1.74 0.563±0.081 4.327±0.01
SwinIR [55] CVPR’21 29.42±1.87 0.636±0.069 3.382±0.009 28.09±1.74 0.560±0.081 4.407±0.01
MTrans [4] TMI’22 29.00±1.79 0.619±0.068 3.699±0.009 27.31±1.68 0.526±0.081 5.254±0.01
MCCA [13] JBHI’23 29.46±1.87 0.637±0.069 3.346±0.009 28.23±1.75 0.562±0.081 4.275±0.01
DCAMSR [14] MICCAI’23 29.45±1.87 0.637±0.068 3.349±0.009 28.42±1.79 0.569±0.081 4.098±0.01
Pan-Mamba [46] Arxiv’24 29.59±1.87 0.645±0.068 3.248±0.928 28.36±1.76 0.570±0.081 4.159±1.034
MMR-Mamba (Ours) 29.66±1.88 0.647±0.068 3.201±0.927 28.44±1.76 0.572±0.081 4.084±1.036

IV-C Experimental Results

In this section, we evaluate the proposed method against state-of-the-art techniques under various experimental settings.

Quantitative Results. In Table I, we report the PSNR, SSIM, and NMSE results on both datasets for 4×\times× and 8×\times× acceleration. Firstly, focusing on the BraTS dataset in the upper part of the table, our method achieves the best reconstruction results, achieving a PSNR of 40.98 dB and SSIM of 0.985 under 4×\times× acceleration, and a PSNR of 37.75 dB and SSIM of 0.974 under 8×\times× acceleration. These results highlight our model’s efficacy in fusing information from multi-contrast images for reconstructing target images. Additionally, our method outperforms the second-best method, Pan-Manba, by 2.14 dB in PSNR under 4×\times× acceleration and by 1.57 dB in PSNR under 8×\times× acceleration. The fact that Pan-Manba achieves the second-best results further highlights the effectiveness of the Mamba in MRI reconstruction. Similarly, as shown in the lower part of the table, our method achieves the best reconstruction results on the fastMRI knee dataset, with a PSNR of 29.66 dB and SSIM of 0.647 under 4×\times× acceleration and a PSNR of 28.44 dB and SSIM of 0.572 under 8×\times× acceleration. This further demonstrates the effectiveness and robustness of our model in reconstructing high-quality MR images.

TABLE II: Ablation Study of the proposed modules on BraTS dataset. We report mean±std for the PSNR, SSIM, and NMSE metrics.
Model TCM SFF ASFF 4×\times× 8×\times×
PSNR\uparrow SSIM\uparrow NMSE (102)superscript102(10^{-2})( 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT )\downarrow PSNR\uparrow SSIM\uparrow NMSE (102)superscript102(10^{-2})( 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT )\downarrow
a 38.45+1.75 0.977±0.006 0.809±0.294 35.72±1.72 0.964±0.009 1.512±0.499
b 39.05±1.79 0.979±0.006 0.706±0.266 36.17±1.80 0.967±0.009 1.368±0.476
c 40.49±1.84 0.984±0.005 0.509±0.205 37.22±1.79 0.971±0.008 1.078±0.392
d 40.66+1.86 0.985±0.005 0.489±0.201 37.50±1.84 0.973±0.008 1.012±0.374
e 40.98±1.88 0.985±0.005 0.454±0.190 37.75±1.85 0.974±0.008 0.955±0.358

Qualitative Results. To better evaluate the reconstruction quality, we visualize the outputs from different methods for the BraTS and fastMRI datasets under 4×\times× and 8×\times× acceleration in Fig. 4, along with their corresponding error maps. These maps illustrate the discrepancies between the reconstructed and the ground truth images, with blue indicating minimal error and red indicating higher error. The error maps reveal that zero-filled reconstructions exhibit pronounced artifacts, with error levels escalating with increasing acceleration factors from 4×\times× to 8×\times×. Notably, our method consistently shows the lowest reconstruction error across both datasets and acceleration factors. This observation underscores that our method ensures superior preservation of essential anatomical details and consistently produces high-quality reconstructed images.

Refer to caption
Figure 4: Qualitative evaluation of reconstruction results from different methods on BraTS dataset and fastMRI knee dataset under 4×\times× and 8×\times× acceleration. For every group, the first row shows the reconstructed images and the second row displays the error map between the results and the ground truth. More color in the error map indicates worse reconstruction results.

IV-D Ablation Studies

In this section, we first conduct an ablation study to assess the efficacy of the proposed modules. Then we analyze different fusion strategies for both spatial and frequency domains.

IV-D1 Effectiveness of the Proposed Modules

To verify the validity of the proposed modules, we conducted ablation experiments on the BraTS dataset under 4×\times× and 8×\times× acceleration. The different configurations tested are denoted as follows: (a) a baseline model with all proposed modules removed; (b) the baseline model enhanced with the TCM; (c) the baseline model enhanced with the SFF; (d) the model augmented with both the TCM and SFF; and (e) the complete model incorporating all proposed modules. The results of these experiments are reported in Table II. As shown, incorporating the TCM improves the PSNR from 38.45 dB to 39.05 dB under 4×\times× acceleration. while incorporating the SFF improves the PSNR to 40.49 dB under 4×\times× acceleration Combining both TCM and SFF modules further boosts PSNR to 40.66 dB. Finally, the inclusion of the ASFF led to additional improvements. Similar trends were observed under 8×\times× acceleration. Visual results on the BraTS dataset, including error maps, are shown in Fig. 5. As shown in the figure, the TCM module alone reduces the overall error, yet some structural details remain unrecovered, as indicated in the yellow box. Conversely, the SFF module effectively restores structural details but exhibits larger errors in the lateral ventricle areas, highlighted in red box. When both the TCM and SFF modules are employed, most regions and structures are adequately restored. Ultimately, integrating all proposed modules yields the best results. The quantitative and visualization results from our ablation experiments verify the effectiveness of each proposed module in enhancing the performance of multi-contrast MRI reconstruction.

Refer to caption
Figure 5: Visualization of the results from ablation study of our proposed modules on BraTS dataset under 8×\times× acceleration.

IV-D2 Analysis on Spatial Domain Fusion

To evaluate the effectiveness of our proposed spatial domain fusion module, we design three ablation experiments: (1) fusion through element-wise addition, (2) fusion through cross-attention, and (3) fusion through our proposed TCM module, donated as ‘Sum’, ‘CA’ and ‘TCM’ in Table III. The results indicate that element-wise addition of features from two modalities leads to the worst results. In contrast, our TCM module achieves superior performance, outperforming ‘Sum’ and ‘CA’ under both 4×\times× and 8×\times× acceleration, notably with minimal parameters.

TABLE III: Ablation Study on spatial domain fusion on the BraTS dataset under 4×4\times4 × and 8×8\times8 × acceleration.

Method PSNR\uparrow SSIM\uparrow NMSE (102)superscript102(10^{-2})( 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT )\downarrow Param(M) 4×4\times4 × Sum 38.56+1.74 0.977±0.006 0.789±0.285 0.90 CA 38.76+1.76 0.978±0.006 0.754±0.278 110.42 TCM 39.05±1.79 0.979±0.006 0.706±0.266 0.86 8×8\times8 × Sum 35.82+1.74 0.965±0.009 1.481±0.492 0.90 CA 35.92+1.74 0.965±0.009 1.447±0.492 110.42 TCM 36.17±1.80 0.967±0.009 1.368±0.476 0.86

IV-D3 Analysis on Frequency Domain Fusion.

To verify the effectiveness of the proposed selective frequency fusion module, we design the following experiments: (1) fusion through element-wise addition and (2) fusion through our proposed SFF module, donated as ‘Sum’ and ‘SFF’ in Table IV, respectively. The results presented in the table show that element-wise addition of features from two modalities results in suboptimal performance. In contrast, our proposed SFF module significantly enhances performance, achieving a 0.43 dB improvement in PSNR under 4×\times× acceleration and a 0.21 dB improvement under 8×\times× acceleration.

V Conclusion

This study explores the comprehensive and efficient integration of complementary information across modalities for multi-contrast MRI reconstruction. We present our MMR-Mamba framework, which integrates information through the TCM in the spatial domain and the SFF in the frequency domain, along with integrating the spatial-frequency features through the ASFF module. We conducted extensive experiments on the BraTS and fastMRI knee datasets, with results demonstrating the superiority of our proposed module in reconstructing MRI under different acceleration factors. This work contributes valuable insights for advancing research in multi-contrast MRI reconstruction, particularly in optimizing information fusion across spatial and frequency domains.

TABLE IV: Ablation Study on Frequency Domain Fusion. We report mean±std for the PSNR, SSIM, and NMSE metrics on the BraTS dataset under 4×4\times4 × and 8×8\times8 × acceleration.
Method PSNR\uparrow SSIM\uparrow NMSE (102)superscript102(10^{-2})( 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT )\downarrow
4×4\times4 × Sum 40.06±1.79 0.982±0.005 0.560±0.002
SFF 40.49±1.84 0.984±0.005 0.509±0.002
8×8\times8 × Sum 37.01+1.77 0.970±0.008 1.128±0.004
SFF 37.22±1.79 0.971±0.008 1.078±0.004

References

  • [1] E. Stoja, S. Konstandin, D. Philipp, R. N. Wilke, D. Betancourt, T. Bertuch, J. Jenne, R. Umathum, and M. Günther, “Improving magnetic resonance imaging with smart and thin metasurfaces,” Scientific reports, vol. 11, no. 1, p. 16179, 2021.
  • [2] E. Plenge, D. H. Poot, M. Bernsen, G. Kotek, G. Houston, P. Wielopolski, L. van der Weerd, W. J. Niessen, and E. Meijering, “Super-resolution methods in mri: can they improve the trade-off between resolution, signal-to-noise ratio, and acquisition time?” Magnetic resonance in medicine, vol. 68, no. 6, pp. 1983–1993, 2012.
  • [3] P. Guo, Y. Mei, J. Zhou, S. Jiang, and V. M. Patel, “Reconformer: Accelerated mri reconstruction using recurrent transformer,” IEEE Trans. Med. Imaging, 2023.
  • [4] C.-M. Feng, Y. Yan, G. Chen, Y. Xu, Y. Hu, L. Shao, and H. Fu, “Multi-modal transformer for accelerated mr imaging,” IEEE Trans. Med. Imaging, 2022.
  • [5] L. Xiang, Y. Chen, W. Chang, Y. Zhan, W. Lin, Q. Wang, and D. Shen, “Deep-learning-based multi-modal fusion for fast mr reconstruction,” IEEE Int. Conf. Comput. Vis., vol. 66, no. 7, pp. 2105–2114, 2018.
  • [6] B. Bilgic, V. K. Goyal, and E. Adalsteinsson, “Multi-contrast reconstruction with bayesian compressed sensing,” Magnetic resonance in medicine, vol. 66, no. 6, pp. 1601–1615, 2011.
  • [7] P. Song, L. Weizman, J. F. Mota, Y. C. Eldar, and M. R. Rodrigues, “Coupled dictionary learning for multi-contrast mri reconstruction,” IEEE Trans. Med. Imaging, vol. 39, no. 3, pp. 621–633, 2019.
  • [8] Z. Lai, X. Qu, H. Lu, X. Peng, D. Guo, Y. Yang, G. Guo, and Z. Chen, “Sparse mri reconstruction using multi-contrast image guided graph representation,” Magnetic resonance imaging, vol. 43, pp. 95–104, 2017.
  • [9] L. Sun, Z. Fan, X. Fu, Y. Huang, X. Ding, and J. Paisley, “A deep information sharing network for multi-contrast compressed sensing mri reconstruction,” IEEE Trans. Image Process., vol. 28, no. 12, pp. 6141–6153, 2019.
  • [10] J. Lyu, G. Li, C. Wang, C. Qin, S. Wang, Q. Dou, and J. Qin, “Region-focused multi-view transformer-based generative adversarial network for cardiac cine mri reconstruction,” Med. Image Anal., vol. 85, p. 102760, 2023.
  • [11] G. Li, J. Lv, Y. Tian, Q. Dou, C. Wang, C. Xu, and J. Qin, “Transformer-empowered multi-scale contextual matching and aggregation for multi-contrast mri super-resolution,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 20 636–20 645.
  • [12] K. Xuan, L. Xiang, X. Huang, L. Zhang, S. Liao, D. Shen, and Q. Wang, “Multimodal mri reconstruction assisted with spatial alignment network,” IEEE Trans. Med. Imaging, vol. 41, no. 9, pp. 2499–2509, 2022.
  • [13] B. Li, W. Hu, C.-M. Feng, Y. Li, Z. Liu, and Y. Xu, “Multi-contrast complementary learning for accelerated mr imaging,” IEEE J. Biomed. Health. Inf., 2023.
  • [14] S. Huang, J. Li, L. Mei, T. Zhang, Z. Chen, Y. Dong, L. Dong, S. Liu, and M. Lyu, “Accurate multi-contrast mri super-resolution via a dual cross-attention transformer network,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assisted Intervention.   Springer, 2023, pp. 313–322.
  • [15] A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.
  • [16] A. Gupta, A. Gu, and J. Berant, “Diagonal state spaces are as effective as structured state spaces,” Adv. Neural Inf. Process. Syst., vol. 35, pp. 22 982–22 994, 2022.
  • [17] Z. Qin, X. Han, W. Sun, B. He, D. Li, D. Li, Y. Dai, L. Kong, and Y. Zhong, “Toeplitz neural network for sequence modeling,” arXiv preprint arXiv:2305.04749, 2023.
  • [18] Z. Xing, T. Ye, Y. Yang, G. Liu, and L. Zhu, “Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation,” arXiv preprint arXiv:2401.13560, 2024.
  • [19] J. Ma, F. Li, and B. Wang, “U-mamba: Enhancing long-range dependency for biomedical image segmentation,” arXiv preprint arXiv:2401.04722, 2024.
  • [20] M. Tancik, P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. Barron, and R. Ng, “Fourier features let networks learn high frequency functions in low dimensional domains,” Adv. Neural Inf. Process. Syst., vol. 33, pp. 7537–7547, 2020.
  • [21] B. Zhou, N. Dey, J. Schlemper, S. S. M. Salehi, C. Liu, J. S. Duncan, and M. Sofka, “Dsformer: A dual-domain self-supervised transformer for accelerated multi-contrast mri reconstruction,” in Proc. IEEE winter Conf. on applications of Comput. Vis., 2023, pp. 4966–4975.
  • [22] J. Wang, Y. Yang, H. Yang, C. Lian, Z. Xu, and J. Sun, “Md-graphformer: A model-driven graph transformer for fast multi-contrast mr imaging,” IEEE Trans. Comput. Imaging, 2023.
  • [23] G. Yang, S. Yu, H. Dong, G. Slabaugh, P. L. Dragotti, X. Ye, F. Liu, S. Arridge, J. Keegan, Y. Guo, et al., “Dagan: deep de-aliasing generative adversarial networks for fast compressed sensing mri reconstruction,” IEEE Trans. Med. Imaging, vol. 37, no. 6, pp. 1310–1321, 2017.
  • [24] M. U. Mirza, O. Dalmaz, H. A. Bedel, G. Elmas, Y. Korkmaz, A. Gungor, S. U. Dar, and T. Çukur, “Learning fourier-constrained diffusion bridges for mri reconstruction,” arXiv preprint arXiv:2308.01096, 2023.
  • [25] B. Zhou and S. K. Zhou, “Dudornet: learning a dual-domain recurrent network for fast mri reconstruction with deep t1 prior,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 4273–4282.
  • [26] Y. Wang, Y. Pang, and C. Tong, “Dsmenet: Detail and structure mutually enhancing network for under-sampled mri reconstruction,” Comput. Biol. Med., vol. 154, p. 106204, 2023.
  • [27] J. Lyu, B. Sui, C. Wang, Y. Tian, Q. Dou, and J. Qin, “Dudocaf: Dual-domain cross-attention fusion with recurrent transformer for fast multi-contrast mr imaging,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assisted Intervention.   Springer, 2022, pp. 474–484.
  • [28] X. Liu, Y. Pang, X. Sun, Y. Liu, Y. Hou, Z. Wang, and X. Li, “Image reconstruction for accelerated mr scan with faster fourier convolutional neural networks,” IEEE Trans. Image Process., 2024.
  • [29] Q. Yi, F. Fang, G. Zhang, and T. Zeng, “Frequency learning via multi-scale fourier transformer for mri reconstruction,” IEEE J. Biomed. Health. Inf., 2023.
  • [30] M. Ekanayake, K. Pawar, M. Harandi, G. Egan, and Z. Chen, “Mcstra: A multi-branch cascaded swin transformer for point spread function-guided robust mri reconstruction,” Comput. Biol. Med., vol. 168, p. 107775, 2024.
  • [31] R. Hasani, M. Lechner, T.-H. Wang, M. Chahine, A. Amini, and D. Rus, “Liquid structural state-space models,” arXiv preprint arXiv:2209.12951, 2022.
  • [32] A. Gu, I. Johnson, A. Timalsina, A. Rudra, and C. Ré, “How to train your hippo: State space models with generalized orthogonal basis projections,” arXiv preprint arXiv:2206.12037, 2022.
  • [33] A. Gu, K. Goel, and C. Ré, “Efficiently modeling long sequences with structured state spaces,” arXiv preprint arXiv:2111.00396, 2021.
  • [34] A. Gu, T. Dao, S. Ermon, A. Rudra, and C. Ré, “Hippo: Recurrent memory with optimal polynomial projections,” Adv. Neural Inf. Process. Syst., vol. 33, pp. 1474–1487, 2020.
  • [35] L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,” arXiv preprint arXiv:2401.09417, 2024.
  • [36] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  • [37] Y. Liu, Y. Tian, Y. Zhao, H. Yu, L. Xie, Y. Wang, Q. Ye, and Y. Liu, “Vmamba: Visual state space model,” arXiv arXiv:2401.10166, 2024.
  • [38] T. Guo, Y. Wang, and C. Meng, “Mambamorph: a mamba-based backbone with contrastive feature learning for deformable mr-ct registration,” arXiv preprint arXiv:2401.13934, 2024.
  • [39] Y. Yue and Z. Li, “Medmamba: Vision mamba for medical image classification,” arXiv preprint arXiv:2403.03849, 2024.
  • [40] G. Yang, K. Du, Z. Yang, Y. Du, Y. Zheng, and S. Wang, “Cmvim: Contrastive masked vim autoencoder for 3d multi-modal representation learning for ad classification,” arXiv preprint arXiv:2403.16520, 2024.
  • [41] J. Huang, L. Yang, F. Wang, Y. Wu, Y. Nan, A. I. Aviles-Rivero, C.-B. Schönlieb, D. Zhang, and G. Yang, “Mambamir: An arbitrary-masked mamba for joint medical image reconstruction and uncertainty estimation,” arXiv preprint arXiv:2402.18451, 2024.
  • [42] Q. Xu, R. Zhang, Y. Zhang, Y. Wang, and Q. Tian, “A fourier-based framework for domain generalization,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 14 383–14 392.
  • [43] Y. Yang and S. Soatto, “Fda: Fourier domain adaptation for semantic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 4085–4095.
  • [44] H. Yu, N. Zheng, M. Zhou, J. Huang, Z. Xiao, and F. Zhao, “Frequency and spatial dual guidance for image dehazing,” in IEEE Conf. Comput. Vis. Pattern Recognit.   Springer, 2022, pp. 181–198.
  • [45] M. Zhou, J. Huang, K. Yan, H. Yu, X. Fu, A. Liu, X. Wei, and F. Zhao, “Spatial-frequency domain information integration for pan-sharpening,” in IEEE Conf. Comput. Vis. Pattern Recognit.   Springer, 2022, pp. 274–291.
  • [46] X. He, K. Cao, K. Yan, R. Li, C. Xie, J. Zhang, and M. Zhou, “Pan-mamba: Effective pan-sharpening with state space model,” arXiv preprint arXiv:2402.12192, 2024.
  • [47] N. Skarbnik, Y. Y. Zeevi, and C. Sagiv, The importance of phase in image processing.   Technion-Israel Institute of Technology, Faculty of Electrical Engineering, 2009.
  • [48] Q. Liu, C. Chen, J. Qin, Q. Dou, and P.-A. Heng, “Feddg: Federated domain generalization on medical image segmentation via episodic learning in continuous frequency space,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 1013–1023.
  • [49] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient convolutional networks through network slimming,” in Proceedings of theIEEE Int. Conf. Comput. Vis., 2017, pp. 2736–2744.
  • [50] J. Ye, X. Lu, Z. Lin, and J. Z. Wang, “Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers,” arXiv preprint arXiv:1802.00124, 2018.
  • [51] B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest, et al., “The multimodal brain tumor image segmentation benchmark (brats),” IEEE Trans. Med. Imaging, vol. 34, no. 10, pp. 1993–2024, 2014.
  • [52] F. Knoll, J. Zbontar, A. Sriram, M. J. Muckley, M. Bruno, A. Defazio, M. Parente, K. J. Geras, J. Katsnelson, H. Chandarana, et al., “fastmri: A publicly available raw k-space and dicom dataset of knee images for accelerated mr image reconstruction using machine learning,” Radiology: Artificial Intelligence, vol. 2, no. 1, p. e190007, 2020.
  • [53] K. Xuan, S. Sun, Z. Xue, Q. Wang, and S. Liao, “Learning mri k-space subsampling pattern using progressive weight pruning,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assisted Intervention.   Springer, 2020, pp. 178–187.
  • [54] C.-M. Feng, H. Fu, S. Yuan, and Y. Xu, “Multi-contrast mri super-resolution via a multi-stage integration network,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assisted Intervention.   Springer, 2021, pp. 140–149.
  • [55] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “Swinir: Image restoration using swin transformer,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 1833–1844.
  • [56] M. A. Bernstein, S. B. Fain, and S. J. Riederer, “Effect of windowing and zero-filled reconstruction of mri data on spatial resolution and acquisition strategy,” J. Magn. Reson. Imaging, vol. 14, no. 3, pp. 270–280, 2001.