\useunder

\ul \floatsetup[table]capposition=top \newfloatcommandcapbtabboxtable[][\FBwidth]

11institutetext: School of Biological Science and Medical Engineering, State Key Laboratory of Software Development Environment, Key Laboratory of Biomechanics and Mechanobiology of Ministry of Education, Bei**g Advanced Innovation Center for Biomedical Engineering, Beihang University, Bei**g 100191, China
11email: [email protected]
22institutetext: Department of Biomedical Engineering, Tsinghua University, Bei**g 100084, China 33institutetext: Department of Gynecology Oncology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Bei**g 100021, China 44institutetext: ByteDance Inc., Bei**g 100098, China

All-In-One Medical Image Restoration via Task-Adaptive Routing

Zhiwen Yang 11    Haowei Chen 11    Ziniu Qian 11    Yang Yi 11    Hui Zhang 22    Dan Zhao 33    Bingzheng Wei 44    Yan Xu(,){}^{(\textrm{{\char 0\relax}},\thanks{Corresponding author})}start_FLOATSUPERSCRIPT ( ✉ , ) end_FLOATSUPERSCRIPT Corresponding author11
Abstract

Although single-task medical image restoration (MedIR) has witnessed remarkable success, the limited generalizability of these methods poses a substantial obstacle to wider application. In this paper, we focus on the task of all-in-one medical image restoration, aiming to address multiple distinct MedIR tasks with a single universal model. Nonetheless, due to significant differences between different MedIR tasks, training a universal model often encounters task interference issues, where different tasks with shared parameters may conflict with each other in the gradient update direction. This task interference leads to deviation of the model update direction from the optimal path, thereby affecting the model’s performance. To tackle this issue, we propose a task-adaptive routing strategy, allowing conflicting tasks to select different network paths in spatial and channel dimensions, thereby mitigating task interference. Experimental results demonstrate that our proposed All-in-one Medical Image Restoration (AMIR) network achieves state-of-the-art performance in three MedIR tasks: MRI super-resolution, CT denoising, and PET synthesis, both in single-task and all-in-one settings. The code and data will be available at https://github.com/Yaziwel/AMIR.

Keywords:
All-in-one Medical image restoration Task routing.

1 Introduction

Medical image restoration (MedIR) refers to the process of transforming low-quality (LQ) medical images into high-quality (HQ) images. MedIR has achieved remarkable success in individual tasks such as MRI super-resolution [1, 2], CT denoising [3, 4, 5, 6], and PET synthesis [7, 8, 9, 10, 11, 12]. Within the scope of single-task MedIR, there are also studies addressing input degradation variations, such as varying noise levels [8], imaging protocols [12], and centers [12]. The well-defined settings of these MedIR tasks allow researchers to design specific models tailored to the unique characteristics of each individual task.

However, these single-task models designed for particular MedIR tasks often encounter significant performance drops when applied to other MedIR tasks. In complex application scenarios such as multimodal imaging (e.g., PET/CT and PET/MRI), multiple MedIR tasks coexist simultaneously. Single-task models struggle to address the differences between modalities and tasks, and utilizing separate models for each task may result in inefficiencies in both usage and maintenance. Apart from practical limitations, single-task models remain inherently constrained by their task-specific nature, hindering the evolution from specificity to a more generalized intelligence in the field of medical image restoration. Therefore, there is a pressing need to develop a single universal model capable of simultaneously handling multiple MedIR tasks.

Refer to caption
Figure 1: (a) Examples of LQ/HQ pairs for three different MedIR tasks. (b) The interference metric [13] of task j𝑗jitalic_j on task i𝑖iitalic_i at the second and last blocks in Restormer [2]. Red values indicate that task j𝑗jitalic_j negatively impacts task i𝑖iitalic_i, while green values indicate a positive impact.

Recently, all-in-one image restoration [14, 15, 16, 17] (also known as multi-task image restoration) has gained prominence in natural images, attempting to address multiple different restoration tasks using a single universal model. It holds the potential to address multiple MedIR tasks with a single universal model. However, given the substantial disparities between medical and natural image restoration tasks, it is not advisable to directly apply all-in-one natural image restoration methods to MedIR tasks. In natural image restoration, task distinctions primarily arise from varied input image degradations, with their ground truths presumed to follow a uniform data distribution. However, in MedIR, alongside input degradation disparities, the ground truths of different MedIR tasks also showcase significantly varied data distributions due to modality differences (as shown in Fig. 1 (a)). These significant differences between MedIR tasks can result in task interference, a common issue in multi-task learning [18, 13], where the gradient update directions between tasks are inconsistent or even opposite. As shown in Fig. 1 (b), we quantify the interference metric defined in the paper [13] among different MedIR tasks and observe significant task interference. This task interference issue will lead to an uncertain update direction that deviates from the optimal, resulting in suboptimal model performance. While essential for handling multiple MedIR tasks, the task interference issue is rarely explicitly addressed by current all-in-one methods.

In this paper, we introduce an innovative All-in-one Medical Image Restoration (AMIR) network capable of handling multiple MedIR tasks with a single universal model. The key idea behind AMIR is the incorporation of a task-adaptive routing strategy, which dynamically directs inputs from conflicting tasks to different network paths, explicitly mitigating interference between tasks. Specifically, the proposed task-adaptive routing involves routing instruction learning, spatial routing, and channel routing. Routing instruction learning aims to adaptively learn task-relevant instructions based on input images, while spatial routing and channel routing utilize these learned instructions to guide the routing of network features at spatial and channel levels, respectively, thereby alleviating potential interference. Extensive experiments demonstrate that our proposed AMIR achieves state-of-the-art performance in three MedIR tasks: MRI super-resolution, CT denoising, and PET synthesis, both in single-task and all-in-one settings. Our contribution can be three-fold:

  • We propose a novel All-in-one Medical Image Restoration (AMIR) network, which allows handling multiple different MedIR tasks with a single unified model. To the best of our knowledge, AMIR could be one of the first methods to handle multiple MedIR tasks in an all-in-one fashion.

  • We propose a novel task-adaptive routing strategy to mitigate interference between different tasks. It is achieved by assigning conflicting tasks to different network paths.

  • Extensive experiments show that our proposed AMIR achieves state-of-the-art performance in both single-task MedIR and all-in-one MedIR tasks.

Refer to caption
Figure 2: Overview of the proposed all-in-one medical image restoration (AMIR) network.

2 Method

2.1 Network Architecture

The architecture of the proposed All-In-One Medical Image Restoration (AMIR) network is shown in Fig. 2 (a). AMIR adopts Restormer [2], a Unet-style network, as the baseline model for medical image restoration. The key difference lies in AMIR’s inclusion of a Routing Instruction Network (RIN) alongside Spatial Routing Modules (SRMs; as shown in Fig. 2 (b)) and Channel Routing Modules (CRMs; as shown in Fig. 2 (c)) for network path routing. Specifically, given an input LQ medical image ILQH×W×1superscript𝐼𝐿𝑄superscript𝐻𝑊1I^{LQ}\in\mathbb{R}^{H\times W\times 1}italic_I start_POSTSUPERSCRIPT italic_L italic_Q end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 1 end_POSTSUPERSCRIPT, AMIR first extracts shallow features ISFH×W×Csuperscript𝐼𝑆𝐹superscript𝐻𝑊𝐶I^{SF}\in\mathbb{R}^{H\times W\times C}italic_I start_POSTSUPERSCRIPT italic_S italic_F end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT by applying a 3×\times×3 convolution, where H𝐻Hitalic_H, W𝑊Witalic_W, and C𝐶Citalic_C denote the height, width, and channel, respectively. Subsequently, ISFsuperscript𝐼𝑆𝐹I^{SF}italic_I start_POSTSUPERSCRIPT italic_S italic_F end_POSTSUPERSCRIPT undergoes a hierarchical encoder-bottleneck-decoder structure to be transformed into deep features IDFsuperscript𝐼𝐷𝐹I^{DF}italic_I start_POSTSUPERSCRIPT italic_D italic_F end_POSTSUPERSCRIPT, with multiple Restormer Transformer blocks [2] utilized at each level for feature extraction. To mitigate task interference, AMIR employs a task-adaptive routing strategy, incorporating SRMs before each encoder level and CRMs before the bottleneck and each decoder level. SRMs and CRMs adaptively select propagation paths for different task features based on the task-relevant instructions from RIN, thus reducing potential interference. Finally, a 3×\times×3 convolution layer is applied to deep features IDFsuperscript𝐼𝐷𝐹I^{DF}italic_I start_POSTSUPERSCRIPT italic_D italic_F end_POSTSUPERSCRIPT to generate residual image IRH×W×1superscript𝐼𝑅superscript𝐻𝑊1I^{R}\in\mathbb{R}^{H\times W\times 1}italic_I start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 1 end_POSTSUPERSCRIPT, which is added to the input LQ image to obtain the restored image I^HQ=ILQ+IRsuperscript^𝐼𝐻𝑄superscript𝐼𝐿𝑄superscript𝐼𝑅\hat{I}^{HQ}=I^{LQ}+I^{R}over^ start_ARG italic_I end_ARG start_POSTSUPERSCRIPT italic_H italic_Q end_POSTSUPERSCRIPT = italic_I start_POSTSUPERSCRIPT italic_L italic_Q end_POSTSUPERSCRIPT + italic_I start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT. Next, we will describe the task-adaptive routing strategy and its key components in detail.

2.2 Task-Adaptive Routing

To address potential interference caused by multiple tasks sharing the same parameters but with different optimization directions, we propose a task-adaptive routing strategy. This strategy enables different tasks to select distinct paths for customized processing, thereby avoiding interference between tasks. We will introduce the task-adaptive routing strategy from three aspects: routing instruction learning, spatial routing, and channel routing.

Routing Instruction Learning. Instructions are representations relevant to the task, crucial for hel** the network better understand the current task and adjust the direction of restoration. Previous all-in-one natural image restoration methods often utilize representations from contrastive learning [19, 14] as instructions. However, the task of contrastive learning exhibits significant differences from image restoration tasks, making it challenging to balance their relationship and potentially introducing undesirable task interference. To address this, we propose a Routing Instruction Network (RIN; as shown in Fig. 2 (a)), which can adaptively generate task-relevant instructions from the input image without the need for additional supervision. The mechanism of the RIN can be formulated as follows:

IIR=i=1NαiDi,αi=Softmax(GAP(E(I))),formulae-sequencesuperscript𝐼𝐼𝑅superscriptsubscript𝑖1𝑁subscript𝛼𝑖subscript𝐷𝑖subscript𝛼𝑖SoftmaxGAPE𝐼I^{IR}=\sum_{i=1}^{N}\alpha_{i}D_{i},\quad\alpha_{i}=\operatorname{Softmax}% \left(\operatorname{GAP}\left(\operatorname{E}(I)\right)\right),italic_I start_POSTSUPERSCRIPT italic_I italic_R end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_Softmax ( roman_GAP ( roman_E ( italic_I ) ) ) , (1)

where E()E\operatorname{E(\cdot)}roman_E ( ⋅ ) is a five-layer CNN encoder, while GAP()GAP\operatorname{GAP(\cdot)}roman_GAP ( ⋅ ) denotes the global average pooling layer. D=[D1,D2,,DN]𝐷subscript𝐷1subscript𝐷2subscript𝐷𝑁D=[D_{1},D_{2},...,D_{N}]italic_D = [ italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_D start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] constitutes an instruction dictionary, where the N𝑁Nitalic_N instructions Di256subscript𝐷𝑖superscript256D_{i}\in\mathbb{R}^{256}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 256 end_POSTSUPERSCRIPT in the dictionary are learnable parameters. During this process, RIN dynamically predicts weights αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the input image I𝐼Iitalic_I and applies them to the instruction dictionary D𝐷Ditalic_D to generate input-conditioned instruction IIRsuperscript𝐼𝐼𝑅I^{IR}italic_I start_POSTSUPERSCRIPT italic_I italic_R end_POSTSUPERSCRIPT. This process does not require any supervision, yet we demonstrate in Fig. 4 (a) that the learned instructions IIRsuperscript𝐼𝐼𝑅I^{IR}italic_I start_POSTSUPERSCRIPT italic_I italic_R end_POSTSUPERSCRIPT are task-relevant.

Spatial Routing. Task interference arises when different tasks share network parameters but have different update directions. A fundamental solution is to assign conflicting tasks to separate parameters. Mixture-of-Experts (MoE) [20] provides a potential solution, which learns to dynamically route inputs to different expert networks. However, vanilla MoE selects the experts solely relying on input token representation, neglecting crucial global and task-relevant information. Hence, we propose to enhance the routing strategy of the vanilla MoE by incorporating the learned global task-relevant instruction IIRsuperscript𝐼𝐼𝑅I^{IR}italic_I start_POSTSUPERSCRIPT italic_I italic_R end_POSTSUPERSCRIPT. Given a local spatial token xiCsubscript𝑥𝑖superscriptsuperscript𝐶x_{i}\in\mathbb{R}^{C^{\prime}}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT of the input feature XH×W×C𝑋superscriptsuperscript𝐻superscript𝑊superscript𝐶X\in\mathbb{R}^{H^{\prime}\times W^{\prime}\times C^{\prime}}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, along with the global instruction IIRsuperscript𝐼𝐼𝑅I^{IR}italic_I start_POSTSUPERSCRIPT italic_I italic_R end_POSTSUPERSCRIPT and an expert bank E=[E1,E2,,EM]EsubscriptE1subscriptE2subscriptE𝑀\operatorname{E}=[\operatorname{E}_{1},\operatorname{E}_{2},...,\operatorname{% E}_{M}]roman_E = [ roman_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , roman_E start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ] comprising M𝑀Mitalic_M expert networks, our spatial routing module (SRM; as shown in Fig. 2 (b)) can be formulated as:

G(xi,IIR)=TopK(Softmax(FC([xi,FC(IIR)]))),Gsubscript𝑥𝑖superscript𝐼𝐼𝑅TopKSoftmaxFCsubscript𝑥𝑖FCsuperscript𝐼𝐼𝑅\operatorname{G}(x_{i},I^{IR})=\operatorname{Top-K}(\operatorname{Softmax}% \left(\operatorname{FC}\left([x_{i},\operatorname{FC}(I^{IR})]\right)\right)),roman_G ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUPERSCRIPT italic_I italic_R end_POSTSUPERSCRIPT ) = start_OPFUNCTION roman_Top - roman_K end_OPFUNCTION ( roman_Softmax ( roman_FC ( [ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_FC ( italic_I start_POSTSUPERSCRIPT italic_I italic_R end_POSTSUPERSCRIPT ) ] ) ) ) , (2)
xi=e=1MG(xi,IIR)eEe(xi),x_{i}^{\prime}=\sum_{e=1}^{M}\operatorname{G}(x_{i},I^{IR})_{e}\operatorname{E% }_{e}\left(x_{i}\right),italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_e = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT roman_G ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUPERSCRIPT italic_I italic_R end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (3)

where FC()FC\operatorname{FC}(\cdot)roman_FC ( ⋅ ) indicates the fully connected layer. TopK()TopK\operatorname{Top-K}(\cdot)start_OPFUNCTION roman_Top - roman_K end_OPFUNCTION ( ⋅ ) operator sets all values to be zero except the largest K𝐾Kitalic_K values. G()G\operatorname{G}(\cdot)roman_G ( ⋅ ) denotes the routing function that produces a sparse weight for different experts. EesubscriptE𝑒\operatorname{E}_{e}roman_E start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT refers to the e𝑒eitalic_e-th expert in the expert bank, with each being a multi-layer perception (MLP). G(xi,IIR)e\operatorname{G}(x_{i},I^{IR})_{e}roman_G ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUPERSCRIPT italic_I italic_R end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT determines how much the e𝑒eitalic_e-th expert contributes to the output. In this process, SRM routes each spatial token from the input feature X𝑋Xitalic_X to the corresponding top-K𝐾Kitalic_K selected experts for separate processing with the guidance of the global instruction IIRsuperscript𝐼𝐼𝑅I^{IR}italic_I start_POSTSUPERSCRIPT italic_I italic_R end_POSTSUPERSCRIPT, and then combines each expert’s knowledge through weighted summation. Notably, very dissimilar tokens will be routed to distinct experts, thereby avoiding potential interference. We integrate SRM before each encoder level in the UNet to mitigate task interference during the encoding process.

Channel Routing. Although SRM can effectively mitigate task interference and obtain strong interpretability, it suffers from two drawbacks. Firstly, routing different spatial tokens to different experts disrupts the feature spatial continuity. Secondly, SRM introduces multiple expert networks, leading to a linear increase in parameters with the growth of expert networks. To address these issues, we propose a more efficient Channel Routing Module (CRM; as shown in Fig. 2 (c)). With the guidance of task instruction IIRsuperscript𝐼𝐼𝑅I^{IR}italic_I start_POSTSUPERSCRIPT italic_I italic_R end_POSTSUPERSCRIPT, CRM dynamically routes data flow from different tasks to different channels without introducing excessive additional parameters, while also preserving feature spatial continuity. Given an input feature XH×W×C𝑋superscriptsuperscript𝐻superscript𝑊superscript𝐶X\in\mathbb{R}^{H^{\prime}\times W^{\prime}\times C^{\prime}}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, the mechanism of CRM can be described as follows:

X=(Xm),m=Sigmoid(FC(IIR)),formulae-sequencesuperscript𝑋direct-product𝑋𝑚𝑚SigmoidFCsuperscript𝐼𝐼𝑅X^{\prime}=(X\odot m),\quad m=\operatorname{Sigmoid}(\operatorname{FC}(I^{IR})),italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( italic_X ⊙ italic_m ) , italic_m = roman_Sigmoid ( roman_FC ( italic_I start_POSTSUPERSCRIPT italic_I italic_R end_POSTSUPERSCRIPT ) ) , (4)

where mC𝑚superscriptsuperscript𝐶m\in\mathbb{R}^{C^{\prime}}italic_m ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is a channel-wise (soft-)binary mask estimated from the instruction IIRsuperscript𝐼𝐼𝑅I^{IR}italic_I start_POSTSUPERSCRIPT italic_I italic_R end_POSTSUPERSCRIPT. In this process, CRM conducts channel-wise task routing through feature masking depending on the task instruction. We incorporate the CRM before the bottleneck and each decoder level of the UNet to address task interference in the decoding process.

2.3 Loss Function

The overall loss can be summarized as follows:

L=L1+γLBalance,𝐿subscript𝐿1𝛾subscript𝐿𝐵𝑎𝑙𝑎𝑛𝑐𝑒L=L_{1}+\gamma L_{Balance},italic_L = italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_γ italic_L start_POSTSUBSCRIPT italic_B italic_a italic_l italic_a italic_n italic_c italic_e end_POSTSUBSCRIPT , (5)

where L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT depicts the difference between the restored I^HQsuperscript^𝐼𝐻𝑄\hat{I}^{HQ}over^ start_ARG italic_I end_ARG start_POSTSUPERSCRIPT italic_H italic_Q end_POSTSUPERSCRIPT and the high-quality image IHQsuperscript𝐼𝐻𝑄I^{HQ}italic_I start_POSTSUPERSCRIPT italic_H italic_Q end_POSTSUPERSCRIPT. LBalancesubscript𝐿𝐵𝑎𝑙𝑎𝑛𝑐𝑒L_{Balance}italic_L start_POSTSUBSCRIPT italic_B italic_a italic_l italic_a italic_n italic_c italic_e end_POSTSUBSCRIPT [20] is a regularization term in MoE that prevents static routing where the same few experts are always selected. γ𝛾\gammaitalic_γ is a weighting parameter.

3 Experiments and Results

3.1 Dataset

Our all-in-one medical image restoration experiment encompasses three tasks: MRI super-resolution, CT denoising, and PET Synthesis. In the subsequent section, we introduce the corresponding datasets for each task.

MRI Super-Resolution. We use the publicly available IXI MRI dataset [21], which comprises 578 HQ T2 weighted MRI images. Each 3D MRI volume is sized at 256×\times×256×nabsent𝑛\times n× italic_n, from which we extract the central 100 2D slices sized 256×\times×256 to exclude side slices. The LQ image is obtained by crop** the k𝑘kitalic_k-space with a downsampling factor of 4×\times× (retaining the central 6.25%percent\%% data points). The dataset is divided into 405 for training, 59 for validation, and 114 for testing.

CT Denoising. We employ the dataset from the 2016 NIH AAPM-Mayo Clinic Low-Dose CT Grand Challenge [22], which comprises paired standard-dose HQ CT images and quarter-dose LQ CT images, each with an image size of 512×\times×512. These images originate from 10 patients, with 8 allocated for training, 1 for validation, and 1 for testing purposes.

PET Synthesis. We acquire 159 HQ PET images using the PolarStar m660 PET/CT system in list mode, with an injection dose of 293MBq 18F-FDG. LQ PET images are generated through random list mode decimation with a dose reduction factor of 12. Both HQ and LQ PET images are reconstructed using the standard OSEM method [23]. Each PET image has 3D shapes of 192×\times×192×\times×400, with a voxel size of 3.15mm×\times×3.15mm×\times×1.87mm, and is divided into 192 2D slices sized 192×\times×400. Slices containing only air are excluded. Patient data are partitioned into 120 for training, 10 for validation, and 29 for testing.

3.2 Implementation

In our AMIR architecture, as shown in Fig. 2, the number of Transformer blocks is configured as follows: L1=5subscript𝐿15L_{1}=5italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 5, L2=L3=7subscript𝐿2subscript𝐿37L_{2}=L_{3}=7italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 7, L4=9subscript𝐿49L_{4}=9italic_L start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = 9, and Lrefinement=4subscript𝐿𝑟𝑒𝑓𝑖𝑛𝑒𝑚𝑒𝑛𝑡4L_{refinement}=4italic_L start_POSTSUBSCRIPT italic_r italic_e italic_f italic_i italic_n italic_e italic_m italic_e italic_n italic_t end_POSTSUBSCRIPT = 4. The input channel number is specified as C=42𝐶42C=42italic_C = 42, and the length of the instruction dictionary is set to N=16𝑁16N=16italic_N = 16. Within the SRM component, we designate the expert number as M=4𝑀4M=4italic_M = 4 and the selected number of experts as K=2𝐾2K=2italic_K = 2. During training, we utilize patches sized at 128×128128128128\times 128128 × 128 with a batch size of 8. Our model undergoes training via the Adam optimizer for 2×1052superscript1052\times 10^{5}2 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT iterations, starting with an initial learning rate of 2×1042superscript1042\times 10^{-4}2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, gradually decreasing to 1×1061superscript1061\times 10^{-6}1 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT using cosine annealing. All experiments are conducted in PyTorch, utilizing NVIDIA A100 with 40GB memory.

Table 1: Single task medical image restoration results. The best results are bolded, and the second-best results are \ulunderlined.
Method MRI Super-Resolution Method CT Denoising Method PET Synthesis
PSNR↑ SSIM↑ RMSE↓ PSNR↑ SSIM↑ RMSE↓ PSNR↑ SSIM↑ RMSE↓
SRCNN [24] 28.8067 0.8919 41.3488 CNN [3] 32.7600 0.9075 9.3928 Xiang’s [7] 35.9268 0.9167 0.0980
VDSR [25] 30.0446 0.9140 36.0508 REDCNN [4] 33.1889 0.9113 8.9427 DCNN [8] 36.2710 0.9243 0.0954
SwinIR [26] 31.5549 0.9334 30.5788 Eformer [5] \ul33.3496 \ul0.9175 \ul8.8030 ARGAN [9] 36.7272 0.9406 0.0902
Restormer [2] \ul31.8474 \ul0.9378 \ul29.7005 CTformer [6] 33.2506 0.9134 8.8974 SpachTransformer [10] \ul37.1371 \ul0.9456 \ul0.0871
AMIR 31.9923 0.9393 29.2095 AMIR 33.6738 0.9183 8.4773 AMIR 37.2121 0.9473 0.0863
Refer to caption
Figure 3: Visual comparison of different methods on all-in-one medical image restoration.
Table 2: All-in-one medical image restoration results.
Method MRI Super-Resolution CT Denoising PET Synthesis Average
PSNR↑ SSIM↑ RMSE↓ PSNR↑ SSIM↑ RMSE↓ PSNR↑ SSIM↑ RMSE↓ PSNR↑ SSIM↑ RMSE↓
Restormer [2] \ul31.7177 \ul0.9362 \ul30.0549 33.6142 \ul0.9177 8.5329 \ul37.1368 \ul0.9473 \ul0.0872 \ul34.1562 \ul0.9337 \ul12.8917
Eformer [5] 29.1922 0.8728 39.0983 32.4438 0.9078 9.7565 35.1096 0.9091 0.1085 32.2485 0.8966 16.3211
Spach Transformer [10] 31.1799 0.9290 31.8342 33.4740 0.9155 8.6677 37.0547 0.9445 0.0874 33.9029 0.9297 13.5298
DRMC [12] 29.5466 0.9032 38.1691 33.2770 0.9153 8.8674 36.1909 0.9376 0.0960 33.0048 0.9187 15.7108
AirNet [14] 31.3921 0.9316 31.1141 \ul33.6222 0.9176 \ul8.5226 37.1721 0.9451 0.0864 34.0621 0.9314 13.2410
AMIR 32.0262 0.9396 29.0988 33.7011 0.9182 8.4520 37.1193 0.9475 0.0876 34.2822 0.9351 12.5461

3.3 Comparative Experiment

To validate our proposed AMIR, we conduct evaluations on three tasks: MRI super-resolution, CT denoising, and PET synthesis. Baselines for MRI super-resolution include SRCNN [24], VDSR [25], SwinIR [26], and Restormer [2]; for CT denoising, CNN [3], REDCNN [4], Eformer [5], and CTformer [6]; and for PET synthesis, Xiang’s method [7], DCNN [8], ARGAN [9], and Spach Transformer [10]. We conduct experiments under two settings: single-task and all-in-one. In the single-task setting, we train different models for each MedIR task and compare them with their respective baselines. In the all-in-one setting, we train a universal model to address all tasks simultaneously. We retrain the best-performing baseline models from each task into the all-in-one setting for comparison. Additionally, we utilize two universal models for comparison: DRMC [12], initially developed for multi-center PET synthesis, and AirNet [14], originally designed for all-in-one natural image restoration. PSNR, SSIM, and RMSE scores are calculated to assess the restoration performance.

Single-Task Medical Image Restoration. The results of single-task MedIR are showcased in Table. 1, revealing that our proposed AMIR surpasses the best-performing baseline models — Restormer [2], Eformer [5], and Spach Transformer [10] — in MRI super-resolution, CT denoising, and PET synthesis tasks, respectively. Despite not being specifically designed for single-task MedIR, AMIR’s outstanding performance suggests that the proposed routing strategy is also helpful for handling the sample differences within a single task.

All-In-One Medical Image Restoration. The results of all-in-one MedIR are presented in Table. 2, where AMIR demonstrates state-of-the-art performance when averaged across the three tasks. Specifically, in MRI super-resolution and CT denoising, AMIR outperforms all comparative methods in terms of PSNR, SSIM, and RMSE metrics. In PET synthesis, although AMIR’s PSNR and RMSE metrics are lower than AirNet [14], it achieves the best result in SSIM. Visual comparisons in Fig. 3 demonstrate that AMIR better restores image structure and details across all three tasks. The superiority of AMIR over other methods lies in its ability to mitigate task interference more effectively, thereby preserving the specificity of each task.

3.4 Ablation Study

We conduct ablation studies on different training task combinations to analyze their impact on AMIR outcomes. Also, we study the task-adaptive routing strategy through ablation studies to understand the specific roles of its components.

Ablation Study on Task Combinations. We train AMIR on various task combinations and list the results in Table. 3. It is surprising that, despite the increase in the number of tasks, AMIR maintains its performance without significant degradation. This can be attributed to the proposed task-adaptive routing strategy, which enables multiple tasks to share a universal model with minimal interference. Furthermore, Table. 3 reveals that certain task combinations yield better results than single-task models. This is because universal models benefit from more training data and task synergy compared to single-task models.

Table 3: Ablation study on the combinations of training tasks. ”\checkmark” denotes AMIR training with the task and ”\textendash” indicates unavailable results.
Training Task Testing Task
MRI Super-Resolution CT Denoising PET Synthesis MRI Super-Resolution CT Denoising PET Synthesis
PSNR↑ SSIM↑ RMSE↓ PSNR↑ SSIM↑ RMSE↓ PSNR↑ SSIM↑ RMSE↓
\checkmark 31.9923 0.9393 29.2095 \textendash \textendash \textendash \textendash \textendash \textendash
\checkmark \textendash \textendash \textendash 33.6738 \ul0.9183 8.4773 \textendash \textendash \textendash
\checkmark \textendash \textendash \textendash \textendash \textendash \textendash 37.2121 \ul0.9473 0.0863
\checkmark \checkmark 32.0683 0.9404 29.0296 33.6934 \ul0.9183 8.4592 \textendash \textendash \textendash
\checkmark \checkmark \ul32.0404 \ul0.9399 \ul29.0864 \textendash \textendash \textendash 37.1054 \ul0.9473 \ul0.0875
\checkmark \checkmark \textendash \textendash \textendash 33.7537 0.9189 8.4027 37.0738 \ul0.9473 0.0880
\checkmark \checkmark \checkmark 32.0262 0.9396 29.0988 \ul33.7011 0.9182 \ul8.4520 \ul37.1193 0.9475 0.0876

Ablation Study on the Task-Adaptive Routing Strategy. We conduct ablations on instruction learning and routing modules to analyze the components of the task-adaptive routing strategy. For instruction learning, we remove the dictionary D𝐷Ditalic_D and adopt contrastive learning [19], similar to AirNet [14], for guidance. Table. 4 demonstrates the superior effectiveness of adaptive instruction learning. The instruction representation IIRsuperscript𝐼𝐼𝑅I^{IR}italic_I start_POSTSUPERSCRIPT italic_I italic_R end_POSTSUPERSCRIPT is visualized in Fig. 4 (a) using t-SNE, revealing discriminations between tasks, thus demonstrating its relevance to tasks. Regarding the routing modules SRM and CRM, we assess their effectiveness by removing them individually. Table. 3 indicates that both of them effectively enhance the model’s performance. Additionally, it is shown in Fig. 4 (b) that different tasks select distinct expert networks within SRM, confirming its interpretability. Thanks to the well-designed task-adaptive routing strategy, our proposed AMIR achieves better results than the baseline model Restormer [2] even with fewer parameters (although introducing additional parameters, AMIR utilizes fewer channels).

Table 4: Ablation study on the task-adaptive routing strategy.
Method Params MRI Super-Resolution CT Denoising PET Synthesis
PSNR↑ SSIM↑ RMSE↓ PSNR↑ SSIM↑ RMSE↓ PSNR↑ SSIM↑ RMSE↓
Restormer [2] (Baseline) 26.12 M 31.7177 0.9362 30.0549 33.6142 0.9177 8.5329 37.1368 0.9473 \ul0.0872
Instruction w/o D𝐷Ditalic_D 23.53 M 31.9088 0.9382 29.4575 33.6800 \ul0.9181 \ul8.4712 37.1233 0.9474 0.0876
w/o D𝐷Ditalic_D and w/ Contrastive Learning [19] 23.53 M 31.9545 0.9388 29.3359 33.5987 0.9161 8.5495 37.1388 0.9465 \ul0.0872
Routing Module w/o SRM 22.74 M 31.8678 0.9379 29.5851 33.6556 0.9171 8.4942 37.1639 0.9470 0.0871
w/o CRM 23.37 M \ul31.9802 \ul0.9391 \ul29.2251 \ul33.6814 \ul0.9181 \ul8.4712 \ul37.1443 0.9476 0.0874
AMIR 23.54 M 32.0262 0.9396 29.0988 33.7011 0.9182 8.4520 37.1193 \ul0.9475 0.0876
Refer to caption
Figure 4: (a) t-SNE visualization of IIRsuperscript𝐼𝐼𝑅I^{IR}italic_I start_POSTSUPERSCRIPT italic_I italic_R end_POSTSUPERSCRIPT from different tasks, indicating a clear clustering of different task inputs. (b) Top-1 selected expert in each spatial routing module (SRM). In our AMIR network setting, there are 3 SRMs, each incorporating a mixture of experts (MOE) with 4 experts. Within these SRMs, the top-1 selected expert is identified across the 3 SRMs for each task. Remarkably, the top experts selected across SRMs form distinct paths, with variations observed across different tasks.

4 Conclusion

In this paper, we propose an all-in-one medical image restoration (AMIR) network capable of handling multiple MedIR tasks with a single universal model. To mitigate task interference, we introduce a task-adaptive routing strategy that dynamically routes different tasks to distinct network paths. Experiments demonstrate that the proposed AMIR achieves state-of-the-art performance in both single-task MedIR and all-in-one MedIR tasks. In the future, we will explore the effectiveness of the proposed AMIR as more MedIR tasks are involved.

5 Acknowledgment

This work is supported by the National Natural Science Foundation in China under Grant 62371016, U23B2063, 62022010, and 62176267, the Be**g Natural Science Foundation Haidian District Joint Fund in China under Grant L222032, the Bei**g hope run special fund of cancer foundation of China under Grant LC2018L02, the Fundamental Research Funds for the Central University of China from the State Key Laboratory of Software Development Environment in Beihang University in China, the 111 Proiect in China under Grant B13003, the SinoUnion Healthcare Inc. under the eHealth program, the high performance computing (HPC) resources at Beihang University.

References

  • [1] Chen, Y., Shi, F., Christodoulou, A.G., Xie, Y., Zhou, Z., Li, D.: Efficient and accurate mri super-resolution using a generative adversarial network and 3d multi-level densely connected network. In: International conference on medical image computing and computer-assisted intervention. pp. 91–99. Springer (2018)
  • [2] Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Restormer: Efficient transformer for high-resolution image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5728–5739 (2022)
  • [3] Chen, H., Zhang, Y., Zhang, W., Liao, P., Li, K., Zhou, J., Wang, G.: Low-dose ct denoising with convolutional neural network. In: 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017). pp. 143–146. IEEE (2017)
  • [4] Chen, H., Zhang, Y., Kalra, M.K., Lin, F., Chen, Y., Liao, P., Zhou, J., Wang, G.: Low-dose ct with a residual encoder-decoder convolutional neural network. IEEE transactions on medical imaging 36(12), 2524–2535 (2017)
  • [5] Luthra, A., Sulakhe, H., Mittal, T., Iyer, A., Yadav, S.: Eformer: Edge enhancement based transformer for medical image denoising. arXiv preprint arXiv:2109.08044 (2021)
  • [6] Wang, D., Fan, F., Wu, Z., Liu, R., Wang, F., Yu, H.: Ctformer: convolution-free token2token dilated vision transformer for low-dose ct denoising. Physics in Medicine & Biology 68(6), 065012 (2023)
  • [7] Xiang, L., Qiao, Y., Nie, D., An, L., Lin, W., Wang, Q., Shen, D.: Deep auto-context convolutional neural networks for standard-dose pet image estimation from low-dose pet/mri. Neurocomputing 267, 406–416 (2017)
  • [8] Chan, C., Zhou, J., Yang, L., Qi, W., Kolthammer, J., Asma, E.: Noise adaptive deep convolutional neural network for whole-body pet denoising. In: 2018 IEEE Nuclear Science Symposium and Medical Imaging Conference Proceedings (NSS/MIC). pp. 1–4. IEEE (2018)
  • [9] Luo, Y., Zhou, L., Zhan, B., Fei, Y., Zhou, J., Wang, Y., Shen, D.: Adaptive rectification based adversarial network with spectrum constraint for high-quality pet image synthesis. Medical Image Analysis 77, 102335 (2022)
  • [10] Jang, S.I., Pan, T., Li, Y., Heidari, P., Chen, J., Li, Q., Gong, K.: Spach transformer: spatial and channel-wise transformer based on local and global self-attentions for pet image denoising. IEEE transactions on medical imaging (2023)
  • [11] Zhou, Y., Yang, Z., Zhang, H., Eric, I., Chang, C., Fan, Y., Xu, Y.: 3d segmentation guided style-based generative adversarial networks for pet synthesis. IEEE Transactions on Medical Imaging 41(8), 2092–2104 (2022)
  • [12] Yang, Z., Zhou, Y., Zhang, H., Wei, B., Fan, Y., Xu, Y.: Drmc: A generalist model with dynamic routing for multi-center pet image synthesis. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 36–46. Springer (2023)
  • [13] Zhu, J., Zhu, X., Wang, W., Wang, X., Li, H., Wang, X., Dai, J.: Uni-perceiver-moe: Learning sparse generalist models with conditional moes. Advances in Neural Information Processing Systems 35, 2664–2678 (2022)
  • [14] Li, B., Liu, X., Hu, P., Wu, Z., Lv, J., Peng, X.: All-in-one image restoration for unknown corruption. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17452–17462 (2022)
  • [15] Potlapalli, V., Zamir, S.W., Khan, S., Khan, F.S.: Promptir: Prompting for all-in-one blind image restoration. arXiv preprint arXiv:2306.13090 (2023)
  • [16] Park, D., Lee, B.H., Chun, S.Y.: All-in-one image restoration for unknown degradations using adaptive discriminative filters for specific degradations. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5815–5824. IEEE (2023)
  • [17] Kong, X., Dong, C., Zhang, L.: Towards effective multiple-in-one image restoration: A sequential and prompt learning strategy. arXiv preprint arXiv:2401.03379 (2024)
  • [18] Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., Finn, C.: Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems 33, 5824–5836 (2020)
  • [19] He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9729–9738 (2020)
  • [20] Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., Dean, J.: Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017)
  • [21] LLC, M.: Ixi dataset, https://brain-development.org/ixi-dataset/, accessed: 2024-01-15
  • [22] McCollough, C.H., Bartley, A.C., Carter, R.E., Chen, B., Drees, T.A., Edwards, P., Holmes III, D.R., Huang, A.E., Khan, F., Leng, S., et al.: Low-dose ct for the detection and classification of metastatic liver lesions: results of the 2016 low dose ct grand challenge. Medical physics 44(10), e339–e352 (2017)
  • [23] Hudson, H.M., Larkin, R.S.: Accelerated image reconstruction using ordered subsets of projection data. IEEE transactions on medical imaging 13(4), 601–609 (1994)
  • [24] Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence 38(2), 295–307 (2015)
  • [25] Kim, J., Lee, J.K., Lee, K.M.: Accurate image super-resolution using very deep convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1646–1654 (2016)
  • [26] Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: Swinir: Image restoration using swin transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1833–1844 (2021)