Rethinking Attention Gated with Hybrid Dual Pyramid Transformer-CNN for Generalized Segmentation in Medical Imaging

Fares BOUGOURZI
Junia, UMR 8520, CNRS, Centrale Lille, Univerity of Polytechnique Hauts-de-France, 59000 Lille, France
[email protected]; [email protected]
& [Uncaptioned image]

Fadi DORNAIKA
University of the Basque Country UPV/EHU,
San Sebastian, SPAIN; IKERBASQUE, Basque
Foundation for Science, Bilbao, SPAIN
[email protected]
& [Uncaptioned image]

Abdelmalik Taleb-Ahmed
Université Polytechnique Hauts-de-France, Université de Lille,
CNRS, Valenciennes, 59313, Hauts-de-France, France
[email protected]
& [Uncaptioned image]

Vinh Truong Hoang
Ho Chi Minh City Open University, Viet Nam,
[email protected]
.

Abstract

Inspired by the success of Transformers in Computer vision, Transformers have been widely investigated for medical imaging segmentation. However, most of Transformer architecture are using the recent transformer architectures as encoder or as parallel encoder with the CNN encoder. In this paper, we introduce a novel hybrid CNN-Transformer segmentation architecture (PAG-TransYnet) designed for efficiently building a strong CNN-Transformer encoder. Our approach exploits attention gates within a Dual Pyramid hybrid encoder. The contributions of this methodology can be summarized into three key aspects: (i) the utilization of Pyramid input for highlighting the prominent features at different scales, (ii) the incorporation of a PVT transformer to capture long-range dependencies across various resolutions, and (iii) the implementation of a Dual-Attention Gate mechanism for effectively fusing prominent features from both CNN and Transformer branches. Through comprehensive evaluation across different segmentation tasks including: abdominal multi-organs segmentation, infection segmentation (Covid-19 and Bone Metastasis), microscopic tissues segmentation (Gland and Nucleus). The proposed approach demonstrates state-of-the-art performance and exhibits remarkable generalization capabilities. This research represents a significant advancement towards addressing the pressing need for efficient and adaptable segmentation solutions in medical imaging applications.

Keywords Transformer $\cdot$ Convolutional Neural Network $\cdot$ Deep Learning $\cdot$ Medical Imaging $\cdot$ Segmentation $\cdot$ Unet $\cdot$ Synapse $\cdot$ (Gland and Nucleus) $\cdot$ Covid-19.

1 Introduction

Medical imaging segmentation plays a crucial role in diagnosing, assessing severity, and monitoring progress in various medical conditions [1]. Despite significant advancements in utilizing machine learning for medical imaging segmentation, several challenges persist in develo** efficient segmentation approaches. These challenges include limited labeled data availability, which is a laborious and error-prone task [1, 2]. The ultimate goal remains to devise a generalized approach for different medical segmentation tasks. However, achieving efficiency across various medical imaging segmentation tasks remains challenging due to the high variability among diseases, ranging from single classes to multi-classes, and from disease to organ segmentation. Consequently, many approaches are tailored to specific tasks, limiting their applicability to other tasks.

In the last decade, Convolutional Neural Networks (CNNs) have emerged as the primary approach for medical imaging segmentation [1, 2, 3, 4]. However, CNNs are predominantly adept at extracting local features, thereby overlooking long-range dependencies, which are crucial for modeling global contextual features. Transformers have demonstrated high capability in encoding long-range dependencies, leading to their integration into segmentation architectures either as pure architectures or hybrid ones combined with CNNs [1, 5, 3, 4, 6]. However, existing architectures often utilize transformers as single or parallel encoders alongside CNN encoders [4, 6, 7, 8, 9, 10, 11], indicating limitations in efficiently combining transformer and CNN features.

To address this, we propose revisiting attention gates to build a stronger encoder, introducing our Dual-Attention Gate. Unlike conventional attention gates originally designed to select prominent features from the encoder during decoding [12], our Dual-Attention Gate selects prominent features between CNN features via an input pyramid and from the transformer branch via the main CNN feature path. This results in a more compact main path.

The paper introduces a novel approach called PAG-TransYnet, which combines Transformer and CNN architectures using Dual-Attention Gates. These gates aim to extract significant feature regions and merge features from both CNN and Transformer models. The encoder structure of PAG-TransYnet consists of three branches. The first branch undergoes contraction through four pyramid levels using convolutional blocks, producing features that act as a gating signal for highlighting prominent features in the second branch. The second branch, termed the main branch, focuses on extracting features from the input data. Simultaneously, the features from the main branch are used to highlight important features in the third branch, which utilizes Transformer architecture. The attention features from both branches are concatenated to form the new main branch features for the subsequent level. Overall, the proposed approach aims to capture both local and global features through attention mechanisms, resulting in a comprehensive representation of the input data.

In summary, the main contributions of this work are:

•

Introduction of a novel hybrid architecture for medical imaging segmentation, which seamlessly integrates CNN, Transformers, and a fusion branch encoder.
•

Enhancement of the Att-Unet attention gate through our proposed Dual-Attention Gate. This refinement involves redesigning its structure, repositioning it within the encoder, and optimizing its functionality within the fusion objective.
•

Demonstration of the remarkable capability of our approach to achieve state-of-the-art performance across a diverse range of medical imaging segmentation tasks, including organ scans segmentation, infection detection, and microscopic tissue segmentation (Fig. 1 shows examples of the considred segmentation tasks).
•

The code for PAGTransYnet are made publicly available at https://github.com/faresbougourzi/PAGTransYnet.

This paper is organised as follows: Section 2 highlights the related works. In section 3, the proposed approach is described. Section 4 depicts and analyzes the obtained results. Finally, section 5 concludes this paper.

2 Related Works

In recent years, Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance in medical image segmentation, particularly following the proposition of the U-Net architecture by Ronneberger et al. in 2015 [13]. Since then, numerous variants such as Attention U-Net (Att-UNet) [12], U-Net++ [14], and ResU-Net [15] have emerged, each aiming to enhance segmentation performance. The U-Net architecture, characterized by an encoder-decoder structure with skip connections, has proven effective in preserving fine-grained details through feature concatenation. On the other hand, attention mechanisms have been widely investigated for medical imaging segmentation. One of the most famous attention mechanisms is the Attention Gate (AG), proposed by Oktay et al. in 2018 [12], which integrates attention into U-Net after the skip connection, producing a variant known as Att-UNet. The main objective of Att-UNet is to highlight salient regions in encoder features using decoder features. However, the efficacy of attention gates can vary, prompting the introduction of our approach: the Dual-Attention Gate, integrated into the encoding phase, leveraging Pyramid features, CNN features, and Transformer features to enhance feature extraction and emphasize prominent regions. Despite the great success of CNNs in medical imaging segmentation, their main shortcoming lies in their weakness in capturing long-range dependencies, as CNNs are primarily focused on extracting local features [1, 7, 8, 16]. On the other hand, Transformers, renowned for their ability to capture long-range dependencies in sequences, have shown promising performance in medical imaging tasks, including classification, detection, and segmentation [1, 9, 10, 17, 11]. In segmentation, both 2D and 3D transformer-based approaches, such as Fat-Net and U-Transformer, have showed promising performance by fusing CNN and Transformer components to enhance segmentation accuracy [9, 18, 7].

The integration of CNN and Transformer blocks into single architectures has been a focal point, particularly in the encoding phase [7, 8, 9, 10, 17, 11]. Various encoder configurations have been proposed, including solely Transformer-based encoders [7, 8], parallel CNN and Transformer encoders with subsequent fusion [9, 10], and CNN encoders followed by Transformer blocks [17, 11]. However, many existing approaches lack robust connectivity between Transformer and CNN features, indicating a gap in feature integration. To address this, our approach introduces a novel encoder architecture incorporating Pyramid features, CNN features, Transformer features, and Dual-Attention Gates, aiming to significantly enhance feature fusion and improve segmentation performance.

Refer to caption — Figure 1: Examples of Medical Imaging Segmentation, the first, second and third rows represent the input image, ground truth and the prediction of our approach, respectively. First, second, third, fourth and fifth columns depict abdominal multi-organ segmentation, Covid-19, Bone Metastasis, Gland, and Nucleus, respectively.

3 Proposed Approach

Our proposed Pyramid Dual-Attention Gate Transformer-Ynet (PAG-TransYnet) has three encoder branches and a Unet-like decoder as shown in Figure 2. The detailed architecture is illustrated in Figure 3.

As shown in Figure 2, the encoder of our proposed architecture consists of four main components: (i) a Pyramid Vision Transformer (PVT-v2), (ii) a pyramid representing the input image with four levels, each level followed by convolution blocks ( $P_{1}ConvB$ , $P_{2}ConvB$ , $P_{3}ConvB$ and $P_{4}ConvB$ ), (iii) a main encoder path that merges the PVT features and the main encoder features using Dual-Attention Gates, and (iv) a classic Transformer (Base ViT) serving as the final stage of encoding.

3.1 Pyramid Encoder

The pyramid encoder branch aims to provide convolutional features at four levels of the input image pyramid, which are subsequently utilized in the spatial gate attention mechanism. The image undergoes transformation into a pyramid with four levels, each level being resized separately. There are four pyramid levels, each with a pyramid input ( $P_{1}$ , $P_{2}$ , $P_{3}$ , and $P_{4}$ ) derived from the input image ( $I$ ). These pyramid levels generate pyramid feature maps ( $P_{f_{1}}$ , $P_{f_{2}}$ , $P_{f_{3}}$ , and $P_{f_{4}}$ ) using pyramid convolutional blocks (PConvB), which consist of double convolutional blocks (DConvB). Notably, the first pyramid level contains one DConvB, whereas the fourth level incorporates a cascade of four DConvBs, as shown in Figure 4. Additionally, as depicted in Figure 4.a, the DConvB comprises two $3\times 3$ convolutional blocks and a residual skip connection that uses a $1\times 1$ convolutional kernel to match the input number of channels $C_{in}$ to $C_{out}$ . The output of the two 3 by 3 kernels is summed with the features of the skip connection.

These pyramid feature maps play a crucial role in maintaining spatial attention awareness across all main encoder layers. They serve as gating signals for the main encoder path, facilitating the integration of spatial attention information throughout the encoding process.

3.2 Main Encoder: Attention Fusion

As shown in Figure 3, the input image is fed to both the Transformer and the main encoder branch. For the Transformer branch, we utilize PVT-v2-Li [19], which was designed for a progressive shrinking pyramid and a spatial-reduction attention. This makes the PVT flexible for learning multi-scale and high-level features, similar to the CNN encoder design. For the main branch, we start with a double convolution module as depicted in Figure 4.a. From this point, it hierarchically merges the current features of the main branch and the Transformer features using a dual-gate attention mechanism (explained in the next section). The first level of the main branch is fused with the Transformer first stage features through the proposed Dual-Attention Gate. This attention fusion process is performed in the main branch for four levels. At each level the corresponding features of Transformer stage, pyramid levels, and previous main branch features are combined.

Upon the completion of the fourth fusion, the resulting features ( $x_{4}$ ) are concatenated with the features from the previous level ( $x_{3}$ ) in the main encoder. These are then fed into a classic ViT (ViT Base) with a spatial resolution of $14\times 14$ , corresponding to 196 tokens. The output features from the ViT ( $x_{6}$ ) are subsequently passed through a dual convolution module. The resulting features ( $x_{7}$ ) are then forwarded to the decoder. Finally, the decoder of the proposed PAG-TransYnet consists of four stages and follows a conventional architecture, with skips provided by two levels of features from the main encoder.

3.3 Dual-Attention Gate

The Dual-Attention gates play a crucial role in providing an effective fusion mechanism for the Transformer features at four different stages and the features of the main encoder branch with attention provided by the pyramid level convolutional features. As depicted in the figure 5, this module has three inputs: Transformer features, the previous main branch level features, and the features associated with the corresponding pyramid level. The module consists of two classical Attention Gates (AG). The first AG considers the signal and the current features as input, while the second AG considers the current features as the signal and the lower pyramid feature as the gate. The outputs of both AGs are then concatenated to form the signal in the main Encoder, which is used in the skips to decoder levels. Each module incorporates Max pooling and Up sampling to match the spatial resolution of all three input features.

In summary, the main encoder branch receives the output of the convolution block at the input image of resolution $H\times W$ , and then four Dual-Attention gates are utilized to obtain the encoded features, which is passed through DConvB to extract the next level features of the main branch. Both Dual-Attention Gate and DConvB are used to fuse then extract higher features, respectively, constructing a strong encoder for medical imaging segmentation.

4 Datasets and tasks

For abdominal organ segmentation, we utilized the Synapse multi-organ segmentation dataset, which has emerged as a benchmark dataset for evaluating the performance of medical imaging segmentation approaches in recent years. Following the precedent set by many state-of-the-art works, we adopted the training and validation splits introduced in the TransUnet paper [4]. In summary, the Synapse dataset consists of 30 abdominal CT scans introduced first hand in the MICCAI 2015 Multi-Atlas Abdomen Labeling Challenge and it has the pixel level annotation of 8 abdominal organs (aorta, gallbladder, spleen, left kidney, right kidney, liver, pancreas, spleen, and stomach).

For infection segmentation tasks, we focused on the multi-class segmentation of Covid-19, specifically Ground Glass Opacity (GGO) and Consolidation, along with Bone Metastasis (BM) segmentation. These tasks present significant challenges due to the variability in infection shape, position, intensity, and type. For Covid-19 segmentation, we followed the methodology outlined in [2, 20] and utilized two datasets from [21]. In total, we used 879 slices for training and 50 slices for testing. Among the training slices, 345 and 272 slices contained GGO and Consolidation infection types, respectively. The remaining slices without infection were included to enable the models to learn more features about healthy tissues. For BM segmentation, we utilized the BM-Seg dataset [22], employing the same data splits as in [22]. The BM-Seg dataset comprises 23 CT-scans, each covering one of multiple organs depending on the spread and primary cancer (e.g., lung, breast). In total, the dataset contains 1517 slices, and we employed a five-fold cross-validation strategy to evaluate the performance of the segmentation models.

For Gland and Nuclear segmentation tasks, we utilized two distinct datasets: the Gland segmentation dataset (GlaS) [23] and the MoNuSeg dataset [24], respectively. The GlaS dataset comprises 165 images specifically designed for gland segmentation tasks. On the other hand, the MoNuSeg dataset consists of 44 images tailored for nuclear segmentation tasks. Following the evaluation scheme proposed in [5], we conducted three iterations of five-fold cross-validation for each task. This approach ensures robust evaluation by splitting the dataset into five subsets, using each subset as a validation set once while training on the remaining four subsets. The results are corresponding to the mean and standard deviation of the three runs, where each run result corresponds the five folds cross-validation results.

Table 1: Comparison on Abdominal Multi-Organs Segmentation. DSC and HD95 are the average dice score and 95% Hausdorff distance of the 8 classes, respectively. The fourth column to the last show the Dice-score (DSC) for each class.

Architecture	Average		Aorta	Gallbladder	Kidney (L)	Kidney (R)	Liver	Pancreas	Spleen	Stomach
Architecture	DSC $\uparrow$	HD95 $\downarrow$	Aorta	Gallbladder	Kidney (L)	Kidney (R)	Liver	Pancreas	Spleen	Stomach
Unet [12]	74.68	36.87	84.18	62.84	79.19	71.29	93.35	48.23	84.41	73.92
Att-Unet [12]	75.57	36.97	55.92	63.91	79.20	72.71	93.56	49.37	87.19	74.95
V-Net [12]	68.81	-	75.34	51.87	77.10	80.75	87.84	40.05	80.56	56.98
TransUnet [4]	77.48	31.69	87.23	63.13	81.87	77.02	94.08	55.86	85.08	75.62
MTUnet [11]	78.59	26.59	87.92	64.99	81.47	77.29	93.06	59.46	87.75	76.81
UCTransNet [5]	78.23	26.75	-	-	-	-	-	-	-	-
TransClaw U-Net [5]	78.09	26.38	85.87	61.38	84.83	79.36	94.28	57.65	87.74	73.55
ST-Unet [25]	78.86	20.37	85.68	69.05	85.81	73.04	95.13	60.23	89.15	72.78
Swin-Unet [26]	77.58	27.32	81.76	65.95	82.32	79.22	93.73	53.81	88.04	75.79
VM-UNet [27]	81.08	19.21	86.40	69.41	86.16	82.76	94.17	58.80	89.51	81.40
TransCeption [28]	82.24	20.89	87.60	71.82	86.23	80.29	95.01	65.27	91.68	80.02
Ours	83.43	15.82	89.67	68.89	86.74	84.88	95.87	68.75	92.01	80.66

5 Experiments and Results

5.1 Experimental Setup

To produce our experiments, we mainly used PyTorch [29] library for deep learning. Each architecture is trained for 100 epochs with an initial learning rate of 0.1 and Adam optimizer. The batch size is set to 16 images. The used machine has NVIDIA RTX A5000 GPU with 24 GB of memory, 11th Gen Intel(R) Core(TM) i9-11900KF (3.50GHz) CPU and 64 of RAM. Three types of active data augmentation are used; random rotate with an angle between $-35^{\circ}$ and $35^{\circ}$ with a probability of 10% and random Horizontal and vertical Flip** with probability of 20% for each.

5.2 Results

Tables 1, 2, 3, and 4 summarizes the comparison results with the state-of-the-art architectures in Synapse, BM-Seg, Covid-19, and GlaS and MoNuSeg datasets, respectively. These results show the superiority of our approach over the state-of-the-art architectures.

For the Synapse dataset results (Table 1), we selected comparison approaches that followed the same evaluation splits as [4]. Comparing with the TransUnet architecture, considered as the baseline for the Synapse dataset, our architecture demonstrated superior performance with improvements of 5.95% and 15.87 for Dice-Score and HD95, respectively. This indicates the efficacy of our approach in leveraging both Transformer and CNN features through the proposed Dual-Attention Gate.

Furthermore, our architecture surpassed state-of-the-art methods in terms of both average Dice-Score and HD95 metrics, demonstrating superior performance across all segmented classes. However, it is noteworthy that our approach did not achieve the highest segmentation performance specifically for the Gallbladder class.

Table 2: Comparison on Bone Metastasis Segmentation. F1-S, DSC and IoU are F1-score, Dice-Score and Intersection over Union, respectively.

Model	F1-S $\uparrow$	DSC $\uparrow$	IoU $\uparrow$
U-Net [22]	79.46	72.26	65.93
AttUnet [22]	79.41	71.76	65.86
Unet++ [22]	79.74	71.99	66.31
AttUnet++ [22]	80.28	72.36	67.06
SwinUnet [30]	61.09	39.17	44.01
MTUnet [11]	58.59	44.30	41.45
MISSFormer [31]	81.44	70.42	68.73
UCTransNet [5]	83.62	73.88	71.85
Hybrid-AttUnet++ [22]	82.27	75.70	69.89
EDAUnet++ [22]	83.67	77.05	71.92
Ours	85.01	79.70	73.92

In the BM segmentation comparison, we present the results obtained by comparing our method with the competing approaches outlined in the dataset paper [22] and four recent transformer-based architectures: SwinUnet [30], MTUnet [11], MISSFormer [31], and UCTransNet [5]. Our proposed approach showcased superiority over these architectures (see Table 2).

Moreover, the lower performance of Transformer-based architectures, such as SwinUnet and MTUnet, raises concerns about their ability to generalize across different tasks, especially for infection segmentation tasks. Infection segmentation tasks involve high variability in shape, type, position, and intensity of infections, which may cover only a small portion compared to the background.

In contrast, our approach exhibits a high ability to segment infection regions due to the rich features extracted and combined during the encoding phase. Additionally, the proposed Dual-Attention Gate effectively highlights prominent parts through multi-scale feature maps, making it well-suited for detecting infection regions.

Table 3 provides a comprehensive summary of the results achieved by our proposed approach and its comparison with three baseline CNN architectures (U-Net, Att-Unet, and Unet++), four state-of-the-art approaches for Covid-19 segmentation (CopleNet [32], AnamNet [33], SCOATNET [34], and EMB-TrAttUnet [2]), and four recent Transformer-based medical imaging segmentation approaches (SwinUnet [30], MTUnet [11], MISSFormer [31], and UCTransNet [5]).

Our analysis revealed that Transformer-based approaches exhibit limited generalization ability, achieving performance close to that of baseline CNN architectures. Additionally, a significant performance gap was observed between the segmentation of the two classes, primarily due to the minor presence of Consolidation compared to GGO, both in appearance and distribution within the lung. Remarkably, our proposed approach achieved the best performance, effectively reducing the gap in segmenting both classes compared to the comparison approaches. This highlights our method’s exceptional capability to accurately highlight infection regions throughout all encoding blocks, leveraging the proposed Dual-Attention Gates.

Table 3: Comparison on Multi-classes Covid-19 Segmentation. F1-S, DSC and HD95 are F1-score, Dice-Score and 95% Hausdorff distance, respectively. GGO and Con are the two types of Covid-19 infection known as Ground-Glass Opacity and Consolidation.

Architecture	Average			F1-S		DSC
Architecture	F1-S $\uparrow$	DSC $\uparrow$	HD95 $\downarrow$	GGO	Con	GGO	Con
U-Net [12]	48.58	32.79	35.69	65.81 $\pm$ 1.26	31.35 $\pm$ 12.96	50.13 $\pm$ 1.31	15.45 $\pm$ 5.66
Att-Unet [12]	51.92	34.85	35.84	64.81 $\pm$ 1.89	39.04 $\pm$ 6.81	50.44 $\pm$ 1.35	19.26 $\pm$ 3.55
Unet++ [14]	48.51	41.48	44.06	65.69 $\pm$ 1.29	31.31 $\pm$ 6.67	51.65 $\pm$ 4.12	31.31 $\pm$ 6.67
CopleNet [32]	54.64	31.355	39.04	60.44 $\pm$ 1.54	29.70 $\pm$ 10.29	46.25 $\pm$ 3.13	16.46 $\pm$ 4.76
AnamNet [33]	48.53	34.875	34.78	65.10 $\pm$ 3.56	31.97 $\pm$ 6.12	51.69 $\pm$ 4.8	18.06 $\pm$ 4.61
SCOATNET [34]	45.07	37.06	30.99	65.77 $\pm$ 3.28	43.52 $\pm$ 1.67	50.80 $\pm$ 4.63	23.32 $\pm$ 2.07
SwinUnet [30]	47.47	31.11	39.42	62.74 $\pm$ 2.63	32.2 $\pm$ 6.68	42.46 $\pm$ 2.61	19.77 $\pm$ 3.87
MTUnet [11]	42.30	30.60	37.50	57.83 $\pm$ 2.57	26.78 $\pm$ 7.39	42.97 $\pm$ 2.78	18.24 $\pm$ 4.56
MISSFormer [31]	56.70	39.79	42.08	65.66 $\pm$ 3.06	47.75 $\pm$ 4.77	51.57 $\pm$ 4.01	28.02 $\pm$ 2.72
UCTransNet [5]	58.33	41.41	34.67	67.46 $\pm$ 2.97	49.21 $\pm$ 4.27	53.42 $\pm$ 4.24	29.41 $\pm$ 3.48
EMB-TrAttUnet [2]	65.16	48.18	27.47	70.06 $\pm$ 0.03	60.26 $\pm$ 0.92	59.14 $\pm$ 0.87	37.23 $\pm$ 0.97
Ours	68.71	51.03	24.22	73.12 $\pm$ 0.37	64.30 $\pm$ 0.90	60.38 $\pm$ 0.94	41.68 $\pm$ 0.98

Table 4: Comparison on Glas and MoNuSeg Segmentation datasets.

Ex	Architecture	GlaS		MoNuSeg
	Architecture	DSC	IoU	DSC	IoU
1	U-Net [5]	85.45 $\pm$ 1.3	74.78 $\pm$ 1.7	76.45 $\pm$ 2.6	62.86 $\pm$ 3.0
2	Unet++ [5]	87.56 $\pm$ 1.2	79.13 $\pm$ 1.7	77.01 $\pm$ 2.1	63.04 $\pm$ 2.5
3	AttUNet [5]	88.80 $\pm$ 1.1	80.69 $\pm$ 1.7	76.67 $\pm$ 1.1	63.47 $\pm$ 1.2
4	MRUNet [5]	88.73 $\pm$ 1.2	80.89 $\pm$ 1.7	78.22 $\pm$ 2.5	64.83 $\pm$ 2.9
5	TransUNet [5]	88.40 $\pm$ 0.7	80.40 $\pm$ 1.0	78.53 $\pm$ 1.1	65.05 $\pm$ 1.3
6	MedT [5]	85.92 $\pm$ 2.9	75.47 $\pm$ 3.5	77.46 $\pm$ 2.4	63.37 $\pm$ 3.1
7	Swin-Unet [5]	89.58 $\pm$ 0.6	82.06 $\pm$ 0.7	77.69 $\pm$ 0.9	63.77 $\pm$ 1.2
8	UCTransNet [5]	90.18 $\pm$ 0.7	82.96 $\pm$ 1.1	79.08 $\pm$ 0.7	65.50 $\pm$ 0.9
9	Ours	94.20 $\pm$ 0.55	89.29 $\pm$ 0.91	79.62 $\pm$ 0.7	66.31 $\pm$ 0.6

Table 5: Ablation study on Synapse Dataset and Covid-19. The importance of the following elements is studied: CNN Pyramid path (Pyr), PVT path (PVT) and the Vit Transformer (ViT). Mean Dice-Score (DSC) and 95% Hausdorff distance metrics are used for both tasks plus F1-Score (F1-S) for for Covid-19 task.

Architecture	Ablation			Synapse		Covid-19
Architecture	Pyr	PVT	ViT	DSC $\uparrow$	HD95 $\downarrow$	F1-S $\uparrow$	DSC $\uparrow$	HD95 $\downarrow$
(1) No Pyramid Path	✗	✓	✓	82.32	21.45	67.84	51.07	23.23
(2) No PVT	✓	✗	✓	79.44	22.92	65.98	50.25	24.05
(3) No ViT	✓	✓	✗	82.39	17.67	68.92	51.69	21.53
(4) PAG-TransYnet	✓	✓	✓	83.43	15.82	68.71	51.03	24.22

Following the evaluation protocol and comparing the performance with the results obtained in [5], Table 4 presents a comprehensive comparison of our approach with the state-of-the-art methods for microscopic segmentation tasks, specifically Gland and Nucleus segmentation. From these results, it is evident that our proposed architecture outperforms the state-of-the-art methods, achieving the best performance in both Gland and Nucleus segmentation tasks. This further confirms the efficiency and versatility of our approach in various medical imaging segmentation tasks.

5.3 Ablation Study

The aim of this section is to investigate the significance of the proposed encoding elements within our approach. We examine the importance of the following components: CNN Pyramid path (Pyr), PVT path (PVT), and the ViT Transformer (ViT), considering multi-organ abdominal segmentation (Synapse) and infection segmentation (Covid-19). The results are summarized in Table 5. In the first ablation experiment, it is evident that the Pyramid path plays a crucial role in Synapse segmentation, as removing it leads to a decrease in performance by 1.11% and 5.63 for DSC and HD95, respectively. Conversely, the results for Covid-19 segmentation show stable performance despite removing the Pyramid path.

In the second ablation study, it becomes apparent that Transformer features are vital for both tasks. Removing the PVT path results in a significant decrease in performance on the Synapse dataset, with a reduction of 4% and 7.1 for Dice-score and HD95, respectively. Similarly, for Covid-19 segmentation, the performance decreases by 2.73% and 0.78% for F1-score and Dice-score, respectively. Regarding the ViT block, the experiments demonstrate its importance in Synapse segmentation, likely due to the complexity of Synapse having more classes compared to Covid-19 segmentation. Additionally, the relatively smaller size of the Covid-19 dataset makes it challenging to train the ViT (base varaint), leading to potential overfitting. However, the experiments show only a minor decrease in performance. Overall, these findings underscore the significance of each component in achieving high performance in both Synapse and Covid-19 segmentation tasks, with particular emphasis on the Transformer features in enhancing segmentation accuracy.

6 Conclusion

In this paper, we introduce a novel hybrid architecture, termed PAG-TransYnet, designed for medical imaging segmentation. By seamlessly integrating Convolutional Neural Networks (CNNs), Transformers, and a fusion branch encoder, we aim to address the limitations of existing approaches and improve segmentation accuracy. Our key innovation lies in enhancing the Att-Unet attention gate with our proposed Dual-Attention Gate mechanism. This mechanism facilitates the extraction of prominent features from multiple encoder branches, thereby capturing both local and global contextual information more effectively.

Through comprehensive evaluation across various segmentation tasks, including abdominal multi-organs segmentation, infection detection (Covid-19 and Bone Metastasis), and microscopic tissue segmentation (Gland and Nucleus), our proposed approach demonstrates state-of-the-art performance and remarkable generalization capabilities. The utilization of the Dual-Attention Gate mechanism enables efficient fusion of features from different encoder branches, leading to enhanced segmentation accuracy and robustness across diverse medical imaging datasets.

The contributions of this work extend beyond the development of a novel segmentation architecture. We present a significant advancement towards addressing the pressing need for efficient and adaptable segmentation solutions in medical imaging applications. By seamlessly integrating CNNs and Transformers, our approach provides a versatile framework capable of handling the high variability among diseases and segmentation tasks. Furthermore, our methodology lays the foundation for future research endeavors aimed at advancing medical imaging segmentation techniques and facilitating clinical decision-making processes.

References

[1] Fahad Shamshad, Salman Khan, Syed Waqas Zamir, Muhammad Haris Khan, Munawar Hayat, Fahad Shahbaz Khan, and Huazhu Fu, “Transformers in Medical Imaging: A Survey,” Jan. 2022, arXiv:2201.09873 [cs, eess].
[2] Fares Bougourzi, Fadi Dornaika, Amir Nakib, and Abdelmalik Taleb-Ahmed, “Emb-trattunet: a novel edge loss function and transformer-CNN architecture for multi-classes pneumonia infection segmentation in low annotation regimes,” Artificial Intelligence Review, vol. 57, no. 4, pp. 90, Mar. 2024.
[3] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah, “Transformers in vision: A survey,” ACM Computing Surveys (CSUR), 2021, Publisher: ACM New York, NY.
[4] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv preprint arXiv:2102.04306, 2021.
[5] Haonan Wang, Peng Cao, Jiaqi Wang, and Osmar R Zaiane, “Uctransnet: rethinking the skip connections in u-net from a channel-wise perspective with transformer,” in Proceedings of the AAAI conference on artificial intelligence, 2022, vol. 36, pp. 2441–2449.
[6] Wenxuan Wang, Chen Chen, Meng Ding, Hong Yu, Sen Zha, and Jiangyun Li, “TransBTS: Multimodal Brain Tumor Segmentation Using Transformer,” in Medical Image Computing and Computer Assisted Intervention – MICCAI 2021, Marleen de Bruijne, Philippe C. Cattin, Stéphane Cotin, Nicolas Padoy, Stefanie Speidel, Yefeng Zheng, and Caroline Essert, Eds., Cham, 2021, pp. 109–119, Springer International Publishing.
[7] Ali Hatamizadeh, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myronenko, Bennett Landman, Holger R. Roth, and Daguang Xu, “UNETR: Transformers for 3D Medical Image Segmentation,” 2022, pp. 574–584.
[8] Zhiqin Zhu, Xianyu He, Guanqiu Qi, Yuanyuan Li, Baisen Cong, and Yu Liu, “Brain tumor segmentation based on the fusion of deep semantics and edge information in multimodal mri,” Information Fusion, vol. 91, pp. 376–387, 2023.
[9] Huisi Wu, Shihuai Chen, Guilian Chen, Wei Wang, Baiying Lei, and Zhenkun Wen, “FAT-Net: Feature adaptive transformers for automated skin lesion segmentation,” Medical Image Analysis, vol. 76, pp. 102327, Feb. 2022.
[10] Xianyu He, Guanqiu Qi, Zhiqin Zhu, Yuanyuan Li, Baisen Cong, and Litao Bai, “Medical image segmentation method based on multi-feature interaction and fusion over cloud computing,” Simulation Modelling Practice and Theory, vol. 126, pp. 102769, 2023.
[11] Hongyi Wang, Shiao Xie, Lanfen Lin, Yutaro Iwamoto, Xian-Hua Han, Yen-Wei Chen, and Ruofeng Tong, “Mixed transformer u-net for medical image segmentation,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 2390–2394.
[12] Ozan Oktay, Jo Schlemper, and Loic Le et al. Folgoc, “Attention U-Net: Learning Where to Look for the Pancreas,” arXiv:1804.03999 [cs], May 2018, arXiv: 1804.03999.
[13] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, Cham, 2015, pp. 234–241, Springer International Publishing.
[14] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang, “UNet++: A Nested U-Net Architecture for Medical Image Segmentation,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Danail Stoyanov, Zeike Taylor, and Gustavo et al. Carneiro, Eds., Cham, 2018, pp. 3–11, Springer International Publishing.
[15] Zhengxin Zhang, Qingjie Liu, and Yunhong Wang, “Road Extraction by Deep Residual U-Net,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 5, pp. 749–753, May 2018, Conference Name: IEEE Geoscience and Remote Sensing Letters.
[16] Fares Bougourzi, Cosimo Distante, Fadi Dornaika, and Abdelmalik Taleb-Ahmed, “Pdatt-unet: Pyramid dual-decoder attention unet for covid-19 infection segmentation from ct-scans,” Medical Image Analysis, vol. 86, pp. 102797, 2023.
[17] Wenxuan Wang, Chen Chen, Meng Ding, Hong Yu, Sen Zha, and Jiangyun Li, “Transbts: Multimodal brain tumor segmentation using transformer,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24. Springer, 2021, pp. 109–119.
[18] Olivier Petit, Nicolas Thome, Clement Rambour, Loic Themyr, Toby Collins, and Luc Soler, “U-Net Transformer: Self and Cross Attention for Medical Image Segmentation,” in Machine Learning in Medical Imaging, Chunfeng Lian, Xiaohuan Cao, Islem Rekik, Xuanang Xu, and **kun Yan, Eds., Cham, 2021, pp. 267–276, Springer International Publishing.
[19] Wenhai Wang, Enze Xie, Xiang Li, Deng-** Fan, Kaitao Song, Ding Liang, Tong Lu, ** Luo, and Ling Shao, “Pvt v2: Improved baselines with pyramid vision transformer,” Computational Visual Media, vol. 8, no. 3, pp. 415–424, 2022.
[20] Fares Bougourzi, Cosimo Distante, Fadi Dornaika, and Abdelmalik Taleb-Ahmed, “D-trattunet: dual-decoder transformer-based attention unet architecture for binary and multi-classes covid-19 infection segmentation,” arXiv preprint arXiv:2303.15576, 2023.
[21] RADIOLOGISTS, “COVID-19 CT-scans segmentation datasets, available at: http://medicalsegmentation.com/covid19/,” 2019, Last visited: 18-08-2021.
[22] Marwa Afnouch, Olfa Gaddour, Yosr Hentati, Fares Bougourzi, Mohamed Abid, Ihsen Alouani, and Abdelmalik Taleb Ahmed, “Bm-seg: A new bone metastases segmentation dataset and ensemble of cnn-based segmentation approach,” Expert Systems with Applications, vol. 228, pp. 120376, 2023.
[23] Korsuk Sirinukunwattana, Josien PW Pluim, Hao Chen, Xiaojuan Qi, Pheng-Ann Heng, Yun Bo Guo, Li Yang Wang, Bogdan J Matuszewski, Elia Bruni, Urko Sanchez, et al., “Gland segmentation in colon histology images the glas challenge contest,” Medical image analysis, vol. 35, pp. 489–502, 2017.
[24] Neeraj Kumar, Ruchika Verma, Deepak Anand, Yanning Zhou, Omer Fahri Onder, Efstratios Tsougenis, Hao Chen, Pheng-Ann Heng, Jiahui Li, Zhiqiang Hu, et al., “A multi-organ nucleus segmentation challenge,” IEEE transactions on medical imaging, vol. 39, no. 5, pp. 1380–1391, 2019.
[25] **g Zhang, Qiuge Qin, Qi Ye, and Tong Ruan, “St-unet: Swin transformer boosted u-net with cross-layer feature enhancement for medical image segmentation,” Computers in Biology and Medicine, vol. 153, pp. 106516, 2023.
[26] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022.
[27] Jiacheng Ruan and Suncheng Xiang, “Vm-unet: Vision mamba unet for medical image segmentation,” arXiv preprint arXiv:2402.02491, 2024.
[28] Reza Azad, Yiwei Jia, Ehsan Khodapanah Aghdam, Julien Cohen-Adad, and Dorit Merhof, “Enhancing medical image segmentation with transception: a multi-scale feature fusion approach,” arXiv preprint arXiv:2301.10847, 2023.
[29] Adam Paszke, Sam Gross, and Francisco et al. Massa, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in neural information processing systems, 2019, pp. 8026–8037.
[30] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, and Manning Wang, “Swin-unet: Unet-like pure transformer for medical image segmentation,” in European conference on computer vision. Springer, 2022, pp. 205–218.
[31] Xiaohong Huang, Zhifang Deng, Dandan Li, Xueguang Yuan, and Ying Fu, “Missformer: An effective transformer for 2d medical image segmentation,” IEEE Transactions on Medical Imaging, 2022.
[32] Guotai Wang, ** Li, and Zhiyong et al. Xu, “A Noise-Robust Framework for Automatic Segmentation of COVID-19 Pneumonia Lesions From CT Images,” IEEE Transactions on Medical Imaging, vol. 39, no. 8, pp. 2653–2663, Aug. 2020, Conference Name: IEEE Transactions on Medical Imaging.
[33] Naveen Paluru, Aveen Dayal, and Håvard Bjørke et al. Jenssen, “Anam-Net: Anamorphic Depth Embedding-Based Lightweight CNN for Segmentation of Anomalies in COVID-19 Chest CT Images,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 3, pp. 932–946, Mar. 2021, Conference Name: IEEE Transactions on Neural Networks and Learning Systems.
[34] Shixuan Zhao, Zhidan Li, and Yang et al. Chen, “Scoat-net: A novel network for segmenting covid-19 lung opacification from ct images,” Pattern Recognition, p. 108109, 2021.