RTA-Former: Reverse Transformer Attention for Polyp Segmentation

Zhikai Li,¹ Murong Yi,² Ali Uneri,² Sihan Niu,¹ and Craig Jones^1,3,4∗ ¹Zhikai Li, Sihan Niu, and Craig Jones are with the Department of Computer Science, Johns Hopkins University, Baltimore MD, USA [email protected], [email protected], [email protected]²Murong Yi and Ali Uneri are with the Department of Biomedical Engineering, Johns Hopkins University, Baltimore MD, USA [email protected], [email protected]^3,4Craig Jones is also with the Department of Radiology and Radiological Science, Johns Hopkins University, School of Medicine, Baltimore, USA, and the Malone Center for Engineering in Healthcare, Johns Hopkins University, Baltimore MD, USA [email protected]^∗Corresponding author

Abstract

Polyp segmentation is a key aspect of colorectal cancer prevention, enabling early detection and guiding subsequent treatments. Intelligent diagnostic tools, including deep learning solutions, are widely explored to streamline and potentially automate this process. However, even with many powerful network architectures, there still comes the problem of producing accurate edge segmentation. In this paper, we introduce a novel network, namely RTA-Former, that employs a transformer model as the encoder backbone and innovatively adapts Reverse Attention (RA) with a transformer stage in the decoder for enhanced edge segmentation. The results of the experiments illustrate that RTA-Former achieves state-of-the-art (SOTA) performance in five polyp segmentation datasets. The strong capability of RTA-Former holds promise in improving the accuracy of Transformer-based polyp segmentation, potentially leading to better clinical decisions and patient outcomes. Our code is publicly available on GitHub.

I INTRODUCTION

Automatic polyp segmentation plays a pivotal role in colorectal cancer prevention by facilitating early detection and providing references for further intervention. However, even with the help of modern medical imaging techniques, qualitative interpretation often requires specialized expertise that leads to prolonged diagnosis times.

Polyp segmentation can be formulated as a pixel-wise classification task, where many deep-learning approaches are applied. One is the U-Net[1] segmentation network. The U-Net proposed a structure composed of an encoder and a decoder with skip connections in between. Following solutions that use such convolutional neural networks (CNNs) inherited the structure, such as UNet++[2], ResUnet++[3], Pranet[4]. These networks all show promising results for medical image segmentation.

Recently, the vision transformer (ViT)[5] shows extended capabilities. By utilizing a self-attention mechanism, a patch-level approach enables the network to establish relationships between disparate regions of the image, ViT is facilitated with the capture of global dependencies. Naturally, the ViT architecture was adapted for segmentation[6], such as in TransUnet[7]. The promising results inspired a ViT variation to allow the transformer to produce hierarchical features with lower computational cost, namely the Pyramid Vision Transformer (PVT)[8]. This also inspired many networks to introduce PVT as an alternative to CNN backbone, instead of a coordinator, such as PolypPVT[9].

Although deep learning has shown promising results in semantic segmentation, for polyp segmentation tasks where the polyp tissue and the background can be very similar, precisely localizing the edges is important and challenging. In this paper, we introduce RTA-Former, a novel architecture that uses PVT as an encoder and employs a Hierarchical Feature Synthesizer (HFS) composed of Reverse Transformer Attention (RTA) to improve its capability for edge segmentation. Our main contributions are threefold:

(1) Novel Network Architecture: We introduce the RTA-Former, which utilizes hierarchical features generated by the transformer encoder. By integrating the transformer structure into the reverse attention mechanism of the decoder, the network can focus on the difficult edge regions of the image.

(2) Flexible Backbone Network: Our proposed decoder architecture can be combined with different sizes of PVT backbones, to meet the different needs on the balance between computation time and performance. It then provides more options in structure depending on practical tasks.

(3) Performance Evaluation on Multiple Datasets: We compared the performance of RTA-Former on the polyp segmentation task on five datasets, evaluating both the learning ability and generalization. The result not only indicates that RTA-Former meets the state-of-the-art performance, but also demonstrates the flexibility of the RTA-Former structure in generalization.

II METHOD

Refer to caption — Figure 1: An overview of RTA-Former architecture. The upper section showcases the overall architecture of the RTA-Former model, which is composed of an Encoder, a Hierarchical Feature Synthesizer, and a Decoder. The lower section offers an in-depth view of the internal structure of our Hierarchical Feature Synthesizer.

For this section, we will discuss our feature fusion mechanism, then introduce our backbone structure used for the encoder, followed by an introduction to the hierarchical feature synthesizer module and its internal reverse transformer attention structure. Lastly, we dive into the architecture and functionalities of our decoder.

II-A Fast Feature Fusion Mechanism(FF)

Inspired by EfficientDet[10], our Fast Feature Fusion mechanism (FF) amalgamates multi-scale features. For each feature, we assign a normalized, adaptive weight. The weighted sum of these features undergoes a Swish activation function, represented as $f(x)=x\cdot\sigma(x)$ , where $\sigma$ is the sigmoid function. For features $\{x_{1},x_{2},\ldots,x_{n}\}$ , the fusion process is formalized as follows:

\mathrm{FF}=\mathrm{Swish}\left(\sum_{i=1}^{n}w_{i}\cdot x_{i}\right)

(1)

where $F$ is the fused feature and $\{w_{1},w_{2},\ldots,w_{n}\}$ are the normalized learnable weights.

II-B Transformer Encoder

We adopt the transformer architecture encoder as the model’s backbone to ensure sufficient universality and multi-scale feature handling capability in polyp segmentation. Specifically, we use a pre-trained PVTv2. The PVTv2 employs convolutional layers to replace traditional patch embedding modules in a transformer. This allows the continuous capture of spatial information and features at various levels. To generate the final features for input to the Hierarchical Feature Synthesizer module, the output features are passed through a $3\times 3$ convolutional layer for channel dimension upscaling.

II-C Hierarchical Feature Synthesizer (HFS)

The Hierarchical Feature Synthesizer (HFS) is a novel architecture tailored for advanced feature extraction. Rooted in the transformer-based encoder, it assimilates hierarchical feature representations. Specifically, this module leverages the features $X_{1}$ to $X_{4}$ mentioned in the previous section as its input. As illustrated in Figure 1, for these features, we subsequently input them into the Reverse Transformer Attention module for feature extraction.

II-D Reverse Transformer Attention (RTA)

As illustrated in Figure 2, the Reverse Transformer Attention (RTA) module employs a reverse attention mechanism for feature refinement. This module integrates multiple transformer stages for feature extraction, capitalizing on their ability to discern global and local features. This approach ensures that the significant but often neglected regions, especially the edge, are accentuated, enhancing the model’s performance in capturing intricate details.

Feature Processing: In the processing of sequential input features, $X_{i}$ and $X_{i+1}$ , to align their dimensions with the transformer stage input, a convolutional layer, along with the transformer stage of the backbone, is employed. To maintain spatial consistency, features are resized and subsequently passed through a bottleneck structure to increase the dimensionality. The reverse attention mechanism is employed to emphasize typically overlooked regions by subtracting the attention map from the unity, which is formulated as follows:

{X}_{i+1,\mathrm{reverse}}=1-\mathrm{BN}_{1}(\mathrm{Resize}(\mathrm{Stage}_{i% +1}(\mathrm{Conv}({X}_{i+1})))).

(2)

Feature Enhancement and Output: The primary feature ${X}_{i}$ is modulated using the reverse attention map, followed by refinement with a bottleneck structure:

X_{i,output}=\mathrm{BottleNeck}_{2}({X}_{i}\odot{X}_{i+1,\mathrm{reverse}})+{% X}_{i}

(3)

The output feature map $X_{i,output}$ encapsulates enhanced information due to the reverse attention mechanism, enabling RTA-Former to dynamically emphasize different feature map regions.

II-E Decoder

Starting with inputs $X_{1}$ to $X_{4}$ and four outputs $X_{1,output}$ to $X_{4,output}$ from the HFS, we perform the following operation:

	$\displaystyle{X}_{\mathrm{fused},i}=\mathrm{Resize}(\mathrm{FF}(X_{i},X_{i,% output})),$		(4)
	$\displaystyle M=\mathrm{Up}(\mathrm{Conv}(\mathrm{Cat}({X}_{\mathrm{fused},1},% {X}_{\mathrm{fused},2},\ldots))).\vspace{-1.5em}$		(5)

where ${X}_{\mathrm{fused},i}$ is the result of feature fusion for each $i$ . $\mathrm{FF}$ is the proposed feature fusion mechanism. For the final output $M$ , these features are resized, concatenated, passed through a convolutional layer, and then upsampled.

III EXPERIMENTS

III-A Dataset

We used five datasets for our experiments: CVC-ClinicDB[11], CVC-ColonDB[12], CVC-300[13], ETIS-LaribPolypDB[14], and Kvasir[15]. These datasets provide a diverse and representative sample of gastrointestinal polyp images for develo** and evaluating medical image segmentation models. Following the previous methods[9, 4], we trained our models on a merged set comprising 550 images from CVC-ClinicDB and 900 from Kvasir, totaling 1,450 images. For testing, we evaluated the model’s performance on all five datasets to assess both the model’s learning capabilities and its generalization.

TABLE I: The param number of our four sizes of the model.

Model Name	Backbone	Param (M)
RTA-Former-T	PVTv2-B0	8.4
RTA-Former-S	PVTv2-B2	56.2
RTA-Former-M	PVTv2-B4	192.6
RTA-Former-L	PVTv2-B5	250.8

III-B Evaluation Metrics

To quantitatively assess the network’s performance on polyp segmentation, we used the Dice Similarity Coefficient (DICE) and mean Intersection over Union (mIoU).

TABLE II: Comparison Results of the purposed method on the 5 polyp segmentation datasets. Blue indicates the best result, and red displays the second-best.

Model	Kvasir-SEG		CVC-ClinicDB		CVC-300		CVC-ColonDB		ETIS
	DICE	mIoU	DICE	mIoU	DICE	mIoU	DICE	mIoU	DICE	mIoU
MICCAI’15 U-Net[1]	0.818	0.746	0.823	0.755	0.710	0.627	0.512	0.444	0.398	0.335
DLMIA’18 UNet++[2]	0.821	0.743	0.794	0.729	0.707	0.624	0.483	0.410	0.401	0.344
MICCAI’20 ACSNet[16]	0.898	0.838	0.882	0.826	0.856	0.788	0.716	0.649	0.578	0.509
arXiv’21 MSEG[17]	0.897	0.839	0.909	0.864	0.874	0.804	0.735	0.666	0.700	0.630
arXiv’21 DCRNet[18]	0.886	0.825	0.896	0.844	0.863	0.787	0.704	0.631	0.556	0.496
MICCAI’20 PraNet[4]	0.898	0.840	0.899	0.849	0.871	0.797	0.712	0.640	0.628	0.567
CRV’21 EU-Net[19]	0.908	0.854	0.902	0.846	0.837	0.765	0.756	0.681	0.687	0.609
MICCAI’21 SANet[20]	0.904	0.847	0.916	0.859	0.888	0.815	0.753	0.670	0.750	0.654
arXiv’21 Polyp-PVT[9]	0.917	0.864	0.937	0.889	0.900	0.833	0.808	0.727	0.787	0.706
IEEE TIM’23 APCNet[21]	0.913	0.859	0.934	0.886	0.893	0.827	0.758	0.682	0.726	0.648
PR’23 CFANet[22]	0.915	0.861	0.933	0.883	0.893	0.827	0.743	0.665	0.732	0.655
SPIE’23 CaraNet[23]	0.918	0.865	0.936	0.887	0.903	0.838	0.773	0.689	0.747	0.672
RTA-Former-T (Ours)	0.903	0.846	0.925	0.868	0.863	0.782	0.766	0.676	0.724	0.639
RTA-Former-S (Ours)	0.920	0.866	0.931	0.883	0.893	0.822	0.794	0.711	0.789	0.710
RTA-Former-M (Ours)	0.921	0.873	0.939	0.892	0.902	0.832	0.798	0.719	0.789	0.712
RTA-Former-L (Ours)	0.923	0.875	0.938	0.888	0.891	0.815	0.818	0.734	0.795	0.714

TABLE III: Ablation Study of the purposed modules

Components			Kvasir-SEG		CVC-ClinicDB		CVC-300		CVC-ColonDB		ETIS
HFS	RA	RTA	DICE	mIoU	DICE	mIoU	DICE	mIoU	DICE	mIoU	DICE	mIoU
			0.909	0.855	0.906	0.851	0.875	0.799	0.804	0.72	0.763	0.683
✓			0.914	0.861	0.912	0.859	0.889	0.806	0.806	0.723	0.779	0.696
✓	✓		0.915	0.863	0.922	0.869	0.890	0.814	0.811	0.726	0.788	0.704
✓		✓	0.923	0.875	0.938	0.888	0.891	0.815	0.818	0.734	0.795	0.714

III-C Implementation

All models are trained on a cluster with 8 NVIDIA RTX 6000 GPUs, each with 24GB memory, utilizing CUDA 12.2. We set the learning rate at $1\times 10^{-4}$ and the weight decay rate at $1\times 10^{-4}$ , batch size 8, Adam optimizer across 100 epochs with structure loss[4].

To ensure a fair evaluation of our method, we adhere to the image resolution settings previously used in each dataset. For the polyp segmentation task, we resize images to 352x352 as previous methods[9, 4], with scales 0.75, 1.0, 1.25.

In Table I, we present four distinct sizes of our model: tiny (T), small (S), medium (M), and large (L). These variations stem from our goal to understand how models with different numbers of parameters fare across diverse tasks. Balancing complexity becomes essential to prevent potential overfitting while ensuring the model effectively grasps medical image intricacies. These four models, T, S, M, and L, are respectively built upon the encoder backbone PolyPVT-v0, v2, v4, and v5, allowing users to adapt to various scenarios and computational limits.

III-D Results

In Table LABEL:result1, we demonstrated the result based on the DICE and the mIoU metrics. The Medium and Large RTA-Former models stand out in their learning capabilities, surpassing several models on Kvasir-SEG and CVC-ClinicDB with DICE scores of up to 93.9% and mIoU scores of up to 89.2%. PVT-based methods, notably RTA-Former, consistently outperformed many CNN-based approaches such as UNet and UNet++. Moreover, RTA-Former’s performance on CVC-300, CVC-ColonDB, and ETIS datasets underscores its impressive generalization. While some networks like the Medium model and CaraNet excel on smaller datasets like CVC-300, many CNN-based models, particularly UNet and UNet++, lagged in generalization. Our RTA-Former effectively captures and analyzes polyp representations, outperforming the other 12 models in both learning and generalization, especially for RTA-Former-M and RTA-Former-L.

III-E Ablation Study

Impacts of Our Modules. As shown in Table LABEL:ablation, we used PVTv2-B5 as our base model. Subsequently, we incorporated the hierarchical feature synthesizer (HFS) without the RTA module. Building upon the HFS, we introduced the traditional convolution-layered reverse attention (RA) which is introduced in CaraNet[23]. In the final step, we integrated our proposed reverse transformer attention (RTA) with the HFS. It shows that the introduction of HFS yields approximately a 1% enhancement in both DICE and mIoU metrics across all datasets. Employing solely the RA exhibits marginal improvement for Kvasir-SEG and CVC-300, while other datasets observe a proximate 1% increment. Contrarily, with the RTA’s integration, underpinned by the transformer, a notable 2% augmentation is observed on several datasets, thereby accentuating RTA’s adaptability and potency over RA.

III-F The Visualization Result of the Polyp Segmentation.

In Figure 3, we show the sensitivity and segmentation performance of different models towards polyps of varied scales on the test dataset from ClinicDB and ETIS. From the results, it becomes evident that our model consistently demonstrates robust segmentation capabilities across polyps of all sizes, outperforming other models in its adaptability and precision. Additionally, our model exhibits superior sensitivity towards the ambiguous or blurry edges of the polyps, ensuring more accurate segmentation even in challenging scenarios, which will be discussed next.

III-G The Visualization Result of Our Reverse Attention Module.

We employ the Grad-CAM[24] technique to illustrate the region of focus within our model. As shown in Figure 2, the output from Bottleneck 1 is reversed and subsequently fed into Bottleneck 2. As shown in Figure 4, within each bottleneck structure, Bottleneck 1.0, 1.1, and 1.2 denote the three convolution layers that sequentially process the image features. Similarly, Bottleneck 2.0, 2.1, and 2.2 correspond to the convolution layers within the second bottleneck unit. This process reveals that the original output of Bottleneck 1 is concentrated on the polyp itself, whereas the inverted feature map predominantly highlights the periphery of the polyp region. This distinction underscores the efficacy of our methodology in accurately delineating the lesion’s boundary, thereby facilitating enhanced segmentation.

IV CONCLUSIONS

This study introduces a novel approach termed RTA-Former, wherein the encoder employs PVT and the decoder incorporates the RTA module to enrich the reverse attention mechanism. The model shows powerful capability and generalization on various datasets. The network can employ different versions of transformers according to the complexity of the task. In particular, RTA-Former-L and RTA-Former-M exhibit the highest performance levels across datasets such as Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, and ETIS in the context of polyp segmentation. These are promising applications of RTA-Former in polyp segmentation.

References

[1] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, 2015, pp. 234–241.
[2] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang, “Unet++: A nested u-net architecture for medical image segmentation,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4. Springer, 2018, pp. 3–11.
[3] Debesh Jha, Pia H Smedsrud, Michael A Riegler, Dag Johansen, Thomas De Lange, Pål Halvorsen, and Håvard D Johansen, “Resunet++: An advanced architecture for medical image segmentation,” in 2019 IEEE International Symposium on Multimedia (ISM). IEEE, 2019, pp. 225–2255.
[4] Deng-** Fan, Ge-Peng Ji, Tao Zhou, Geng Chen, Huazhu Fu, Jianbing Shen, and Ling Shao, “Pranet: Parallel reverse attention network for polyp segmentation,” in International conference on medical image computing and computer-assisted intervention. Springer, 2020, pp. 263–273.
[5] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[6] Wen Zheng, Murong Yi, Guiqun Cao, Zhuyu Zhou, and Jian Cheng, “Deep learning-based fetal corpus callosum segmentation in ultrasonic images,” International Journal of Computer Theory and Engineering, vol. 14, no. 3, pp. 104–108, 2022.
[7] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv preprint arXiv:2102.04306, 2021.
[8] Wenhai Wang, Enze Xie, Xiang Li, Deng-** Fan, Kaitao Song, Ding Liang, Tong Lu, ** Luo, and Ling Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 568–578.
[9] Bo Dong, Wenhai Wang, Deng-** Fan, **peng Li, Huazhu Fu, and Ling Shao, “Polyp-pvt: Polyp segmentation with pyramid vision transformers,” arXiv preprint arXiv:2108.06932, 2021.
[10] Mingxing Tan, Ruoming Pang, and Quoc V. Le, “Efficientdet: Scalable and efficient object detection,” 2020.
[11] Jorge Bernal, Nima Tajkbaksh, Francisco Javier Sanchez, Bogdan J Matuszewski, Hao Chen, Lequan Yu, Quentin Angermann, Olivier Romain, Bjørn Rustad, Ilangko Balasingham, et al., “Comparative validation of polyp detection methods in video colonoscopy: results from the miccai 2015 endoscopic vision challenge,” IEEE transactions on medical imaging, vol. 36, no. 6, pp. 1231–1249, 2017.
[12] J Bernal, J Sánchez, and F Vilariño, “Cvc-colondb: A database for assessment of polyp detection,” Database, 2012.
[13] David Vázquez, Jorge Bernal, F Javier Sánchez, Gloria Fernández-Esparrach, Antonio M López, Adriana Romero, Michal Drozdzal, and Aaron Courville, “A benchmark for endoluminal scene segmentation of colonoscopy images,” Journal of healthcare engineering, vol. 2017, 2017.
[14] Kun Yang, Shilong Chang, Zhaoxing Tian, Cong Gao, Yu Du, Xiongfeng Zhang, Kun Liu, Jie Meng, and Linyan Xue, “Automatic polyp detection and segmentation using shuffle efficient channel attention network,” Alexandria Engineering Journal, vol. 61, no. 1, pp. 917–926, 2022.
[15] Debesh Jha, Pia H Smedsrud, Michael A Riegler, Pål Halvorsen, Thomas de Lange, Dag Johansen, and Håvard D Johansen, “Kvasir-seg: A segmented polyp dataset,” in MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part II 26. Springer, 2020, pp. 451–462.
[16] Ruifei Zhang, Guanbin Li, Zhen Li, Shuguang Cui, Dahong Qian, and Yizhou Yu, “Adaptive context selection for polyp segmentation,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part VI 23. Springer, 2020, pp. 253–262.
[17] John Lambert, Zhuang Liu, Ozan Sener, James Hays, and Vladlen Koltun, “Mseg: A composite dataset for multi-domain semantic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2879–2888.
[18] Libo Qin, Wanxiang Che, Yangming Li, Mingheng Ni, and Ting Liu, “Dcr-net: A deep co-interactive relation network for joint dialog act recognition and sentiment classification,” in Proceedings of the AAAI conference on artificial intelligence, 2020, vol. 34, pp. 8665–8672.
[19] Krushi Patel, Andrés M Bur, and Guanghui Wang, “Enhanced u-net: A feature enhancement network for polyp segmentation,” in 2021 18th Conference on Robots and Vision (CRV). IEEE, 2021, pp. 181–188.
[20] Jun Wei, Yiwen Hu, Ruimao Zhang, Zhen Li, S Kevin Zhou, and Shuguang Cui, “Shallow attention network for polyp segmentation,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24. Springer, 2021, pp. 699–708.
[21] Guanghui Yue, Siying Li, Runmin Cong, Tianwei Zhou, Baiying Lei, and Tianfu Wang, “Attention-guided pyramid context network for polyp segmentation in colonoscopy images,” IEEE Transactions on Instrumentation and Measurement, vol. 72, pp. 1–13, 2023.
[22] Tao Zhou, Yi Zhou, Kelei He, Chen Gong, Jian Yang, Huazhu Fu, and Dinggang Shen, “Cross-level feature aggregation network for polyp segmentation,” Pattern Recognition, vol. 140, pp. 109555, 2023.
[23] Ange Lou, Shuyue Guan, Hanseok Ko, and Murray H Loew, “Caranet: context axial reverse attention network for segmentation of small medical objects,” in Medical Imaging 2022: Image Processing. SPIE, 2022, vol. 12032, pp. 81–92.
[24] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626.