Medical Image Segmentation Using Directional Window Attention

Abstract

Accurate segmentation of medical images is crucial for diagnostic purposes, including cell segmentation, tumor identification, and organ localization. Traditional convolutional neural network (CNN)-based approaches struggled to achieve precise segmentation results due to their limited receptive fields, particularly in cases involving multi-organ segmentation with varying shapes and sizes. The transformer-based approaches address this limitation by leveraging the global receptive field, but they often face challenges in capturing local information required for pixel-precise segmentation. In this work, we introduce DwinFormer, a hierarchical encoder-decoder architecture for medical image segmentation comprising a directional window (Dwin) attention and global self-attention (GSA) for feature encoding. The focus of our design is the introduction of Dwin block within DwinFormer that effectively captures local and global information along the horizontal, vertical, and depthwise directions of the input feature map by separately performing attention in each of these directional volumes. To this end, our Dwin block introduces a nested Dwin attention (NDA) that progressively increases the receptive field in horizontal, vertical, and depthwise directions and a convolutional Dwin attention (CDA) that captures local contextual information for the attention computation. While the proposed Dwin block captures local and global dependencies at the first two high-resolution stages of DwinFormer, the GSA block encodes global dependencies at the last two lower-resolution stages. Experiments over the challenging 3D Synapse Multi-organ dataset and Cell HMS dataset demonstrate the benefits of our DwinFormer over the state-of-the-art approaches. Our source code will be publicly available at https://github.com/Daniyanaj/DWINFORMER.

1 Introduction

Medical image segmentation is a challenging task that requires pixel (voxel)-precise localization of cells, tumors, and human organs for diagnostic purposes [1] [2] [3]. This challenging task is generally addressed using a U-Net [4], where the encoder creates a low-dimensional representation of the input 3D image, and the decoder maps it to an accurate segmentation mask. Previous CNN-based methods [4] [5] [1] struggled to achieve accurate segmentation results due to their limited receptive field. Various efforts have been made using dilated convolution [6], feature pyramid [7], contextual attentions [8, 9] to handle the long-range dependencies. Nevertheless, these approaches still limit their learning capabilities due to the locality of the receptive fields. This may lead to sub-optimal segmentation, especially where structures of the tissues are variable in shape and scale. Recently, transformers have achieved state-of-the-art performance on 3D medical image segmentation tasks by employing a self-attention (SA) mechanism for capturing long-range dependencies. Transformers are remarkably good at capturing global interactions within images, leading to a larger receptive field and more precise predictions. Consequently, there has been a surge in the development of transformer-based methods and hybrid models combining CNNs and transformers, which have substantially enhanced segmentation accuracies [10, 11].

The Swin transformer [12] has recently emerged as a promising solution for 3D segmentation [11] tasks due to the effective extraction of global and local dependencies. It utilizes non-overlap** window-based multi-head attention, which has a linear complexity advantage over the quadratic complexity found in ViTs [13]. However, the Swin transformer still has limitations in explicitly encoding global interactions due to its limited attention area which is restricted inside the window. Increasing window size or applying attention to full resolution would lead to higher computational costs and a parameter-heavy model. To this end, we propose our DwinFormer which strives to effectively capture local and global information while limiting model complexity.

Contribution: In this work, we propose a hierarchical encoder-decoder architecture, dubbed as DwinFormer for medical image segmentation tasks. Our DwinFormer comprises directional window (Dwin) attention and global self-attention (GSA) blocks. The objective of our novel Dwin attention block is to encode the local and global representations across horizontal, vertical, and depthwise directions. Specifically, we introduce nested Dwin attention (NDA) and convolutional Dwin attention (CDA) within the Dwin block to progressively enlarge the receptive field across horizontal, vertical, and depthwise directions as well as to encode the local contextual information, respectively. Experimental results over 3D multi-organ Synapse (human organs) and Cell HMS (microscopic zebrafish cells) segmentation datasets show the superiority of our method over state-of-the-art methods.

Refer to caption — Fig. 1: (a) Overall architecture of the proposed DwinFormer having a hierarchical encoder-decoder framework. In the encoder, the stem features are input to the directional window (Dwin) block to explicitly learn the local and global dependencies at high resolution in the initial two stages of the encoder, whereas global self-attention (GSA) block is applied in the later two stages to capture the global information. In the decoder, the features are first upsampled and then added with the encoder features using a skip connection. The focus of our design is the introduction of (b) Dwin block into DwinFormer, enabling the effective capturing of local and global information in multiple directions within the input feature map. The Dwin block consists of two components:(c) nested Dwin attention (NDA) that gradually expands the receptive field in the depthwise, horizontal and vertical directions, and (d) convolutional Dwin attention (CDA) that strives to capture local contextual information using depthwise convolution during the attention computation,. (e) shows the qkv computation for attention in (i) Nested Dwin Attention (NDA) (ii) Convolutional Dwin Attention (CDA). The NDA employs a linear layer to obtain qkv while CDA additionally captures local information using a depthwise convolution.

2 Methods

Motivation: As previously mentioned, transformer-based and hybrid approaches employ self-attention operations, which require high computational costs. An example of this limitation is nnFormer model [11], which incorporates Swin transformer blocks. The Swin transformer has a restricted attention area, making it challenging to explicitly encode global interactions. Increasing window size or employing attention on full resolution would result in increased computational costs and a model with a large number of parameters. In addition, accurately predicting target boundaries remains a challenge. Therefore, we propose the importance of learning boundary regions within both local and larger spatial contexts. Our approach focuses on explicitly capturing local and global dependencies using high-resolution features, while also incorporating global dependencies from lower-resolution features. By doing so, we aim to enhance the associations among volumetric feature representations, leading to improved predictions of boundary regions.

2.1 Overall Architecture

The overall architecture of the proposed method, dubbed as DwinFormer, is shown in figure 1-a. Our model follows an encoder-decoder framework with varying resolutions at each stage for accurately capturing local and global dependencies. The 3D image is input to the stem layer to generate stem features which are downsampled and input to different stages of the encoder. Similarly, the low-resolution features are upsampled and input to the decoder stages, and finally the expanding block to output the final mask. The primary objective is to learn the diverse shapes of target regions by explicitly capturing both local and global dependencies. Specifically, the framework captures local and global feature dependencies using the directional window (Dwin) block at the first two of the encoder stages and the last two decoder stages, all of which have high feature resolutions. The remaining encoder and decoder stages, which have relatively lower feature resolutions, are responsible for capturing global feature dependencies using a global self-attention (GSA) block.

2.2 Directional Window (Dwin) Block:

As mentioned previously, incorporating a self-attention mechanism is crucial in accurately segmenting regions with diverse shapes and sizes, as it enables the network to capture long-range dependencies. However, solely relying on self-attention may limit the ability to learn local contextual information, as global features become dominant and can result in increased computational costs. Therefore, we present a novel Dwin attention block that performs separate attention across all dimensional volumes to better learn the local and global representations with enhanced underlying attention areas.

Our proposed Dwin block comprises a nested Dwin attention (NDA) and a convolution Dwin attention (CDA) layers as shown in figure 1-b. In both Dwin attention layers, the number of input channels is divided to perform attention along horizontal, vertical, and depthwise volumes. Suppose $\mathcal{F}\in\mathcal{R}^{H\times W\times D\times C}$ be the input, where $N=(H,W,D)$ represents the size of the 3D input (volume) and $C$ denotes the number of channels. In each layer of the Dwin block, the horizontal volume attention is performed on windowed volumes ( $H\times sw\times sd$ ), vertical volume attention is performed on windowed volume ( $sh\times W\times sd$ ), and depthwise volume attention is performed on the volume of size ( $sh\times sw\times D$ ). Here $s$ refers to the number of windowed volumes along a particular direction whereas $h$ , $w$ , and $d$ refer to the window dimensions which can be adjusted accordingly. The Dwin attention block operations can be summarized as:

\displaystyle\begin{split}\scriptsize{\mathcal{\hat{F}}=NDA(\text{Norm}(% \mathcal{F})+\mathcal{F}},\quad\scriptsize{\mathcal{\bar{F}}=MLP(\text{Norm}(% \mathcal{\hat{F}})+\mathcal{\hat{F}}},\\ \scriptsize{\mathcal{\hat{\hat{F}}}=CDA(\text{Norm}(\mathcal{\bar{F}})+% \mathcal{\bar{F}}},\quad\scriptsize{\mathcal{\bar{\bar{F}}}=MLP(\text{Norm}(% \mathcal{\hat{\hat{F}}})+\mathcal{\hat{\hat{F}}}},\end{split}

(1)

where $\hat{F}$ and $\bar{F}$ represent the intermediate and final features of the NDA layer, whereas $\hat{\hat{F}}$ and $\bar{\bar{F}}$ denote the intermediate and final features of the CDA layer.

Nested Dwin Attention (NDA) Layer: The objective of nested Dwin attention layer is to increase the receptive field while reducing exponential computational complexity of standard self-attention for volumetric input. In addition, the proposed NDA layer captures the global dependencies for better segmentation. The fundamental attention operation of the NDA layer is shown in figure 1-c, which focuses on increasing the attention area and improving the representations by addressing the dependencies across these divided channel sets. In this attention mechanism, we consecutively multiply the attention maps from right to left and finally, concatenate these attention maps as shown in figure 1-c.
Convolution Dwin Attention (CDA) Layer: We also propose a CDA layer intending to encode the local contextual information as shown in figure 1-d. To do so, we use a 5 $\times$ 5 depthwise convolution during the computation of qkv features, as shown in figure 1-e-ii, with an objective to encode the local contextual information to handle the issue of smaller patch sizes with an increased spatial context.

2.3 Global Self-Attention Block

Our global self-attention block consists of two sequential standard self-attention (SA) layers, which is responsible for capturing global dependencies at lower-resolution features. The 3D input (volume) tensor $\mathcal{F}\in\mathcal{R}^{H\times W\times D\times C}$ is first reshaped to a vector $f\in\mathcal{R}^{N\times C}$ and input to GSA block for global representation. Later, the output of the GSA block is reshaped back to the original 3D tensor.

Method	DSC	HD95
U-Net [4]	76.85	-
ViT [13]+CUP [10]	67.86	36.11
TransUNet [10]	77.48	31.69
Swin-UNet [14]	79.13	21.55
LeVit-UNet-384s [15]	78.53	16.84
MissFormer [16]	81.96	18.20
UNETR [17]	79.56	22.97
Axial Deeplab [18]	85.37	9.16
nnFormer [11]	86.57	10.63
DwinFormer (Ours)	87.38	8.68

Table 1: Comparison with other state-of-the-art methods over multi-organ Synapse dataset. We report the mean of our results for 3 runs. The best results are in bold.

	ARE	VOI_split	Avg JI	Avg DSC	JI>70%	DSC>70%	JI>50%	DSC>50%
UNet [5]	0.44	1.42	45.7	58.10	26.2	41.2	45.8	65.7
nnFormer [11]	0.41	1.17	52.5	64.01	39.3	53.5	56.09	73.6
Swin-UNet[14]	0.41	1.17	53.4	65.09	40.0	55.9	59.8	76.0
Ours	0.38	1.09	54.2	65.8	40.6	56.11	58.4	78.18

Table 2: State-of-the-art comparison on HMS dataset [19] dataset. We report the results in terms of ARE,

VOI_{split}

, overall accuracy (JI and DSC), and cell count accuracy (JI/DSC greater than 50% or 70%) metrics. We report the mean for 3 runs. The best results are in bold.

Method	Average		Aorta		Gall Bladder		Kidney(L)		Kidney(R)		Liver		Pancreas		Spleen		Stomach
Method	DSC	HD95	DSC	HD95	DSC	HD95	DSC	HD95	DSC	HD95	DSC	HD95	DSC	HD95	DSC	HD95	DSC	HD95
UNETR [17]	79.50	22.97	89.9	5.48	60.55	28.69	85.66	17.76	84.80	22.44	94.45	30.40	59.24	15.82	87.8	47.12	73.99	16.05
nnFormer [11]	86.57	10.63	92.04	11.38	70.17	11.55	86.57	18.09	86.25	12.76	96.84	2.00	83.35	3.72	90.51	16.92	86.83	8.58
Ours	87.38	8.68	92.63	4.56	73.17	6.92	87.64	13.93	87.41	9.10	96.45	2.94	83.11	4.23	92.17	19.5	86.49	8.24

Table 3: Organ-wise segmentation comparison over Synapse Multi-organ dataset. The best results are in bold.

3 Experiments

3.1 Datasets and Training Setup

Multi-organ Synapse dataset: is a multi-organ dataset, includes 30 abdominal CT scans with 18 train and 12 validation scans. It contains the segmentation task for eight organs including the liver, right kidney, left kidney, pancreas, gall bladder, stomach, spleen, and aorta.

Cell HMS dataset: The HMS dataset was constructed, at Harvard Medical School [19], from zebrafish cells containing 3 target categories. It contains 36 images with a resolution of 181×331×160. Out of which, we used 88% for training and 12% for testing.

Training Setup: The method is implemented in PyTorch 1.8.0 and trained on an NVIDIA A100 GPU. For the multi-organ synapse dataset, we adopt the pre-processing, augmentation strategies, and training recipe from nnFormer [11]. We set the batch size to 2 and the initial learning rate to 0.01 and utilized a poly decay strategy to adjust the learning rate. We set the momentum and weight decay as 0.99 and 3e-5 with SGD as the default optimizer. The training was done on 1000 epochs. We use a combination of soft dice loss and cross-entropy loss for training the network. For the Cell HMS dataset, we set a learning rate of 0.0001, batch-size 5, and crop size $128\times 128\times 128$ . Following [19], we adopt Adam optimizer and Dice loss with weights for training.

	Attention 1	Attention 2	Attention 3	DSC	HD95
1.	Vertical	Horizontal	Depthwise	87.33	8.71
2.	Depthwise	Vertical	Horizontal	87.09	8.82
3.	Horizontal	Depthwise	Vertical	87.14	9.19
4.	Horizontal	Vertical	Depthwise	87.21	9.01
5.	Depthwise	Horizontal	Vertical	87.38	8.68
6.	Vertical	Depthwise	Horizontal	87.06	9.12

Table 4: Ablation studies on the selection of the order of depthwise, vertical, and horizontal volumetric attention mechanisms within nested Dwin attention (NDA) layer. The best results are in bold.

3.2 Comparison with state-of-the-art methods

Multi-organ Synapse Dataset: Table 1 shows our approach obtains a significantly higher dice score of 87.38% and a better HD95 score of 8.68 multi-organ synapse dataset compared to existing methods. Furthermore, in Table 3, we conduct a detailed performance analysis and show that our method has significant improvement in terms of dice and HD95 scores in most of the organs. The segmentation accuracies of organs like the left kidney, right kidney, and gall bladder shows considerable improvement which proves the efficiency of our method in segmenting complex boundaries. The qualitative comparison in figure 2 shows that our approach performs better compared to other methods in avoiding false segmentations and preserving the organ boundaries.
Cell HMS Dataset: Table 2 shows that DwinFormer obtains improved performance compared to CNN-based and transformer-based methods for microscopic cell dataset. The qualitative comparison of our DwinFormer with groundtruth is demonstrated in figure 3 which shows that segmented output has a close resemblance to the actual segments.

Sr. No.	Stage 1	Stage 2	Stage 3	Stage 4	DSC	HD95
1	GSA block	GSA block	GSA block	GSA block	81.90	16.02
2	Swin block	Swin block	Swin block	Swin block	86.57	10.63
3	CSwin block	CSwin block	CSWin block	CSWin block	86.80	9.63
4	Dwin block	Dwin block	Dwin block	Dwin block	86.98	9.11
5	Dwin block	Dwin block	GSA block	GSA block	87.38	8.68

Table 5: Comparison of various variants of context aggregator blocks incorporating standard self-attention [13] as GSA, Swin Transformer [12], CSwin Transformer [21] and the proposed Dwin block at different stages over multi-organ Synapse dataset.

3.3 Ablation Study

We perform an ablation study over the multi-organ Synapse dataset to validate the effectiveness of our method. Firstly, we validate the order of horizontal, vertical, and depthwise attentions inside the nested Dwin attention (NDA) layer as in Table 4. Though all combinations exhibit similar performance patterns, employing depthwise, horizontal, and vertical attentions as the consequtive attentions exhibits more optimal performance. So we fixed this arrangement as the default setting for the NDA layer. This is likely due to the fact that at deeper stages of the DwinFormer having a hierarchical structure, it receives input features from the previous stage that are progressively attended in all directions. Later, we employ various context aggregators incorporating standard self-attention [13] as GSA, Swin Transformer [12], CSwin Transformer [21], and the proposed Dwin block at different stages in our network, as shown in Table 5, and observe that employing our hybrid method reflects better dice and HD95 scores.

4 Conclusion

We propose DwinFormer to learn the local and global dependencies using Dwin, and GSA blocks for global feature encoding for better medical image segmentation. The focus of our design is to propose an NDA that progressively increases the receptive field in horizontal, vertical, and depthwise directions as well as a CDA to encode local contextual information for the attention computation. Experimental study reveals that our approach provides favorable segmentation results over the multi-organ Synapse dataset and cell HMS dataset.

Compliance with Ethical Standards: This research study was conducted retrospectively using human subject data made available in open access by the Multi-Atlas Abdomen Labelling Challenge MICCAI 2015 [20] and zebrafish cell HMS dataset which is an open-source dataset provided by the Department of Systems Biology at Harvard Medical School [19]. For both these datasets, ethical approval was not required as confirmed by the license attached with the open access data.

Acknowledgement: This work is partially supported by the MBZUAI-WIS Joint Program for AI Research (Project grant number- WIS P008)

References

[1] Fabian Isensee, Jens Petersen, Andre Klein, David Zimmerer, Paul F. Jaeger, Simon Kohl, Jakob Wasserthal, Gregor Koehler, Tobias Norajitra, Sebastian Wirkert, and Klaus H. Maier-Hein, “nnu-net: Self-adapting framework for u-net-based medical image segmentation,” 2018.
[2] Mustansar Fiaz, Moein Heidari, Rao Muhammad Anwer, and Hisham Cholakkal, “Sa2-net: Scale-aware attention network for microscopic image segmentation,” 2023.
[3] Daniya Najiha Abdul Kareem, Mustansar Fiaz, Noa Novershtern, Jacob Hanna, and Hisham Cholakkal, “Improving 3d medical image segmentation at boundary regions using local self-attention and global volume mixing,” IEEE Transactions on Artificial Intelligence, pp. 1–12, 2023.
[4] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
[5] Özgün Çiçek, Ahmed Abdulkadir, Soeren Lienkamp, Thomas Brox, and Olaf Ronneberger, “3d u-net: Learning dense volumetric segmentation from sparse annotation,” 06 2016.
[6] Jianpeng Zhang, Yutong Xie, Yan Wang, and Yong Xia, “Inter-slice context residual learning for 3d medical image segmentation,” IEEE Transactions on Medical Imaging, vol. 40, no. 2, pp. 661–672, 2020.
[7] Mourad Gridach, “Pydinet: Pyramid dilated network for medical image segmentation,” Neural networks, vol. 140, pp. 274–281, 2021.
[8] Xudong Wang, Shizhong Han, Yunqiang Chen, Dashan Gao, and Nuno Vasconcelos, “Volumetric attention for 3d medical image segmentation and detection,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part VI 22. Springer, 2019, pp. 175–184.
[9] Wenhao Fang and Xian-hua Han, “Spatial and channel attention modulated network for medical image segmentation,” in Proceedings of the Asian Conference on Computer Vision, 2020.
[10] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L. Yuille, and Yuyin Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” 2021.
[11] Hong-Yu Zhou, Jiansen Guo, Yinghao Zhang, Lequan Yu, Liansheng Wang, and Yizhou Yu, “nnformer: Interleaved transformer for volumetric segmentation,” 2021.
[12] Ze Liu, Yutong Lin, and et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022.
[13] Alexey Dosovitskiy, Lucas Beyer, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[14] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, and Manning Wang, “Swin-unet: Unet-like pure transformer for medical image segmentation,” 2021.
[15] Guo** Xu, Xingrong Wu, Xuan Zhang, and Xinwei He, “Levit-unet: Make faster encoders with transformer for medical image segmentation,” 2021.
[16] Xiaohong Huang, Zhifang Deng, Dandan Li, and Xueguang Yuan, “Missformer: An effective medical image segmentation transformer,” 2021.
[17] Ali Hatamizadeh, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myronenko, Bennett Landman, Holger Roth, and Daguang Xu, “Unetr: Transformers for 3d medical image segmentation,” 2021.
[18] Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan L. Yuille, and Liang-Chieh Chen, “Axial-deeplab: Stand-alone axial-attention for panoptic segmentation,” CoRR, vol. abs/2003.07853, 2020.
[19] Andong Wang, Qi Zhang, Yang Han, Sean Megason, Sahand Hormoz, Kishore R Mosaliganti, Jacqueline CK Lam, and Victor OK Li, “A novel deep learning-based 3d cell segmentation framework for future image-based disease detection,” Scientific Reports, vol. 12, no. 1, pp. 1–15, 2022.
[20] Bennett Landman and et al., “Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge,” in Proc. MICCAI Multi-Atlas Labeling Beyond Cranial Vault—Workshop Challenge, 2015, vol. 5, p. 12.
[21] Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo, “Cswin transformer: A general vision transformer backbone with cross-shaped windows,” 2022.