Multi-view Remote Sensing Image segmentation with SAM priors

Abstract

Multi-view segmentation in Remote Sensing (RS) seeks to segment images from diverse perspectives within a scene. Recent methods leverage 3D information extracted from an Implicit Neural Field (INF), bolstering result consistency across multiple views while using limited accounts of labels (even within 3-5 labels) to streamline labor. Nonetheless, achieving superior performance within the constraints of limited-view labels remains challenging due to inadequate scene-wide supervision and insufficient semantic features within the INF. To address these. we propose to inject the prior of the visual foundation model-Segment Anything(SAM), to the INF to obtain better results under the limited number of training data. Specifically, we contrast SAM features between testing and training views to derive pseudo labels for each testing view, augmenting scene-wide labeling information. Subsequently, we introduce SAM features via a transformer into the INF of the scene, supplementing the semantic information. The experimental results demonstrate that our method outperforms the mainstream method, confirming the efficacy of SAM as a supplement to the INF for this task.

Index Terms—  Multi-view segmentation, Implicit Neural Network, Transformer, Remote Sensing

1 Introduction

Recent advancements in Remote Sensing (RS) 3D technology, encompassing scene reconstruction [1, 2], novel view synthesis [3], and more  [4],  [5],  [6], have progressed rapidly. This paper concentrates on the segmentation of images captured from various views within a scene using very small annotations, referred to as multi-view segmentation. This facet holds significance in comprehensively understanding target RS scenes.

Mainstream CNN-based segmentation methods [7, 8, 9, 10] often rely on extensive labeled training data, which might not be suitable for segmenting multi-view images within an RS scene characterised by a limited number of captured views. Within the development of Implicit Neural Field (INF),  [11] utilise the INF to encapsulate the colour and density character of each special point in an RS scene. Later, the colour attribute is transformed into a semantic attribute through a limited number of supervisions (e.g., 3–5 semantic labels). When the above process is optimised, we sample points along rays passing through the camera’s centre and each pixel in the input image, and compute the colour and semantic class (seg.) of each pixel using the colour rendering and semantic rendering functions. The density, colour, and semantic attributes of the INF encode the scene’s 3D, colour, and semantic information, respectively.

Nevertheless, the semantic attributes of certain sampled points might be under-fitted due to the limited coverage of the entire scene by a restricted number of semantic annotations.

Refer to caption
Fig. 1: Our proposed method consists of two stages. Initially, we employ two MLPs to build the scene’s INF, encoding 3D information in the density attributes of each spatial point, supervised by all RGB images. Subsequently, we freeze the density MLP and incorporate SAM priors into the INF. This involves transferring the colour attribute by introducing pseudo-labels as additional supervision and injecting SAM features via a transformer.

In this paper, we propose to integrate the prior knowledge of a large foundation model into the INF to tackle the aforementioned challenges. We specifically selected a visual foundation model-Segment Anything(SAM) [12] due to its extensive training on diverse data, enabling it to segment arbitrary input images. Its encoder demonstrates the capacity to robustly extract image features, even for RS images. Our proposed method comprises two stages: (1) constructing the scene’s INF, and (2) transferring the colour attribute into the semantic attribute. We introduce SAM priors into the INF using a transformer mechanism during the second stage to augment semantic information. Additionally, we compare SAM features between testing and training views to derive pseudo-labels for each testing view, promoting scene-wide supervision.

In Section 2, we give a detailed introduction to the proposed method. In Section 3, the experimental results are presented. Conclusions are drawn in section 4.

2 METHODOLOGY

2.1 Overview

As depicted in Fig. 1, our proposed method consists of two stages. In the first stage, we extract the 3D information of the RS scene encoded in the density attribute of the INF, utilising all RGB images. The second stage involves transferring the colour attribute of each sampled point to the semantic attribute using a limited number of training labels. Additionally, we leverage SAM features to generate pseudo-labels for the testing views, thereby enhancing the semantic supervision of the scene. Furthermore, we introduce SAM features into the INF through a transformer mechanism to enrich the semantic information.

2.2 Constructing INF

In the first stage, we utilise the pixel colour from all input RGB images to supervise the construction of the scene’s INF. Specifically, we randomly select rays (R𝑅Ritalic_R) originating from the camera centre and passing through pixels in the image plane, sampling points along these rays. Two MLPs are employed, taking the point position and ray direction as input to predict the colour and density attributes for each sampled point. Subsequently, we employ the following function to render the colour of each pixel.

colour=i=1Nexp(j=1i1αjσj)(1exp(αiσi))ci,coloursubscriptsuperscript𝑁𝑖1subscriptsuperscript𝑖1𝑗1subscript𝛼𝑗subscript𝜎𝑗1subscript𝛼𝑖subscript𝜎𝑖subscript𝑐𝑖\text{colour}=\sum\limits^{N}_{i=1}\exp{\Big{(}-\sum\limits^{i-1}_{j=1}\alpha_% {j}\sigma_{j}\Big{)}}\Big{(}1-\exp{(-\alpha_{i}\sigma_{i}})\Big{)}c_{i},colour = ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT roman_exp ( - ∑ start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( 1 - roman_exp ( - italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (1)

where σisubscript𝜎𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the density attribute of the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT sampling point, cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the colour attribute and αjsubscript𝛼𝑗\alpha_{j}italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the interval distance between the i𝑖iitalic_i point and the i+1𝑖1i+1italic_i + 1 point. Then, we use the following loss to optimise the parameters of the two MLPs:

lossrgb=rR[colourrGTr22].𝑙𝑜𝑠subscript𝑠𝑟𝑔𝑏subscript𝑟𝑅delimited-[]superscriptsubscriptnorm𝑐𝑜𝑙𝑜𝑢subscript𝑟𝑟𝐺subscript𝑇𝑟22\text{$loss_{rgb}$}=\sum\limits_{r\in R}\Big{[}\|\text{$colour_{r}$}-\text{$GT% _{r}$}\|_{2}^{2}\Big{]}.italic_l italic_o italic_s italic_s start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_r ∈ italic_R end_POSTSUBSCRIPT [ ∥ italic_c italic_o italic_l italic_o italic_u italic_r start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT - italic_G italic_T start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (2)

2.3 Transfering semantic attribute

In the second stage, our proposal involves integrating SAM priors into the process of converting the colour attribute into the semantic attribute of each sampled point.

Generating pseudo labels. While the INF mechanism effectively utilises semantic information from a limited training dataset, optimising semantic attributes for points beyond the coverage of the training view remains a challenge. To address this, we utilise SAM features to generate pseudo-labels for testing views. Initially, we compute SAM encoder features for each view’s images using the equation:

Fi=SAM(imagei).subscript𝐹𝑖SAM𝑖𝑚𝑎𝑔subscript𝑒𝑖\text{$F_{i}$}=\text{SAM}(\text{$image_{i}$}).italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = SAM ( italic_i italic_m italic_a italic_g italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (3)

Subsequently, we calculate the mean value of the features (F) associated with pixels from the training data belonging to category c𝑐citalic_c—serving as the center of category c𝑐citalic_c (c=1L𝑐1𝐿c=1\dots Litalic_c = 1 … italic_L, where L represents the number of classes). Finally, we compare features corresponding to each pixel of testing images with the computed centers to assign a category based on the minimum distance. The results we take as pseudo-labels, as shown in the left-down of Fig. 1.

Intergrating SAM features via Transformer. Directly converting the colour attributes into semantic attributes lacks texture information, resulting in poor results. Thus, we use the Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which contains rich and very robust semantic information, as the local texture prompter.

s1,,snsubscript𝑠1subscript𝑠𝑛\displaystyle s_{1},\dots,s_{n}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =Te(b1,,bn),absentsubscript𝑇𝑒subscript𝑏1subscript𝑏𝑛\displaystyle=T_{e}(b_{1},\dots,b_{n}),= italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , (4)
s11,,sn1superscriptsubscript𝑠11superscriptsubscript𝑠𝑛1\displaystyle s_{1}^{1},\dots,s_{n}^{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT =Td({s1,,sn},Fix,y),absentsubscript𝑇𝑑subscript𝑠1subscript𝑠𝑛superscriptsubscript𝐹𝑖𝑥𝑦\displaystyle=T_{d}(\{s_{1},\dots,s_{n}\},F_{i}^{x,y}),= italic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( { italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } , italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x , italic_y end_POSTSUPERSCRIPT ) ,
s12,,sn2superscriptsubscript𝑠12superscriptsubscript𝑠𝑛2\displaystyle s_{1}^{2},\dots,s_{n}^{2}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =s1+s11,sn+sn1,absentsubscript𝑠1superscriptsubscript𝑠11subscript𝑠𝑛superscriptsubscript𝑠𝑛1\displaystyle=s_{1}+s_{1}^{1},\dots s_{n}+s_{n}^{1},= italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ,

Here, the bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the base feature of the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT sampling point along the ray r𝑟ritalic_r for generating the density and colour attributes, si2superscriptsubscript𝑠𝑖2s_{i}^{2}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT represents the semantic attribute, and x,y𝑥𝑦{x,y}italic_x , italic_y denotes the coordinates of the intersection point of the ray r𝑟ritalic_r with the image. Tesubscript𝑇𝑒T_{e}italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT refers to the transformer encoder, and Tdsubscript𝑇𝑑T_{d}italic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT signifies the transformer decoder. In the aforementioned process, the encoder is responsible for transforming the base features of sampling points into semantic features, while the decoder is designed to integrate SAM features with the semantic features. We generate the semantic label using the following function.

seg=i=1Nexp(j=1i1αjσj)(1exp(αiσi))si2,segsubscriptsuperscript𝑁𝑖1subscriptsuperscript𝑖1𝑗1subscript𝛼𝑗subscript𝜎𝑗1subscript𝛼𝑖subscript𝜎𝑖superscriptsubscript𝑠𝑖2\text{seg}=\sum\limits^{N}_{i=1}\exp{\Big{(}-\sum\limits^{i-1}_{j=1}\alpha_{j}% \sigma_{j}\Big{)}}\Big{(}1-\exp{(-\alpha_{i}\sigma_{i}})\Big{)}s_{i}^{2},seg = ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT roman_exp ( - ∑ start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( 1 - roman_exp ( - italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (5)

where si2superscriptsubscript𝑠𝑖2s_{i}^{2}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the semantic attribute of the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT sampling point. Then we employ the subsequent loss function to compute the loss:

s=λrR[l=1LsegrllogGTrl],subscript𝑠𝜆subscript𝑟𝑅delimited-[]superscriptsubscript𝑙1𝐿𝑠𝑒superscriptsubscript𝑔𝑟𝑙𝐺superscriptsubscript𝑇𝑟𝑙\mathcal{L}_{s}=-\lambda\sum\limits_{r\in R}\Big{[}\sum\limits_{l=1}^{L}seg_{r% }^{l}\log{GT_{r}^{l}}\Big{]},caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = - italic_λ ∑ start_POSTSUBSCRIPT italic_r ∈ italic_R end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_s italic_e italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT roman_log italic_G italic_T start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ] , (6)

L𝐿Litalic_L is the number of classes. When R𝑅Ritalic_R belongs to the training views, the value of λ𝜆\lambdaitalic_λ is set to 1. When R𝑅Ritalic_R belongs to the testing views, we utilise the pseudo-labels as the ground truth and set λ𝜆\lambdaitalic_λ to 0.001.

3 Experiment

3.1 Experimental Setup

We perform experiments on the real sub-datasets introduced in [11]. The whole real sub-dataset includes 300 images, each sized at 512×512512512512\times 512512 × 512 pixels. Merely 2% of the images in the training sets possess corresponding labels. The INF model in our method is based on the NeRF++ [13] and our model is implemented on PyTorch and trained using a single NVIDIA RTX 3090 GPU. The comparison methods we selected comprise two categories: CNN-based and INF-based methods, including Unet [8], SegNet [14], DANet [15], DeepLab [9], SETR [16] and Sem-NeRF [17]. All these methods were re-trained using the previously mentioned dataset. We use average mIOU across all sub-datasets as the validation metric.

3.2 Experiment Results

Refer to caption
Fig. 2: The images on the left showcase results in proximity to the training views, while those on the right depict regions distant from the training views.
Table 1: MIoU metric of different methods on sub-datasets.
Methods SegNet Unet DANet Deeplab SETR Sem-NeRF Ours
real #1 8.67 64.94 35.71 39.35 36.26 11.64 62.29
real #2 27.88 49.95 49.13 41.37 31.04 18.78 79.94
real #3 18.84 68.33 34.30 52.37 26.07 19.76 61.93
AVG 18.46 61.07 39.71 44.36 31.12 16.72 67.95

Quantitative Results: We present the quantitative comparison results between our method and other comparison methods in Table 1. Compared with CNN-based and INF-based methods, our segmentation accuracy has been improved to a certain extent in the avg. mIoU metric. Compared with CNN-based methods, due to the lack of spatial information, the consistency between views of the results of these methods is very poor. Compared to Unet, our method demonstrates an improvement of over 6.8% on the avg.mIOU metric. Additionally, when compared to Sem-NeRF, which is not suitable for modeling remote sensing scenes due to significant differences between distant and close views, our method outperforms it by over 50%. The experimental results show that our proposed method is more friendly to large-scale remote sensing scenes, and we introduce SAM features into multi-view segmentation tasks for the first time, verifying its effectiveness.

Quantitative Results: Fig. 2 illustrates the multi-view segmentation results of the real1 sub-dataset. The proposed method accurately segments images from the test view using annotations from only two training views, ensuring strong inter-view consistency. Notably, our method exhibits commendable segmentation performance in regions far from the training perspective on the right part of the figure. This is achieved by leveraging pseudo-labels and the feature conspicuousness of SAM, demonstrating the effectiveness of our approach.

4 Conclusion

This paper focuses on multi-view segmentation within RS scenes, aiming to tackle challenges arising from insufficient scene-wide supervision and inadequate semantic features within the implicit neural field. To address these, we leverage SAM to derive pseudo-labels for each test view, thereby enhancing scene-wide semantic supervision. Additionally, we integrate SAM features into the INF using a transformer to bolster semantic information. Experimental results substantiate that our method outperforms the CNN-based and INF-based method, validating the efficacy of SAM as a valuable supplement to the INF for this task.

References

  • [1] Zipeng Qi, Zhengxia Zou, Hao Chen, and Zhenwei Shi, “3d reconstruction of remote sensing mountain areas with tsdf-based neural networks,” Remote Sensing, vol. 14, no. 17, pp. 4333, 2022.
  • [2] Mehmet Buyukdemircioglu, Sultan Kocaman, and Martin Kada, “Deep learning for 3d building reconstruction: A review,” The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 43, pp. 359–366, 2022.
  • [3] Yongchang Wu, Zhengxia Zou, and Zhenwei Shi, “Remote sensing novel view synthesis with implicit multiplane representations,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2022.
  • [4] Kyle Gao, Yina Gao, Hongjie He, Dening Lu, Linlin Xu, and Jonathan Li, “Nerf: Neural radiance field in 3d vision, a comprehensive review,” arXiv preprint arXiv:2210.00379, 2022.
  • [5] Zipeng Qi, Zhengxia Zou, Hao Chen, and Zhenwei Shi, “Remote-sensing image segmentation based on implicit 3-d scene representation,” IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2022.
  • [6] Chenyang Liu, Jiajun Yang, Zipeng Qi, Zhengxia Zou, and Zhenwei Shi, “Progressive scale-aware network for remote sensing image change captioning,” in IGARSS 2023 - 2023 IEEE International Geoscience and Remote Sensing Symposium, 2023, pp. 6668–6671.
  • [7] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, and Manning Wang, “Swin-unet: Unet-like pure transformer for medical image segmentation,” in European conference on computer vision. Springer, 2022, pp. 205–218.
  • [8] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, 2015, pp. 234–241.
  • [9] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017.
  • [10] Yu Zhou, Rui Lu, Feng Xue, and Yuzhe Gao, “Occlusion relationship reasoning with a feature separation and interaction network,” Visual Intelligence, vol. 1, no. 1, pp. 23, 2023.
  • [11] Zipeng Qi, Hao Chen, Chenyang Liu, Zhenwei Shi, and Zhengxia Zou, “Implicit ray-transformers for multi-view remote sensing image segmentation,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
  • [12] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
  • [13] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen Koltun, “Nerf++: Analyzing and improving neural radiance fields,” arXiv preprint arXiv:2010.07492, 2020.
  • [14] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
  • [15] Jun Fu, **g Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu, “Dual attention network for scene segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3146–3154.
  • [16] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al., “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 6881–6890.
  • [17] Shuaifeng Zhi, Tristan Laidlow, Stefan Leutenegger, and Andrew J Davison, “In-place scene labelling and understanding with implicit scene representation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15838–15847.