High-Fidelity Lake Extraction via Two-Stage Prompt Enhancement:
Establishing a Novel Baseline and Benchmark

Abstract

Lake extraction from remote sensing imagery is a complex challenge due to the varied lake shapes and data noise. Current methods rely on multispectral image datasets, making it challenging to learn lake features accurately from pixel arrangements. This, in turn, affects model learning and the creation of accurate segmentation masks. This paper introduces a prompt-based dataset construction approach that provides approximate lake locations using point, box, and mask prompts. We also propose a two-stage prompt enhancement framework, LEPrompter, with prompt-based and prompt-free stages during training. The prompt-based stage employs a prompt encoder to extract prior information, integrating prompt tokens and image embedding through self- and cross-attention in the prompt decoder. Prompts are deactivated to ensure independence during inference, enabling automated lake extraction without introducing additional parameters and GFlops. Extensive experiments showcase performance improvements of our proposed approach compared to the previous state-of-the-art method. The source code is available at https://github.com/BastianChen/LEPrompter.

Index Terms— Lake Extraction, Semantic Segmentation, Prompt Learning, Vision Transformer

1 Introduction

Lakes are important environmental and climatic indicators and have received significant scientific attention [1]. Global observation technology and sensing devices have made remote sensing imagery a popular tool for extracting water bodies [2]. Automated lake extraction from remote sensing imagery is crucial in monitoring climate change [3].

Refer to caption — Fig. 1: Our proposed two-stage prompt enhancement framework for lake extraction. The prompt-based approach simulates a teacher guiding students to solve challenging problems, while the prompt-free approach allows students to tackle problems independently. Conversely, the inference process exclusively utilizes the prompt-free approach.

Lake extraction is a semantic segmentation task that researchers have recently tackled using deep learning techniques, such as UDGN [4], to enhance the spatial resolution of lake regions. MSLWENet [5] proposes an end-to-end multi-scale plateau lake extraction network model based on ResNet-101 [6] and depth-wise separable convolution [7]. LEFormer [8] utilizes a hybrid CNN-Transformer architecture, HA-Net [2] employs a mixed-scale attention model, and MSNANet [9] adopts an encoder-decoder framework to improve feature representation and extraction results. However, existing lake datasets [5] consist of multi-channel input images with single-channel ground truth, posing challenges for model learning due to the complex pixel information and potential noise. Models need to capture intricate lake details effectively and handle noise to improve extraction accuracy.

Recently, the prompt-based interactive semantic segmentation method SAM [10] has shown promising performance by refining predicted masks through prompt prior information. However, this method requires a pre-prompt (point, box, or mask) with the input image, making it interactive. Additionally, SAM’s segmentation results lack semantic information, which restricts their suitability for fully automated interpretation of remote sensing images. Furthermore, SAM utilizes extensive backbones, such as ViT-H [11], leading to numerous model parameters and increased computational complexity.

To address the limitations of existing datasets and methods, we employ morphological operations to create a prompt dataset based on the ground truth, which consists of point, box, and mask prompts. We propose LEPrompter, a two-stage prompt enhancement framework for lake extraction during training, to leverage the prompt dataset effectively. The framework includes prompt-based and prompt-free stages. Training progresses in the prompt-based stage until reaching a specific step threshold. After that, we transition to the prompt-free stage, training the lake extraction model independently of prompts. In the prompt-based stage, the prompt dataset is processed by a lightweight prompt encoder and decoder, increasing only 1.23M parameters and 0.95G Flops. This processing is fused with the image embedding obtained from the lake extraction model, which collectively generates output tokens for mask prediction. In inference, we employ the prompt-free approach. This approach requires neither additional model parameters nor GFlops. The workflow is illustrated in Fig. 1. Experimental results demonstrate that our proposed approach consistently improves the performance of the previous SOTA methods on the Surface Water dataset (SW dataset) and Qinghai-Tibet Plateau Lake dataset (QTPL dataset) [5], achieving mIoU of 91.53% and 97.44%, respectively. The main contributions of our approach are summarized as follows:

•

We develop a unified morphological method for generating various prompts and establish three types of prompt datasets (point, box, and mask), serving as a benchmark for lake extraction from remote sensing imagery.
•

We propose a two-stage prompt enhancement framework for automated lake extraction. This framework employs prompts to guide model training while allowing independence from prompts during inference, serving as a baseline for lake extraction with prompt datasets.
•

We evaluate the influence of prompt types for lake extraction and observe that a slight prompt positively guides model learning. Conversely, excessive prompts restrict performance improvement, aligning with the scenario of teachers guiding students in the real world.

2 Related Work

2.1 Lake Extraction

Lake extraction aims to automatically locate the boundaries of a lake’s location from remote sensing imagery, and it belongs to the semantic segmentation task. Deep learning-based approaches for lake extraction have recently garnered significant attention from researchers. These approaches, such as UDGN [4], aim to enhance the spatial resolution of lake regions, but struggle with diverse lake types and spatial-spectral characteristics, leading to lost spatial information. To address these challenges, researchers explore integration strategies to optimize network structures and leverage multi-scale features. For instance, MSLWENet [5] proposes an end-to-end multi-scale plateau lake extraction network model based on ResNet-101 [6] and depth-separable convolution [7]. However, this model is susceptible to noise, particularly for lakes with intricate surface textures. Various approaches have been developed to overcome these challenges in the field of lake extraction. LEFormer [8] utilizes a hybrid CNN-Transformer architecture, leveraging CNN to extract local features and the Transformer to capture global features. HA-Net [2] employs a mixed-scale attention model, and MSNANet [9] adopts an encoder-decoder framework to enhance feature representation and achieve improved extraction results for lakes. However, the extraction of lakes remains challenging due to high interclass heterogeneity and complex background information.

2.2 Prompt Learning

Prompt learning aims to reduce semantic differences, bridge the pre-training and fine-tuning gap, and prevent overfitting. Since the introduction of GPT-3 [12], prompt learning has advanced from traditional discrete and continuous prompt construction to large-scale model-oriented in-context learning [13], instruction-tuning [14], and chain-of-thought approaches [15]. Current methods for constructing prompts mainly involve manual templates, heuristic-based templates, generation, word embedding fine-tuning, and pseudo tokens. Recently, prompt-based interactive semantic segmentation method SAM [10] has shown remarkable performance. However, its applicability for automatic image segmentation is limited due to the requirement of pre-prompt input and the use of an extensive backbone [11], resulting in increased model parameters and computational costs. In remote sensing imagery, RSPrompter [16] has successfully introduced prompt mechanisms. However, this method requires additional model structures to extract prompt information during inference, introducing additional parameters and computational costs.

In this work, we generate images containing prompt information based on the ground truth to assist in supervised training with the original dataset. We adopt a prompt-free approach during inference, using only the original lake extraction model without additional parameters or computational costs.

3 Prompt-based Lake Extraction Dataset

We evaluate the accuracy of two satellite remote sensing datasets for lake extraction: the SW and QTPL [5] datasets. Both datasets include annotated lake water bodies and have a bit depth of 24 bits per pixel, including R, G, and B spectral bands. The SW dataset contains 17,596 images of size 256 × 256, divided randomly into training and testing sets with a 4:1 ratio. In comparison, the QTPL dataset includes 6,773 images of size 256 × 256, with a spatial resolution of 17 meters, divided randomly into training and testing sets with a 9:1 ratio. However, these datasets suffer from the rich spatial-spectral characteristics of lakes, leading to the loss of crucial spatial information and noise factors that impact the model’s learning process. Moreover, these datasets present challenges of high interclass heterogeneity and complex background information, such as snow, glaciers, and mountains, introducing contextual ambiguity and additional extraction challenges.

Previous approaches have focused on improving the model, while our approach directly improves the dataset to reduce the model’s learning difficulty. We utilize the density-based clustering algorithm DBSCAN [17] and simple morphological operations, such as erosion and dilation, to refine the ground truth. This process creates our benchmark as a supplementary dataset, called the prompt dataset. In contrast to the interactive semantic segmentation model SAM [10], our benchmark is generated directly from the ground truth by simulating human prompt habits, including point, box, and mask prompts. Our benchmark comprises five prompt types, matching the number of images in the original dataset’s training set. We generate random and center points concentrated in the lake’s central area for point prompts, with a maximum of 9 points per type. Masks consist of filled masks with complete interiors and unfilled masks with partially unfilled interiors. The box prompt encompasses the entire lake. An example of our benchmark is shown in Fig. 2, and the workflow diagram for creating our benchmark is depicted in Fig. 3. The creation process for each prompt is summarized as follows:

•

By applying DBSCAN clustering, we obtain the pixel map (a) of the lakes and randomly select 9 points as random points prompt. We calculate the centroid of the pixels to generate an image (b) and identify the 9 points closest to the centroid, resulting in the image (c). Finally, a slight shift is applied to the positions of these 9 points to obtain the final center points prompt.
•

We select the top-left and bottom-right points on the pixel map (a), generated through DBSCAN clustering, to accurately determine the box prompt.
•

We randomly select 0.8% of the lake pixel points from the ground truth, resulting in an image (d). Applying a slight random shift produces an image (e). Next, we generate the unfilled mask prompt using dilate and close operations. To obtain the final filled mask prompt, we identify the contours of the unfilled mask, resulting in an image (f), and perform pixel-filling operations.

Table 1: Ablation studies on the different combinations of prompts. CP, RP, FK, and UFK represent three center points, three random points, a filled mask, and an unfilled mask, respectively.

ID		baseline	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17
Point	CP	-	✓	-	-	-	-	✓	✓	✓	-	-	-	-	-	✓	✓	-	-
Point	RP	-	-	✓	-	-	-	-	-	-	✓	✓	✓	-	-	-	-	✓	✓
Box		-	-	-	✓	-	-	✓	-	-	✓	-	-	✓	✓	✓	✓	✓	✓
Mask	FK	-	-	-	-	✓	-	-	✓	-	-	✓	-	✓	-	✓	-	✓	-
Mask	UFK	-	-	-	-	-	✓	-	-	✓	-	-	✓	-	✓	-	✓	-	✓
mIoU $\uparrow$		90.86	91.31	91.53	91.17	91.14	91.13	91.12	90.89	90.93	91.11	90.90	90.91	90.91	90.94	90.88	90.85	90.91	90.90

4 LEPrompter: Lake Extraction Prompter

Inspired by the success of SAM [10], we propose a two-stage prompt enhancement framework, LEPrompter, to integrate the solid prior prompts from the prompt dataset into lake extraction models, which is designed using a combination of prompt-based and prompt-free training during the training stage and prompt-free inference during the inference stage, as depicted in Fig. 1. LEPrompter consists of a lightweight prompt encoder and decoder with only 1.23M learnable parameters and 0.95G Flops. The architecture of LEPrompter is shown in Fig. 4 (a).

4.1 Prompt Encoder

The function of the prompt encoder is to extract prompt tokens to assist in training the lake extraction model. It processes points and a single box to obtain sparse prompt tokens (SPT), while the mask is processed to obtain a dense prompt token (DPT). A point is represented as the sum of its positional encoding [18] and a learned embedding. In contrast to SAM, which processes only 1 point per interaction, we extend the maximum number of points to 9. An embedding pair represents a box: the positional encoding of its top-left corner with a learned embedding for the ”top-left corner” and a similar structure for the ”bottom-right corner.” The DPT is achieved by downscaling using a depth-wise separable convolution [7], followed by a GELU activation function [19]. It corresponds spatially to the image embedding $\mathbf{F}\in\mathbb{R}^{C\times\frac{H}{4}\times\frac{W}{4}}$ , which is obtained by upsampling the feature map from the last encoder layer of the vision image encoder (VIE). If there is no point, box or mask prompt, the prompt encoder outputs a learned embedding representing ”no point”, ”no box”, or ”no mask.” The prompt encoder can be expressed mathematically as:

<\mathbf{O}_{s},\mathbf{O}_{d}>=\mathrm{PE}(\mathbf{Q};<\mathbf{P}_{p},\mathbf% {P}_{b},\mathbf{P}_{m}>),

(1)

where $\mathbf{Q}$ is the learnable queries, $\mathbf{P_{p}},\mathbf{P_{b}},\mathbf{P_{m}}\in\mathbb{R}^{1\times H\times W}$ represent the point, box, and mask prompts respectively. $\mathbf{O_{s}}\in\mathbb{R}^{N_{tokens}\times C}$ and $\mathbf{O_{d}}\in\mathbb{R}^{C\times\frac{H}{4}\times\frac{W}{4}}$ represent the SPT and DPT, where $N_{tokens}\in[1,12]$ . $\mathrm{PE}$ represents the prompt encoder.

Table 2: Quantitative comparison of four methods w/o and w/ our proposed approach on the SW, QTPL, CVC-ClinicDB and ISIC2018 datasets. #P and #F denote parameters and GFlops with an image size of 256 × 256 in the prompt-based stage.

Method	Ours	#P $\downarrow$	#F $\downarrow$	SW			QTPL			CVC-ClinicDB			ISIC2018
Method	Ours	#P $\downarrow$	#F $\downarrow$	OA $\uparrow$	F1 $\uparrow$	mIoU $\uparrow$	OA $\uparrow$	F1 $\uparrow$	mIoU $\uparrow$	OA $\uparrow$	F1 $\uparrow$	mIoU $\uparrow$	OA $\uparrow$	F1 $\uparrow$	mIoU $\uparrow$
SegFormer [20]	w/o	3.72	1.59	95.58	93.65	90.75	98.66	98.62	97.27	98.98	97.02	94.33	95.65	93.47	87.98
SegFormer [20]	w/	5.76	3.27	95.91	95.49	91.40	98.69	98.64	97.33	99.12	97.43	95.06	96.15	94.30	89.39
PoolFormer [21]	w/o	15.64	7.67	94.65	94.09	88.90	98.59	98.54	97.13	98.83	96.56	93.50	95.59	93.38	87.83
PoolFormer [21]	w/	17.12	7.68	95.12	94.62	89.84	98.62	98.58	97.20	98.97	97.00	94.29	95.77	93.68	88.32
SegNeXt [22]	w/o	4.26	1.55	95.50	95.05	90.61	98.60	98.55	97.15	98.87	96.73	93.79	95.63	93.47	87.97
SegNeXt [22]	w/	6.13	1.78	95.59	95.14	90.77	98.65	98.60	97.24	99.05	97.22	94.69	95.89	93.81	88.55
LEFormer [8]	w/o	3.61	1.27	95.63	95.19	90.86	98.74	98.69	97.42	98.93	96.91	94.11	95.69	93.64	88.26
LEFormer [8]	w/	4.84	2.22	95.97	95.56	91.53	98.75	98.70	97.44	99.13	97.48	95.15	96.33	94.55	89.82

4.2 Prompt Decoder

We extend the prompt decoder based on SAM to combine one or even three types of prompt tokens from the prompt encoder and the image embedding from the VIE per interaction to generate the final mask by calculating self- and cross-attention. To reduce the complexity of the self- and cross-attention, we employ the sequence reduction process inspired by PVT [23]. Our prompt decoder design, shown in Fig. 4 (b), consists of two Image-Prompt Transformer Blocks (IPTB). Each IPTB performs four steps: (1) efficient self-attention on the tokens; (2) efficient cross-attention from tokens (as queries) to the image embedding; (3) a point-wise MLP updates each token; (4) efficient cross-attention from the image embedding (as queries) to tokens. This last step updates the image embedding with prompt information while each operation has a residual connection [6] and a layer normalization. The next IPTB takes the updated $\mathbf{O_{s}}$ and $\mathbf{O_{d}}$ from the previous layer and outputs $\mathbf{O_{d}}$ as an output token, as follows:

\mathbf{O_{d}}=\mathrm{Conv{(\mathbf{F}+\mathbf{O_{d}})}},

(2)

<\mathbf{O_{s}},\mathbf{O_{d}}>=\mathrm{IPTB(\mathbf{O_{s}},\mathbf{O_{d}})}.

(3)

The output token is then upsampled and concatenated with other image embeddings from the VIE and then fed to the vision image decoder for the final prediction of the lake mask.

5 Experiments

5.1 Experimental Settings

Implementation Details. In this work, we train all models on a single Tesla V100 GPU using the MMSegmentation [24] codebase for 160K iterations on the SW and QTPL datasets with our benchmark. To assess the generalization capability of our proposed approach, we conduct experiments on two additional binary medical image segmentation datasets: CVC-ClinicDB (CVC) [25] and ISIC2018 [26] with an image size of 288 × 384 and 384 × 384, respectively. We randomly divide these datasets into training and testing sets with a 9:1 ratio. We apply data augmentation techniques, such as random resizing and horizontal flip**, using the AdamW optimizer and the cross-entropy loss function with batch size 16. The initial learning rate and weight decay are set to $6\times 10^{-5}$ and 0.01.
Evaluation Metrics. We evaluate our approach against four methods [20, 21, 22, 8], using metrics such as overall accuracy (OA), F1, and mean Intersection over Union (mIoU) to measure accuracy. We assess efficiency using parameters (Params, M) and floating-point operations per second (Flops, G).

5.2 Ablation Studies

We select the SW dataset to evaluate our proposed ablation study method, as the SW dataset presents more challenges due to higher pixel complexity, interclass heterogeneity and complex background information. To evaluate our proposed approach’s effectiveness more rigorously, we conduct ablation studies on the SW dataset using LEFormer with our approach.
Influence of the prompt-based steps. Fig. 5 (a) shows the model’s mIoU improves with increasing prompt-based training steps, peaking at 91.24% at 50k steps using a single center point prompt. However, in the prompt-free inference phase, excessive training steps may cause over-reliance on prompts, decreasing mIoU. Overall, using our approach appropriately enhances accuracy. We find 50k steps optimal for prompt-based training and apply it in subsequent experiments.
Influence of the type and number of prompt points. We evaluate the impact of point prompt types (center vs random points) and prompt point numbers from 0 to 9, as shown in Fig. 5 (b). The model’s mIoU increases with more points, peaking at three points, then slightly decreases. At three points, random points yield the highest mIoU of 91.53%. Thus, we select three random points for subsequent experiments.
Influence of the combination of prompts. We examine five prompts with 17 combinations on our benchmark, as shown in Table 1. Results reveal that most prompt combinations positively affect accuracy, but their effectiveness diminishes with increasing prompt numbers, consistent with the previous experiment. Thus, we select only three random points as the optimal combination of prompts for subsequent experiments.

5.3 Comparison with State-of-the-Art Methods

We evaluate our approach against four methods [20, 21, 22, 8]. Quantitative results on SW, QTPL, CVC, and ISIC2018 datasets are shown in Table 2. Fig. 6 illustrates visualization results for the SW dataset.

Table 2 shows the significant accuracy improvement achieved by our approach. Applying our approach to LEFormer yields mIoU values of 91.53%, 97.44%, 95.15% and 89.82% on the SW, QTPL, CVC and ISIC2018 datasets, respectively, with improvements of 0.67%, 0.02%, 1.04% and 1.56%, and only increases of 1.23M parameters and 0.95G Flops in the prompt-based stage. We use only the prompt-free stage during inference without extra parameters and GFlops, preventing higher hardware costs. Overall, our approach is a superior auxiliary framework for lake extraction tasks.

6 Conclusion

In this work, we propose three types of prompt datasets (point, box, and mask) and a two-stage prompt enhancement framework called LEPrompter for lake extraction. These aim to improve accuracy without increasing learning difficulty or incorporating excessive noise information. Our benchmark is created by applying morphological operations to the ground truth in the original dataset. Our baseline consists of a lightweight prompt encoder and decoder, which can be easily integrated into existing lake extraction methods to improve accuracy. Experimental results show that our proposed approach significantly enhances the accuracy of automated lake extraction on two widely used datasets. We believe that this study will facilitate and inspire further research in the field of lake extraction.

7 Acknowledgements

This work is supported in part by the Natural Science Foundation of China under Grant No. 62222606 and 62076238.

References

[1] Jieyu Lu, Yubao Qiu, Xingxing Wang, Wenshan Liang, Pengfei Xie, Lijuan Shi, Massimo Menenti, et al., “Constructing dataset of classified drainage areas based on surface water-supply patterns in high mountain asia,” BIG EARTH DATA, vol. 4, no. 3, pp. 225–241, 2020.
[2] Zhaobin Wang, Xiong Gao, and Yaonan Zhang, “Ha-net: A lake water body extraction network based on hybrid-scale attention and transfer learning,” Remote Sensing, vol. 13, no. 20, pp. 4121, 2021.
[3] Zhihui Tian, Xiaoyu Guo, Xiaohui He, Panle Li, Xijie Cheng, and Guangsheng Zhou, “Mscanet: multiscale context information aggregation network for tibetan plateau lake extraction from remote sensing images,” Int. J. Digit. Earth, vol. 16, no. 1, pp. 1–30, 2023.
[4] Mengjiao Qin, Linshu Hu, Zhenhong Du, Yi Gao, Lianjie Qin, Feng Zhang, and Renyi Liu, “Achieving higher resolution lake area from remote sensing images through an unsupervised deep learning super-resolution method,” Remote Sensing, vol. 12, no. 12, pp. 1937, 2020.
[5] Zhaobin Wang, Xiong Gao, Yaonan Zhang, et al., “Mslwenet: A novel deep learning network for lake water body extraction of google remote sensing images,” Remote Sensing, vol. 12, no. 24, pp. 4140, 2020.
[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
[7] François Chollet, “Xception: Deep learning with depthwise separable convolutions,” in CVPR, 2017, pp. 1251–1258.
[8] Ben Chen, Xuechao Zou, Yu Zhang, Jiayu Li, Kai Li, and Pin Tao, “Leformer: A hybrid cnn-transformer architecture for accurate lake extraction from remote sensing imagery,” arXiv:2308.04397, 2023.
[9] Xin Lyu, Yiwei Fang, Baogen Tong, Xin Li, and Tao Zeng, “Multiscale normalization attention network for water body extraction from remote sensing imagery,” Remote Sensing, vol. 14, no. 19, pp. 4983, 2022.
[10] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, et al., “Segment anything,” arXiv:2304.02643, 2023.
[11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2021.
[12] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al., “Language models are few-shot learners,” NeurIPS, vol. 33, pp. 1877–1901, 2020.
[13] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al., “Flamingo: a visual language model for few-shot learning,” NeurIPS, vol. 35, pp. 23716–23736, 2022.
[14] Haotian Liu, Chunyuan Li, Qingyang Wu, et al., “Visual instruction tuning,” arXiv:2304.08485, 2023.
[15] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, et al., “Chain-of-thought prompting elicits reasoning in large language models,” NeurIPS, vol. 35, pp. 24824–24837, 2022.
[16] Keyan Chen, Chenyang Liu, Hao Chen, Haotian Zhang, Wenyuan Li, Zhengxia Zou, and Zhenwei Shi, “Rsprompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model,” arXiv:2306.16269, 2023.
[17] Martin Ester, Hans-Peter Kriegel, et al., “A density-based algorithm for discovering clusters in large spatial databases with noise,” in KDD, 1996, pp. 226–231.
[18] Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, et al., “Fourier features let networks learn high frequency functions in low dimensional domains,” NeurIPS, vol. 33, pp. 7537–7547, 2020.
[19] Dan Hendrycks and Kevin Gimpel, “Bridging nonlinearities and stochastic regularizers with gaussian error linear units,” arXiv:1606.08415, 2016.
[20] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and ** Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,” NeurIPS, vol. 34, pp. 12077–12090, 2021.
[21] Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan, “Metaformer is actually what you need for vision,” in CVPR, 2022, pp. 10819–10829.
[22] Meng-Hao Guo, Cheng-Ze Lu, Qibin Hou, Zhengning Liu, Ming-Ming Cheng, and Shi-Min Hu, “Segnext: Rethinking convolutional attention design for semantic segmentation,” NeurlPS, vol. 35, pp. 1140–1156, 2022.
[23] Wenhai Wang, Enze Xie, Xiang Li, Deng-** Fan, Kaitao Song, Ding Liang, et al., “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in ICCV, 2021, pp. 548–558.
[24] MMSegmentation Contributors, “MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark,” https://github.com/open-mmlab/mmsegmentation, 2020.
[25] J. Bernal, F. J. Sánchez, G. Fernández-Esparrach, D. Gil, C. Rodríguez, and F. Vilariño, “Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,” Computerized Medical Imaging and Graphics, vol. 43, pp. 99–111, 2015.
[26] Philipp Tschandl et al., “The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions,” Scientific Data, vol. 5, pp. 180161, 2018.

High-Fidelity Lake Extraction via Two-Stage Prompt Enhancement: Establishing a Novel Baseline and Benchmark