License: CC BY 4.0
arXiv:2311.11319v2 [cs.CV] 30 Jan 2024

GeoSAM: Fine-tuning SAM with Sparse and Dense Visual Prompting for Automated Segmentation of Mobility Infrastructure

Rafi Ibn Sultan1, Chengyin Li1, Hui Zhu1, Prashant Khanduri1, Marco Brocanelli2, Dongxiao Zhu1
1Wayne State University, 2Ohio State University
1{hm4013, gv2145, hq2197, dzhu}@wayne.edu   2[email protected]
Abstract

The Segment Anything Model (SAM) has shown impressive performance when applied to natural image segmentation. However, it struggles with geographical images like aerial and satellite imagery, especially when segmenting mobility infrastructure including roads, sidewalks, and crosswalks. This inferior performance stems from the narrow features of these objects, their textures blending into the surroundings, and interference from objects like trees, buildings, vehicles, and pedestrians - all of which can disorient the model to produce inaccurate segmentation maps. To address these challenges, we propose Geographical SAM (GeoSAM), a novel SAM-based framework that implements a fine-tuning strategy using the dense visual prompt from zero-shot learning, and the sparse visual prompt from a pre-trained CNN segmentation model. The proposed GeoSAM outperforms existing approaches for geographical image segmentation, specifically by 26%, 7%, and 17% for road infrastructure, pedestrian infrastructure, and on average, respectively, representing a momentous leap in leveraging foundation models to segment mobility infrastructure including both road and pedestrian infrastructure in geographical images. The source code can be found on this GitHub repository.

Index Terms:
Semantic Segmentation, Geographical Imagery, SAM, Fine-Tuning, Prompt Generation, Multi-Class Mobility Infrastructure Segmentation

I Introduction

While a substantial amount of research has focused on road infrastructure segmentation from geographical imagery like aerial and satellite images, pedestrian infrastructure such as sidewalks and crosswalks has received comparatively little attention, despite its importance in daily life. Historically, research efforts have predominantly focused on assisting drivers in navigation rather than pedestrians [1, 2]. Therefore accurately segmenting mobility infrastructure including both road and pedestrian infrastructure could provide invaluable information about accessible pedestrian routes and trip locations.

In the realm of pedestrian infrastructure research, rooted in historical context, current methodologies for mobility infrastructure segmentation have evolved to predominantly utilize traditional Convolutional Neural Networks (CNNs) [3, 4, 5] or Vision Transformer (ViT) models [6, 7]. These approaches typically rely on extensive collections of human-labeled data such as roads, sidewalks, and crosswalks for training [1, 8, 9, 10]. A fundamental challenge of these conventional segmentation models lies in their reliance on large datasets of high-quality labeled images, posing obstacles in terms of scalability and adaptability to varied tasks.

In response to the limitations of traditional segmentation models, the rise of vision foundation models [11, 12, 13] represents a big leap in scaling up segmentation models, allowing for powerful zero-shot or few-shot learning capabilities and flexible prompting. Without the need for re-training, these models can quickly adapt to a new downstream task. To tackle this problem, we turn to the Segment Anything Model (SAM) [11], one of the first vision foundation models for image segmentation. With the introduction of SAM, designed with the ambition of segmenting virtually anything in images, the field of image segmentation is in a fast pace of transition from traditional CNN or ViT models to pre-trained foundation models.

Further exploring the adaptability and efficacy of foundation models, zero or few-shot learning and fine-tuning using Parameter Efficient Fine-Tuning (PEFT) are the two primary approaches for leveraging the capabilities of foundation models. Zero-shot learning sets the initial groundwork for a downstream task, utilizing the model without specific contextual information [11, 12]. While this basic utilization has shown early success, zero-shot SAM often struggles to generalize effectively across various downstream tasks [14, 15]. To address this limitation, PEFT is a strategy that fine-tunes a subset (typically a lightweight module) of a large foundation model for a specific down-stream task while kee** the majority of parameters frozen to optimize task performance without full retraining [16, 17, 18, 19, 20].

Considering the task of mobility infrastructure segmentation employed in this work, zero-shot SAM faces significant challenges, particularly in distinguishing sidewalks from roads in aerial imagery. Sidewalks often have very thin borders alongside road borders (can be seen in Fig. 5 and can have very similar textures to roads, presenting difficulties for zero-shot SAM in accurately differentiating between them. However, SAM has the potential to address this issue, as it has been trained extensively to distinguish between objects, requiring only a minor yet targeted calibration to optimize its inherent capabilities. To address these challenges, we introduce Geographical SAM (GeoSAM), an end-to-end model tailored for segmenting pedestrian infrastructure through binary segmentation of road and pedestrian infrastructure.

Reflecting on the methodology of our approach as illustrated in Fig. 1, our approach incorporates an automated prompt generation process. We employ sparse prompts generated by a domain-specific CNN encoder and complement them with dense prompts produced by zero-shot SAM to perform additional fine-tuning of SAM using PEFT techniques. Sparse prompts, which are essentially clicks on the image, provide the model with a context for segmentation by indicating where to focus. These sparse prompts are complemented by dense prompts (which are low-quality segmentation maps), which offer the model additional context about the objects to be segmented.

The fine-tuning process plays a crucial role in imparting domain-specific knowledge to SAM. Our approach, which employs zero-shot SAM to create dense prompts, draws inspiration from SAM’s own training methodology, where mask predictions from previous iterations are used as additional dense prompts for subsequent modeling [11]. Conversely, we employ a domain-specific CNN encoder for the segmentation of geographical objects, enabling us to capture precise location information for the generation of sparse prompts specific to mobility infrastructure in the geographical imagery domain.

GeoSAM sets itself apart from previous CNN-based approaches [1, 21, 22, 23], not only excelling in the segmentation of mobility infrastructure through improved accuracy but also showcasing the extended capabilities of foundation models in the field of geographical image analysis.

Our contributions are summarized below:

  • We pioneer the adaptation of the foundation model, SAM, for mobility infrastructure segmentation, a multi-class segmentation problem using geographical imagery, without any human intervention, overcoming the limitations of zero-shot SAM.

  • We develop the fine-tuning and prompting of SAM for geographical imagery, empowering SAM with domain-specific knowledge drawn from the utilization of both sparse and dense prompts.

  • We design and implement a novel automated pipeline to generate both dense prompts from zero-shot learning and sparse prompts from a pre-trained CNN encoder to enhance SAM’s effectiveness and efficiency on the under-performing mobility infrastructure segmentation task.

II Related Work

II-A CNN-based Geographical Image Segmentation

Prior to the emergence of foundation models, CNN-based models like UNet [5] and vision transformer-based [24] works, which follow an encoder-decoder architecture for semantic segmentation, were the standard choice for various geographical segmentation tasks. Simple UNet-based approaches like [21, 25] and more advanced encoder-decoder-based works like [22, 23, 26, 27] are developed to execute various geographical image segmentation tasks. Additionally, there are endeavors like [28, 29, 30], where researchers exploit multiple machine learning techniques to enhance the performance of these CNN-based segmentation models for geographical object segmentation.

Furthermore, Tile2Net [1], also based on CNN, is one of the most established works in mobility infrastructure segmentation in geographical imagery. Their primary focus is map** sidewalk networks using aerial and satellite imagery, involving segmenting mobility infrastructure elements like roads, sidewalks, and crosswalks. For the semantic segmentation part of their network creation, they train a hierarchical multi-scale attention model [31] and HRNet-W48 [32] from scratch; [33] with object-contextual representations [34] as the backbone. While these diverse efforts contribute significantly to the field of geographical image segmentation, many of them share similar limitations, as they typically require a lot of supervised data for each different task and necessitate retraining. While accuracy improvements are valuable, they do not necessarily address some of the more fundamental issues inherent in geographical image segmentation, i.e., generalizability to new locations.

In addition to conventional CNN models trained from scratch, there have been mentionable transfer learning-based efforts, exemplified in works like [35, 36], where a model trained from the source task is used to reduce the computational demands for various related downstream tasks. However, it’s worth noting that in practice, these researchers often encounter the need to conduct further fine-tuning or retraining of these models to align them with the precise objectives of their respective tasks. This lack of generalizability leads to these source models being re-trained to achieve competitive performance in the downstream task. These transfer learning-based approaches remain task-specific, in contrast to foundation models, which are designed to be more general and not tied to specific tasks.

II-B SAM-based Geographical Image Segmentation

Vision foundation models, in their essence, aim to address the shortcomings of CNN-based segmentation models by being readily adaptable to segment previously unknown classes for various downstream tasks. SAM [11], a foundation model dedicated to segmentation tasks has three main components: (i) a ViT-based [24] image encoder that has been trained with over 1 billion masks on 11 million images to compute the image embeddings, (ii) a prompt encoder that takes prompts from users (guiding the model for the context where to focus at) and encodes the embeddings, and, (iii) a lightweight mask decoder to generate segmentation map based on the received image embedding, and prompts embedding. These prompts can take the form of sparse prompts (such as clicks, bounding boxes, or texts) or dense prompts (mask inputs).

While the application of SAM in the field of geographical imagery is not as extensive as in other domains, it has been utilized in several research endeavors. Considering the zero-shot SAM approaches, several works, including [37, 38, 39, 40], have employed SAM’s zero-shot capabilities for a range of downstream tasks, extending beyond segmentation. Additionally, in [41], a hybrid approach combining both zero-shot and one-shot learning is applied to SAM for segmenting geographical imagery. However, it’s crucial to note that these zero-shot-based approaches primarily target objects with well-defined boundaries and distinguishable physical contexts in their surroundings. In such cases, SAM doesn’t necessitate extensive domain-specific knowledge for accurate segmentation.

When zero-shot SAM encounters difficulties in specific domains, there have been attempts to fine-tune SAM using PEFT techniques. Mixed works in geographical imagery like [42, 43] delve into the exploration of fine-tuning using diverse PEFT techniques for a range of downstream tasks such as geo-localization and map**. Beyond geographical imagery, [18, 44, 45] fine-tune a small number of parameters exploiting various PEFT techniques across a variety of natural imagery. However, as far as our knowledge extends, no fine-tuning work specifically tailored to geographical images for mobility infrastructure segmentation has been done, a significant under-performing task that may generate tremendous social impact.

While most of these works are based on manual human prompting in the inference stage, there are also works focusing on automating prompt generation, mainly in the medical imaging segmentation domain. Works such as [46, 14] report about develo** auto prompts generation techniques by replacing the SAM prompt encoder with a trainable network, a process that demands substantial training data. However, auto-prompting in geographical image segmentation represents a non-trivial task due to the lack of curated geographic infrastructure data for prompt generation.

II-C Pre-training Geographical Image Segmentation Foundation Model

Very recently, researchers have also attempted to train domain-specific foundation models using large geographical imagery data sets. For instance, [47] uses plain ViT models with about 0.1 billion parameters to train large vision models tailored to remote sensing tasks and investigate how these large ViT models perform on object detection, scene classification, and semantic segmentation tasks. The work in [48] compares different visions of foundation models with CNN-based fine-tuned models for geographical images and they conclude that the foundation models fall short compared to the CNN-based fine-tuned models. Similar to SAM, [49, 50] develop their own foundation models for geographical imagery based on scaled versions of ViT and CLIP models [12] respectively. In addition to these studies, research such as [51, 52] employs SAM to leverage the capabilities of foundation models in performing various segmentation tasks on geographical imagery. Many of these works focus on road infrastructure segmentation tasks, which, unfortunately, do not specifically address the key issue of pedestrian infrastructure segmentation tasks, i.e., sidewalk/crosswalk segmentation. Additionally, given that they do not provide the source code, assessing their model’s effectiveness in the context of pedestrian infrastructure presents challenges.

III Method

Refer to caption
Figure 1: Training GeoSAM, an automated mobility infrastructure segmentation pipeline. In Prompts Generation (orange arrows), the model generates the sparse and dense prompts with the help of a secondary CNN-based geographical image encoder. Sparse prompts are generated automatically from the output of the secondary model and the soft mask is generated by the SAM mask decoder which is used as the dense prompts. In Fine-tuning (blue arrows), the prompts generated from the previous stage are used in the tunable decoder to produce the mask.

This section is organized to provide an overview of SAM at first followed by a discussion of the training and inference phases of GeoSAM.

III-A SAM: Background

SAM comprises three elements: an image encoder (referred to as EncI)\operatorname{Enc_{I}})start_OPFUNCTION roman_Enc start_POSTSUBSCRIPT roman_I end_POSTSUBSCRIPT end_OPFUNCTION ), a prompt encoder (referred to as EncPsubscriptEncP\operatorname{Enc_{P}}roman_Enc start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPT), and a mask decoder (referred to as DecM)\operatorname{Dec_{M}})start_OPFUNCTION roman_Dec start_POSTSUBSCRIPT roman_M end_POSTSUBSCRIPT end_OPFUNCTION ). SAM, designed as a model that can be prompted, accepts an image, denoted as I𝐼Iitalic_I, and a collection of prompts, known as P𝑃Pitalic_P. These prompts can represent a point, a box, or a dense mask. In its operation, SAM initially employs EncIsubscriptEncI\operatorname{Enc_{I}}roman_Enc start_POSTSUBSCRIPT roman_I end_POSTSUBSCRIPT to extract features from the input image. It then utilizes EncPsubscriptEncP\operatorname{Enc_{P}}roman_Enc start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPT to transform the human-provided prompts, which have a length of k𝑘kitalic_k, into prompt tokens. Specifically:

FI=EncI(I),TP=EncP(P),formulae-sequencesubscript𝐹𝐼subscriptEncI𝐼subscript𝑇𝑃subscriptEncP𝑃F_{I}=\operatorname{Enc_{I}}(I),\quad T_{P}=\operatorname{Enc_{P}}(P),\quaditalic_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = start_OPFUNCTION roman_Enc start_POSTSUBSCRIPT roman_I end_POSTSUBSCRIPT end_OPFUNCTION ( italic_I ) , italic_T start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = start_OPFUNCTION roman_Enc start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPT end_OPFUNCTION ( italic_P ) , (1)

In Equation 1, FIsubscript𝐹𝐼F_{I}italic_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT is the feature embedding of the input image where FIh×w×csubscript𝐹𝐼superscript𝑤𝑐F_{I}\in\mathbb{R}^{h\times w\times c}italic_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_c end_POSTSUPERSCRIPT, hhitalic_h and w𝑤witalic_w represent the resolution of the image feature map, and c𝑐citalic_c denotes the feature dimension. Likewise, TPsubscript𝑇𝑃T_{P}italic_T start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT is the feature embedding of the prompts where TPk×csubscript𝑇𝑃superscript𝑘𝑐T_{P}\in\mathbb{R}^{k\times c}italic_T start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_c end_POSTSUPERSCRIPT, k𝑘kitalic_k is the length of the prompts.

Following this, the encoded image and prompts are supplied to the decoder, called DecMsubscriptDecM\operatorname{Dec_{M}}roman_Dec start_POSTSUBSCRIPT roman_M end_POSTSUBSCRIPT, which employs attention-based mechanisms for feature interaction. SAM prepares the input tokens for the decoder by merging several mask tokens, denoted as TMsubscript𝑇𝑀T_{M}italic_T start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, with the prompt tokens TPsubscript𝑇𝑃T_{P}italic_T start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT. These mask tokens play a crucial role in generating the mask output, which is defined as:

S𝑆\displaystyle Sitalic_S =DecM(FI,Concat(TM,TP)),absentsubscriptDecMsubscript𝐹𝐼Concatsubscript𝑇𝑀subscript𝑇𝑃\displaystyle=\operatorname{Dec_{M}}\left(F_{I},\operatorname{Concat}(T_{M},T_% {P})\right),= start_OPFUNCTION roman_Dec start_POSTSUBSCRIPT roman_M end_POSTSUBSCRIPT end_OPFUNCTION ( italic_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , roman_Concat ( italic_T start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ) ) , (2)

where S𝑆Sitalic_S in Equation 2 represents the output segmentation mask predicted by SAM.

III-B GeoSAM: Training Strategy

III-B1 Prompts Generation

In the proposed framework, GeoSAM has employed a system that automatically generates both sparse and dense prompts to offer contextual information for the segmentation task.

Refer to caption
Figure 2: Sparse prompts generated based on segmentation maps created by the pre-trained CNN image encoder. Here, the foreground class is the sidewalk/crosswalk, blue and red circles represent foreground and background clicks respectively.

Sparse Prompts Generation In the sparse prompts generation process, GeoSAM implements a system that randomly creates click points, known as sparse prompts, on both the foreground and background of the target object. GeoSAM takes aid from a pre-trained CNN encoder in this domain in creating these sparse prompts. Tile2Net [1], a CNN-based and established model of this area, has been selected for this purpose. Tile2Net produces a multi-class segmentation map with four different classes, three for the foreground (roads, sidewalks, crosswalks) and one for the background. We refer to the segmentation map generated from the pre-trained CNN encoder as the ”pseudo labels”. The pseudo labels differ somewhat from our specific requirements, as our task involves dealing with two types of foreground classes (road and pedestrian infrastructure) in addition to the background. Therefore, GeoSAM makes certain modifications to the pseudo labels before employing them in the generation of sparse prompts.

GeoSAM, based on the Segment Anything Model (SAM), primarily a binary-class segmentation model, requires adjustments to suit the multi-class segmentation task. In training and evaluating, GeoSAM creates multi-channel segmentation maps for road and pedestrian segmentation respectively. Each channel utilizes pseudo labels generated from Tile2Net to provide the model with foreground and background information.

GeoSAM categorizes pixel values of these pseudo labels into foreground (such as pixels of roads from pseudo labels for road infrastructure, or sidewalks and crosswalks pixels for pedestrian infrastructures) and background pixels, aligning with the binary class segmentation approach of SAM. From these foreground and background sets of pixels, GeoSAM randomly selects a number of foreground and background pixels and obtains their coordinates to use as sparse prompts or clicks on the input image (see the ablation study for details in Table II). Fig. 2 visually illustrates the method of generating random sparse prompts using Tile2Net’s pseudo labels. This example focuses on the sparse prompts generation for pedestrian infrastructure. For the sake of clarity, the figure depicts a smaller number of points than used.

Dense Prompts Generation GeoSAM includes an automated system for generating dense prompts using zero-shot SAM, complementing the sparse prompts for better context definition. These dense prompts act as a soft mask, essentially an unthresholded, lower-quality prediction map. The zero-shot SAM generates a soft mask with continuous values, reflecting the model’s confidence in pixel classification, and offering more detailed semantic information than a binary mask, which results from applying a threshold.

The generation of dense prompts begins by feeding the target image and its feature embeddings, obtained from the pre-trained CNN image encoder, into the model. These embeddings initially serve as the dense prompts for SAM’s prompt encoder, as depicted in Fig. 1. Typically, SAM’s prompt encoder accepts dense prompts in the format of (64×\times×64×\times×256), where 256 denotes the channel dimension, and (64×\times×64) represents the spatial dimensions (height and width). To adapt the feature embeddings to this format, GeoSAM resizes them to this dimension. This involves applying a 1×\times×1 convolution to convert the channel dimensions of the embeddings to 256, followed by resizing their height and width to 64×\times×64 using average pooling, as shown in Fig. 3. Once resized, these embeddings are ready to be input into the prompt encoder. Using the outputs produced by the image encoder and prompt encoder, the decoder of zero-shot SAM generates the output i.e. soft mask (can be seen in Fig. 1, effectively creating self-generated dense prompts for the model.

III-B2 Fine-Tuning

Refer to caption
Figure 3: Resizing the feature embeddings of the pre-trained CNN encoder to generate the dense prompts for SAM.

During this phase of training, the model is simultaneously fed with both dense prompts and sparse prompts, which act as inputs into the previously mentioned prompt encoder. Following this, the SAM mask decoder, using both the prompts and the input image, generates a multi-channel segmentation map that encompasses all the classes. At this point, the model calculates the loss by comparing the output segmentation maps to the multi-channel ground truth images, produced through one-hot encoding. GeoSAM then begins updating the decoder’s parameters in this phase while kee** the other parameters fixed, starting the fine-tuning process using PEFT. It’s important to highlight that GeoSAM doesn’t make any modifications to the model’s parameters before this point, following the standard PEFT strategy.

For the loss function to fine-tune the decoder, we opt for a combination of Dice Loss [53] and Focal Loss [54]. The Dice Loss is based on the Dice Similarity Coefficient (DSC), a popular metric using the overlap between two regions for evaluating the accuracy of a segmentation algorithm. The Dice Loss can be represented as the complement of the dice coefficient metric, therefore, minimizing the Dice Loss during training is equivalent to maximizing the Dice Coefficient. The Equation can be expressed as

Dice=12c=1Ci=1Nsicgicc=1Ci=1Nsic+c=1Ci=1Ngic,subscriptDice12superscriptsubscript𝑐1𝐶superscriptsubscript𝑖1𝑁superscriptsubscript𝑠𝑖𝑐superscriptsubscript𝑔𝑖𝑐superscriptsubscript𝑐1𝐶superscriptsubscript𝑖1𝑁superscriptsubscript𝑠𝑖𝑐superscriptsubscript𝑐1𝐶superscriptsubscript𝑖1𝑁superscriptsubscript𝑔𝑖𝑐\mathcal{L}_{\text{Dice}}=1-\frac{2\sum_{c=1}^{C}\sum_{i=1}^{N}s_{i}^{c}g_{i}^% {c}}{\sum_{c=1}^{C}\sum_{i=1}^{N}s_{i}^{c}+\sum_{c=1}^{C}\sum_{i=1}^{N}g_{i}^{% c}},caligraphic_L start_POSTSUBSCRIPT Dice end_POSTSUBSCRIPT = 1 - divide start_ARG 2 ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_ARG , (3)

where gicsuperscriptsubscript𝑔𝑖𝑐g_{i}^{c}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT represents the ground truth binary indicator of class label c𝑐citalic_c for the pixel i𝑖iitalic_i, and sicsuperscriptsubscript𝑠𝑖𝑐s_{i}^{c}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT denotes the corresponding predicted segmentation probability.

Focal Loss is a weighted Cross Entropy Loss designed to focus on hard-to-classify examples while down-weighting easy-to-classify examples. In addition to the previous notations, Focal Loss also introduces α𝛼\alphaitalic_α and γ𝛾\gammaitalic_γ as the balancing and focusing parameters, respectively. The balancing factor α𝛼\alphaitalic_α assigns different weights to different classes to provide more importance to the minority class whereas the focusing parameter γ𝛾\gammaitalic_γ affects how much the loss is focused on hard-to-classify examples [55]. Parameters α𝛼\alphaitalic_α and γ𝛾\gammaitalic_γ are combined with the basic Cross Entropy Loss to get the Equation of Focal Loss:

Focal=1Ni=1Nc=1Cαc[(1sic)γgiclog(sic)].subscriptFocal1𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑐1𝐶subscript𝛼𝑐delimited-[]superscript1superscriptsubscript𝑠𝑖𝑐𝛾superscriptsubscript𝑔𝑖𝑐superscriptsubscript𝑠𝑖𝑐\begin{split}\mathcal{L}_{\text{Focal}}&=-\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^% {C}\alpha_{c}\left[\left(1-s_{i}^{c}\right)^{\gamma}g_{i}^{c}\log(s_{i}^{c})% \right].\\ \end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT Focal end_POSTSUBSCRIPT end_CELL start_CELL = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT [ ( 1 - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT roman_log ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ] . end_CELL end_ROW (4)

Finally, both losses are combined uniformly to get the overall Dice Focal Loss used for fine-tuning the GeoSAM.

DiceFocal=Dice+Focal.subscriptDiceFocalsubscriptDicesubscriptFocal\mathcal{L}_{\text{DiceFocal}}=\mathcal{L}_{\text{Dice}}+\mathcal{L}_{\text{% Focal}}.caligraphic_L start_POSTSUBSCRIPT DiceFocal end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT Dice end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT Focal end_POSTSUBSCRIPT . (5)

To assess the performance of geographical object segmentation, we employ two metrics: Intersection over Union (IoU) and average precision (AP). The IoU score is a ratio of the area of overlap between the predicted and ground truth regions to the area of union. The value 0 indicates no overlap, while a value of 1 indicates perfect overlap between the predicted and ground truth regions. Similarly, the AP measures the accuracy of the model in classifying each pixel, considering both the precision (the proportion of true positive predictions) and the recall (the ability of the model to find all the relevant cases) across different threshold levels.

III-C The End-to-End Inference Pipeline

During the inference stage, GeoSAM utilizes the fine-tuned decoder obtained during training, along with sparse prompts and dense prompts, to automatically generate a multi-class segmentation map from input geographical images. This end-to-end pipeline mirrors the approach used during training, where the pre-trained CNN encoder provides sparse prompts, and zero-shot SAM contributes dense prompts. GeoSAM processes these prompts with the fine-tuned decoder to generate the final segmentation map output. The model also undergoes postprocessing on these generated maps to refine them for more practical use (more on section IV-C1). This inference pipeline ensures that the model is capable of segmenting both road and pedestrian infrastructure simultaneously in geographical images without any human intervention.

IV Experiments

IV-A Datasets

We denote the training dataset as Dtrain={Strain,Gtrain}subscript𝐷trainsubscript𝑆trainsubscript𝐺trainD_{\text{train}}=\{S_{\text{train}},G_{\text{train}}\}italic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT = { italic_S start_POSTSUBSCRIPT train end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT train end_POSTSUBSCRIPT }, where Strain={s1,,sn}subscript𝑆trainsubscript𝑠1subscript𝑠𝑛S_{\text{train}}=\{s_{1},...,s_{n}\}italic_S start_POSTSUBSCRIPT train end_POSTSUBSCRIPT = { italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, Gtrain={g1,,gn}subscript𝐺trainsubscript𝑔1subscript𝑔𝑛G_{\text{train}}=\{g_{1},...,g_{n}\}italic_G start_POSTSUBSCRIPT train end_POSTSUBSCRIPT = { italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } correspond to the n𝑛nitalic_n sample of images and segmentation ground truth images (contain masks of road and pedestrian infrastructure). Similarly, we denote the test dataset as Dtest={Stest,Gtest}subscript𝐷testsubscript𝑆testsubscript𝐺testD_{\text{test}}=\{S_{\text{test}},G_{\text{test}}\}italic_D start_POSTSUBSCRIPT test end_POSTSUBSCRIPT = { italic_S start_POSTSUBSCRIPT test end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT test end_POSTSUBSCRIPT }. We also have a separate dataset to test the model’s generalizability that we denote as Dgen={Sgen,Ggen}subscript𝐷gensubscript𝑆gensubscript𝐺genD_{\text{gen}}=\{S_{\text{gen}},G_{\text{gen}}\}italic_D start_POSTSUBSCRIPT gen end_POSTSUBSCRIPT = { italic_S start_POSTSUBSCRIPT gen end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT gen end_POSTSUBSCRIPT }.

For these datasets, we use high-resolution orthorectified imagery, which can be defined as a corrected form of raw aerial images, along with corresponding GIS (Geographic Information System) data to create annotation labels or ground truth images. These orthorectified images are sourced from the U.S. Geological Survey (USGS) [56], a body dedicated to the study and map** of Earth’s natural resources and geological features.

We follow the methodology described in [1] to download orthorectified image tiles and create ground truth images. This involves using specific geographical coordinates to define the bounding box of an area and selecting appropriate zoom levels for the imagery (at zoom level 0, a single image tile represents the entire world). The ground truth images are generated using the same geographical coordinates and their corresponding GIS data. For this purpose, we employ publicly accessible planimetric GIS data from Washington DC [57, 58, 59] and Cambridge, MA [60, 61, 62], focusing on urban elements like roads, sidewalks, footpaths, and crosswalks. By utilizing these features, we can create ground truth images that distinctly categorize two types of mobility infrastructure classes: roads and pedestrian infrastructure.

In training and evaluating the GeoSAM model, we use two separate regions within Washington DC. Further. to test the model’s ability to generalize, we introduce image tiles from a different city, Cambridge, not included in the training set. This approach demonstrates the model’s effectiveness in interpreting and processing new, unseen environments, ensuring its robustness and applicability across diverse geographical locations. The input images for GeoSAM are prepared at a resolution of (1024,1024), with a zoom level of 20. This process requires the stitching of adjacent image tiles, with the base images dictating the number of neighboring tiles to be merged. The details of these three datasets are consolidated in Table I.

TABLE I: Summary of the three datasets used. The first two datasets are from regions of Washington DC (used for training and evaluation respectively), and the third region is from Cambridge, MA to test generalizability.
Datasets Geographical Co-ordinates Base # of Image Area Total #
Image Size Tiles Stitched (km2𝑘superscript𝑚2km^{2}italic_k italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) of Images
Dtrainsubscript𝐷trainD_{\text{train}}italic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT 38.905788, -77.045019, 38.90968, -77.019694 (512, 512) 2 1.97 560
Dtestsubscript𝐷testD_{\text{test}}italic_D start_POSTSUBSCRIPT test end_POSTSUBSCRIPT 38.8968333, -77.0074118, 38.906958, -76.988948 (512, 512) 2 1.04 296
Dgensubscript𝐷genD_{\text{gen}}italic_D start_POSTSUBSCRIPT gen end_POSTSUBSCRIPT 42.360067, -71.144373, 42.395258, -71.051704 (256, 256) 4 30.18 2380

IV-B Implementation Details

Refer to caption
Figure 4: Postprocessing operations on the two classes (viewed separately). Each of the columns represents a single randomly picked image from the testing dataset with two tasks: (a) road infrastructure segmentation, and (b) pedestrian infrastructure segmentation. Here, GT = ground truth.

Refer to caption
Figure 5: Comparative Qualitative Segmentation Results: GeoSAM vs. Tile2Net and Zero-Shot SAM. Different colors indicate distinct classes in the multi-class output. Each row displays a randomly selected image from the test dataset. The final two columns illustrate GeoSAM’s performance in a single-task scenario.

As mentioned earlier we treat the objective as a multi-class segmentation task, where the classes of interest are roads and sidewalks/crosswalks respectively. We adopted ViT-H [24] as the encoder version of SAM and initialized the model with pre-trained weights from SAM’s ViT-H version. Following the original SAM paper settings [11], the choice of optimizer was the AdamW (β1𝛽1\beta 1italic_β 1 =0.9, β2𝛽2\beta 2italic_β 2 = 0.999) optimizer [63], with an initial learning rate set at 1e51superscript𝑒51e^{-5}1 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and weight decay of 0.1 and no data augmentation techniques were applied. Following our experimentation on various values, we chose uniform values for all classes for α𝛼\alphaitalic_α and set γ𝛾\gammaitalic_γ at 2 for the loss function. To have an adaptable learning rate, a cosine annealing learning rate scheduler was employed with a maximum learning rate decaying smoothly to a minimum value (1e71superscript𝑒71e^{-7}1 italic_e start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT) over the course of training. Tile2Net’s semantic segmentation component, specifically the Hierarchical Multi-Scale Attention model [31] was selected as the pre-trained image encoder and initialized with publicly available pre-trained weights released from the Tile2Net team [1].

All the experiments are done on an NVIDIA GeForce RTX 4090 GPU with 24 GB of memory with the Python version for the project is 3.10.9. We use a total of 100 epochs to train GeoSAM as well as the other baseline models. While doing the training for GeoSAM, we follow PEFT techniques where we keep the parameters of the encoder part (both image and prompt encoder) frozen and only update the gradients of the decoder.

To compare GeoSAM with the popular benchmark semantic segmentation models (both CNN-based and ViT-based) in Table II, we train each of the models from scratch in their default settings using the training dataset (Dtrainsubscript𝐷trainD_{\text{train}}italic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT) (described in section IV-A). Following the GeoSAM’s training strategy, the benchmark models are trained using Monai, an open-source python framework [64] (without any preprocessing and post-processing for a fair comparison). In the inference stage, we assess these models also using the dedicated test (Dtestsubscript𝐷testD_{\text{test}}italic_D start_POSTSUBSCRIPT test end_POSTSUBSCRIPT) and generalizability (Dgensubscript𝐷genD_{\text{gen}}italic_D start_POSTSUBSCRIPT gen end_POSTSUBSCRIPT) dataset (described in section IV-A) to obtain the intersection over union (IoU) and average precision (AP) values of each class, as well as the mean intersection over union (mIoU) and mean average precision (mAP) values.

IV-C Results

IV-C1 Qualitative Results

TABLE II: GeoSAM Evaluation Results in IoU and AP Against Benchmark Models (”Ped.” for Pedestrian, ”Infras.” for Infrastructure). Washington DC Used for Testing and Cambridge, MA Used for Evaluating Generalizability. Top Results in Bold.
Method Washington, DC Cambridge, MA
IoU mIoU AP mAP IoU mIoU AP mAP
Road Ped. Road Ped. Road Ped. Road Ped.
Infras. Infras. Infras. Infras. Infras. Infras. Infras. Infras.
GeoSAM (Ours) 0.76 0.45 0.60 0.67 0.45 0.56 0.61 0.27 0.44 0.55 0.25 0.40
Zero-shot SAM [11] 0.44 0.24 0.34 0.21 0.15 0.18 0.22 0.12 0.17 0.09 0.13 0.11
Tile2Net* [1] 0.60 0.42 0.51 0.54 0.38 0.46 - - - - - -
UNet [5] 0.45 0.17 0.31 0.44 0.22 0.33 0.24 0.12 0.18 0.11 0.05 0.08
AttUNet [4] 0.47 0.21 0.34 0.43 0.21 0.32 0.25 0.11 0.18 0.12 0.06 0.09
UNet++ [3] 0.61 0.30 0.45 0.54 0.32 0.43 0.24 0.08 0.16 0.12 0.04 0.08
UNETR [65] 0.48 0.20 0.34 0.50 0.27 0.38 0.25 0.11 0.18 0.12 0.05 0.08
Swin UNETR [66] 0.63 0.26 0.44 0.57 0.29 0.43 0.13 0.09 0.11 0.09 0.04 0.06
*Excluded Tile2Net values for Cambridge, MA, as its included in their training, impacting the fairness in generalizability assessment.

We implement postprocessing techniques after the initial segmentation maps are generated from the SAM mask decoder. To further improve results, techniques such as morphological erosion and dilation [67] are utilized to enhance performance. In our task, the priority is on path creation rather than the classification of individual pixels. We place greater emphasis on identifying and delineating all available paths.

In the generated segmentation maps, which represent real-world pedestrian paths, the existence of scattered isolated regions indicates inaccuracies in the model’s performance. To rectify this issue, we employ an erosion technique to eliminate such isolated regions. Additionally, when the segmentation map displays abrupt interruptions or gaps in connected paths, it indicates failure in accurately segmenting the entire path. Therefore, we utilize dilation to address these issues by establishing connections between disjointed regions, bringing the map closer to the ground truth representation.

The Equations for erosion and dilation are described below where (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) is a pixel in the image, B(i,j)𝐵𝑖𝑗B(i,j)italic_B ( italic_i , italic_j ) is the structuring element or the mask to do the operation, (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) are the coordinates within the structuring element, \bigcap is the intersection and \bigcup is the union operation:

E(x,y)=(i,j)BI(x+i,y+j),𝐸𝑥𝑦subscript𝑖𝑗𝐵𝐼𝑥𝑖𝑦𝑗E(x,y)=\bigcap_{(i,j)\in B}I(x+i,y+j),italic_E ( italic_x , italic_y ) = ⋂ start_POSTSUBSCRIPT ( italic_i , italic_j ) ∈ italic_B end_POSTSUBSCRIPT italic_I ( italic_x + italic_i , italic_y + italic_j ) , (6)
D(x,y)=(i,j)BI(x+i,y+j).𝐷𝑥𝑦subscript𝑖𝑗𝐵𝐼𝑥𝑖𝑦𝑗D(x,y)=\bigcup_{(i,j)\in B}I(x+i,y+j).italic_D ( italic_x , italic_y ) = ⋃ start_POSTSUBSCRIPT ( italic_i , italic_j ) ∈ italic_B end_POSTSUBSCRIPT italic_I ( italic_x + italic_i , italic_y + italic_j ) . (7)

To do these operations, we select a (10×\times×10) filter in a (1024×\times×1024) resolution segmentation map. This filter is passed over the whole segmentation map and performs erosion and dilation respectively in the regions it covers. We run these operations for a total of 10 iterations to get a better-refined map. Fig. 4 demonstrates the application of postprocessing techniques to enhance performance. For a more detailed understanding of their impact, we present visualizations of each class separately. For instance, we can observe the removal of isolated regions and a connection has been established where a path should be. We plot both the initial output and the postprocessing output to distinguish the difference in quality.

Fig. 5 provides a visual representation of the qualitative outcomes achieved by GeoSAM on several randomly picked images from the test dataset. Given the objective of performing multi-class class segmentation tasks, we can see the presence of both classes in the output of GeoSAM. Upon examination of the figure, it becomes evident that GeoSAM’s performance is on par with Tile2Net. Additionally, GeoSAM significantly outperforms zero-shot SAM when compared to zero-shot SAM’s capacity to handle similar tasks. This observation underscores the earlier assertion in Section I that zero-shot SAM does not exhibit strong performance for the thin boundary objects when applied to geographical images. Additionally, the final two columns of the figure present GeoSAM’s output for individual class segmentation, showcasing its proficiency in single-task scenarios and its effectiveness when focused on a specific task.

TABLE III: An ablation study. The segmentation performance is examined across the varying number of sparse point prompts and various model components. DP = dense prompt, PE = pre-trained encoder, FT = fine-tuning. The white and gray sections represent two distinct types of ablation studies.
Techniques Sparse Prompt IoU
Zero-shot DP PE FT For. Points Back. Points Ratio Road Infras. Pedestrian Infras.
0 0 - 0.01 0.01
100 50 2:1 0.72 0.40
2000 2000 1:1 0.72 0.43
2000 1000 2:1 0.76 0.45
2000 4000 1:2 0.74 0.39
×\times× ×\times× ×\times× 2000 1000 2:1 0.44 0.24
×\times× ×\times× 2000 1000 2:1 0.36 0.20
×\times× 2000 1000 2:1 0.47 0.24
×\times× 2000 1000 2:1 0.53 0.30

IV-C2 Quantitative Results

Next, we have some quantitative results based on the evaluation metrics we discussed in Section III-B2. In Table II we demonstrate and compare the detailed results obtained by using GeoSAM with Tile2Net, zero-shot SAM, and some of the more popular semantic segmentation models (both CNN-based and ViT-based as benchmarks). We apply Tile2Net to our selected datasets, adjusting the outputs to produce multi-class segmentation maps that specifically include only road and pedestrian infrastructure. For comparison with zero-shot SAM, we use the pre-trained SAM components, supplemented with sparse prompts created through our automated sparse prompts generation technique. We pick UNet [5], AttUNet [4], and UNet++ [3] as CNN-based benchmark models to compare with. For ViT-based models, among popular benchmark models such as [65, 66, 68, 69], we have selected [65, 66] for comparison.

The results of the test dataset (Washington DC) in Table II demonstrate that GeoSAM outperforms Tile2Net by a margin of 17% and 21% in terms of mean IoU (mIoU) and mean AP (mAP). Compared to zero-shot SAM, GeoSAM performs much better across both classes (a 76% increase in mIoU and a 211% in mAP). In comparison with the other benchmark segmentation models, GeoSAM has demonstrated its superiority by significantly outperforming these models for both of the classes. We observe that GeoSAM surpasses the top-performing CNN-based model (UNet++) in mIoU and mAP by a significant margin of 33% and 30% respectively and outperforms the best ViT-based model (Swin UNETR) by approximately 36% and 30% in mIoU and mAP significantly. Although [49] is relevant to our work, especially in using foundation models for geographical images, the unavailability of their source code hinders a direct comparison. Instead, we reference the highest IoU value for road segmentation from their results (0.59), which GeoSAM surpasses with an IoU of 0.76. Importantly, [49] lacks results on sidewalk/crosswalk segmentation, which is our primary focus on mobility infrastructure segmentation.

When evaluating GeoSAM on the generalizability dataset from Cambridge, MA, we observe a performance decrease of all the models compared to Washington DC, which is anticipated due to the data shift arising from different geological regions. Nonetheless, when compared to other models, GeoSAM still outperforms them significantly, showing at least 140% and 260% better results in mIoU and mAP, respectively. This supports our assertion that GeoSAM is scalable and possesses superior generalizability capabilities compared to traditional models. It’s important to note that we don’t include Tile2Net’s results for this generalizability dataset. Since the dataset’s purpose is to assess model performance on unseen data in a completely different scenario, Tile2Net is not applicable here, as Cambridge, MA was originally part of its training data.

Table III reports some of the ablation studies on the test dataset we performed for our approach. We split the ablation study into two parts. For the first part of the ablation study, we examine the effect of the number of sparse prompts on the performance of GeoSAM. We tried different ratios of foreground and background points and also we ran GeoSAM without any points as well to compare the performance. The results strongly support the assertion that sparse prompts play a critical role, as demonstrated by GeoSAM’s significant drop in performance when sparse prompts are omitted. Further, from analyzing the data, it can be inferred that the initial assumption, which involved employing a ratio of 2:1 and selecting 2000:1000 foreground-to-background points, results in the best performance for this specific case. As a result, this configuration becomes the default setting.

In the second part of the ablation study, as shown in Table III, we conducted an ablation study to assess the importance of several key techniques, as outlined in Table III. These key components include the significance of the CNN Encoder for generating dense prompts, the role of dense prompts in enhancing sparse prompts, and the importance of fine-tuning the decoder over the original decoder. Utilizing the optimal settings established in the previous section, we assessed the significance of the individual techniques opted for within GeoSAM. The results from Table III indicate that each component integrated into GeoSAM contributes uniquely to enhancing the overall performance. Significantly, the exclusion of FT (fine-tuned decoder) in the last row results in the most significant drop in performance, underscoring that the original SAM decoder is not well-suited for this specific type of task. Additionally, without the other techniques, only automated sparse prompts given to zero-shot SAM (first row) resulted in very poor performance, highlighting the importance of each of the techniques we utilized.

V Conclusion

GeoSAM is a pioneering work that uses SAM’s capabilities for mobility infrastructure within geographical images, without human intervention. We introduce an innovative architecture of utilizing both sparse prompts and dense prompts with the addition of fine-tuning for the foundation segmentation model SAM. In addition, different from other existing work, the training and end-to-end inference pipeline for mobility infrastructure segmentation that we developed here are reproducible can be adapted for various segmentation tasks, and are transferable to other geo-locations by using a different domain-specific encoder and fine-tuning the decoder with a different type of dataset.

References

  • [1] M. Hosseini, A. Sevtsuk, F. Miranda, R. M. Cesar Jr, and C. T. Silva, “Map** the walk: A scalable computer vision approach for generating sidewalk network datasets from aerial imagery,” Computers, Environment and Urban Systems, vol. 101, p. 101950, 2023.
  • [2] C. Li, Z. Dong, N. Fisher, and D. Zhu, “Coupling user preference with external rewards to enable driver-centered and resource-aware ev charging recommendation,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases.   Springer, 2022, pp. 3–19.
  • [3] Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: A nested u-net architecture for medical image segmentation,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4.   Springer, 2018, pp. 3–11.
  • [4] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz et al., “Attention u-net: Learning where to look for the pancreas,” arXiv preprint arXiv:1804.03999, 2018.
  • [5] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18.   Springer, 2015, pp. 234–241.
  • [6] Y. Qiang, C. Li, P. Khanduri, and D. Zhu, “Fairness-aware vision transformer via debiased self-attention,” arXiv preprint arXiv:2301.13803, 2023.
  • [7] ——, “Interpretability-aware vision transformer,” arXiv preprint arXiv:2309.08035, 2023.
  • [8] M. Hosseini, M. Saugstad, F. Miranda, A. Sevtsuk, C. T. Silva, and J. E. Froehlich, “Towards global-scale crowd+ ai techniques to map and assess sidewalks for people with disabilities,” arXiv preprint arXiv:2206.13677, 2022.
  • [9] L. Scheibenreif, J. Hanna, M. Mommert, and D. Borth, “Self-supervised vision transformers for land-cover segmentation and classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1422–1431.
  • [10] M. Saha, M. Saugstad, H. T. Maddali, A. Zeng, R. Holland, S. Bower, A. Dash, S. Chen, A. Li, K. Hara et al., “Project sidewalk: A web-based crowdsourcing tool for collecting sidewalk accessibility data at scale,” in Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 2019, pp. 1–14.
  • [11] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
  • [12] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  • [13] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
  • [14] C. Li, P. Khanduri, Y. Qiang, R. I. Sultan, I. Chetty, and D. Zhu, “Auto-prompting sam for mobile friendly 3d medical image segmentation,” arXiv preprint arXiv:2308.14936, 2023.
  • [15] X. Hu, X. Xu, and Y. Shi, “How to efficiently adapt large segmentation model (sam) to medical images,” arXiv preprint arXiv:2306.13731, 2023.
  • [16] B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” arXiv preprint arXiv:2104.08691, 2021.
  • [17] N. Ding, Y. Qin, G. Yang, F. Wei, Z. Yang, Y. Su, S. Hu, Y. Chen, C.-M. Chan, W. Chen et al., “Parameter-efficient fine-tuning of large-scale pre-trained language models,” Nature Machine Intelligence, vol. 5, no. 3, pp. 220–235, 2023.
  • [18] T. Chen, L. Zhu, C. Ding, R. Cao, S. Zhang, Y. Wang, Z. Li, L. Sun, P. Mao, and Y. Zang, “Sam fails to segment anything?–sam-adapter: Adapting sam in underperformed scenes: Camouflage, shadow, and more,” arXiv preprint arXiv:2304.09148, 2023.
  • [19] Z. Fu, H. Yang, A. M.-C. So, W. Lam, L. Bing, and N. Collier, “On the effectiveness of parameter-efficient fine-tuning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, 2023, pp. 12 799–12 807.
  • [20] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
  • [21] C. Henry, S. M. Azimi, and N. Merkle, “Road segmentation in sar satellite images with deep fully convolutional neural networks,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 12, pp. 1867–1871, 2018.
  • [22] L. Zhou, C. Zhang, and M. Wu, “D-linknet: Linknet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 182–186.
  • [23] V. Badrinarayanan, A. Handa, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling,” arXiv preprint arXiv:1505.07293, 2015.
  • [24] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  • [25] A. Saha, “Conducting semantic segmentation on landcover satellite imagery through u-net architectures,” in Proceedings of the Future Technologies Conference.   Springer, 2022, pp. 758–764.
  • [26] X. Chen, Q. Sun, W. Guo, C. Qiu, and A. Yu, “Ga-net: A geometry prior assisted neural network for road extraction,” International Journal of Applied Earth Observation and Geoinformation, vol. 114, p. 103004, 2022.
  • [27] P. Gudžius, O. Kurasova, V. Darulis, and E. Filatovas, “Deep learning-based object recognition in multispectral satellite imagery for real-time applications,” Machine Vision and Applications, vol. 32, no. 4, p. 98, 2021.
  • [28] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
  • [29] K. Ayush, B. Uzkent, C. Meng, K. Tanmay, M. Burke, D. Lobell, and S. Ermon, “Geography-aware self-supervised learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 181–10 190.
  • [30] P. Gudzius, O. Kurasova, V. Darulis, and E. Filatovas, “Automl-based neural architecture search for object recognition in satellite imagery,” Remote Sensing, vol. 15, no. 1, p. 91, 2022.
  • [31] A. Tao, K. Sapra, and B. Catanzaro, “Hierarchical multi-scale attention for semantic segmentation,” arXiv preprint arXiv:2005.10821, 2020.
  • [32] K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao, D. Liu, Y. Mu, X. Wang, W. Liu, and J. Wang, “High-resolution representations for labeling pixels and regions,” arXiv preprint arXiv:1904.04514, 2019.
  • [33] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang et al., “Deep high-resolution representation learning for visual recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 10, pp. 3349–3364, 2020.
  • [34] Y. Yuan, X. Chen, and J. Wang, “Object-contextual representations for semantic segmentation,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16.   Springer, 2020, pp. 173–190.
  • [35] J. H. Kim, S. Lee, J. R. Hipp, and D. Ki, “Decoding urban landscapes: Google street view and measurement sensitivity,” Computers, Environment and Urban Systems, vol. 88, p. 101626, 2021.
  • [36] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 633–641.
  • [37] D. Wang, J. Zhang, B. Du, D. Tao, and L. Zhang, “Scaling-up remote sensing segmentation dataset with segment anything model,” arXiv preprint arXiv:2305.02034, 2023.
  • [38] I. Giannakis, A. Bhardwaj, L. Sam, and G. Leontidis, “Deep learning universal crater detection using segment anything model (sam),” arXiv preprint arXiv:2304.07764, 2023.
  • [39] Z. Wang, S. Sun, X. Que, and X. Ma, “Interactive segmentation in aerial images: a new benchmark and an open access web-based tool,” arXiv preprint arXiv:2308.13174, 2023.
  • [40] L. Yao, H. Zuo, G. Zheng, C. Fu, and J. Pan, “Sam-da: Uav tracks anything at night with sam-powered domain adaptation,” arXiv preprint arXiv:2307.01024, 2023.
  • [41] L. P. Osco, Q. Wu, E. L. de Lemos, W. N. Gonçalves, A. P. M. Ramos, J. Li, and J. M. Junior, “The segment anything model (sam) for remote sensing applications: From zero to one shot,” arXiv preprint arXiv:2306.16623, 2023.
  • [42] F. Deuser, K. Habel, and N. Oswald, “Sample4geo: Hard negative sampling for cross-view geo-localisation,” arXiv preprint arXiv:2303.11851, 2023.
  • [43] S. Julka and M. Granitzer, “Knowledge distillation with segment anything (sam) model for planetary geological map**,” arXiv preprint arXiv:2305.07586, 2023.
  • [44] L. Ding, K. Zhu, D. Peng, H. Tang, and H. Guo, “Adapting segment anything model for change detection in hr remote sensing images,” arXiv preprint arXiv:2309.01429, 2023.
  • [45] Z. Peng, Z. Xu, Z. Zeng, X. Yang, and W. Shen, “Sam-parser: Fine-tuning sam efficiently by parameter space reconstruction,” arXiv preprint arXiv:2308.14604, 2023.
  • [46] T. Shaharabany, A. Dahan, R. Giryes, and L. Wolf, “Autosam: Adapting sam to medical images by overloading the prompt encoder,” arXiv preprint arXiv:2306.06370, 2023.
  • [47] D. Wang, Q. Zhang, Y. Xu, J. Zhang, B. Du, D. Tao, and L. Zhang, “Advancing plain vision transformer toward remote sensing foundation model,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–15, 2022.
  • [48] G. Mai, W. Huang, J. Sun, S. Song, D. Mishra, N. Liu, S. Gao, T. Liu, G. Cong, Y. Hu et al., “On the opportunities and challenges of foundation models for geospatial artificial intelligence,” arXiv preprint arXiv:2304.06798, 2023.
  • [49] K. Cha, J. Seo, and T. Lee, “A billion-scale foundation model for remote sensing images,” arXiv preprint arXiv:2304.05215, 2023.
  • [50] F. Liu, D. Chen, Z. Guan, X. Zhou, J. Zhu, and J. Zhou, “Remoteclip: A vision language foundation model for remote sensing,” arXiv preprint arXiv:2306.11029, 2023.
  • [51] Z. Yan, J. Li, X. Li, R. Zhou, W. Zhang, Y. Feng, W. Diao, K. Fu, and X. Sun, “Ringmo-sam: A foundation model for segment anything in multimodal remote-sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–16, 2023.
  • [52] X. Ma, Q. Wu, X. Zhao, X. Zhang, M.-O. Pun, and B. Huang, “Sam-assisted remote sensing imagery semantic segmentation with object and boundary constraints,” arXiv preprint arXiv:2312.02464, 2023.
  • [53] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
  • [54] C. H. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. Jorge Cardoso, “Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, September 14, Proceedings 3.   Springer, 2017, pp. 240–248.
  • [55] X. Li, X. Li, D. Pan, and D. Zhu, “On the learning property of logistic and softmax losses for deep neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 4739–4746.
  • [56] US Geological Survey, “USGS EROS Archive - Aerial Photography - High Resolution Orthoimagery (HRO),” https://doi.org/10.5066/F73X84W6, 2018.
  • [57] DC GIS (2019a), “Roads 2019,” https://opendata.dc.gov/datasets/DCGIS::roads-2019, 2019.
  • [58] DC GIS (2019b), “Sidewalks 2019 (2019b),” https://opendata.dc.gov/datasets/sidewalks-2019, 2019.
  • [59] Cambridge GIS (2018a), “Cambridge sidewalk,” https://www.cambridgema.gov/GIS/gisdatadictionary/Basemap/BASEMAP_Sidewalks, 2018.
  • [60] Cambridge GIS (2018b), “Pavement markings,” https://www.cambridgema.gov/GIS/gisdatadictionary/Traffic/TRAFFIC_PavementMarkings, 2018.
  • [61] Cambridge GIS (2018c), “Public footpaths,” https://www.cambridgema.gov/GIS/gisdatadictionary/Basemap/BASEMAP_PublicFootpaths, 2018.
  • [62] Cambridge GIS (2018d), “Roads,” https://www.cambridgema.gov/GIS/gisdatadictionary/Basemap/BASEMAP_Roads, 2018.
  • [63] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  • [64] M. J. Cardoso, W. Li, R. Brown, N. Ma, E. Kerfoot, Y. Wang, B. Murrey, A. Myronenko, C. Zhao, D. Yang et al., “Monai: An open-source framework for deep learning in healthcare,” arXiv preprint arXiv:2211.02701, 2022.
  • [65] A. Hatamizadeh, Y. Tang, V. Nath, D. Yang, A. Myronenko, B. Landman, H. R. Roth, and D. Xu, “Unetr: Transformers for 3d medical image segmentation,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2022, pp. 574–584.
  • [66] A. Hatamizadeh, V. Nath, Y. Tang, D. Yang, H. R. Roth, and D. Xu, “Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images,” in International MICCAI Brainlesion Workshop.   Springer, 2021, pp. 272–284.
  • [67] M. Pesaresi and J. A. Benediktsson, “A new approach for the morphological segmentation of high-resolution satellite imagery,” IEEE transactions on Geoscience and Remote Sensing, vol. 39, no. 2, pp. 309–320, 2001.
  • [68] C. Li, H. Bagher-Ebadian, V. Goddla, I. J. Chetty, and D. Zhu, “Focalunetr: A focal transformer for boundary-aware segmentation of ct images,” arXiv preprint arXiv:2210.03189, 2022.
  • [69] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.