GeoSAM: Fine-tuning SAM with Sparse and Dense Visual Prompting for Automated Segmentation of Mobility Infrastructure

Rafi Ibn Sultan¹, Chengyin Li¹, Hui Zhu¹, Prashant Khanduri¹, Marco Brocanelli², Dongxiao Zhu¹
¹Wayne State University, ²Ohio State University
¹{hm4013, gv2145, hq2197, dzhu}@wayne.edu ²[email protected]

Abstract

The Segment Anything Model (SAM) has shown impressive performance when applied to natural image segmentation. However, it struggles with geographical images like aerial and satellite imagery, especially when segmenting mobility infrastructure including roads, sidewalks, and crosswalks. This inferior performance stems from the narrow features of these objects, their textures blending into the surroundings, and interference from objects like trees, buildings, vehicles, and pedestrians - all of which can disorient the model to produce inaccurate segmentation maps. To address these challenges, we propose Geographical SAM (GeoSAM), a novel SAM-based framework that implements a fine-tuning strategy using the dense visual prompt from zero-shot learning, and the sparse visual prompt from a pre-trained CNN segmentation model. The proposed GeoSAM outperforms existing approaches for geographical image segmentation, specifically by 26%, 7%, and 17% for road infrastructure, pedestrian infrastructure, and on average, respectively, representing a momentous leap in leveraging foundation models to segment mobility infrastructure including both road and pedestrian infrastructure in geographical images. The source code can be found on this GitHub repository.

Index Terms:

Semantic Segmentation, Geographical Imagery, SAM, Fine-Tuning, Prompt Generation, Multi-Class Mobility Infrastructure Segmentation

I Introduction

While a substantial amount of research has focused on road infrastructure segmentation from geographical imagery like aerial and satellite images, pedestrian infrastructure such as sidewalks and crosswalks has received comparatively little attention, despite its importance in daily life. Historically, research efforts have predominantly focused on assisting drivers in navigation rather than pedestrians [1, 2]. Therefore accurately segmenting mobility infrastructure including both road and pedestrian infrastructure could provide invaluable information about accessible pedestrian routes and trip locations.

In the realm of pedestrian infrastructure research, rooted in historical context, current methodologies for mobility infrastructure segmentation have evolved to predominantly utilize traditional Convolutional Neural Networks (CNNs) [3, 4, 5] or Vision Transformer (ViT) models [6, 7]. These approaches typically rely on extensive collections of human-labeled data such as roads, sidewalks, and crosswalks for training [1, 8, 9, 10]. A fundamental challenge of these conventional segmentation models lies in their reliance on large datasets of high-quality labeled images, posing obstacles in terms of scalability and adaptability to varied tasks.

In response to the limitations of traditional segmentation models, the rise of vision foundation models [11, 12, 13] represents a big leap in scaling up segmentation models, allowing for powerful zero-shot or few-shot learning capabilities and flexible prompting. Without the need for re-training, these models can quickly adapt to a new downstream task. To tackle this problem, we turn to the Segment Anything Model (SAM) [11], one of the first vision foundation models for image segmentation. With the introduction of SAM, designed with the ambition of segmenting virtually anything in images, the field of image segmentation is in a fast pace of transition from traditional CNN or ViT models to pre-trained foundation models.

Further exploring the adaptability and efficacy of foundation models, zero or few-shot learning and fine-tuning using Parameter Efficient Fine-Tuning (PEFT) are the two primary approaches for leveraging the capabilities of foundation models. Zero-shot learning sets the initial groundwork for a downstream task, utilizing the model without specific contextual information [11, 12]. While this basic utilization has shown early success, zero-shot SAM often struggles to generalize effectively across various downstream tasks [14, 15]. To address this limitation, PEFT is a strategy that fine-tunes a subset (typically a lightweight module) of a large foundation model for a specific down-stream task while kee** the majority of parameters frozen to optimize task performance without full retraining [16, 17, 18, 19, 20].

Considering the task of mobility infrastructure segmentation employed in this work, zero-shot SAM faces significant challenges, particularly in distinguishing sidewalks from roads in aerial imagery. Sidewalks often have very thin borders alongside road borders (can be seen in Fig. 5 and can have very similar textures to roads, presenting difficulties for zero-shot SAM in accurately differentiating between them. However, SAM has the potential to address this issue, as it has been trained extensively to distinguish between objects, requiring only a minor yet targeted calibration to optimize its inherent capabilities. To address these challenges, we introduce Geographical SAM (GeoSAM), an end-to-end model tailored for segmenting pedestrian infrastructure through binary segmentation of road and pedestrian infrastructure.

Reflecting on the methodology of our approach as illustrated in Fig. 1, our approach incorporates an automated prompt generation process. We employ sparse prompts generated by a domain-specific CNN encoder and complement them with dense prompts produced by zero-shot SAM to perform additional fine-tuning of SAM using PEFT techniques. Sparse prompts, which are essentially clicks on the image, provide the model with a context for segmentation by indicating where to focus. These sparse prompts are complemented by dense prompts (which are low-quality segmentation maps), which offer the model additional context about the objects to be segmented.

The fine-tuning process plays a crucial role in imparting domain-specific knowledge to SAM. Our approach, which employs zero-shot SAM to create dense prompts, draws inspiration from SAM’s own training methodology, where mask predictions from previous iterations are used as additional dense prompts for subsequent modeling [11]. Conversely, we employ a domain-specific CNN encoder for the segmentation of geographical objects, enabling us to capture precise location information for the generation of sparse prompts specific to mobility infrastructure in the geographical imagery domain.

GeoSAM sets itself apart from previous CNN-based approaches [1, 21, 22, 23], not only excelling in the segmentation of mobility infrastructure through improved accuracy but also showcasing the extended capabilities of foundation models in the field of geographical image analysis.

Our contributions are summarized below:

•

We pioneer the adaptation of the foundation model, SAM, for mobility infrastructure segmentation, a multi-class segmentation problem using geographical imagery, without any human intervention, overcoming the limitations of zero-shot SAM.
•

We develop the fine-tuning and prompting of SAM for geographical imagery, empowering SAM with domain-specific knowledge drawn from the utilization of both sparse and dense prompts.
•

We design and implement a novel automated pipeline to generate both dense prompts from zero-shot learning and sparse prompts from a pre-trained CNN encoder to enhance SAM’s effectiveness and efficiency on the under-performing mobility infrastructure segmentation task.

II Related Work

II-A CNN-based Geographical Image Segmentation

Prior to the emergence of foundation models, CNN-based models like UNet [5] and vision transformer-based [24] works, which follow an encoder-decoder architecture for semantic segmentation, were the standard choice for various geographical segmentation tasks. Simple UNet-based approaches like [21, 25] and more advanced encoder-decoder-based works like [22, 23, 26, 27] are developed to execute various geographical image segmentation tasks. Additionally, there are endeavors like [28, 29, 30], where researchers exploit multiple machine learning techniques to enhance the performance of these CNN-based segmentation models for geographical object segmentation.

Furthermore, Tile2Net [1], also based on CNN, is one of the most established works in mobility infrastructure segmentation in geographical imagery. Their primary focus is map** sidewalk networks using aerial and satellite imagery, involving segmenting mobility infrastructure elements like roads, sidewalks, and crosswalks. For the semantic segmentation part of their network creation, they train a hierarchical multi-scale attention model [31] and HRNet-W48 [32] from scratch; [33] with object-contextual representations [34] as the backbone. While these diverse efforts contribute significantly to the field of geographical image segmentation, many of them share similar limitations, as they typically require a lot of supervised data for each different task and necessitate retraining. While accuracy improvements are valuable, they do not necessarily address some of the more fundamental issues inherent in geographical image segmentation, i.e., generalizability to new locations.

In addition to conventional CNN models trained from scratch, there have been mentionable transfer learning-based efforts, exemplified in works like [35, 36], where a model trained from the source task is used to reduce the computational demands for various related downstream tasks. However, it’s worth noting that in practice, these researchers often encounter the need to conduct further fine-tuning or retraining of these models to align them with the precise objectives of their respective tasks. This lack of generalizability leads to these source models being re-trained to achieve competitive performance in the downstream task. These transfer learning-based approaches remain task-specific, in contrast to foundation models, which are designed to be more general and not tied to specific tasks.

II-B SAM-based Geographical Image Segmentation

Vision foundation models, in their essence, aim to address the shortcomings of CNN-based segmentation models by being readily adaptable to segment previously unknown classes for various downstream tasks. SAM [11], a foundation model dedicated to segmentation tasks has three main components: (i) a ViT-based [24] image encoder that has been trained with over 1 billion masks on 11 million images to compute the image embeddings, (ii) a prompt encoder that takes prompts from users (guiding the model for the context where to focus at) and encodes the embeddings, and, (iii) a lightweight mask decoder to generate segmentation map based on the received image embedding, and prompts embedding. These prompts can take the form of sparse prompts (such as clicks, bounding boxes, or texts) or dense prompts (mask inputs).

While the application of SAM in the field of geographical imagery is not as extensive as in other domains, it has been utilized in several research endeavors. Considering the zero-shot SAM approaches, several works, including [37, 38, 39, 40], have employed SAM’s zero-shot capabilities for a range of downstream tasks, extending beyond segmentation. Additionally, in [41], a hybrid approach combining both zero-shot and one-shot learning is applied to SAM for segmenting geographical imagery. However, it’s crucial to note that these zero-shot-based approaches primarily target objects with well-defined boundaries and distinguishable physical contexts in their surroundings. In such cases, SAM doesn’t necessitate extensive domain-specific knowledge for accurate segmentation.

When zero-shot SAM encounters difficulties in specific domains, there have been attempts to fine-tune SAM using PEFT techniques. Mixed works in geographical imagery like [42, 43] delve into the exploration of fine-tuning using diverse PEFT techniques for a range of downstream tasks such as geo-localization and map**. Beyond geographical imagery, [18, 44, 45] fine-tune a small number of parameters exploiting various PEFT techniques across a variety of natural imagery. However, as far as our knowledge extends, no fine-tuning work specifically tailored to geographical images for mobility infrastructure segmentation has been done, a significant under-performing task that may generate tremendous social impact.

While most of these works are based on manual human prompting in the inference stage, there are also works focusing on automating prompt generation, mainly in the medical imaging segmentation domain. Works such as [46, 14] report about develo** auto prompts generation techniques by replacing the SAM prompt encoder with a trainable network, a process that demands substantial training data. However, auto-prompting in geographical image segmentation represents a non-trivial task due to the lack of curated geographic infrastructure data for prompt generation.

II-C Pre-training Geographical Image Segmentation Foundation Model

Very recently, researchers have also attempted to train domain-specific foundation models using large geographical imagery data sets. For instance, [47] uses plain ViT models with about 0.1 billion parameters to train large vision models tailored to remote sensing tasks and investigate how these large ViT models perform on object detection, scene classification, and semantic segmentation tasks. The work in [48] compares different visions of foundation models with CNN-based fine-tuned models for geographical images and they conclude that the foundation models fall short compared to the CNN-based fine-tuned models. Similar to SAM, [49, 50] develop their own foundation models for geographical imagery based on scaled versions of ViT and CLIP models [12] respectively. In addition to these studies, research such as [51, 52] employs SAM to leverage the capabilities of foundation models in performing various segmentation tasks on geographical imagery. Many of these works focus on road infrastructure segmentation tasks, which, unfortunately, do not specifically address the key issue of pedestrian infrastructure segmentation tasks, i.e., sidewalk/crosswalk segmentation. Additionally, given that they do not provide the source code, assessing their model’s effectiveness in the context of pedestrian infrastructure presents challenges.

III Method

Refer to caption — Figure 1: Training GeoSAM, an automated mobility infrastructure segmentation pipeline. In Prompts Generation (orange arrows), the model generates the sparse and dense prompts with the help of a secondary CNN-based geographical image encoder. Sparse prompts are generated automatically from the output of the secondary model and the soft mask is generated by the SAM mask decoder which is used as the dense prompts. In Fine-tuning (blue arrows), the prompts generated from the previous stage are used in the tunable decoder to produce the mask.

This section is organized to provide an overview of SAM at first followed by a discussion of the training and inference phases of GeoSAM.

III-A SAM: Background

SAM comprises three elements: an image encoder (referred to as $\operatorname{Enc_{I}})$ , a prompt encoder (referred to as $\operatorname{Enc_{P}}$ ), and a mask decoder (referred to as $\operatorname{Dec_{M}})$ . SAM, designed as a model that can be prompted, accepts an image, denoted as $I$ , and a collection of prompts, known as $P$ . These prompts can represent a point, a box, or a dense mask. In its operation, SAM initially employs $\operatorname{Enc_{I}}$ to extract features from the input image. It then utilizes $\operatorname{Enc_{P}}$ to transform the human-provided prompts, which have a length of $k$ , into prompt tokens. Specifically:

F_{I}=\operatorname{Enc_{I}}(I),\quad T_{P}=\operatorname{Enc_{P}}(P),\quad

(1)

In Equation 1, $F_{I}$ is the feature embedding of the input image where $F_{I}\in\mathbb{R}^{h\times w\times c}$ , $h$ and $w$ represent the resolution of the image feature map, and $c$ denotes the feature dimension. Likewise, $T_{P}$ is the feature embedding of the prompts where $T_{P}\in\mathbb{R}^{k\times c}$ , $k$ is the length of the prompts.

Following this, the encoded image and prompts are supplied to the decoder, called $\operatorname{Dec_{M}}$ , which employs attention-based mechanisms for feature interaction. SAM prepares the input tokens for the decoder by merging several mask tokens, denoted as $T_{M}$ , with the prompt tokens $T_{P}$ . These mask tokens play a crucial role in generating the mask output, which is defined as:

\displaystyle S

\displaystyle=\operatorname{Dec_{M}}\left(F_{I},\operatorname{Concat}(T_{M},T_% {P})\right),

(2)

where $S$ in Equation 2 represents the output segmentation mask predicted by SAM.

III-B GeoSAM: Training Strategy

III-B1 Prompts Generation

In the proposed framework, GeoSAM has employed a system that automatically generates both sparse and dense prompts to offer contextual information for the segmentation task.

Sparse Prompts Generation In the sparse prompts generation process, GeoSAM implements a system that randomly creates click points, known as sparse prompts, on both the foreground and background of the target object. GeoSAM takes aid from a pre-trained CNN encoder in this domain in creating these sparse prompts. Tile2Net [1], a CNN-based and established model of this area, has been selected for this purpose. Tile2Net produces a multi-class segmentation map with four different classes, three for the foreground (roads, sidewalks, crosswalks) and one for the background. We refer to the segmentation map generated from the pre-trained CNN encoder as the ”pseudo labels”. The pseudo labels differ somewhat from our specific requirements, as our task involves dealing with two types of foreground classes (road and pedestrian infrastructure) in addition to the background. Therefore, GeoSAM makes certain modifications to the pseudo labels before employing them in the generation of sparse prompts.

GeoSAM, based on the Segment Anything Model (SAM), primarily a binary-class segmentation model, requires adjustments to suit the multi-class segmentation task. In training and evaluating, GeoSAM creates multi-channel segmentation maps for road and pedestrian segmentation respectively. Each channel utilizes pseudo labels generated from Tile2Net to provide the model with foreground and background information.

GeoSAM categorizes pixel values of these pseudo labels into foreground (such as pixels of roads from pseudo labels for road infrastructure, or sidewalks and crosswalks pixels for pedestrian infrastructures) and background pixels, aligning with the binary class segmentation approach of SAM. From these foreground and background sets of pixels, GeoSAM randomly selects a number of foreground and background pixels and obtains their coordinates to use as sparse prompts or clicks on the input image (see the ablation study for details in Table II). Fig. 2 visually illustrates the method of generating random sparse prompts using Tile2Net’s pseudo labels. This example focuses on the sparse prompts generation for pedestrian infrastructure. For the sake of clarity, the figure depicts a smaller number of points than used.

Dense Prompts Generation GeoSAM includes an automated system for generating dense prompts using zero-shot SAM, complementing the sparse prompts for better context definition. These dense prompts act as a soft mask, essentially an unthresholded, lower-quality prediction map. The zero-shot SAM generates a soft mask with continuous values, reflecting the model’s confidence in pixel classification, and offering more detailed semantic information than a binary mask, which results from applying a threshold.

The generation of dense prompts begins by feeding the target image and its feature embeddings, obtained from the pre-trained CNN image encoder, into the model. These embeddings initially serve as the dense prompts for SAM’s prompt encoder, as depicted in Fig. 1. Typically, SAM’s prompt encoder accepts dense prompts in the format of (64 $\times$ 64 $\times$ 256), where 256 denotes the channel dimension, and (64 $\times$ 64) represents the spatial dimensions (height and width). To adapt the feature embeddings to this format, GeoSAM resizes them to this dimension. This involves applying a 1 $\times$ 1 convolution to convert the channel dimensions of the embeddings to 256, followed by resizing their height and width to 64 $\times$ 64 using average pooling, as shown in Fig. 3. Once resized, these embeddings are ready to be input into the prompt encoder. Using the outputs produced by the image encoder and prompt encoder, the decoder of zero-shot SAM generates the output i.e. soft mask (can be seen in Fig. 1, effectively creating self-generated dense prompts for the model.

III-B2 Fine-Tuning

During this phase of training, the model is simultaneously fed with both dense prompts and sparse prompts, which act as inputs into the previously mentioned prompt encoder. Following this, the SAM mask decoder, using both the prompts and the input image, generates a multi-channel segmentation map that encompasses all the classes. At this point, the model calculates the loss by comparing the output segmentation maps to the multi-channel ground truth images, produced through one-hot encoding. GeoSAM then begins updating the decoder’s parameters in this phase while kee** the other parameters fixed, starting the fine-tuning process using PEFT. It’s important to highlight that GeoSAM doesn’t make any modifications to the model’s parameters before this point, following the standard PEFT strategy.

For the loss function to fine-tune the decoder, we opt for a combination of Dice Loss [53] and Focal Loss [54]. The Dice Loss is based on the Dice Similarity Coefficient (DSC), a popular metric using the overlap between two regions for evaluating the accuracy of a segmentation algorithm. The Dice Loss can be represented as the complement of the dice coefficient metric, therefore, minimizing the Dice Loss during training is equivalent to maximizing the Dice Coefficient. The Equation can be expressed as

\mathcal{L}_{\text{Dice}}=1-\frac{2\sum_{c=1}^{C}\sum_{i=1}^{N}s_{i}^{c}g_{i}^% {c}}{\sum_{c=1}^{C}\sum_{i=1}^{N}s_{i}^{c}+\sum_{c=1}^{C}\sum_{i=1}^{N}g_{i}^{% c}},

(3)

where $g_{i}^{c}$ represents the ground truth binary indicator of class label $c$ for the pixel $i$ , and $s_{i}^{c}$ denotes the corresponding predicted segmentation probability.

Focal Loss is a weighted Cross Entropy Loss designed to focus on hard-to-classify examples while down-weighting easy-to-classify examples. In addition to the previous notations, Focal Loss also introduces $\alpha$ and $\gamma$ as the balancing and focusing parameters, respectively. The balancing factor $\alpha$ assigns different weights to different classes to provide more importance to the minority class whereas the focusing parameter $\gamma$ affects how much the loss is focused on hard-to-classify examples [55]. Parameters $\alpha$ and $\gamma$ are combined with the basic Cross Entropy Loss to get the Equation of Focal Loss:

\begin{split}\mathcal{L}_{\text{Focal}}&=-\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^% {C}\alpha_{c}\left[\left(1-s_{i}^{c}\right)^{\gamma}g_{i}^{c}\log(s_{i}^{c})% \right].\\ \end{split}

(4)

Finally, both losses are combined uniformly to get the overall Dice Focal Loss used for fine-tuning the GeoSAM.

\mathcal{L}_{\text{DiceFocal}}=\mathcal{L}_{\text{Dice}}+\mathcal{L}_{\text{% Focal}}.

(5)

To assess the performance of geographical object segmentation, we employ two metrics: Intersection over Union (IoU) and average precision (AP). The IoU score is a ratio of the area of overlap between the predicted and ground truth regions to the area of union. The value 0 indicates no overlap, while a value of 1 indicates perfect overlap between the predicted and ground truth regions. Similarly, the AP measures the accuracy of the model in classifying each pixel, considering both the precision (the proportion of true positive predictions) and the recall (the ability of the model to find all the relevant cases) across different threshold levels.

III-C The End-to-End Inference Pipeline

During the inference stage, GeoSAM utilizes the fine-tuned decoder obtained during training, along with sparse prompts and dense prompts, to automatically generate a multi-class segmentation map from input geographical images. This end-to-end pipeline mirrors the approach used during training, where the pre-trained CNN encoder provides sparse prompts, and zero-shot SAM contributes dense prompts. GeoSAM processes these prompts with the fine-tuned decoder to generate the final segmentation map output. The model also undergoes postprocessing on these generated maps to refine them for more practical use (more on section IV-C1). This inference pipeline ensures that the model is capable of segmenting both road and pedestrian infrastructure simultaneously in geographical images without any human intervention.

IV Experiments

IV-A Datasets

We denote the training dataset as $D_{\text{train}}=\{S_{\text{train}},G_{\text{train}}\}$ , where $S_{\text{train}}=\{s_{1},...,s_{n}\}$ , $G_{\text{train}}=\{g_{1},...,g_{n}\}$ correspond to the $n$ sample of images and segmentation ground truth images (contain masks of road and pedestrian infrastructure). Similarly, we denote the test dataset as $D_{\text{test}}=\{S_{\text{test}},G_{\text{test}}\}$ . We also have a separate dataset to test the model’s generalizability that we denote as $D_{\text{gen}}=\{S_{\text{gen}},G_{\text{gen}}\}$ .

For these datasets, we use high-resolution orthorectified imagery, which can be defined as a corrected form of raw aerial images, along with corresponding GIS (Geographic Information System) data to create annotation labels or ground truth images. These orthorectified images are sourced from the U.S. Geological Survey (USGS) [56], a body dedicated to the study and map** of Earth’s natural resources and geological features.

We follow the methodology described in [1] to download orthorectified image tiles and create ground truth images. This involves using specific geographical coordinates to define the bounding box of an area and selecting appropriate zoom levels for the imagery (at zoom level 0, a single image tile represents the entire world). The ground truth images are generated using the same geographical coordinates and their corresponding GIS data. For this purpose, we employ publicly accessible planimetric GIS data from Washington DC [57, 58, 59] and Cambridge, MA [60, 61, 62], focusing on urban elements like roads, sidewalks, footpaths, and crosswalks. By utilizing these features, we can create ground truth images that distinctly categorize two types of mobility infrastructure classes: roads and pedestrian infrastructure.

In training and evaluating the GeoSAM model, we use two separate regions within Washington DC. Further. to test the model’s ability to generalize, we introduce image tiles from a different city, Cambridge, not included in the training set. This approach demonstrates the model’s effectiveness in interpreting and processing new, unseen environments, ensuring its robustness and applicability across diverse geographical locations. The input images for GeoSAM are prepared at a resolution of (1024,1024), with a zoom level of 20. This process requires the stitching of adjacent image tiles, with the base images dictating the number of neighboring tiles to be merged. The details of these three datasets are consolidated in Table I.

TABLE I: Summary of the three datasets used. The first two datasets are from regions of Washington DC (used for training and evaluation respectively), and the third region is from Cambridge, MA to test generalizability.

Datasets	Geographical Co-ordinates	Base	# of Image	Area	Total #
Datasets	Geographical Co-ordinates	Image Size	Tiles Stitched	( $km^{2}$ )	of Images
$D_{\text{train}}$	38.905788, -77.045019, 38.90968, -77.019694	(512, 512)	2	1.97	560
$D_{\text{test}}$	38.8968333, -77.0074118, 38.906958, -76.988948	(512, 512)	2	1.04	296
$D_{\text{gen}}$	42.360067, -71.144373, 42.395258, -71.051704	(256, 256)	4	30.18	2380

IV-B Implementation Details

As mentioned earlier we treat the objective as a multi-class segmentation task, where the classes of interest are roads and sidewalks/crosswalks respectively. We adopted ViT-H [24] as the encoder version of SAM and initialized the model with pre-trained weights from SAM’s ViT-H version. Following the original SAM paper settings [11], the choice of optimizer was the AdamW ( $\beta 1$ =0.9, $\beta 2$ = 0.999) optimizer [63], with an initial learning rate set at $1e^{-5}$ and weight decay of 0.1 and no data augmentation techniques were applied. Following our experimentation on various values, we chose uniform values for all classes for $\alpha$ and set $\gamma$ at 2 for the loss function. To have an adaptable learning rate, a cosine annealing learning rate scheduler was employed with a maximum learning rate decaying smoothly to a minimum value ( $1e^{-7}$ ) over the course of training. Tile2Net’s semantic segmentation component, specifically the Hierarchical Multi-Scale Attention model [31] was selected as the pre-trained image encoder and initialized with publicly available pre-trained weights released from the Tile2Net team [1].

All the experiments are done on an NVIDIA GeForce RTX 4090 GPU with 24 GB of memory with the Python version for the project is 3.10.9. We use a total of 100 epochs to train GeoSAM as well as the other baseline models. While doing the training for GeoSAM, we follow PEFT techniques where we keep the parameters of the encoder part (both image and prompt encoder) frozen and only update the gradients of the decoder.

To compare GeoSAM with the popular benchmark semantic segmentation models (both CNN-based and ViT-based) in Table II, we train each of the models from scratch in their default settings using the training dataset ( $D_{\text{train}}$ ) (described in section IV-A). Following the GeoSAM’s training strategy, the benchmark models are trained using Monai, an open-source python framework [64] (without any preprocessing and post-processing for a fair comparison). In the inference stage, we assess these models also using the dedicated test ( $D_{\text{test}}$ ) and generalizability ( $D_{\text{gen}}$ ) dataset (described in section IV-A) to obtain the intersection over union (IoU) and average precision (AP) values of each class, as well as the mean intersection over union (mIoU) and mean average precision (mAP) values.

IV-C Results

IV-C1 Qualitative Results

TABLE II: GeoSAM Evaluation Results in IoU and AP Against Benchmark Models (”Ped.” for Pedestrian, ”Infras.” for Infrastructure). Washington DC Used for Testing and Cambridge, MA Used for Evaluating Generalizability. Top Results in Bold.

*Excluded Tile2Net values for Cambridge, MA, as its included in their training, impacting the fairness in generalizability assessment.
Method	Washington, DC						Cambridge, MA
	IoU		mIoU	AP		mAP	IoU		mIoU	AP		mAP
	Road	Ped.	mIoU	Road	Ped.	mAP	Road	Ped.	mIoU	Road	Ped.	mAP
	Infras.	Infras.		Infras.	Infras.		Infras.	Infras.		Infras.	Infras.
GeoSAM (Ours)	0.76	0.45	0.60	0.67	0.45	0.56	0.61	0.27	0.44	0.55	0.25	0.40
Zero-shot SAM [11]	0.44	0.24	0.34	0.21	0.15	0.18	0.22	0.12	0.17	0.09	0.13	0.11
Tile2Net* [1]	0.60	0.42	0.51	0.54	0.38	0.46	-	-	-	-	-	-
UNet [5]	0.45	0.17	0.31	0.44	0.22	0.33	0.24	0.12	0.18	0.11	0.05	0.08
AttUNet [4]	0.47	0.21	0.34	0.43	0.21	0.32	0.25	0.11	0.18	0.12	0.06	0.09
UNet++ [3]	0.61	0.30	0.45	0.54	0.32	0.43	0.24	0.08	0.16	0.12	0.04	0.08
UNETR [65]	0.48	0.20	0.34	0.50	0.27	0.38	0.25	0.11	0.18	0.12	0.05	0.08
Swin UNETR [66]	0.63	0.26	0.44	0.57	0.29	0.43	0.13	0.09	0.11	0.09	0.04	0.06

We implement postprocessing techniques after the initial segmentation maps are generated from the SAM mask decoder. To further improve results, techniques such as morphological erosion and dilation [67] are utilized to enhance performance. In our task, the priority is on path creation rather than the classification of individual pixels. We place greater emphasis on identifying and delineating all available paths.

In the generated segmentation maps, which represent real-world pedestrian paths, the existence of scattered isolated regions indicates inaccuracies in the model’s performance. To rectify this issue, we employ an erosion technique to eliminate such isolated regions. Additionally, when the segmentation map displays abrupt interruptions or gaps in connected paths, it indicates failure in accurately segmenting the entire path. Therefore, we utilize dilation to address these issues by establishing connections between disjointed regions, bringing the map closer to the ground truth representation.

The Equations for erosion and dilation are described below where $(x,y)$ is a pixel in the image, $B(i,j)$ is the structuring element or the mask to do the operation, $(i,j)$ are the coordinates within the structuring element, $\bigcap$ is the intersection and $\bigcup$ is the union operation:

E(x,y)=\bigcap_{(i,j)\in B}I(x+i,y+j),

(6)

D(x,y)=\bigcup_{(i,j)\in B}I(x+i,y+j).

(7)

To do these operations, we select a (10 $\times$ 10) filter in a (1024 $\times$ 1024) resolution segmentation map. This filter is passed over the whole segmentation map and performs erosion and dilation respectively in the regions it covers. We run these operations for a total of 10 iterations to get a better-refined map. Fig. 4 demonstrates the application of postprocessing techniques to enhance performance. For a more detailed understanding of their impact, we present visualizations of each class separately. For instance, we can observe the removal of isolated regions and a connection has been established where a path should be. We plot both the initial output and the postprocessing output to distinguish the difference in quality.

Fig. 5 provides a visual representation of the qualitative outcomes achieved by GeoSAM on several randomly picked images from the test dataset. Given the objective of performing multi-class class segmentation tasks, we can see the presence of both classes in the output of GeoSAM. Upon examination of the figure, it becomes evident that GeoSAM’s performance is on par with Tile2Net. Additionally, GeoSAM significantly outperforms zero-shot SAM when compared to zero-shot SAM’s capacity to handle similar tasks. This observation underscores the earlier assertion in Section I that zero-shot SAM does not exhibit strong performance for the thin boundary objects when applied to geographical images. Additionally, the final two columns of the figure present GeoSAM’s output for individual class segmentation, showcasing its proficiency in single-task scenarios and its effectiveness when focused on a specific task.

TABLE III: An ablation study. The segmentation performance is examined across the varying number of sparse point prompts and various model components. DP = dense prompt, PE = pre-trained encoder, FT = fine-tuning. The white and gray sections represent two distinct types of ablation studies.

Techniques				Sparse Prompt			IoU
Zero-shot	DP	PE	FT	For. Points	Back. Points	Ratio	Road Infras.	Pedestrian Infras.
✓	✓	✓	✓	0	0	-	0.01	0.01
✓	✓	✓	✓	100	50	2:1	0.72	0.40
✓	✓	✓	✓	2000	2000	1:1	0.72	0.43
✓	✓	✓	✓	2000	1000	2:1	0.76	0.45
✓	✓	✓	✓	2000	4000	1:2	0.74	0.39
✓	$\times$	$\times$	$\times$	2000	1000	2:1	0.44	0.24
✓	$\times$	$\times$	✓	2000	1000	2:1	0.36	0.20
✓	✓	✓	$\times$	2000	1000	2:1	0.47	0.24
✓	✓	$\times$	✓	2000	1000	2:1	0.53	0.30

IV-C2 Quantitative Results

Next, we have some quantitative results based on the evaluation metrics we discussed in Section III-B2. In Table II we demonstrate and compare the detailed results obtained by using GeoSAM with Tile2Net, zero-shot SAM, and some of the more popular semantic segmentation models (both CNN-based and ViT-based as benchmarks). We apply Tile2Net to our selected datasets, adjusting the outputs to produce multi-class segmentation maps that specifically include only road and pedestrian infrastructure. For comparison with zero-shot SAM, we use the pre-trained SAM components, supplemented with sparse prompts created through our automated sparse prompts generation technique. We pick UNet [5], AttUNet [4], and UNet++ [3] as CNN-based benchmark models to compare with. For ViT-based models, among popular benchmark models such as [65, 66, 68, 69], we have selected [65, 66] for comparison.

The results of the test dataset (Washington DC) in Table II demonstrate that GeoSAM outperforms Tile2Net by a margin of 17% and 21% in terms of mean IoU (mIoU) and mean AP (mAP). Compared to zero-shot SAM, GeoSAM performs much better across both classes (a 76% increase in mIoU and a 211% in mAP). In comparison with the other benchmark segmentation models, GeoSAM has demonstrated its superiority by significantly outperforming these models for both of the classes. We observe that GeoSAM surpasses the top-performing CNN-based model (UNet++) in mIoU and mAP by a significant margin of 33% and 30% respectively and outperforms the best ViT-based model (Swin UNETR) by approximately 36% and 30% in mIoU and mAP significantly. Although [49] is relevant to our work, especially in using foundation models for geographical images, the unavailability of their source code hinders a direct comparison. Instead, we reference the highest IoU value for road segmentation from their results (0.59), which GeoSAM surpasses with an IoU of 0.76. Importantly, [49] lacks results on sidewalk/crosswalk segmentation, which is our primary focus on mobility infrastructure segmentation.

When evaluating GeoSAM on the generalizability dataset from Cambridge, MA, we observe a performance decrease of all the models compared to Washington DC, which is anticipated due to the data shift arising from different geological regions. Nonetheless, when compared to other models, GeoSAM still outperforms them significantly, showing at least 140% and 260% better results in mIoU and mAP, respectively. This supports our assertion that GeoSAM is scalable and possesses superior generalizability capabilities compared to traditional models. It’s important to note that we don’t include Tile2Net’s results for this generalizability dataset. Since the dataset’s purpose is to assess model performance on unseen data in a completely different scenario, Tile2Net is not applicable here, as Cambridge, MA was originally part of its training data.

Table III reports some of the ablation studies on the test dataset we performed for our approach. We split the ablation study into two parts. For the first part of the ablation study, we examine the effect of the number of sparse prompts on the performance of GeoSAM. We tried different ratios of foreground and background points and also we ran GeoSAM without any points as well to compare the performance. The results strongly support the assertion that sparse prompts play a critical role, as demonstrated by GeoSAM’s significant drop in performance when sparse prompts are omitted. Further, from analyzing the data, it can be inferred that the initial assumption, which involved employing a ratio of 2:1 and selecting 2000:1000 foreground-to-background points, results in the best performance for this specific case. As a result, this configuration becomes the default setting.

In the second part of the ablation study, as shown in Table III, we conducted an ablation study to assess the importance of several key techniques, as outlined in Table III. These key components include the significance of the CNN Encoder for generating dense prompts, the role of dense prompts in enhancing sparse prompts, and the importance of fine-tuning the decoder over the original decoder. Utilizing the optimal settings established in the previous section, we assessed the significance of the individual techniques opted for within GeoSAM. The results from Table III indicate that each component integrated into GeoSAM contributes uniquely to enhancing the overall performance. Significantly, the exclusion of FT (fine-tuned decoder) in the last row results in the most significant drop in performance, underscoring that the original SAM decoder is not well-suited for this specific type of task. Additionally, without the other techniques, only automated sparse prompts given to zero-shot SAM (first row) resulted in very poor performance, highlighting the importance of each of the techniques we utilized.

V Conclusion

GeoSAM is a pioneering work that uses SAM’s capabilities for mobility infrastructure within geographical images, without human intervention. We introduce an innovative architecture of utilizing both sparse prompts and dense prompts with the addition of fine-tuning for the foundation segmentation model SAM. In addition, different from other existing work, the training and end-to-end inference pipeline for mobility infrastructure segmentation that we developed here are reproducible can be adapted for various segmentation tasks, and are transferable to other geo-locations by using a different domain-specific encoder and fine-tuning the decoder with a different type of dataset.

References

[1] M. Hosseini, A. Sevtsuk, F. Miranda, R. M. Cesar Jr, and C. T. Silva, “Map** the walk: A scalable computer vision approach for generating sidewalk network datasets from aerial imagery,” Computers, Environment and Urban Systems, vol. 101, p. 101950, 2023.
[2] C. Li, Z. Dong, N. Fisher, and D. Zhu, “Coupling user preference with external rewards to enable driver-centered and resource-aware ev charging recommendation,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2022, pp. 3–19.
[3] Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: A nested u-net architecture for medical image segmentation,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4. Springer, 2018, pp. 3–11.
[4] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz et al., “Attention u-net: Learning where to look for the pancreas,” arXiv preprint arXiv:1804.03999, 2018.
[5] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, 2015, pp. 234–241.
[6] Y. Qiang, C. Li, P. Khanduri, and D. Zhu, “Fairness-aware vision transformer via debiased self-attention,” arXiv preprint arXiv:2301.13803, 2023.
[7] ——, “Interpretability-aware vision transformer,” arXiv preprint arXiv:2309.08035, 2023.
[8] M. Hosseini, M. Saugstad, F. Miranda, A. Sevtsuk, C. T. Silva, and J. E. Froehlich, “Towards global-scale crowd+ ai techniques to map and assess sidewalks for people with disabilities,” arXiv preprint arXiv:2206.13677, 2022.
[9] L. Scheibenreif, J. Hanna, M. Mommert, and D. Borth, “Self-supervised vision transformers for land-cover segmentation and classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1422–1431.
[10] M. Saha, M. Saugstad, H. T. Maddali, A. Zeng, R. Holland, S. Bower, A. Dash, S. Chen, A. Li, K. Hara et al., “Project sidewalk: A web-based crowdsourcing tool for collecting sidewalk accessibility data at scale,” in Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 2019, pp. 1–14.
[11] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
[12] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
[13] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
[14] C. Li, P. Khanduri, Y. Qiang, R. I. Sultan, I. Chetty, and D. Zhu, “Auto-prompting sam for mobile friendly 3d medical image segmentation,” arXiv preprint arXiv:2308.14936, 2023.
[15] X. Hu, X. Xu, and Y. Shi, “How to efficiently adapt large segmentation model (sam) to medical images,” arXiv preprint arXiv:2306.13731, 2023.
[16] B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” arXiv preprint arXiv:2104.08691, 2021.
[17] N. Ding, Y. Qin, G. Yang, F. Wei, Z. Yang, Y. Su, S. Hu, Y. Chen, C.-M. Chan, W. Chen et al., “Parameter-efficient fine-tuning of large-scale pre-trained language models,” Nature Machine Intelligence, vol. 5, no. 3, pp. 220–235, 2023.
[18] T. Chen, L. Zhu, C. Ding, R. Cao, S. Zhang, Y. Wang, Z. Li, L. Sun, P. Mao, and Y. Zang, “Sam fails to segment anything?–sam-adapter: Adapting sam in underperformed scenes: Camouflage, shadow, and more,” arXiv preprint arXiv:2304.09148, 2023.
[19] Z. Fu, H. Yang, A. M.-C. So, W. Lam, L. Bing, and N. Collier, “On the effectiveness of parameter-efficient fine-tuning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, 2023, pp. 12 799–12 807.
[20] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
[21] C. Henry, S. M. Azimi, and N. Merkle, “Road segmentation in sar satellite images with deep fully convolutional neural networks,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 12, pp. 1867–1871, 2018.
[22] L. Zhou, C. Zhang, and M. Wu, “D-linknet: Linknet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 182–186.
[23] V. Badrinarayanan, A. Handa, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling,” arXiv preprint arXiv:1505.07293, 2015.
[24] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[25] A. Saha, “Conducting semantic segmentation on landcover satellite imagery through u-net architectures,” in Proceedings of the Future Technologies Conference. Springer, 2022, pp. 758–764.
[26] X. Chen, Q. Sun, W. Guo, C. Qiu, and A. Yu, “Ga-net: A geometry prior assisted neural network for road extraction,” International Journal of Applied Earth Observation and Geoinformation, vol. 114, p. 103004, 2022.
[27] P. Gudžius, O. Kurasova, V. Darulis, and E. Filatovas, “Deep learning-based object recognition in multispectral satellite imagery for real-time applications,” Machine Vision and Applications, vol. 32, no. 4, p. 98, 2021.
[28] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
[29] K. Ayush, B. Uzkent, C. Meng, K. Tanmay, M. Burke, D. Lobell, and S. Ermon, “Geography-aware self-supervised learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 181–10 190.
[30] P. Gudzius, O. Kurasova, V. Darulis, and E. Filatovas, “Automl-based neural architecture search for object recognition in satellite imagery,” Remote Sensing, vol. 15, no. 1, p. 91, 2022.
[31] A. Tao, K. Sapra, and B. Catanzaro, “Hierarchical multi-scale attention for semantic segmentation,” arXiv preprint arXiv:2005.10821, 2020.
[32] K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao, D. Liu, Y. Mu, X. Wang, W. Liu, and J. Wang, “High-resolution representations for labeling pixels and regions,” arXiv preprint arXiv:1904.04514, 2019.
[33] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang et al., “Deep high-resolution representation learning for visual recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 10, pp. 3349–3364, 2020.
[34] Y. Yuan, X. Chen, and J. Wang, “Object-contextual representations for semantic segmentation,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16. Springer, 2020, pp. 173–190.
[35] J. H. Kim, S. Lee, J. R. Hipp, and D. Ki, “Decoding urban landscapes: Google street view and measurement sensitivity,” Computers, Environment and Urban Systems, vol. 88, p. 101626, 2021.
[36] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 633–641.
[37] D. Wang, J. Zhang, B. Du, D. Tao, and L. Zhang, “Scaling-up remote sensing segmentation dataset with segment anything model,” arXiv preprint arXiv:2305.02034, 2023.
[38] I. Giannakis, A. Bhardwaj, L. Sam, and G. Leontidis, “Deep learning universal crater detection using segment anything model (sam),” arXiv preprint arXiv:2304.07764, 2023.
[39] Z. Wang, S. Sun, X. Que, and X. Ma, “Interactive segmentation in aerial images: a new benchmark and an open access web-based tool,” arXiv preprint arXiv:2308.13174, 2023.
[40] L. Yao, H. Zuo, G. Zheng, C. Fu, and J. Pan, “Sam-da: Uav tracks anything at night with sam-powered domain adaptation,” arXiv preprint arXiv:2307.01024, 2023.
[41] L. P. Osco, Q. Wu, E. L. de Lemos, W. N. Gonçalves, A. P. M. Ramos, J. Li, and J. M. Junior, “The segment anything model (sam) for remote sensing applications: From zero to one shot,” arXiv preprint arXiv:2306.16623, 2023.
[42] F. Deuser, K. Habel, and N. Oswald, “Sample4geo: Hard negative sampling for cross-view geo-localisation,” arXiv preprint arXiv:2303.11851, 2023.
[43] S. Julka and M. Granitzer, “Knowledge distillation with segment anything (sam) model for planetary geological map**,” arXiv preprint arXiv:2305.07586, 2023.
[44] L. Ding, K. Zhu, D. Peng, H. Tang, and H. Guo, “Adapting segment anything model for change detection in hr remote sensing images,” arXiv preprint arXiv:2309.01429, 2023.
[45] Z. Peng, Z. Xu, Z. Zeng, X. Yang, and W. Shen, “Sam-parser: Fine-tuning sam efficiently by parameter space reconstruction,” arXiv preprint arXiv:2308.14604, 2023.
[46] T. Shaharabany, A. Dahan, R. Giryes, and L. Wolf, “Autosam: Adapting sam to medical images by overloading the prompt encoder,” arXiv preprint arXiv:2306.06370, 2023.
[47] D. Wang, Q. Zhang, Y. Xu, J. Zhang, B. Du, D. Tao, and L. Zhang, “Advancing plain vision transformer toward remote sensing foundation model,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–15, 2022.
[48] G. Mai, W. Huang, J. Sun, S. Song, D. Mishra, N. Liu, S. Gao, T. Liu, G. Cong, Y. Hu et al., “On the opportunities and challenges of foundation models for geospatial artificial intelligence,” arXiv preprint arXiv:2304.06798, 2023.
[49] K. Cha, J. Seo, and T. Lee, “A billion-scale foundation model for remote sensing images,” arXiv preprint arXiv:2304.05215, 2023.
[50] F. Liu, D. Chen, Z. Guan, X. Zhou, J. Zhu, and J. Zhou, “Remoteclip: A vision language foundation model for remote sensing,” arXiv preprint arXiv:2306.11029, 2023.
[51] Z. Yan, J. Li, X. Li, R. Zhou, W. Zhang, Y. Feng, W. Diao, K. Fu, and X. Sun, “Ringmo-sam: A foundation model for segment anything in multimodal remote-sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–16, 2023.
[52] X. Ma, Q. Wu, X. Zhao, X. Zhang, M.-O. Pun, and B. Huang, “Sam-assisted remote sensing imagery semantic segmentation with object and boundary constraints,” arXiv preprint arXiv:2312.02464, 2023.
[53] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
[54] C. H. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. Jorge Cardoso, “Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, September 14, Proceedings 3. Springer, 2017, pp. 240–248.
[55] X. Li, X. Li, D. Pan, and D. Zhu, “On the learning property of logistic and softmax losses for deep neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 4739–4746.
[56] US Geological Survey, “USGS EROS Archive - Aerial Photography - High Resolution Orthoimagery (HRO),” https://doi.org/10.5066/F73X84W6, 2018.
[57] DC GIS (2019a), “Roads 2019,” https://opendata.dc.gov/datasets/DCGIS::roads-2019, 2019.
[58] DC GIS (2019b), “Sidewalks 2019 (2019b),” https://opendata.dc.gov/datasets/sidewalks-2019, 2019.
[59] Cambridge GIS (2018a), “Cambridge sidewalk,” https://www.cambridgema.gov/GIS/gisdatadictionary/Basemap/BASEMAP_Sidewalks, 2018.
[60] Cambridge GIS (2018b), “Pavement markings,” https://www.cambridgema.gov/GIS/gisdatadictionary/Traffic/TRAFFIC_PavementMarkings, 2018.
[61] Cambridge GIS (2018c), “Public footpaths,” https://www.cambridgema.gov/GIS/gisdatadictionary/Basemap/BASEMAP_PublicFootpaths, 2018.
[62] Cambridge GIS (2018d), “Roads,” https://www.cambridgema.gov/GIS/gisdatadictionary/Basemap/BASEMAP_Roads, 2018.
[63] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
[64] M. J. Cardoso, W. Li, R. Brown, N. Ma, E. Kerfoot, Y. Wang, B. Murrey, A. Myronenko, C. Zhao, D. Yang et al., “Monai: An open-source framework for deep learning in healthcare,” arXiv preprint arXiv:2211.02701, 2022.
[65] A. Hatamizadeh, Y. Tang, V. Nath, D. Yang, A. Myronenko, B. Landman, H. R. Roth, and D. Xu, “Unetr: Transformers for 3d medical image segmentation,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2022, pp. 574–584.
[66] A. Hatamizadeh, V. Nath, Y. Tang, D. Yang, H. R. Roth, and D. Xu, “Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images,” in International MICCAI Brainlesion Workshop. Springer, 2021, pp. 272–284.
[67] M. Pesaresi and J. A. Benediktsson, “A new approach for the morphological segmentation of high-resolution satellite imagery,” IEEE transactions on Geoscience and Remote Sensing, vol. 39, no. 2, pp. 309–320, 2001.
[68] C. Li, H. Bagher-Ebadian, V. Goddla, I. J. Chetty, and D. Zhu, “Focalunetr: A focal transformer for boundary-aware segmentation of ct images,” arXiv preprint arXiv:2210.03189, 2022.
[69] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.