License: CC BY 4.0
arXiv:2305.19787v2 [cs.CV] 05 Jan 2024

Deep Merge: Deep-learning-based Region Merging for Remote Sensing Image Segmentation

Xianwei Lv Claudio Persello Wangbin Li Xiao Huang Dong** Ming Alfred Stein School of Computer and Communication Engineering, Northeastern University at Qinhuangdao, Qinhuangdao 066004, China Hebei Key Laboratory of Marine Perception Network and Data Processing, Northeastern University at Qinhuangdao 066004, China. Dept. of Earth Observation Science, Faculty ITC, University of Twente, 7500AE Enschede, the Netherlands State Key Laboratory of Information Engineering in Surveying, Map** and Remote Sensing, Wuhan University, Wuhan 430079, China The Department of Geosciences, University of Arkansas, Fayetteville, AR, USA School of Information Engineering, China University of Geosciences (Bei**g), Bei**g 100083, China
Abstract

Image segmentation aims to partition an image according to the objects in the scene and is a fundamental step in analysing very high spatial-resolution (VHR) remote sensing imagery. Current methods struggle to effectively consider land objects with diverse shapes and sizes. Additionally, the determination of segmentation scale parameters frequently adheres to a static and empirical doctrine, posing limitations on the segmentation of large-scale remote sensing images and yielding algorithms with limited interpretability. To address the above challenges, we propose a deep-learning-based region merging method dubbed DeepMerge to handle the segmentation of complete objects in large VHR images by integrating deep learning and region adjacency graph (RAG). This is the first method to use deep learning to learn the similarity and merge similar adjacent super-pixels in RAG. We propose a modified binary tree sampling method to generate shift-scale data, serving as inputs for transformer-based deep learning networks, a shift-scale attention with 3-Dimension relative position embedding to learn features across scales, and an embedding to fuse learned features with hand-crafted features. DeepMerge can achieve high segmentation accuracy in a supervised manner from large-scale remotely sensed images and provides an interpretable optimal scale parameter, which is validated using a remote sensing image of 0.55 m resolution covering an area of 5,660 km22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT. The experimental results show that DeepMerge achieves the highest F value (0.9550) and the lowest total error TE (0.0895), correctly segmenting objects of different sizes and outperforming all competing segmentation methods.

keywords:
Image segmentation, region adjacency graph, deep learning, scale parameter interpretability.
journal: Elsevier

1 Introduction

Acquiring very high-spatial-resolution (VHR) remote sensing images over large areas has become easier than ever due to the advancement in satellite remote sensing technology. VHR images provide rich spatial details for characterizing objects on the ground. For this reason, they are widely applied in land-cover and land-use classification Lv et al. (2021), urban functional zone understanding Zhou et al. (2020), urban management Mundia and Aniya (2005), building roof modelling Zhao et al. (2021), large-scale terrain classification Na et al. (2021), and object extraction and monitoring Chen et al. (2021); Zhao et al. (2022). Objects in the land-cover classes tend to show high intra-class discrepancy and inter-class consistency, leading to challenges in image interpretation. Geographic Object-based Image Analysis (GeOBIA) has been proven to be an effective approach to address this issue by transitioning image interpretation from pixel-level to object-level using image segmentation, which clusters pixels into meaningful complete objects Blaschke (2010). Segmentation is different from semantic segmentation (or pixel-wise classification), which aims to classify each pixel in an image into specific categories and instance segmentation, whose goal is to identify and distinguish individual object instances within an image, providing pixel-level masks for each object. Image segmentation plays a crucial role in GeOBIA as well as other image analysis workflows. Objects are generally clustered with adjacent pixels with similar characteristics, such as spectral reflectance. Many efforts have been made to design precise segmentation methods and interpretable scale parameters for these methods. However, these methods fail to select a ’good’ scale parameter to generate desirable segmentation results, as they tend to exhibit segmentation bias, including over-segmentation and under-segmentation errors, especially when applied to large-area VHR images. It is thus paramount to design more accurate and efficient segmentation algorithms with interpretable scale parameters that can be applied to a wide range of applications.

To satisfy the high segmentation quality demanded by diverse applications, region-merging-based segmentation methods were proposed, which will not segment image directly but initialize over-segmented super-pixels as primitives and further merge them into complete objects. Following spilt-and-merge, many efforts have been made to design various merging criteria to merge super-pixels into meaningful complete objects by learning similarities between neighbouring super-pixel pairs Zhang et al. (2014). The method involves three key steps: 1) initializing over-segmented super-pixels, 2) building a region adjacency graph (RAG) model Beaulieu and Goldberg (1989); Haris et al. (1998), and 3) merging the most-similar super-pixel pairs according to the least weight edge extracted from the RAG. Fig.1 illustrates the segmentation results of the proposed method against state-of-the-art (SOTA) region-merging-based segmentation methods based on RAG.

Refer to caption
Figure 1: Comparison of different region-merging based segmentation methods in region-merging. (a) Original remote sensing image. (b) Initial over-segmented super-pixels. (c) Region-merging results of SOTA method BCMS Zhang et al. (2013). (d) Region-merging results of SOTA method Local-SA Yang et al. (2017). (e) Region-merging results of the proposed method DeepMerge.

Region-merging-based methods via RAG can be implemented in an unsupervised or supervised manner. For unsupervised methods, users need to select a proper scale parameter (i.e., the threshold or the stop** rule) designed to stop merging by measuring discrepancy (or similarity) between adjacent super-pixel pairs. However, the scale parameter selection overly requires user experience and often introduces serious uncertainties in the segmentation results. Optimal scale parameter selection methods failed to adapt to various merging criteria, remote sensing sensors, and applications. Their segmentation results are far from satisfactory. Supervised segmentations are designed to measure the discrepancy or similarity among super-pixels via combined features between super-pixels and reference objects. However, supervised segmentations face challenges when applied to large VHR images. Most of the existing supervised methods are designed for a single scene, failing to generalize to other scenes due to the discrepancy in objects’ characteristics.

As discussed above, supervised and unsupervised methods suffer from low segmentation accuracies caused by the land objects with diverse shapes and sizes and the selection of scale parameters with interpretability. To address these challenges, we propose a deep stepwise optimization method to handle segmentation in large VHR images. The proposed DeepMerge method can learn the shift-scale features (shown in Fig.2) to measure the similarity between two adjacent neighbouring super-pixels.

Refer to caption
Figure 2: The shift-scale presentations in the DeepMerge.

To comprehend the interrelations among the inner scale, object-scale, and environmental scale of objects, we introduced a shift-scale presentation module. This module can extract shift-scale features corresponding to the aforementioned scales represented in red, green, and blue as depicted in Fig.2. It is important to note that shift-scale differs from multi-scales, as the shift-scale pertains to object-specific scales in contrast to the fixed scales in the multi-scale. To our knowledge, this is the first effort to combine RAG with deep learning for remote sensing image segmentation. The key contributions of our study are the following:

  • 1.

    A deep learning-based region-merging method was designed to enhance large VHR image segmentation. The proposed method results in high accuracy compared to SOTA methods, correctly segmenting small and large objects.

  • 2.

    The optimal scale parameter of DeepMerge is interpretable, stabilising at 0.5, releasing the hands of scholars in selecting optimal scale parameters.

  • 3.

    To facilitate sample collection for the supervised training of DeepMerge, we designed a user-friendly graphical interface. A skilled operator can select at least 10,000 training sample pairs in a working day.

2 Related works about region-merging-based segmentation

Image segmentation is an essential step in remote sensing image analysis workflows. Several efforts have been made to improve the segmentation quality of remote sensing images, such as the simple linear iterative clustering method Achanta et al. (2012), mean-shift Paris and Durand (2007), multi-resolution segmentation (MRS) Baatz (2000). Contour-based methods are an important branch of segmentation methods focusing on the extraction of object boundaries Martin et al. (2004); Arbelaez et al. (2010); Pont-Tuset et al. (2016). However, the segmentation methods that rely on a single strategy usually fail to meet the requirement of high-quality segmentation. Thus, region-merging-absed segmentation has received wide attention. In this section, we review related works concerning the rationale of RAG, optimization of scale parameters, and merging criteria.

2.1 The rationale of RAG

Following the graph structure, RAG is constructed by taking the initial super-pixels as vertices and the connections between super-pixels as edges. Each vertex stores the features of the corresponding super-pixel, and each edge stores the feature distance (i.e., similarity or weight) between two vertices at its ends. Fig.3 describes the concept of a RAG model in a hypothetical over-segmented case.

Refer to caption
Figure 3: The construction of RAG and a region-merging step in the RAG.

Based on the seven over-segmented super-pixels (Fig.3a), a RAG (i.e., a non-directional graph) is constructed. The vertices in Fig.3b represent super-pixels from the original segmentation, and the edges between vertices depict the connections between super-pixels. The weight of an edge indicates the similarity between two neighbouring vertices, which is calculated by user-designed merging criteria. Supposing that the weight value of the dotted edge between v3subscript𝑣3v_{3}italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and v3subscript𝑣3v_{3}italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT in Fig.3b is the smallest, indicating their high similarity, v3subscript𝑣3v_{3}italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and v5subscript𝑣5v_{5}italic_v start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT are further merged into a new v3subscript𝑣superscript3normal-′v_{3^{\prime}}italic_v start_POSTSUBSCRIPT 3 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT shown in Fig.3c. The weights of edges connected to v3subscript𝑣superscript3normal-′v_{3^{\prime}}italic_v start_POSTSUBSCRIPT 3 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT are then updated. The RAG model iteratively repeats these operations until the smallest weight is higher than a user-defined scale paremeter.

2.2 Merging criteria in RAG

Merging criteria in RAG have been regarded as an important component that determines the segmentation results. To enhance the quality of segmentation, a supervised classic watershed segmentation method was improved via multispectral gradient Derivaux et al. (2006). Furthermore, (Wassenberg et al., 2009) utilized a graph-cutting heuristic method to accelerate the Minimum Spanning Tree-based algorithm. In Johnson and Xie (2011), unsupervised image segmentation and evaluation and refinement using weighted variance and Moran’s I in a series of scales was developed. The researches in Zhang et al. (2014, 2013); Lee and Cok (1991); Su et al. (2020) introduced edge strength, compactness, and the feature of standard derivation into the merging criteria. A multiscale segmentation method to achieve results through stepwise refinement guided by area and boundary features was designed Chen et al. (2014). Local spectral statics were used to measure the homogeneity and heterogeneity between segments Wang et al. (2019). To delineate dune-field landscape patches,Zheng et al. (2020) proposed a new method that integrates multisource features that represent dune-field landscapes at multiple scales. The paper Su et al. (2020) viewed the region-merging as a classification process and trained a random forest using segment-based features. However, the segment-based features failed to fully capture the details of objects, leading to the segmentation failure of complete objects.

2.3 Scale parameter optimisation in RAG

Efforts have been made to optimize scale parameters (threshold) in RAG. The segmentation output varies by setting different scale parameters in the investigated area. Low scale results in small-area segments, benefiting the segmentation of small objects; however, large objects tend to contain multiple super-pixels, leading to over-segmentation errors. The manual-optimal scale is often determined by a trial-and-error process, causing uncertainties in segmentation results Lv et al. (2021); Zhang et al. (2018). Therefore, scholars developed automatic and self-adaptive scale optimization methods. The majority of region-merging algorithms calculate the homogeneity in objects and the heterogeneity between objects to determine the optimal scale for objects. For example, an automated approach to parameterizing multiscale image segmentation was proposed to detect scale transitions in objects relying on the local variance Drăguţ et al. (2014). In Ming et al. (2015), a spatial and spectral statics-based scale parameter selection was proposed for object-based information extraction using an average local variance graph. Hu et al. (2018) developed a general stepwise evolution analysis framework for optimal scale parameter estimation using local variance and Moran’s I, a measure of spatial autocorrelation. In order to better segment objects of various sizes, a scale-variable segmentation method was proposed where scale parameters are adaptively estimated Zhang et al. (2014). To obtain the objective-adaptive scale for each object, Shen et al. (2019); Zhang et al. (2020) proposed object-specific optimization strategies using hierarchical tree structures of multiscale segmentation.

3 Methodology

3.1 Outline of the proposed method

The proposed DeepMerge integrates deep learning and RAG to achieve desirable segmentation in VHR remote sensing images. Fig.4 summarizes the workflow of DeepMerge. The original image is firstly over-segmented into super-pixels by a standard segmentation method. Then we utilize a siamese network Chopra et al. (2005); Guo et al. (2017) to learn the similarity between neighbouring super-pixels. We proposed the shift-scale tansformer (S2Former) model as the backbone network. S2Former is trained using a training set composed of positive and negative samples. Pairs of adjacent super-pixels of the same object are called positive samples. On the contrary, pairs of neighbouring super-pixels of different objects are called negative samples. To train the S2Former, we manually select positive and negative samples and measure similarities (the weights in Fig.4) between adjacent super-pixels. After training the S2Former, the model can measure the similarity between adjacent super-pixels. Finally, the RAG model iteratively merges the most-similar super-pixel pairs via the global best-merging strategy until all weights of edges in RAG are higher than 0.5.

Refer to caption
Figure 4: The workflow of DeepMerge for super-pixel segmentation in high-resolution remote sensing imagery. W means shared weights. Aux is the loss function auxiliary module. FC is the fully connected layers. SFE is the segment-absed feature embedding module.

We recommend MRS for the initial (over)segmentation, a region-growing segmentation method that follows the minimum heterogeneity principle Baatz (2000). MRS has been proven efficient in generating super-pixels as polygons in a shapefile format Lv et al. (2018). Of course, other segmentation methods can also be used as the initial segmentation method in our framework.

We describe the sample collection, shift-scale inputs, segment-based feature embedding, shift-scale attention, and the feature updating strategy for the merging process in the following sections. All codes are written either in C# (sample selection software, shift-scale inputs extraction, RAG) or Python languages (S2Former and RAG) and tested on a computer with Windows 10 OS, an intel i7-13700K CPU (3.4GHz), 64GB RAM, and an NVIDIA GPU (RTX 4090). We open-source the code at https://github.com/lvxianwei/DeepMerge.

3.2 Sample collection for training S2Former

To train the S2Former, we manually collected samples according to the following steps. We visually analysed the super-pixels in the initial over-segmentation results, and determined the dominant categories in the study areas. Then, we selected neighbouring super-pixels (Fig.5a) with high homogeneity serving as positive samples (Fig.5b), and neighbouring super-pixels of different categories serving as negative samples (Fig.5c). It is worth noting that we need to collect more negative than positive samples to account for the large variety of possible discrepancies of two different super-pixels. In VHR images, one object may contain high heterogeneous super-pixels and different adjacent super-pixels can contain similar pixels, leading to segmentation errors. Therefore, it is necessary to select many samples of both situations. In addition, we have to focus on similar neighbouring super-pixels in different categories, such as rivers and vegetations, asphalt roads and shadows, and vegetations and roads. We developed a user-friendly graphical interface for collecting the training samples, which can automatically record indexes of sample pairs selected by operators. Therefore, a skilled-operator can select at least 10,000 sample pairs in a working day, greatly improving work efficiency.

Refer to caption
Figure 5: Sample collection graphical interface and sample pairs. (a) is the graphical interface of sample collection. (b) and (c) are the positive sample pairs and negative sample pairs, respectively.

The S2Former requires image patches as inputs, similar to standard deep learning models. However, the samples collected above for the proposed model are super-pixel pairs, i.e., positive and negative samples with varying shapes and sizes, as shown in Fig.5. It leads to an input gap between super-pixel samples and the proposed model requirement. To overcome the disparity between super-pixels and the requirement for square patches, we represent super-pixels using patches with global and local information. Thus, a binary tree sampling method (BTS), as a crucial step in object-based convolutional neural networks Zhang et al. (2018); Lv et al. (2019) is implemented to generate suitable inputs that can be fed into the proposed model from super-pixel samples Lv et al. (2022) to improve the super-pixel identification quality. The original BTS is able to partition a super-pixel into sub-super-pixels by recursively dividing the super-pixel into two parts until a user-defined threshold is reached. Based on the positions of sub-parts, shift-scale data can be extracted in an adaptive window size. Fig.6 demonstrates how BTS works. For more information, please refer to Lv et al. (2022).

Refer to caption
Figure 6: The basic theory about BTS.

The number of the threshold and the extraction window size are the key issues in the sampling procedure. Thus, we improve BTS by develo** an automatic strategy. According to Lv et al. (2022), up to three positions are needed to extract information for the representation of a super-pixel. The window size is determined by the ratio of the intersection of the windows and the super-pixels. Therefore, each super-pixel corresponds to different window sizes. In order to obtain local and global information of super-pixels, we designed extraction windows in three scales (P1subscript𝑃1{P}_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, P2subscript𝑃2{P}_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and P3subscript𝑃3{P}_{3}italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT). The automatic strategy is summarized as follows.

input : A super-pixel Seg and a centre position Pos for extracting image patches
output : Three patches P1subscript𝑃1{P}_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, P2subscript𝑃2{P}_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and P3subscript𝑃3{P}_{3}italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT in four spatial scales
1 patch \leftarrow ExtractPatch(Pos, 5);
2 iter \leftarrow 0 ;
3 while P1subscript𝑃1{P}_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT,P2subscript𝑃2{P}_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT== null𝑛𝑢𝑙𝑙nullitalic_n italic_u italic_l italic_l  do
4       intersect_area𝑖𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡normal-_𝑎𝑟𝑒𝑎intersect\_areaitalic_i italic_n italic_t italic_e italic_r italic_s italic_e italic_c italic_t _ italic_a italic_r italic_e italic_a \leftarrow Seg.Intersect(patch).area;
5       ratio \leftarrow intersect_area𝑖𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡normal-_𝑎𝑟𝑒𝑎intersect\_areaitalic_i italic_n italic_t italic_e italic_r italic_s italic_e italic_c italic_t _ italic_a italic_r italic_e italic_a / patch.area;
6       if ratio <<<= 0.90 and P1subscript𝑃1{P}_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT == null𝑛𝑢𝑙𝑙nullitalic_n italic_u italic_l italic_l then P1subscript𝑃1{P}_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT \leftarrow patch;
7            
8      if ratio <<<= 0.30 and P2subscript𝑃2{P}_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT == null𝑛𝑢𝑙𝑙nullitalic_n italic_u italic_l italic_l then P2subscript𝑃2{P}_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT \leftarrow patch;
9            
10      iter\leftarrowiter+1;
11       patch \leftarrow ExtractPatch(Pos, 5 + iter ×\times× 5);
12      
13P3subscript𝑃3{P}_{3}italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT \leftarrow ExtractPatch(Pos, P2subscript𝑃2{P}_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.width + (P2subscript𝑃2{P}_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.width – P1subscript𝑃1{P}_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.width)) ;
14 final;
15 return P1subscript𝑃1{P}_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, P2subscript𝑃2{P}_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and P3subscript𝑃3{P}_{3}italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT;
Algorithm 1 shift-scale input data generation

Seg is a super-pixel to be presented by multiple patches in DeepMerge. The red stars in Seg are the positions for extracting square patches (i.e., the red middle star in Fig.5 is the Pos). ExtractPatch(.) is a function applied to extract a square patch with the Pos as the centre and the initial side length of 5 pixels. The intersection area is calculated by the function Intersect(.) between Seg and patch is intersect_area. We iteratively increase the width of the square patch by 5 pixels. The intersect_area will increase accordingly. Meanwhile, the area ratio of the intersection to the current patch will decrease from 100%. When the ratio is firstly lower than 90%, the current patch will serve as P1subscript𝑃1{P}_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. When the ratio is smaller than 30%, the current patch will serve as P2subscript𝑃2{P}_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Thus, P3subscript𝑃3{P}_{3}italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT can define by Pos and width difference of P1subscript𝑃1{P}_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and P2subscript𝑃2{P}_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT shown in the above pseudo codes. The 90% and 30% are defined by the human visual observation of the objects. We found that when the ratio approaches 90%, the current patch can capture most inner information without much external information. The balance information on inner and external objects can be captured by the current patch when the ratio approaches to 30%. Note that other positions in the super-pixel share the same multi-scale window sizes. Because the super-pixels are of different shapes, the square patches extracted from each super-pixel are also different. As most super-pixels tend to be over-segmented, an object can contain multiple super-pixels. Thus, information from the P1subscript𝑃1{P}_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and P2subscript𝑃2{P}_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can sometimes only represent the object partially. To enhance the representation of global objects, information from the P3subscript𝑃3{P}_{3}italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT needs to be extracted. In certain cases, some small super-pixels are completely segmented without over-segmentation errors. In this situation, the designed shift-scale information strategy provides not only global information for better overall representation but also neighbouring information that improves the robustness of region merging. Thus, small and large super-pixels can be well-represented following the designed protocol.

3.3 S2Former model

To learn the similarity between neighbouring super-pixels, we propose an S2Former model as the backbone in the Siamese network, which is used for supervised contrastive learning Guo et al. (2017). The basic structure of the Siamese network is presented in Fig.4. The features of negative and positive samples can be extracted by a weight-shared backbone. The similarity is obtained by the loss function that calculates the distance in the feature space. The S2Former comprises shift-scale attention modules, scale-wise pooling, auxiliary modules, and segment-based feature embedding modules shown in Fig.7. The mentioned-above basic component modules in the S2Former are described in the following sections.

Refer to caption
Figure 7: The architecture of the S2Former in the DeepMerge.

3.3.1 Shift-scale attention mechanism

Shift-scale attention forms the basic S2Former block, which can capture long-range and cross-scale pixel dependencies based on the self-attention mechanism with 3D relative position embedding (Fig.8). Inspired by human visual concertation, the attention mechanism focuses attention on important information, thereby saving resources and extracting accurate information in a rapid mannerArnab et al. (2021). The S2Former is a feature extraction structure composed of multi-head modules using a shift-scale attention mechanism. The input patches firstly are resized into 32×\times×32, 64×\times×64, and 128×\times×128, then embedded into three 8×\times×8 vectors of features via shift-scale embedding moudle Conv2D (2D convolution functions), whose kernel sizes are 4×\times×4, 8×\times×8, and 16×\times×16, respectively. The embedded features from shift-scale embedding are then flattened into 1D vectors and concatenated for shift-scale attention modules. The Linear is a one-layer fully connected network. MLP denotes the Multilayer Perception, and the Gaussian error linear units (GELU) is the activation function Hendrycks and Gimpel (2016). The layer Dropout is employed as a regularization strategy.

Refer to caption
Figure 8: Shift-scale attention mechanism.

Eq.1 depicts the process of the attention mechanism. V(=AWv𝐴superscript𝑊𝑣AW^{v}italic_A italic_W start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT) is the output of the input A{a1,a2,a3subscript𝑎1subscript𝑎2subscript𝑎3a_{1},a_{2},a_{3}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT} (Fig.8) linear transformed by the weight Wvsuperscript𝑊𝑣W^{v}italic_W start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT. In visual image processing, A is a 1D vector composed of embedded features a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and a3subscript𝑎3a_{3}italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. The softmax function plays the role of scorer for V based on query (Q = AWq𝐴superscript𝑊𝑞AW^{q}italic_A italic_W start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT) and key (K=AWk𝐴superscript𝑊𝑘AW^{k}italic_A italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT), where Wqsuperscript𝑊𝑞W^{q}italic_W start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT, Wksuperscript𝑊𝑘W^{k}italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT are the weights for updating in the backpropagating process, Q and K are two special matrices used for searching the importance of pixels. Inspired by the 2D relative-position embedding, we expand it to a 3D relative-position embedding to measure the relative spatial relationships B of the shift-scale inputs. The features of the self-attention output are finally fed into an MLP.

Attention(Q,K,V)=softmax(QKTdk+B)V𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝑄𝐾𝑉𝑠𝑜𝑓𝑡𝑚𝑎𝑥𝑄superscript𝐾𝑇subscript𝑑𝑘𝐵𝑉Attention(Q,K,V)=softmax\left(\frac{Q{{K}^{T}}}{\sqrt{{{d}_{k}}}}+B\right)Vitalic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_Q , italic_K , italic_V ) = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG + italic_B ) italic_V (1)
softmax(zi)=exp(zi)j=1Kexp(zj)𝑠𝑜𝑓𝑡𝑚𝑎𝑥subscript𝑧𝑖subscript𝑧𝑖superscriptsubscript𝑗1𝐾subscript𝑧𝑗softmax{{\left(z_{i}\right)}}=\frac{\exp\left({{z}_{i}}\right)}{\sum\nolimits_% {j=1}^{K}{\exp\left({{z}_{j}}\right)}}italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG roman_exp ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG (2)

where dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the dimension of K. The output of the attention module, named head, contains scored information in A. The above process is the basic principle of the shift-scale attention mechanism. The softmax function applies the standard exponential function to each element zisubscript𝑧𝑖{z}_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the input vector z and normalizes these values by dividing by the sum of all these exponentials; this normalization ensures that the sum of the components of the output vector z is 1. The shift-scale attention forms the S2Former block in Fig.7.

3.3.2 Scale-wise pooling

To decrease the model size, we propose a scale-wise pooling which is able to pool the outputs of S2Former block in three scales. Fig.9 depicts the process of scale-wise pooling. The output of a S2Former block is a 1D vector composed of features in three scales marked in red, green, and blue. The featuers are firstly un-flattened into the 2D state in three scales, respectively. The average pooling is then applied to reduce the size of 2D features. Finally, the new 2D features in three scales are flattened and concatenated into a 1D vector, whose size is reduced by half.

Refer to caption
Figure 9: Scale-wise pooling module.

3.3.3 Auxiliary module and segment-based feature embedding

The auxiliary module can assist in improving the performance and robustness of the S2Former. The 1D features of the S2Former block are processed individually in three branches related to the three scales shown in Fig.10a and Fig.10b. The Conv2D (Fig.10b) and Conv1D (Fig.10c) are two-dimensional and one-dimensional convolution functions, respectively. The Norm block is the batch normalization for inputs. The ReLU is the Linear rectification function. The AvgPool is a global pooling layer.

Refer to caption
Figure 10: Auxiliary module. (a) is the auxiliary module; (b) is the branch of the auxiliary module; (c) is the segment-based feature embedding.

The segment-based feature embedding block can encode engineered features to integrate other deep features extracted by networks (Fig.10c). A total of eighteen features are designed for segment-based feature embedding module, including texture features, statistical features, shape features, standard deviation of each band, mean value of each band, shape indicator, compactness, brightness, border indicator, and resizing factors. These features are computed considering the pixels within the segment. They capture different characteristics to assess region similarities with respect to deep learning features, which are calculated using a fixed window size. Their calculations follow:

Meani=1nj=1nvj𝑀𝑒𝑎subscript𝑛𝑖1𝑛superscriptsubscript𝑗1𝑛subscript𝑣𝑗Mea{{n}_{i}}=\frac{1}{n}\sum\limits_{j=1}^{n}{{{v}_{j}}}italic_M italic_e italic_a italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (3)
Stdi=1n1j=1n(vjMeani)2𝑆𝑡subscript𝑑𝑖1𝑛1superscriptsubscript𝑗1𝑛superscriptsubscript𝑣𝑗𝑀𝑒𝑎subscript𝑛𝑖2St{{d}_{i}}=\sqrt{\frac{1}{n-1}\sum\limits_{j=1}^{n}{{{\left({{v}_{j}}-Mea{{n}% _{i}}\right)}^{2}}}}italic_S italic_t italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_n - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_M italic_e italic_a italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (4)
Shape=l4C𝑆𝑎𝑝𝑒𝑙4𝐶Shape=\frac{l}{4\sqrt{C}}italic_S italic_h italic_a italic_p italic_e = divide start_ARG italic_l end_ARG start_ARG 4 square-root start_ARG italic_C end_ARG end_ARG (5)
Compactness=ln𝐶𝑜𝑚𝑝𝑎𝑐𝑡𝑛𝑒𝑠𝑠𝑙𝑛Compactness=l\sqrt{n}italic_C italic_o italic_m italic_p italic_a italic_c italic_t italic_n italic_e italic_s italic_s = italic_l square-root start_ARG italic_n end_ARG (6)
Brightness=1wBi=1KwiBMeani𝐵𝑟𝑖𝑔𝑡𝑛𝑒𝑠𝑠1superscript𝑤𝐵superscriptsubscript𝑖1𝐾superscriptsubscript𝑤𝑖𝐵𝑀𝑒𝑎subscript𝑛𝑖Brightness=\frac{1}{{{w}^{B}}}\sum\limits_{i=1}^{K}{w_{i}^{B}Mea{{n}_{i}}}italic_B italic_r italic_i italic_g italic_h italic_t italic_n italic_e italic_s italic_s = divide start_ARG 1 end_ARG start_ARG italic_w start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_M italic_e italic_a italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (7)
Border=l2(length+width)𝐵𝑜𝑟𝑑𝑒𝑟𝑙2𝑙𝑒𝑛𝑔𝑡𝑤𝑖𝑑𝑡Border=\frac{l}{2\left(length+width\right)}italic_B italic_o italic_r italic_d italic_e italic_r = divide start_ARG italic_l end_ARG start_ARG 2 ( italic_l italic_e italic_n italic_g italic_t italic_h + italic_w italic_i italic_d italic_t italic_h ) end_ARG (8)

where i (i=1,2,3) indicates the ith band, n denotes the super-pixel size in pixels, vjsubscript𝑣𝑗{v}_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT denotes the jth pixel value, and Meani𝑀𝑒𝑎subscript𝑛𝑖{Mean}_{i}italic_M italic_e italic_a italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the mean value of the ith band in a super-pixel. The Stdi𝑆𝑡subscript𝑑𝑖{Std}_{i}italic_S italic_t italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the standard deviation of the ith band in a super-pixel. The Shape denotes the shape indicator defined by the perimeter (l) of a super-pixel and the perimeter (C) of the minimum bounding rectangle (MBR) of a super-pixel. The length and width are the long edge and the short edge of the MBR. The length and width are used to define the border indicator (Border) and Compactness. K denotes the number of bands, and wiBsuperscriptsubscript𝑤𝑖𝐵w_{i}^{B}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT denotes the related weight.

3.3.4 Loss function

The output of the S2Former is a 1D vector storing features used for super-pixel representation, and it will be used in calculating similarity in an RAG model. The features of positive sample pairs or negative sample pairs are extracted by the network and are supervised by the loss function. Given a pair of super-pixels, The Laux1subscript𝐿𝑎𝑢𝑥1L_{aux1}italic_L start_POSTSUBSCRIPT italic_a italic_u italic_x 1 end_POSTSUBSCRIPT, Laux2subscript𝐿𝑎𝑢𝑥2L_{aux2}italic_L start_POSTSUBSCRIPT italic_a italic_u italic_x 2 end_POSTSUBSCRIPT, and Lmainsubscript𝐿𝑚𝑎𝑖𝑛L_{main}italic_L start_POSTSUBSCRIPT italic_m italic_a italic_i italic_n end_POSTSUBSCRIPT in the Fig.4 are calculated by the loss function:

Loss=αAileftAiright2+(1α)max(0,(λAileftAiright2))𝐿𝑜𝑠𝑠𝛼subscriptnormsuperscriptsubscript𝐴𝑖𝑙𝑒𝑓𝑡superscriptsubscript𝐴𝑖𝑟𝑖𝑔𝑡21𝛼𝑚𝑎𝑥0𝜆subscriptnormsuperscriptsubscript𝐴𝑖𝑙𝑒𝑓𝑡superscriptsubscript𝐴𝑖𝑟𝑖𝑔𝑡2Loss=\alpha{{\left\|A_{i}^{left}-A_{i}^{right}\right\|}_{2}}+\left(1-\alpha% \right)max\left(0,\left(\lambda-{{\left\|A_{i}^{left}-A_{i}^{right}\right\|}_{% 2}}\right)\right)italic_L italic_o italic_s italic_s = italic_α ∥ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l italic_e italic_f italic_t end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_i italic_g italic_h italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ( 1 - italic_α ) italic_m italic_a italic_x ( 0 , ( italic_λ - ∥ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l italic_e italic_f italic_t end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_i italic_g italic_h italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) (9)

where A means the feature vector in a super-pixel, i indicates the feature item from the A. α𝛼\alphaitalic_α is a binary indicator (positive pair: α𝛼\alphaitalic_α=1 ; negative pair: α𝛼\alphaitalic_α=0). λ𝜆\lambdaitalic_λ is a user-defined parameter that represents the cluster centre of the Euclidean distance in feature space between negative sample pairs (we recommend λ𝜆\lambdaitalic_λ=1). The function max(.,.) can return the maximum value from two parameters. In the proposed DeepMerge, the distance in feature space between positive sample pairs should be close to 0, while the distance between negative sample pairs should be close to 1 if less than 1, which predicts that the optimal scale parameter of the proposed method will be 0.5. Once the super-pixel features are extracted by the S2Former, they further serve as the vertex features in the RAG model (described in Fig.4). The final loss of the whole S2Former is calculated by the sum of weighted loss values above.

Lfinal=Lmain+0.1×Laux1+0.2×Laux2subscript𝐿𝑓𝑖𝑛𝑎𝑙subscript𝐿𝑚𝑎𝑖𝑛0.1subscript𝐿𝑎𝑢𝑥10.2subscript𝐿𝑎𝑢𝑥2L_{final}=L_{main}+0.1\times L_{aux1}+0.2\times L_{aux2}italic_L start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_m italic_a italic_i italic_n end_POSTSUBSCRIPT + 0.1 × italic_L start_POSTSUBSCRIPT italic_a italic_u italic_x 1 end_POSTSUBSCRIPT + 0.2 × italic_L start_POSTSUBSCRIPT italic_a italic_u italic_x 2 end_POSTSUBSCRIPT (10)

3.4 Merging criteria and feature updating

Merging criteria are designed for calculating the similarity between super-pixels, i.e., edge weight in the RAG model. We use the Euclidean distance of features from two neighbouring super-pixels as the merging criteria in the proposed method:

MC=AileftAiright2𝑀𝐶subscriptnormsuperscriptsubscript𝐴𝑖𝑙𝑒𝑓𝑡superscriptsubscript𝐴𝑖𝑟𝑖𝑔𝑡2MC={{\left\|A_{i}^{left}-A_{i}^{right}\right\|}_{2}}italic_M italic_C = ∥ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l italic_e italic_f italic_t end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_i italic_g italic_h italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (11)

where MC is the Euclidean distance of two group features, which is the similarity of two super-pixels, recorded as the weight of the connected edge in the RAG model. Conventional merging criteria updates edge weights of a newly merged super-pixel by re-extracting features and re-calculating weights between the super-pixel and its adjacent super-pixels. However, deep features of a newly merged super-pixel can cause inefficiency in region-merging if they are re-extracted by the pre-trained model. In the Deeop-SO, the features of a new super-pixel are calculated via the weighted average of the original two features by Eq.12:

Aileft+right=1m+n(mAileft+nAiright)superscriptsubscript𝐴𝑖𝑙𝑒𝑓𝑡𝑟𝑖𝑔𝑡1𝑚𝑛𝑚superscriptsubscript𝐴𝑖𝑙𝑒𝑓𝑡𝑛superscriptsubscript𝐴𝑖𝑟𝑖𝑔𝑡A_{i}^{left+right}=\frac{1}{m+n}\left(mA_{i}^{left}+nA_{i}^{right}\right)italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l italic_e italic_f italic_t + italic_r italic_i italic_g italic_h italic_t end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_m + italic_n end_ARG ( italic_m italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l italic_e italic_f italic_t end_POSTSUPERSCRIPT + italic_n italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_i italic_g italic_h italic_t end_POSTSUPERSCRIPT ) (12)

where Aileft+rightsuperscriptsubscript𝐴𝑖𝑙𝑒𝑓𝑡𝑟𝑖𝑔𝑡A_{i}^{left+right}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l italic_e italic_f italic_t + italic_r italic_i italic_g italic_h italic_t end_POSTSUPERSCRIPT are the features of new super-pixels. m and n denote the feature vector weights of the super-pixel lfet and right. left and right are the number of extracting centres by BTS in the super-pixel left and right. Therefore, the feature vector weight of the new super-pixel is m+n. In fact, calculating the features of the newly merged super-pixel is calculating the spatial clustering center of the features of all intial super-pixels it contains. The feature vector of a initial super-pixel is the average of features extracted from multiple extracting centres by BTS.

3.5 Segmentation accuracy estimation

To assess the segmentation performance of the proposed DeepMerge, two groups of accuracy assessment metrics are applied, considering over-segmentation, under-segmentation, and whole-segmentation performance. The first group of metrics includes precision, recall, and F value Zhang et al. (2015). The second group of metrics includes the global over-segmentation error (GOSE), global under-segmentation error (GUSE), and total error (TE) Su and Zhang (2017). These measurement metrics require polygon segmentation results and vectorized reference objects, which have been proven to be effective and robust for measuring the local and global segmentation performance from various aspects. The calculations of these metrics are presented in Table 1. S is the set of polygon segmentation results containing M segments {S1subscript𝑆1{S}_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, S2subscript𝑆2{S}_{2}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, …, SMsubscript𝑆𝑀{S}_{M}italic_S start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT}, and R is the set of polygon reference objects containing N reference objects {R1subscript𝑅1{R}_{1}italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, R2subscript𝑅2{R}_{2}italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, …, RNsubscript𝑅𝑁{R}_{N}italic_R start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT}. |*|\left|*\right|| * | represents the area of a segment. Ri,maxsubscript𝑅𝑖𝑚𝑎𝑥{R}_{i,max}italic_R start_POSTSUBSCRIPT italic_i , italic_m italic_a italic_x end_POSTSUBSCRIPT denotes the largest area reference object related to the segment Risubscript𝑅𝑖{R}_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and Ri,maxsubscript𝑅𝑖𝑚𝑎𝑥{R}_{i,max}italic_R start_POSTSUBSCRIPT italic_i , italic_m italic_a italic_x end_POSTSUBSCRIPT denotes the largest area segment related to the reference object Risubscript𝑅𝑖{R}_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Rijsubscript𝑅𝑖𝑗{R}_{ij}italic_R start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT denotes the set of segments related to Risubscript𝑅𝑖{R}_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. SiRi,maxsubscript𝑆𝑖subscript𝑅𝑖𝑚𝑎𝑥{S}_{i}\cap{R}_{i,max}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ italic_R start_POSTSUBSCRIPT italic_i , italic_m italic_a italic_x end_POSTSUBSCRIPT is the intersection of Sisubscript𝑆𝑖{S}_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Ri,maxsubscript𝑅𝑖𝑚𝑎𝑥{R}_{i,max}italic_R start_POSTSUBSCRIPT italic_i , italic_m italic_a italic_x end_POSTSUBSCRIPT, and the SiRi,maxsubscript𝑆𝑖subscript𝑅𝑖𝑚𝑎𝑥{S}_{i}\cup{R}_{i,max}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∪ italic_R start_POSTSUBSCRIPT italic_i , italic_m italic_a italic_x end_POSTSUBSCRIPT the union of them. The difference set |Ri\Si,max|\subscript𝑅𝑖subscript𝑆𝑖𝑚𝑎𝑥\left|{R}_{i}\backslash{S}_{i,max}\right|| italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT \ italic_S start_POSTSUBSCRIPT italic_i , italic_m italic_a italic_x end_POSTSUBSCRIPT | contains pixels in Risubscript𝑅𝑖{R}_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, but not in Si,maxsubscript𝑆𝑖𝑚𝑎𝑥{S}_{i,max}italic_S start_POSTSUBSCRIPT italic_i , italic_m italic_a italic_x end_POSTSUBSCRIPT. α𝛼\alphaitalic_α is set as 0.5. The \uparrow means higher values with better performance and vice versa for the \downarrow.

Table 1: Assessment metrics used for segmentation accuracy estimation.
Assessment metrics Formulas Range Trend
precision |S|=i=1M|Si|𝑆superscriptsubscript𝑖1𝑀subscript𝑆𝑖\left|S\right|=\sum\limits_{i=1}^{M}{\left|{{S}_{i}}\right|}| italic_S | = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT | italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | precision=i=1M|SiRi,max|/|S|𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛superscriptsubscript𝑖1𝑀subscript𝑆𝑖subscript𝑅𝑖𝑆precision=\sum\limits_{i=1}^{M}{\left|{{S}_{i}}\bigcap{{R}_{i,\max}}\right|}/% \left|S\right|italic_p italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT | italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋂ italic_R start_POSTSUBSCRIPT italic_i , roman_max end_POSTSUBSCRIPT | / | italic_S | |R|=i=1N|Ri|𝑅superscriptsubscript𝑖1𝑁subscript𝑅𝑖\left|R\right|=\sum\limits_{i=1}^{N}{\left|{{R}_{{}_{i}}}\right|}| italic_R | = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | italic_R start_POSTSUBSCRIPT start_FLOATSUBSCRIPT italic_i end_FLOATSUBSCRIPT end_POSTSUBSCRIPT | [0,1] \uparrow
recall |R|=i=1N|Ri|𝑅superscriptsubscript𝑖1𝑁subscript𝑅𝑖\left|R\right|=\sum\limits_{i=1}^{N}{\left|{{R}_{{}_{i}}}\right|}| italic_R | = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | italic_R start_POSTSUBSCRIPT start_FLOATSUBSCRIPT italic_i end_FLOATSUBSCRIPT end_POSTSUBSCRIPT | recall=i=1N|RiSi,max|/|R|𝑟𝑒𝑐𝑎𝑙𝑙superscriptsubscript𝑖1𝑁subscript𝑅𝑖subscript𝑆𝑖𝑅recall=\sum\limits_{i=1}^{N}{\left|{{R}_{i}}\bigcap{{S}_{i,\max}}\right|}/% \left|R\right|italic_r italic_e italic_c italic_a italic_l italic_l = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋂ italic_S start_POSTSUBSCRIPT italic_i , roman_max end_POSTSUBSCRIPT | / | italic_R | [0,1] \uparrow
F F=1/(α1p+(1α)1r)𝐹1𝛼1𝑝1𝛼1𝑟{F}={1}/{\left(\alpha\frac{1}{p}+(1-\alpha)\frac{1}{r}\right)}\;italic_F = 1 / ( italic_α divide start_ARG 1 end_ARG start_ARG italic_p end_ARG + ( 1 - italic_α ) divide start_ARG 1 end_ARG start_ARG italic_r end_ARG ) [0,1] \uparrow
GOSE𝐺𝑂𝑆𝐸GOSEitalic_G italic_O italic_S italic_E GOSE=1|R|i=1N|Ri||Ri\Si,max||Ri|1𝐺𝑂𝑆𝐸1𝑅superscriptsubscript𝑖1𝑁subscript𝑅𝑖\subscript𝑅𝑖subscript𝑆𝑖subscript𝑅𝑖1GOSE=\frac{1}{\left|R\right|}\sum\limits_{i=1}^{N}{\left|{{R}_{i}}\right|\frac% {\left|{{R}_{i}}\backslash{{S}_{i,\max}}\right|}{\left|{{R}_{{}_{i}}}\right|-1}}italic_G italic_O italic_S italic_E = divide start_ARG 1 end_ARG start_ARG | italic_R | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | divide start_ARG | italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT \ italic_S start_POSTSUBSCRIPT italic_i , roman_max end_POSTSUBSCRIPT | end_ARG start_ARG | italic_R start_POSTSUBSCRIPT start_FLOATSUBSCRIPT italic_i end_FLOATSUBSCRIPT end_POSTSUBSCRIPT | - 1 end_ARG [0,1] \downarrow
GUSE𝐺𝑈𝑆𝐸GUSEitalic_G italic_U italic_S italic_E GUSE=1|R|i=1Nmin(|RiSij|\|RiSij|,|Ri|)𝐺𝑈𝑆𝐸1𝑅superscriptsubscript𝑖1𝑁\subscript𝑅𝑖subscript𝑆𝑖𝑗subscript𝑅𝑖subscript𝑆𝑖𝑗subscript𝑅𝑖GUSE=\frac{1}{\left|R\right|}\sum\limits_{i=1}^{N}{\min\left(\left|{{R}_{i}}% \bigcup{{S}_{ij}}\right|\backslash\left|{{R}_{i}}\bigcap{{S}_{ij}}\right|,% \left|{{R}_{i}}\right|\right)}italic_G italic_U italic_S italic_E = divide start_ARG 1 end_ARG start_ARG | italic_R | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_min ( | italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋃ italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | \ | italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋂ italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | , | italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ) [0,1] \downarrow
TE𝑇𝐸TEitalic_T italic_E TE=GOSE+GUSE𝑇𝐸𝐺𝑂𝑆𝐸𝐺𝑈𝑆𝐸TE=GOSE+GUSEitalic_T italic_E = italic_G italic_O italic_S italic_E + italic_G italic_U italic_S italic_E [0,2] \downarrow

4 Experimental Results

4.1 Dataset

The dataset used in this study covers nine cities that belong to the Phoenix city cluster, Arizona, U.S., including Phoenix, Glendale, Scottsdale, Tempe, Mesa, Chandler, Peoria, Surprise, and Goodyear. The images are composites from Google Earth Yu and Gong (2012) with a variety of sensors (e.g., WorldView, QuickBird, IKONOS, etc.) captured at different times. The image contains 18.7B pixels (182,272×102,626182272102626182,272\times 102,626182 , 272 × 102 , 626, covering 5,660 km2𝑘superscript𝑚2{km}^{2}italic_k italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) with 0.55-meter resolution and RGB bands encoded as 8-bit integral values. A variety of scenes are included, e.g., urban residential zones, urban green spaces, industrial zones, rural farmlands, water areas, and bare lands (Fig.11).

Refer to caption
Figure 11: The study area of the Phoneix city cluster. The translucent blue tiles are used for training and the green tiles are used for accuracy assessment.

We partition the image into a total of 135 tiles encoded by row and column numbers (Fig.11). The average size of a tile is 167M (12,800×12,800128001280012,800\times 12,80012 , 800 × 12 , 800) pixels. The training of the proposed DeepMerge requires negative and positive samples. In the Phoenix dataset, negative and positive pairs are selected in the patches with blue masks, as shown in Fig.11. A total of 71,948 super-pixel pairs (54,945 negative sample pairs and 17,003 positive sample pairs) involving 92,421 super-pixels are manually collected as training data. There is a total of 47,995,928 super-pixels in the whole study area. The training data accounts for 0.19% (=92,421/47,995,928) of the total number of super-pixels.

To assess the performance of DeepMerge, a total of 3,776 polygons are manually digitized as reference objects. To validate the transferability of DeepMerge, the reference objects, covering a variety of land uses (Fig.12), are selected in the green mask tiles in Fig.11.

Refer to caption
Figure 12: Examples of reference objects.

4.2 Region-merging-based image segmentation results

For consistency, we set the shape, compactness, and scale parameters of MRS for initializing over-segmentation as 0.5, 0.5, and 25. Too many segments can lead to low efficiency of region-merging. The MRS with these super-pixels applied to segment minimum objects in the dataset with little over-segmentation errors, maintaining a balance between over-segmentation errors and reducing super-pixel number. Relying on the same initial over-segmentation and the same RAG model, ten bottom-to-top supervised, and unsupervised methods are selected as competing algorithms, including BCMS Zhang et al. (2013), FHS Zhang et al. (2014), HRM Zhang et al. (2014), CSVD Chen et al. (2015), USIH Hu et al. (2017), Local-SA Yang et al. (2017), OMS Shen et al. (2019), OSO(fine and coarse) Zhang et al. (2020), MLRM Su et al. (2020), IOseg Lv et al. (2021). These methods differ in optimization strategies, including merging criteria (e.g., edge penalty), object optimization, scale sets, and supervised segmentation. Diverse merging criteria suggest different scale parameters. Some multiscale segmentation methods (BCMS and FHS) set scale parameters by trial and error. The scale parameters in HRM, CSVD, Local-SA, MLRM, and IOseg are set as recommended. USIH, OMS, and OSO automatically generate specific objects with optimized scales, where OSO can produce coarse and fine segmentations at the same time. In our experiments, we set optimal scale value of DeepMerge to 0.5. In addition to standard image segmentation methods, deep learning-based semantic segmentation has made great progress in past years. The decoder-encoder neural networks are able to predict pixel-wise labels in the image. Novel modules have been designed to be applied in various applications to improve semantic segmentation accuracy. Standard deep learning based semantic segmentation methods, UNet Ronneberger et al. (2015), UNet++ Zhou et al. (2018), U2Net Qin et al. (2020), UNetFormer Wang et al. (2022), FCN Long et al. (2015), SegNet Badrinarayanan et al. (2017), DeepLab Chen et al. (2014), PspNet Zhao et al. (2017), ABCNet Li et al. (2021a), MAResUNet Li et al. (2021b), EANet Guo et al. (2022), CCnet Huang et al. (2019), SegFormer Xie et al. (2021), DenseASPP Yang et al. (2018), and ENet Paszke et al. (2016) are selected as the competing algorithms. We labeled the blue mask areas (Fig.11) as the training dataset for semantic segmentation networks. We preserved the boundaries of semantic segmentation result in the green mask areas serving as polygon objects in comparison to DeepMerge. The details about labeling dataset, training networks, and outputing image segmentation are shown in A. The F values and TE vales of segmentation results based on the semantic segmentation methods are low and high, respectively, caused by low precision and high GUSE, meaning serious under-segmentation errors in the results caused by the pixel adhesion, whose details are described in A.

The segmentation performances of DeepMerge and other competing methods are shown in Table 2. A sensitivity analysis of the scale parameter is reported in Section 4.3. The precision(0.9772) of the segmented results from MRS, the initial segmentation approach, is the highest, and the GUSE value (0.0346) of MRS is the lowest among all investigated methods. These two metrics are closely related to over-segmentation errors. In general, higher precision and lower GUSE values indicate stronger over-segmentation errors and weaker under-segmentation errors. We notice that the proposed DeepMerge achieves the best performance in the rest metrics compared with other methods. The F value of DeepMerge is 0.9550 higher than FHS, the second-best method (F value: 0.8465). The TE values (0.0895) of the DeepMerge are the lowest, suggesting small segmentation errors. Note that the proposed DeepMerge, among all competing algorithms, achieves the highest recall value and the lowest GOSE value at the same time, indicating its superiority and robustness. The image segmentation results preserved from semantic segmentations show weak performance compared to standard superpixel segmentation methods. The precision and GOSE values of these methods are extremely low, as similar nearby objects can be clustered into the same object, resulting in substantial undersegmentation errors.

Table 2: Image segmentation performance of the DeepMerge and other competing methods.
Method precisionnormal-↑\uparrow recallnormal-↑\uparrow Fnormal-↑\uparrow GOSEnormal-↓\downarrow GUSEnormal-↓\downarrow TEnormal-↓\downarrow
MRS 0.9772 0.1913 0.3200 0.1028 0.0346 0.1374
BCMS 0.9540 0.4233 0.5864 0.1392 0.0431 0.1823
FHS 0.9568 0.7590 0.8465 0.1247 0.0459 0.1706
HRM 0.9714 0.3288 0.4913 0.1246 0.0394 0.1641
CSVD 0.9657 0.5954 0.7366 0.1762 0.0504 0.2265
USIH 0.8870 0.6504 0.7505 0.1164 0.0483 0.1647
Local-SA 0.7062 0.8179 0.7580 0.0610 0.1370 0.1980
OMS 0.9733 0.3154 0.4764 0.1145 0.0819 0.1964
OSO(fine) 0.9700 0.4624 0.6262 0.1387 0.0406 0.1793
OSO(course) 0.9586 0.5150 0.6700 0.1283 0.0469 0.1751
MLRM 0.9699 0.3860 0.5522 0.1606 0.0390 0.1996
IOseg 0.7346 0.8737 0.7981 0.0903 0.0563 0.1466
UNet 0.2424 0.9165 0.3834 0.0558 0.4076 0.4634
FCN 0.0364 0.9029 0.0700 0.0615 0.4468 0.5083
SegNet 0.0678 0.9222 0.1263 0.0470 0.4295 0.4765
PspNet 0.0182 0.8810 0.0357 0.0775 0.6564 0.7339
DeepLab 0.0169 0.8751 0.0331 0.0742 0.5394 0.6136
U2Net 0.0677 0.7685 0.1244 0.1289 0.3774 0.5063
EANet 0.0024 0.8264 0.0048 0.1061 0.5206 0.6267
ABCBet 0.0350 0.8032 0.0671 0.1187 0.4549 0.5736
MAResUNet 0.0288 0.7858 0.0556 0.1271 0.3302 0.4573
UNetFormer 0.0019 0.7837 0.0038 0.1324 0.7611 0.8935
CCNet 0.0060 0.8581 0.0120 0.0834 0.5953 0.6787
SegFormer 0.0235 0.8200 0.0458 0.1097 0.4691 0.5788
UNet++ 0.1307 0.7692 0.2234 0.1295 0.1718 0.3013
DenseAspp 0.0058 0.6871 0.0114 0.1689 0.5051 0.6740
ENet 0.0302 0.9278 0.0585 0.0458 0.5962 0.6420
Ours (DeepMerge) 0.9679 0.9425 0.9550 0.0454 0.0441 0.0895

Example segmentation results of the investigated algorithms are presented in Fig.13, where three typical landscape types are selected: 1) urban residential areas (Fig.13a), 2) rural industrial zones (Fig.14b), and 3) rural green spaces (Fig.15c). We notice that results from MRS contain notable over-segmentation errors. Relying on the results from MRS results, other investigated methods started to generate their optimized segmentations. We observe that the segmentation results of DeepMerge are satisfactory and superior to the others. Almost all house outlines, road boundaries, green spaces, and even sidewalk boundaries are precisely delineated by DeepMerge (Fig.13a). FHS and USIH also achieve good performances, however, with discontinuous roads. The excellent performance of DeepMerge is also demonstrated in Fig.14b, where objects vary greatly in size, texture, shape, and spectrum. DeepMerge successfully delineates large areas of bare land, factory buildings, roads, and even individual trees. Local-SA and USIH also present satisfactory segmented results, but some over-segmentation errors still exist (see the spots in buildings of Local-SA and the bare lands of USIH). The above qualitative results well support the robustness of the proposed DeepMerge.

* Refer to caption

Figure 13: Image segmentation results in urban residential areas.
Refer to caption
Figure 14: Image segmentation results in rural industrial zones.
Refer to caption
Figure 15: Image segmentation results in and rural green spaces.

4.3 Optimal scale parameters in DeepMerge

The final segmentation results of the proposed method can be derived by setting different scale parameters. The segmentation evaluation metrics of precision, recall, and Fof the proposed DeepMerge are presented in Fig.16a, where the scale parameters are set as 0.01, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, and 0.99. The curve of precision (red line in Fig.16a) tends to decrease slowly for scale parameters from 0.0 to 0.6. After 0.6, however, the precision decrease sharply with the further increase of scale parameters. The recall value of the proposed DeepMerge increases sharply when the scale parameter is above 0.2. In comparison, the F values first increase and then decrease with the continuous increase of scale parameters. The highest F value is achieved when the scale parameter is 0.5. In general, the segmentation performance of DeepMerge varies with different settings of scale parameters. Fig.16b depicts the global and local F values in the varying scale parameters, where the gray curves of local F values firstlt tend to increase and then decrease like global F values. All of the optimal scale parameters of local F values are distributed around 0.5. The above results suggest that the optimal scale for DeepMerge is 0.5 as Section3.3.4 predicted.

Refer to caption
Figure 16: Segmentation performance of the proposed DeepMerge with different scale settings. (a) Performance curves by precision, recall, F. (b) Performance curves by global and local F values in the study areas.

4.4 Ablation experiments

The parameters involving margin value, training epoch number, batch size, and learning rate of DeepMerge are set to 1.0, 100, 120, and 0.0001, respectively. The quantitative segmentation measures via six evaluation metrics of ablation experiments are described in Table 3. ‘S2E‘ denotes a model with shift-scale embedding module and ‘3DP‘ means 3D position embedding module. As expected, the S2E model achieves satisfactory performance as it can learn shift scale features of objects. ‘Aux‘ denotes a model with an auxiliary module added to the backbone. ‘SEF‘ denotes a model with a segment-based feature module added to the backbone. The ’depth’ depicts the layer number of stages in S2Former. The results suggest desirable performance, evidenced by the high recall and F values and low GOSE, TE values, proving the ability of S2Former to improve the image segmentation performance.

Table 3: Ablation experiments on model variations.
S2E 3DP Aux SEF depth precisionnormal-↑\uparrow recallnormal-↑\uparrow Fnormal-↑\uparrow GOSEnormal-↓\downarrow GUSEnormal-↓\downarrow TEnormal-↓\downarrow
[6,4,2] 0.9142 0.9122 0.9132 0.0534 0.0539 0.1073
[6,4,2] 0.9254 0.9370 0.9311 0.0453 0.0594 0.1048
[6,4,2] 0.9386 0.9056 0.9218 0.056 0.0532 0.1093
[6,4,2] 0.9464 0.9178 0.9319 0.0547 0.0485 0.1032
[6,4,2] 0.9483 0.9320 0.9401 0.048 0.0533 0.1014
[6,4,2] 0.9483 0.9209 0.9344 0.0543 0.0487 0.1031
[6,4,2] 0.9246 0.9431 0.9338 0.0415 0.0551 0.0966
[6,4,2] 0.9679 0.9425 0.9550 0.0454 0.0441 0.0895
[3,2,1] 0.8823 0.9077 0.8948 0.0580 0.0639 0.1219

5 Discussion

The proposed DeepMerge only takes 0.19% of the total number of super-pixels as training to achieve desirable image segmentation results. The optimal scale parameter value stabilizes at 0.5 in the study areas, releasing the selection of the optimal scale parameter value for users. Thus, DeepMerge overcomes scale parameter selection, unlike other multi-scale segmentation methods. A closer-to-zero scale parameter denotes a high similarity, while values close to or greater than one denote a low similarity. From Fig.16, we notice that segmentation results in the scale range of [0.4, 0.6] are desirable for many applications, and the proposed DeepMerge can generate segmentation results in different scale parameters as needed.

To ensure the efficiency of the merging process, many region-merging methods introduce the object size as a parameter in the merging criteria. Although such a strategy improves the merging efficiency and generates even-sized objects, it fails to meet the demand of many applications due to the uneven size distribution of objects in an image. For example, the tiny objects in our dataset include individual trees and three-pixel-wide intermittent sidewalks. In comparison, factory buildings and roads usually contain thousands of pixels as shown in Fig.17. The blue mask in the figure is the road segment as a whole object with 166,981m22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT in size. The polygons in red are independent residential houses with an average area of 1,000m22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT. In the green areas, there are individual lawns, in which the smallest area of lawns only covers 42m22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT. In this case, the blue road area is 3,976 times the size of the smaller lawn. In this study, the segmentation results of DeepMerge are able to capture the true sizes of objects.

Refer to caption
Figure 17: A case of objects with various sizes for presentation.

6 Conclusion

In this study, we proposed a deep step-wise optimization method for image segmentation of large-area and high-resolution remote sensing imagery. Given an initial super-pixel segmentation result, DeepMerge automatically obtains optimal segmentation output as a vector format in an interpretable scale parameter.

Our method combines deep learning and RAG. We proposed the shift-scale embedding, shift-scale attention mechanism, and interpretable scale parameter to form S2Former to capture the shift-scale information. In addition, we introduced the segment-based feature embedding module into the networks to hold the object features. The segmentation performance on the Phoenix city cluster of our proposed method performs out SOTA methods in both qualitative and quantitative measurements. The proposed method requires a small number of training samples with respect to the total number of super-pixels. In our experimental analysis, only 0.19% of the total number of super-pixel were labelled. DeepMerge based on a low ratio training dataset, achieves high advancements of 0.1 in F value, large decrease of 0.04 in TE against SOTA methods. The optimal scale parameter of the proposed method stabilizes at 0.5. This makes the selection of the optimal parameter value easier for the user. The proposed DeepMerge is suitable for the precise segmentation of large-area and very high-spatial-resolution remote sensing images. It will provide efficient and precise segmentation of products. However, it still requires training samples, leading to time-consuming segmentation in small areas.

We plan to direct our further works to the potential improvement and the wide applications of the proposed DeepMerge by exploring the possibility of develo** unsupervised DeepMerge-based segmentation approaches and further evaluating DeepMerge’s performance on various land-cover classification problems.

7 Acknowledgements

This work was supported in part by the China Scholarship Council, Scientific Research Startup Fund of Northeastern University at Qinhuangdao.

Appendix A Detials about semantic segmentation

Fig.18 depicts the workflow of labeling dataset, training semantic segmentation, preserving boundaries, and segmentation accuracy estimation for semantic segmentation networks. To ensure the consistency of training data, all super-pixels in the blue tiles are selected as the training dataset for semantic segmentation methods. We manually label the data under the blue zones as the training dataset, containing 11 classes (shown in Fig.18b) and 1,803 training images with 512×\times×512 in pixel size. After training the semantic segmentation networks, the output in the tested is the prediction result of pixel-wise category. To make the results comparable, we only preserved the boundaries of the results and saved it as standard segmentation outputs shown in Fig.18d.

Refer to caption
Figure 18: Workflow aboout image segmentatin based on semantic segmentation networks. (a) is the training data area corresponding to Fig.4; (b) shows the land cover categories in the dataset; (c) is the training datset for semantic segmentation; (d) is the processing of remote sensing image segmentation based on semantic segmentaiton methods.

Based on the dataset, the results of semantic segmentation methods are not bad if F score based on pixels was used in the test area shown in Fig.19. For example, the F score of UNet is high in 0.7085 in the test area. However, the images segmentation accessment metric is F values based reference polygons different from F scores based on pixels. The precision and recall measures are calculated based on region overlap**. The matching direction for the precision measure is defined as a reference-to-segment directional correspondence. For the recall measure, we reverse this and match segments to reference objects.

Pixel adhension often occurs in similar neighbouring objects in the semantic segmentation results shown in Fig.20. Though the semantic segmentation in Fig.20b is desirable by visual accessment, the image segmentation results based on their boundaries are bad, causing the low F values in the Table.2.

Refer to caption
Figure 19: Semantic segmentation results of five standard networks.
Refer to caption
Figure 20: Pixel adhension of semantic segmentation methods.(a) and (e) are the output of semantic segmentation and image segmentation; (b) is a zoom in diagram with a local region; (c) and (f) are image segmentation results; (d) is the image segmentation result of semantic segmentation showing pixel adhension; (g) is the image segmentation of our method.

References

  • Lv et al. (2021) X. Lv, Z. Shao, D. Ming, C. Diao, K. Zhou, C. Tong, Improved object-based convolutional neural network (iocnn) to classify very high-resolution remote sensing images, International Journal of Remote Sensing 42 (2021) 8318–8344.
  • Zhou et al. (2020) W. Zhou, D. Ming, X. Lv, K. Zhou, H. Bao, Z. Hong, So–cnn based urban functional zone fine division with vhr remote sensing image, Remote Sensing of Environment 236 (2020) 111458.
  • Mundia and Aniya (2005) C. N. Mundia, M. Aniya, Analysis of land use/cover changes and urban expansion of nairobi city using remote sensing and gis, International journal of Remote sensing 26 (2005) 2831–2849.
  • Zhao et al. (2021) W. Zhao, C. Persello, A. Stein, Building outline delineation: From aerial images to polygons with an improved end-to-end learning framework, ISPRS journal of photogrammetry and remote sensing 175 (2021) 119–131.
  • Na et al. (2021) J. Na, H. Ding, W. Zhao, K. Liu, G. Tang, N. Pfeifer, Object-based large-scale terrain classification combined with segmentation optimization and terrain features: A case study in china, Transactions in GIS 25 (2021) 2939–2962.
  • Chen et al. (2021) D. Chen, Y. Zhong, Z. Zheng, A. Ma, X. Lu, Urban road map** based on an end-to-end road vectorization map** network framework, ISPRS Journal of Photogrammetry and Remote Sensing 178 (2021) 345–365.
  • Zhao et al. (2022) W. Zhao, C. Persello, A. Stein, Extracting planar roof structures from very high resolution images using graph neural networks, ISPRS Journal of Photogrammetry and Remote Sensing 187 (2022) 34–45.
  • Blaschke (2010) T. Blaschke, Object based image analysis for remote sensing, ISPRS journal of photogrammetry and remote sensing 65 (2010) 2–16.
  • Zhang et al. (2014) X. Zhang, P. Xiao, X. Feng, Fast hierarchical segmentation of high-resolution remote sensing image with adaptive edge penalty, Photogrammetric Engineering & Remote Sensing 80 (2014) 71–80.
  • Beaulieu and Goldberg (1989) J.-M. Beaulieu, M. Goldberg, Hierarchy in picture segmentation: A stepwise optimization approach, IEEE Transactions on pattern analysis and machine intelligence 11 (1989) 150–163.
  • Haris et al. (1998) K. Haris, S. N. Efstratiadis, N. Maglaveras, A. K. Katsaggelos, Hybrid image segmentation using watersheds and fast region merging, IEEE Transactions on image processing 7 (1998) 1684–1699.
  • Zhang et al. (2013) X. Zhang, P. Xiao, X. Song, J. She, Boundary-constrained multi-scale segmentation method for remote sensing images, ISPRS Journal of Photogrammetry and Remote Sensing 78 (2013) 15–25.
  • Yang et al. (2017) J. Yang, Y. He, J. Caspersen, Region merging using local spectral angle thresholds: A more accurate method for hybrid segmentation of remote sensing images, Remote sensing of environment 190 (2017) 137–148.
  • Achanta et al. (2012) R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, S. Süsstrunk, Slic superpixels compared to state-of-the-art superpixel methods, IEEE transactions on pattern analysis and machine intelligence 34 (2012) 2274–2282.
  • Paris and Durand (2007) S. Paris, F. Durand, A topological approach to hierarchical segmentation using mean shift, in: 2007 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2007, pp. 1–8.
  • Baatz (2000) M. Baatz, Multi resolution segmentation: an optimum approach for high quality multi scale image segmentation, in: Beutrage zum AGIT-Symposium. Salzburg, Heidelberg, 2000, 2000, pp. 12–23.
  • Martin et al. (2004) D. R. Martin, C. C. Fowlkes, J. Malik, Learning to detect natural image boundaries using local brightness, color, and texture cues, IEEE transactions on pattern analysis and machine intelligence 26 (2004) 530–549.
  • Arbelaez et al. (2010) P. Arbelaez, M. Maire, C. Fowlkes, J. Malik, Contour detection and hierarchical image segmentation, IEEE transactions on pattern analysis and machine intelligence 33 (2010) 898–916.
  • Pont-Tuset et al. (2016) J. Pont-Tuset, P. Arbelaez, J. T. Barron, F. Marques, J. Malik, Multiscale combinatorial grou** for image segmentation and object proposal generation, IEEE transactions on pattern analysis and machine intelligence 39 (2016) 128–140.
  • Derivaux et al. (2006) S. Derivaux, S. Lefevre, C. Wemmert, J. Korczak, Watershed segmentation of remotely sensed images based on a supervised fuzzy pixel classification, in: 2006 IEEE International Symposium on Geoscience and Remote Sensing, IEEE, 2006, pp. 3712–3715.
  • Wassenberg et al. (2009) J. Wassenberg, W. Middelmann, P. Sanders, An efficient parallel algorithm for graph-based image segmentation, in: International Conference on Computer Analysis of Images and Patterns, Springer, 2009, pp. 1003–1010.
  • Johnson and Xie (2011) B. Johnson, Z. Xie, Unsupervised image segmentation evaluation and refinement using a multi-scale approach, ISPRS Journal of Photogrammetry and Remote Sensing 66 (2011) 473–483.
  • Lee and Cok (1991) H.-C. Lee, D. R. Cok, Detecting boundaries in a vector field, IEEE Transactions on Signal Processing 39 (1991) 1181–1194.
  • Su et al. (2020) T. Su, T. Liu, S. Zhang, Z. Qu, R. Li, Machine learning-assisted region merging for remote sensing image segmentation, ISPRS Journal of Photogrammetry and Remote Sensing 168 (2020) 89–123.
  • Chen et al. (2014) J. Chen, M. Deng, X. Mei, T. Chen, Q. Shao, L. Hong, Optimal segmentation of a high-resolution remote-sensing image guided by area and boundary, International Journal of Remote Sensing 35 (2014) 6914–6939.
  • Wang et al. (2019) Y. Wang, Q. Qi, Y. Liu, L. Jiang, J. Wang, Unsupervised segmentation parameter selection using the local spatial statistics for remote sensing image segmentation, International Journal of Applied Earth Observation and Geoinformation 81 (2019) 98–109.
  • Zheng et al. (2020) Z. Zheng, S. Du, S. Du, X. Zhang, A multiscale approach to delineate dune-field landscape patches, Remote Sensing of Environment 237 (2020) 111591.
  • Zhang et al. (2018) C. Zhang, I. Sargent, X. Pan, H. Li, A. Gardiner, J. Hare, P. M. Atkinson, An object-based convolutional neural network (ocnn) for urban land use classification, Remote sensing of environment 216 (2018) 57–70.
  • Drăguţ et al. (2014) L. Drăguţ, O. Csillik, C. Eisank, D. Tiede, Automated parameterisation for multi-scale image segmentation on multiple layers, ISPRS Journal of photogrammetry and Remote Sensing 88 (2014) 119–127.
  • Ming et al. (2015) D. Ming, J. Li, J. Wang, M. Zhang, Scale parameter selection by spatial statistics for geobia: Using mean-shift based multi-scale segmentation as an example, ISPRS Journal of Photogrammetry and Remote Sensing 106 (2015) 28–41.
  • Hu et al. (2018) Z. Hu, Q. Zhang, Q. Zou, Q. Li, G. Wu, Stepwise evolution analysis of the region-merging segmentation for scale parameterization, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 11 (2018) 2461–2472.
  • Zhang et al. (2014) X. Zhang, P. Xiao, X. Feng, J. Wang, Z. Wang, Hybrid region merging method for segmentation of high-resolution remote sensing images, ISPRS Journal of Photogrammetry and Remote Sensing 98 (2014) 19–28.
  • Shen et al. (2019) Y. Shen, J. Chen, L. Xiao, D. Pan, Optimizing multiscale segmentation with local spectral heterogeneity measure for high resolution remote sensing images, ISPRS Journal of Photogrammetry and Remote Sensing 157 (2019) 13–25.
  • Zhang et al. (2020) X. Zhang, P. Xiao, X. Feng, Object-specific optimization of hierarchical multiscale segmentations for high-spatial resolution remote sensing images, ISPRS Journal of Photogrammetry and Remote Sensing 159 (2020) 308–321.
  • Chopra et al. (2005) S. Chopra, R. Hadsell, Y. LeCun, Learning a similarity metric discriminatively, with application to face verification, in: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, IEEE, 2005, pp. 539–546.
  • Guo et al. (2017) Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, S. Wang, Learning dynamic siamese network for visual object tracking, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 1763–1771.
  • Lv et al. (2018) X. Lv, D. Ming, T. Lu, K. Zhou, M. Wang, H. Bao, A new method for region-based majority voting cnns for very high resolution image classification, Remote Sensing 10 (2018) 1946.
  • Lv et al. (2019) X. Lv, D. Ming, Y. Chen, M. Wang, Very high resolution remote sensing image classification with seeds-cnn and scale effect analysis for superpixel cnn classification, International Journal of Remote Sensing 40 (2019) 506–531.
  • Lv et al. (2022) X. Lv, Z. Shao, X. Huang, W. Zhou, D. Ming, J. Wang, C. Tong, Bts: a binary tree sampling strategy for object identification based on deep learning, International journal of geographical information science 36 (2022) 822–848.
  • Arnab et al. (2021) A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, C. Schmid, Vivit: A video vision transformer, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6836–6846.
  • Hendrycks and Gimpel (2016) D. Hendrycks, K. Gimpel, Gaussian error linear units (gelus), arXiv preprint arXiv:1606.08415 (2016).
  • Zhang et al. (2015) X. Zhang, X. Feng, P. Xiao, G. He, L. Zhu, Segmentation quality evaluation using region-based precision and recall measures for remote sensing images, ISPRS Journal of Photogrammetry and Remote Sensing 102 (2015) 73–84.
  • Su and Zhang (2017) T. Su, S. Zhang, Local and global evaluation for remote sensing image segmentation, ISPRS Journal of Photogrammetry and Remote Sensing 130 (2017) 256–276.
  • Yu and Gong (2012) L. Yu, P. Gong, Google earth as a virtual globe tool for earth science applications at the global scale: progress and perspectives, International Journal of Remote Sensing 33 (2012) 3966–3986.
  • Chen et al. (2015) B. Chen, F. Qiu, B. Wu, H. Du, Image segmentation based on constrained spectral variance difference and edge penalty, Remote Sensing 7 (2015) 5980–6004.
  • Hu et al. (2017) Z. Hu, Q. Li, Q. Zhang, Q. Zou, Z. Wu, Unsupervised simplification of image hierarchies via evolution analysis in scale-sets framework, IEEE Transactions on Image Processing 26 (2017) 2394–2407.
  • Ronneberger et al. (2015) O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, Springer, 2015, pp. 234–241.
  • Zhou et al. (2018) Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, J. Liang, Unet++: A nested u-net architecture for medical image segmentation, in: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, Springer, 2018, pp. 3–11.
  • Qin et al. (2020) X. Qin, Z. Zhang, C. Huang, M. Dehghan, O. R. Zaiane, M. Jagersand, U2-net: Going deeper with nested u-structure for salient object detection, Pattern recognition 106 (2020) 107404.
  • Wang et al. (2022) L. Wang, R. Li, C. Zhang, S. Fang, C. Duan, X. Meng, P. M. Atkinson, Unetformer: A unet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery, ISPRS Journal of Photogrammetry and Remote Sensing 190 (2022) 196–214.
  • Long et al. (2015) J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
  • Badrinarayanan et al. (2017) V. Badrinarayanan, A. Kendall, R. Cipolla, Segnet: A deep convolutional encoder-decoder architecture for image segmentation, IEEE transactions on pattern analysis and machine intelligence 39 (2017) 2481–2495.
  • Chen et al. (2014) L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. L. Yuille, Semantic image segmentation with deep convolutional nets and fully connected crfs, arXiv preprint arXiv:1412.7062 (2014).
  • Zhao et al. (2017) H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene parsing network, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890.
  • Li et al. (2021a) R. Li, S. Zheng, C. Zhang, C. Duan, L. Wang, P. M. Atkinson, Abcnet: Attentive bilateral contextual network for efficient semantic segmentation of fine-resolution remotely sensed imagery, ISPRS journal of photogrammetry and remote sensing 181 (2021a) 84–98.
  • Li et al. (2021b) R. Li, S. Zheng, C. Duan, J. Su, C. Zhang, Multistage attention resu-net for semantic segmentation of fine-resolution remote sensing images, IEEE Geoscience and Remote Sensing Letters 19 (2021b) 1–5.
  • Guo et al. (2022) M.-H. Guo, Z.-N. Liu, T.-J. Mu, S.-M. Hu, Beyond self-attention: External attention using two linear layers for visual tasks, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (2022) 5436–5447.
  • Huang et al. (2019) Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, W. Liu, Ccnet: Criss-cross attention for semantic segmentation, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 603–612.
  • Xie et al. (2021) E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, P. Luo, Segformer: Simple and efficient design for semantic segmentation with transformers, Advances in Neural Information Processing Systems 34 (2021) 12077–12090.
  • Yang et al. (2018) M. Yang, K. Yu, C. Zhang, Z. Li, K. Yang, Denseaspp for semantic segmentation in street scenes, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3684–3692.
  • Paszke et al. (2016) A. Paszke, A. Chaurasia, S. Kim, E. Culurciello, Enet: A deep neural network architecture for real-time semantic segmentation, arXiv preprint arXiv:1606.02147 (2016).