License: arXiv.org perpetual non-exclusive license
arXiv:2402.16242v1 [cs.CV] 26 Feb 2024

HSONet:A Siamese foreground association-driven hard case sample optimization network for high-resolution remote sensing image change detection

Chao Tao, Dongsheng Kuang, Zhenyang Huang, Chengli Peng, Haifeng Li The work was supported in part by the Major Program Project of Xiangjiang Laboratory under Grant 22XJ01010, in part by the National Natural Science Foundation of China under Grant 61973047 and Grant 42171458, and in part by using Computing Resources at the High-Performance Computing Platform of Central South University. (Corresponding author: Haifeng Li.)Chao Tao, Dongsheng Kuang, Zhenyang Huang, Chengli Peng, Haifeng Li are with the School of Geosciences and Info-Physics, Central South University, Changsha 410083, China, and also with the Xiangjiang Laboratory, Changsha 410205, China.
Abstract

Deep learning technologies have driven significant advances in change detection (CD) techniques. RS-CD relies on the model’s ability to learn features of marked change objects, known as foreground targets. In addition to foreground targets, the most valuable samples in the image space for model optimization are the unlabelled and semantically ambiguous samples in the background, which include hard samples, pseudochanges, and noninteresting changes; these samples are collectively referred to as ”hard case samples” in this paper. In the later training stages, further improvement of the model’s ability to determine changes relies on how well the model learns hard cases; however, there are two additional challenges to learning hard case samples: (1) change labels are limited and tend to pointer only to foreground targets, yet hard case samples are prevalent in the background, which leads to optimizing the loss function focusing on the foreground targets and ignoring the background hard cases, which we call ’imbalance’. (2) Complex situations, such as light shadows, target occlusion, and seasonal changes, induce hard case samples, and in the absence of both supervisory and scene information, it is difficult for the model to learn hard case samples directly to accurately obtain the feature representations of the change information, which we call ’missingness’. We propose a Siamese foreground association-driven hard case sample optimization network (HSONet). To deal with this imbalance, we propose an equilibrium optimization loss function to regulate the optimization focus of the foreground and background, determine the hard case samples through the distribution of the loss values, and introduce dynamic weights in the loss term to gradually shift the optimization focus of the loss from the foreground to the background hard cases as the training progresses. To address this missingness, we understand hard case samples with the help of the scene context, propose the scene-foreground association module, use potential remote sensing spatial scene information to model the association between the target of interest in the foreground and the related context to obtain scene embedding, and apply this information to the feature reinforcement of hard cases. Experiments on four public datasets show that HSONet outperforms current state-of-the-art CD methods, particularly in detecting hard case samples.

Index Terms:
Change detection, remote sensing image, hard case samples, foreground-scene association.

I Introduction

Change detection is a crucial technology in interpreting and analyzing remote sensing images. Remote sensing change detection (RS-CD) focuses on the same geographical areas on the Earth’s surface, to capture changes in land cover information over time and to generate binary change maps representing differential information. Today, a multitude of CD algorithms and theoretical models have been proposed. Coupled with the development of remote sensing imaging technology, RS-CD has been widely applied in many significant fields, including land resource management[1], natural disaster monitoring[2], agricultural land analysis[3], and urban-rural development planning[4, 5].

In recent years, owing to the gradual maturation of the commercialization of high-resolution remote sensing satellites, current high-resolution remote sensing imagery (HRRS) can describe a variety of feature entities more comprehensively and in more detail[6, 7]. However, the prevalence of light shading, target occlusion, and seasonal changes in optical remote sensing images, coupled with difficulties such as spectral feature confusion, increases the intraclass variance in features, while the interclass variance decreases, which in turn leads to a significant reduction in the separability of remotely sensed targets. This problem is particularly pronounced in HRRS and, therefore, induces intractable CD problems involving hard samples, pseudochanges, and noninteresting changes. Like in semantic segmentation tasks, RS-CD relies heavily on the effectiveness of feature learning for key objects, which are also called foreground targets. In addition to these foreground targets, the most valuable elements for model optimization in the image space are the semantically ambiguous samples in the background, i.e., hard case samples, which include hard samples, pseudochanges, and noninteresting changes. After many experimental explorations, we realized that hard case samples are often more valuable for optimization than foreground targets are in the later model training stages because, for the pixel-level prediction task, the deep neural network first obtains the foreground target from the background information and then calculates the conditional probability of the category to which the foreground target belongs on a pixel-by-pixel basis and selects the highest probability score as the category information for that pixel. The probability of a hard case sample being correctly discriminated can increase if and when the hard case target and the foreground target are optimized to the same degree. In contrast, hard case samples interfering with foreground target feature extraction and feature discrimination processes will have a negative impact on RS-CD, so optimizing the feature learning process for background hard cases will certainly be highly important.

At present, deep learning techniques are increasingly widely used in remote sensing, and deep neural network-based change detection models are also focused on solving hard case sample problems and subproblems. These methods can be broadly categorized into two groups: (1) strengthening the feature representations of key information and (2) adding an attention mechanism and its variants. In terms of enhancing feature representation, Zheng et al.[8] considered the importance of the accuracy and completeness of building boundary recognition for CD and proposed a target edge guidance module and a feature differential enhancement module based on a transformer for refining edge features and fusing different levels of change information, respectively, thus enhancing the high-frequency information of buildings. Xu et al.[9] realized the impact of the structural information of remote sensing targets on the detection accuracy and proposed a multiscale context aggregation network in which the global attention pyramid module and the dense feature fusion module are used to enhance the depth features of the original target and bridge the semantic gap between multiscale features, respectively. In terms of incorporating attention mechanisms, Fang et al.[10] introduced an integrated channel attention module for the deep supervision of interest features and designed a tightly coupled information transmission mechanism between encoders and decoders. This approach mitigates the loss of deep localization information in neural networks and ultimately enables the localization of edge pixels and the capture of small targets. Chen et al.[11] considered the prevalent pseudochange problem in CD and solved the pseudochange problem to a certain extent by applying spatial and channel attention mechanisms to sensitize the network to changes in interest and additionally balancing the variability between samples by penalizing noninteresting changes and increasing the attention given to changes in interest. In the first category of methods, Chen et al.[12] used differential features at different scales to strengthen the feature representation of changes, effectively mitigating issues such as pseudochanges. However, the greatest challenge in optical remote sensing image feature extraction lies in learning accurate semantic features from hard case samples, and strengthening only the feature representation and designing some kind of attentional mechanism cannot completely guarantee the accuracy and completeness of the semantic understanding of hard case samples; thus, so there is currently no learning methods available for hard case samples. However, networks such as DMINet[13] and DASNet[11] have addressed several challenges, such as pseudochanges and differences between foreground and background samples, and have optimized the focus of the network using joint attention and spatial and channel attention mechanisms, thus gaining some robustness against pseudochange issues. However, changing the network’s attention to various types of samples does not improve the model’s ability to identify hard case samples because the hard case samples tend to be weak, dispersed, or small in volume in the feature space; moreover, it is not easy for them to dominate the gradient optimization during the training process. Additionally, they will not be subjected to the same optimization effort as the foreground samples, and they will not be easy to focus on the attention mechanism.

In the HRRS, hard case samples in the background are often represented in the image space as light shadows, object occlusions, small targets, etc. We realize that each of these cases can be considered as some form of ’camouflage’ of a simple sample, e.g., ’occlusion’ is the covering of a part of the target with a mask, e.g., the tree in the red box in Figure 1 occludes the main body of the road. ’Shadows’ are dark areas on the same side of a 3-D target, such as the shaded area around a high-rise building in the yellow box in Figure 1, while ’small targets’ can be regarded as scaled-down versions of conventional targets, such as the discrete distribution of remote sensing targets in the orange box in Figure 1. Based on the above thinking, learning hard case samples by the network can start from ’simple samples’, and whenever the model obtains an excellent learning effect on simple samples, it gradually begins learning background hard cases. Experiments have proven that this process optimization will obtain a better learning effect on hard case samples. This is in line with the core idea of curriculum learning[14, 15, 16], which allows the model to learn relatively simple samples at first so that it can form basic concepts and patterns and then gradually present them with more challenging hard samples as the model continues to mature. A large body of work has verified that this strategy is beneficial for learning complex concepts via neural networks.

Refer to caption
Figure 1: Sample display of hard case sample in remote sensing image. Where hard case samples such as target occlusion, shadows and small targets are shown in red, yellow and orange boxes respectively.

After analysing the aforementioned issues, we propose a Siamese foreground association-driven hard case sample optimization network for remote sensing image change detection. Using scene context to deepen the understanding of background hard cases is an excellent strategy; therefore, we propose the foreground-scene association module, which uses potential remote sensing spatial scene information to model the relationship between interest targets in the foreground and the associated context and activate the output of hard case samples in the background to increase the difference between the foreground and the background and, conversely, to reduce the difference between the foreground and the hard cases of interest and to solve leakage and misreporting in CD. In contrast, background samples are less likely to dominate the gradient during training, but hard case samples are more valuable for model optimization in the later stages of training; however, the number of background hard cases is much smaller than the number of simple samples. To address this problem, we propose an equilibrium optimization loss function(EO-loss) so that the network focuses on the foreground target in the pretraining stage and the background hard cases in the posttraining stage for balanced optimization. We conducted relevant experiments on four public change detection datasets, and the results demonstrated the efficiency and robustness of HSONet. The main contributions of this paper are as follows:

  1. 1.

    To address the common issue of hard case samples in HRRS, we propose a Siamese foreground association-driven hard case sample optimization network for remote sensing image change detection. To address the ’imbalance’ in optimizing hard case samples, we introduce an equilibrium optimization loss function. By analysing the distribution of loss values, hard case samples are identified, and dynamic weights are introduced in the loss term. This gradually shifts the focus of loss optimization from the foreground to background hard cases, ensuring the successful discovery of interest hard case samples.

  2. 2.

    To better understand the semantics of hard case samples, we propose a foreground-scene association module. It uses latent remote sensing spatial scene information to model the association between foreground targets of interest and related context, obtaining scene embeddings. These are then applied to the feature discrimination of hard case samples. Moreover, this approach reduces the feature difference between interest samples in background hard cases and the foreground, thereby suppressing issues such as missed detections in CD.

  3. 3.

    Compared to 11 advanced baseline methods, HSONet achieves state-of-the-art results on four datasets, CDD[17], LEVIR-CD[18], Google-CD[19], and SYSU-CD[20], with absolute F1 scores reaching 92.42%, 98.11%, 93.14%, and 82.84%, respectively. Compared to the latest method, USSFC-Net[21] yields improvements of 2.29%, 2.61%, 6.34%, and 1.77%, with an average increase of 3.26%.

The other sections of this paper are as follows. Section 2 reviews related work. Section 3 describes our method. Section 4 describes the design of the validation experiments, followed by the analysis and discussion. Section 5 concludes the paper and proposes ideas and suggestions for future research.

II Related works

In recent years, deep learning technology has taken remote sensing image change detection to a new level, with a large number of DL-based CD algorithms and theoretical models being successively proposed. In the following text, we review two parts of the related work: CD methods based on two mainstream deep networks (CNNs and transformers) and the current status and issues of optimization methods for hard case samples.

II-A CD methods based on CNNs or Transformers

The introduction of convolutional neural networks (CNNs) has driven rapid developments in RS-CD research, with existing studies divided into two main stages: (1) the optimization stage of the network structure and (2) the application stage of attention mechanisms. In the network structure optimization stage, CNN-based methods have evolved towards depth and modularity, aiming to obtain more abstract feature representations and achieve plug-and-play functionality, which aids in understanding complex objects and scenes. For instance, Zhan et al.[22] first organized the CNN backbone within a Siamese framework to obtain feature information and subsequently determined changes by measuring the distance between features and threshold segmentation. In 2018, Daudt et al.[23] proposed three FCN-based CD methods—FC-EF, FC-Siam-diff, and FC-Siam-conc—and discussed the impact of network structures and feature fusion methods on CD tasks through comparisons. Numerous FCN-based CD networks have been proposed; for example, Guo et al.[24] proposed a fully convolutional metric network that learns implicit metric criteria to measure the similarity of feature map**s, thereby learning discrimination patterns for change and no change; Zheng et al.[25] introduced an end-to-end U-shaped FCN change detection network called CLNet, which combines features of different scales and multilevel context information to enhance CD effectiveness; and Peng et al.[26] proposed a difference-enhanced dense-attention convolutional neural network, which enhances the accuracy of change information retrieval by introducing dense attention and the DE unit. In the application stage of attention mechanisms, numerous works guide the network to learn emphasized part features through attention mechanisms, thus learning key discriminative information related to changes. For example, Chen et al.[11] introduced a dual-attention mechanism (spatial and channel attention) into a fully convolutional Siamese network, solving problems such as pseudochanges by enhancing the focus on interest changes.

To address the complexity and semantic uncertainty of remote sensing targets, recent years have seen a surge of work introducing context modelling to enhance the learning capability of neural networks, which has proven to be crucial for RS-CD tasks. Current context modelling methods include four mainstream methods: multiscale feature fusion, a deep supervision network architecture, the application of dilated convolution, and the design of various attention mechanisms. Many works combine the advantages of mainstream methods, integrating them into a single network. For example, Jiang et al.[27] used various methods to merge low-level and coattention-level features and establish long-range contextual connections. Wang et al.[28] proposed a depth-supervised network based on self-attention that extracts multilevel image features during the encoding phase. To further highlight change features, they proposed an adaptive attention mechanism that combines spatial and channel features, capturing the relationships between change features of different scales. Using CNNs to obtain multiscale feature map**s of targets is significant for CD tasks, but fundamentally, CNNs lack the capability to model long-range dependencies. Fortunately, the advent of the transformer has changed this situation.

Dosovitskiy et al.[29] first applied a transformer to image classification tasks in 2018, achieving remarkable results; moreover, the transformer also applied RS-CD to new heights. Chen et al.[30] were the first to apply transformers to CD tasks and proposed the bitemporal image transformer (BIT) network. Its encoder, embedded with a transformer, models the spatiotemporal context of abstract feature map**s obtained from CNNs. Then, tokens enriched with contextual information are fed back to the pixel space through the transformer decoder to locate changes in interest. BIT once again proved the superiority of the transformer in context modelling. Since contextual information is crucial for semantically understanding remote sensing imagery, many RS-CD networks based on ViT have been proposed. Currently, transformer-based methods can be divided into two main categories: hybrid methods of transformers and CNNs. For example, Feng et al.[31] reported that the sequential use of a CNN and ViT hindered the interaction of depth and breadth features; consequently, they proposed a parallel method to realize the coupling and complementarity of local and global features. Li et al.[32] found the advantages of U-Net and ViT to be complementary, creatively embedding ViT into a U-shaped network, overcoming the difficulty of feature layer relationship modelling and the inaccuracy of differential feature representation. Xu et al.[33] adopted progressive sampling ViT to eliminate irrelevant change interference to address pseudochanges and missed detections and subsequently applied a fusion module to obtain complete edge information, ensuring the accuracy of the change information. The other category includes pure transformer-based CD methods, such as those used by Yan et al.[34], who introduced a pyramid structure to aggregate multilevel features of the transformer for solving irregular change area boundaries using a progressive attention module to enhance the representation level of interdependent features. Similarly, Zhang et al.[35] designed a dual U-shaped CD network using the Swin-T backbone, aiming to overcome the inherent limitations of convolutional operations completely and fully exploit the global modelling advantage of the transformer.

II-B Current research status of optimization methods for hard case samples

Hard case sample problems are widespread in machine learning and data science, typically referring to situations where certain samples are more difficult to classify, learn, and predict correctly during model training. Several aspects of the hard case sample problem in remote sensing imagery can be summarized as follows: 1) sample imbalance, 2) small targets, 3) shadow coverage and object occlusion, and 4) pseudochanges. Many studies have explored the above problems in computer vision tasks; for example, for the sample imbalance problem, the current mainstream approach is to optimize loss function design. Lin et al.[36] proposed the focal loss for object detection tasks, introducing an adjustable focus parameter to reduce the weight of easy-to-classify samples, focusing more on hard case samples, while weighted cross-entropy loss[37] adjusts the model’s focus on different categories by introducing different weights for each class, thus solving the sample imbalance problem. Zhao et al.[38] introduced Dice loss in medical image segmentation tasks, minimizing the Dice coefficient to increase the sensitivity of the model to situations with fewer pixels and suppress the background weight; however, Dice loss can be used only for binary segmentation tasks. For issues such as small targets and pseudochanges, in addition to data augmentation operations, optimizing the model structure is also crucial. For instance, Lin et al.[39] proposed the FPN method, which generates feature maps of different resolutions at different levels of the network. High-level features contain stronger semantics, and low-level features contain more details, enhancing the perception of small targets. Fu et al.[40] introduced a dual-attention mechanism in semantic segmentation tasks, using spatial attention to capture key spatial information and channel attention to capture features more relevant to the task, thereby improving key information extraction. Of course, research on CD has also actively explored optimization solutions for hard case samples. For example, MFCN[41] designs multiscale convolutional kernels to extract detailed surface features and combines the WBCE and Dice loss for balanced sample optimization, improving the model’s ability to discriminate minute features from two aspects. Zhu et al.[42] also focused on problems such as small target changes and edge pixel misclassification, proposing ECFNet and designing three processes—feature extraction, feature comparison, and feature fusion. In the fusion process, the number of channels is constrained to better utilize the fine-grained information in multiscale features for result prediction. Guo et al.[43] realized the importance of interest feature extraction and feature fusion for CD and proposed three modules, deep multiscale feature extraction, parallel convolutional feature fusion, and self-attention-based feature refinement, to integrate multiscale change information and further enhance the feature representation of interest targets. The current optimization methods for hard case samples mainly focus on model structure improvement, interaction of deep and shallow layer information, and optimization of attention mechanisms. However, the idea of this paper is to start from the learning rules of hard case samples and divide the feature learning process according to difficulty characteristics. Sample optimization can be achieved through two steps: hard case sample mining and balanced optimization of foreground and hard cases. In this process, background hard cases are enhanced using relevant scene information in the foreground, thus improving feature discriminability and accuracy. Therefore, we designed the foreground-scene association module and an EO-loss to learn hard case samples commonly found in the background.

III Methodology

In this chapter, we provide a detailed introduction to the proposed CD method. Section A presents the overall framework of HSONet. Section B introduces the variant feature pyramid network encoder. Section C describes the foreground-scene association module. Section D covers the dual-temporal feature fusion and feature decoding structure. Section E introduces the EO-loss.

Refer to caption
Figure 2: Overview of the HSONet.

III-A Network Architecture

As shown in Figure 2, the overall structure of HSONet is an end-to-end Siamese network. A pair of dual-temporal remote sensing images is taken as input, and a binary change map is output. The network consists of a variant feature pyramid network module, a foreground-scene association module (FS-relation), an EO-loss, and a feature fusion and lightweight decoder. The V-FPN module is responsible for multiscale target feature extraction and obtaining deep feature information. The FS-relation module is built on a collection of multiscale feature maps. Our idea is that change misses and false detections are due to numerous difficult-to-distinguish samples in the background, while the foreground lacks sufficient discriminative information for interest targets. Therefore, the FS-relation module is designed to improve the associative capacity of interest features, thereby enhancing the discrimination ability for hard case targets. Additionally, to better learn about hard case targets in the background and interest targets in the foreground, we propose an EO-loss for balanced optimization and hard case information learning of foreground-background samples. Finally, we design a multilevel feature fusion module and a lightweight decoder for fusing dual-temporal change information and restoring it to the original pixel space to obtain the final change map.

III-B Foreground-Scene Association module

III-B1 Variant feature pyramid network encoder

To obtain deep features of scene embedding and utilize scene embedding at multiple scales to enhance feature representation, we designed the variant feature pyramid network (V-FPN) for multiscale feature extraction and adaptation to multilayer scene information embedding. The V-FPN consists of a feature extraction branch and a scene information embedding branch. Both branches utilize a variant FPN network divided into multiscale feature layers. The FPN, originally proposed by Liu et al. [6] for object detection tasks, aims to capture strong semantic information in multiple feature layers. To further enhance the model’s multiscale feature extraction and fusion capabilities, we designed each layer of the FPN with ResNet as the backbone network for basic feature information extraction. As shown in Figure 2, given a pair of input images Tm,m=1,2formulae-sequencesuperscriptT𝑚𝑚12\mathrm{T}^{m},m=1,2roman_T start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_m = 1 , 2, the size of the feature layers in the feature extraction backbone network decreases vertically. This results in a multilevel feature layer set Fim,i=1,2,3,4;m=1,2formulae-sequencesuperscriptsubscriptF𝑖𝑚𝑖1234𝑚12\mathrm{F}_{i}^{m},i=1,2,3,4;m=1,2roman_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_i = 1 , 2 , 3 , 4 ; italic_m = 1 , 2, with output strides relative to the input image of (4,8,16,32) pixels. Building on the standard FPN structure, we use feature layers from deep to shallow and lateral feature connections to generate a pyramid feature map** set Pim,i=1,2,3,4;m=1,2formulae-sequencesuperscriptsubscriptP𝑖𝑚𝑖1234𝑚12\mathrm{P}_{i}^{m},i=1,2,3,4;m=1,2roman_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_i = 1 , 2 , 3 , 4 ; italic_m = 1 , 2 with the same dimensions. This method fully links spatial detail information of shallow features with strong semantic information of deep features, aiding in the modelling of multiscale target context information. This process can be expressed by Equation 1.

Pim=τ(Fim)+σ(Pi+1m),i=1,2,3,4;m=1,2formulae-sequencesuperscriptsubscriptP𝑖𝑚𝜏superscriptsubscript𝐹𝑖𝑚𝜎superscriptsubscriptP𝑖1𝑚formulae-sequence𝑖1234𝑚12\begin{gathered}\mathrm{P}_{i}^{m}=\tau\left(F_{i}^{m}\right)+\sigma\left(% \mathrm{P}_{i+1}^{m}\right),i=1,2,3,4;m=1,2\end{gathered}\centering\@add@centeringstart_ROW start_CELL roman_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = italic_τ ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) + italic_σ ( roman_P start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) , italic_i = 1 , 2 , 3 , 4 ; italic_m = 1 , 2 end_CELL end_ROW (1)

where τ𝜏\tauitalic_τ represents a learnable 1x1 convolutional layer used for lateral feature connections and σ𝜎\sigmaitalic_σ represents a nearest-neighbour upsampling operation with a dimension of 2.

In the final layer F4subscriptF4\mathrm{F}_{4}roman_F start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT of the pyramid feature set, we designed an important feature flow branch, namely, the foreground-scene information embedding branch. This structure is based on global context aggregation to obtain the spatial scene embedding vector SV, which models the dependency relationship between geographical scenes and interest targets, including interest targets in the foreground and hard case samples. The specific structure and principle of this process will be introduced in the next section. Additionally, in our experiments, we used two types of backbone networks for feature extraction: ResNet[44] and the pyramid vision transformer[45]. The former, with its ingenious residual modules, has strong feature extraction capabilities. Considering accuracy and efficiency, we chose ResNet50 as the first backbone network. Moreover, the PVT can consider both low-level local features and high-level global features. Moreover, the transformer can learn long-range dependencies between features, giving it strong feature representation capabilities at different scales. In the experiments, we used PVTv2 as the second backbone network.

III-B2 Foreground-scene associated information embedding

HRRS is characterized by rich spectral features and complex scene information. Issues such as varying lighting conditions, seasonal changes, and shadows and occlusions caused by tall buildings commonly occur. This means that there is a significant difference in image spectral features and a large intraclass variance in the background, specifically manifested as a large number of hard case samples in the background information, leading to problems such as false positives, false negatives, and pseudochanges in CD. To address this, we propose a foreground-science association module that leverages the scene context of remote sensing space to model the feature representation of foreground targets positively and enhance the feature discrimination ability of background samples. The main principle is shown in Figure 3. The FS-relation module first explicitly models the foreground targets and the geographic scene information and then uses implicit geographic scenes to associate the foreground with the relevant context. This relationship is subsequently used to enhance the input feature map**, increasing the difference between foreground targets and background hard cases and reducing the likelihood of false alarms and misreporting.

Refer to caption
Figure 3: Schematic diagram of the FS-relation module.

As depicted in Figure 4, the FS-relation module generates a new feature map** set RimsuperscriptsubscriptR𝑖𝑚\mathrm{R}_{i}^{m}roman_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT based on the feature pyramid collection Pim,i=1,2,3,4;m=1,2formulae-sequencesuperscriptsubscriptP𝑖𝑚𝑖1234𝑚12\mathrm{P}_{i}^{m},i=1,2,3,4;m=1,2roman_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_i = 1 , 2 , 3 , 4 ; italic_m = 1 , 2 This module initially re-encodes each layer of PimsuperscriptsubscriptP𝑖𝑚\mathrm{P}_{i}^{m}roman_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT to preliminarily form the feature map** QimsuperscriptsubscriptQ𝑖𝑚\mathrm{Q}_{i}^{m}roman_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. This map** is then reweighted according to the correlation map rimsuperscriptsubscriptr𝑖𝑚\mathrm{r}_{i}^{m}roman_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, ultimately producing the relation-enhanced feature map** set RimsuperscriptsubscriptR𝑖𝑚\mathrm{R}_{i}^{m}roman_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. The correlation map rimsuperscriptsubscriptr𝑖𝑚\mathrm{r}_{i}^{m}roman_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT refers to the similarity matrix between the geographical scene and the foreground representation. To achieve greater discerning RimsuperscriptsubscriptR𝑖𝑚\mathrm{R}_{i}^{m}roman_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, the model learns a feature map** function to align the feature input PimsuperscriptsubscriptP𝑖𝑚\mathrm{P}_{i}^{m}roman_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT with the scene vector (SV) into the same dimension, facilitating feature interaction. Here, QimRd×H×Wsuperscriptsubscript𝑄𝑖𝑚superscript𝑅𝑑𝐻𝑊Q_{i}^{m}\in R^{d\times H\times W}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_d × italic_H × italic_W end_POSTSUPERSCRIPT is transformed from the feature pyramid layer PimRd×H×Wsuperscriptsubscript𝑃𝑖𝑚superscript𝑅𝑑𝐻𝑊P_{i}^{m}\in R^{d\times H\times W}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_d × italic_H × italic_W end_POSTSUPERSCRIPT through the scale projection function v()𝑣v(\cdot)italic_v ( ⋅ ), as expressed in Equations 2 and 3.

v():RC×H×WRd×H×W:𝑣superscript𝑅𝐶𝐻𝑊superscript𝑅𝑑𝐻𝑊\begin{gathered}v(\bullet):R^{C\times H\times W}\rightarrow R^{d\times H\times W% }\end{gathered}\centering\@add@centeringstart_ROW start_CELL italic_v ( ∙ ) : italic_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT → italic_R start_POSTSUPERSCRIPT italic_d × italic_H × italic_W end_POSTSUPERSCRIPT end_CELL end_ROW (2)
Qim=v(Pim)superscriptsubscript𝑄𝑖𝑚𝑣superscriptsubscript𝑃𝑖𝑚\begin{gathered}Q_{i}^{m}=v\left(P_{i}^{m}\right)\end{gathered}\centering\@add@centeringstart_ROW start_CELL italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = italic_v ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_CELL end_ROW (3)

We implement v()𝑣v(\bullet)italic_v ( ∙ ) efficiently, first through a 1×1 convolution layer, followed by batch normalization and ReLU.

The similarity relation matrix set rimsuperscriptsubscriptr𝑖𝑚\mathrm{r}_{i}^{m}roman_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT results from the interaction between the scene embedding information and the foreground pyramid features. Therefore, to compute this set, the 1-D scene embedding information SVRdSVsuperscriptR𝑑\mathrm{SV}\in\mathrm{R}^{d}roman_SV ∈ roman_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT interacts with the foreground feature projection Qimsuperscriptsubscript𝑄𝑖𝑚Q_{i}^{m}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. Here, the scene embedding information SV is obtained by linearly projecting P4msuperscriptsubscript𝑃4𝑚P_{4}^{m}italic_P start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, as represented by Equation 4.

SV=φ(P4m)𝑆𝑉𝜑superscriptsubscript𝑃4𝑚\begin{gathered}SV=\varphi\left(P_{4}^{m}\right)\end{gathered}\centering\@add@centeringstart_ROW start_CELL italic_S italic_V = italic_φ ( italic_P start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_CELL end_ROW (4)

φ()𝜑\varphi(\bullet)italic_φ ( ∙ ) signifies a scene-focused projection function. For ease of computation, this process is executed by a learnable 1×1 convolution, with the output channel number set to d𝑑ditalic_d. For the same pair of input images, the remote sensing scene information should be identical, allowing each layer of features in Qimsuperscriptsubscript𝑄𝑖𝑚Q_{i}^{m}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT to share SV. At this stage, the similarity relation matrix set rimsuperscriptsubscript𝑟𝑖𝑚r_{i}^{m}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is derived using Equation 5.

rim=δ(SV,Qim)=SVQimsuperscriptsubscript𝑟𝑖𝑚𝛿𝑆𝑉superscriptsubscript𝑄𝑖𝑚direct-product𝑆𝑉superscriptsubscript𝑄𝑖𝑚\begin{gathered}r_{i}^{m}=\delta\left(SV,Q_{i}^{m}\right)=SV\odot Q_{i}^{m}% \end{gathered}\centering\@add@centeringstart_ROW start_CELL italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = italic_δ ( italic_S italic_V , italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) = italic_S italic_V ⊙ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_CELL end_ROW (5)

where δ()𝛿\delta(\bullet)italic_δ ( ∙ ) denotes the similarity measure function. To simplify the computation and improve the efficiency, the similarity measure is realized by the vector inner product.

The details of the foreground-scene correlation process are illustrated in Figure 4. The final relation-enhanced feature map** Rimsuperscriptsubscript𝑅𝑖𝑚R_{i}^{m}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is computed according to Equation 6.

Rim=ε(Pim)1+exp(rim)superscriptsubscript𝑅𝑖𝑚𝜀superscriptsubscript𝑃𝑖𝑚1superscriptsubscript𝑟𝑖𝑚\begin{gathered}R_{i}^{m}=\frac{\varepsilon\left(P_{i}^{m}\right)}{1+\exp\left% (-r_{i}^{m}\right)}\end{gathered}\centering\@add@centeringstart_ROW start_CELL italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = divide start_ARG italic_ε ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_ARG start_ARG 1 + roman_exp ( - italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_ARG end_CELL end_ROW (6)

ε()𝜀\varepsilon(\bullet)italic_ε ( ∙ ) is a feature re-encoding structure designed to re-encode the pyramid feature map** set Pimsuperscriptsubscript𝑃𝑖𝑚P_{i}^{m}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, similar to the aforementioned v()𝑣v(\bullet)italic_v ( ∙ ) structure. In this equation, we use a polynomial containing rimsuperscriptsubscript𝑟𝑖𝑚r_{i}^{m}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT to weight and highlight the re-encoded feature map** ε(Pim)𝜀superscriptsubscript𝑃𝑖𝑚\varepsilon(P_{i}^{m})italic_ε ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ).

Refer to caption
Figure 4: Scene information embedding schematic.

III-C Lightweight decoder and dual-branch information fusion

Refer to caption
Figure 5: Lightweight decoder architecture diagram

As shown in Figure 1, to capture multiscale change information, we fuse the relationship-enhanced feature map** sets Pi1superscriptsubscript𝑃𝑖1P_{i}^{1}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and Pi2superscriptsubscript𝑃𝑖2P_{i}^{2}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT of the dual branches, which means that the multiscale feature map Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be derived via Equation 7.

Ci=abs(Ri1Ri2)subscript𝐶𝑖𝑎𝑏𝑠superscriptsubscript𝑅𝑖1superscriptsubscript𝑅𝑖2\begin{gathered}C_{i}=abs\left(R_{i}^{1}-R_{i}^{2}\right)\end{gathered}\centering\@add@centeringstart_ROW start_CELL italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_a italic_b italic_s ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT - italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW (7)

To restore the spatial resolution of the relationship-enhanced feature map** set, we designed a lightweight decoder, the structure of which is shown in Figure 5. According to the foreground-scene correlation module, given the relationship-enhanced feature map** Rimsuperscriptsubscript𝑅𝑖𝑚R_{i}^{m}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, the upscaled feature map** set SimRc×αH×αWsuperscriptsubscript𝑆𝑖𝑚superscript𝑅𝑐𝛼𝐻𝛼𝑊S_{i}^{m}\in R^{c\times\alpha H\times\alpha W}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_c × italic_α italic_H × italic_α italic_W end_POSTSUPERSCRIPT can be computed with this decoder. This decoder consists of several upsampling modules, each containing two parts, a dimension transformation operation B()𝐵B(\bullet)italic_B ( ∙ ) and a bilinear interpolation upsampling operation U()𝑈U(\bullet)italic_U ( ∙ ). These are responsible for transforming multiscale feature channels and restoring feature dimensions, respectively, as represented by Equation 8.

S^im=UN(BN(Sim))superscriptsubscript^𝑆𝑖𝑚subscript𝑈𝑁subscript𝐵𝑁superscriptsubscript𝑆𝑖𝑚\begin{gathered}\hat{S}_{i}^{m}=U_{N}\left(B_{N}\left(S_{i}^{m}\right)\right)% \end{gathered}\centering\@add@centeringstart_ROW start_CELL over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = italic_U start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) ) end_CELL end_ROW (8)

Here, N𝑁Nitalic_N represents the number of times the upsampling module is used, which is determined by the dimensions of the Rimsuperscriptsubscript𝑅𝑖𝑚R_{i}^{m}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT feature layer. The B()𝐵B(\bullet)italic_B ( ∙ ) operation is achieved through a 3×3 convolution, BN, or ReLU operation, while U()𝑈U(\bullet)italic_U ( ∙ ) represents bilinear interpolation upsampling with a scaling factor of 2; thus, the upsampling scale is 2N2𝑁2N2 italic_N. Subsequently, each layer of the upscaled feature map** set S^imsuperscriptsubscript^𝑆𝑖𝑚\hat{S}_{i}^{m}over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is aggregated from high to low to obtain a semantically rich feature map** Mm,m=1,2formulae-sequencesuperscript𝑀𝑚𝑚12M^{m},m=1,2italic_M start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_m = 1 , 2. This is achieved through pointwise mean computation for feature fusion, followed by a 1×1 convolution for feature integration and parameter computation, as depicted by Equation 9.

Mm=Conv1×1(i=14S^immax(i)),m=1,2formulae-sequencesuperscript𝑀𝑚subscriptConv11superscriptsubscript𝑖14superscriptsubscript^𝑆𝑖𝑚𝑖𝑚12\begin{gathered}M^{m}=\operatorname{Conv}_{1\times 1}\left(\frac{\sum_{i=1}^{4% }\hat{S}_{i}^{m}}{\max(i)}\right),m=1,2\end{gathered}\centering\@add@centeringstart_ROW start_CELL italic_M start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = roman_Conv start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_ARG start_ARG roman_max ( italic_i ) end_ARG ) , italic_m = 1 , 2 end_CELL end_ROW (9)

For ease of expression and clarity, we collectively refer to the upsampling and aggregation modules as deo()𝑑𝑒𝑜deo(\bullet)italic_d italic_e italic_o ( ∙ ). Then, the change feature t𝑡titalic_t can be further fused from the dual-branch aggregated features Mmsuperscript𝑀𝑚M^{m}italic_M start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, as illustrated by Equation 10.

t=abs(M1M2)𝑡𝑎𝑏𝑠superscript𝑀1superscript𝑀2\begin{gathered}t=abs\left(M^{1}-M^{2}\right)\end{gathered}\centering\@add@centeringstart_ROW start_CELL italic_t = italic_a italic_b italic_s ( italic_M start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT - italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW (10)

Finally, to better represent the multiscale information in the final predictive map, a skip connection is established. The specific operation involves aggregating the change feature t𝑡titalic_t with the multiscale feature information Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, as represented by Equation 11.

T=Concat(t,deo(Ci))𝑇Concat𝑡deosubscript𝐶𝑖\begin{gathered}T=\operatorname{Concat}\left(t,\operatorname{deo}\left(C_{i}% \right)\right)\end{gathered}\centering\@add@centeringstart_ROW start_CELL italic_T = roman_Concat ( italic_t , roman_deo ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) end_CELL end_ROW (11)

We successfully obtained the semantically enriched variational feature T𝑇Titalic_T. To derive the final predictive change map Tchangesubscript𝑇changeT_{\text{change}}italic_T start_POSTSUBSCRIPT change end_POSTSUBSCRIPT, it is imperative to remap the advanced semantic-laden features T𝑇Titalic_T back into the original pixel space. To address this issue, we crafted a lightweight deconvolution decoder based on the fully convolutional network (FCN) framework. This architecture is segmented into three core components: the feature fusion layer, the deconvolution layer, and the classification layer. Each segment is intricately designed, comprising a 3×3 convolution layer with batch normalization (BN) and ReLU, a 4×4 deconvolution layer also equipped with BN and ReLU, and, finally, a deconvolution layer solely with BN. The culmination of this process is the pixel-level classification to discern change information, executed through the sigmoid function, as delineated in Equation 12.

Tchange =Sigmoid(η(Conv3×3(T)))subscript𝑇change Sigmoid𝜂subscriptConv33𝑇\begin{gathered}T_{\text{change }}=\operatorname{Sigmoid}\left(\eta\left(% \operatorname{Conv}_{3\times 3}(T)\right)\right)\end{gathered}\centering\@add@centeringstart_ROW start_CELL italic_T start_POSTSUBSCRIPT change end_POSTSUBSCRIPT = roman_Sigmoid ( italic_η ( roman_Conv start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT ( italic_T ) ) ) end_CELL end_ROW (12)

Here, η()𝜂\eta(\bullet)italic_η ( ∙ ) symbolizes a sequence of two consecutive deconvolution operations.

III-D Equilibrium optimization loss function

Remote sensing imagery backgrounds usually contain a certain number of hard case samples, and models often show poor learning effectiveness for these samples. One reason is that the number of hard case samples is relatively small, and only a portion of the hard case samples related to interest areas are meaningful for model optimization in the later stages of training. Another reason is that in the early stages of training, the model’s judgement of hard case samples is uncertain, and directly learning from these samples is less effective. Therefore, we designed an EO-loss to address these issues. The training loss measures the feature distance between the target supervision and the predicted values. The difference between the model’s predicted values and the true values for hard case samples is often greater than that for simple samples. Thus, the distribution of hard case samples can be roughly estimated by the loss values. The loss value is positively correlated with the difficulty of sample learning and is generally expressed as follows: the more difficult the sample is to learn, the greater the loss value; the easier the sample is to learn, the smaller the loss value. In the later training stages, the model’s predictions for simple samples have already reached a good level, and at this time, only the more challenging samples in the background and foreground are more meaningful for training optimization. Based on this, we propose an EO-loss. As shown in Figure 6, this loss function consists of three steps: hard case sample assessment, dynamic weight optimization, and backpropagation.

To obtain weights that represent the difficulty level of sample data, thereby adjusting the loss distribution pixel by pixel to achieve balanced optimization, we are inspired by the focal loss[36] approach for optimizing hard samples. We use an exponential polynomial with trainable parameters to predict the distribution of hard case samples, which represents the predicted probability and the focus factor. For the pixel-level prediction task of distinguishing between foreground targets and background hard case samples, we aim to adjust the distribution of loss without changing the total value of the loss to avoid the vanishing gradient. Therefore, we introduce a normalization parameter Z𝑍Zitalic_Z to eliminate the impact of outliers and anomalies on the overall results, ensuring stable and efficient learning of hard case samples. This parameter Z𝑍Zitalic_Z must ensure that Equation 13 is valid.

i=0H×Wloss(pi,yi)=1Zi=0H×W(1pi)γloss(pi,yi)superscriptsubscript𝑖0𝐻𝑊losssubscript𝑝𝑖subscript𝑦𝑖1𝑍superscriptsubscript𝑖0𝐻𝑊superscript1subscript𝑝𝑖𝛾losssubscript𝑝𝑖subscript𝑦𝑖\begin{gathered}\sum_{i=0}^{H\times W}\operatorname{loss}\left(p_{i},y_{i}% \right)=\frac{1}{Z}\sum_{i=0}^{H\times W}\left(1-p_{i}\right)^{\gamma}% \operatorname{loss}\left(p_{i},y_{i}\right)\end{gathered}\centering\@add@centeringstart_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT roman_loss ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_Z end_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT roman_loss ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL end_ROW (13)

Here, loss(pi,yi)𝑙𝑜𝑠𝑠subscript𝑝𝑖subscript𝑦𝑖{loss}\left(p_{i},y_{i}\right)italic_l italic_o italic_s italic_s ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) represents the binary cross-entropy loss value for the i-th pixel, which can be calculated from the predicted probability value pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the actual value yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Therefore, the weight of each pixel’s loss value is 1Z(1p)γ1𝑍superscript1𝑝𝛾\frac{1}{Z}(1-p)^{\gamma}divide start_ARG 1 end_ARG start_ARG italic_Z end_ARG ( 1 - italic_p ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT.

Refer to caption
Figure 6: Equilibrium optimization loss function structure.

The learning effectiveness of hard case samples depends on the optimization method of the model. In the early stages of training, the model’s confidence in sample category determination is not high, and dealing with too many hard case samples at this time may lead to an unstable learning process. To ensure stable and convergent model training, we propose a dynamic weighting strategy based on a decay function. This strategy focuses on learning from simple samples in the early training stages and strengthens the optimization of hard case samples in the later stages. Based on the binary cross-entropy loss BCE()𝐵𝐶𝐸BCE(\bullet)italic_B italic_C italic_E ( ∙ ), we can express the EO-loss as shown in Equation 14.

loss=[λ(t)(11Z(1pi)γ)+1Z(1pi)γ]BCE(pi,yi)𝑙𝑜𝑠𝑠delimited-[]𝜆𝑡11𝑍superscript1subscript𝑝𝑖𝛾1𝑍superscript1subscript𝑝𝑖𝛾𝐵𝐶𝐸subscript𝑝𝑖subscript𝑦𝑖\begin{gathered}loss=\left[\lambda(t)\left(1-\frac{1}{Z}\left(1-p_{i}\right)^{% \gamma}\right)+\frac{1}{Z}\left(1-p_{i}\right)^{\gamma}\right]BCE\left(p_{i},y% _{i}\right)\end{gathered}\centering\@add@centeringstart_ROW start_CELL italic_l italic_o italic_s italic_s = [ italic_λ ( italic_t ) ( 1 - divide start_ARG 1 end_ARG start_ARG italic_Z end_ARG ( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_Z end_ARG ( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ] italic_B italic_C italic_E ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL end_ROW (14)

Here, λ(t)𝜆𝑡\lambda(t)italic_λ ( italic_t ) represents a weight function related to the training process t𝑡titalic_t, which is used for weighting nonhard case samples, and λ(t)(0,1)𝜆𝑡01\lambda(t)\in(0,1)italic_λ ( italic_t ) ∈ ( 0 , 1 ) decreases monotonically. Therefore, as training progresses, the model’s confidence in predicting nonhard case samples increases continuously, and the focus of the loss value distribution gradually shifts to hard case samples. Considering that there are multiple methods to decrease λ(t)𝜆𝑡\lambda(t)italic_λ ( italic_t ), we design three representative and computationally convenient weight functions for weighting nonhard case samples, with specific details shown in TABLE I. A linear decrease is used for the model to transition smoothly from simple to hard case samples; an exponential decrease is based on the idea that the learning process of simple samples progresses very quickly, hence further shifting the optimization focus towards hard case samples and increasing the time for optimizing hard case samples; and a cosine decrease distributes the optimization weights evenly between simple and hard case samples, reducing the proportion of the transition process.

TABLE I: INFORMATION ON DIFFERENT WEIGHTING FUNCTIONS
Function Expression hyperparameter
Linear λ(t)=1t step𝜆𝑡1𝑡 step\lambda(t)=1-\frac{t}{\text{ step }}italic_λ ( italic_t ) = 1 - divide start_ARG italic_t end_ARG start_ARG step end_ARG step
Exponential λ(t)=(1t step )decaty𝜆𝑡superscript1𝑡 step decaty\lambda(t)=\left(1-\frac{t}{\text{ step }}\right)^{\text{decaty }}italic_λ ( italic_t ) = ( 1 - divide start_ARG italic_t end_ARG start_ARG step end_ARG ) start_POSTSUPERSCRIPT decaty end_POSTSUPERSCRIPT step and decay
Cosine λ(t)=12(1+cos(tπstep))𝜆𝑡121𝑡𝜋𝑠𝑡𝑒𝑝\lambda(t)=\frac{1}{2}\left(1+\cos\left(\frac{t\pi}{step}\right)\right)italic_λ ( italic_t ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( 1 + roman_cos ( divide start_ARG italic_t italic_π end_ARG start_ARG italic_s italic_t italic_e italic_p end_ARG ) ) step

IV Experiments and discussion

To evaluate the performance of HSONet, we conducted experiments on four public CD datasets, LEVIR-CD, Google-CD, CDD, and SYSU-CD, and compared them with the latest RS-CD methods. Additionally, we designed ablation studies and effectiveness experiments and discussed the experimental results.

IV-A Datasets

IV-A1 LEVIR-CD[17]

A large-scale building change detection dataset consisting of 637 pairs of Google Earth images, each with a size of 1024×1024 pixels and a resolution of 0.5 m, spanning from 2002 to 2018. It contains a variety of building change information, such as villas, large factories, high-rise apartments, and garages.

IV-A2 Google-CD[18]

This dataset records changes in various types of buildings, including large factories, villages, and warehouses, in the suburbs of Guangzhou, China, from 2006 to 2019. The 19 pairs of images are sourced from Google Earth, with spatial resolutions of 0.55 m and image sizes ranging from 1006×1168 to 4936×5224.

IV-A3 CDD[16]

This dataset contains 11 pairs of multisource remote sensing images collected in different seasons; 7 pairs have a resolution of 7425×2202 pixels, and 4 pairs have 1900×1000 pixels, with resolutions varying from 0.03 m to 1 m. The challenge of this dataset lies in accurately detecting changes in buildings, roads, farmland, and vehicles, regardless of the impact of seasonal changes.

IV-A4 SYSU-CD[19]

This dataset includes 20,000 pairs of aerial images collected in Hong Kong, China, from 2007 to 2014; all the images have a size of 256×256 pixels and a resolution of 0.5 m. The main types of changes in the data include urban buildings, road construction, marine construction, and vegetation changes.

IV-B Metrics

IV-B1 Evaluation criteria

To evaluate the performance of the model, we used seven evaluation metrics for result presentation: precision (P), recall (R), F1 score (F1), intersection over union (IOU), mean intersection over union (mIOU), overall accuracy (OA), and kappa coefficient (Kappa). In the RS-CD tasks, higher precision indicates more correct detections in positive cases, higher recall means fewer losses in predicted positive cases, and larger F1, IOU, and mIOU indicate better CD performance. A higher kappa coefficient indicates better consistency between two sets of predictions, while the OA is an overall assessment of all the predictions being correctly classified. The calculation formulas for these metrics are as follows.

P=TPTP+FP𝑃𝑇𝑃𝑇𝑃𝐹𝑃\begin{gathered}P=\frac{TP}{TP+FP}\end{gathered}\centering\@add@centeringstart_ROW start_CELL italic_P = divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_P end_ARG end_CELL end_ROW (15)
R=TPTP+FN𝑅𝑇𝑃𝑇𝑃𝐹𝑁\begin{gathered}R=\frac{TP}{TP+FN}\end{gathered}\centering\@add@centeringstart_ROW start_CELL italic_R = divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_N end_ARG end_CELL end_ROW (16)
F1=1P1+R1𝐹11superscriptP1superscriptR1\begin{gathered}F1=\frac{1}{\mathrm{P}^{-1}+\mathrm{R}^{-1}}\end{gathered}\centering\@add@centeringstart_ROW start_CELL italic_F 1 = divide start_ARG 1 end_ARG start_ARG roman_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + roman_R start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW (17)
IOU=TPTP+FN+FP𝐼𝑂𝑈𝑇𝑃𝑇𝑃𝐹𝑁𝐹𝑃\begin{gathered}IOU=\frac{TP}{TP+FN+FP}\end{gathered}\centering\@add@centeringstart_ROW start_CELL italic_I italic_O italic_U = divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_N + italic_F italic_P end_ARG end_CELL end_ROW (18)
OA=TP+TNTP+FN+FP+TN𝑂𝐴𝑇𝑃𝑇𝑁𝑇𝑃𝐹𝑁𝐹𝑃𝑇𝑁\begin{gathered}OA=\frac{TP+TN}{TP+FN+FP+TN}\end{gathered}\centering\@add@centeringstart_ROW start_CELL italic_O italic_A = divide start_ARG italic_T italic_P + italic_T italic_N end_ARG start_ARG italic_T italic_P + italic_F italic_N + italic_F italic_P + italic_T italic_N end_ARG end_CELL end_ROW (19)
mIOU=12(TPTP+FN+FP+TNTN+FN+FP)𝑚𝐼𝑂𝑈12𝑇𝑃𝑇𝑃𝐹𝑁𝐹𝑃𝑇𝑁𝑇𝑁𝐹𝑁𝐹𝑃\begin{gathered}mIOU=\frac{1}{2}\left(\frac{TP}{TP+FN+FP}+\frac{TN}{TN+FN+FP}% \right)\end{gathered}\centering\@add@centeringstart_ROW start_CELL italic_m italic_I italic_O italic_U = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_N + italic_F italic_P end_ARG + divide start_ARG italic_T italic_N end_ARG start_ARG italic_T italic_N + italic_F italic_N + italic_F italic_P end_ARG ) end_CELL end_ROW (20)
Kappa=OAPe1P𝐾𝑎𝑝𝑝𝑎𝑂𝐴𝑃𝑒1𝑃\begin{gathered}Kappa=\frac{OA-Pe}{1-P}\end{gathered}\centering\@add@centeringstart_ROW start_CELL italic_K italic_a italic_p italic_p italic_a = divide start_ARG italic_O italic_A - italic_P italic_e end_ARG start_ARG 1 - italic_P end_ARG end_CELL end_ROW (21)

IV-B2 Implementation details

We conducted experiments using an NVIDIA 4080Ti graphics card and trained the model with a minibatch Adam optimizer. The PyTorch deep learning framework was used. The Epoch was set to 150, batch size to 16, learning rate to 5×10-4, weight decay to5×10-4, Step size to 50, and momentum to 0.9. For the CDD dataset, we cropped the original images into small images of 256×256 pixels and randomly performed simple operations such as rotation, flip**, and central crop**. We obtained 10,000/3,000/3,000 pairs of image patches for training, validation, and testing, respectively. For the LEVIR-CD dataset, we cropped the original images into nonoverlap** blocks of 256×256 pixels, similarly performing random rotations, flip**, and central crop** for data augmentation. We divided the images randomly into three parts: 8392/600/1200 pairs for training, validation, and testing. For the Google-CD dataset, we cropped the original images into blocks of 256×256 pixels and performed random rotations, flip**, and central crop** steps; we obtained 2661/157/312 pairs of images for training, validation, and testing. For the SYSU-CD dataset, we cropped the original images into small images of 256×256 pixels, similarly performing random rotations, flip**, and central crop**, and obtained 14,001/1,999/4,000 pairs of images for training, validation, and testing.

IV-C Results and Discussion

IV-C1 Benchmark methods

In this subsection, we introduce the benchmark methods used for comparison. To demonstrate the superiority of HSONet, we selected the most classic research achievements in each subdirection for comparison, including three pure convolution-based CD methods, FC-EF, FC-Siam-Di, and FC-Siam-Co; a method based on channel and spatial attention, DASNet; a method embedding ViT into the CNN backbone BiT; a pure Swin-T based method, SwinSUNet; a method based on cross-temporal joint attention, DMINet; a method based on cross-interaction attention and multiscale feature fusion, ICIF-Net; a CD method for building edge feature enhancement, EGCTNet; a method based on high-frequency information enhancement and spatial attention, HFA-Net; and a 3-D attention method, USSFC-Net, based on spatial-spectral feature synergy.

  1. (a)

    FC-EF[23]: An early feature fusion CD method. It directly concatenates dual-temporal images in the channel dimension before feeding them into the network.

  2. (b)

    FC-Siam-Co[23]: A weight-sharing dual-encoder Siamese network that acquires change information by fusing features at various levels.

  3. (c)

    FC-Siam-Di[23]: Compared with FC-Siam-Co, this method changes the feature map fusion method in the network from concatenation to absolute difference.

  4. (d)

    DASNet[11]: Based on VGG16 or ResNet50, this Siamese network introduces spatial and channel attention in the feature encoding phase to enhance the network’s focus on interest changes and resist pseudochange interference. The change map is obtained through a feature metric module.

  5. (e)

    BIT[30]: This method embeds a transformer module in the convolutional network, converting dual-temporal feature map**s into compact semantic tokens to facilitate modelling of the spatiotemporal context in feature space, thereby optimizing feature representation in image space.

  6. (f)

    SwinSUNet[35]: Based on Swin-T, this dual U-shaped CD network uses multiple pure transformer encoders to capture the spatiotemporal context of dual-temporal images. After feature fusion and linear map**, the change information is decoded using multiple Swin-T blocks combined with skip connections.

  7. (g)

    DMINet[13]: Based on dual-branch ResNet18 for feature extraction, DMINet unifies self-attention and cross-attention in one module to guide the global feature distribution of each input. Two change information acquisition structures are also designed, namely, the fusion method based on subtraction and concatenation and a multilevel differential aggregation method based on incremental feature alignment.

  8. (h)

    ICIFNet[31]: This method designs a four-branch cross-interaction feature extraction structure with parallel CNN and ViT to promote the mutual penetration of local and global features. Mask-based aggregation and spatial alignment (SA) schemes are also introduced for scale integration, achieving information integration at different resolutions.

  9. (i)

    EGCTNet[46]: Focusing on the issue that local fine features and global information in CD cannot be obtained simultaneously, a fusion encoder is designed combining a CNN and transformers. Additionally, an edge detection branch is proposed that uses edge information to guide mask feature generation.

  10. (j)

    HFA-Net[47]: Addressing the issue of insufficient high-frequency information acquisition for clearly defined targets, a high-frequency attention-guided module is proposed. It consists of two main stages: first, spatial attention is used to search for and focus on buildings; second, high-frequency enhancement is used to highlight the high-frequency information of features, which better represents the edges of changed buildings.

  11. (k)

    USSFC-Net[21]: Simultaneously modelling spatial and spectral features, this method proposes an efficient collaborative network to generate three-dimensional attention for richer feature information. Additionally, a multiscale decoupled convolution is designed to flexibly capture the multiscale features of changing objects.

IV-C2 Performance comparison

Refer to caption
Figure 7: Precision metric radar chart. The above figures show the accuracy metric radar charts on LEVIR-CD, Google-CD, CDD, and SYSU-CD.

To accurately assess the effectiveness of HSONet, we conducted quantitative experiments on four public CD datasets. TABLE II, TABLE III, TABLE IV, and TABLE V show the accuracy of HSONet on the test sets of LEVIR-CD, Google-CD, CDD, and SYSU-CD, respectively. The quantitative results indicate that our method consistently outperforms the others on three datasets and demonstrates significant advantages on the SYSU-CD dataset. For instance, on the first three datasets, HSONet outperforms the other comparative methods in most accuracy metrics, with F1 scores exceeding those of the latest method, DMINet, by 1.16%, 1.15%, and 1.72%, respectively. Objectively speaking, our method has achieved certain breakthroughs in all seven accuracy metrics on all four datasets, proving the significance of introducing an equilibrium optimization loss function and a novel CD network architecture with a foreground-science association module. Additionally, Figures 8, 9, 10, and 11 show the visualization results of HSONet on each test set. To clearly observe the accuracy of each area classification, we used different colours to represent TP (blue), TN (light blue), FP (red), and FN (orange). To obtain a clearer view of the performance comparison between the various methods, we visualized the scores of the seven accuracy metrics using radar charts; as Figure 7 shows, our method has several advantages in all the metrics on the four datasets, and the homogeneity is good, with no poor accuracy results.

IV-C2a Experimental Results on the LEVIR-CD Dataset

As Figure 8 shows, we selected typical samples from the LEVIR-CD test set for visual comparison, such as the shadow area in the middle of the ring-shaped building in (1), the large scene changes in (2), and the strip-like small target buildings in (3). For (1), only our method shows strong robustness to shadow areas and more reasonable segmentation results for the interior and boundaries of buildings. In (2), HSONet exhibits stronger resistance to noninteresting changes and fewer misses and false detections than do the other methods. In (3), our method not only accurately detects the complete change area of large buildings but also senses changes in small targets at the bottom right, which were not annotated in the GT, demonstrating HSONet’s strong perception of detailed information. In summary, our method achieves SOTA performance according to the qualitative results on LEVIR-CD, which is consistent with the quantitative results in TABLE II.

TABLE II: COMPARISON RESULT ON LEVIR-CD TEST SET
Method Year Precision Recall F1 OA mIOU IOU Kappa
FC-EF 2018 86.91 80.17 83.40 98.39 80.98 71.53 80.44
FC-Siam-Di 2018 89.53 83.31 86.31 98.67 83.87 75.92 84.28
FC-Siam-Co 2018 91.99 76.77 83.69 98.49 81.42 71.96 80.94
DASNet 2020 80.76 90.70 79.91 94.32 79.22 74.65 75.39
BIT 2021 89.24 89.37 89.31 98.92 89.02 80.68 88.35
SwinSUNet 2022 88.03 84.76 86.37 99.16 87.57 76.01 85.93
DMINet 2022 93.02 89.58 91.26 99.46 91.69 83.94 90.99
ICIFNet 2022 92.61 90.80 91.69 99.48 92.07 84.66 91.43
EGCTNet 2022 87.66 91.42 89.50 99.33 90.15 81.00 89.15
HFA-Net 2022 90.10 80.48 85.02 99.11 86.51 73.94 94.56
USSFC-Net 2023 91.06 88.72 89.81 99.37 90.48 81.61 89.55
Ours - 92.80 92.04 92.42 99.53 92.71 85.90 92.17
TABLE III: COMPARISON RESULT ON Google-CD TEST SET
Method Year Precision Recall F1 OA mIOU IOU Kappa
FC-EF 2018 80.81 64.39 71.67 85.85 67.83 55.85 66.40
FC-Siam-Di 2018 85.44 63.28 72.71 87.27 69.37 57.12 70.01
FC-Siam-Co 2018 82.07 64.73 72.38 84.56 68.39 56.71 69.05
DASNet 2020 69.92 68.83 77.46 94.59 79.22 74.65 75.39
BIT 2021 92.04 72.03 80.82 96.59 79.11 67.81 77.33
SwinSUNet 2022 90.77 75.92 82.68 96.59 83.39 70.48 80.81
DMINet 2022 93.06 90.92 91.98 98.30 91.63 85.15 91.03
ICIFNet 2022 92.37 91.02 91.69 98.23 91.35 84.66 90.70
EGCTNet 2022 90.34 86.23 88.24 97.53 88.12 78.95 86.86
HFA-Net 2022 81.97 77.91 79.89 95.80 80.97 66.52 77.55
USSFC-Net 2023 90.58 83.33 86.80 97.29 86.86 76.69 85.3
Ours - 95.13 91.25 93.14 98.56 92.79 87.17 92.34
TABLE IV: COMPARISON RESULT ON CDD TEST SET
Method Year Precision Recall F1 OA mIOU IOU Kappa
FC-EF 2018 65.60 55.01 57.65 93.58 65.32 52.20 52.42
FC-Siam-Di 2018 78.25 65.76 69.04 94.44 67.24 56.57 62.39
FC-Siam-Co 2018 74.51 73.87 71.16 94.92 69.48 60.11 66.44
DASNet 2020 90.00 90.50 91.00 99.10 90.28 83.01 90.12
BIT 2021 96.19 93.99 95.07 98.62 94.43 90.61 93.00
SwinSUNet 2022 95.70 92.30 94.00 98.50 93.36 91.74 93.28
DMINet 2022 98.40 94.66 96.49 99.15 96.13 93.23 96.01
ICIFNet 2022 98.81 94.35 96.53 99.16 96.17 93.29 96.05
EGCTNet 2022 94.32 86.74 90.37 97.72 89.94 82.43 89.08
HFA-Net 2022 92.23 62.55 74.54 94.73 76.85 59.42 71.72
USSFC-Net 2023 96.19 95.46 95.92 98.97 95.41 91.99 95.24
Ours - 98.32 98.12 98.21 99.56 98.00 96.49 98.33
TABLE V: COMPARISON RESULT ON SYSU-CD TEST SET
Method Year Precision Recall F1 OA mIOU IOU Kappa
FC-EF 2018 74.32 75.84 75.07 86.02 73.62 60.09 69.37
FC-Siam-Di 2018 89.13 61.21 72.57 82.11 68.77 56.96 67.01
FC-Siam-Co 2018 82.54 71.03 76.35 86.17 74.21 61.75 70.09
DASNet 2020 68.14 70.01 69.14 80.16 56.88 60.65 64.37
BIT 2021 82.18 74.49 78.15 90.18 75.29 64.13 70.93
SwinSUNet 2022 78.27 73.93 76.04 89.01 74.02 61.34 68.92
DMINet 2022 82.08 84.86 83.45 92.06 80.53 71.60 78.23
ICIFNet 2022 78.64 77.75 78.20 89.77 75.84 64.21 71.51
EGCTNet 2022 75.92 79.81 77.82 89.27 75.24 63.69 70.75
HFA-Net 2022 80.93 70.91 75.60 89.20 73.90 60.76 68.70
USSFC-Net 2023 79.84 82.34 81.07 90.93 78.45 68.16 75.11
Ours - 87.21 78.88 82.84 92.29 80.62 70.70 78.28
IV-C2b Experimental Results on the Google-CD Dataset

Similarly, we selected representative prediction samples from Google-CD for visual comparison. As shown in Figure 9, large scene changes occur in (1), (2), and (3). Specifically, (1) involves interference from noninteresting changes such as roads and trees, and the challenge in (2) is how to reduce misclassifications around buildings. Despite the limited number of samples in this dataset, the DMI and our method have significant advantages in terms of detection accuracy and completeness compared to the other methods, consistent with the quantitative results in TABLE III. In addition, only HSONet simultaneously achieves high recognition rates and low false alarm rates for all three datasets, further demonstrating the superior performance of our method on Google-CD images.

Refer to caption
Figure 8: Visualization results of several CD methods on the LEVIR-CD test set. (a)T1, (b)T2, (c)DMINet, (d)EGCTNet, (e)ICIFNet, (f)USSFC-Net, (g)GT, and (h)Ours, where blue, red, orange, and light blue denote TP, FP, FN and TN, respectively.
Refer to caption
Figure 9: Visualization results of several CD methods on the Google-CD test set. (a)T1, (b)T2, (c)DMINet, (d)EGCTNet, (e)ICIFNet, (f)USSFC-Net, (g)GT, and (h)Ours, where blue, red, orange, and light blue denote TP, FP, FN and TN, respectively.
Refer to caption
Figure 10: Visualization results of several CD methods on the CDD test set. (a)T1, (b)T2, (c)DMINet, (d)EGCTNet, (e)ICIFNet, (f)USSFC-Net, (g)GT, and (h)Ours, where blue, red, orange, and light blue denote TP, FP, FN and TN, respectively.
Refer to caption
Figure 11: Visualization results of several CD methods on the SYSU-CD test set. (a)T1, (b)T2, (c)DMINet, (d)EGCTNet, (e)ICIFNet, (f)USSFC-Net, (g)GT, and (h)Ours, where blue, red, orange, and light blue denote TP, FP, FN and TN, respectively.
IV-C2c Experimental Results on the CDD Dataset

As Figure 10 shows, we selected typical samples from the CDD test set for visual comparison. These include (1) numerous small target changes, (2) challenges due to tree occlusion affecting feature information discrimination, and (3) significant seasonal changes, which can interfere with the change detection of rural roads. As observed in TABLE IV, HSONet consistently achieves the highest recognition scores with the least FPs and FNs, surpassing the other comparative methods. This indicates that our method can accurately detect changes in interest under seasonal interference, perceive small target changes, rarely miss or falsely detect changes, and is less affected by light, colour, and weather.

IV-C2d Experimental Results on the SYSU-CD Dataset

While the first three datasets mainly focus on changes in buildings, to verify our method’s detection performance for different types of target changes, we selected typical samples from the SYSU-CD test set for visual comparison. As shown in Figure 11, HSONet yields fewer misjudgments and higher recognition accuracy when detecting changes in ambiguous areas such as roads, bare land, and residential areas. It also shows robustness and resistance to noninteresting changes such as forests and rural roads. This proves that, compared to other methods, our method has a universal ability to discriminate changes in interest and can more accurately detect changes in various categories of land objects, consistent with the quantitative results in TABLE V.

IV-C3 Learning curve comparison

To evaluate the performance of HSONet, we compared the F1 score variation over epochs for USSFC-Net, EGCTNet, and HSONet on four datasets. Figure 12 shows the change line graphs for LEVIR-CD, Google-CD, CDD, and SYSU-CD from top to bottom and left to right. The graph shows that our model has higher accuracy, faster convergence, and a more stable training process than the other two methods, with very few large fluctuations in accuracy during training. This indicates that HSONet is more robust and stable than the other models and has stronger robustness. Additionally, the model converges after reaching a certain point, and training for more epochs does not significantly enhance the CD capability. Therefore, in our experiments, we set the epochs according to the dataset: 60 epochs for LEVIR-CD and 100 epochs for Google-CD, CDD, and SYSU-CD.

Refer to caption
Figure 12: F1 score compare with USSFC-Net and EGCTNet on LEVIR-CD, Google-CD, CDD, and SYSU-CD validation set.

IV-D Ablation Study

TABLE VI: COMPARISON OF RESULTS WITH DIFFERENT LOSS FUNCTIONS
Loss Backbone LEVIR-CD Google-CD SYSU-CD
function FS PVT ResNet F1 IOU F1 IOU F1 IOU
BCE 91.74 84.74 91.40 84.18 76.65 62.14
92.24 85.59 92.01 85.38 82.72 70.58
92.12 85.39 91.90 85.01 81.42 68.66
SF-loss 91.77 84.80 90.87 83.27 73.39 57.96
92.31 85.73 92.16 85.46 82.53 70.25
91.86 84.89 91.20 83.82 82.14 69.69
EO-loss 91.75 84.76 91.40 84.17 74.73 59.65
92.42 85.90 93.14 87.17 82.84 70.70
91.87 84.97 91.69 84.65 81.96 69.44

IV-D1 Comparison of different loss function effects

To explore the contribution of EO-loss to model performance, we compared the original loss function with the EO-loss. As shown in TABLE VI, in the control experiments using the PVT, the accuracy of the EO-loss on the three datasets is 0.18%, 1.13%, and 0.12% greater than that of the BCE, and 0.11%, 0.98%, and 0.31% greater than that of the SF-loss. These results indicate that EO-loss indeed has certain advantages. However, since hard case samples constitute a smaller proportion of the data samples, the improvement in accuracy is not very large. This finding demonstrates the effectiveness and rationality of the method.

IV-D2 Comparison of the FS-relation effect

To explore the contribution of the FS-relation module to the performance of HSONet, we conducted ablation experiments on the FS-relation module on three datasets. As shown in TABLE VI, when ResNet was used as the backbone network, the three sets of experiments introducing the FS-relation module were comprehensively superior to those without it. For example, when the loss function is BCE, the F1 scores of the former are 0.38%, 0.50%, and 4.77% greater than those of the latter on three datasets, and similar results are obtained for the SF-loss and EO-loss experiment sets. This finding suggests that enhancing the feature representation of hard case targets by modelling the association of foreground targets of interest with related context in latent remote sensing spatial scene information is a feasible feature modelling strategy.

IV-D3 Comparison of different backbone effects

To verify the role of the multiscale structure-designed backbone in the network, we compared the experimental results of PVT and ResNet on three datasets. As shown in TABLE VI, in all six sets of comparative experiments, PVT achieves more competitive accuracy results than ResNet. For instance, when the loss function is EO-loss, the F1 scores of the PVT group are 0.55%, 1.45%, and 0.88% greater than those of the ResNet group. This indicates that the PVT has a stronger feature extraction capability than ResNet and is more suitable for CD tasks. This might be due to ViT’s superior global modelling capability and the multiscale information of the pyramid feature layer, which allows the FS-relation module to obtain more representative scene embedding vector SVs, thereby enhancing the feature representation of background hard case targets.

IV-E Parameter Verification Experiment

There are many important parameters and functions in the proposed network, and to explore the effectiveness of these parameters and functions on model performance, we conduct parameter validation experiments on three datasets.

IV-E1 Effect of the dynamic weighting function

TABLE VII: COMPARISON OF RESULTS FOR WEIGHTING FUNCTIONS
LEVIR-CD Google-CD CDD
  Backbone   Norm   Linear   Exponential   Cosine F1 IOU F1 IOU F1 IOU
    PVT 92.08 85.32 91.90 85.01 97.20 94.61
92.31 85.72 92.15 85.45 97.38 94.88
92.23 85.59 92.70 86.39 97.74 95.57
92.07 85.29 90.10 81.99 97.42 94.91
92.42 85.90 92.56 86.15 97.56 95.24
    ResNet 91.26 83.92 91.11 83.68 97.14 94.45
91.35 84.04 91.21 83.87 97.48 95.09
91.85 84.92 91.23 83.86 98.11 96.28
91.62 84.54 92.06 85.28 97.36 94.87
91.87 84.97 91.69 84.65 97.85 95.78

Optimizing hard case samples in the later stages of model training is crucial for enhancing model performance. At this stage, background hard case samples should become the focus of model optimization. Therefore, we introduce dynamic weight functions to shift the learning focus of the model from the foreground to the background. We designed three types of dynamic weight functions and conducted experiments on three datasets. As shown in TABLE VII, in the experiments using ResNet as the backbone, the maximum F1 score improvement achieved by using dynamic weight functions compared to not using them was 0.52%, 0.85%, and 0.63%, respectively; similar results were obtained in experiments using PVT as the backbone, with increases of 0.34%, 0.80%, and 0.54%, respectively. This indicates that the method based on dynamic weighting reduces early training errors in predicting hard case samples, thereby enhancing the CD performance of the model; moreover, focusing on optimizing background hard case samples in the later stages of training is both reasonable and effective. According to both sets of experiments, although all three dynamic weight functions achieved some accuracy improvements, they had no absolute advantage. This might be due to the varying numbers and distributions of hard case samples in different datasets; hence, different dynamic weight functions are suitable for different datasets. We can choose the appropriate dynamic weight function based on data characteristics. For example, cosine dynamic weighting is more suitable for datasets with a clear gap between simple and hard case samples, allowing for stable adjustment of the loss distribution to achieve healthy convergence; exponential dynamic weighting is more suitable for situations where there are fewer foreground samples and more background samples.

IV-E2 Effect of the focusing factor

TABLE VIII: COMPARISON OF DIFFERENT FOCUSING FACTOR RESULTS
Dataset γ𝛾\gammaitalic_γ 0.0 0.3 0.5 1.0 2.0 4.0 6.0
LEVIR-CD F1 92.23 92.25 92.27 92.30 92.42 92.27 92.17
IOU 85.67 85.66 85.66 85.7 85.90 85.65 85.49
Google-CD F1 91.94 92.45 92.47 93.14 92.7 92.45 90.92
IOU 85.09 85.97 85.99 87.17 86.39 85.95 83.36

Introducing the hard case awareness term (1p)γsuperscript1𝑝𝛾(1-p)^{\gamma}( 1 - italic_p ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT leverages the distribution of loss values to approximate the regions of hard case samples and then adjusts the attention weight distribution between hard case samples and foreground samples using the focus Factor γ𝛾\gammaitalic_γ. Generally, the larger γ𝛾\gammaitalic_γ is, the greater the weight and the higher the degree of attention. We conducted comparative experiments with progressively increasing γ𝛾\gammaitalic_γ on two datasets and observed similar experimental effects. As shown in TABLE VIII, starting from γ𝛾\gammaitalic_γ=0.0 and increasing γ𝛾\gammaitalic_γ, the CD accuracy continuously improved. On the LEVIR-CD and Google-CD datasets, when γ𝛾\gammaitalic_γ was set to 2.0 and 1.0, respectively, the F1 scores were 0.19% and 1.20% higher than when γ𝛾\gammaitalic_γ was 0.0, reaching absolute accuracies of 92.42% and 93.14%, respectively. However, as γ𝛾\gammaitalic_γ continued to increase, the performance of the model began to decline. This may be due to two reasons: first, image noise was also given a high focus Factor γ𝛾\gammaitalic_γ, leading to its misclassification as hard case samples; second, an excessively large γ𝛾\gammaitalic_γ caused the model to excessively neglect foreground targets, thus reducing its discriminative ability for foreground interest targets. From this, we can conclude that an appropriate focus Factor γ𝛾\gammaitalic_γ can enhance the model’s CD performance, but different data characteristics often require different optimal focus factors.

IV-F Analysis of effectiveness

IV-F1 T-SNE and attention heatmap visualization effect

To demonstrate how our method focuses on various remote sensing targets, we visualized the attention heatmap of the last layer of HSONet and overlaid it on the T2 image to obtain the image shown in (f). In (f), the attention distribution of the changing targets can be clearly observed, where red denotes higher attention values and blue denotes lower values. As (f) shows in Figures 13 and 14, areas with high heating values completely cover the areas of interest changes. The house change areas in LEVIR-CD and Google-CD both have high heating values, indicating that our method learned the feature representation of interest targets and is not sensitive to noninteresting changes. Notably, HSONet demonstrates strong robustness to building shadows, as shown in (2) and (3) of Figure 13. The shadows around buildings do not affect our method’s ability to accurately detect real change areas. Moreover, shadows around buildings may be a key factor in change recognition because our model can implicitly learn certain additional feature concepts to facilitate the identification of change areas.

Refer to caption
Figure 13: t-SNE visual comparison on LEVIR-CD:(a)T1, (b)T2, (c)GT, (d)ICIF-Net, (e)Our HSONet, (f)HSONet heatmap, (g)ICIF-Net t-SNE, and (h)HSONet t-SNE.
Refer to caption
Figure 14: t-SNE visual comparison on Google-CD:(a) T1, (b)T2, (c)GT, (d)ICIF-Net, (e)Our HSONet, (f)HSONet heatmap, (g)ICIF-Net t-SNE, and (h)HSONet t-SNE.

In addition, to more intuitively reflect the degree of separation between the changed and unchanged samples, we used t-distributed stochastic neighbour embedding (t-SNE)[48] to downscale the distribution of the visualized samples and compare and analyse our method with the distribution of the ICIF-Net features. As shown in (g) and (h) of Figures 14 and 15, red represents changed samples, and blue represents unchanged samples. Notably, in the dimensionality reduction results of HSONet, the separation boundary between the red and blue samples is clearer than that between the other samples, with both sample categories being more distinctly clustered and with few instances of red and blue intermingling. This indicates that our method has more accurate change perception and anti-interference capabilities, which is why HSONet’s change confidence is greater and it has clear boundaries. In contrast, in the ICIF-Net dimensionality reduction results, a certain number of red samples appear in the blue samples, which indicates that some of the changed samples and the unchanged samples are indistinguishable in the feature space. This leads to some samples at the edge of the changed area with low confidence being misclassified by the network, resulting in ”blurred boundaries of the changed area”.

IV-F2 Attention shift heatmap visualization

To validate the attention shift pattern of the model during the early and late training stages, we visualized the attention focus distribution on sample data at different epochs. Each epoch is equidistant, allowing the observation of the pattern of shifting optimization focus. Figures 15, 16, and 17 show the heatmaps of a particular data sample in LEVIR-CD, Google-CD, and CDD, respectively, visualized using Grad-CAM[49]. All three images exhibit the same pattern: in the early stages of training, the optimization focus is concentrated on interest targets in the foreground, which in this case are ’buildings’. By the middle stages of training, the model gradually shifts towards learning about background information, such as the areas around roads or less obvious buildings. In the later stages of training, the model’s attention increasingly turns to learning about background features, during which there is a varying degree of neglect for foreground targets.

Refer to caption
Figure 15: Attention shift heatmap on the LEVIR-CD validation set.
Refer to caption
Figure 16: Attention shift heatmap on the Google-CD validation set.
Refer to caption
Figure 17: Attention shift heatmap on the CDD validation set.

V Conclusion and future work

In response to the challenge of obtaining hard case samples from remote sensing imagery, this paper presents a Siamese foreground association-driven hard case sample optimization network, named HSONet, which includes a foreground-scene association module and an EO-loss. In the equilibrium optimization loss function, the distribution of hard case samples is estimated through the hard case awareness term, and the optimization focus of the network is shifted dynamically to balance the learning process for both foreground targets and background hard case samples. In the foreground-scene association module, the associations between interest targets in the foreground and related context in latent remote sensing spatial scene information are modelled, and background hard case samples are actively output to ensure the successful extraction of interest changes in both foreground and background hard case samples. This strategy effectively captures and perceives changing features, showing advantages in detecting changes in hard case samples. Experimental results on four change detection datasets show that our method obtains more accurate CD results than other methods, is less prone to problems such as omissions and misdetections, and is robust against hard case samples.

Additionally, our method is versatile. Although it is applied only to CD tasks here, the proposed strategy can be readily transferred to other application scenarios. Of course, supervised learning algorithms are limited by data availability. Therefore, in the future, we plan to explore optimization methods further for hard case samples in semisupervised[50] or self-supervised learning[51, 52] contexts.

References

  • [1] Işin Onur, Derya Maktav, Mustafa Sari, and N Kemal Sönmez. Change detection of land cover and land use using remote sensing and gis: a case study in kemer, turkey. International Journal of Remote Sensing, 30(7):1749–1757, 2009.
  • [2] Tao Lei, Yuxiao Zhang, Zhiyong Lv, Shuying Li, Shigang Liu, and Asoke K Nandi. Landslide inventory map** from bitemporal images using deep convolutional neural networks. IEEE Geoscience and Remote Sensing Letters, 16(6):982–986, 2019.
  • [3] Michelle Cristina Araujo Picoli, Gilberto Camara, Ieda Sanches, Rolf Simões, Alexandre Carvalho, Adeline Maciel, Alexandre Coutinho, Julio Esquerdo, João Antunes, Rodrigo Anzolin Begotti, et al. Big earth observation time series analysis for monitoring brazilian agriculture. ISPRS journal of photogrammetry and remote sensing, 145:328–339, 2018.
  • [4] Begüm Demir, Francesca Bovolo, and Lorenzo Bruzzone. Updating land-cover maps by classification of image time series: A novel change-detection-driven transfer learning approach. IEEE Transactions on Geoscience and Remote Sensing, 51(1):300–312, 2012.
  • [5] Shun** Ji, Shiqing Wei, and Meng Lu. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Transactions on geoscience and remote sensing, 57(1):574–586, 2018.
  • [6] ZhiYong Lv, Tongfei Liu, Jon Atli Benediktsson, and Nicola Falco. Land cover change detection techniques: Very-high-resolution optical images: A review. IEEE Geoscience and Remote Sensing Magazine, 10(1):44–63, 2021.
  • [7] Anju Asokan and JJESI Anitha. Change detection techniques for remote sensing applications: A survey. Earth Science Informatics, 12:143–160, 2019.
  • [8] Hanhong Zheng, Maoguo Gong, Tongfei Liu, Fenlong Jiang, Tao Zhan, Di Lu, and Mingyang Zhang. Hfa-net: High frequency attention siamese network for building change detection in vhr remote sensing images. Pattern Recognition, 129:108717, 2022.
  • [9] Xintao Xu, Zhe Yang, and **jiang Li. Amca: Attention-guided multi-scale context aggregation network for remote sensing image change detection. IEEE Transactions on Geoscience and Remote Sensing, 2023.
  • [10] Sheng Fang, Kaiyu Li, **yuan Shao, and Zhe Li. Snunet-cd: A densely connected siamese network for change detection of vhr images. IEEE Geoscience and Remote Sensing Letters, 19:1–5, 2021.
  • [11] Jie Chen, Ziyang Yuan, Jian Peng, Li Chen, Haozhe Huang, Jiawei Zhu, Yu Liu, and Haifeng Li. Dasnet: Dual attentive fully convolutional siamese networks for change detection in high-resolution satellite images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14:1194–1206, 2020.
  • [12] Pan Chen, Bing Zhang, Danfeng Hong, Zhengchao Chen, Xuan Yang, and Baipeng Li. Fccdn: Feature constraint network for vhr image change detection. ISPRS Journal of Photogrammetry and Remote Sensing, 187:101–119, 2022.
  • [13] Yuchao Feng, Jiawei Jiang, Honghui Xu, and Jianwei Zheng. Change detection on remote sensing images using dual-branch multilevel intertemporal network. IEEE Transactions on Geoscience and Remote Sensing, 61:1–15, 2023.
  • [14] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48, 2009.
  • [15] Xin Wang, Yudong Chen, and Wenwu Zhu. A survey on curriculum learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4555–4576, 2021.
  • [16] Jian Peng, Dingqi Ye, Bo Tang, Yinjie Lei, Yu Liu, and Haifeng Li. Lifelong learning with cycle memory networks. IEEE Transactions on Neural Networks and Learning Systems, 2023.
  • [17] MA Lebedev, Yu V Vizilter, OV Vygolov, Vladimir A Knyaz, and A Yu Rubis. Change detection in remote sensing images using conditional adversarial networks. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 42:565–571, 2018.
  • [18] Hao Chen and Zhenwei Shi. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sensing, 12(10):1662, 2020.
  • [19] Mengxi Liu, Qian Shi, Andrea Marinoni, Da He, ** Liu, and Liangpei Zhang. Super-resolution-based change detection network with stacked attention module for images with different resolutions. IEEE Transactions on Geoscience and Remote Sensing, 60:1–18, 2021.
  • [20] Qian Shi, Mengxi Liu, Shengchen Li, ** Liu, Fei Wang, and Liangpei Zhang. A deeply supervised attention metric-based network and an open aerial image dataset for remote sensing change detection. IEEE transactions on geoscience and remote sensing, 60:1–16, 2021.
  • [21] Tao Lei, Xinzhe Geng, Hailong Ning, Zhiyong Lv, Maoguo Gong, Yaochu **, and Asoke K Nandi. Ultralightweight spatial–spectral feature cooperation network for change detection in remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 61:1–14, 2023.
  • [22] Yang Zhan, Kun Fu, Menglong Yan, Xian Sun, Hongqi Wang, and Xiaosong Qiu. Change detection based on deep siamese convolutional network for optical aerial images. IEEE Geoscience and Remote Sensing Letters, 14(10):1845–1849, 2017.
  • [23] Rodrigo Caye Daudt, Bertr Le Saux, and Alexandre Boulch. Fully convolutional siamese networks for change detection. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 4063–4067. IEEE, 2018.
  • [24] Enqiang Guo, Xinsha Fu, Jiawei Zhu, Min Deng, Yu Liu, Qing Zhu, and Haifeng Li. Learning to measure change: Fully convolutional siamese metric networks for scene change detection. arXiv preprint arXiv:1810.09111, 2018.
  • [25] Zhi Zheng, Yi Wan, Yongjun Zhang, Sizhe Xiang, Daifeng Peng, and Bin Zhang. Clnet: Cross-layer convolutional neural network for change detection in optical remote sensing imagery. ISPRS Journal of Photogrammetry and Remote Sensing, 175:247–267, 2021.
  • [26] Xueli Peng, Ruofei Zhong, Zhen Li, and Qingyang Li. Optical remote sensing image change detection based on attention mechanism and image difference. IEEE Transactions on Geoscience and Remote Sensing, 59(9):7296–7307, 2020.
  • [27] Huiwei Jiang, Xiangyun Hu, Kun Li, **ming Zhang, **qi Gong, and Mi Zhang. Pga-siamnet: Pyramid feature-based attention-guided siamese network for remote sensing orthoimagery building change detection. Remote Sensing, 12(3):484, 2020.
  • [28] Decheng Wang, Xiangning Chen, Mingyong Jiang, Shuhan Du, Bijie Xu, and Junda Wang. Ads-net: An attention-based deeply supervised network for remote sensing image change detection. International Journal of Applied Earth Observation and Geoinformation, 101:102348, 2021.
  • [29] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • [30] Hao Chen, Zipeng Qi, and Zhenwei Shi. Remote sensing image change detection with transformers. IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2021.
  • [31] Yuchao Feng, Honghui Xu, Jiawei Jiang, Hao Liu, and Jianwei Zheng. Icif-net: Intra-scale cross-interaction and inter-scale feature fusion network for bitemporal remote sensing images change detection. IEEE Transactions on Geoscience and Remote Sensing, 60:1–13, 2022.
  • [32] Qingyang Li, Ruofei Zhong, Xin Du, and Yu Du. Transunetcd: A hybrid transformer network for change detection in optical remote-sensing images. IEEE Transactions on Geoscience and Remote Sensing, 60:1–19, 2022.
  • [33] Xintao Xu, **jiang Li, and Zheng Chen. Tcianet: Transformer-based context information aggregation network for remote sensing image change detection. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 16:1951–1971, 2023.
  • [34] Tianyu Yan, Zifu Wan, and **** Zhang. Fully transformer network for change detection of remote sensing images. In Proceedings of the Asian Conference on Computer Vision, pages 1691–1708, 2022.
  • [35] Cui Zhang, Liejun Wang, Shuli Cheng, and Yongming Li. Swinsunet: Pure transformer network for remote sensing image change detection. IEEE Transactions on Geoscience and Remote Sensing, 60:1–13, 2022.
  • [36] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  • [37] Trong Huy Phan and Kazuma Yamamoto. Resolving class imbalance in object detection with weighted cross entropy losses. arXiv preprint arXiv:2006.01413, 2020.
  • [38] Rongjian Zhao, Buyue Qian, Xianli Zhang, Yang Li, Rong Wei, Yang Liu, and Yinggang Pan. Rethinking dice loss for medical image segmentation. In 2020 IEEE International Conference on Data Mining (ICDM), pages 851–860. IEEE, 2020.
  • [39] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  • [40] Jun Fu, **g Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3146–3154, 2019.
  • [41] Xinghua Li, Meizhen He, Huifang Li, and Huanfeng Shen. A combined loss-based multiscale fully convolutional network for high-resolution remote sensing image change detection. IEEE Geoscience and Remote Sensing Letters, 19:1–5, 2021.
  • [42] Siyuan Zhu, Yonghong Song, Yu Zhang, and Yuanlin Zhang. Ecfnet: A siamese network with fewer fps and fewer fns for change detection of remote-sensing images. IEEE Geoscience and Remote Sensing Letters, 20:1–5, 2023.
  • [43] Qingle Guo, Jun** Zhang, Shengyu Zhu, Chongxiao Zhong, and Ye Zhang. Deep multiscale siamese network with parallel convolutional structure and self-attention for change detection. IEEE Transactions on Geoscience and Remote Sensing, 60:1–12, 2021.
  • [44] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [45] Wenhai Wang, Enze Xie, Xiang Li, Deng-** Fan, Kaitao Song, Ding Liang, Tong Lu, ** Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, pages 568–578, 2021.
  • [46] Liegang Xia, Jun Chen, Jiancheng Luo, Junxia Zhang, Dezhi Yang, and Zhanfeng Shen. Building change detection based on an edge-guided convolutional neural network combined with a transformer. Remote Sensing, 14(18):4524, 2022.
  • [47] Hanhong Zheng, Maoguo Gong, Tongfei Liu, Fenlong Jiang, Tao Zhan, Di Lu, and Mingyang Zhang. Hfa-net: High frequency attention siamese network for building change detection in vhr remote sensing images. Pattern Recognition, 129:108717, 2022.
  • [48] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  • [49] Ramprasaath R Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, and Dhruv Batra. Grad-cam: Why did you say that? arXiv preprint arXiv:1611.07450, 2016.
  • [50] Chao Tao, Ji Qi, Mingning Guo, Qing Zhu, and Haifeng Li. Self-supervised remote sensing feature learning: Learning paradigms, challenges, and future works. IEEE Transactions on Geoscience and Remote Sensing, 2023.
  • [51] Haifeng Li, Jun Cao, Jiawei Zhu, Qinyao Luo, Silu He, and Xuying Wang. Augmentation-free graph contrastive learning of invariant-discriminative representations. IEEE Transactions on Neural Networks and Learning Systems, 2023.
  • [52] Zhaoyang Zhang, Zhen Ren, Chao Tao, Yunsheng Zhang, Chengli Peng, and Haifeng Li. Grass: Contrastive learning with gradient-guided sampling strategy for remote sensing image semantic segmentation. IEEE Transactions on Geoscience and Remote Sensing, 61:1–14, 2023.