PPTFormer: Pseudo Multi-Perspective Transformer for UAV Segmentation

Deyi Ji1,2    Wenwei **2    Hongtao Lu3&Feng Zhao1
1University of Science and Technology of China
2Alibaba Group
3Dept. of CSE, MOE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
[email protected], [email protected]
Corresponding author.
Abstract

The ascension of Unmanned Aerial Vehicles (UAVs) in various fields necessitates effective UAV image segmentation, which faces challenges due to the dynamic perspectives of UAV-captured images. Traditional segmentation algorithms falter as they cannot accurately mimic the complexity of UAV perspectives, and the cost of obtaining multi-perspective labeled datasets is prohibitive. To address these issues, we introduce the PPTFormer, a novel Pseudo Multi-Perspective Transformer network that revolutionizes UAV image segmentation. Our approach circumvents the need for actual multi-perspective data by creating pseudo perspectives for enhanced multi-perspective learning. The PPTFormer network boasts Perspective Decomposition, novel Perspective Prototypes, and a specialized encoder and decoder that together achieve superior segmentation results through Pseudo Multi-Perspective Attention (PMP Attention) and fusion. Our experiments demonstrate that PPTFormer achieves state-of-the-art performance across five UAV segmentation datasets, confirming its capability to effectively simulate UAV flight perspectives and significantly advance segmentation precision. This work presents a pioneering leap in UAV scene understanding and sets a new benchmark for future developments in semantic segmentation.

Refer to caption
Figure 1: Examples of UAV scene segmentation dataset showing the diverse perspectives encountered during UAV flights. Perspectives such as parallel, oblique, and perpendicular are presented across different flights and varying stages within the same flight. These images illustrate the significant changes in viewpoint that can occur when observing various scenes from a UAV. However, there is often a lack of datasets that capture multiple perspectives of the same scene, which complicates the task of accurate UAV segmentation.

1 Introduction

Semantic segmentation is a fundamental component of many practical applications in computer vision, such as autonomous driving, video surveillance, and precision agriculture, where the goal is to predict a category label for each pixel in an image. In recent years, a plethora of algorithms and networks Long et al. (2015); Ji et al. (2023c, 2022) have been developed for segmentation in conventional scenes, showcasing significant advancements in the field.

With the burgeoning growth of Unmanned Aerial Vehicles (UAVs), UAV segmentation has emerged as a pivotal area of research, playing a crucial role in applications ranging from environmental monitoring to disaster response. Unlike data captured from fixed perspectives, UAVs operate at varying altitudes and angles, offering a wealth of dynamic viewpoints, as shown in Figure 1. The images captured by UAVs reflect the rich and varied visual information from different angles and altitudes, which is of great importance for monitoring rapidly changing environments.

Addressing the segmentation challenges posed by such rich variability in perspectives necessitates a novel approach. The most intuitive solution would be to collect actual multi-perspective data for training segmentation networks. However, as analyzed in Ji et al. (2024a); Zheng et al. (2020); Yang and Ma (2022); Ji et al. (2024b), due to the prohibitive costs of collection and fine-grained annotation, existing datasets lack multi-perspective images with detailed labeling. Conventional segmentation methods typically rely on explicit perspective transformation-based data augmentation techniques, such as scaling, rotating, or flip** images in 2D or 3D dimensions. These rudimentary methods of augmentation produce stiff and unnatural perspectives that fail to represent the true changes in perspective experienced during UAV flight, resulting in limited model performance in real-world UAV scenarios.

In light of these challenges and the rapid development of Vision Transformers in semantic segmentation, a subset of methodologies Xie et al. (2021); Zheng et al. (2021) has been applied to UAV scene segmentation. However, these Transformer networks have not deeply analyzed or been designed with UAV perspective dynamics in mind. Based on this analysis, we propose PPTFormer, a new Pseudo Multi-Perspective Transformer network designed for UAV segmentation. It targets efficient perspective learning by integrating a specialized encoder and a universal decoder. At its core are the advanced PPTFormer blocks for pseudo multi-perspective learning. These blocks leverage perspective prototypes, consistent across the network, to facilitate perspective-aware learning. Inspired by Ji et al. (2024b), key to the PPTFormer Blocks is the Perspective Transformation module, which adjusts visual features to simulate varying UAV viewpoints while preserving scene semantics. Pseudo Multi-Perspective Attention (PMP Attention) layers then fuse these adjusted features with the original input, enriching the model’s semantic understanding from multiple perspectives.

Our contributions can be summarized as follows:

  • We propose PPTFormer that enables implicit multi-perspective learning even in the absence of authentic multi-perspective datasets. By generating pseudo multi-perspective characterization about the scene and engaging in joint learning across them, PPTFormer can effectively simulate the varying viewpoints encountered during actual UAV flight, thereby improving the segmentation accuracy.

  • Particularly, the PPTFormer begins with Perspective Representation, distills high-dimensional Perspective Prototypes, generates Pseudo Perspectives through transformations, and finally performs fusion learning with original and Pseudo Perspectives.

  • PPTFormer achieves state-of-the-art performance on five UAV segmentation datasets, demonstrating its effectiveness in capturing the intricate dynamics of UAV-captured scenes through Pseudo Multi-Perspective Learning.

2 Related Work

2.1 Semantic Segmentation

Semantic Segmentation has been a classical and fundamental task in computer vision area Chen et al. (2024); Ji et al. (2024a, 2020); Zhu et al. (2024b); Wang et al. (2021b, a); Feng et al. (2018); Ji et al. (2019, 2023a); Zhu et al. (2023e, f, c, 2022); Yu et al. (2023); Chen et al. (2023); Yang et al. (2024). In recent years, the majority of segmentation advancements are grounded in the use of fully convolutional networks (FCN) Zhu et al. (2021, 2024a); Ji et al. (2023c, 2022); Zhu et al. (2023d, b); Long et al. (2015); Zheng et al. (2023a, b); Yu et al. (2022a); Zhou et al. (2023); Yu et al. (2023). Subsequent research has concentrated on capturing contextual relationships within images, employing sophisticated network architectures to enhance the understanding of scene composition Hu et al. (2020); Ji et al. (2022); Zhu et al. (2023a). Further innovations have focused on exploiting the contextual richness embedded within deep features. Encoder-decoder structures have also been pivotal in refining predictions by capturing high-level semantic information and detailed spatial relationships.

2.2 UAV Scene Segmentation

Despite these advancements, there is a noticeable gap in the literature pertaining to semantic segmentation tailored for UAV imagery. Existing UAV segmentation methods mainly focus on solving the class imbalanced problems, such as SCO Yang and Ma (2022), FarSeg Zheng et al. (2020) and PointFlow Li et al. (2021). SCO Yang and Ma (2022) tackles the large intra-class variance issues for both foreground and background class via prototypes. FarSeg Zheng et al. (2020) proposes foreground-aware relation network to solve the larger intra-class variance of background. PointFlow presents the point-wise affinity propagation module to address foreground-background imbalanced distribution. The unique challenges posed by UAV scenes, characterized by diverse and dynamic changes in perspective, are seldom addressed. DLPL Ji et al. (2024b) firstly presents a universal framework. Yet the development of segmentation models that can effectively handle the variability inherent in UAV-captured images also remains an area in need of further exploration.

2.3 Ultra-High Resolution Segmentation

In comparison to natural scene imagery, images captured from Unmanned Aerial Vehicles (UAVs) generally exhibit higher resolution characteristics. Existing research has also attempted to introduce segmentation algorithms that can concurrently balance accuracy and efficiency, such as WSDNet Ji et al. (2023c), GPWFormer Ji et al. (2023b), among others. In this paper, we endeavor to enhance segmentation precision from the perspective of UAV flight viewpoints. This approach is universally applicable and can be integrated with the aforementioned high-resolution image segmentation algorithms.

Refer to caption
Figure 2: The overview of the proposed PPTFormer. The encoder comprises four Transformer Blocks: one Plain Transformer Block followed by three PPTFormer Blocks. The former is responsible for extracting basic low-level information, which serves as the foundation for pseudo multi-perspective learning in the subsequent blocks. Specifically, within the PPTFormer Blocks, we extract implicit perspective representations from the visual features. In conjunction with training across the entire dataset, we create perspective prototypes of the images present throughout the dataset. These prototypes are shared across the three PPTFormer Blocks to ensure a consistent learning of perspectives during the training process. Interleaved between the PPTFormer Blocks are Perspective Calibration modules, which prevent potential scene domain shifts. Finally, we concatenate features of varying scales produced by each block and feed them into the decoder network.

3 PPTFormer

3.1 Overall Structure

The overall architecture of our proposed Pseudo Multi-Perspective Transformer (PPTFormer) is inspired by Ji et al. (2024b) and depicted in Figure 2. Following the classic paradigms Zheng et al. (2021); Xie et al. (2021), PPTFormer consists of a meticulously designed encoder for perspective learning and a generic decoder. The encoder comprises four Transformer Blocks: one Plain Transformer Block followed by three PPTFormer Blocks. The former is responsible for extracting basic low-level information, which serves as the foundation for pseudo multi-perspective learning in the subsequent blocks. Specifically, within the PPTFormer Blocks, we extract implicit perspective representations from the visual features. In conjunction with training across the entire dataset, we create perspective prototypes of the images present throughout the dataset. These prototypes are shared across the three PPTFormer Blocks to ensure a consistent learning of perspectives during the training process. Interleaved between the PPTFormer Blocks are Perspective Calibration modules, which are instrumental in aligning the visually fused features from pseudo perspectives with the original perspective of the image. This alignment prevents potential scene domain shifts. Finally, we concatenate features of varying scales produced by each block and feed them into the decoder network for further processing.

3.2 PPTFormer Block

As shown in Figure 2, the PPTFormer Block comprises a Perspective Transformation module and M𝑀Mitalic_M layers of Pseudo Multi-Perspective Attention (PMP Attention). Given the input of low-level visual features F𝐹Fitalic_F from block 1, the Perspective Transformation module implicitly represents and transforms the image’s perspective p𝑝pitalic_p, generating a pseudo perspective psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that simulates the movement and shift of viewpoints during an actual UAV flight, all while preserving the semantic information of the scene. During this process, the acquired perspective representation p𝑝pitalic_p contributes to the construction of perspective prototypes P𝑃Pitalic_P for the entire dataset and also bases the perspective transformation on these prototypes. The output visual feature Fsuperscript𝐹F^{\prime}italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with the pseudo perspective psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, along with F𝐹Fitalic_F, are both fed into the M𝑀Mitalic_M layers of PMP Attention for multi-perspective fusion. This allows the model to understand the scene’s semantic information from both the original perspective and the new pseudo perspective simultaneously. Specifically, in the first layer of PMP Attention, the inputs are F𝐹Fitalic_F and Fsuperscript𝐹F^{\prime}italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and the output is the first level of perspective fusion. Subsequently, in the following M1𝑀1M-1italic_M - 1 layers of PMP Attention, the fused feature and Fsuperscript𝐹F^{\prime}italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT work together to achieve the subsequent M1𝑀1M-1italic_M - 1 levels of perspective fusion. Below, we will introduce the specific structures of the Perspective Transformation and PMP Attention in detail.

Refer to caption
Figure 3: The Perspective Transformation module in PPTFormer Block.

3.3 Perspective Transformation

As illustrated in Figure 3, the input is the visual feature F𝐹Fitalic_F form Block 1, which first passes through a Perspective Representation encoder 𝐄psubscript𝐄𝑝\mathbf{E}_{p}bold_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT to obtain an original Perspective p𝑝pitalic_p. Subsequently, on one hand, p𝑝pitalic_p contributes to the construction of the entire dataset’s perspective prototypes P𝑃Pitalic_P using the online sequential clustering updating technique. The length of P𝑃Pitalic_P, which corresponds to the number of perspective prototypes in the dataset, is N𝑁Nitalic_N. On the other hand, p𝑝pitalic_p is also used for Pseudo Perspective Generation based on P𝑃Pitalic_P. Through the transformation process \mathcal{H}caligraphic_H, a pseudo perspective psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is generated. Thereafter, a Perspective Reconstruction decoder 𝐃psubscript𝐃𝑝\mathbf{D}_{p}bold_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT uses psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to ultimately reconstruct the visual feature Fsuperscript𝐹F^{\prime}italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. 𝐃psubscript𝐃𝑝\mathbf{D}_{p}bold_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is all-MLP architecture, during training, to ensure its reconstructive capability, p𝑝pitalic_p is also directly fed into 𝐃psubscript𝐃𝑝\mathbf{D}_{p}bold_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT with the aim of restoring the original visual feature F𝐹Fitalic_F, that is,

Lrec=𝐃p(p)F2=𝐃p(𝐄p(F))F2,subscript𝐿𝑟𝑒𝑐subscriptnormsubscript𝐃𝑝𝑝𝐹2subscriptnormsubscript𝐃𝑝subscript𝐄𝑝𝐹𝐹2L_{rec}=||\mathbf{D}_{p}(p)-F||_{2}=||\mathbf{D}_{p}(\mathbf{E}_{p}(F))-F||_{2},italic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT = | | bold_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_p ) - italic_F | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = | | bold_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( bold_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_F ) ) - italic_F | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (1)

where Lrecsubscript𝐿𝑟𝑒𝑐L_{rec}italic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT is the reconstruction loss.

Next, we detail the structure of the Perspective Transformation encoder 𝐄psubscript𝐄𝑝\mathbf{E}_{p}bold_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, the construction process of the Perspective Prototypes P𝑃Pitalic_P, and the Pseudo Perspective Generation \mathcal{H}caligraphic_H.

3.3.1 Perspective Representation

As depicted in Figure 4, different from Ji et al. (2024b), the Perspective Representation encoder 𝐄psubscript𝐄𝑝\mathbf{E}_{p}bold_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT primarily encompasses two processes: extracting low-level structural texture from the image using contourlet decomposition Do and Vetterli (2005) and, based on this texture, extracting interest super points that are related to the image’s perspective. The former ensures that the model captures the global structural texture information, which can represent the image’s contours, edges, and other structural features, including perspective information. To further distill perspective-related features, the latter identifies key support points representing perspective as super points, whose spatial distribution and feature intensity can construct a structured high-dimensional description of the image’s perspective.

Texture Decomposition.

Specifically, as traditional filters, contourlet decompositions inherently excel at texture representation across various geometric scales and directions in the spectral domain. Rather than describing texture features in the spatial domain, they analyze the energy distribution in the spectral domain to extract the inherent geometric structures of the texture, which naturally includes the image’s perspective.

The contourlet decomposition comprises a cascaded Laplacian Pyramid (LP) Burt and Adelson (1983) and a directional filter bank (DFB) Bamberger and Smith (1992). The LP decomposes input features into low-pass and high-pass subbands using pyramidal filters. The high-pass subband is processed through the DFB, which is employed to reconstruct the original signal with minimal sample representation, produced by t𝑡titalic_t-level binary tree decomposition in the two-dimensional frequency domain, resulting in 2zsuperscript2𝑧2^{z}2 start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT directional subbands. For instance, when z=3𝑧3z=3italic_z = 3, the frequency domain is divided into 8 directional subbands, with subbands 0-3 and 4-7 corresponding to vertical and horizontal details, respectively. Following Ji et al. (2022), for a richer expression, we stack multiple contourlet decomposition layers iteratively, and concatenate the output of each level to form the final extracted structural texture.

Specifically, the output of level t[1,T]𝑡1𝑇t\in[1,T]italic_t ∈ [ 1 , italic_T ] is denoted as Fbds,tsubscript𝐹𝑏𝑑𝑠𝑡F_{bds,t}italic_F start_POSTSUBSCRIPT italic_b italic_d italic_s , italic_t end_POSTSUBSCRIPT,

Fbds,tsubscript𝐹𝑏𝑑𝑠𝑡\displaystyle F_{bds,t}italic_F start_POSTSUBSCRIPT italic_b italic_d italic_s , italic_t end_POSTSUBSCRIPT =𝐃𝐅𝐁(Fh,t),t[1,T],formulae-sequenceabsent𝐃𝐅𝐁subscript𝐹𝑡𝑡1𝑇\displaystyle={\rm\mathbf{DFB}}(F_{h,t}),~{}~{}~{}t\in[1,T],= bold_DFB ( italic_F start_POSTSUBSCRIPT italic_h , italic_t end_POSTSUBSCRIPT ) , italic_t ∈ [ 1 , italic_T ] , (2)
whereFl,t,Fh,t=𝐋𝐏(Fl,t1)wheresubscript𝐹𝑙𝑡subscript𝐹𝑡𝐋𝐏subscript𝐹𝑙𝑡1\displaystyle{\rm where}~{}~{}F_{l,t},F_{h,t}={\rm\mathbf{LP}}(F_{l,t-1})roman_where italic_F start_POSTSUBSCRIPT italic_l , italic_t end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_h , italic_t end_POSTSUBSCRIPT = bold_LP ( italic_F start_POSTSUBSCRIPT italic_l , italic_t - 1 end_POSTSUBSCRIPT )

where l𝑙litalic_l and hhitalic_h represent the low-pass and high-pass subbands respectively, bds𝑏𝑑𝑠bdsitalic_b italic_d italic_s denotes the bandpass directional subbands. Then structural texture Ftexturesubscript𝐹𝑡𝑒𝑥𝑡𝑢𝑟𝑒F_{texture}italic_F start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t italic_u italic_r italic_e end_POSTSUBSCRIPT is denoted as:

Ftexture=Catt[1,T]{Fbds,t}.subscript𝐹𝑡𝑒𝑥𝑡𝑢𝑟𝑒subscriptCat𝑡1𝑇subscript𝐹𝑏𝑑𝑠𝑡F_{texture}=\mathop{\rm Cat}\limits_{t\in[1,T]}\{F_{bds,t}\}.italic_F start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t italic_u italic_r italic_e end_POSTSUBSCRIPT = roman_Cat start_POSTSUBSCRIPT italic_t ∈ [ 1 , italic_T ] end_POSTSUBSCRIPT { italic_F start_POSTSUBSCRIPT italic_b italic_d italic_s , italic_t end_POSTSUBSCRIPT } . (3)

where CatCat{\rm Cat}roman_Cat is the concatenation operation. Ftexturesubscript𝐹𝑡𝑒𝑥𝑡𝑢𝑟𝑒F_{texture}italic_F start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t italic_u italic_r italic_e end_POSTSUBSCRIPT is rich in texture information including the image’s perspective.

Perspective Support Description.

Based on Ftexturesubscript𝐹𝑡𝑒𝑥𝑡𝑢𝑟𝑒F_{texture}italic_F start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t italic_u italic_r italic_e end_POSTSUBSCRIPT, we use a SuperPoint network to extract key support points and corresponding support descriptors that are capable of characterizing the perspective from the texture features. Specifically, this network comprises two parallel heads, 𝐒spsubscript𝐒𝑠𝑝\mathbf{S}_{sp}bold_S start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT and 𝐒sdsubscript𝐒𝑠𝑑\mathbf{S}_{sd}bold_S start_POSTSUBSCRIPT italic_s italic_d end_POSTSUBSCRIPT, which respectively output the “point-ness” probability map and the corresponding point feature descriptor. The final output, the perspective feature p𝑝pitalic_p, is the concatenation of output features from the two heads, along the channel dimension:

p=Cat(𝐒sp(Ftexture),𝐒sd(Ftexture)).𝑝Catsubscript𝐒𝑠𝑝subscript𝐹𝑡𝑒𝑥𝑡𝑢𝑟𝑒subscript𝐒𝑠𝑑subscript𝐹𝑡𝑒𝑥𝑡𝑢𝑟𝑒p={\rm Cat}(\mathbf{S}_{sp}(F_{texture}),\mathbf{S}_{sd}(F_{texture})).italic_p = roman_Cat ( bold_S start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t italic_u italic_r italic_e end_POSTSUBSCRIPT ) , bold_S start_POSTSUBSCRIPT italic_s italic_d end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t italic_u italic_r italic_e end_POSTSUBSCRIPT ) ) . (4)

3.3.2 Perspective Prototypes Construction

The construction of Perspective Prototypes is aimed to obtain and manage the scene perspective types of the whole dataset, by performing an online sequential clustering process on the coming p𝑝pitalic_ps. We utilize a lightweight memory bank and its length N𝑁Nitalic_N is equal to the number of prototypes. Firstly, the N𝑁Nitalic_N prototypes P={P1,P2,,Pn,,PN}𝑃subscript𝑃1subscript𝑃2subscript𝑃𝑛subscript𝑃𝑁P=\{P_{1},P_{2},...,P_{n},...,P_{N}\}italic_P = { italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , … , italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } are initialized with the first input N𝑁Nitalic_N p𝑝pitalic_ps, and we set the counts {c1,c2,,cn,,cN}subscript𝑐1subscript𝑐2subscript𝑐𝑛subscript𝑐𝑁\{c_{1},c_{2},...,c_{n},...,c_{N}\}{ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } to record the number of perspective features belonging to the corresponding prototype. Then, for each new coming p𝑝pitalic_p, we find its closest prototype Pnsubscript𝑃𝑛P_{n}italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT by L2 distances, and update the the prototype with:

cnsubscript𝑐𝑛\displaystyle c_{n}italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT cn+1absentsubscript𝑐𝑛1\displaystyle\leftarrow c_{n}+1← italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + 1 (5)
Pnsubscript𝑃𝑛\displaystyle P_{n}italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT Pn+1cn(pPn).absentsubscript𝑃𝑛1subscript𝑐𝑛𝑝subscript𝑃𝑛\displaystyle\leftarrow P_{n}+\frac{1}{c_{n}}(p-P_{n}).← italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ( italic_p - italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) .

So the final resulting prototype Pnsubscript𝑃𝑛P_{n}italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the moving average of p𝑝pitalic_ps that are closest to Pnsubscript𝑃𝑛P_{n}italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

Then, following Ji et al. (2024b), we can formulate overall perspective distribution of the whole dataset in form of Gaussian Mixed Model (GMM) as,

G(P)=n=1Nπn𝒩(p|Pn,Σn),𝐺𝑃superscriptsubscript𝑛1𝑁subscript𝜋𝑛𝒩conditional𝑝subscript𝑃𝑛subscriptΣ𝑛G(P)=\sum_{n=1}^{N}\pi_{n}\cdot\mathcal{N}(p|P_{n},\Sigma_{n}),italic_G ( italic_P ) = ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⋅ caligraphic_N ( italic_p | italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , (6)

where 𝒩()𝒩\mathcal{N}(\cdot)caligraphic_N ( ⋅ ) indicates the Gaussian Distribution, the n𝑛nitalic_nth component of GMM has the center of Pnsubscript𝑃𝑛P_{n}italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with the variance of ΣnsubscriptΣ𝑛\Sigma_{n}roman_Σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, and πnsubscript𝜋𝑛\pi_{n}italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the mixture coefficient of meets:

n=1Nπn=1,0πn1,formulae-sequencesuperscriptsubscript𝑛1𝑁subscript𝜋𝑛10subscript𝜋𝑛1\displaystyle\sum_{n=1}^{N}\pi_{n}=1,~{}~{}~{}0\leq\pi_{n}\leq 1,∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 , 0 ≤ italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≤ 1 , (7)
Refer to caption
Figure 4: The detailed process of Perspective Representation.

3.3.3 Pseudo Perspective Generation

Based on the dynamically updated perspective distribution G(P)𝐺𝑃G(P)italic_G ( italic_P ), we can generate a new semantic-related pseudo perspective psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of the given probe p𝑝pitalic_p, by leveraging all the prototypes.

p=pG(P,p)=pn=1Nπn𝒩(p|Pn,Σn),superscript𝑝𝑝𝐺𝑃𝑝𝑝superscriptsubscript𝑛1𝑁subscript𝜋𝑛𝒩conditional𝑝subscript𝑃𝑛subscriptΣ𝑛\displaystyle p^{\prime}=p\cdot G(P,p)=p\cdot\sum_{n=1}^{N}\pi_{n}\cdot% \mathcal{N}(p|P_{n},\Sigma_{n}),italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_p ⋅ italic_G ( italic_P , italic_p ) = italic_p ⋅ ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⋅ caligraphic_N ( italic_p | italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , (8)

where psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is generated based on the overall perspective distribution over all perspective prototypes.

Finally, the corresponding visual feature Fsuperscript𝐹F^{\prime}italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can be reconstructed with 𝐃psubscript𝐃𝑝\mathbf{D}_{p}bold_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT.

F=𝐃p(p).superscript𝐹subscript𝐃𝑝superscript𝑝F^{\prime}=\mathbf{D}_{p}(p^{\prime}).italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) . (9)

3.4 Pseudo Multi-Perspective Attention

By the Perspective Transformation, we obtain a semantic-related perspective-transformed visual feature Fsuperscript𝐹F^{\prime}italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for the input visual feature F𝐹Fitalic_F. As seen that F𝐹Fitalic_F and Fsuperscript𝐹F^{\prime}italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT contain closely identical scene context and structured information but only differ in perspective. Next, they are fed into the M𝑀Mitalic_M layers of Pseudo Multi-Perspective (PMP) Attention, as illustrated in Sec. 3.2, to leverage the relationship between the F𝐹Fitalic_F (with original perspective p𝑝pitalic_p) and Fsuperscript𝐹F^{\prime}italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (with generated pseudo perspective psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT). Formally, the first layer of PMP Attention is formulated as:

PMP_Att(F,F)=Softmax(F×FC1)×F,\displaystyle{\rm PMP\_Att}(F,F^{\prime})={\rm Softmax}(\frac{F\times F^{{}^{% \prime}\top}}{\sqrt{C_{1}}})\times F^{\prime},roman_PMP _ roman_Att ( italic_F , italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_Softmax ( divide start_ARG italic_F × italic_F start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_ARG ) × italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , (10)

where C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the feature channel, F𝐹Fitalic_F acts as query, Fsuperscript𝐹F^{\prime}italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT acts as key and value, in the similar cross-perspective-attention calculation as Ji et al. (2024b).

3.5 Perspective Calibration

Within each PPTFormer Block, after undergoing N layers of PMP Attention, the original perspective and the pseudo perspective are thoroughly fused multiple times, ultimately enabling the model to capture scene information as observed from various perspectives. However, in practice, we observed that as the perspective fusion progresses, there could be some domain shift within the visual features’ depiction of the scene. To prevent such occurrences, we further incorporate a straightforward Perspective Calibration process after PPTFormer Blocks. Specifically, this entails passing the visual feature with the original perspective, which is the input to the current PPTFormer Block, through a skip connection to calibrate the fused feature output by the current PPTFormer Block, by several layers of PMP Attention. In practice, we found this elegant approach to be effective in mitigating issues of domain shift.

3.6 Optimization

The overall loss function L𝐿Litalic_L is the combination of the main segmentation loss Lsegsubscript𝐿𝑠𝑒𝑔L_{seg}italic_L start_POSTSUBSCRIPT italic_s italic_e italic_g end_POSTSUBSCRIPT and the reconstruction loss Lrecsubscript𝐿𝑟𝑒𝑐L_{rec}italic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT:

L=Lseg+λLrec,𝐿subscript𝐿𝑠𝑒𝑔𝜆subscript𝐿𝑟𝑒𝑐L=L_{seg}+\lambda L_{rec},italic_L = italic_L start_POSTSUBSCRIPT italic_s italic_e italic_g end_POSTSUBSCRIPT + italic_λ italic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT , (11)

where λ𝜆\lambdaitalic_λ is the weight for Lrecsubscript𝐿𝑟𝑒𝑐L_{rec}italic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT, and set to 0.4.

4 Experiments

4.1 Datasets and Evaluation Metrics

In our experiments, we validate the effectiveness of PPTFormer on five datasets, including UDD6, iSAID, UAVid , Aeroscapes, and DroneSeg.

4.1.1 UDD6

Urban Drone Dataset (UDD) dataset is collected by a DJI-Phantom 4 UAV at altitudes between 60 and 100 meters, and is extracted from 10 video sequences. The resolution is either 4k (4096×\times×2160) or 12M (4000×\times×3000). It contains a variety of urban scenes.

4.1.2 iSAID

iSAID totally consists of 2,806 images, where 1411, 458, and 937 images are for training, validation, and testing sets, respectively.

4.1.3 UAVid

UAVid dataset has 300 images of size of 3840×\times×2160, where the training, validation, and testing set contains 200, 70, and 30 images respectively.

4.1.4 Aeroscapes

The Aeroscapes dataset provides 3,269 720p images and ground-truth masks for 11 categories, where the training and validation sets include 2,621 and 648 images respectively.

4.1.5 DroneSeg

The DroneSeg dataset Ji et al. (2024b) extends the segmentation annotations from VisDrone dataset. The dataset consists of 10,209 images with fine-grained pixel-level annotations of 14 categories.

Method mIoU (%)
UDD6 iSAID UAVid Aeroscapes DroneSeg
Deeplab 71.84 59.20 56.82 51.40 38.69
OCR_W48 73.37 62.73 63.10 58.19 43.10
PSPNet 72.95 60.30 58.20 57.98 37.03
FarSeg - 63.70 - - -
FarSeg++ - 63.70 - - -
PFNet - 66.90 - - -
SCO - 69.10 - - -
SETR 68.00 62.77 58.52 50.34 48.23
UperNet 73.13 66.45 61.91 64.32 53.34
PoolFormer 74.54 65.55 61.73 62.27 53.94
SegFormer 74.28 67.19 62.01 66.40 55.33
PPTFormer 76.70 69.87 65.00 68.50 57.71
Table 1: Moving UAV Semantic Segmentation: Comparison with state-of-the-arts on UDD6, iSAID, UAVid, Aeroscapes, and our proposed DroneSeg datasets.

4.2 Implementation Details

In our experiments, we follow Ji et al. (2024b) and adopt the MMSegmentation toolbox as codebase and follow the default augments without bells and whistles. SuperPoint network is used for the perspective support description. To ensure training stability, during the initial 30% of epochs, we replace the PMP Attention with plain self-Attention. This substitution aims to guarantee the reliability of perspective representation and reconstruction within 𝐄psubscript𝐄𝑝\mathbf{E}_{p}bold_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and 𝐃psubscript𝐃𝑝\mathbf{D}_{p}bold_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, as well as the stability of learning perspective prototypes. Subsequently, we revert to the PMP Attention mechanism to perform global joint optimization in the remaining epochs. In the training, SGD optimizer with momentum 0.98 for all parameters is used, the initial learning rate is configured as 5 ×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT and the maximum iteration number is set to 160K for all datasets. In Eq. 2, T𝑇Titalic_T is set to 2. The length of P𝑃Pitalic_P is set to N=64𝑁64N=64italic_N = 64.

4.3 Comparison with State-of-the-Arts

We compare PPTFormer with both representative CNN-based (DeepLabV3+ Chen et al. (2018), OCRNet_W48 Yuan et al. (2020), PSPNet Zhao et al. (2017), FarSeg Zheng et al. (2020), FarSeg++ Zheng et al. (2023c), PFNet Li et al. (2021) and SCO Yang and Ma (2022)) and ViT-based (SETR Zheng et al. (2021), UperNet Liu et al. (2021), PoolFormer Yu et al. (2022b), SegFormer Xie et al. (2021)) segmentation methods on five benchmark datasets.

4.3.1 UDD6, UAVid

Both the two datasets contain relative fewer images and lower scene complexity and we compare their results here. For fair comparisons with ViT-based methods, we adopt large backbones for CNN-based methods including ResNet-101 and HRNet-W48. As shown in Table 1, the ordinary transformer (SETR) show even lower performance than CNN-based methods, and the advanced ones including PoolFormer, SegFormer shows better results. The proposed PPTFormer achieves further performance improvements on both the datasets.

4.3.2 iSAID, Aeroscapes, DroneSeg

These three datasets consist of more images than UDD6 and UAVid, and have higher scene perspective variances. So they would be more convincing to prove the superiority. PPTFormer outperforms other methods by a larger margin, which demonstrates the effectiveness of the proposed method on the description of perspective information.

4.4 Ablation Study

All ablation studies are performed on DroneSeg testing set, SegFormer is used as baseline network.

Perspective-Oriented Learning Method mIoU (%)
Baseline (SegFormer w/o data aug.) 52.03
+ Random Rotate 52.94
+ Random Scale 53.09
+ Random Perspective-Vertical 53.56
+ Random Perspective-Horizontal 53.42
+ Random Combination 55.33
PPTFormer 57.71
Table 2: The comparison of PPTFormer with other perspective-oriented learning methods (data augmentation). Perspective-Vertical and Perspective-Horizontal means adjusting perspectives in vertical and horizontal directions, and are implemented by the default “torchvision.transforms” interfaces in PyTorch.

4.4.1 Comparison with Perspective Learning Methods

Given that PPTFormer is a perspective-oriented learning approach, we begin by comparing it with various perspective-based augmentations. We observe that in UAV scenarios, perspective shifts are almost invariably linked to changes in altitude and angular positioning, which manifest as alterations in scale and rotation. Therefore, we employ a combination of these two data augmentation techniques to benchmark against PPTFormer. As illustrated in Table 2, we discover that utilizing either augmentation method in isolation yields only modest enhancements over the baseline approach. In contrast, PPTFormer secures a substantial increase in performance. We further demonstrate that PPTFormer remains compatible with standard data augmentations for additional gains.

Contourlet Decomposition mIoU (%)
0 56.88
1 57.40
2 57.71
3 57.73
Table 3: The impact of Contourlet Decomposition.

4.4.2 The Impact of Contourlet Decomposition

The contourlet decomposition is capable of extracting structural texture information from images, which encompasses a wealth of perspective details. By initially employing it within the Perspective Representation, the network can swiftly focus on shallow image textures, thereby facilitating further extraction of Perspective and enhancing learning efficiency. Table 3 demonstrates its efficacy, as the number of contourlet decomposition layers increases, the mIoU correspondingly improves. Here, a layer count of zero indicates no application of contourlet decomposition.

4.4.3 Effectiveness of Perspective Calibration

The purpose of Perspective Calibration is to prevent the occurrence of scene domain shift that may arise as a consequence of deep perspective fusion. Figure 5 illustrates the impact of the number of PMP Attention layers in it on model performance. “Layer=0” implies the absence of Calibration, and the results indicate a low mIoU under this condition. As the number of layers increases, there is a significant improvement in mIoU, which underscores the effectiveness and necessity of Perspective Calibration.

4.4.4 The Quantity of Perspective Prototypes

The quantity of perspective prototypes represents the entirety of perspective variations found within the dataset, with a higher count enabling the retention of a more extensive set of prototypes. Figure 6 reveals that with a smaller allocation of prototypes (16, 32), the process fails to exhaustively capture all perspectives, resulting in an underfitting of the model. Conversely, as the number of prototypes increases (128, 256), we generate an overly dense array of prototypes. This surplus can introduce redundancy and give rise to numerous discrete perspectives, potentially hindering the learning process.

Refer to caption
Figure 5: Layer number in Perspective Calibration.
Refer to caption
Figure 6: The quantity of Perspective Prototypes.

5 Conclusion

This paper presents the novel PPTFormer, a Pseudo Multi-Perspective Transformer network for UAV scene segmentation. It addresses the challenges of capturing the dynamic perspectives inherent in UAV-captured imagery. By integrating systematic Pseudo Multi-Perspective Learning within the Transformer framework, PPTFormer adeptly performs Perspective Decomposition, constructs a rich Perspective Space, and achieves Multi-Perspective Fusion, leading to a more nuanced understanding of UAV scenes. The experiments on several datasets validate the superior performance of PPTFormer. The significant advancements made by PPTFormer underscore the importance of perspective-oriented learning in semantic segmentation and pave the way for further innovation in the processing of UAV-captured visual data.

Acknowledgments

This work was supported by the Anhui Provincial Natural Science Foundation under Grant 2108085UD12, the JKW Research Funds under Grant 20-163-14-LZ-001-004-01, the National Key R&D Program of China under Grant 2020AAA0103902, NSFC (No. 62176155), Shanghai Municipal Science and Technology Major Project, China (2021SHZDZX0102).

We acknowledge the support of GPU cluster built by MCC Lab of Information Science and Technology Institution, USTC.

Contribution Statement

The first two authors contribute equally to this work.

References

  • Bamberger and Smith [1992] R.H. Bamberger and M.J.T. Smith. A filter bank for the directional decomposition of images: theory and design. IEEE Transactions on Signal Processing, 40(4):882–893, 1992.
  • Burt and Adelson [1983] P. Burt and E. Adelson. The laplacian pyramid as a compact image code. IEEE Transactions on Communications, 31(4):532–540, 1983.
  • Chen et al. [2018] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision, pages 801–818, 2018.
  • Chen et al. [2023] Tianrun Chen, Lanyun Zhu, Chaotao Ding, Runlong Cao, Shangzhan Zhang, Yan Wang, Zejian Li, Lingyun Sun, Papa Mao, and Ying Zang. Sam fails to segment anything?–sam-adapter: Adapting sam in underperformed scenes: Camouflage, shadow, and more. arXiv preprint arXiv:2304.09148, 2023.
  • Chen et al. [2024] Tianrun Chen, Chunan Yu, **g Li, Jianqi Zhang, Lanyun Zhu, Deyi Ji, Yong Zhang, Ying Zang, Zejian Li, and Lingyun Sun. Reasoning3d – grounding and reasoning in 3d: Fine-grained zero-shot open-vocabulary 3d reasoning part segmentation via large vision-language models. arXiv preprint arXiv:2405.19326, 2024.
  • Do and Vetterli [2005] M.N. Do and M. Vetterli. The contourlet transform: an efficient directional multiresolution image representation. IEEE Transactions on Image Processing, 14(12):2091–2106, 2005.
  • Feng et al. [2018] Weitao Feng, Deyi Ji, Yiru Wang, Shuorong Chang, Hansheng Ren, and Weihao Gan. Challenges on large scale surveillance video analysis. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 69–76, 2018.
  • Hu et al. [2020] Hanzhe Hu, Deyi Ji, Weihao Gan, Shuai Bai, Wei Wu, and Junjie Yan. Class-wise dynamic graph convolution for semantic segmentation. In European Conference on Computer Vision, pages 1–17, 2020.
  • Ji et al. [2019] Deyi Ji, Hongtao Lu, and Tongzhen Zhang. End to end multi-scale convolutional neural network for crowd counting. In Eleventh International Conference on Machine Vision, volume 11041, pages 761–766, 2019.
  • Ji et al. [2020] Deyi Ji, Haoran Wang, Hanzhe Hu, Weihao Gan, Wei Wu, and Junjie Yan. Context-aware graph convolution network for target re-identification. arXiv preprint arXiv:2012.04298, 2020.
  • Ji et al. [2022] Deyi Ji, Haoran Wang, Mingyuan Tao, Jianqiang Huang, Xian-Sheng Hua, and Hongtao Lu. Structural and statistical texture knowledge distillation for semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16876–16885, 2022.
  • Ji et al. [2023a] Deyi Ji, Siqi Gao, Mingyuan Tao, Hongtao Lu, and Feng Zhao. Changenet: Multi-temporal asymmetric change detection dataset. arXiv preprint arXiv:2312.17428, 2023.
  • Ji et al. [2023b] Deyi Ji, Feng Zhao, and Hongtao Lu. Guided patch-grou** wavelet transformer with spatial congruence for ultra-high resolution segmentation. International Joint Conference on Artificial Intelligence, pages 920–928, 2023.
  • Ji et al. [2023c] Deyi Ji, Feng Zhao, Hongtao Lu, Mingyuan Tao, and Jie** Ye. Ultra-high resolution segmentation with ultra-rich context: A novel benchmark. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23621–23630, June 2023.
  • Ji et al. [2024a] Deyi Ji, Siqi Gao, Lanyun Zhu, Qi Zhu, Yiru Zhao, Peng Xu, Hongtao Lu, Feng Zhao, and Jie** Ye. View-centric multi-object tracking with homographic matching in moving uav. arXiv preprint arXiv:2403.10830, 2024.
  • Ji et al. [2024b] Deyi Ji, Feng Zhao, Lanyun Zhu, Wenwei **, Hongtao Lu, and Jie** Ye. Discrete latent perspective learning for segmentation and detection. International Conference on Machine Learning, 2024.
  • Li et al. [2021] ** Shi, Lubin Weng, Yunhai Tong, and Zhouchen Lin. Pointflow: Flowing semantics through points for aerial image segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4217–4226, 2021.
  • Liu et al. [2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
  • Long et al. [2015] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
  • Wang et al. [2021a] Haoran Wang, Licheng Jiao, Fang Liu, Lingling Li, Xu Liu, Deyi Ji, and Weihao Gan. Ipgn: Interactiveness proposal graph network for human-object interaction detection. IEEE Transactions on Image Processing, 30:6583–6593, 2021.
  • Wang et al. [2021b] Haoran Wang, Licheng Jiao, Fang Liu, Lingling Li, Xu Liu, Deyi Ji, and Weihao Gan. Learning social spatio-temporal relation graph in the wild and a video benchmark. IEEE Transactions on Neural Networks and Learning Systems, 34(6):2951–2964, 2021.
  • Xie et al. [2021] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and ** Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. In 35th Conference on Neural Information Processing Systems, pages 1–13, 2021.
  • Yang and Ma [2022] Fengyu Yang and Chenyang Ma. Sparse and complete latent organization for geospatial semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1809–1818, 2022.
  • Yang et al. [2024] Zizheng Yang, Jie Huang, Man Zhou, Naishan Zheng, and Feng Zhao. IRVR: A general image restoration framework for visual recognition. IEEE Transactions on Multimedia, 26:7012–7026, 2024.
  • Yu et al. [2022a] Hu Yu, Naishan Zheng, Man Zhou, Jie Huang, Zeyu Xiao, and Feng Zhao. Frequency and spatial dual guidance for image dehazing. In European Conference on Computer Vision, pages 181–198. Springer, 2022.
  • Yu et al. [2022b] Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer is actually what you need for vision. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10819–10829, 2022.
  • Yu et al. [2023] Wei Yu, Qi Zhu, Naishan Zheng, Jie Huang, Man Zhou, and Feng Zhao. Learning non-uniform-sampling for ultra-high-definition image enhancement. In ACM International Conference on Multimedia, pages 1412–1421, 2023.
  • Yuan et al. [2020] Yuhui Yuan, Xilin Chen, and **gdong Wang. Object-contextual representations for semantic segmentation. In European Conference on Computer Vision, pages 1–17, 2020.
  • Zhao et al. [2017] Hengshuang Zhao, Jian** Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2881–2890, 2017.
  • Zheng et al. [2020] Zhuo Zheng, Yanfei Zhong, Junjue Wang, and Ailong Ma. Foreground-aware relation network for geospatial object segmentation in high spatial resolution remote sensing imagery. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4096–4105, 2020.
  • Zheng et al. [2021] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, and Li Zhang. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1–10, 2021.
  • Zheng et al. [2023a] Naishan Zheng, Jie Huang, Feng Zhao, Xueyang Fu, and Feng Wu. Unsupervised underexposed image enhancement via self-illuminated and perceptual guidance. IEEE Transactions on Multimedia, 25:5469–5484, 2023.
  • Zheng et al. [2023b] Naishan Zheng, Jie Huang, Man Zhou, Zizheng Yang, Qi Zhu, and Feng Zhao. Learning semantic degradation-aware guidance for recognition-driven unsupervised low-light image enhancement. In AAAI Conference on Artificial Intelligence, pages 3678–3686, 2023.
  • Zheng et al. [2023c] Zhuo Zheng, Yanfei Zhong, Junjue Wang, Ailong Ma, and Liangpei Zhang. FarSeg++: Foreground-aware relation network for geospatial object segmentation in high spatial resolution remote sensing imagery. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11):13715–13729, 2023.
  • Zhou et al. [2023] Man Zhou, Naishan Zheng, Yuan Xu, Chun-Le Guo, and Chongyi Li. Training your image restoration network better with random weight network as optimization function. In 37th Advances in Neural Information Processing Systems, pages 1270–1282, 2023.
  • Zhu et al. [2021] Lanyun Zhu, Deyi Ji, Shi** Zhu, Weihao Gan, Wei Wu, and Junjie Yan. Learning statistical texture for semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12537–12546, 2021.
  • Zhu et al. [2022] Qi Zhu, Zeyu Xiao, Jie Huang, and Feng Zhao. Dast-net: Depth-aware spatio-temporal network for video deblurring. In IEEE International Conference on Multimedia and Expo, pages 1–6, 2022.
  • Zhu et al. [2023a] Lanyun Zhu, Tianrun Chen, Deyi Ji, Jie** Ye, and Jun Liu. Llafs: When large-language models meet few-shot segmentation. arXiv preprint arXiv:2311.16926, 2023.
  • Zhu et al. [2023b] Lanyun Zhu, Tianrun Chen, Jianxiong Yin, Simon See, and Jun Liu. Continual semantic segmentation with automatic memory sample selection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3082–3092, 2023.
  • Zhu et al. [2023c] Lanyun Zhu, Tianrun Chen, Jianxiong Yin, Simon See, and Jun Liu. Learning gabor texture features for fine-grained recognition. In IEEE/CVF International Conference on Computer Vision, pages 1621–1631, 2023.
  • Zhu et al. [2023d] Lanyun Zhu, Tianrun Chen, Jianxiong Yin, Simon See, and Jun Liu. Learning gabor texture features for fine-grained recognition. In IEEE/CVF International Conference on Computer Vision, pages 1621–1631, 2023.
  • Zhu et al. [2023e] Qi Zhu, Jie Huang, Naishan Zheng, Hongzhi Gao, Chongyi Li, Yuan Xu, Feng Zhao, et al. Fouridown: Factoring down-sampling into shuffling and superposing. In 37th Advances in Neural Information Processing Systems, volume 36, pages 1–14, 2023.
  • Zhu et al. [2023f] Qi Zhu, Man Zhou, Naishan Zheng, Chongyi Li, Jie Huang, and Feng Zhao. Exploring temporal frequency spectrum in deep video deblurring. In IEEE/CVF International Conference on Computer Vision, pages 12428–12437, 2023.
  • Zhu et al. [2024a] Lanyun Zhu, Tianrun Chen, Jianxiong Yin, Simon See, and Jun Liu. Addressing background context bias in few-shot segmentation through iterative modulation. In IEEE/CVF International Conference on Computer Vision, pages 1–10, 2024.
  • Zhu et al. [2024b] Lanyun Zhu, Deyi Ji, Tianrun Chen, Peng Xu, Jie** Ye, and Jun Liu. Ibd: Alleviating hallucinations in large vision-language models via image-biased decoding. arXiv preprint arXiv:2402.18476, 2024.