Search | arXiv e-print repository

AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation

Authors: Yanan Sun, Yanchen Liu, Yinhao Tang, Wenjie Pei, Kai Chen

Abstract: The field of text-to-image (T2I) generation has made significant progress in recent years, largely driven by advancements in diffusion models. Linguistic control enables effective content creation, but struggles with fine-grained control over image generation. This challenge has been explored, to a great extent, by incorporating additional user-supplied spatial conditions, such as depth maps and e… ▽ More The field of text-to-image (T2I) generation has made significant progress in recent years, largely driven by advancements in diffusion models. Linguistic control enables effective content creation, but struggles with fine-grained control over image generation. This challenge has been explored, to a great extent, by incorporating additional user-supplied spatial conditions, such as depth maps and edge maps, into pre-trained T2I models through extra encoding. However, multi-control image synthesis still faces several challenges. Specifically, current approaches are limited in handling free combinations of diverse input control signals, overlook the complex relationships among multiple spatial conditions, and often fail to maintain semantic alignment with provided textual prompts. This can lead to suboptimal user experiences. To address these challenges, we propose AnyControl, a multi-control image synthesis framework that supports arbitrary combinations of diverse control signals. AnyControl develops a novel Multi-Control Encoder that extracts a unified multi-modal embedding to guide the generation process. This approach enables a holistic understanding of user inputs, and produces high-quality, faithful results under versatile control signals, as demonstrated by extensive quantitative and qualitative evaluations. Our project page is available in https://any-control.github.io. △ Less

Submitted 27 June, 2024; v1 submitted 27 June, 2024; originally announced June 2024.

arXiv:2405.15191 [pdf, other]

Effectiveness of halo and galaxy properties in reducing the scatter in the stellar-to-halo mass relation

Authors: Wenxiang Pei, Qi Guo, Shi Shao, Yi He, Qing Gu

Abstract: The stellar-to-halo mass relation (SHMR) is a fundamental relationship between galaxies and their host dark matter haloes. In this study, we examine the scatter in this relation for primary galaxies in the semi-analytic L-Galaxies model and two cosmological hydrodynamical simulations, \eagle{} and \tng{}. We find that in low-mass haloes, more massive galaxies tend to reside in haloes with higher c… ▽ More The stellar-to-halo mass relation (SHMR) is a fundamental relationship between galaxies and their host dark matter haloes. In this study, we examine the scatter in this relation for primary galaxies in the semi-analytic L-Galaxies model and two cosmological hydrodynamical simulations, \eagle{} and \tng{}. We find that in low-mass haloes, more massive galaxies tend to reside in haloes with higher concentration, earlier formation time, greater environmental density, earlier major mergers, and, to have older stellar populations, which is consistent with findings in various studies. Quantitative analysis reveals the varying significance of halo and galaxy properties in determining SHMR scatter across simulations and models. In \eagle{} and \tng{}, halo concentration and formation time primarily influence SHMR scatter for haloes with $M_{\rm h}<10^{12}~\rm M_\odot$, but the influence diminishes at high mass. Baryonic processes play a more significant role in \lgal{}. For halos with $M_{\rm h} <10^{11}~\rm M_\odot$ and $10^{12}~\rm M_\odot<M_{\rm h}<10^{13}~\rm M_\odot$, the main drivers of scatter are galaxy SFR and age. In the $10^{11.5}~\rm M_\odot<M_{\rm h} <10^{12}~\rm M_\odot$ range, halo concentration and formation time are the primary factors. And for halos with $M_{\rm h} > 10^{13}~\rm M_\odot$, supermassive black hole mass becomes more important. Interestingly, it is found that AGN feedback may increase the amplitude of the scatter and decrease the dependence on halo properties at high masses. △ Less

Submitted 23 May, 2024; originally announced May 2024.

Comments: 23 pages, 12 + 5 figures, 2 tables, including 4 Appendix; Accepted by MNRAS

arXiv:2405.09185 [pdf, other]

Influence Maximization in Hypergraphs Using A Genetic Algorithm with New Initialization and Evaluation Methods

Authors: Xilong Qu, Wenbin Pei, Yingchao Yang, Xirong Xu, Renquan Zhang, Qiang Zhang

Abstract: Influence maximization (IM) is a crucial optimization task related to analyzing complex networks in the real world, such as social networks, disease propagation networks, and marketing networks. Publications to date about the IM problem focus mainly on graphs, which fail to capture high-order interaction relationships from the real world. Therefore, the use of hypergraphs for addressing the IM pro… ▽ More Influence maximization (IM) is a crucial optimization task related to analyzing complex networks in the real world, such as social networks, disease propagation networks, and marketing networks. Publications to date about the IM problem focus mainly on graphs, which fail to capture high-order interaction relationships from the real world. Therefore, the use of hypergraphs for addressing the IM problem has been receiving increasing attention. However, identifying the most influential nodes in hypergraphs remains challenging, mainly because nodes and hyperedges are often strongly coupled and correlated. In this paper, to effectively identify the most influential nodes, we first propose a novel hypergraph-independent cascade model that integrates the influences of both node and hyperedge failures. Afterward, we introduce genetic algorithms (GA) to identify the most influential nodes that leverage hypergraph collective influences. In the GA-based method, the hypergraph collective influence is effectively used to initialize the population, thereby enhancing the quality of initial candidate solutions. The designed fitness function considers the joint influences of both nodes and hyperedges. This ensures the optimal set of nodes with the best influence on both nodes and hyperedges to be evaluated accurately. Moreover, a new mutation operator is designed by introducing factors, i.e., the collective influence and overlap** effects of nodes in hypergraphs, to breed high-quality offspring. In the experiments, several simulations on both synthetic and real hypergraphs have been conducted, and the results demonstrate that the proposed method outperforms the compared methods. △ Less

Submitted 15 May, 2024; originally announced May 2024.

arXiv:2404.10322 [pdf, other]

Domain-Rectifying Adapter for Cross-Domain Few-Shot Segmentation

Authors: Jiapeng Su, Qi Fan, Guangming Lu, Fanglin Chen, Wenjie Pei

Abstract: Few-shot semantic segmentation (FSS) has achieved great success on segmenting objects of novel classes, supported by only a few annotated samples. However, existing FSS methods often underperform in the presence of domain shifts, especially when encountering new domain styles that are unseen during training. It is suboptimal to directly adapt or generalize the entire model to new domains in the fe… ▽ More Few-shot semantic segmentation (FSS) has achieved great success on segmenting objects of novel classes, supported by only a few annotated samples. However, existing FSS methods often underperform in the presence of domain shifts, especially when encountering new domain styles that are unseen during training. It is suboptimal to directly adapt or generalize the entire model to new domains in the few-shot scenario. Instead, our key idea is to adapt a small adapter for rectifying diverse target domain styles to the source domain. Consequently, the rectified target domain features can fittingly benefit from the well-optimized source domain segmentation model, which is intently trained on sufficient source domain data. Training domain-rectifying adapter requires sufficiently diverse target domains. We thus propose a novel local-global style perturbation method to simulate diverse potential target domains by perturbating the feature channel statistics of the individual images and collective statistics of the entire source domain, respectively. Additionally, we propose a cyclic domain alignment module to facilitate the adapter effectively rectifying domains using a reverse domain rectification supervision. The adapter is trained to rectify the image features from diverse synthesized target domains to align with the source domain. During testing on target domains, we start by rectifying the image features and then conduct few-shot segmentation on the domain-rectified features. Extensive experiments demonstrate the effectiveness of our method, achieving promising results on cross-domain few-shot semantic segmentation tasks. Our code is available at https://github.com/Matt-Su/DR-Adapter. △ Less

Submitted 16 April, 2024; originally announced April 2024.

Comments: Accepted by CVPR 2024

arXiv:2404.00092 [pdf, other]

Simulating emission line galaxies for the next generation of large-scale structure surveys

Authors: Wenxiang Pei, Qi Guo, Ming Li, Qiao Wang, Jiaxin Han, Jia Hu, Tong Su, Liang Gao, Jie Wang, Yu Luo, Chengliang Wei

Abstract: We investigate emission line galaxies across cosmic time by combining the modified L-Galaxies semi-analytical galaxy formation model with the JiuTian cosmological simulation. We improve the tidal disruption model of satellite galaxies in L-Galaxies to address the time dependence problem. We utilise the public code CLOUDY to compute emission line ratios for a grid of HII region models. The emission… ▽ More We investigate emission line galaxies across cosmic time by combining the modified L-Galaxies semi-analytical galaxy formation model with the JiuTian cosmological simulation. We improve the tidal disruption model of satellite galaxies in L-Galaxies to address the time dependence problem. We utilise the public code CLOUDY to compute emission line ratios for a grid of HII region models. The emission line models assume the same initial mass function as that used to generate the spectral energy distribution of semi-analytical galaxies, ensuring a coherent treatment for modelling the full galaxy spectrum. By incorporating these emission line ratios with galaxy properties, we reproduce observed luminosity functions for H$α$, H$β$, [OII], and [OIII] in the local Universe and at high redshifts. We also find good agreement between model predictions and observations for auto-correlation and cross-correlation functions of [OII]-selected galaxies, as well as their luminosity dependence. The bias of emission line galaxies depends on both luminosity and redshift. At lower redshifts, it remains constant with increasing luminosity up to around $\sim 10^{42.5}\rm \, erg\,s^{-1}$ and then rises steeply for higher luminosities. The transition luminosity increases with redshift and becomes insignificant above $z$=1.5. Generally, galaxy bias shows an increasing trend with redshift. However, for luminous galaxies, the bias is higher at low redshifts, as the strong luminosity dependence observed at low redshifts diminishes at higher redshifts. We provide a fitting formula for the bias of emission line galaxies as a function of luminosity and redshift, which can be utilised for large-scale structure studies with future galaxy surveys. △ Less

Submitted 29 March, 2024; originally announced April 2024.

Comments: 22 pages, 18 figures, 5 tables, including 3 Appendix; Accepted by MNRAS

arXiv:2402.05492 [pdf, other]

Cosmological Forecast of the Void Size Function Measurement from the CSST Spectroscopic Survey

Authors: Yingxiao Song, Qi Xiong, Yan Gong, Furen Deng, Kwan Chuen Chan, Xuelei Chen, Qi Guo, Jiaxin Han, Guoliang Li, Ming Li, Yun Liu, Yu Luo, Wenxiang Pei, Chengliang Wei

Abstract: Void size function (VSF) contains information of the cosmic large-scale structure (LSS), and can be used to derive the properties of dark energy and dark matter. We predict the VSFs measured from the spectroscopic galaxy survey operated by the China Space Station Telescope (CSST), and study the strength of cosmological constraint. We employ a high-resolution Jiutian simulation to get CSST galaxy m… ▽ More Void size function (VSF) contains information of the cosmic large-scale structure (LSS), and can be used to derive the properties of dark energy and dark matter. We predict the VSFs measured from the spectroscopic galaxy survey operated by the China Space Station Telescope (CSST), and study the strength of cosmological constraint. We employ a high-resolution Jiutian simulation to get CSST galaxy mock samples based on an improved semi-analytical model. We identify voids from this galaxy catalog using the watershed algorithm without assuming a spherical shape, and estimate the VSFs at different redshift bins from $z=0.5$ to 1.1. We propose a void selection method based on the ellipticity, and assume the void linear underdensity threshold $δ_{\rm v}$ in the theoretical model is redshift-dependent and set it as a free parameter in each redshift bin. The Markov Chain Monte Carlo (MCMC) method is adopted to implement the constraints on the cosmological and void parameters. We find that the CSST VSF measurement can constrain the cosmological parameters to a few percent level. The best-fit values of $δ_{\rm v}$ are ranging from $\sim-0.4$ to $-0.1$ as the redshift increases from 0.5 to 1.1, which has a distinct difference from the theoretical calculation with $δ_{\rm v}\simeq-2.7$ assuming the spherical evolution and using particles as tracer. Our method can provide a good reference for void identification and selection in the VSF analysis of the spectroscopic galaxy surveys. △ Less

Submitted 24 June, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

Comments: 10 pages, 7 figures, 3 tables. Accepted for publication in MNRAS

arXiv:2402.00404 [pdf, other]

Improving Critical Node Detection Using Neural Network-based Initialization in a Genetic Algorithm

Authors: Chanjuan Liu, Shike Ge, Zhihan Chen, Wenbin Pei, Enqiang Zhu, Yi Mei, Hisao Ishibuchi

Abstract: The Critical Node Problem (CNP) is concerned with identifying the critical nodes in a complex network. These nodes play a significant role in maintaining the connectivity of the network, and removing them can negatively impact network performance. CNP has been studied extensively due to its numerous real-world applications. Among the different versions of CNP, CNP-1a has gained the most popularity… ▽ More The Critical Node Problem (CNP) is concerned with identifying the critical nodes in a complex network. These nodes play a significant role in maintaining the connectivity of the network, and removing them can negatively impact network performance. CNP has been studied extensively due to its numerous real-world applications. Among the different versions of CNP, CNP-1a has gained the most popularity. The primary objective of CNP-1a is to minimize the pair-wise connectivity in the remaining network after deleting a limited number of nodes from a network. Due to the NP-hard nature of CNP-1a, many heuristic/metaheuristic algorithms have been proposed to solve this problem. However, most existing algorithms start with a random initialization, leading to a high cost of obtaining an optimal solution. To improve the efficiency of solving CNP-1a, a knowledge-guided genetic algorithm named K2GA has been proposed. Unlike the standard genetic algorithm framework, K2GA has two main components: a pretrained neural network to obtain prior knowledge on possible critical nodes, and a hybrid genetic algorithm with local search for finding an optimal set of critical nodes based on the knowledge given by the trained neural network. The local search process utilizes a cut node-based greedy strategy. The effectiveness of the proposed knowledgeguided genetic algorithm is verified by experiments on 26 realworld instances of complex networks. Experimental results show that K2GA outperforms the state-of-the-art algorithms regarding the best, median, and average objective values, and improves the best upper bounds on the best objective values for eight realworld instances. △ Less

Submitted 1 February, 2024; originally announced February 2024.

Comments: 14 pages, 13 figures

arXiv:2401.10342 [pdf, other]

A younger Universe implied by satellite pair correlations from SDSS observations of massive galaxy groups

Authors: Qing Gu, Qi Guo, Marius Cautun, Shi Shao, Wenxiang Pei, Wenting Wang, Liang Gao, Jie Wang

Abstract: Many of the satellites of galactic-mass systems such as the Miky Way, Andromeda and Centaurus A show evidence of coherent motions to a larger extent than most of the systems predicted by the standard cosmological model. It is an open question if correlations in satellite orbits are present in systems of different masses. Here , we report an analysis of the kinematics of satellite galaxies around m… ▽ More Many of the satellites of galactic-mass systems such as the Miky Way, Andromeda and Centaurus A show evidence of coherent motions to a larger extent than most of the systems predicted by the standard cosmological model. It is an open question if correlations in satellite orbits are present in systems of different masses. Here , we report an analysis of the kinematics of satellite galaxies around massive galaxy groups. Unlike what is seen in Milky Way analogues, we find an excess of diametrically opposed pairs of satellites that have line-of-sight velocity offsets from the central galaxy of the same sign. This corresponds to a $\pmb{6.0σ}$ ($\pmb{p}$-value $\pmb{=\ 9.9\times10^{-10}}$) detection of non-random satellite motions. Such excess is predicted by up-to-date cosmological simulations but the magnitude of the effect is considerably lower than in observations. The observational data is discrepant at the $\pmb{4.1σ}$ and $\pmb{3.6σ}$ level with the expectations of the Millennium and the Illustris TNG300 cosmological simulations, potentially indicating that massive galaxy groups assembled later in the real Universe. The detection of velocity correlations of satellite galaxies and tension with theoretical predictions is robust against changes in sample selection. Using the largest sample to date, our findings demonstrate that the motions of satellite galaxies represent a challenge to the current cosmological model. △ Less

Submitted 18 January, 2024; originally announced January 2024.

Comments: 28 pages, 9 figures, accepted for publication in Nature Astronomy

arXiv:2401.00755 [pdf, other]

Saliency-Aware Regularized Graph Neural Network

Authors: Wenjie Pei, Weina Xu, Zongze Wu, Weichao Li, **fan Wang, Guangming Lu, Xiangrong Wang

Abstract: The crux of graph classification lies in the effective representation learning for the entire graph. Typical graph neural networks focus on modeling the local dependencies when aggregating features of neighboring nodes, and obtain the representation for the entire graph by aggregating node features. Such methods have two potential limitations: 1) the global node saliency w.r.t. graph classificatio… ▽ More The crux of graph classification lies in the effective representation learning for the entire graph. Typical graph neural networks focus on modeling the local dependencies when aggregating features of neighboring nodes, and obtain the representation for the entire graph by aggregating node features. Such methods have two potential limitations: 1) the global node saliency w.r.t. graph classification is not explicitly modeled, which is crucial since different nodes may have different semantic relevance to graph classification; 2) the graph representation directly aggregated from node features may have limited effectiveness to reflect graph-level information. In this work, we propose the Saliency-Aware Regularized Graph Neural Network (SAR-GNN) for graph classification, which consists of two core modules: 1) a traditional graph neural network serving as the backbone for learning node features and 2) the Graph Neural Memory designed to distill a compact graph representation from node features of the backbone. We first estimate the global node saliency by measuring the semantic similarity between the compact graph representation and node features. Then the learned saliency distribution is leveraged to regularize the neighborhood aggregation of the backbone, which facilitates the message passing of features for salient nodes and suppresses the less relevant nodes. Thus, our model can learn more effective graph representation. We demonstrate the merits of SAR-GNN by extensive experiments on seven datasets across various types of graph data. Code will be released. △ Less

Submitted 1 January, 2024; originally announced January 2024.

Comments: Accepted by Artificial Intelligence Journal with minor revision

arXiv:2312.10608 [pdf, other]

Robust 3D Tracking with Quality-Aware Shape Completion

Authors: **gwen Zhang, Zikun Zhou, Guangming Lu, Jiandong Tian, Wenjie Pei

Abstract: 3D single object tracking remains a challenging problem due to the sparsity and incompleteness of the point clouds. Existing algorithms attempt to address the challenges in two strategies. The first strategy is to learn dense geometric features based on the captured sparse point cloud. Nevertheless, it is quite a formidable task since the learned dense geometric features are with high uncertainty… ▽ More 3D single object tracking remains a challenging problem due to the sparsity and incompleteness of the point clouds. Existing algorithms attempt to address the challenges in two strategies. The first strategy is to learn dense geometric features based on the captured sparse point cloud. Nevertheless, it is quite a formidable task since the learned dense geometric features are with high uncertainty for depicting the shape of the target object. The other strategy is to aggregate the sparse geometric features of multiple templates to enrich the shape information, which is a routine solution in 2D tracking. However, aggregating the coarse shape representations can hardly yield a precise shape representation. Different from 2D pixels, 3D points of different frames can be directly fused by coordinate transform, i.e., shape completion. Considering that, we propose to construct a synthetic target representation composed of dense and complete point clouds depicting the target shape precisely by shape completion for robust 3D tracking. Specifically, we design a voxelized 3D tracking framework with shape completion, in which we propose a quality-aware shape completion mechanism to alleviate the adverse effect of noisy historical predictions. It enables us to effectively construct and leverage the synthetic target representation. Besides, we also develop a voxelized relation modeling module and box refinement module to improve tracking performance. Favorable performance against state-of-the-art algorithms on three benchmarks demonstrates the effectiveness and generalization ability of our method. △ Less

Submitted 16 December, 2023; originally announced December 2023.

Comments: A detailed version of the paper accepted by AAAI 2024

arXiv:2312.10376 [pdf, other]

SA$^2$VP: Spatially Aligned-and-Adapted Visual Prompt

Authors: Wenjie Pei, Tongqi Xia, Fanglin Chen, **song Li, Jiandong Tian, Guangming Lu

Abstract: As a prominent parameter-efficient fine-tuning technique in NLP, prompt tuning is being explored its potential in computer vision. Typical methods for visual prompt tuning follow the sequential modeling paradigm stemming from NLP, which represents an input image as a flattened sequence of token embeddings and then learns a set of unordered parameterized tokens prefixed to the sequence representati… ▽ More As a prominent parameter-efficient fine-tuning technique in NLP, prompt tuning is being explored its potential in computer vision. Typical methods for visual prompt tuning follow the sequential modeling paradigm stemming from NLP, which represents an input image as a flattened sequence of token embeddings and then learns a set of unordered parameterized tokens prefixed to the sequence representation as the visual prompts for task adaptation of large vision models. While such sequential modeling paradigm of visual prompt has shown great promise, there are two potential limitations. First, the learned visual prompts cannot model the underlying spatial relations in the input image, which is crucial for image encoding. Second, since all prompt tokens play the same role of prompting for all image tokens without distinction, it lacks the fine-grained prompting capability, i.e., individual prompting for different image tokens. In this work, we propose the \mymodel model (\emph{SA$^2$VP}), which learns a two-dimensional prompt token map with equal (or scaled) size to the image token map, thereby being able to spatially align with the image map. Each prompt token is designated to prompt knowledge only for the spatially corresponding image tokens. As a result, our model can conduct individual prompting for different image tokens in a fine-grained manner. Moreover, benefiting from the capability of preserving the spatial structure by the learned prompt token map, our \emph{SA$^2$VP} is able to model the spatial relations in the input image, leading to more effective prompting. Extensive experiments on three challenging benchmarks for image classification demonstrate the superiority of our model over other state-of-the-art methods for visual prompt tuning. Code is available at \emph{https://github.com/tommy-xq/SA2VP}. △ Less

Submitted 16 December, 2023; originally announced December 2023.

Comments: Accepted by AAAI 2024

arXiv:2312.01431 [pdf, other]

D$^2$ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition

Authors: Wenjie Pei, Qizhong Tan, Guangming Lu, Jiandong Tian

Abstract: Adapting large pre-trained image models to few-shot action recognition has proven to be an effective and efficient strategy for learning robust feature extractors, which is essential for few-shot learning. Typical fine-tuning based adaptation paradigm is prone to overfitting in the few-shot learning scenarios and offers little modeling flexibility for learning temporal features in video data. In t… ▽ More Adapting large pre-trained image models to few-shot action recognition has proven to be an effective and efficient strategy for learning robust feature extractors, which is essential for few-shot learning. Typical fine-tuning based adaptation paradigm is prone to overfitting in the few-shot learning scenarios and offers little modeling flexibility for learning temporal features in video data. In this work we present the Disentangled-and-Deformable Spatio-Temporal Adapter (D$^2$ST-Adapter), which is a novel adapter tuning framework well-suited for few-shot action recognition due to lightweight design and low parameter-learning overhead. It is designed in a dual-pathway architecture to encode spatial and temporal features in a disentangled manner. In particular, we devise the anisotropic Deformable Spatio-Temporal Attention module as the core component of D$^2$ST-Adapter, which can be tailored with anisotropic sampling densities along spatial and temporal domains to learn spatial and temporal features specifically in corresponding pathways, allowing our D$^2$ST-Adapter to encode features in a global view in 3D spatio-temporal space while maintaining a lightweight design. Extensive experiments with instantiations of our method on both pre-trained ResNet and ViT demonstrate the superiority of our method over state-of-the-art methods for few-shot action recognition. Our method is particularly well-suited to challenging scenarios where temporal dynamics are critical for action recognition. △ Less

Submitted 20 April, 2024; v1 submitted 3 December, 2023; originally announced December 2023.

arXiv:2309.01867 [pdf, other]

Variable Time Step Method of DAHLQUIST, LINIGER and NEVANLINNA (DLN) for a Corrected Smagorinsky Model

Authors: Farjana Siddiqua, Wenlong Pei

Abstract: Turbulent flows strain resources, both memory and CPU speed. The DLN method has greater accuracy and allows larger time steps, requiring less memory and fewer FLOPS. The DLN method can also be implemented adaptively. The classical Smagorinsky model, as an effective way to approximate a (resolved) mean velocity, has recently been corrected to represent a flow of energy from unresolved fluctuations… ▽ More Turbulent flows strain resources, both memory and CPU speed. The DLN method has greater accuracy and allows larger time steps, requiring less memory and fewer FLOPS. The DLN method can also be implemented adaptively. The classical Smagorinsky model, as an effective way to approximate a (resolved) mean velocity, has recently been corrected to represent a flow of energy from unresolved fluctuations to the (resolved) mean velocity. In this paper, we apply a family of second-order, G-stable time-step** methods proposed by Dahlquist, Liniger, and Nevanlinna (the DLN method) to one corrected Smagorinsky model and provide the detailed numerical analysis of the stability and consistency. We prove that the numerical solutions under any arbitrary time step sequences are unconditionally stable in the long term and converge at second order. We also provide error estimate under certain time step condition. Numerical tests are given to confirm the rate of convergence and also to show that the adaptive DLN algorithm helps to control numerical dissipation so that backscatter is visible. △ Less

Submitted 4 September, 2023; originally announced September 2023.

arXiv:2308.14061 [pdf, other]

Hierarchical Contrastive Learning for Pattern-Generalizable Image Corruption Detection

Authors: Xin Feng, Yifeng Xu, Guangming Lu, Wenjie Pei

Abstract: Effective image restoration with large-size corruptions, such as blind image inpainting, entails precise detection of corruption region masks which remains extremely challenging due to diverse shapes and patterns of corruptions. In this work, we present a novel method for automatic corruption detection, which allows for blind corruption restoration without known corruption masks. Specifically, we… ▽ More Effective image restoration with large-size corruptions, such as blind image inpainting, entails precise detection of corruption region masks which remains extremely challenging due to diverse shapes and patterns of corruptions. In this work, we present a novel method for automatic corruption detection, which allows for blind corruption restoration without known corruption masks. Specifically, we develop a hierarchical contrastive learning framework to detect corrupted regions by capturing the intrinsic semantic distinctions between corrupted and uncorrupted regions. In particular, our model detects the corrupted mask in a coarse-to-fine manner by first predicting a coarse mask by contrastive learning in low-resolution feature space and then refines the uncertain area of the mask by high-resolution contrastive learning. A specialized hierarchical interaction mechanism is designed to facilitate the knowledge propagation of contrastive learning in different scales, boosting the modeling performance substantially. The detected multi-scale corruption masks are then leveraged to guide the corruption restoration. Detecting corrupted regions by learning the contrastive distinctions rather than the semantic patterns of corruptions, our model has well generalization ability across different corruption patterns. Extensive experiments demonstrate following merits of our model: 1) the superior performance over other methods on both corruption detection and various image restoration tasks including blind inpainting and watermark removal, and 2) strong generalization across different corruption patterns such as graffiti, random noise or other image content. Codes and trained weights are available at https://github.com/xyfJASON/HCL . △ Less

Submitted 27 August, 2023; originally announced August 2023.

Comments: ICCV 2023

arXiv:2308.05104 [pdf, other]

Scene-Generalizable Interactive Segmentation of Radiance Fields

Authors: Songlin Tang, Wenjie Pei, Xin Tao, Tanghui Jia, Guangming Lu, Yu-Wing Tai

Abstract: Existing methods for interactive segmentation in radiance fields entail scene-specific optimization and thus cannot generalize across different scenes, which greatly limits their applicability. In this work we make the first attempt at Scene-Generalizable Interactive Segmentation in Radiance Fields (SGISRF) and propose a novel SGISRF method, which can perform 3D object segmentation for novel (unse… ▽ More Existing methods for interactive segmentation in radiance fields entail scene-specific optimization and thus cannot generalize across different scenes, which greatly limits their applicability. In this work we make the first attempt at Scene-Generalizable Interactive Segmentation in Radiance Fields (SGISRF) and propose a novel SGISRF method, which can perform 3D object segmentation for novel (unseen) scenes represented by radiance fields, guided by only a few interactive user clicks in a given set of multi-view 2D images. In particular, the proposed SGISRF focuses on addressing three crucial challenges with three specially designed techniques. First, we devise the Cross-Dimension Guidance Propagation to encode the scarce 2D user clicks into informative 3D guidance representations. Second, the Uncertainty-Eliminated 3D Segmentation module is designed to achieve efficient yet effective 3D segmentation. Third, Concealment-Revealed Supervised Learning scheme is proposed to reveal and correct the concealed 3D segmentation errors resulted from the supervision in 2D space with only 2D mask annotations. Extensive experiments on two real-world challenging benchmarks covering diverse scenes demonstrate 1) effectiveness and scene-generalizability of the proposed method, 2) favorable performance compared to classical method requiring scene-specific optimization. △ Less

Submitted 9 August, 2023; originally announced August 2023.

arXiv:2308.03529 [pdf, other]

Feature Decoupling-Recycling Network for Fast Interactive Segmentation

Authors: Huimin Zeng, Weinong Wang, Xin Tao, Zhiwei Xiong, Yu-Wing Tai, Wenjie Pei

Abstract: Recent interactive segmentation methods iteratively take source image, user guidance and previously predicted mask as the input without considering the invariant nature of the source image. As a result, extracting features from the source image is repeated in each interaction, resulting in substantial computational redundancy. In this work, we propose the Feature Decoupling-Recycling Network (FDRN… ▽ More Recent interactive segmentation methods iteratively take source image, user guidance and previously predicted mask as the input without considering the invariant nature of the source image. As a result, extracting features from the source image is repeated in each interaction, resulting in substantial computational redundancy. In this work, we propose the Feature Decoupling-Recycling Network (FDRN), which decouples the modeling components based on their intrinsic discrepancies and then recycles components for each user interaction. Thus, the efficiency of the whole interactive process can be significantly improved. To be specific, we apply the Decoupling-Recycling strategy from three perspectives to address three types of discrepancies, respectively. First, our model decouples the learning of source image semantics from the encoding of user guidance to process two types of input domains separately. Second, FDRN decouples high-level and low-level features from stratified semantic representations to enhance feature learning. Third, during the encoding of user guidance, current user guidance is decoupled from historical guidance to highlight the effect of current user guidance. We conduct extensive experiments on 6 datasets from different domains and modalities, which demonstrate the following merits of our model: 1) superior efficiency than other methods, particularly advantageous in challenging scenarios requiring long-term interactions (up to 4.25x faster), while achieving favorable segmentation performance; 2) strong applicability to various methods serving as a universal enhancement technique; 3) well cross-task generalizability, e.g., to medical image segmentation, and robustness against misleading user guidance. △ Less

Submitted 8 August, 2023; v1 submitted 7 August, 2023; originally announced August 2023.

Comments: Accepted to ACM MM 2023

arXiv:2308.03177 [pdf, other]

Boosting Few-shot 3D Point Cloud Segmentation via Query-Guided Enhancement

Authors: Zhenhua Ning, Zhuotao Tian, Guangming Lu, Wenjie Pei

Abstract: Although extensive research has been conducted on 3D point cloud segmentation, effectively adapting generic models to novel categories remains a formidable challenge. This paper proposes a novel approach to improve point cloud few-shot segmentation (PC-FSS) models. Unlike existing PC-FSS methods that directly utilize categorical information from support prototypes to recognize novel classes in que… ▽ More Although extensive research has been conducted on 3D point cloud segmentation, effectively adapting generic models to novel categories remains a formidable challenge. This paper proposes a novel approach to improve point cloud few-shot segmentation (PC-FSS) models. Unlike existing PC-FSS methods that directly utilize categorical information from support prototypes to recognize novel classes in query samples, our method identifies two critical aspects that substantially enhance model performance by reducing contextual gaps between support prototypes and query features. Specifically, we (1) adapt support background prototypes to match query context while removing extraneous cues that may obscure foreground and background in query samples, and (2) holistically rectify support prototypes under the guidance of query features to emulate the latter having no semantic gap to the query targets. Our proposed designs are agnostic to the feature extractor, rendering them readily applicable to any prototype-based methods. The experimental results on S3DIS and ScanNet demonstrate notable practical benefits, as our approach achieves significant improvements while still maintaining high efficiency. The code for our approach is available at https://github.com/AaronNZH/Boosting-Few-shot-3D-Point-Cloud-Segmentation-via-Query-Guided-Enhancement △ Less

Submitted 8 August, 2023; v1 submitted 6 August, 2023; originally announced August 2023.

Comments: Accepted to ACM MM 2023

arXiv:2306.02461 [pdf, ps, other]

The Semi-implicit DLN Algorithm for the Navier Stokes Equations

Authors: Wenlong Pei

Abstract: Dahlquist, Liniger, and Nevanlinna design a family of one-leg, two-step methods (the DLN method) that is second order, A- and G-stable for arbitrary, non-uniform time steps. Recently, the implementation of the DLN method can be simplified by the refactorization process (adding time filters on backward Euler scheme). Due to these fine properties, the DLN method has strong potential for the numerica… ▽ More Dahlquist, Liniger, and Nevanlinna design a family of one-leg, two-step methods (the DLN method) that is second order, A- and G-stable for arbitrary, non-uniform time steps. Recently, the implementation of the DLN method can be simplified by the refactorization process (adding time filters on backward Euler scheme). Due to these fine properties, the DLN method has strong potential for the numerical simulation of time-dependent fluid models. In the report, we propose a semi-implicit DLN algorithm for the Navier Stokes equations (avoiding non-linear solver at each time step) and prove the unconditional, long-term stability and second-order convergence with the moderate time step restriction. Moreover, the adaptive DLN algorithms by the required error or numerical dissipation criterion are presented to balance the accuracy and computational cost. Numerical tests will be given to support the main conclusions. △ Less

Submitted 4 June, 2023; originally announced June 2023.

Comments: 35 pages

arXiv:2303.14384 [pdf, other]

Reliability-Hierarchical Memory Network for Scribble-Supervised Video Object Segmentation

Authors: Zikun Zhou, Kaige Mao, Wenjie Pei, Hongpeng Wang, Yaowei Wang, Zhenyu He

Abstract: This paper aims to solve the video object segmentation (VOS) task in a scribble-supervised manner, in which VOS models are not only trained by the sparse scribble annotations but also initialized with the sparse target scribbles for inference. Thus, the annotation burdens for both training and initialization can be substantially lightened. The difficulties of scribble-supervised VOS lie in two asp… ▽ More This paper aims to solve the video object segmentation (VOS) task in a scribble-supervised manner, in which VOS models are not only trained by the sparse scribble annotations but also initialized with the sparse target scribbles for inference. Thus, the annotation burdens for both training and initialization can be substantially lightened. The difficulties of scribble-supervised VOS lie in two aspects. On the one hand, it requires the powerful ability to learn from the sparse scribble annotations during training. On the other hand, it demands strong reasoning capability during inference given only a sparse initial target scribble. In this work, we propose a Reliability-Hierarchical Memory Network (RHMNet) to predict the target mask in a step-wise expanding strategy w.r.t. the memory reliability level. To be specific, RHMNet first only uses the memory in the high-reliability level to locate the region with high reliability belonging to the target, which is highly similar to the initial target scribble. Then it expands the located high-reliability region to the entire target conditioned on the region itself and the memories in all reliability levels. Besides, we propose a scribble-supervised learning mechanism to facilitate the learning of our model to predict dense results. It mines the pixel-level relation within the single frame and the frame-level relation within the sequence to take full advantage of the scribble annotations in sequence training samples. The favorable performance on two popular benchmarks demonstrates that our method is promising. △ Less

Submitted 25 March, 2023; originally announced March 2023.

Comments: This project is available at https://github.com/mkg1204/RHMNet-for-SSVOS

arXiv:2303.07943 [pdf, other]

doi 10.1093/mnras/stad1375

SKA Science Data Challenge 2: analysis and results

Authors: P. Hartley, A. Bonaldi, R. Braun, J. N. H. S. Aditya, S. Aicardi, L. Alegre, A. Chakraborty, X. Chen, S. Choudhuri, A. O. Clarke, J. Coles, J. S. Collinson, D. Cornu, L. Darriba, M. Delli Veneri, J. Forbrich, B. Fraga, A. Galan, J. Garrido, F. Gubanov, H. Håkansson, M. J. Hardcastle, C. Heneka, D. Herranz, K. M. Hess , et al. (83 additional authors not shown)

Abstract: The Square Kilometre Array Observatory (SKAO) will explore the radio sky to new depths in order to conduct transformational science. SKAO data products made available to astronomers will be correspondingly large and complex, requiring the application of advanced analysis techniques to extract key science findings. To this end, SKAO is conducting a series of Science Data Challenges, each designed t… ▽ More The Square Kilometre Array Observatory (SKAO) will explore the radio sky to new depths in order to conduct transformational science. SKAO data products made available to astronomers will be correspondingly large and complex, requiring the application of advanced analysis techniques to extract key science findings. To this end, SKAO is conducting a series of Science Data Challenges, each designed to familiarise the scientific community with SKAO data and to drive the development of new analysis techniques. We present the results from Science Data Challenge 2 (SDC2), which invited participants to find and characterise 233245 neutral hydrogen (Hi) sources in a simulated data product representing a 2000~h SKA MID spectral line observation from redshifts 0.25 to 0.5. Through the generous support of eight international supercomputing facilities, participants were able to undertake the Challenge using dedicated computational resources. Alongside the main challenge, `reproducibility awards' were made in recognition of those pipelines which demonstrated Open Science best practice. The Challenge saw over 100 participants develop a range of new and existing techniques, with results that highlight the strengths of multidisciplinary and collaborative effort. The winning strategy -- which combined predictions from two independent machine learning techniques to yield a 20 percent improvement in overall performance -- underscores one of the main Challenge outcomes: that of method complementarity. It is likely that the combination of methods in a so-called ensemble approach will be key to exploiting very large astronomical datasets. △ Less

Submitted 14 March, 2023; originally announced March 2023.

Comments: Under review by MNRAS; 28 pages, 16 figures

arXiv:2301.06690 [pdf, other]

Audio2Gestures: Generating Diverse Gestures from Audio

Authors: **g Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Linchao Bao, Zhenyu He

Abstract: People may perform diverse gestures affected by various mental and physical factors when speaking the same sentences. This inherent one-to-many relationship makes co-speech gesture generation from audio particularly challenging. Conventional CNNs/RNNs assume one-to-one map**, and thus tend to predict the average of all possible target motions, easily resulting in plain/boring motions during infe… ▽ More People may perform diverse gestures affected by various mental and physical factors when speaking the same sentences. This inherent one-to-many relationship makes co-speech gesture generation from audio particularly challenging. Conventional CNNs/RNNs assume one-to-one map**, and thus tend to predict the average of all possible target motions, easily resulting in plain/boring motions during inference. So we propose to explicitly model the one-to-many audio-to-motion map** by splitting the cross-modal latent code into shared code and motion-specific code. The shared code is expected to be responsible for the motion component that is more correlated to the audio while the motion-specific code is expected to capture diverse motion information that is more independent of the audio. However, splitting the latent code into two parts poses extra training difficulties. Several crucial training losses/strategies, including relaxed motion loss, bicycle constraint, and diversity loss, are designed to better train the VAE. Experiments on both 3D and 2D motion datasets verify that our method generates more realistic and diverse motions than previous state-of-the-art methods, quantitatively and qualitatively. Besides, our formulation is compatible with discrete cosine transformation (DCT) modeling and other popular backbones (\textit{i.e.} RNN, Transformer). As for motion losses and quantitative motion evaluation, we find structured losses/metrics (\textit{e.g.} STFT) that consider temporal and/or spatial context complement the most commonly used point-wise losses (\textit{e.g.} PCK), resulting in better motion dynamics and more nuanced motion details. Finally, we demonstrate that our method can be readily used to generate motion sequences with user-specified motion clips on the timeline. △ Less

Submitted 16 January, 2023; originally announced January 2023.

Comments: arXiv admin note: substantial text overlap with arXiv:2108.06720

arXiv:2212.06106 [pdf, other]

doi 10.1007/JHEP03(2023)144

Sensitivity of Future Tritium Decay Experiments to New Physics

Authors: James A. L. Canning, Frank F. Deppisch, Wenna Pei

Abstract: Tritium beta-decay is the most promising approach to measure the absolute masses of active light neutrinos in the laboratory and in a model-independent fashion. The development of Cyclotron Radiation Emission Spectroscopy techniques and the use of atomic tritium has the potential to improve the current limits by an order of magnitude in future experiments. In this paper, we analyse the potential s… ▽ More Tritium beta-decay is the most promising approach to measure the absolute masses of active light neutrinos in the laboratory and in a model-independent fashion. The development of Cyclotron Radiation Emission Spectroscopy techniques and the use of atomic tritium has the potential to improve the current limits by an order of magnitude in future experiments. In this paper, we analyse the potential sensitivity of such future searches to keV-mass sterile neutrinos and exotic interactions of either the active or sterile neutrinos. We calculate the relevant decay distributions in both energy and angle of the emitted electron with respect to a potential polarisation of the tritium, including the interference with the Standard Model case as well as incorporating relevant final state corrections for atomic tritium. We present projected sensitivities on the active-sterile neutrino mixing and effective coupling constants of exotic currents, demonstrating the potential to probe New Physics in tritium experiments. △ Less

Submitted 26 March, 2023; v1 submitted 12 December, 2022; originally announced December 2022.

Comments: 44 pages, 14 figures, matches accepted version

arXiv:2212.01131 [pdf, other]

Activating the Discriminability of Novel Classes for Few-shot Segmentation

Authors: Dianwen Mei, Wei Zhuo, Jiandong Tian, Guangming Lu, Wenjie Pei

Abstract: Despite the remarkable success of existing methods for few-shot segmentation, there remain two crucial challenges. First, the feature learning for novel classes is suppressed during the training on base classes in that the novel classes are always treated as background. Thus, the semantics of novel classes are not well learned. Second, most of existing methods fail to consider the underlying seman… ▽ More Despite the remarkable success of existing methods for few-shot segmentation, there remain two crucial challenges. First, the feature learning for novel classes is suppressed during the training on base classes in that the novel classes are always treated as background. Thus, the semantics of novel classes are not well learned. Second, most of existing methods fail to consider the underlying semantic gap between the support and the query resulting from the representative bias by the scarce support samples. To circumvent these two challenges, we propose to activate the discriminability of novel classes explicitly in both the feature encoding stage and the prediction stage for segmentation. In the feature encoding stage, we design the Semantic-Preserving Feature Learning module (SPFL) to first exploit and then retain the latent semantics contained in the whole input image, especially those in the background that belong to novel classes. In the prediction stage for segmentation, we learn an Self-Refined Online Foreground-Background classifier (SROFB), which is able to refine itself using the high-confidence pixels of query image to facilitate its adaptation to the query image and bridge the support-query semantic gap. Extensive experiments on PASCAL-5$^i$ and COCO-20$^i$ datasets demonstrates the advantages of these two novel designs both quantitatively and qualitatively. △ Less

Submitted 2 December, 2022; originally announced December 2022.

arXiv:2211.15143 [pdf, other]

Explaining Deep Convolutional Neural Networks for Image Classification by Evolving Local Interpretable Model-agnostic Explanations

Authors: Bin Wang, Wenbin Pei, Bing Xue, Mengjie Zhang

Abstract: Deep convolutional neural networks have proven their effectiveness, and have been acknowledged as the most dominant method for image classification. However, a severe drawback of deep convolutional neural networks is poor explainability. Unfortunately, in many real-world applications, users need to understand the rationale behind the predictions of deep convolutional neural networks when determini… ▽ More Deep convolutional neural networks have proven their effectiveness, and have been acknowledged as the most dominant method for image classification. However, a severe drawback of deep convolutional neural networks is poor explainability. Unfortunately, in many real-world applications, users need to understand the rationale behind the predictions of deep convolutional neural networks when determining whether they should trust the predictions or not. To resolve this issue, a novel genetic algorithm-based method is proposed for the first time to automatically evolve local explanations that can assist users to assess the rationality of the predictions. Furthermore, the proposed method is model-agnostic, i.e., it can be utilised to explain any deep convolutional neural network models. In the experiments, ResNet is used as an example model to be explained, and the ImageNet dataset is selected as the benchmark dataset. DenseNet and MobileNet are further explained to demonstrate the model-agnostic characteristic of the proposed method. The evolved local explanations on four images, randomly selected from ImageNet, are presented, which show that the evolved local explanations are straightforward to be recognised by humans. Moreover, the evolved explanations can explain the predictions of deep convolutional neural networks on all four images very well by successfully capturing meaningful interpretable features of the sample images. Further analysis based on the 30 runs of the experiments exhibits that the evolved local explanations can also improve the probabilities/confidences of the deep convolutional neural network models in making the predictions. The proposed method can obtain local explanations within one minute, which is more than ten times faster than LIME (the state-of-the-art method). △ Less

Submitted 28 November, 2022; originally announced November 2022.

arXiv:2211.14705 [pdf, other]

Semantic-Aware Local-Global Vision Transformer

Authors: Jiatong Zhang, Zengwei Yao, Fanglin Chen, Guangming Lu, Wenjie Pei

Abstract: Vision Transformers have achieved remarkable progresses, among which Swin Transformer has demonstrated the tremendous potential of Transformer for vision tasks. It surmounts the key challenge of high computational complexity by performing local self-attention within shifted windows. In this work we propose the Semantic-Aware Local-Global Vision Transformer (SALG), to further investigate two potent… ▽ More Vision Transformers have achieved remarkable progresses, among which Swin Transformer has demonstrated the tremendous potential of Transformer for vision tasks. It surmounts the key challenge of high computational complexity by performing local self-attention within shifted windows. In this work we propose the Semantic-Aware Local-Global Vision Transformer (SALG), to further investigate two potential improvements towards Swin Transformer. First, unlike Swin Transformer that performs uniform partition to produce equal size of regular windows for local self-attention, our SALG performs semantic segmentation in an unsupervised way to explore the underlying semantic priors in the image. As a result, each segmented region can correspond to a semantically meaningful part in the image, potentially leading to more effective features within each of segmented regions. Second, instead of only performing local self-attention within local windows as Swin Transformer does, the proposed SALG performs both 1) local intra-region self-attention for learning fine-grained features within each region and 2) global inter-region feature propagation for modeling global dependencies among all regions. Consequently, our model is able to obtain the global view when learning features for each token, which is the essential advantage of Transformer. Owing to the explicit modeling of the semantic priors and the proposed local-global modeling mechanism, our SALG is particularly advantageous for small-scale models when the modeling capacity is not sufficient for other models to learn semantics implicitly. Extensive experiments across various vision tasks demonstrates the merit of our model over other vision Transformers, especially in the small-scale modeling scenarios. △ Less

Submitted 26 November, 2022; originally announced November 2022.

arXiv:2210.16834 [pdf, other]

Alleviating the Sample Selection Bias in Few-shot Learning by Removing Projection to the Centroid

Authors: **g Xu, Xu Luo, Xinglin Pan, Wenjie Pei, Yanan Li, Zenglin Xu

Abstract: Few-shot learning (FSL) targets at generalization of vision models towards unseen tasks without sufficient annotations. Despite the emergence of a number of few-shot learning methods, the sample selection bias problem, i.e., the sensitivity to the limited amount of support data, has not been well understood. In this paper, we find that this problem usually occurs when the positions of support samp… ▽ More Few-shot learning (FSL) targets at generalization of vision models towards unseen tasks without sufficient annotations. Despite the emergence of a number of few-shot learning methods, the sample selection bias problem, i.e., the sensitivity to the limited amount of support data, has not been well understood. In this paper, we find that this problem usually occurs when the positions of support samples are in the vicinity of task centroid -- the mean of all class centroids in the task. This motivates us to propose an extremely simple feature transformation to alleviate this problem, dubbed Task Centroid Projection Removing (TCPR). TCPR is applied directly to all image features in a given task, aiming at removing the dimension of features along the direction of the task centroid. While the exact task centroid cannot be accurately obtained from limited data, we estimate it using base features that are each similar to one of the support features. Our method effectively prevents features from being too close to the task centroid. Extensive experiments over ten datasets from different domains show that TCPR can reliably improve classification accuracy across various feature extractors, training algorithms and datasets. The code has been made available at https://github.com/KikimorMay/FSL-TCBR. △ Less

Submitted 30 October, 2022; originally announced October 2022.

Comments: Accepted at NeurIPS 2022

arXiv:2209.01193 [pdf]

doi 10.1016/j.apsusc.2022.155912

Oxygen dissociation on the C3N monolayer: A first-principles study

Authors: Liang Zhao, Wen** Luo, Zhi**g Huang, Zihan Yan, Hui Jia, Wei Pei, Yusong Tu

Abstract: The oxygen dissociation and the oxidized structure on the pristine C3N monolayer in exposure to air are the inevitably critical issues for the C3N engineering and surface functionalization yet have not been revealed in detail. Using the first-principles calculations, we have systematically investigated the possible O2 adsorption sites, various O2 dissociation pathways and the oxidized structures.… ▽ More The oxygen dissociation and the oxidized structure on the pristine C3N monolayer in exposure to air are the inevitably critical issues for the C3N engineering and surface functionalization yet have not been revealed in detail. Using the first-principles calculations, we have systematically investigated the possible O2 adsorption sites, various O2 dissociation pathways and the oxidized structures. It is demonstrated that the pristine C3N monolayer shows more O2 physisorption sites and exhibits stronger O2 adsorption than the pristine graphene. Among various dissociation pathways, the most preferable one is a two-step process involving an intermediate state with the chemisorbed O2 and the barrier is lower than that on the pristine graphene, indicating that the pristine C3N monolayer is more susceptible to oxidation than the pristine graphene. Furthermore, we found that the most stable oxidized structure is not produced by the most preferable dissociation pathway but generated from a direct dissociation process. These results can be generalized into a wide range of temperatures and pressures using ab initio atomistic thermodynamics. Our findings deepen the understanding of the chemical stability of 2D crystalline carbon nitrides under ambient conditions, and could provide insights into the tailoring of the surface chemical structures via do** and oxidation. △ Less

Submitted 7 December, 2022; v1 submitted 2 September, 2022; originally announced September 2022.

Comments: 23 pages,8 figures

arXiv:2208.14093 [pdf, other]

SSORN: Self-Supervised Outlier Removal Network for Robust Homography Estimation

Authors: Yi Li, Wenjie Pei, Zhenyu He

Abstract: The traditional homography estimation pipeline consists of four main steps: feature detection, feature matching, outlier removal and transformation estimation. Recent deep learning models intend to address the homography estimation problem using a single convolutional network. While these models are trained in an end-to-end fashion to simplify the homography estimation problem, they lack the featu… ▽ More The traditional homography estimation pipeline consists of four main steps: feature detection, feature matching, outlier removal and transformation estimation. Recent deep learning models intend to address the homography estimation problem using a single convolutional network. While these models are trained in an end-to-end fashion to simplify the homography estimation problem, they lack the feature matching step and/or the outlier removal step, which are important steps in the traditional homography estimation pipeline. In this paper, we attempt to build a deep learning model that mimics all four steps in the traditional homography estimation pipeline. In particular, the feature matching step is implemented using the cost volume technique. To remove outliers in the cost volume, we treat this outlier removal problem as a denoising problem and propose a novel self-supervised loss to solve the problem. Extensive experiments on synthetic and real datasets demonstrate that the proposed model outperforms existing deep learning models. △ Less

Submitted 30 August, 2022; originally announced August 2022.

arXiv:2208.06162 [pdf, other]

Layout-Bridging Text-to-Image Synthesis

Authors: Jiadong Liang, Wenjie Pei, Feng Lu

Abstract: The crux of text-to-image synthesis stems from the difficulty of preserving the cross-modality semantic consistency between the input text and the synthesized image. Typical methods, which seek to model the text-to-image map** directly, could only capture keywords in the text that indicates common objects or actions but fail to learn their spatial distribution patterns. An effective way to circu… ▽ More The crux of text-to-image synthesis stems from the difficulty of preserving the cross-modality semantic consistency between the input text and the synthesized image. Typical methods, which seek to model the text-to-image map** directly, could only capture keywords in the text that indicates common objects or actions but fail to learn their spatial distribution patterns. An effective way to circumvent this limitation is to generate an image layout as guidance, which is attempted by a few methods. Nevertheless, these methods fail to generate practically effective layouts due to the diversity of input text and object location. In this paper we push for effective modeling in both text-to-layout generation and layout-to-image synthesis. Specifically, we formulate the text-to-layout generation as a sequence-to-sequence modeling task, and build our model upon Transformer to learn the spatial relationships between objects by modeling the sequential dependencies between them. In the stage of layout-to-image synthesis, we focus on learning the textual-visual semantic alignment per object in the layout to precisely incorporate the input text into the layout-to-image synthesizing process. To evaluate the quality of generated layout, we design a new metric specifically, dubbed Layout Quality Score, which considers both the absolute distribution errors of bounding boxes in the layout and the mutual spatial relationships between them. Extensive experiments on three datasets demonstrate the superior performance of our method over state-of-the-art methods on both predicting the layout and synthesizing the image from the given text. △ Less

Submitted 12 August, 2022; originally announced August 2022.

arXiv:2207.12941 [pdf, other]

Learning Generalizable Latent Representations for Novel Degradations in Super Resolution

Authors: Fengjun Li, Xin Feng, Fanglin Chen, Guangming Lu, Wenjie Pei

Abstract: Typical methods for blind image super-resolution (SR) focus on dealing with unknown degradations by directly estimating them or learning the degradation representations in a latent space. A potential limitation of these methods is that they assume the unknown degradations can be simulated by the integration of various handcrafted degradations (e.g., bicubic downsampling), which is not necessarily… ▽ More Typical methods for blind image super-resolution (SR) focus on dealing with unknown degradations by directly estimating them or learning the degradation representations in a latent space. A potential limitation of these methods is that they assume the unknown degradations can be simulated by the integration of various handcrafted degradations (e.g., bicubic downsampling), which is not necessarily true. The real-world degradations can be beyond the simulation scope by the handcrafted degradations, which are referred to as novel degradations. In this work, we propose to learn a latent representation space for degradations, which can be generalized from handcrafted (base) degradations to novel degradations. The obtained representations for a novel degradation in this latent space are then leveraged to generate degraded images consistent with the novel degradation to compose paired training data for SR model. Furthermore, we perform variational inference to match the posterior of degradations in latent representation space with a prior distribution (e.g., Gaussian distribution). Consequently, we are able to sample more high-quality representations for a novel degradation to augment the training data for SR model. We conduct extensive experiments on both synthetic and real-world datasets to validate the effectiveness and advantages of our method for blind super-resolution with novel degradations. △ Less

Submitted 25 July, 2022; originally announced July 2022.

arXiv:2207.12049 [pdf, other]

Few-Shot Object Detection by Knowledge Distillation Using Bag-of-Visual-Words Representations

Authors: Wenjie Pei, Shuang Wu, Dianwen Mei, Fanglin Chen, Jiandong Tian, Guangming Lu

Abstract: While fine-tuning based methods for few-shot object detection have achieved remarkable progress, a crucial challenge that has not been addressed well is the potential class-specific overfitting on base classes and sample-specific overfitting on novel classes. In this work we design a novel knowledge distillation framework to guide the learning of the object detector and thereby restrain the overfi… ▽ More While fine-tuning based methods for few-shot object detection have achieved remarkable progress, a crucial challenge that has not been addressed well is the potential class-specific overfitting on base classes and sample-specific overfitting on novel classes. In this work we design a novel knowledge distillation framework to guide the learning of the object detector and thereby restrain the overfitting in both the pre-training stage on base classes and fine-tuning stage on novel classes. To be specific, we first present a novel Position-Aware Bag-of-Visual-Words model for learning a representative bag of visual words (BoVW) from a limited size of image set, which is used to encode general images based on the similarities between the learned visual words and an image. Then we perform knowledge distillation based on the fact that an image should have consistent BoVW representations in two different feature spaces. To this end, we pre-learn a feature space independently from the object detection, and encode images using BoVW in this space. The obtained BoVW representation for an image can be considered as distilled knowledge to guide the learning of object detector: the extracted features by the object detector for the same image are expected to derive the consistent BoVW representations with the distilled knowledge. Extensive experiments validate the effectiveness of our method and demonstrate the superiority over other state-of-the-art methods. △ Less

Submitted 25 July, 2022; originally announced July 2022.

arXiv:2207.11549 [pdf, other]

Self-Support Few-Shot Semantic Segmentation

Authors: Qi Fan, Wenjie Pei, Yu-Wing Tai, Chi-Keung Tang

Abstract: Existing few-shot segmentation methods have achieved great progress based on the support-query matching framework. But they still heavily suffer from the limited coverage of intra-class variations from the few-shot supports provided. Motivated by the simple Gestalt principle that pixels belonging to the same object are more similar than those to different objects of same class, we propose a novel… ▽ More Existing few-shot segmentation methods have achieved great progress based on the support-query matching framework. But they still heavily suffer from the limited coverage of intra-class variations from the few-shot supports provided. Motivated by the simple Gestalt principle that pixels belonging to the same object are more similar than those to different objects of same class, we propose a novel self-support matching strategy to alleviate this problem, which uses query prototypes to match query features, where the query prototypes are collected from high-confidence query predictions. This strategy can effectively capture the consistent underlying characteristics of the query objects, and thus fittingly match query features. We also propose an adaptive self-support background prototype generation module and self-support loss to further facilitate the self-support matching procedure. Our self-support network substantially improves the prototype quality, benefits more improvement from stronger backbones and more supports, and achieves SOTA on multiple datasets. Codes are at \url{https://github.com/fanq15/SSP}. △ Less

Submitted 23 July, 2022; originally announced July 2022.

Comments: ECCV 2022

arXiv:2207.11184 [pdf, other]

Multi-Faceted Distillation of Base-Novel Commonality for Few-shot Object Detection

Authors: Shuang Wu, Wenjie Pei, Dianwen Mei, Fanglin Chen, Jiandong Tian, Guangming Lu

Abstract: Most of existing methods for few-shot object detection follow the fine-tuning paradigm, which potentially assumes that the class-agnostic generalizable knowledge can be learned and transferred implicitly from base classes with abundant samples to novel classes with limited samples via such a two-stage training strategy. However, it is not necessarily true since the object detector can hardly disti… ▽ More Most of existing methods for few-shot object detection follow the fine-tuning paradigm, which potentially assumes that the class-agnostic generalizable knowledge can be learned and transferred implicitly from base classes with abundant samples to novel classes with limited samples via such a two-stage training strategy. However, it is not necessarily true since the object detector can hardly distinguish between class-agnostic knowledge and class-specific knowledge automatically without explicit modeling. In this work we propose to learn three types of class-agnostic commonalities between base and novel classes explicitly: recognition-related semantic commonalities, localization-related semantic commonalities and distribution commonalities. We design a unified distillation framework based on a memory bank, which is able to perform distillation of all three types of commonalities jointly and efficiently. Extensive experiments demonstrate that our method can be readily integrated into most of existing fine-tuning based methods and consistently improve the performance by a large margin. △ Less

Submitted 3 November, 2022; v1 submitted 22 July, 2022; originally announced July 2022.

Comments: Accepted to ECCV 2022

arXiv:2207.09710 [pdf, other]

Learning Sequence Representations by Non-local Recurrent Neural Memory

Authors: Wenjie Pei, Xin Feng, Canmiao Fu, Qiong Cao, Guangming Lu, Yu-Wing Tai

Abstract: The key challenge of sequence representation learning is to capture the long-range temporal dependencies. Typical methods for supervised sequence representation learning are built upon recurrent neural networks to capture temporal dependencies. One potential limitation of these methods is that they only model one-order information interactions explicitly between adjacent time steps in a sequence,… ▽ More The key challenge of sequence representation learning is to capture the long-range temporal dependencies. Typical methods for supervised sequence representation learning are built upon recurrent neural networks to capture temporal dependencies. One potential limitation of these methods is that they only model one-order information interactions explicitly between adjacent time steps in a sequence, hence the high-order interactions between nonadjacent time steps are not fully exploited. It greatly limits the capability of modeling the long-range temporal dependencies since the temporal features learned by one-order interactions cannot be maintained for a long term due to temporal information dilution and gradient vanishing. To tackle this limitation, we propose the Non-local Recurrent Neural Memory (NRNM) for supervised sequence representation learning, which performs non-local operations \MR{by means of self-attention mechanism} to learn full-order interactions within a sliding temporal memory block and models global interactions between memory blocks in a gated recurrent manner. Consequently, our model is able to capture long-range dependencies. Besides, the latent high-level features contained in high-order interactions can be distilled by our model. We validate the effectiveness and generalization of our NRNM on three types of sequence applications across different modalities, including sequence classification, step-wise sequential prediction and sequence similarity learning. Our model compares favorably against other state-of-the-art methods specifically designed for each of these sequence applications. △ Less

Submitted 20 July, 2022; originally announced July 2022.

Comments: To be appeared in International Journal of Computer Vision (IJCV). arXiv admin note: substantial text overlap with arXiv:1908.09535

arXiv:2207.08808 [pdf, other]

Global-Local Stepwise Generative Network for Ultra High-Resolution Image Restoration

Authors: Xin Feng, Haobo Ji, Wenjie Pei, Fanglin Chen, Guangming Lu

Abstract: While the research on image background restoration from regular size of degraded images has achieved remarkable progress, restoring ultra high-resolution (e.g., 4K) images remains an extremely challenging task due to the explosion of computational complexity and memory usage, as well as the deficiency of annotated data. In this paper we present a novel model for ultra high-resolution image restora… ▽ More While the research on image background restoration from regular size of degraded images has achieved remarkable progress, restoring ultra high-resolution (e.g., 4K) images remains an extremely challenging task due to the explosion of computational complexity and memory usage, as well as the deficiency of annotated data. In this paper we present a novel model for ultra high-resolution image restoration, referred to as the Global-Local Stepwise Generative Network (GLSGN), which employs a stepwise restoring strategy involving four restoring pathways: three local pathways and one global pathway. The local pathways focus on conducting image restoration in a fine-grained manner over local but high-resolution image patches, while the global pathway performs image restoration coarsely on the scale-down but intact image to provide cues for the local pathways in a global view including semantics and noise patterns. To smooth the mutual collaboration between these four pathways, our GLSGN is designed to ensure the inter-pathway consistency in four aspects in terms of low-level content, perceptual attention, restoring intensity and high-level semantics, respectively. As another major contribution of this work, we also introduce the first ultra high-resolution dataset to date for both reflection removal and rain streak removal, comprising 4,670 real-world and synthetic images. Extensive experiments across three typical tasks for image background restoration, including image reflection removal, image rain streak removal and image dehazing, show that our GLSGN consistently outperforms state-of-the-art methods. △ Less

Submitted 17 May, 2023; v1 submitted 16 July, 2022; originally announced July 2022.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2207.07253 [pdf, other]

Single Shot Self-Reliant Scene Text Spotter by Decoupled yet Collaborative Detection and Recognition

Authors: **g**g Wu, Pengyuan Lyu, Guangming Lu, Chengquan Zhang, Wenjie Pei

Abstract: Typical text spotters follow the two-stage spotting paradigm which detects the boundary for a text instance first and then performs text recognition within the detected regions. Despite the remarkable progress of such spotting paradigm, an important limitation is that the performance of text recognition depends heavily on the precision of text detection, resulting in the potential error propagatio… ▽ More Typical text spotters follow the two-stage spotting paradigm which detects the boundary for a text instance first and then performs text recognition within the detected regions. Despite the remarkable progress of such spotting paradigm, an important limitation is that the performance of text recognition depends heavily on the precision of text detection, resulting in the potential error propagation from detection to recognition. In this work, we propose the single shot Self-Reliant Scene Text Spotter v2 (SRSTS v2), which circumvents this limitation by decoupling recognition from detection while optimizing two tasks collaboratively. Specifically, our SRSTS v2 samples representative feature points around each potential text instance, and conducts both text detection and recognition in parallel guided by these sampled points. Thus, the text recognition is no longer dependent on detection, thereby alleviating the error propagation from detection to recognition. Moreover, the sampling module is learned under the supervision from both detection and recognition, which allows for the collaborative optimization and mutual enhancement between two tasks. Benefiting from such sampling-driven concurrent spotting framework, our approach is able to recognize the text instances correctly even if the precise text boundaries are challenging to detect. Extensive experiments on four benchmarks demonstrate that our method compares favorably to state-of-the-art spotters. △ Less

Submitted 7 February, 2023; v1 submitted 14 July, 2022; originally announced July 2022.

arXiv:2205.14815 [pdf, other]

doi 10.3847/1538-3881/ac746e

Influence of the Gaia-Sausage-Enceladus on the density shape of the Galactic stellar halo revealed by halo K giants from the LAMOST survey

Authors: Wenbo Wu, Gang Zhao, Xiang-Xiang Xue, Wenxiang Pei, Chengqun Yang

Abstract: We present a study of the influence of the Gaia-Sausage-Enceladus (GSE) on the density shape of the Galactic stellar halo using 11624 K giants from the LAMOST survey. Every star is assigned a probability of being a member of the GSE based on its spherical velocities and metallicity by a Gaussian Mixture Model. We divide the stellar halo into two parts by the obtained probabilities, of which one is… ▽ More We present a study of the influence of the Gaia-Sausage-Enceladus (GSE) on the density shape of the Galactic stellar halo using 11624 K giants from the LAMOST survey. Every star is assigned a probability of being a member of the GSE based on its spherical velocities and metallicity by a Gaussian Mixture Model. We divide the stellar halo into two parts by the obtained probabilities, of which one is composed of the GSE members and defined as the GSE-related halo, and the other one is referred to as the GSE-removed halo. Using a non-parametric method, the radial number density profiles of the two stellar halos can be well described by a single power law with a variable flattening $q$ ($r = \sqrt{R^2+[(Z/q(r))]^2}, ν= {ν_0}r^{-α}$). The index $α$ is $4.92\pm0.12$ for the GSE-related halo and $4.25\pm0.14$ for the GSE-removed halo. Both the two stellar halos are vertically flattened at smaller radii but become more spherical at larger radii. We find that the GSE-related halo is less vertically flattened than the GSE-removed halo, and the difference of $q$ between the two stellar halos ranges from 0.07 to 0.15. However, after the consideration of the bias, it is thought to be within 0.08 at most of the radii. Finally, we compare our results with two Milky Way analogues which experience a significant major merger in the TNG50 simulation. The study of the two analogues also shows that the major merger-related stellar halo has a smaller ellipticity than the major merger-removed stellar halo. △ Less

Submitted 29 May, 2022; originally announced May 2022.

Comments: 19 Pages, 14 figures, Accepted by AJ

arXiv:2203.16092 [pdf, other]

Global Tracking via Ensemble of Local Trackers

Authors: Zikun Zhou, Jianqiu Chen, Wenjie Pei, Kaige Mao, Hongpeng Wang, Zhenyu He

Abstract: The crux of long-term tracking lies in the difficulty of tracking the target with discontinuous moving caused by out-of-view or occlusion. Existing long-term tracking methods follow two typical strategies. The first strategy employs a local tracker to perform smooth tracking and uses another re-detector to detect the target when the target is lost. While it can exploit the temporal context like hi… ▽ More The crux of long-term tracking lies in the difficulty of tracking the target with discontinuous moving caused by out-of-view or occlusion. Existing long-term tracking methods follow two typical strategies. The first strategy employs a local tracker to perform smooth tracking and uses another re-detector to detect the target when the target is lost. While it can exploit the temporal context like historical appearances and locations of the target, a potential limitation of such strategy is that the local tracker tends to misidentify a nearby distractor as the target instead of activating the re-detector when the real target is out of view. The other long-term tracking strategy tracks the target in the entire image globally instead of local tracking based on the previous tracking results. Unfortunately, such global tracking strategy cannot leverage the temporal context effectively. In this work, we combine the advantages of both strategies: tracking the target in a global view while exploiting the temporal context. Specifically, we perform global tracking via ensemble of local trackers spreading the full image. The smooth moving of the target can be handled steadily by one local tracker. When the local tracker accidentally loses the target due to suddenly discontinuous moving, another local tracker close to the target is then activated and can readily take over the tracking to locate the target. While the activated local tracker performs tracking locally by leveraging the temporal context, the ensemble of local trackers renders our model the global view for tracking. Extensive experiments on six datasets demonstrate that our method performs favorably against state-of-the-art algorithms. △ Less

Submitted 30 March, 2022; originally announced March 2022.

Comments: 10 pages; 6 figures; accepted to CVPR2022

arXiv:2202.12471 [pdf, other]

Laboratory observation of plasmoid-dominated magnetic reconnection in hybrid collisional-collisionless regime

Authors: Z. H. Zhao, H. H. An, Y. Xie, Z. Lei, W. P. Yao, W. Q. Yuan, J. Xiong, C. Wang, J. J. Ye, Z. Y. Xie, Z. H. Fang, A. L. Lei, W. B. Pei, X. T. He, W. M. Zhou, W. Wang, S. P. Zhu, B. Qiao

Abstract: Magnetic reconnection, breaking and reorganization of magnetic field topology, is a fundamental process for rapid release of magnetic energy into plasma particles that occurs pervasively throughout the universe. In most natural circumstances, the plasma properties on either side of the reconnection layer are asymmetric, in particular for the collision rates that are associated with a combination o… ▽ More Magnetic reconnection, breaking and reorganization of magnetic field topology, is a fundamental process for rapid release of magnetic energy into plasma particles that occurs pervasively throughout the universe. In most natural circumstances, the plasma properties on either side of the reconnection layer are asymmetric, in particular for the collision rates that are associated with a combination of density and temperature and critically determine the reconnection mechanism. To date, all laboratory experiments on magnetic reconnections have been limited to purely collisional or collisionless regimes. Here, we report a well-designed experimental investigation on asymmetric magnetic reconnections in a novel hybrid collisional-collisionless regime by interactions between laser-ablated Cu and CH plasmas. We show that the growth rate of the tearing instability in such a hybrid regime is still extremely large, resulting in rapid formation of multiple plasmoids, lower than that in the purely collisionless regime but much higher than the collisional case. In addition, we, for the first time, directly observe the topology evolutions of the whole process of plasmoid-dominated magnetic reconnections by using highly-resolved proton radiography. △ Less

Submitted 24 February, 2022; originally announced February 2022.

arXiv:2112.07224 [pdf, other]

Exploring Category-correlated Feature for Few-shot Image Classification

Authors: **g Xu, Xinglin Pan, Xu Luo, Wenjie Pei, Zenglin Xu

Abstract: Few-shot classification aims to adapt classifiers to novel classes with a few training samples. However, the insufficiency of training data may cause a biased estimation of feature distribution in a certain class. To alleviate this problem, we present a simple yet effective feature rectification method by exploring the category correlation between novel and base classes as the prior knowledge. We… ▽ More Few-shot classification aims to adapt classifiers to novel classes with a few training samples. However, the insufficiency of training data may cause a biased estimation of feature distribution in a certain class. To alleviate this problem, we present a simple yet effective feature rectification method by exploring the category correlation between novel and base classes as the prior knowledge. We explicitly capture such correlation by map** features into a latent vector with dimension matching the number of base classes, treating it as the logarithm probability of the feature over base classes. Based on this latent vector, the rectified feature is directly constructed by a decoder, which we expect maintaining category-related information while removing other stochastic factors, and consequently being closer to its class centroid. Furthermore, by changing the temperature value in softmax, we can re-balance the feature rectification and reconstruction for better performance. Our method is generic, flexible and agnostic to any feature extractor and classifier, readily to be embedded into existing FSL approaches. Experiments verify that our method is capable of rectifying biased features, especially when the feature is far from the class centroid. The proposed approach consistently obtains considerable performance gains on three widely used benchmarks, evaluated with different backbones and classifiers. The code will be made public. △ Less

Submitted 14 December, 2021; originally announced December 2021.

Comments: 10 pages, 9 figures

arXiv:2112.06467 [pdf, other]

An Informative Tracking Benchmark

Authors: Xin Li, Qiao Liu, Wenjie Pei, Qiuhong Shen, Yaowei Wang, Huchuan Lu, Ming-Hsuan Yang

Abstract: Along with the rapid progress of visual tracking, existing benchmarks become less informative due to redundancy of samples and weak discrimination between current trackers, making evaluations on all datasets extremely time-consuming. Thus, a small and informative benchmark, which covers all typical challenging scenarios to facilitate assessing the tracker performance, is of great interest. In this… ▽ More Along with the rapid progress of visual tracking, existing benchmarks become less informative due to redundancy of samples and weak discrimination between current trackers, making evaluations on all datasets extremely time-consuming. Thus, a small and informative benchmark, which covers all typical challenging scenarios to facilitate assessing the tracker performance, is of great interest. In this work, we develop a principled way to construct a small and informative tracking benchmark (ITB) with 7% out of 1.2 M frames of existing and newly collected datasets, which enables efficient evaluation while ensuring effectiveness. Specifically, we first design a quality assessment mechanism to select the most informative sequences from existing benchmarks taking into account 1) challenging level, 2) discriminative strength, 3) and density of appearance variations. Furthermore, we collect additional sequences to ensure the diversity and balance of tracking scenarios, leading to a total of 20 sequences for each scenario. By analyzing the results of 15 state-of-the-art trackers re-trained on the same data, we determine the effective methods for robust tracking under each scenario and demonstrate new challenges for future research direction in this field. △ Less

Submitted 13 December, 2021; originally announced December 2021.

Comments: 10 pages, 6 figures

arXiv:2112.02279 [pdf, other]

U2-Former: A Nested U-shaped Transformer for Image Restoration

Authors: Haobo Ji, Xin Feng, Wenjie Pei, **xing Li, Guangming Lu

Abstract: While Transformer has achieved remarkable performance in various high-level vision tasks, it is still challenging to exploit the full potential of Transformer in image restoration. The crux lies in the limited depth of applying Transformer in the typical encoder-decoder framework for image restoration, resulting from heavy self-attention computation load and inefficient communications across diffe… ▽ More While Transformer has achieved remarkable performance in various high-level vision tasks, it is still challenging to exploit the full potential of Transformer in image restoration. The crux lies in the limited depth of applying Transformer in the typical encoder-decoder framework for image restoration, resulting from heavy self-attention computation load and inefficient communications across different depth (scales) of layers. In this paper, we present a deep and effective Transformer-based network for image restoration, termed as U2-Former, which is able to employ Transformer as the core operation to perform image restoration in a deep encoding and decoding space. Specifically, it leverages the nested U-shaped structure to facilitate the interactions across different layers with different scales of feature maps. Furthermore, we optimize the computational efficiency for the basic Transformer block by introducing a feature-filtering mechanism to compress the token representation. Apart from the typical supervision ways for image restoration, our U2-Former also performs contrastive learning in multiple aspects to further decouple the noise component from the background image. Extensive experiments on various image restoration tasks, including reflection removal, rain streak removal and dehazing respectively, demonstrate the effectiveness of the proposed U2-Former. △ Less

Submitted 8 December, 2021; v1 submitted 4 December, 2021; originally announced December 2021.

arXiv:2111.08974 [pdf, other]

Pedestrian Detection by Exemplar-Guided Contrastive Learning

Authors: Zebin Lin, Wenjie Pei, Fanglin Chen, David Zhang, Guangming Lu

Abstract: Typical methods for pedestrian detection focus on either tackling mutual occlusions between crowded pedestrians, or dealing with the various scales of pedestrians. Detecting pedestrians with substantial appearance diversities such as different pedestrian silhouettes, different viewpoints or different dressing, remains a crucial challenge. Instead of learning each of these diverse pedestrian appear… ▽ More Typical methods for pedestrian detection focus on either tackling mutual occlusions between crowded pedestrians, or dealing with the various scales of pedestrians. Detecting pedestrians with substantial appearance diversities such as different pedestrian silhouettes, different viewpoints or different dressing, remains a crucial challenge. Instead of learning each of these diverse pedestrian appearance features individually as most existing methods do, we propose to perform contrastive learning to guide the feature learning in such a way that the semantic distance between pedestrians with different appearances in the learned feature space is minimized to eliminate the appearance diversities, whilst the distance between pedestrians and background is maximized. To facilitate the efficiency and effectiveness of contrastive learning, we construct an exemplar dictionary with representative pedestrian appearances as prior knowledge to construct effective contrastive training pairs and thus guide contrastive learning. Besides, the constructed exemplar dictionary is further leveraged to evaluate the quality of pedestrian proposals during inference by measuring the semantic distance between the proposal and the exemplar dictionary. Extensive experiments on both daytime and nighttime pedestrian detection validate the effectiveness of the proposed method. △ Less

Submitted 9 July, 2022; v1 submitted 17 November, 2021; originally announced November 2021.

arXiv:2111.04901 [pdf, other]

Label-Aware Distribution Calibration for Long-tailed Classification

Authors: Chaozheng Wang, Shuzheng Gao, Cuiyun Gao, Pengyun Wang, Wenjie Pei, Lujia Pan, Zenglin Xu

Abstract: Real-world data usually present long-tailed distributions. Training on imbalanced data tends to render neural networks perform well on head classes while much worse on tail classes. The severe sparseness of training instances for the tail classes is the main challenge, which results in biased distribution estimation during training. Plenty of efforts have been devoted to ameliorating the challenge… ▽ More Real-world data usually present long-tailed distributions. Training on imbalanced data tends to render neural networks perform well on head classes while much worse on tail classes. The severe sparseness of training instances for the tail classes is the main challenge, which results in biased distribution estimation during training. Plenty of efforts have been devoted to ameliorating the challenge, including data re-sampling and synthesizing new training instances for tail classes. However, no prior research has exploited the transferable knowledge from head classes to tail classes for calibrating the distribution of tail classes. In this paper, we suppose that tail classes can be enriched by similar head classes and propose a novel distribution calibration approach named as label-Aware Distribution Calibration LADC. LADC transfers the statistics from relevant head classes to infer the distribution of tail classes. Sampling from calibrated distribution further facilitates re-balancing the classifier. Experiments on both image and text long-tailed datasets demonstrate that LADC significantly outperforms existing methods.The visualization also shows that LADC provides a more accurate distribution estimation. △ Less

Submitted 8 November, 2021; originally announced November 2021.

Comments: 9 pages

arXiv:2110.04791 [pdf, other]

doi 10.1109/TASLP.2022.3140556

Stepwise-Refining Speech Separation Network via Fine-Grained Encoding in High-order Latent Domain

Authors: Zengwei Yao, Wenjie Pei, Fanglin Chen, Guangming Lu, David Zhang

Abstract: The crux of single-channel speech separation is how to encode the mixture of signals into such a latent embedding space that the signals from different speakers can be precisely separated. Existing methods for speech separation either transform the speech signals into frequency domain to perform separation or seek to learn a separable embedding space by constructing a latent domain based on convol… ▽ More The crux of single-channel speech separation is how to encode the mixture of signals into such a latent embedding space that the signals from different speakers can be precisely separated. Existing methods for speech separation either transform the speech signals into frequency domain to perform separation or seek to learn a separable embedding space by constructing a latent domain based on convolutional filters. While the latter type of methods learning an embedding space achieves substantial improvement for speech separation, we argue that the embedding space defined by only one latent domain does not suffice to provide a thoroughly separable encoding space for speech separation. In this paper, we propose the Stepwise-Refining Speech Separation Network (SRSSN), which follows a coarse-to-fine separation framework. It first learns a 1-order latent domain to define an encoding space and thereby performs a rough separation in the coarse phase. Then the proposed SRSSN learns a new latent domain along each basis function of the existing latent domain to obtain a high-order latent domain in the refining phase, which enables our model to perform a refining separation to achieve a more precise speech separation. We demonstrate the effectiveness of our SRSSN by conducting extensive experiments, including speech separation in a clean (noise-free) setting on WSJ0-2/3mix datasets as well as in noisy/reverberant settings on WHAM!/WHAMR! datasets. Furthermore, we also perform experiments of speech recognition on separated speech signals by our model to evaluate the performance of speech separation indirectly. △ Less

Submitted 31 January, 2022; v1 submitted 10 October, 2021; originally announced October 2021.

Comments: Accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP)

arXiv:2110.00261 [pdf, other]

Generative Memory-Guided Semantic Reasoning Model for Image Inpainting

Authors: Xin Feng, Wenjie Pei, Fengjun Li, Fanglin Chen, David Zhang, Guangming Lu

Abstract: Most existing methods for image inpainting focus on learning the intra-image priors from the known regions of the current input image to infer the content of the corrupted regions in the same image. While such methods perform well on images with small corrupted regions, it is challenging for these methods to deal with images with large corrupted area due to two potential limitations: 1) such metho… ▽ More Most existing methods for image inpainting focus on learning the intra-image priors from the known regions of the current input image to infer the content of the corrupted regions in the same image. While such methods perform well on images with small corrupted regions, it is challenging for these methods to deal with images with large corrupted area due to two potential limitations: 1) such methods tend to overfit each single training pair of images relying solely on the intra-image prior knowledge learned from the limited known area; 2) the inter-image prior knowledge about the general distribution patterns of visual semantics, which can be transferred across images sharing similar semantics, is not exploited. In this paper, we propose the Generative Memory-Guided Semantic Reasoning Model (GM-SRM), which not only learns the intra-image priors from the known regions, but also distills the inter-image reasoning priors to infer the content of the corrupted regions. In particular, the proposed GM-SRM first pre-learns a generative memory from the whole training data to capture the semantic distribution patterns in a global view. Then the learned memory are leveraged to retrieve the matching inter-image priors for the current corrupted image to perform semantic reasoning during image inpainting. While the intra-image priors are used for guaranteeing the pixel-level content consistency, the inter-image priors are favorable for performing high-level semantic reasoning, which is particularly effective for inferring semantic content for large corrupted area. Extensive experiments on Paris Street View, CelebA-HQ, and Places2 benchmarks demonstrate that our GM-SRM outperforms the state-of-the-art methods for image inpainting in terms of both the visual quality and quantitative metrics. △ Less

Submitted 20 March, 2022; v1 submitted 1 October, 2021; originally announced October 2021.

Comments: 13 pages, 10 figures

arXiv:2108.09339 [pdf, ps, other]

Refactorization of a variable step, unconditionally stable method of Dahlquist, Liniger and Nevanlinna

Authors: William Layton, Wenlong Pei, Catalin Trenchea

Abstract: The one-leg, two-step time-step** scheme proposed by Dahlquist, Liniger and Nevanlinna has clear advantages in complex, stiff numerical simulations: unconditional $G$-stability for variable time-steps and second-order accuracy. Yet it has been underutilized due, partially, to its complexity of direct implementation. We prove herein that this method is equivalent to the backward Euler method with… ▽ More The one-leg, two-step time-step** scheme proposed by Dahlquist, Liniger and Nevanlinna has clear advantages in complex, stiff numerical simulations: unconditional $G$-stability for variable time-steps and second-order accuracy. Yet it has been underutilized due, partially, to its complexity of direct implementation. We prove herein that this method is equivalent to the backward Euler method with pre- and post arithmetic steps added. This refactorization eases implementation in complex, possibly legacy codes. The realization we develop reduces complexity, including cognitive complexity and increases accuracy over both first order methods and constant time steps second order methods. △ Less

Submitted 20 August, 2021; originally announced August 2021.

arXiv:2108.06720 [pdf, other]

Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders

Authors: **g Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Zhenyu He, Linchao Bao

Abstract: Generating conversational gestures from speech audio is challenging due to the inherent one-to-many map** between audio and body motions. Conventional CNNs/RNNs assume one-to-one map**, and thus tend to predict the average of all possible target motions, resulting in plain/boring motions during inference. In order to overcome this problem, we propose a novel conditional variational autoencoder… ▽ More Generating conversational gestures from speech audio is challenging due to the inherent one-to-many map** between audio and body motions. Conventional CNNs/RNNs assume one-to-one map**, and thus tend to predict the average of all possible target motions, resulting in plain/boring motions during inference. In order to overcome this problem, we propose a novel conditional variational autoencoder (VAE) that explicitly models one-to-many audio-to-motion map** by splitting the cross-modal latent code into shared code and motion-specific code. The shared code mainly models the strong correlation between audio and motion (such as the synchronized audio and motion beats), while the motion-specific code captures diverse motion information independent of the audio. However, splitting the latent code into two parts poses training difficulties for the VAE model. A map** network facilitating random sampling along with other techniques including relaxed motion loss, bicycle constraint, and diversity loss are designed to better train the VAE. Experiments on both 3D and 2D motion datasets verify that our method generates more realistic and diverse motions than state-of-the-art methods, quantitatively and qualitatively. Finally, we demonstrate that our method can be readily used to generate motion sequences with user-specified motion clips on the timeline. Code and more results are at https://**gli513.github.io/audio2gestures. △ Less

Submitted 15 August, 2021; originally announced August 2021.

arXiv:2108.03637 [pdf, other]

Saliency-Associated Object Tracking

Authors: Zikun Zhou, Wenjie Pei, Xin Li, Hongpeng Wang, Feng Zheng, Zhenyu He

Abstract: Most existing trackers based on deep learning perform tracking in a holistic strategy, which aims to learn deep representations of the whole target for localizing the target. It is arduous for such methods to track targets with various appearance variations. To address this limitation, another type of methods adopts a part-based tracking strategy which divides the target into equal patches and tra… ▽ More Most existing trackers based on deep learning perform tracking in a holistic strategy, which aims to learn deep representations of the whole target for localizing the target. It is arduous for such methods to track targets with various appearance variations. To address this limitation, another type of methods adopts a part-based tracking strategy which divides the target into equal patches and tracks all these patches in parallel. The target state is inferred by summarizing the tracking results of these patches. A potential limitation of such trackers is that not all patches are equally informative for tracking. Some patches that are not discriminative may have adverse effects. In this paper, we propose to track the salient local parts of the target that are discriminative for tracking. In particular, we propose a fine-grained saliency mining module to capture the local saliencies. Further, we design a saliency-association modeling module to associate the captured saliencies together to learn effective correlation representations between the exemplar and the search image for state estimation. Extensive experiments on five diverse datasets demonstrate that the proposed method performs favorably against state-of-the-art trackers. △ Less

Submitted 8 August, 2021; originally announced August 2021.

Comments: Accepted by ICCV 2021

arXiv:2106.10900 [pdf, other]

Self-Supervised Tracking via Target-Aware Data Synthesis

Authors: Xin Li, Wenjie Pei, Yaowei Wang, Zhenyu He, Huchuan Lu, Ming-Hsuan Yang

Abstract: While deep-learning based tracking methods have achieved substantial progress, they entail large-scale and high-quality annotated data for sufficient training. To eliminate expensive and exhaustive annotation, we study self-supervised learning for visual tracking. In this work, we develop the Crop-Transform-Paste operation, which is able to synthesize sufficient training data by simulating various… ▽ More While deep-learning based tracking methods have achieved substantial progress, they entail large-scale and high-quality annotated data for sufficient training. To eliminate expensive and exhaustive annotation, we study self-supervised learning for visual tracking. In this work, we develop the Crop-Transform-Paste operation, which is able to synthesize sufficient training data by simulating various appearance variations during tracking, including appearance variations of objects and background interference. Since the target state is known in all synthesized data, existing deep trackers can be trained in routine ways using the synthesized data without human annotation. The proposed target-aware data-synthesis method adapts existing tracking approaches within a self-supervised learning framework without algorithmic changes. Thus, the proposed self-supervised learning mechanism can be seamlessly integrated into existing tracking frameworks to perform training. Extensive experiments show that our method 1) achieves favorable performance against supervised learning schemes under the cases with limited annotations; 2) helps deal with various tracking challenges such as object deformation, occlusion, or background clutter due to its manipulability; 3) performs favorably against state-of-the-art unsupervised tracking methods; 4) boosts the performance of various state-of-the-art supervised learning frameworks, including SiamRPN++, DiMP, and TransT. △ Less

Submitted 30 December, 2022; v1 submitted 21 June, 2021; originally announced June 2021.

Comments: 11 pages, 7 figures, Accepted by IEEE Transactions on Neural Networks and Learning Systems

Showing 1–50 of 83 results for author: Pei, W