Search | arXiv e-print repository

arXiv:2406.19859 [pdf, other]

MetaDesigner: Advancing Artistic Typography through AI-Driven, User-Centric, and Multilingual WordArt Synthesis

Authors: Jun-Yan He, Zhi-Qi Cheng, Chenyang Li, **gdong Sun, Qi He, Wangmeng Xiang, Hanyuan Chen, **-Peng Lan, Xianhui Lin, Kang Zhu, Bin Luo, Yifeng Geng, Xuansong Xie, Alexander G. Hauptmann

Abstract: MetaDesigner revolutionizes artistic typography synthesis by leveraging the strengths of Large Language Models (LLMs) to drive a design paradigm centered around user engagement. At the core of this framework lies a multi-agent system comprising the Pipeline, Glyph, and Texture agents, which collectively enable the creation of customized WordArt, ranging from semantic enhancements to the imposition… ▽ More MetaDesigner revolutionizes artistic typography synthesis by leveraging the strengths of Large Language Models (LLMs) to drive a design paradigm centered around user engagement. At the core of this framework lies a multi-agent system comprising the Pipeline, Glyph, and Texture agents, which collectively enable the creation of customized WordArt, ranging from semantic enhancements to the imposition of complex textures. MetaDesigner incorporates a comprehensive feedback mechanism that harnesses insights from multimodal models and user evaluations to refine and enhance the design process iteratively. Through this feedback loop, the system adeptly tunes hyperparameters to align with user-defined stylistic and thematic preferences, generating WordArt that not only meets but exceeds user expectations of visual appeal and contextual relevance. Empirical validations highlight MetaDesigner's capability to effectively serve diverse WordArt applications, consistently producing aesthetically appealing and context-sensitive results. △ Less

Submitted 28 June, 2024; originally announced June 2024.

Comments: 18 pages, 16 figures, Project: https://modelscope.cn/studios/WordArt/WordArt

arXiv:2405.20608 [pdf, other]

Identifying while Learning for Document Event Causality Identification

Authors: Cheng Liu, Wei Xiang, Bang Wang

Abstract: Event Causality Identification (ECI) aims to detect whether there exists a causal relation between two events in a document. Existing studies adopt a kind of identifying after learning paradigm, where events' representations are first learned and then used for the identification. Furthermore, they mainly focus on the causality existence, but ignoring causal direction. In this paper, we take care o… ▽ More Event Causality Identification (ECI) aims to detect whether there exists a causal relation between two events in a document. Existing studies adopt a kind of identifying after learning paradigm, where events' representations are first learned and then used for the identification. Furthermore, they mainly focus on the causality existence, but ignoring causal direction. In this paper, we take care of the causal direction and propose a new identifying while learning mode for the ECI task. We argue that a few causal relations can be easily identified with high confidence, and the directionality and structure of these identified causalities can be utilized to update events' representations for boosting next round of causality identification. To this end, this paper designs an *iterative learning and identifying framework*: In each iteration, we construct an event causality graph, on which events' causal structure representations are updated for boosting causal identification. Experiments on two public datasets show that our approach outperforms the state-of-the-art algorithms in both evaluations for causality existence identification and direction identification. △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: Accepted at ACL 2024

arXiv:2405.18974 [pdf, other]

Encoding Hierarchical Schema via Concept Flow for Multifaceted Ideology Detection

Authors: Songtao Liu, Bang Wang, Wei Xiang, Han Xu, Minghua Xu

Abstract: Multifaceted ideology detection (MID) aims to detect the ideological leanings of texts towards multiple facets. Previous studies on ideology detection mainly focus on one generic facet and ignore label semantics and explanatory descriptions of ideologies, which are a kind of instructive information and reveal the specific concepts of ideologies. In this paper, we develop a novel concept semantics-… ▽ More Multifaceted ideology detection (MID) aims to detect the ideological leanings of texts towards multiple facets. Previous studies on ideology detection mainly focus on one generic facet and ignore label semantics and explanatory descriptions of ideologies, which are a kind of instructive information and reveal the specific concepts of ideologies. In this paper, we develop a novel concept semantics-enhanced framework for the MID task. Specifically, we propose a bidirectional iterative concept flow (BICo) method to encode multifaceted ideologies. BICo enables the concepts to flow across levels of the schema tree and enriches concept representations with multi-granularity semantics. Furthermore, we explore concept attentive matching and concept-guided contrastive learning strategies to guide the model to capture ideology features with the learned concept semantics. Extensive experiments on the benchmark dataset show that our approach achieves state-of-the-art performance in MID, including in the cross-topic scenario. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: 13pages, 4 figures (Accepted to Findings of ACL 2024)

arXiv:2405.10512 [pdf, other]

In-context Contrastive Learning for Event Causality Identification

Authors: Chao Liang, Wei Xiang, Bang Wang

Abstract: Event Causality Identification (ECI) aims at determining the existence of a causal relation between two events. Although recent prompt learning-based approaches have shown promising improvements on the ECI task, their performance are often subject to the delicate design of multiple prompts and the positive correlations between the main task and derivate tasks. The in-context learning paradigm prov… ▽ More Event Causality Identification (ECI) aims at determining the existence of a causal relation between two events. Although recent prompt learning-based approaches have shown promising improvements on the ECI task, their performance are often subject to the delicate design of multiple prompts and the positive correlations between the main task and derivate tasks. The in-context learning paradigm provides explicit guidance for label prediction in the prompt learning paradigm, alleviating its reliance on complex prompts and derivative tasks. However, it does not distinguish between positive and negative demonstrations for analogy learning. Motivated from such considerations, this paper proposes an In-Context Contrastive Learning (ICCL) model that utilizes contrastive learning to enhance the effectiveness of both positive and negative demonstrations. Additionally, we apply contrastive learning to event pairs to better facilitate event causality identification. Our ICCL is evaluated on the widely used corpora, including the EventStoryLine and Causal-TimeBank, and results show significant performance improvements over the state-of-the-art algorithms. △ Less

Submitted 16 May, 2024; originally announced May 2024.

arXiv:2405.09985 [pdf, other]

VirtualModel: Generating Object-ID-retentive Human-object Interaction Image by Diffusion Model for E-commerce Marketing

Authors: Binghui Chen, Chongyang Zhong, Wangmeng Xiang, Yifeng Geng, Xuansong Xie

Abstract: Due to the significant advances in large-scale text-to-image generation by diffusion model (DM), controllable human image generation has been attracting much attention recently. Existing works, such as Controlnet [36], T2I-adapter [20] and HumanSD [10] have demonstrated good abilities in generating human images based on pose conditions, they still fail to meet the requirements of real e-commerce s… ▽ More Due to the significant advances in large-scale text-to-image generation by diffusion model (DM), controllable human image generation has been attracting much attention recently. Existing works, such as Controlnet [36], T2I-adapter [20] and HumanSD [10] have demonstrated good abilities in generating human images based on pose conditions, they still fail to meet the requirements of real e-commerce scenarios. These include (1) the interaction between the shown product and human should be considered, (2) human parts like face/hand/arm/foot and the interaction between human model and product should be hyper-realistic, and (3) the identity of the product shown in advertising should be exactly consistent with the product itself. To this end, in this paper, we first define a new human image generation task for e-commerce marketing, i.e., Object-ID-retentive Human-object Interaction image Generation (OHG), and then propose a VirtualModel framework to generate human images for product shown, which supports displays of any categories of products and any types of human-object interaction. As shown in Figure 1, VirtualModel not only outperforms other methods in terms of accurate pose control and image quality but also allows for the display of user-specified product objects by maintaining the product-ID consistency and enhancing the plausibility of human-object interaction. Codes and data will be released. △ Less

Submitted 16 May, 2024; originally announced May 2024.

Comments: project page: https://aigcdesigngroup.github.io/replace-anything;

arXiv:2405.09543 [pdf, other]

Algorithmic Fairness: A Tolerance Perspective

Authors: Renqiang Luo, Tao Tang, Feng Xia, Jiaying Liu, Chengpei Xu, Leo Yu Zhang, Wei Xiang, Chengqi Zhang

Abstract: Recent advancements in machine learning and deep learning have brought algorithmic fairness into sharp focus, illuminating concerns over discriminatory decision making that negatively impacts certain individuals or groups. These concerns have manifested in legal, ethical, and societal challenges, including the erosion of trust in intelligent systems. In response, this survey delves into the existi… ▽ More Recent advancements in machine learning and deep learning have brought algorithmic fairness into sharp focus, illuminating concerns over discriminatory decision making that negatively impacts certain individuals or groups. These concerns have manifested in legal, ethical, and societal challenges, including the erosion of trust in intelligent systems. In response, this survey delves into the existing literature on algorithmic fairness, specifically highlighting its multifaceted social consequences. We introduce a novel taxonomy based on 'tolerance', a term we define as the degree to which variations in fairness outcomes are acceptable, providing a structured approach to understanding the subtleties of fairness within algorithmic decisions. Our systematic review covers diverse industries, revealing critical insights into the balance between algorithmic decision making and social equity. By synthesizing these insights, we outline a series of emerging challenges and propose strategic directions for future research and policy making, with the goal of advancing the field towards more equitable algorithmic systems. △ Less

Submitted 26 April, 2024; originally announced May 2024.

Comments: 33 pages, 4 figures

MSC Class: 68T01; 68W40 ACM Class: I.2.6; K.4.2; H.1.2

arXiv:2405.05164 [pdf, other]

ProbRadarM3F: mmWave Radar based Human Skeletal Pose Estimation with Probability Map Guided Multi-Format Feature Fusion

Authors: Bing Zhu, Zixin He, Weiyi Xiong, Guanhua Ding, Jianan Liu, Tao Huang, Wei Chen, Wei Xiang

Abstract: Millimeter wave (mmWave) radar is a non-intrusive privacy and relatively convenient and inexpensive device, which has been demonstrated to be applicable in place of RGB cameras in human indoor pose estimation tasks. However, mmWave radar relies on the collection of reflected signals from the target, and the radar signals containing information is difficult to be fully applied. This has been a long… ▽ More Millimeter wave (mmWave) radar is a non-intrusive privacy and relatively convenient and inexpensive device, which has been demonstrated to be applicable in place of RGB cameras in human indoor pose estimation tasks. However, mmWave radar relies on the collection of reflected signals from the target, and the radar signals containing information is difficult to be fully applied. This has been a long-standing hindrance to the improvement of pose estimation accuracy. To address this major challenge, this paper introduces a probability map guided multi-format feature fusion model, ProbRadarM3F. This is a novel radar feature extraction framework using a traditional FFT method in parallel with a probability map based positional encoding method. ProbRadarM3F fuses the traditional heatmap features and the positional features, then effectively achieves the estimation of 14 keypoints of the human body. Experimental evaluation on the HuPR dataset proves the effectiveness of the model proposed in this paper, outperforming other methods experimented on this dataset with an AP of 69.9 %. The emphasis of our study is focusing on the position information that is not exploited before in radar singal. This provides direction to investigate other potential non-redundant information from mmWave rader. △ Less

Submitted 28 June, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

arXiv:2404.16223 [pdf, other]

Deep RAW Image Super-Resolution. A NTIRE 2024 Challenge Survey

Authors: Marcos V. Conde, Florin-Alexandru Vasluianu, Radu Timofte, Jianxing Zhang, Jia Li, Fan Wang, Xiaopeng Li, Zikun Liu, Hyunhee Park, Sejun Song, Changho Kim, Zhijuan Huang, Hongyuan Yu, Cheng Wan, Wending Xiang, Jiamin Lin, Hang Zhong, Qiaosong Zhang, Yue Sun, Xuanwu Yin, Kunlong Zuo, Senyan Xu, Siyuan Jiang, Zhi**g Sun, Jiaying Zhu , et al. (10 additional authors not shown)

Abstract: This paper reviews the NTIRE 2024 RAW Image Super-Resolution Challenge, highlighting the proposed solutions and results. New methods for RAW Super-Resolution could be essential in modern Image Signal Processing (ISP) pipelines, however, this problem is not as explored as in the RGB domain. Th goal of this challenge is to upscale RAW Bayer images by 2x, considering unknown degradations such as nois… ▽ More This paper reviews the NTIRE 2024 RAW Image Super-Resolution Challenge, highlighting the proposed solutions and results. New methods for RAW Super-Resolution could be essential in modern Image Signal Processing (ISP) pipelines, however, this problem is not as explored as in the RGB domain. Th goal of this challenge is to upscale RAW Bayer images by 2x, considering unknown degradations such as noise and blur. In the challenge, a total of 230 participants registered, and 45 submitted results during thee challenge period. The performance of the top-5 submissions is reviewed and provided here as a gauge for the current state-of-the-art in RAW Image Super-Resolution. △ Less

Submitted 24 April, 2024; originally announced April 2024.

Comments: CVPR 2024 - NTIRE Workshop

arXiv:2403.05050 [pdf, other]

DyRoNet: Dynamic Routing and Low-Rank Adapters for Autonomous Driving Streaming Perception

Authors: Xiang Huang, Zhi-Qi Cheng, Jun-Yan He, Chenyang Li, Wangmeng Xiang, Baigui Sun, Xiao Wu

Abstract: The advancement of autonomous driving systems hinges on the ability to achieve low-latency and high-accuracy perception. To address this critical need, this paper introduces Dynamic Routering Network (DyRoNet), a low-rank enhanced dynamic routing framework designed for streaming perception in autonomous driving systems. DyRoNet integrates a suite of pre-trained branch networks, each meticulously f… ▽ More The advancement of autonomous driving systems hinges on the ability to achieve low-latency and high-accuracy perception. To address this critical need, this paper introduces Dynamic Routering Network (DyRoNet), a low-rank enhanced dynamic routing framework designed for streaming perception in autonomous driving systems. DyRoNet integrates a suite of pre-trained branch networks, each meticulously fine-tuned to function under distinct environmental conditions. At its core, the framework offers a speed router module, developed to assess and route input data to the most suitable branch for processing. This approach not only addresses the inherent limitations of conventional models in adapting to diverse driving conditions but also ensures the balance between performance and efficiency. Extensive experimental evaluations demonstrating the adaptability of DyRoNet to diverse branch selection strategies, resulting in significant performance enhancements across different scenarios. This work not only establishes a new benchmark for streaming perception but also provides valuable engineering insights for future work. △ Less

Submitted 18 March, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

Comments: Project: https://tastevision.github.io/DyRoNet/

arXiv:2403.04435 [pdf, other]

Pilot Spoofing Attack on the Downlink of Cell-Free Massive MIMO: From the Perspective of Adversaries

Authors: Weiyang Xu, Ruiguang Wang, Yuan Zhang, Hien Quoc Ngo, Wei Xiang

Abstract: The channel hardening effect is less pronounced in the cell-free massive multiple-input multiple-output (mMIMO) system compared to its cellular counterpart, making it necessary to estimate the downlink effective channel gains to ensure decent performance. However, the downlink training inadvertently creates an opportunity for adversarial nodes to launch pilot spoofing attacks (PSAs). First, we dem… ▽ More The channel hardening effect is less pronounced in the cell-free massive multiple-input multiple-output (mMIMO) system compared to its cellular counterpart, making it necessary to estimate the downlink effective channel gains to ensure decent performance. However, the downlink training inadvertently creates an opportunity for adversarial nodes to launch pilot spoofing attacks (PSAs). First, we demonstrate that adversarial distributed access points (APs) can severely degrade the achievable downlink rate. They achieve this by estimating their channels to users in the uplink training phase and then precoding and sending the same pilot sequences as those used by legitimate APs during the downlink training phase. Then, the impact of the downlink PSA is investigated by rigorously deriving a closed-form expression of the per-user achievable downlink rate. By employing the min-max criterion to optimize the power allocation coefficients, the maximum per-user achievable rate of downlink transmission is minimized from the perspective of adversarial APs. As an alternative to the downlink PSA, adversarial APs may opt to precode random interference during the downlink data transmission phase in order to disrupt legitimate communications. In this scenario, the achievable downlink rate is derived, and then power optimization algorithms are also developed. We present numerical results to showcase the detrimental impact of the downlink PSA and compare the effects of these two types of attacks. △ Less

Submitted 11 April, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

arXiv:2402.17339 [pdf, other]

SocialCVAE: Predicting Pedestrian Trajectory via Interaction Conditioned Latents

Authors: Wei Xiang, Haoteng Yin, He Wang, Xiaogang **

Abstract: Pedestrian trajectory prediction is the key technology in many applications for providing insights into human behavior and anticipating human future motions. Most existing empirical models are explicitly formulated by observed human behaviors using explicable mathematical terms with a deterministic nature, while recent work has focused on develo** hybrid models combined with learning-based techn… ▽ More Pedestrian trajectory prediction is the key technology in many applications for providing insights into human behavior and anticipating human future motions. Most existing empirical models are explicitly formulated by observed human behaviors using explicable mathematical terms with a deterministic nature, while recent work has focused on develo** hybrid models combined with learning-based techniques for powerful expressiveness while maintaining explainability. However, the deterministic nature of the learned steering behaviors from the empirical models limits the models' practical performance. To address this issue, this work proposes the social conditional variational autoencoder (SocialCVAE) for predicting pedestrian trajectories, which employs a CVAE to explore behavioral uncertainty in human motion decisions. SocialCVAE learns socially reasonable motion randomness by utilizing a socially explainable interaction energy map as the CVAE's condition, which illustrates the future occupancy of each pedestrian's local neighborhood area. The energy map is generated using an energy-based interaction model, which anticipates the energy cost (i.e., repulsion intensity) of pedestrians' interactions with neighbors. Experimental results on two public benchmarks including 25 scenes demonstrate that SocialCVAE significantly improves prediction accuracy compared with the state-of-the-art methods, with up to 16.85% improvement in Average Displacement Error (ADE) and 69.18% improvement in Final Displacement Error (FDE). △ Less

Submitted 27 February, 2024; originally announced February 2024.

Comments: Accepted by AAAI'24

arXiv:2402.11739 [pdf, other]

A Transition System Abstraction Framework for Neural Network Dynamical System Models

Authors: Yejiang Yang, Zihao Mo, Hoang-Dung Tran, Weiming Xiang

Abstract: This paper proposes a transition system abstraction framework for neural network dynamical system models to enhance the model interpretability, with applications to complex dynamical systems such as human behavior learning and verification. To begin with, the localized working zone will be segmented into multiple localized partitions under the data-driven Maximum Entropy (ME) partitioning method.… ▽ More This paper proposes a transition system abstraction framework for neural network dynamical system models to enhance the model interpretability, with applications to complex dynamical systems such as human behavior learning and verification. To begin with, the localized working zone will be segmented into multiple localized partitions under the data-driven Maximum Entropy (ME) partitioning method. Then, the transition matrix will be obtained based on the set-valued reachability analysis of neural networks. Finally, applications to human handwriting dynamics learning and verification are given to validate our proposed abstraction framework, which demonstrates the advantages of enhancing the interpretability of the black-box model, i.e., our proposed framework is able to abstract a data-driven neural network model into a transition system, making the neural network model interpretable through verifying specifications described in Computational Tree Logic (CTL) languages. △ Less

Submitted 18 February, 2024; originally announced February 2024.

Comments: ACC 2024

arXiv:2402.11737 [pdf, other]

Compression Repair for Feedforward Neural Networks Based on Model Equivalence Evaluation

Authors: Zihao Mo, Yejiang Yang, Shuaizheng Lu, Weiming Xiang

Abstract: In this paper, we propose a method of repairing compressed Feedforward Neural Networks (FNNs) based on equivalence evaluation of two neural networks. In the repairing framework, a novel neural network equivalence evaluation method is developed to compute the output discrepancy between two neural networks. The output discrepancy can quantitatively characterize the output difference produced by comp… ▽ More In this paper, we propose a method of repairing compressed Feedforward Neural Networks (FNNs) based on equivalence evaluation of two neural networks. In the repairing framework, a novel neural network equivalence evaluation method is developed to compute the output discrepancy between two neural networks. The output discrepancy can quantitatively characterize the output difference produced by compression procedures. Based on the computed output discrepancy, the repairing method first initializes a new training set for the compressed networks to narrow down the discrepancy between the two neural networks and improve the performance of the compressed network. Then, we repair the compressed FNN by re-training based on the training set. We apply our developed method to the MNIST dataset to demonstrate the effectiveness and advantages of our proposed repair method. △ Less

Submitted 18 February, 2024; originally announced February 2024.

Comments: ACC 2024

arXiv:2402.04584 [pdf, other]

Troublemaker Learning for Low-Light Image Enhancement

Authors: Yinghao Song, Zhiyuan Cao, Wanhong Xiang, Sifan Long, Bo Yang, Hongwei Ge, Yanchun Liang, Chunguo Wu

Abstract: Low-light image enhancement (LLIE) restores the color and brightness of underexposed images. Supervised methods suffer from high costs in collecting low/normal-light image pairs. Unsupervised methods invest substantial effort in crafting complex loss functions. We address these two challenges through the proposed TroubleMaker Learning (TML) strategy, which employs normal-light images as inputs for… ▽ More Low-light image enhancement (LLIE) restores the color and brightness of underexposed images. Supervised methods suffer from high costs in collecting low/normal-light image pairs. Unsupervised methods invest substantial effort in crafting complex loss functions. We address these two challenges through the proposed TroubleMaker Learning (TML) strategy, which employs normal-light images as inputs for training. TML is simple: we first dim the input and then increase its brightness. TML is based on two core components. First, the troublemaker model (TM) constructs pseudo low-light images from normal images to relieve the cost of pairwise data. Second, the predicting model (PM) enhances the brightness of pseudo low-light images. Additionally, we incorporate an enhancing model (EM) to further improve the visual performance of PM outputs. Moreover, in LLIE tasks, characterizing global element correlations is important because more information on the same object can be captured. CNN cannot achieve this well, and self-attention has high time complexity. Accordingly, we propose Global Dynamic Convolution (GDC) with O(n) time complexity, which essentially imitates the partial calculation process of self-attention to formulate elementwise correlations. Based on the GDC module, we build the UGDC model. Extensive quantitative and qualitative experiments demonstrate that UGDC trained with TML can achieve competitive performance against state-of-the-art approaches on public datasets. The code is available at https://github.com/Rainbowman0/TML_LLIE. △ Less

Submitted 2 March, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

arXiv:2401.11633 [pdf, other]

Zoom-shot: Fast and Efficient Unsupervised Zero-Shot Transfer of CLIP to Vision Encoders with Multimodal Loss

Authors: Jordan Shipard, Arnold Wiliem, Kien Nguyen Thanh, Wei Xiang, Clinton Fookes

Abstract: The fusion of vision and language has brought about a transformative shift in computer vision through the emergence of Vision-Language Models (VLMs). However, the resource-intensive nature of existing VLMs poses a significant challenge. We need an accessible method for develo** the next generation of VLMs. To address this issue, we propose Zoom-shot, a novel method for transferring the zero-shot… ▽ More The fusion of vision and language has brought about a transformative shift in computer vision through the emergence of Vision-Language Models (VLMs). However, the resource-intensive nature of existing VLMs poses a significant challenge. We need an accessible method for develo** the next generation of VLMs. To address this issue, we propose Zoom-shot, a novel method for transferring the zero-shot capabilities of CLIP to any pre-trained vision encoder. We do this by exploiting the multimodal information (i.e. text and image) present in the CLIP latent space through the use of specifically designed multimodal loss functions. These loss functions are (1) cycle-consistency loss and (2) our novel prompt-guided knowledge distillation loss (PG-KD). PG-KD combines the concept of knowledge distillation with CLIP's zero-shot classification, to capture the interactions between text and image features. With our multimodal losses, we train a $\textbf{linear map**}$ between the CLIP latent space and the latent space of a pre-trained vision encoder, for only a $\textbf{single epoch}$. Furthermore, Zoom-shot is entirely unsupervised and is trained using $\textbf{unpaired}$ data. We test the zero-shot capabilities of a range of vision encoders augmented as new VLMs, on coarse and fine-grained classification datasets, outperforming the previous state-of-the-art in this problem domain. In our ablations, we find Zoom-shot allows for a trade-off between data and compute during training; and our state-of-the-art results can be obtained by reducing training from 20% to 1% of the ImageNet training data with 20 epochs. All code and models are available on GitHub. △ Less

Submitted 21 January, 2024; originally announced January 2024.

Comments: 15 pages

arXiv:2401.05800 [pdf, other]

Graph Spatiotemporal Process for Multivariate Time Series Anomaly Detection with Missing Values

Authors: Yu Zheng, Huan Yee Koh, Ming **, Lianhua Chi, Haishuai Wang, Khoa T. Phan, Yi-** Phoebe Chen, Shirui Pan, Wei Xiang

Abstract: The detection of anomalies in multivariate time series data is crucial for various practical applications, including smart power grids, traffic flow forecasting, and industrial process control. However, real-world time series data is usually not well-structured, posting significant challenges to existing approaches: (1) The existence of missing values in multivariate time series data along variabl… ▽ More The detection of anomalies in multivariate time series data is crucial for various practical applications, including smart power grids, traffic flow forecasting, and industrial process control. However, real-world time series data is usually not well-structured, posting significant challenges to existing approaches: (1) The existence of missing values in multivariate time series data along variable and time dimensions hinders the effective modeling of interwoven spatial and temporal dependencies, resulting in important patterns being overlooked during model training; (2) Anomaly scoring with irregularly-sampled observations is less explored, making it difficult to use existing detectors for multivariate series without fully-observed values. In this work, we introduce a novel framework called GST-Pro, which utilizes a graph spatiotemporal process and anomaly scorer to tackle the aforementioned challenges in detecting anomalies on irregularly-sampled multivariate time series. Our approach comprises two main components. First, we propose a graph spatiotemporal process based on neural controlled differential equations. This process enables effective modeling of multivariate time series from both spatial and temporal perspectives, even when the data contains missing values. Second, we present a novel distribution-based anomaly scoring mechanism that alleviates the reliance on complete uniform observations. By analyzing the predictions of the graph spatiotemporal process, our approach allows anomalies to be easily detected. Our experimental results show that the GST-Pro method can effectively detect anomalies in time series data and outperforms state-of-the-art methods, regardless of whether there are missing values present in the data. Our code is available: https://github.com/huankoh/GST-Pro. △ Less

Submitted 11 January, 2024; originally announced January 2024.

Comments: Accepted by Information Fusion

arXiv:2401.01699 [pdf, other]

WordArt Designer API: User-Driven Artistic Typography Synthesis with Large Language Models on ModelScope

Authors: Jun-Yan He, Zhi-Qi Cheng, Chenyang Li, **gdong Sun, Wangmeng Xiang, Yusen Hu, Xianhui Lin, Xiaoyang Kang, Zengke **, Bin Luo, Yifeng Geng, Xuansong Xie, **gren Zhou

Abstract: This paper introduces the WordArt Designer API, a novel framework for user-driven artistic typography synthesis utilizing Large Language Models (LLMs) on ModelScope. We address the challenge of simplifying artistic typography for non-professionals by offering a dynamic, adaptive, and computationally efficient alternative to traditional rigid templates. Our approach leverages the power of LLMs to u… ▽ More This paper introduces the WordArt Designer API, a novel framework for user-driven artistic typography synthesis utilizing Large Language Models (LLMs) on ModelScope. We address the challenge of simplifying artistic typography for non-professionals by offering a dynamic, adaptive, and computationally efficient alternative to traditional rigid templates. Our approach leverages the power of LLMs to understand and interpret user input, facilitating a more intuitive design process. We demonstrate through various case studies how users can articulate their aesthetic preferences and functional requirements, which the system then translates into unique and creative typographic designs. Our evaluations indicate significant improvements in user satisfaction, design flexibility, and creative expression over existing systems. The WordArt Designer API not only democratizes the art of typography but also opens up new possibilities for personalized digital communication and design. △ Less

Submitted 12 January, 2024; v1 submitted 3 January, 2024; originally announced January 2024.

Comments: Spotlight Paper at the Workshop on Machine Learning for Creativity and Design, 37th Conference on Neural Information Processing Systems (NeurIPS 2023). 5 pages, 5 figures

arXiv:2312.10504 [pdf, other]

SubAnom: Efficient Subgraph Anomaly Detection Framework over Dynamic Graphs

Authors: Chi Zhang, Wenkai Xiang, Xingzhi Guo, Baojian Zhou, Deqing Yang

Abstract: Given a dynamic graph, the efficient tracking of anomalous subgraphs via their node embeddings poses a significant challenge. Addressing this issue necessitates an effective scoring mechanism and an innovative anomalous subgraph strategy. Existing methods predominantly focus on designing scoring strategies or employing graph structures that consider nodes in isolation, resulting in ineffective cap… ▽ More Given a dynamic graph, the efficient tracking of anomalous subgraphs via their node embeddings poses a significant challenge. Addressing this issue necessitates an effective scoring mechanism and an innovative anomalous subgraph strategy. Existing methods predominantly focus on designing scoring strategies or employing graph structures that consider nodes in isolation, resulting in ineffective capture of the anomalous subgraph structure information. In this paper, we introduce SUBANOM, a novel framework for subgraph anomaly detection that is adept at identifying anomalous subgraphs. SUBANOM has three key components: 1) We implement current state-of-the-art dynamic embedding methods to efficiently calculate node embeddings, thereby capturing all node-level anomalies successfully; 2) We devise novel subgraph identification strategies, which include k-hop and triadic-closure. These strategies form the crucial component that can proficiently differentiate between strong and weak neighbors, thus effectively capturing the anomaly structure information; 3) For qualifying the anomaly subgraphs, we propose using Lp-norm-based score aggregation functions. These iterative steps enable us to process large-scale dynamic graphs effectively. Experiments conducted on a real-world dynamic graph underscore the efficacy of our framework in detecting anomalous subgraphs, outperforming state-of-the-art methods. Experimental results further signify that our framework is a potent tool for identifying anomalous subgraphs in real-world scenarios. For instance, the F1 score under the optimal subgraph identification strategy, can peak at 0.6679, while the highest achievable score using the corresponding baseline method is 0.5677. △ Less

Submitted 16 December, 2023; originally announced December 2023.

arXiv:2312.01555 [pdf, other]

Explainable AI is Responsible AI: How Explainability Creates Trustworthy and Socially Responsible Artificial Intelligence

Authors: Stephanie Baker, Wei Xiang

Abstract: Artificial intelligence (AI) has been clearly established as a technology with the potential to revolutionize fields from healthcare to finance - if developed and deployed responsibly. This is the topic of responsible AI, which emphasizes the need to develop trustworthy AI systems that minimize bias, protect privacy, support security, and enhance transparency and accountability. Explainable AI (XA… ▽ More Artificial intelligence (AI) has been clearly established as a technology with the potential to revolutionize fields from healthcare to finance - if developed and deployed responsibly. This is the topic of responsible AI, which emphasizes the need to develop trustworthy AI systems that minimize bias, protect privacy, support security, and enhance transparency and accountability. Explainable AI (XAI) has been broadly considered as a building block for responsible AI (RAI), with most of the literature considering it as a solution for improved transparency. This work proposes that XAI and responsible AI are significantly more deeply entwined. In this work, we explore state-of-the-art literature on RAI and XAI technologies. Based on our findings, we demonstrate that XAI can be utilized to ensure fairness, robustness, privacy, security, and transparency in a wide range of contexts. Our findings lead us to conclude that XAI is an essential foundation for every pillar of RAI. △ Less

Submitted 3 December, 2023; originally announced December 2023.

Comments: 35 pages, 7 figures (figures 3-6 include subfigures)

arXiv:2312.00082 [pdf, other]

A Compact Implicit Neural Representation for Efficient Storage of Massive 4D Functional Magnetic Resonance Imaging

Authors: Ruoran Li, Runzhao Yang, Wenxin Xiang, Yuxiao Cheng, Tingxiong Xiao, **li Suo

Abstract: Functional Magnetic Resonance Imaging (fMRI) data is a widely used kind of four-dimensional biomedical data, which requires effective compression. However, fMRI compressing poses unique challenges due to its intricate temporal dynamics, low signal-to-noise ratio, and complicated underlying redundancies. This paper reports a novel compression paradigm specifically tailored for fMRI data based on Im… ▽ More Functional Magnetic Resonance Imaging (fMRI) data is a widely used kind of four-dimensional biomedical data, which requires effective compression. However, fMRI compressing poses unique challenges due to its intricate temporal dynamics, low signal-to-noise ratio, and complicated underlying redundancies. This paper reports a novel compression paradigm specifically tailored for fMRI data based on Implicit Neural Representation (INR). The proposed approach focuses on removing the various redundancies among the time series by employing several methods, including (i) conducting spatial correlation modeling for intra-region dynamics, (ii) decomposing reusable neuronal activation patterns, and (iii) using proper initialization together with nonlinear fusion to describe the inter-region similarity. This scheme appropriately incorporates the unique features of fMRI data, and experimental results on publicly available datasets demonstrate the effectiveness of the proposed method, surpassing state-of-the-art algorithms in both conventional image quality evaluation metrics and fMRI downstream tasks. This work in this paper paves the way for sharing massive fMRI data at low bandwidth and high fidelity. △ Less

Submitted 29 February, 2024; v1 submitted 30 November, 2023; originally announced December 2023.

arXiv:2311.18214 [pdf, other]

Perception of Misalignment States for Sky Survey Telescopes with the Digital Twin and the Deep Neural Networks

Authors: Miao Zhang, Peng Jia, Zhengyang Li, Wennan Xiang, Jiameng Lv, Rui Sun

Abstract: Sky survey telescopes play a critical role in modern astronomy, but misalignment of their optical elements can introduce significant variations in point spread functions, leading to reduced data quality. To address this, we need a method to obtain misalignment states, aiding in the reconstruction of accurate point spread functions for data processing methods or facilitating adjustments of optical… ▽ More Sky survey telescopes play a critical role in modern astronomy, but misalignment of their optical elements can introduce significant variations in point spread functions, leading to reduced data quality. To address this, we need a method to obtain misalignment states, aiding in the reconstruction of accurate point spread functions for data processing methods or facilitating adjustments of optical components for improved image quality. Since sky survey telescopes consist of many optical elements, they result in a vast array of potential misalignment states, some of which are intricately coupled, posing detection challenges. However, by continuously adjusting the misalignment states of optical elements, we can disentangle coupled states. Based on this principle, we propose a deep neural network to extract misalignment states from continuously varying point spread functions in different field of views. To ensure sufficient and diverse training data, we recommend employing a digital twin to obtain data for neural network training. Additionally, we introduce the state graph to store misalignment data and explore complex relationships between misalignment states and corresponding point spread functions, guiding the generation of training data from experiments. Once trained, the neural network estimates misalignment states from observation data, regardless of the impacts caused by atmospheric turbulence, noise, and limited spatial sampling rates in the detector. The method proposed in this paper could be used to provide prior information for the active optics system and the optical system alignment. △ Less

Submitted 29 November, 2023; originally announced November 2023.

Comments: The aforementioned submission has been accepted by Optics Express. We kindly request any feedback or comments to be directed to the corresponding author, Peng Jia ([email protected]), or the second corresponding author, Zhengyang Li ([email protected]). Please note that Zhengyang is currently stationed in the South Antarctica and will not be available until after February 1st, 2024

arXiv:2311.03054 [pdf, other]

AnyText: Multilingual Visual Text Generation And Editing

Authors: Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, Xuansong Xie

Abstract: Diffusion model based Text-to-Image has achieved impressive achievements recently. Although current technology for synthesizing images is highly advanced and capable of generating images with high fidelity, it is still possible to give the show away when focusing on the text area in the generated image. To address this issue, we introduce AnyText, a diffusion-based multilingual visual text generat… ▽ More Diffusion model based Text-to-Image has achieved impressive achievements recently. Although current technology for synthesizing images is highly advanced and capable of generating images with high fidelity, it is still possible to give the show away when focusing on the text area in the generated image. To address this issue, we introduce AnyText, a diffusion-based multilingual visual text generation and editing model, that focuses on rendering accurate and coherent text in the image. AnyText comprises a diffusion pipeline with two primary elements: an auxiliary latent module and a text embedding module. The former uses inputs like text glyph, position, and masked image to generate latent features for text generation or editing. The latter employs an OCR model for encoding stroke data as embeddings, which blend with image caption embeddings from the tokenizer to generate texts that seamlessly integrate with the background. We employed text-control diffusion loss and text perceptual loss for training to further enhance writing accuracy. AnyText can write characters in multiple languages, to the best of our knowledge, this is the first work to address multilingual visual text generation. It is worth mentioning that AnyText can be plugged into existing diffusion models from the community for rendering or editing text accurately. After conducting extensive evaluation experiments, our method has outperformed all other approaches by a significant margin. Additionally, we contribute the first large-scale multilingual text images dataset, AnyWord-3M, containing 3 million image-text pairs with OCR annotations in multiple languages. Based on AnyWord-3M dataset, we propose AnyText-benchmark for the evaluation of visual text generation accuracy and quality. Our project will be open-sourced on https://github.com/tyxsspa/AnyText to improve and promote the development of text generation technology. △ Less

Submitted 21 February, 2024; v1 submitted 6 November, 2023; originally announced November 2023.

arXiv:2310.18332 [pdf, other]

WordArt Designer: User-Driven Artistic Typography Synthesis using Large Language Models

Authors: Jun-Yan He, Zhi-Qi Cheng, Chenyang Li, **gdong Sun, Wangmeng Xiang, Xianhui Lin, Xiaoyang Kang, Zengke **, Yusen Hu, Bin Luo, Yifeng Geng, Xuansong Xie, **gren Zhou

Abstract: This paper introduces WordArt Designer, a user-driven framework for artistic typography synthesis, relying on the Large Language Model (LLM). The system incorporates four key modules: the LLM Engine, SemTypo, StyTypo, and TexTypo modules. 1) The LLM Engine, empowered by the LLM (e.g., GPT-3.5), interprets user inputs and generates actionable prompts for the other modules, thereby transforming abst… ▽ More This paper introduces WordArt Designer, a user-driven framework for artistic typography synthesis, relying on the Large Language Model (LLM). The system incorporates four key modules: the LLM Engine, SemTypo, StyTypo, and TexTypo modules. 1) The LLM Engine, empowered by the LLM (e.g., GPT-3.5), interprets user inputs and generates actionable prompts for the other modules, thereby transforming abstract concepts into tangible designs. 2) The SemTypo module optimizes font designs using semantic concepts, striking a balance between artistic transformation and readability. 3) Building on the semantic layout provided by the SemTypo module, the StyTypo module creates smooth, refined images. 4) The TexTypo module further enhances the design's aesthetics through texture rendering, enabling the generation of inventive textured fonts. Notably, WordArt Designer highlights the fusion of generative AI with artistic typography. Experience its capabilities on ModelScope: https://www.modelscope.cn/studios/WordArt/WordArt. △ Less

Submitted 26 November, 2023; v1 submitted 20 October, 2023; originally announced October 2023.

Comments: Accepted by EMNLP 2023, 10 pages, 11 figures, 1 table, the system is at https://www.modelscope.cn/studios/WordArt/WordArt

arXiv:2310.17401 [pdf, other]

Energy Efficient Robust Beamforming for Vehicular ISAC with Imperfect Channel Estimation

Authors: Hanwen Zhang, Haijian Sun, Tianyi He, Weiming Xiang, Rose Qingyang Hu

Abstract: This paper investigates robust beamforming for system-centric energy efficiency (EE) optimization in the vehicular integrated sensing and communication (ISAC) system, where the mobility of vehicles poses significant challenges to channel estimation. To obtain the optimal beamforming under channel uncertainty, we first formulate an optimization problem for maximizing the system EE under bounded cha… ▽ More This paper investigates robust beamforming for system-centric energy efficiency (EE) optimization in the vehicular integrated sensing and communication (ISAC) system, where the mobility of vehicles poses significant challenges to channel estimation. To obtain the optimal beamforming under channel uncertainty, we first formulate an optimization problem for maximizing the system EE under bounded channel estimation errors. Next, fractional programming and semidefinite relaxation (SDR) are utilized to relax the rank-1 constraints. We further use Schur complement and S-Procedure to transform Cramer-Rao bound (CRB) and channel estimation error constraints into convex forms, respectively. Based on the Lagrangian dual function and Karush-Kuhn-Tucker (KKT) conditions, it is proved that the optimal beamforming solution is rank-1. Finally, we present comprehensive simulation results to demonstrate two key findings: 1) the proposed algorithm exhibits a favorable convergence rate, and 2) the approach effectively mitigates the impact of channel estimation errors. △ Less

Submitted 26 October, 2023; originally announced October 2023.

Comments: Submitted to IEEE for future publication

arXiv:2310.07182 [pdf, other]

Generate Coherent Rays Directly

Authors: Fengqi Liu, Zaonan Tan, Weilai Xiang, Chenhao Lu, Dan Li, Xu Gong, Yulong Shi, Songnan Shi, Qilong Kou, Bo Hu

Abstract: The path tracing method generates incoherent rays by randomly sampling directions. This randomness makes it unsuitable for modern processor architectures that rely on coherence to achieve optimal performance. Many efforts have been made to address this issue by reordering rays based on their origin, end, or direction to enhance coherence. However, a drawback of reordering methods is the need to en… ▽ More The path tracing method generates incoherent rays by randomly sampling directions. This randomness makes it unsuitable for modern processor architectures that rely on coherence to achieve optimal performance. Many efforts have been made to address this issue by reordering rays based on their origin, end, or direction to enhance coherence. However, a drawback of reordering methods is the need to encode and sort rays before tracing, introducing additional overhead. We propose a technique to generate coherent rays directly by reusing the direction. Additionally, we introduce an interleaved reuse domain partition method to mitigate the impact of sampling correlation resulting from direction reuse. We demonstrate the effectiveness of our approach across various scenes, establishing its superiority over reordering methods. △ Less

Submitted 11 October, 2023; originally announced October 2023.

Comments: 8 pages

arXiv:2310.04180 [pdf, other]

Degradation-Aware Self-Attention Based Transformer for Blind Image Super-Resolution

Authors: Qingguo Liu, Pan Gao, Kang Han, Ningzhong Liu, Wei Xiang

Abstract: Compared to CNN-based methods, Transformer-based methods achieve impressive image restoration outcomes due to their abilities to model remote dependencies. However, how to apply Transformer-based methods to the field of blind super-resolution (SR) and further make an SR network adaptive to degradation information is still an open problem. In this paper, we propose a new degradation-aware self-atte… ▽ More Compared to CNN-based methods, Transformer-based methods achieve impressive image restoration outcomes due to their abilities to model remote dependencies. However, how to apply Transformer-based methods to the field of blind super-resolution (SR) and further make an SR network adaptive to degradation information is still an open problem. In this paper, we propose a new degradation-aware self-attention-based Transformer model, where we incorporate contrastive learning into the Transformer network for learning the degradation representations of input images with unknown noise. In particular, we integrate both CNN and Transformer components into the SR network, where we first use the CNN modulated by the degradation information to extract local features, and then employ the degradation-aware Transformer to extract global semantic features. We apply our proposed model to several popular large-scale benchmark datasets for testing, and achieve the state-of-the-art performance compared to existing methods. In particular, our method yields a PSNR of 32.43 dB on the Urban100 dataset at $\times$2 scale, 0.94 dB higher than DASR, and 26.62 dB on the Urban100 dataset at $\times$4 scale, 0.26 dB improvement over KDSR, setting a new benchmark in this area. Source code is available at: https://github.com/I2-Multimedia-Lab/DSAT/tree/main. △ Less

Submitted 6 October, 2023; originally announced October 2023.

Comments: 12 pages

arXiv:2309.07561 [pdf, other]

Adaptive Prompt Learning with Distilled Connective Knowledge for Implicit Discourse Relation Recognition

Authors: Bang Wang, Zhenglin Wang, Wei Xiang, Yijun Mo

Abstract: Implicit discourse relation recognition (IDRR) aims at recognizing the discourse relation between two text segments without an explicit connective. Recently, the prompt learning has just been applied to the IDRR task with great performance improvements over various neural network-based approaches. However, the discrete nature of the state-art-of-art prompting approach requires manual design of tem… ▽ More Implicit discourse relation recognition (IDRR) aims at recognizing the discourse relation between two text segments without an explicit connective. Recently, the prompt learning has just been applied to the IDRR task with great performance improvements over various neural network-based approaches. However, the discrete nature of the state-art-of-art prompting approach requires manual design of templates and answers, a big hurdle for its practical applications. In this paper, we propose a continuous version of prompt learning together with connective knowledge distillation, called AdaptPrompt, to reduce manual design efforts via continuous prompting while further improving performance via knowledge transfer. In particular, we design and train a few virtual tokens to form continuous templates and automatically select the most suitable one by gradient search in the embedding space. We also design an answer-relation map** rule to generate a few virtual answers as the answer space. Furthermore, we notice the importance of annotated connectives in the training dataset and design a teacher-student architecture for knowledge transfer. Experiments on the up-to-date PDTB Corpus V3.0 validate our design objectives in terms of the better relation recognition performance over the state-of-the-art competitors. △ Less

Submitted 14 September, 2023; originally announced September 2023.

arXiv:2309.01365 [pdf, other]

Refined Temporal Pyramidal Compression-and-Amplification Transformer for 3D Human Pose Estimation

Authors: Hanbing Liu, Wangmeng Xiang, Jun-Yan He, Zhi-Qi Cheng, Bin Luo, Yifeng Geng, Xuansong Xie

Abstract: Accurately estimating the 3D pose of humans in video sequences requires both accuracy and a well-structured architecture. With the success of transformers, we introduce the Refined Temporal Pyramidal Compression-and-Amplification (RTPCA) transformer. Exploiting the temporal dimension, RTPCA extends intra-block temporal modeling via its Temporal Pyramidal Compression-and-Amplification (TPCA) struct… ▽ More Accurately estimating the 3D pose of humans in video sequences requires both accuracy and a well-structured architecture. With the success of transformers, we introduce the Refined Temporal Pyramidal Compression-and-Amplification (RTPCA) transformer. Exploiting the temporal dimension, RTPCA extends intra-block temporal modeling via its Temporal Pyramidal Compression-and-Amplification (TPCA) structure and refines inter-block feature interaction with a Cross-Layer Refinement (XLR) module. In particular, TPCA block exploits a temporal pyramid paradigm, reinforcing key and value representation capabilities and seamlessly extracting spatial semantics from motion sequences. We stitch these TPCA blocks with XLR that promotes rich semantic representation through continuous interaction of queries, keys, and values. This strategy embodies early-stage information with current flows, addressing typical deficits in detail and stability seen in other transformer-based methods. We demonstrate the effectiveness of RTPCA by achieving state-of-the-art results on Human3.6M, HumanEva-I, and MPI-INF-3DHP benchmarks with minimal computational overhead. The source code is available at https://github.com/hbing-l/RTPCA. △ Less

Submitted 4 February, 2024; v1 submitted 4 September, 2023; originally announced September 2023.

Comments: 11 pages, 5 figures

arXiv:2308.09678 [pdf, other]

PoSynDA: Multi-Hypothesis Pose Synthesis Domain Adaptation for Robust 3D Human Pose Estimation

Authors: Hanbing Liu, Jun-Yan He, Zhi-Qi Cheng, Wangmeng Xiang, Qize Yang, Wenhao Chai, Gaoang Wang, Xu Bao, Bin Luo, Yifeng Geng, Xuansong Xie

Abstract: Existing 3D human pose estimators face challenges in adapting to new datasets due to the lack of 2D-3D pose pairs in training sets. To overcome this issue, we propose \textit{Multi-Hypothesis \textbf{P}ose \textbf{Syn}thesis \textbf{D}omain \textbf{A}daptation} (\textbf{PoSynDA}) framework to bridge this data disparity gap in target domain. Typically, PoSynDA uses a diffusion-inspired structure to… ▽ More Existing 3D human pose estimators face challenges in adapting to new datasets due to the lack of 2D-3D pose pairs in training sets. To overcome this issue, we propose \textit{Multi-Hypothesis \textbf{P}ose \textbf{Syn}thesis \textbf{D}omain \textbf{A}daptation} (\textbf{PoSynDA}) framework to bridge this data disparity gap in target domain. Typically, PoSynDA uses a diffusion-inspired structure to simulate 3D pose distribution in the target domain. By incorporating a multi-hypothesis network, PoSynDA generates diverse pose hypotheses and aligns them with the target domain. To do this, it first utilizes target-specific source augmentation to obtain the target domain distribution data from the source domain by decoupling the scale and position parameters. The process is then further refined through the teacher-student paradigm and low-rank adaptation. With extensive comparison of benchmarks such as Human3.6M and MPI-INF-3DHP, PoSynDA demonstrates competitive performance, even comparable to the target-trained MixSTE model\cite{zhang2022mixste}. This work paves the way for the practical application of 3D human pose estimation in unseen domains. The code is available at https://github.com/hbing-l/PoSynDA. △ Less

Submitted 16 October, 2023; v1 submitted 18 August, 2023; originally announced August 2023.

Comments: Accepted to ACM Multimedia 2023; 10 pages, 4 figures, 8 tables; the code is at https://github.com/hbing-l/PoSynDA

arXiv:2308.03262 [pdf, other]

A Benchmark for Chinese-English Scene Text Image Super-resolution

Authors: Jianqi Ma, Zhetong Liang, Wangmeng Xiang, Xi Yang, Lei Zhang

Abstract: Scene Text Image Super-resolution (STISR) aims to recover high-resolution (HR) scene text images with visually pleasant and readable text content from the given low-resolution (LR) input. Most existing works focus on recovering English texts, which have relatively simple character structures, while little work has been done on the more challenging Chinese texts with diverse and complex character s… ▽ More Scene Text Image Super-resolution (STISR) aims to recover high-resolution (HR) scene text images with visually pleasant and readable text content from the given low-resolution (LR) input. Most existing works focus on recovering English texts, which have relatively simple character structures, while little work has been done on the more challenging Chinese texts with diverse and complex character structures. In this paper, we propose a real-world Chinese-English benchmark dataset, namely Real-CE, for the task of STISR with the emphasis on restoring structurally complex Chinese characters. The benchmark provides 1,935/783 real-world LR-HR text image pairs~(contains 33,789 text lines in total) for training/testing in 2$\times$ and 4$\times$ zooming modes, complemented by detailed annotations, including detection boxes and text transcripts. Moreover, we design an edge-aware learning method, which provides structural supervision in image and feature domains, to effectively reconstruct the dense structures of Chinese characters. We conduct experiments on the proposed Real-CE benchmark and evaluate the existing STISR models with and without our edge-aware loss. The benchmark, including data and source code, is available at https://github.com/mjq11302010044/Real-CE. △ Less

Submitted 6 August, 2023; originally announced August 2023.

Comments: Accepted by ICCV2023

arXiv:2308.01251 [pdf, other]

Hyper-pixel-wise Contrastive Learning Augmented Segmentation Network for Old Landslide Detection through Fusing High-Resolution Remote Sensing Images and Digital Elevation Model Data

Authors: Yiming Zhou, Yuexing Peng, Wei Li, Junchuan Yu, Daqing Ge, Wei Xiang

Abstract: As a natural disaster, landslide often brings tremendous losses to human lives, so it urgently demands reliable detection of landslide risks. When detecting old landslides that present important information for landslide risk warning, problems such as visual blur and small-sized dataset cause great challenges when using remote sensing data. To extract accurate semantic features, a hyper-pixel-wise… ▽ More As a natural disaster, landslide often brings tremendous losses to human lives, so it urgently demands reliable detection of landslide risks. When detecting old landslides that present important information for landslide risk warning, problems such as visual blur and small-sized dataset cause great challenges when using remote sensing data. To extract accurate semantic features, a hyper-pixel-wise contrastive learning augmented segmentation network (HPCL-Net) is proposed, which augments the local salient feature extraction from boundaries of landslides through HPCL-Net and fuses heterogeneous infromation in the semantic space from high-resolution remote sensing images and digital elevation model data. For full utilization of precious samples, a global hyper-pixel-wise sample pair queues-based contrastive learning method is developed, which includes the construction of global queues that store hyper-pixel-wise samples and the updating scheme of a momentum encoder, reliably enhancing the extraction ability of semantic features. The proposed HPCL-Net is evaluated on the Loess Plateau old landslide dataset and experimental results verify that the proposed HPCL-Net greatly outperforms existing models, where the mIoU is increased from 0.620 to 0.651, the Landslide IoU is improved from 0.334 to 0.394 and the F1score is enhanced from 0.501 to 0.565. △ Less

Submitted 6 October, 2023; v1 submitted 2 August, 2023; originally announced August 2023.

arXiv:2307.12327 [pdf, other]

End-to-end Hyperspectral Image Change Detection Network Based on Band Selection

Authors: Qingren Yao, Yuan Zhou, Chang Tang, Wei Xiang

Abstract: For hyperspectral image change detection (HSI-CD), one key challenge is to reduce band redundancy, as only a few bands are crucial for change detection while other bands may be adverse to it. However, most existing HSI-CD methods directly extract change feature from full-dimensional HSIs, suffering from a degradation of feature discrimination. To address this issue, we propose an end-to-end hypers… ▽ More For hyperspectral image change detection (HSI-CD), one key challenge is to reduce band redundancy, as only a few bands are crucial for change detection while other bands may be adverse to it. However, most existing HSI-CD methods directly extract change feature from full-dimensional HSIs, suffering from a degradation of feature discrimination. To address this issue, we propose an end-to-end hyperspectral image change detection network with band selection (ECDBS), which effectively retains the critical bands to promote change detection. The main ingredients of the network are a deep learning based band selection module and cascading band-specific spatial attention (BSA) blocks. The band selection module can be seamlessly integrated with subsequent CD models for joint optimization and end-to-end reasoning, rather than as a step separate from change detection. The BSA block extracts features from each band using a tailored strategy. Unlike the typically used feature extraction strategy that uniformly processes all bands, the BSA blocks considers the differences in feature distributions among widely spaced bands, thereupon extracting more sufficient change feature. Experimental evaluations conducted on three widely used HSI-CD datasets demonstrate the effectiveness and superiority of our proposed method over other state-of-the-art techniques. △ Less

Submitted 16 November, 2023; v1 submitted 23 July, 2023; originally announced July 2023.

arXiv:2307.09977 [pdf, ps, other]

Learner Referral for Cost-Effective Federated Learning Over Hierarchical IoT Networks

Authors: Yulan Gao, Ziqiang Ye, Yue Xiao, Wei Xiang

Abstract: The paradigm of federated learning (FL) to address data privacy concerns by locally training parameters on resource-constrained clients in a distributed manner has garnered significant attention. Nonetheless, FL is not applicable when not all clients within the coverage of the FL server are registered with the FL network. To bridge this gap, this paper proposes joint learner referral aided federat… ▽ More The paradigm of federated learning (FL) to address data privacy concerns by locally training parameters on resource-constrained clients in a distributed manner has garnered significant attention. Nonetheless, FL is not applicable when not all clients within the coverage of the FL server are registered with the FL network. To bridge this gap, this paper proposes joint learner referral aided federated client selection (LRef-FedCS), along with communications and computing resource scheduling, and local model accuracy optimization (LMAO) methods. These methods are designed to minimize the cost incurred by the worst-case participant and ensure the long-term fairness of FL in hierarchical Internet of Things (HieIoT) networks. Utilizing the Lyapunov optimization technique, we reformulate the original problem into a stepwise joint optimization problem (JOP). Subsequently, to tackle the mixed-integer non-convex JOP, we separatively and iteratively address LRef-FedCS and LMAO through the centralized method and self-adaptive global best harmony search (SGHS) algorithm, respectively. To enhance scalability, we further propose a distributed LRef-FedCS approach based on a matching game to replace the centralized method described above. Numerical simulations and experimental results on the MNIST/CIFAR-10 datasets demonstrate that our proposed LRef-FedCS approach could achieve a good balance between pursuing high global accuracy and reducing cost. △ Less

Submitted 19 July, 2023; originally announced July 2023.

arXiv:2307.09813 [pdf, other]

DAPrompt: Deterministic Assumption Prompt Learning for Event Causality Identification

Authors: Wei Xiang, Chuanhong Zhan, Bang Wang

Abstract: Event Causality Identification (ECI) aims at determining whether there is a causal relation between two event mentions. Conventional prompt learning designs a prompt template to first predict an answer word and then maps it to the final decision. Unlike conventional prompts, we argue that predicting an answer word may not be a necessary prerequisite for the ECI task. Instead, we can first make a d… ▽ More Event Causality Identification (ECI) aims at determining whether there is a causal relation between two event mentions. Conventional prompt learning designs a prompt template to first predict an answer word and then maps it to the final decision. Unlike conventional prompts, we argue that predicting an answer word may not be a necessary prerequisite for the ECI task. Instead, we can first make a deterministic assumption on the existence of causal relation between two events and then evaluate its rationality to either accept or reject the assumption. The design motivation is to try the most utilization of the encyclopedia-like knowledge embedded in a pre-trained language model. In light of such considerations, we propose a deterministic assumption prompt learning model, called DAPrompt, for the ECI task. In particular, we design a simple deterministic assumption template concatenating with the input event pair, which includes two masks as predicted events' tokens. We use the probabilities of predicted events to evaluate the assumption rationality for the final event causality decision. Experiments on the EventStoryLine corpus and Causal-TimeBank corpus validate our design objective in terms of significant performance improvements over the state-of-the-art algorithms. △ Less

Submitted 19 July, 2023; originally announced July 2023.

arXiv:2307.08390 [pdf, other]

doi 10.1109/TNNLS.2023.3325667

Correlation-aware Spatial-Temporal Graph Learning for Multivariate Time-series Anomaly Detection

Authors: Yu Zheng, Huan Yee Koh, Ming **, Lianhua Chi, Khoa T. Phan, Shirui Pan, Yi-** Phoebe Chen, Wei Xiang

Abstract: Multivariate time-series anomaly detection is critically important in many applications, including retail, transportation, power grid, and water treatment plants. Existing approaches for this problem mostly employ either statistical models which cannot capture the non-linear relations well or conventional deep learning models (e.g., CNN and LSTM) that do not explicitly learn the pairwise correlati… ▽ More Multivariate time-series anomaly detection is critically important in many applications, including retail, transportation, power grid, and water treatment plants. Existing approaches for this problem mostly employ either statistical models which cannot capture the non-linear relations well or conventional deep learning models (e.g., CNN and LSTM) that do not explicitly learn the pairwise correlations among variables. To overcome these limitations, we propose a novel method, correlation-aware spatial-temporal graph learning (termed CST-GL), for time series anomaly detection. CST-GL explicitly captures the pairwise correlations via a multivariate time series correlation learning module based on which a spatial-temporal graph neural network (STGNN) can be developed. Then, by employing a graph convolution network that exploits one- and multi-hop neighbor information, our STGNN component can encode rich spatial information from complex pairwise dependencies between variables. With a temporal module that consists of dilated convolutional functions, the STGNN can further capture long-range dependence over time. A novel anomaly scoring component is further integrated into CST-GL to estimate the degree of an anomaly in a purely unsupervised manner. Experimental results demonstrate that CST-GL can detect anomalies effectively in general settings as well as enable early detection across different time delays. △ Less

Submitted 16 November, 2023; v1 submitted 17 July, 2023; originally announced July 2023.

Comments: 17 pages, double columns, 10 tables, 3 figures. Accepted to IEEE Transactions on Neural Networks and Learning Systems (TNNLS)

arXiv:2306.09551 [pdf, other]

Edit-DiffNeRF: Editing 3D Neural Radiance Fields using 2D Diffusion Model

Authors: Lu Yu, Wei Xiang, Kang Han

Abstract: Recent research has demonstrated that the combination of pretrained diffusion models with neural radiance fields (NeRFs) has emerged as a promising approach for text-to-3D generation. Simply coupling NeRF with diffusion models will result in cross-view inconsistency and degradation of stylized view syntheses. To address this challenge, we propose the Edit-DiffNeRF framework, which is composed of a… ▽ More Recent research has demonstrated that the combination of pretrained diffusion models with neural radiance fields (NeRFs) has emerged as a promising approach for text-to-3D generation. Simply coupling NeRF with diffusion models will result in cross-view inconsistency and degradation of stylized view syntheses. To address this challenge, we propose the Edit-DiffNeRF framework, which is composed of a frozen diffusion model, a proposed delta module to edit the latent semantic space of the diffusion model, and a NeRF. Instead of training the entire diffusion for each scene, our method focuses on editing the latent semantic space in frozen pretrained diffusion models by the delta module. This fundamental change to the standard diffusion framework enables us to make fine-grained modifications to the rendered views and effectively consolidate these instructions in a 3D scene via NeRF training. As a result, we are able to produce an edited 3D scene that faithfully aligns to input text instructions. Furthermore, to ensure semantic consistency across different viewpoints, we propose a novel multi-view semantic consistency loss that extracts a latent semantic embedding from the input view as a prior, and aim to reconstruct it in different views. Our proposed method has been shown to effectively edit real-world 3D scenes, resulting in 25% improvement in the alignment of the performed 3D edits with text instructions compared to prior work. △ Less

Submitted 15 June, 2023; originally announced June 2023.

arXiv:2305.17916 [pdf, other]

Volume Feature Rendering for Fast Neural Radiance Field Reconstruction

Authors: Kang Han, Wei Xiang, Lu Yu

Abstract: Neural radiance fields (NeRFs) are able to synthesize realistic novel views from multi-view images captured from distinct positions and perspectives. In NeRF's rendering pipeline, neural networks are used to represent a scene independently or transform queried learnable feature vector of a point to the expected color or density. With the aid of geometry guides either in occupancy grids or proposal… ▽ More Neural radiance fields (NeRFs) are able to synthesize realistic novel views from multi-view images captured from distinct positions and perspectives. In NeRF's rendering pipeline, neural networks are used to represent a scene independently or transform queried learnable feature vector of a point to the expected color or density. With the aid of geometry guides either in occupancy grids or proposal networks, the number of neural network evaluations can be reduced from hundreds to dozens in the standard volume rendering framework. Instead of rendering yielded color after neural network evaluation, we propose to render the queried feature vectors of a ray first and then transform the rendered feature vector to the final pixel color by a neural network. This fundamental change to the standard volume rendering framework requires only one single neural network evaluation to render a pixel, which substantially lowers the high computational complexity of the rendering framework attributed to a large number of neural network evaluations. Consequently, we can use a comparably larger neural network to achieve a better rendering quality while maintaining the same training and rendering time costs. Our model achieves the state-of-the-art rendering quality on both synthetic and real-world datasets while requiring a training time of several minutes. △ Less

Submitted 31 May, 2023; v1 submitted 29 May, 2023; originally announced May 2023.

arXiv:2305.16652 [pdf, other]

Faster Ray Tracing through Hierarchy Cut Code

Authors: WeiLai Xiang, FengQi Liu, Dan Li, ZhaoNan Tan, PengZhan Xu, MeiZhi Liu, QiLong Kou

Abstract: We propose a novel ray reordering technique to accelerate the ray tracing process by encoding and sorting rays prior to traversal. Instead of spatial coordinates, our method encodes rays according to the cuts of the hierarchical acceleration structure, which is called the hierarchy cut code. This approach can better adapt to the acceleration structure and obtain a more reliable encoding result. We… ▽ More We propose a novel ray reordering technique to accelerate the ray tracing process by encoding and sorting rays prior to traversal. Instead of spatial coordinates, our method encodes rays according to the cuts of the hierarchical acceleration structure, which is called the hierarchy cut code. This approach can better adapt to the acceleration structure and obtain a more reliable encoding result. We also propose a compression scheme to decrease the sorting overhead by a shorter sorting key. In addition, based on the phenomenon of boundary drift, we theoretically explain the reason why existing reordering methods cannot achieve better performance by using longer sorting keys. The experiment demonstrates that our method can accelerate secondary ray tracing by up to 1.81 times, outperforming the existing methods. Such result proves the effectiveness of hierarchy cut code, and indicate that the reordering technique can achieve greater performance improvement, which worth further research. △ Less

Submitted 19 July, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

arXiv:2305.16437 [pdf, other]

KeyPosS: Plug-and-Play Facial Landmark Detection through GPS-Inspired True-Range Multilateration

Authors: Xu Bao, Zhi-Qi Cheng, Jun-Yan He, Chenyang Li, Wangmeng Xiang, **gdong Sun, Hanbing Liu, Wei Liu, Bin Luo, Yifeng Geng, Xuansong Xie

Abstract: Accurate facial landmark detection is critical for facial analysis tasks, yet prevailing heatmap and coordinate regression methods grapple with prohibitive computational costs and quantization errors. Through comprehensive theoretical analysis and experimentation, we identify and elucidate the limitations of existing techniques. To overcome these challenges, we pioneer the application of True-Rang… ▽ More Accurate facial landmark detection is critical for facial analysis tasks, yet prevailing heatmap and coordinate regression methods grapple with prohibitive computational costs and quantization errors. Through comprehensive theoretical analysis and experimentation, we identify and elucidate the limitations of existing techniques. To overcome these challenges, we pioneer the application of True-Range Multilateration, originally devised for GPS localization, to facial landmark detection. We propose KeyPoint Positioning System (KeyPosS) - the first framework to deduce exact landmark coordinates by triangulating distances between points of interest and anchor points predicted by a fully convolutional network. A key advantage of KeyPosS is its plug-and-play nature, enabling flexible integration into diverse decoding pipelines. Extensive experiments on four datasets demonstrate state-of-the-art performance, with KeyPosS outperforming existing methods in low-resolution settings despite minimal computational overhead. By spearheading the integration of Multilateration with facial analysis, KeyPosS marks a paradigm shift in facial landmark detection. The code is available at https://github.com/zhiqic/KeyPosS. △ Less

Submitted 23 September, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

Comments: Accepted to ACM Multimedia 2023; 10 pages, 7 figures, 6 tables; the code is at https://github.com/zhiqic/KeyPosS

arXiv:2305.10866 [pdf, other]

TEPrompt: Task Enlightenment Prompt Learning for Implicit Discourse Relation Recognition

Authors: Wei Xiang, Chao Liang, Bang Wang

Abstract: Implicit Discourse Relation Recognition (IDRR) aims at classifying the relation sense between two arguments without an explicit connective. Recently, the ConnPrompt~\cite{Wei.X:et.al:2022:COLING} has leveraged the powerful prompt learning for IDRR based on the fusion of multi-prompt decisions from three different yet much similar connective prediction templates. Instead of multi-prompt ensembling,… ▽ More Implicit Discourse Relation Recognition (IDRR) aims at classifying the relation sense between two arguments without an explicit connective. Recently, the ConnPrompt~\cite{Wei.X:et.al:2022:COLING} has leveraged the powerful prompt learning for IDRR based on the fusion of multi-prompt decisions from three different yet much similar connective prediction templates. Instead of multi-prompt ensembling, we propose to design auxiliary tasks with enlightened prompt learning for the IDRR task. Although an auxiliary task is not used to directly output final prediction, we argue that during the joint training some of its learned features can be useful to boost the main task. In light of such motivations, we propose a task enlightenment prompt learning model, called TEPrompt, to fuse learned features from three related tasks for IDRR. In particular, the TEPrompt contains three tasks, viz., Discourse Relation Recognition (DRR), Sense Semantics Classification (SSC) and Annotated Connective Prediction (ACP), each with a unique prompt template and an answer space. In the training phase, we jointly train three prompt learning tasks with shared argument representation. In the testing phase, we only take the DRR output with fused features as the final IDRR decision. Experiments with the same conditions have shown that the proposed TEPrompt outperforms the ConnPrompt. This can be attributed to the promoted decision features and language models benefited from joint-training of auxiliary tasks. △ Less

Submitted 18 May, 2023; originally announced May 2023.

arXiv:2304.13812 [pdf, other]

Guaranteed Quantization Error Computation for Neural Network Model Compression

Authors: Wesley Cooke, Zihao Mo, Weiming Xiang

Abstract: Neural network model compression techniques can address the computation issue of deep neural networks on embedded devices in industrial systems. The guaranteed output error computation problem for neural network compression with quantization is addressed in this paper. A merged neural network is built from a feedforward neural network and its quantized version to produce the exact output differenc… ▽ More Neural network model compression techniques can address the computation issue of deep neural networks on embedded devices in industrial systems. The guaranteed output error computation problem for neural network compression with quantization is addressed in this paper. A merged neural network is built from a feedforward neural network and its quantized version to produce the exact output difference between two neural networks. Then, optimization-based methods and reachability analysis methods are applied to the merged neural network to compute the guaranteed quantization error. Finally, a numerical example is proposed to validate the applicability and effectiveness of the proposed approach. △ Less

Submitted 26 April, 2023; originally announced April 2023.

arXiv:2304.13811 [pdf, other]

A Data-Driven Hybrid Automaton Framework to Modeling Complex Dynamical Systems

Authors: Yejiang Yang, Zihao Mo, Weiming Xiang

Abstract: In this paper, a computationally efficient data-driven hybrid automaton model is proposed to capture unknown complex dynamical system behaviors using multiple neural networks. The sampled data of the system is divided by valid partitions into groups corresponding to their topologies and based on which, transition guards are defined. Then, a collection of small-scale neural networks that are comput… ▽ More In this paper, a computationally efficient data-driven hybrid automaton model is proposed to capture unknown complex dynamical system behaviors using multiple neural networks. The sampled data of the system is divided by valid partitions into groups corresponding to their topologies and based on which, transition guards are defined. Then, a collection of small-scale neural networks that are computationally efficient are trained as the local dynamical description for their corresponding topologies. After modeling the system with a neural-network-based hybrid automaton, the set-valued reachability analysis with low computation cost is provided based on interval analysis and a split and combined process. At last, a numerical example of the limit cycle is presented to illustrate that the developed models can significantly reduce the computational cost in reachable set computation without sacrificing any modeling precision. △ Less

Submitted 26 April, 2023; originally announced April 2023.

arXiv:2303.17144 [pdf, other]

DAMO-StreamNet: Optimizing Streaming Perception in Autonomous Driving

Authors: Jun-Yan He, Zhi-Qi Cheng, Chenyang Li, Wangmeng Xiang, Binghui Chen, Bin Luo, Yifeng Geng, Xuansong Xie

Abstract: Real-time perception, or streaming perception, is a crucial aspect of autonomous driving that has yet to be thoroughly explored in existing research. To address this gap, we present DAMO-StreamNet, an optimized framework that combines recent advances from the YOLO series with a comprehensive analysis of spatial and temporal perception mechanisms, delivering a cutting-edge solution. The key innovat… ▽ More Real-time perception, or streaming perception, is a crucial aspect of autonomous driving that has yet to be thoroughly explored in existing research. To address this gap, we present DAMO-StreamNet, an optimized framework that combines recent advances from the YOLO series with a comprehensive analysis of spatial and temporal perception mechanisms, delivering a cutting-edge solution. The key innovations of DAMO-StreamNet are (1) A robust neck structure incorporating deformable convolution, enhancing the receptive field and feature alignment capabilities (2) A dual-branch structure that integrates short-path semantic features and long-path temporal features, improving motion state prediction accuracy. (3) Logits-level distillation for efficient optimization, aligning the logits of teacher and student networks in semantic space. (4) A real-time forecasting mechanism that updates support frame features with the current frame, ensuring seamless streaming perception during inference. Our experiments demonstrate that DAMO-StreamNet surpasses existing state-of-the-art methods, achieving 37.8% (normal size (600, 960)) and 43.3% (large size (1200, 1920)) sAP without using extra data. This work not only sets a new benchmark for real-time perception but also provides valuable insights for future research. Additionally, DAMO-StreamNet can be applied to various autonomous systems, such as drones and robots, paving the way for real-time perception. The code is at https://github.com/zhiqic/DAMO-StreamNet. △ Less

Submitted 20 May, 2023; v1 submitted 30 March, 2023; originally announced March 2023.

Comments: Accepted to IJCAI 2023; 9 pages, 4 figures, 6 tables; the code is at https://github.com/zhiqic/DAMO-StreamNet

Journal ref: In the 32nd International Joint Conference on Artificial Intelligence (IJCAI 2023)

arXiv:2303.14395 [pdf, other]

MDQE: Mining Discriminative Query Embeddings to Segment Occluded Instances on Challenging Videos

Authors: Minghan Li, Shuai Li, Wangmeng Xiang, Lei Zhang

Abstract: While impressive progress has been achieved, video instance segmentation (VIS) methods with per-clip input often fail on challenging videos with occluded objects and crowded scenes. This is mainly because instance queries in these methods cannot encode well the discriminative embeddings of instances, making the query-based segmenter difficult to distinguish those `hard' instances. To address these… ▽ More While impressive progress has been achieved, video instance segmentation (VIS) methods with per-clip input often fail on challenging videos with occluded objects and crowded scenes. This is mainly because instance queries in these methods cannot encode well the discriminative embeddings of instances, making the query-based segmenter difficult to distinguish those `hard' instances. To address these issues, we propose to mine discriminative query embeddings (MDQE) to segment occluded instances on challenging videos. First, we initialize the positional embeddings and content features of object queries by considering their spatial contextual information and the inter-frame object motion. Second, we propose an inter-instance mask repulsion loss to distance each instance from its nearby non-target instances. The proposed MDQE is the first VIS method with per-clip input that achieves state-of-the-art results on challenging videos and competitive performance on simple videos. In specific, MDQE with ResNet50 achieves 33.0\% and 44.5\% mask AP on OVIS and YouTube-VIS 2021, respectively. Code of MDQE can be found at \url{https://github.com/MinghanLi/MDQE_CVPR2023}. △ Less

Submitted 25 March, 2023; originally announced March 2023.

Journal ref: The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023

arXiv:2303.09769 [pdf, other]

Denoising Diffusion Autoencoders are Unified Self-supervised Learners

Authors: Weilai Xiang, Hongyu Yang, Di Huang, Yunhong Wang

Abstract: Inspired by recent advances in diffusion models, which are reminiscent of denoising autoencoders, we investigate whether they can acquire discriminative representations for classification via generative pre-training. This paper shows that the networks in diffusion models, namely denoising diffusion autoencoders (DDAE), are unified self-supervised learners: by pre-training on unconditional image ge… ▽ More Inspired by recent advances in diffusion models, which are reminiscent of denoising autoencoders, we investigate whether they can acquire discriminative representations for classification via generative pre-training. This paper shows that the networks in diffusion models, namely denoising diffusion autoencoders (DDAE), are unified self-supervised learners: by pre-training on unconditional image generation, DDAE has already learned strongly linear-separable representations within its intermediate layers without auxiliary encoders, thus making diffusion pre-training emerge as a general approach for generative-and-discriminative dual learning. To validate this, we conduct linear probe and fine-tuning evaluations. Our diffusion-based approach achieves 95.9% and 50.0% linear evaluation accuracies on CIFAR-10 and Tiny-ImageNet, respectively, and is comparable to contrastive learning and masked autoencoders for the first time. Transfer learning from ImageNet also confirms the suitability of DDAE for Vision Transformers, suggesting the potential to scale DDAEs as unified foundation models. Code is available at github.com/FutureXiang/ddae. △ Less

Submitted 19 August, 2023; v1 submitted 17 March, 2023; originally announced March 2023.

Comments: ICCV 2023 Oral

arXiv:2303.08525 [pdf, other]

MRGAN360: Multi-stage Recurrent Generative Adversarial Network for 360 Degree Image Saliency Prediction

Authors: Pan Gao, Xinlang Chen, Rong Quan, Wei Xiang

Abstract: Thanks to the ability of providing an immersive and interactive experience, the uptake of 360 degree image content has been rapidly growing in consumer and industrial applications. Compared to planar 2D images, saliency prediction for 360 degree images is more challenging due to their high resolutions and spherical viewing ranges. Currently, most high-performance saliency prediction models for omn… ▽ More Thanks to the ability of providing an immersive and interactive experience, the uptake of 360 degree image content has been rapidly growing in consumer and industrial applications. Compared to planar 2D images, saliency prediction for 360 degree images is more challenging due to their high resolutions and spherical viewing ranges. Currently, most high-performance saliency prediction models for omnidirectional images (ODIs) rely on deeper or broader convolutional neural networks (CNNs), which benefit from CNNs' superior feature representation capabilities while suffering from their high computational costs. In this paper, inspired by the human visual cognitive process, i.e., human being's perception of a visual scene is always accomplished by multiple stages of analysis, we propose a novel multi-stage recurrent generative adversarial networks for ODIs dubbed MRGAN360, to predict the saliency maps stage by stage. At each stage, the prediction model takes as input the original image and the output of the previous stage and outputs a more accurate saliency map. We employ a recurrent neural network among adjacent prediction stages to model their correlations, and exploit a discriminator at the end of each stage to supervise the output saliency map. In addition, we share the weights among all the stages to obtain a lightweight architecture that is computationally cheap. Extensive experiments are conducted to demonstrate that our proposed model outperforms the state-of-the-art model in terms of both prediction accuracy and model size. △ Less

Submitted 15 March, 2023; originally announced March 2023.

arXiv:2303.04935 [pdf, other]

X-Pruner: eXplainable Pruning for Vision Transformers

Authors: Lu Yu, Wei Xiang

Abstract: Recently vision transformer models have become prominent models for a range of tasks. These models, however, usually suffer from intensive computational costs and heavy memory requirements, making them impractical for deployment on edge platforms. Recent studies have proposed to prune transformers in an unexplainable manner, which overlook the relationship between internal units of the model and t… ▽ More Recently vision transformer models have become prominent models for a range of tasks. These models, however, usually suffer from intensive computational costs and heavy memory requirements, making them impractical for deployment on edge platforms. Recent studies have proposed to prune transformers in an unexplainable manner, which overlook the relationship between internal units of the model and the target class, thereby leading to inferior performance. To alleviate this problem, we propose a novel explainable pruning framework dubbed X-Pruner, which is designed by considering the explainability of the pruning criterion. Specifically, to measure each prunable unit's contribution to predicting each target class, a novel explainability-aware mask is proposed and learned in an end-to-end manner. Then, to preserve the most informative units and learn the layer-wise pruning rate, we adaptively search the layer-wise threshold that differentiates between unpruned and pruned units based on their explainability-aware mask values. To verify and evaluate our method, we apply the X-Pruner on representative transformer models including the DeiT and Swin Transformer. Comprehensive simulation results demonstrate that the proposed X-Pruner outperforms the state-of-the-art black-box methods with significantly reduced computational costs and slight performance degradation. △ Less

Submitted 5 June, 2023; v1 submitted 8 March, 2023; originally announced March 2023.

arXiv:2303.03808 [pdf, other]

Multiscale Tensor Decomposition and Rendering Equation Encoding for View Synthesis

Authors: Kang Han, Wei Xiang

Abstract: Rendering novel views from captured multi-view images has made considerable progress since the emergence of the neural radiance field. This paper aims to further advance the quality of view synthesis by proposing a novel approach dubbed the neural radiance feature field (NRFF). We first propose a multiscale tensor decomposition scheme to organize learnable features so as to represent scenes from c… ▽ More Rendering novel views from captured multi-view images has made considerable progress since the emergence of the neural radiance field. This paper aims to further advance the quality of view synthesis by proposing a novel approach dubbed the neural radiance feature field (NRFF). We first propose a multiscale tensor decomposition scheme to organize learnable features so as to represent scenes from coarse to fine scales. We demonstrate many benefits of the proposed multiscale representation, including more accurate scene shape and appearance reconstruction, and faster convergence compared with the single-scale representation. Instead of encoding view directions to model view-dependent effects, we further propose to encode the rendering equation in the feature space by employing the anisotropic spherical Gaussian mixture predicted from the proposed multiscale representation. The proposed NRFF improves state-of-the-art rendering results by over 1 dB in PSNR on both the NeRF and NSVF synthetic datasets. A significant improvement has also been observed on the real-world Tanks & Temples dataset. Code can be found at https://github.com/imkanghan/nrff. △ Less

Submitted 27 May, 2023; v1 submitted 7 March, 2023; originally announced March 2023.

arXiv:2302.14302 [pdf, other]

Improving Model Generalization by On-manifold Adversarial Augmentation in the Frequency Domain

Authors: Chang Liu, Wenzhao Xiang, Yuan He, Hui Xue, Shibao Zheng, Hang Su

Abstract: Deep neural networks (DNNs) may suffer from significantly degenerated performance when the training and test data are of different underlying distributions. Despite the importance of model generalization to out-of-distribution (OOD) data, the accuracy of state-of-the-art (SOTA) models on OOD data can plummet. Recent work has demonstrated that regular or off-manifold adversarial examples, as a spec… ▽ More Deep neural networks (DNNs) may suffer from significantly degenerated performance when the training and test data are of different underlying distributions. Despite the importance of model generalization to out-of-distribution (OOD) data, the accuracy of state-of-the-art (SOTA) models on OOD data can plummet. Recent work has demonstrated that regular or off-manifold adversarial examples, as a special case of data augmentation, can be used to improve OOD generalization. Inspired by this, we theoretically prove that on-manifold adversarial examples can better benefit OOD generalization. Nevertheless, it is nontrivial to generate on-manifold adversarial examples because the real manifold is generally complex. To address this issue, we proposed a novel method of Augmenting data with Adversarial examples via a Wavelet module (AdvWavAug), an on-manifold adversarial data augmentation technique that is simple to implement. In particular, we project a benign image into a wavelet domain. With the assistance of the sparsity characteristic of wavelet transformation, we can modify an image on the estimated data manifold. We conduct adversarial augmentation based on AdvProp training framework. Extensive experiments on different models and different datasets, including ImageNet and its distorted versions, demonstrate that our method can improve model generalization, especially on OOD data. By integrating AdvWavAug into the training process, we have achieved SOTA results on some recent transformer-based models. △ Less

Submitted 8 June, 2024; v1 submitted 27 February, 2023; originally announced February 2023.

arXiv:2302.14301 [pdf, other]

A Comprehensive Study on Robustness of Image Classification Models: Benchmarking and Rethinking

Authors: Chang Liu, Yinpeng Dong, Wenzhao Xiang, Xiao Yang, Hang Su, Jun Zhu, Yuefeng Chen, Yuan He, Hui Xue, Shibao Zheng

Abstract: The robustness of deep neural networks is usually lacking under adversarial examples, common corruptions, and distribution shifts, which becomes an important research problem in the development of deep learning. Although new deep learning methods and robustness improvement techniques have been constantly proposed, the robustness evaluations of existing methods are often inadequate due to their rap… ▽ More The robustness of deep neural networks is usually lacking under adversarial examples, common corruptions, and distribution shifts, which becomes an important research problem in the development of deep learning. Although new deep learning methods and robustness improvement techniques have been constantly proposed, the robustness evaluations of existing methods are often inadequate due to their rapid development, diverse noise patterns, and simple evaluation metrics. Without thorough robustness evaluations, it is hard to understand the advances in the field and identify the effective methods. In this paper, we establish a comprehensive robustness benchmark called \textbf{ARES-Bench} on the image classification task. In our benchmark, we evaluate the robustness of 55 typical deep learning models on ImageNet with diverse architectures (e.g., CNNs, Transformers) and learning algorithms (e.g., normal supervised training, pre-training, adversarial training) under numerous adversarial attacks and out-of-distribution (OOD) datasets. Using robustness curves as the major evaluation criteria, we conduct large-scale experiments and draw several important findings, including: 1) there is an inherent trade-off between adversarial and natural robustness for the same model architecture; 2) adversarial training effectively improves adversarial robustness, especially when performed on Transformer architectures; 3) pre-training significantly improves natural robustness based on more training data or self-supervised learning. Based on ARES-Bench, we further analyze the training tricks in large-scale adversarial training on ImageNet. By designing the training settings accordingly, we achieve the new state-of-the-art adversarial robustness. We have made the benchmarking results and code platform publicly available. △ Less

Submitted 27 February, 2023; originally announced February 2023.

Comments: International Journal of Computer Vision (IJCV) [under review]

Showing 1–50 of 125 results for author: Xiang, W