Search | arXiv e-print repository

arXiv:2406.19934 [pdf, other]

From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis

Authors: Chuanqi Cheng, Jian Guan, Wei Wu, Rui Yan

Abstract: We explore multi-step reasoning in vision-language models (VLMs). The problem is challenging, as reasoning data consisting of multiple steps of visual and language processing are barely available. To overcome the challenge, we first introduce a least-to-most visual reasoning paradigm, which interleaves steps of decomposing a question into sub-questions and invoking external tools for resolving sub… ▽ More We explore multi-step reasoning in vision-language models (VLMs). The problem is challenging, as reasoning data consisting of multiple steps of visual and language processing are barely available. To overcome the challenge, we first introduce a least-to-most visual reasoning paradigm, which interleaves steps of decomposing a question into sub-questions and invoking external tools for resolving sub-questions. Based on the paradigm, we further propose a novel data synthesis approach that can automatically create questions and multi-step reasoning paths for an image in a bottom-up manner. Our approach divides the complex synthesis task into a few simple sub-tasks, and (almost entirely) relies on open-sourced models to accomplish the sub-tasks. Therefore, the entire synthesis process is reproducible and cost-efficient, and the synthesized data is quality guaranteed. With the approach, we construct $50$k visual reasoning examples. Then, we develop a visual reasoner through supervised fine-tuning, which is capable of generally enhancing the reasoning abilities of a wide range of existing VLMs in a plug-and-play fashion. Extensive experiments indicate that the visual reasoner can consistently and significantly improve four VLMs on four VQA benchmarks. Our code and dataset are available at https://github.com/steven-ccq/VisualReasoner. △ Less

Submitted 28 June, 2024; originally announced June 2024.

arXiv:2406.16708 [pdf, other]

CausalFormer: An Interpretable Transformer for Temporal Causal Discovery

Authors: Lingbai Kong, Wengen Li, Hanchen Yang, Yichao Zhang, Jihong Guan, Shuigeng Zhou

Abstract: Temporal causal discovery is a crucial task aimed at uncovering the causal relations within time series data. The latest temporal causal discovery methods usually train deep learning models on prediction tasks to uncover the causality between time series. They capture causal relations by analyzing the parameters of some components of the trained models, e.g., attention weights and convolution weig… ▽ More Temporal causal discovery is a crucial task aimed at uncovering the causal relations within time series data. The latest temporal causal discovery methods usually train deep learning models on prediction tasks to uncover the causality between time series. They capture causal relations by analyzing the parameters of some components of the trained models, e.g., attention weights and convolution weights. However, this is an incomplete map** process from the model parameters to the causality and fails to investigate the other components, e.g., fully connected layers and activation functions, that are also significant for causal discovery. To facilitate the utilization of the whole deep learning models in temporal causal discovery, we proposed an interpretable transformer-based causal discovery model termed CausalFormer, which consists of the causality-aware transformer and the decomposition-based causality detector. The causality-aware transformer learns the causal representation of time series data using a prediction task with the designed multi-kernel causal convolution which aggregates each input time series along the temporal dimension under the temporal priority constraint. Then, the decomposition-based causality detector interprets the global structure of the trained causality-aware transformer with the proposed regression relevance propagation to identify potential causal relations and finally construct the causal graph. Experiments on synthetic, simulated, and real datasets demonstrate the state-of-the-art performance of CausalFormer on discovering temporal causality. Our code is available at https://github.com/lingbai-kong/CausalFormer. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2405.18729 [pdf, other]

Preferred-Action-Optimized Diffusion Policies for Offline Reinforcement Learning

Authors: Tianle Zhang, Jiayi Guan, Lin Zhao, Yihang Li, Dongjiang Li, Zecui Zeng, Lei Sun, Yue Chen, Xuelong Wei, Lusong Li, Xiaodong He

Abstract: Offline reinforcement learning (RL) aims to learn optimal policies from previously collected datasets. Recently, due to their powerful representational capabilities, diffusion models have shown significant potential as policy models for offline RL issues. However, previous offline RL algorithms based on diffusion policies generally adopt weighted regression to improve the policy. This approach opt… ▽ More Offline reinforcement learning (RL) aims to learn optimal policies from previously collected datasets. Recently, due to their powerful representational capabilities, diffusion models have shown significant potential as policy models for offline RL issues. However, previous offline RL algorithms based on diffusion policies generally adopt weighted regression to improve the policy. This approach optimizes the policy only using the collected actions and is sensitive to Q-values, which limits the potential for further performance enhancement. To this end, we propose a novel preferred-action-optimized diffusion policy for offline RL. In particular, an expressive conditional diffusion model is utilized to represent the diverse distribution of a behavior policy. Meanwhile, based on the diffusion model, preferred actions within the same behavior distribution are automatically generated through the critic function. Moreover, an anti-noise preference optimization is designed to achieve policy improvement by using the preferred actions, which can adapt to noise-preferred actions for stable training. Extensive experiments demonstrate that the proposed method provides competitive or superior performance compared to previous state-of-the-art offline RL methods, particularly in sparse reward tasks such as Kitchen and AntMaze. Additionally, we empirically prove the effectiveness of anti-noise preference optimization. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.15769 [pdf, other]

FastDrag: Manipulate Anything in One Step

Authors: Xuanjia Zhao, Jian Guan, Congyi Fan, Dongli Xu, Youtian Lin, Haiwei Pan, Pengming Feng

Abstract: Drag-based image editing using generative models provides precise control over image contents, enabling users to manipulate anything in an image with a few clicks. However, prevailing methods typically adopt $n$-step iterations for latent semantic optimization to achieve drag-based image editing, which is time-consuming and limits practical applications. In this paper, we introduce a novel one-ste… ▽ More Drag-based image editing using generative models provides precise control over image contents, enabling users to manipulate anything in an image with a few clicks. However, prevailing methods typically adopt $n$-step iterations for latent semantic optimization to achieve drag-based image editing, which is time-consuming and limits practical applications. In this paper, we introduce a novel one-step drag-based image editing method, i.e., FastDrag, to accelerate the editing process. Central to our approach is a latent warpage function (LWF), which simulates the behavior of a stretched material to adjust the location of individual pixels within the latent space. This innovation achieves one-step latent semantic optimization and hence significantly promotes editing speeds. Meanwhile, null regions emerging after applying LWF are addressed by our proposed bilateral nearest neighbor interpolation (BNNI) strategy. This strategy interpolates these regions using similar features from neighboring areas, thus enhancing semantic integrity. Additionally, a consistency-preserving strategy is introduced to maintain the consistency between the edited and original images by adopting semantic information from the original image, saved as key and value pairs in self-attention module during diffusion inversion, to guide the diffusion sampling. Our FastDrag is validated on the DragBench dataset, demonstrating substantial improvements in processing time over existing methods, while achieving enhanced editing performance. Project page: https://fastdrag-site.github.io/ . △ Less

Submitted 6 June, 2024; v1 submitted 24 May, 2024; originally announced May 2024.

Comments: 13 pages, 13 figures, Project page: https://fastdrag-site.github.io/

arXiv:2404.13528 [pdf, other]

doi 10.1145/3620666.3651384

SmartMem: Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile

Authors: Wei Niu, Md Musfiqur Rahman Sanim, Zhihao Shu, Jiexiong Guan, Xipeng Shen, Miao Yin, Gagan Agrawal, Bin Ren

Abstract: This work is motivated by recent developments in Deep Neural Networks, particularly the Transformer architectures underlying applications such as ChatGPT, and the need for performing inference on mobile devices. Focusing on emerging transformers (specifically the ones with computationally efficient Swin-like architectures) and large models (e.g., Stable Diffusion and LLMs) based on transformers, w… ▽ More This work is motivated by recent developments in Deep Neural Networks, particularly the Transformer architectures underlying applications such as ChatGPT, and the need for performing inference on mobile devices. Focusing on emerging transformers (specifically the ones with computationally efficient Swin-like architectures) and large models (e.g., Stable Diffusion and LLMs) based on transformers, we observe that layout transformations between the computational operators cause a significant slowdown in these applications. This paper presents SmartMem, a comprehensive framework for eliminating most layout transformations, with the idea that multiple operators can use the same tensor layout through careful choice of layout and implementation of operations. Our approach is based on classifying the operators into four groups, and considering combinations of producer-consumer edges between the operators. We develop a set of methods for searching such layouts. Another component of our work is develo** efficient memory layouts for 2.5 dimensional memory commonly seen in mobile devices. Our experimental results show that SmartMem outperforms 5 state-of-the-art DNN execution frameworks on mobile devices across 18 varied neural networks, including CNNs, Transformers with both local and global attention, as well as LLMs. In particular, compared to DNNFusion, SmartMem achieves an average speedup of 2.8$\times$, and outperforms TVM and MNN with speedups of 6.9$\times$ and 7.9$\times$, respectively, on average. △ Less

Submitted 21 April, 2024; originally announced April 2024.

arXiv:2404.10343 [pdf, other]

The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

Authors: Bin Ren, Yawei Li, Nancy Mehta, Radu Timofte, Hongyuan Yu, Cheng Wan, Yuxin Hong, Bingnan Han, Zhuoyuan Wu, Yajun Zou, Yuqing Liu, Jizhe Li, Keji He, Chao Fan, Heng Zhang, Xiaolin Zhang, Xuanwu Yin, Kunlong Zuo, Bohao Liao, Peizhe Xia, Long Peng, Zhibo Du, Xin Di, Wangkai Li, Yang Wang , et al. (109 additional authors not shown)

Abstract: This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such… ▽ More This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such as runtime, parameters, and FLOPs, while still maintaining a peak signal-to-noise ratio (PSNR) of approximately 26.90 dB on the DIV2K_LSDIR_valid dataset and 26.99 dB on the DIV2K_LSDIR_test dataset. In addition, this challenge has 4 tracks including the main track (overall performance), sub-track 1 (runtime), sub-track 2 (FLOPs), and sub-track 3 (parameters). In the main track, all three metrics (ie runtime, FLOPs, and parameter count) were considered. The ranking of the main track is calculated based on a weighted sum-up of the scores of all other sub-tracks. In sub-track 1, the practical runtime performance of the submissions was evaluated, and the corresponding score was used to determine the ranking. In sub-track 2, the number of FLOPs was considered. The score calculated based on the corresponding FLOPs was used to determine the ranking. In sub-track 3, the number of parameters was considered. The score calculated based on the corresponding parameters was used to determine the ranking. RLFN is set as the baseline for efficiency measurement. The challenge had 262 registered participants, and 34 teams made valid submissions. They gauge the state-of-the-art in efficient single-image super-resolution. To facilitate the reproducibility of the challenge and enable other researchers to build upon these findings, the code and the pre-trained model of validated solutions are made publicly available at https://github.com/Amazingren/NTIRE2024_ESR/. △ Less

Submitted 25 June, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

Comments: The report paper of NTIRE2024 Efficient Super-resolution, accepted by CVPRW2024

arXiv:2403.15759 [pdf]

Deep Learning Approach to Forecasting COVID-19 Cases in Residential Buildings of Hong Kong Public Housing Estates: The Role of Environment and Sociodemographics

Authors: E. Leung, J. Guan, KO. Kwok, CT. Hung, CC. Ching, KC. Chong, CHK. Yam, T. Sun, WH. Tsang, EK. Yeoh, A. Lee

Abstract: Introduction: The current study investigates the complex association between COVID-19 and the studied districts' socioecology (e.g. internal and external built environment, sociodemographic profiles, etc.) to quantify their contributions to the early outbreaks and epidemic resurgence of COVID-19. Methods: We aligned the analytic model's architecture with the hierarchical structure of the resident'… ▽ More Introduction: The current study investigates the complex association between COVID-19 and the studied districts' socioecology (e.g. internal and external built environment, sociodemographic profiles, etc.) to quantify their contributions to the early outbreaks and epidemic resurgence of COVID-19. Methods: We aligned the analytic model's architecture with the hierarchical structure of the resident's socioecology using a multi-headed hierarchical convolutional neural network to structure the vast array of hierarchically related predictive features representing buildings' internal and external built environments and residents' sociodemographic profiles as model input. COVID-19 cases accumulated in buildings across three adjacent districts in HK, both before and during HK's epidemic resurgence, were modeled. A forward-chaining validation was performed to examine the model's performance in forecasting COVID-19 cases over the 3-, 7-, and 14-day horizons during the two months subsequent to when the model for COVID-19 resurgence was built to align with the forecasting needs in an evolving pandemic. Results: Different sets of factors were found to be linked to the earlier waves of COVID-19 outbreaks compared to the epidemic resurgence of the pandemic. Sociodemographic factors such as work hours, monthly household income, employment types, and the number of non-working adults or children in household populations were of high importance to the studied buildings' COVID-19 case counts during the early waves of COVID-19. Factors constituting one's internal built environment, such as the number of distinct households in the buildings, the number of distinct households per floor, and the number of floors, corridors, and lifts, had the greatest unique contributions to the building-level COVID-19 case counts during epidemic resurgence. △ Less

Submitted 23 March, 2024; originally announced March 2024.

arXiv:2403.13842 [pdf]

Analyzing the Variations in Emergency Department Boarding and Testing the Transferability of Forecasting Models across COVID-19 Pandemic Waves in Hong Kong: Hybrid CNN-LSTM approach to quantifying building-level socioecological risk

Authors: Eman Leung, **g**g Guan, Kin On Kwok, CT Hung, CC. Ching, CK. Chung, Hector Tsang, EK Yeoh, Albert Lee

Abstract: Emergency department's (ED) boarding (defined as ED waiting time greater than four hours) has been linked to poor patient outcomes and health system performance. Yet, effective forecasting models is rare before COVID-19, lacking during the peri-COVID era. Here, a hybrid convolutional neural network (CNN)-Long short-term memory (LSTM) model was applied to public-domain data sourced from Hong Kong's… ▽ More Emergency department's (ED) boarding (defined as ED waiting time greater than four hours) has been linked to poor patient outcomes and health system performance. Yet, effective forecasting models is rare before COVID-19, lacking during the peri-COVID era. Here, a hybrid convolutional neural network (CNN)-Long short-term memory (LSTM) model was applied to public-domain data sourced from Hong Kong's Hospital Authority, Department of Health, and Housing Authority. In addition, we sought to identify the phase of the COVID-19 pandemic that most significantly perturbed our complex adaptive healthcare system, thereby revealing a stable pattern of interconnectedness among its components, using deep transfer learning methodology. Our result shows that 1) the greatest proportion of days with ED boarding was found between waves four and five; 2) the best-performing model for forecasting ED boarding was observed between waves four and five, which was based on features representing time-invariant residential buildings' built environment and sociodemographic profiles and the historical time series of ED boarding and case counts, compared to during the waves when best-performing forecasting is based on time-series features alone; and 3) when the model built from the period between waves four and five was applied to data from other waves via deep transfer learning, the transferred model enhanced the performance of indigenous models. △ Less

Submitted 17 March, 2024; originally announced March 2024.

arXiv:2403.07902 [pdf, other]

DecompDiff: Diffusion Models with Decomposed Priors for Structure-Based Drug Design

Authors: Jiaqi Guan, Xiangxin Zhou, Yuwei Yang, Yu Bao, Jian Peng, Jianzhu Ma, Qiang Liu, Liang Wang, Quanquan Gu

Abstract: Designing 3D ligands within a target binding site is a fundamental task in drug discovery. Existing structured-based drug design methods treat all ligand atoms equally, which ignores different roles of atoms in the ligand for drug design and can be less efficient for exploring the large drug-like molecule space. In this paper, inspired by the convention in pharmaceutical practice, we decompose the… ▽ More Designing 3D ligands within a target binding site is a fundamental task in drug discovery. Existing structured-based drug design methods treat all ligand atoms equally, which ignores different roles of atoms in the ligand for drug design and can be less efficient for exploring the large drug-like molecule space. In this paper, inspired by the convention in pharmaceutical practice, we decompose the ligand molecule into two parts, namely arms and scaffold, and propose a new diffusion model, DecompDiff, with decomposed priors over arms and scaffold. In order to facilitate the decomposed generation and improve the properties of the generated molecules, we incorporate both bond diffusion in the model and additional validity guidance in the sampling phase. Extensive experiments on CrossDocked2020 show that our approach achieves state-of-the-art performance in generating high-affinity molecules while maintaining proper molecular properties and conformational stability, with up to -8.39 Avg. Vina Dock score and 24.5 Success Rate. The code is provided at https://github.com/bytedance/DecompDiff △ Less

Submitted 26 February, 2024; originally announced March 2024.

Comments: Accepted to ICML 2023

arXiv:2403.07040 [pdf, other]

All in One: Multi-Task Prompting for Graph Neural Networks (Extended Abstract)

Authors: Xiangguo Sun, Hong Cheng, Jia Li, Bo Liu, Jihong Guan

Abstract: This paper is an extended abstract of our original work published in KDD23, where we won the best research paper award (Xiangguo Sun, Hong Cheng, Jia Li, Bo Liu, and Jihong Guan. All in one: Multi-task prompting for graph neural networks. KDD 23) The paper introduces a novel approach to bridging the gap between pre-trained graph models and the diverse tasks they're applied to, inspired by the succ… ▽ More This paper is an extended abstract of our original work published in KDD23, where we won the best research paper award (Xiangguo Sun, Hong Cheng, Jia Li, Bo Liu, and Jihong Guan. All in one: Multi-task prompting for graph neural networks. KDD 23) The paper introduces a novel approach to bridging the gap between pre-trained graph models and the diverse tasks they're applied to, inspired by the success of prompt learning in NLP. Recognizing the challenge of aligning pre-trained models with varied graph tasks (node level, edge level, and graph level), which can lead to negative transfer and poor performance, we propose a multi-task prompting method for graphs. This method involves unifying graph and language prompt formats, enabling NLP's prompting strategies to be adapted for graph tasks. By analyzing the task space of graph applications, we reformulate problems to fit graph-level tasks and apply meta-learning to improve prompt initialization for multiple tasks. Experiments show our method's effectiveness in enhancing model performance across different graph tasks. Beyond the original work, in this extended abstract, we further discuss the graph prompt from a bigger picture and provide some of the latest work toward this area. △ Less

Submitted 11 March, 2024; originally announced March 2024.

Comments: submitted to IJCAI 2024 Sister Conferences Track. The original paper can be seen at arXiv:2307.01504

arXiv:2403.03218 [pdf, other]

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Authors: Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer , et al. (32 additional authors not shown)

Abstract: The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in develo** biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are develo** evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing furthe… ▽ More The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in develo** biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are develo** evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 3,668 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP was developed by a consortium of academics and technical consultants, and was stringently filtered to eliminate sensitive information prior to public release. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. To guide progress on unlearning, we develop RMU, a state-of-the-art unlearning method based on controlling model representations. RMU reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs. We release our benchmark and code publicly at https://wmdp.ai △ Less

Submitted 15 May, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

Comments: See the project page at https://wmdp.ai

arXiv:2402.19101 [pdf, other]

Effective Two-Stage Knowledge Transfer for Multi-Entity Cross-Domain Recommendation

Authors: Jianyu Guan, Zongming Yin, Tianyi Zhang, Leihui Chen, Yin Zhang, Fei Huang, Jufeng Chen, Shuguang Han

Abstract: In recent years, the recommendation content on e-commerce platforms has become increasingly rich -- a single user feed may contain multiple entities, such as selling products, short videos, and content posts. To deal with the multi-entity recommendation problem, an intuitive solution is to adopt the shared-network-based architecture for joint training. The idea is to transfer the extracted knowled… ▽ More In recent years, the recommendation content on e-commerce platforms has become increasingly rich -- a single user feed may contain multiple entities, such as selling products, short videos, and content posts. To deal with the multi-entity recommendation problem, an intuitive solution is to adopt the shared-network-based architecture for joint training. The idea is to transfer the extracted knowledge from one type of entity (source entity) to another (target entity). However, different from the conventional same-entity cross-domain recommendation, multi-entity knowledge transfer encounters several important issues: (1) data distributions of the source entity and target entity are naturally different, making the shared-network-based joint training susceptible to the negative transfer issue, (2) more importantly, the corresponding feature schema of each entity is not exactly aligned (e.g., price is an essential feature for selling product while missing for content posts), making the existing methods no longer appropriate. Recent researchers have also experimented with the pre-training and fine-tuning paradigm. Again, they only consider the scenarios with the same entity type and feature systems, which is inappropriate in our case. To this end, we design a pre-training & fine-tuning based Multi-entity Knowledge Transfer framework called MKT. MKT utilizes a multi-entity pre-training module to extract transferable knowledge across different entities. In particular, a feature alignment module is first applied to scale and align different feature schemas. Afterward, a couple of knowledge extractors are employed to extract the common and entity-specific knowledge. In the end, the extracted common knowledge is adopted for target entity model training. Through extensive offline and online experiments, we demonstrated the superiority of MKT over multiple State-Of-The-Art methods. △ Less

Submitted 29 February, 2024; originally announced February 2024.

arXiv:2402.03708 [pdf, other]

SISP: A Benchmark Dataset for Fine-grained Ship Instance Segmentation in Panchromatic Satellite Images

Authors: Pengming Feng, Mingjie Xie, Hongning Liu, Xuanjia Zhao, Guangjun He, Xueliang Zhang, Jian Guan

Abstract: Fine-grained ship instance segmentation in satellite images holds considerable significance for monitoring maritime activities at sea. However, existing datasets often suffer from the scarcity of fine-grained information or pixel-wise localization annotations, as well as the insufficient image diversity and variations, thus limiting the research of this task. To this end, we propose a benchmark da… ▽ More Fine-grained ship instance segmentation in satellite images holds considerable significance for monitoring maritime activities at sea. However, existing datasets often suffer from the scarcity of fine-grained information or pixel-wise localization annotations, as well as the insufficient image diversity and variations, thus limiting the research of this task. To this end, we propose a benchmark dataset for fine-grained Ship Instance Segmentation in Panchromatic satellite images, namely SISP, which contains 56,693 well-annotated ship instances with four fine-grained categories across 10,000 sliced images, and all the images are collected from SuperView-1 satellite with the resolution of 0.5m. Targets in the proposed SISP dataset have characteristics that are consistent with real satellite scenes, such as high class imbalance, various scenes, large variations in target densities and scales, and high inter-class similarity and intra-class diversity, all of which make the SISP dataset more suitable for real-world applications. In addition, we introduce a Dynamic Feature Refinement-assist Instance segmentation network, namely DFRInst, as the benchmark method for ship instance segmentation in satellite images, which can fortify the explicit representation of crucial features, thus improving the performance of ship instance segmentation. Experiments and analysis are performed on the proposed SISP dataset to evaluate the benchmark method and several state-of-the-art methods to establish baselines for facilitating future research. The proposed dataset and source codes will be available at: https://github.com/Justlovesmile/SISP. △ Less

Submitted 6 February, 2024; originally announced February 2024.

Comments: 14 pages, 9 figures

arXiv:2402.01469 [pdf, other]

AMOR: A Recipe for Building Adaptable Modular Knowledge Agents Through Process Feedback

Authors: Jian Guan, Wei Wu, Zujie Wen, Peng Xu, Hongning Wang, Minlie Huang

Abstract: The notable success of large language models (LLMs) has sparked an upsurge in building language agents to complete various complex tasks. We present AMOR, an agent framework based on open-source LLMs, which reasons with external knowledge bases and adapts to specific domains through human supervision to the reasoning process. AMOR builds reasoning logic over a finite state machine (FSM) that solve… ▽ More The notable success of large language models (LLMs) has sparked an upsurge in building language agents to complete various complex tasks. We present AMOR, an agent framework based on open-source LLMs, which reasons with external knowledge bases and adapts to specific domains through human supervision to the reasoning process. AMOR builds reasoning logic over a finite state machine (FSM) that solves problems through autonomous executions and transitions over disentangled modules. This allows humans to provide direct feedback to the individual modules, and thus naturally forms process supervision. Based on this reasoning and feedback framework, we develop AMOR through two-stage fine-tuning: warm-up and adaptation. The former fine-tunes the LLM with examples automatically constructed from various public datasets and enables AMOR to generalize across different knowledge environments, while the latter tailors AMOR to specific domains using process feedback. Extensive experiments across multiple domains demonstrate the advantage of AMOR to strong baselines, thanks to its FSM-based reasoning and process feedback mechanism. △ Less

Submitted 2 February, 2024; originally announced February 2024.

Comments: Work in progress

arXiv:2401.11040 [pdf, other]

Design Frameworks for Spatial Zone Agents in XRI Metaverse Smart Environments

Authors: Jie Guan, Jiamin Liu, Alexis Morris

Abstract: The spatial XR-IoT (XRI) Zone Agents concept combines Extended Reality (XR), the Internet of Things (IoT), and spatial computing concepts to create hyper-connected spaces for metaverse applications; envisioning space as zones that are social, smart, scalable, expressive, and agent-based. These zone agents serve as applications and agents (partners, assistants, or guides) for users co-living and co… ▽ More The spatial XR-IoT (XRI) Zone Agents concept combines Extended Reality (XR), the Internet of Things (IoT), and spatial computing concepts to create hyper-connected spaces for metaverse applications; envisioning space as zones that are social, smart, scalable, expressive, and agent-based. These zone agents serve as applications and agents (partners, assistants, or guides) for users co-living and co-operating together in a shared spatial context. The zone agent concept is toward reducing the gap between the physical environment (space) and the classical two-dimensional user interface, through space-based interactions for future metaverse applications. This integration aims to enrich user engagement with their environments through intuitive and immersive experiences and pave the way for innovative human-machine interaction in smart spaces. Contributions include: i) a theoretical framework for creating XRI zone/space-agents using Mixed-Reality Agents (MiRAs) and XRI theory, ii) agent and scene design for spatial zone agents, and iii) prototype and user interaction design scenario concepts for human-to-space agent relationships in an early immersive smart-space application. △ Less

Submitted 19 January, 2024; originally announced January 2024.

Journal ref: 6th IEEE International Conference on Artificial Intelligence & extended and Virtual Reality (IEEE AIxVR 2024)

arXiv:2312.16855 [pdf, other]

Molecular Property Prediction Based on Graph Structure Learning

Authors: Bangyi Zhao, Weixia Xu, Jihong Guan, Shuigeng Zhou

Abstract: Molecular property prediction (MPP) is a fundamental but challenging task in the computer-aided drug discovery process. More and more recent works employ different graph-based models for MPP, which have made considerable progress in improving prediction performance. However, current models often ignore relationships between molecules, which could be also helpful for MPP. For this sake, in this pap… ▽ More Molecular property prediction (MPP) is a fundamental but challenging task in the computer-aided drug discovery process. More and more recent works employ different graph-based models for MPP, which have made considerable progress in improving prediction performance. However, current models often ignore relationships between molecules, which could be also helpful for MPP. For this sake, in this paper we propose a graph structure learning (GSL) based MPP approach, called GSL-MPP. Specifically, we first apply graph neural network (GNN) over molecular graphs to extract molecular representations. Then, with molecular fingerprints, we construct a molecular similarity graph (MSG). Following that, we conduct graph structure learning on the MSG (i.e., molecule-level graph structure learning) to get the final molecular embeddings, which are the results of fusing both GNN encoded molecular representations and the relationships among molecules, i.e., combining both intra-molecule and inter-molecule information. Finally, we use these molecular embeddings to perform MPP. Extensive experiments on seven various benchmark datasets show that our method could achieve state-of-the-art performance in most cases, especially on classification tasks. Further visualization studies also demonstrate the good molecular representations of our method. △ Less

Submitted 28 December, 2023; originally announced December 2023.

arXiv:2312.16600 [pdf, other]

scRNA-seq Data Clustering by Cluster-aware Iterative Contrastive Learning

Authors: Weikang Jiang, **xian Wang, Jihong Guan, Shuigeng Zhou

Abstract: Single-cell RNA sequencing (scRNA-seq) enables researchers to analyze gene expression at single-cell level. One important task in scRNA-seq data analysis is unsupervised clustering, which helps identify distinct cell types, laying down the foundation for other downstream analysis tasks. In this paper, we propose a novel method called Cluster-aware Iterative Contrastive Learning (CICL in short) for… ▽ More Single-cell RNA sequencing (scRNA-seq) enables researchers to analyze gene expression at single-cell level. One important task in scRNA-seq data analysis is unsupervised clustering, which helps identify distinct cell types, laying down the foundation for other downstream analysis tasks. In this paper, we propose a novel method called Cluster-aware Iterative Contrastive Learning (CICL in short) for scRNA-seq data clustering, which utilizes an iterative representation learning and clustering framework to progressively learn the clustering structure of scRNA-seq data with a cluster-aware contrastive loss. CICL consists of a Transformer encoder, a clustering head, a projection head and a contrastive loss module. First, CICL extracts the feature vectors of the original and augmented data by the Transformer encoder. Then, it computes the clustering centroids by K-means and employs the student t-distribution to assign pseudo-labels to all cells in the clustering head. The projection-head uses a Multi-Layer Perceptron (MLP) to obtain projections of the augmented data. At last, both pseudo-labels and projections are used in the contrastive loss to guide the model training. Such a process goes iteratively so that the clustering result becomes better and better. Extensive experiments on 25 real world scRNA-seq datasets show that CICL outperforms the SOTA methods. Concretely, CICL surpasses the existing methods by from 14% to 280%, and from 5% to 133% on average in terms of performance metrics ARI and NMI respectively. △ Less

Submitted 27 December, 2023; originally announced December 2023.

arXiv:2312.04519 [pdf, other]

Bootstrap** Autonomous Driving Radars with Self-Supervised Learning

Authors: Yiduo Hao, Sohrab Madani, Junfeng Guan, Mohammed Alloulah, Saurabh Gupta, Haitham Hassanieh

Abstract: The perception of autonomous vehicles using radars has attracted increased research interest due its ability to operate in fog and bad weather. However, training radar models is hindered by the cost and difficulty of annotating large-scale radar data. To overcome this bottleneck, we propose a self-supervised learning framework to leverage the large amount of unlabeled radar data to pre-train radar… ▽ More The perception of autonomous vehicles using radars has attracted increased research interest due its ability to operate in fog and bad weather. However, training radar models is hindered by the cost and difficulty of annotating large-scale radar data. To overcome this bottleneck, we propose a self-supervised learning framework to leverage the large amount of unlabeled radar data to pre-train radar-only embeddings for self-driving perception tasks. The proposed method combines radar-to-radar and radar-to-vision contrastive losses to learn a general representation from unlabeled radar heatmaps paired with their corresponding camera images. When used for downstream object detection, we demonstrate that the proposed self-supervision framework can improve the accuracy of state-of-the-art supervised baselines by $5.8\%$ in mAP. Code is available at \url{https://github.com/yiduohao/Radical}. △ Less

Submitted 18 April, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

Comments: 12 pages, 5 figures, to be published in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024

arXiv:2311.06073 [pdf, other]

Collaborative Inference in DNN-based Satellite Systems with Dynamic Task Streams

Authors: **glong Guan, Qiyang Zhang, Ilir Murturi, Praveen Kumar Donta, Schahram Dustdar, Shangguang Wang

Abstract: As a driving force in the advancement of intelligent in-orbit applications, DNN models have been gradually integrated into satellites, producing daily latency-constraint and computation-intensive tasks. However, the substantial computation capability of DNN models, coupled with the instability of the satellite-ground link, pose significant challenges, hindering timely completion of tasks. It becom… ▽ More As a driving force in the advancement of intelligent in-orbit applications, DNN models have been gradually integrated into satellites, producing daily latency-constraint and computation-intensive tasks. However, the substantial computation capability of DNN models, coupled with the instability of the satellite-ground link, pose significant challenges, hindering timely completion of tasks. It becomes necessary to adapt to task stream changes when dealing with tasks requiring latency guarantees, such as dynamic observation tasks on the satellites. To this end, we consider a system model for a collaborative inference system with latency constraints, leveraging the multi-exit and model partition technology. To address this, we propose an algorithm, which is tailored to effectively address the trade-off between task completion and maintaining satisfactory task accuracy by dynamically choosing early-exit and partition points. Simulation evaluations show that our proposed algorithm significantly outperforms baseline algorithms across the task stream with strict latency constraints. △ Less

Submitted 10 November, 2023; originally announced November 2023.

arXiv:2310.14564 [pdf, other]

Language Models Hallucinate, but May Excel at Fact Verification

Authors: Jian Guan, Jesse Dodge, David Wadden, Minlie Huang, Hao Peng

Abstract: Recent progress in natural language processing (NLP) owes much to remarkable advances in large language models (LLMs). Nevertheless, LLMs frequently "hallucinate," resulting in non-factual outputs. Our carefully-designed human evaluation substantiates the serious hallucination issue, revealing that even GPT-3.5 produces factual outputs less than 25% of the time. This underscores the importance of… ▽ More Recent progress in natural language processing (NLP) owes much to remarkable advances in large language models (LLMs). Nevertheless, LLMs frequently "hallucinate," resulting in non-factual outputs. Our carefully-designed human evaluation substantiates the serious hallucination issue, revealing that even GPT-3.5 produces factual outputs less than 25% of the time. This underscores the importance of fact verifiers in order to measure and incentivize progress. Our systematic investigation affirms that LLMs can be repurposed as effective fact verifiers with strong correlations with human judgments. Surprisingly, FLAN-T5-11B, the least factual generator in our study, performs the best as a fact verifier, even outperforming more capable LLMs like GPT3.5 and ChatGPT. Delving deeper, we analyze the reliance of these LLMs on high-quality evidence, as well as their deficiencies in robustness and generalization ability. Our study presents insights for develo** trustworthy generation models. △ Less

Submitted 20 March, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

Comments: Accepted in NAACL 2024

arXiv:2310.14173 [pdf, other]

First-Shot Unsupervised Anomalous Sound Detection With Unknown Anomalies Estimated by Metadata-Assisted Audio Generation

Authors: He**g Zhang, Qiaoxi Zhu, Jian Guan, Haohe Liu, Feiyang Xiao, Jiantong Tian, Xinhao Mei, Xubo Liu, Wenwu Wang

Abstract: First-shot (FS) unsupervised anomalous sound detection (ASD) is a brand-new task introduced in DCASE 2023 Challenge Task 2, where the anomalous sounds for the target machine types are unseen in training. Existing methods often rely on the availability of normal and abnormal sound data from the target machines. However, due to the lack of anomalous sound data for the target machine types, it become… ▽ More First-shot (FS) unsupervised anomalous sound detection (ASD) is a brand-new task introduced in DCASE 2023 Challenge Task 2, where the anomalous sounds for the target machine types are unseen in training. Existing methods often rely on the availability of normal and abnormal sound data from the target machines. However, due to the lack of anomalous sound data for the target machine types, it becomes challenging when adapting the existing ASD methods to the first-shot task. In this paper, we propose a new framework for the first-shot unsupervised ASD, where metadata-assisted audio generation is used to estimate unknown anomalies, by utilising the available machine information (i.e., metadata and sound data) to fine-tune a text-to-audio generation model for generating the anomalous sounds that contain unique acoustic characteristics accounting for each different machine type. We then use the method of Time-Weighted Frequency domain audio Representation with Gaussian Mixture Model (TWFR-GMM) as the backbone to achieve the first-shot unsupervised ASD. Our proposed FS-TWFR-GMM method achieves competitive performance amongst top systems in DCASE 2023 Challenge Task 2, while requiring only 1% model parameters for detection, as validated in our experiments. △ Less

Submitted 11 March, 2024; v1 submitted 22 October, 2023; originally announced October 2023.

Comments: Accepted at ICASSP 2024

arXiv:2310.08950 [pdf, ps, other]

Transformer-based Autoencoder with ID Constraint for Unsupervised Anomalous Sound Detection

Authors: Jian Guan, Youde Liu, Qiuqiang Kong, Feiyang Xiao, Qiaoxi Zhu, Jiantong Tian, Wenwu Wang

Abstract: Unsupervised anomalous sound detection (ASD) aims to detect unknown anomalous sounds of devices when only normal sound data is available. The autoencoder (AE) and self-supervised learning based methods are two mainstream methods. However, the AE-based methods could be limited as the feature learned from normal sounds can also fit with anomalous sounds, reducing the ability of the model in detectin… ▽ More Unsupervised anomalous sound detection (ASD) aims to detect unknown anomalous sounds of devices when only normal sound data is available. The autoencoder (AE) and self-supervised learning based methods are two mainstream methods. However, the AE-based methods could be limited as the feature learned from normal sounds can also fit with anomalous sounds, reducing the ability of the model in detecting anomalies from sound. The self-supervised methods are not always stable and perform differently, even for machines of the same type. In addition, the anomalous sound may be short-lived, making it even harder to distinguish from normal sound. This paper proposes an ID constrained Transformer-based autoencoder (IDC-TransAE) architecture with weighted anomaly score computation for unsupervised ASD. Machine ID is employed to constrain the latent space of the Transformer-based autoencoder (TransAE) by introducing a simple ID classifier to learn the difference in the distribution for the same machine type and enhance the ability of the model in distinguishing anomalous sound. Moreover, weighted anomaly score computation is introduced to highlight the anomaly scores of anomalous events that only appear for a short time. Experiments performed on DCASE 2020 Challenge Task2 development dataset demonstrate the effectiveness and superiority of our proposed method. △ Less

Submitted 13 October, 2023; originally announced October 2023.

Comments: Accepted by EURASIP Journal on Audio, Speech, and Music Processing

arXiv:2310.05330 [pdf, other]

A Lightweight Video Anomaly Detection Model with Weak Supervision and Adaptive Instance Selection

Authors: Yang Wang, Jiaogen Zhou, Jihong Guan

Abstract: Video anomaly detection is to determine whether there are any abnormal events, behaviors or objects in a given video, which enables effective and intelligent public safety management. As video anomaly labeling is both time-consuming and expensive, most existing works employ unsupervised or weakly supervised learning methods. This paper focuses on weakly supervised video anomaly detection, in which… ▽ More Video anomaly detection is to determine whether there are any abnormal events, behaviors or objects in a given video, which enables effective and intelligent public safety management. As video anomaly labeling is both time-consuming and expensive, most existing works employ unsupervised or weakly supervised learning methods. This paper focuses on weakly supervised video anomaly detection, in which the training videos are labeled whether or not they contain any anomalies, but there is no information about which frames the anomalies are located. However, the uncertainty of weakly labeled data and the large model size prevent existing methods from wide deployment in real scenarios, especially the resource-limit situations such as edge-computing. In this paper, we develop a lightweight video anomaly detection model. On the one hand, we propose an adaptive instance selection strategy, which is based on the model's current status to select confident instances, thereby mitigating the uncertainty of weakly labeled data and subsequently promoting the model's performance. On the other hand, we design a lightweight multi-level temporal correlation attention module and an hourglass-shaped fully connected layer to construct the model, which can reduce the model parameters to only 0.56\% of the existing methods (e.g. RTFM). Our extensive experiments on two public datasets UCF-Crime and ShanghaiTech show that our model can achieve comparable or even superior AUC score compared to the state-of-the-art methods, with a significantly reduced number of model parameters. △ Less

Submitted 8 October, 2023; originally announced October 2023.

arXiv:2310.04463 [pdf, other]

Diffusing on Two Levels and Optimizing for Multiple Properties: A Novel Approach to Generating Molecules with Desirable Properties

Authors: Siyuan Guo, Jihong Guan, Shuigeng Zhou

Abstract: In the past decade, Artificial Intelligence driven drug design and discovery has been a hot research topic, where an important branch is molecule generation by generative models, from GAN-based models and VAE-based models to the latest diffusion-based models. However, most existing models pursue only the basic properties like validity and uniqueness of the generated molecules, a few go further to… ▽ More In the past decade, Artificial Intelligence driven drug design and discovery has been a hot research topic, where an important branch is molecule generation by generative models, from GAN-based models and VAE-based models to the latest diffusion-based models. However, most existing models pursue only the basic properties like validity and uniqueness of the generated molecules, a few go further to explicitly optimize one single important molecular property (e.g. QED or PlogP), which makes most generated molecules little usefulness in practice. In this paper, we present a novel approach to generating molecules with desirable properties, which expands the diffusion model framework with multiple innovative designs. The novelty is two-fold. On the one hand, considering that the structures of molecules are complex and diverse, and molecular properties are usually determined by some substructures (e.g. pharmacophores), we propose to perform diffusion on two structural levels: molecules and molecular fragments respectively, with which a mixed Gaussian distribution is obtained for the reverse diffusion process. To get desirable molecular fragments, we develop a novel electronic effect based fragmentation method. On the other hand, we introduce two ways to explicitly optimize multiple molecular properties under the diffusion model framework. First, as potential drug molecules must be chemically valid, we optimize molecular validity by an energy-guidance function. Second, since potential drug molecules should be desirable in various properties, we employ a multi-objective mechanism to optimize multiple molecular properties simultaneously. Extensive experiments with two benchmark datasets QM9 and ZINC250k show that the molecules generated by our proposed method have better validity, uniqueness, novelty, Fréchet ChemNet Distance (FCD), QED, and PlogP than those generated by current SOTA models. △ Less

Submitted 5 October, 2023; originally announced October 2023.

arXiv:2309.09705 [pdf, other]

Synth-AC: Enhancing Audio Captioning with Synthetic Supervision

Authors: Feiyang Xiao, Qiaoxi Zhu, Jian Guan, Xubo Liu, Haohe Liu, Kejia Zhang, Wenwu Wang

Abstract: Data-driven approaches hold promise for audio captioning. However, the development of audio captioning methods can be biased due to the limited availability and quality of text-audio data. This paper proposes a SynthAC framework, which leverages recent advances in audio generative models and commonly available text corpus to create synthetic text-audio pairs, thereby enhancing text-audio represent… ▽ More Data-driven approaches hold promise for audio captioning. However, the development of audio captioning methods can be biased due to the limited availability and quality of text-audio data. This paper proposes a SynthAC framework, which leverages recent advances in audio generative models and commonly available text corpus to create synthetic text-audio pairs, thereby enhancing text-audio representation. Specifically, the text-to-audio generation model, i.e., AudioLDM, is used to generate synthetic audio signals with captions from an image captioning dataset. Our SynthAC expands the availability of well-annotated captions from the text-vision domain to audio captioning, thus enhancing text-audio representation by learning relations within synthetic text-audio pairs. Experiments demonstrate that our SynthAC framework can benefit audio captioning models by incorporating well-annotated text corpus from the text-vision domain, offering a promising solution to the challenge caused by data scarcity. Furthermore, SynthAC can be easily adapted to various state-of-the-art methods, leading to substantial performance improvements. △ Less

Submitted 18 September, 2023; originally announced September 2023.

arXiv:2309.07498 [pdf, other]

Hierarchical Metadata Information Constrained Self-Supervised Learning for Anomalous Sound Detection Under Domain Shift

Authors: Haiyan Lan, Qiaoxi Zhu, Jian Guan, Yuming Wei, Wenwu Wang

Abstract: Self-supervised learning methods have achieved promising performance for anomalous sound detection (ASD) under domain shift, where the type of domain shift is considered in feature learning by incorporating section IDs. However, the attributes accompanying audio files under each section, such as machine operating conditions and noise types, have not been considered, although they are also crucial… ▽ More Self-supervised learning methods have achieved promising performance for anomalous sound detection (ASD) under domain shift, where the type of domain shift is considered in feature learning by incorporating section IDs. However, the attributes accompanying audio files under each section, such as machine operating conditions and noise types, have not been considered, although they are also crucial for characterizing domain shifts. In this paper, we present a hierarchical metadata information constrained self-supervised (HMIC) ASD method, where the hierarchical relation between section IDs and attributes is constructed, and used as constraints to obtain finer feature representation. In addition, we propose an attribute-group-center (AGC)-based method for calculating the anomaly score under the domain shift condition. Experiments are performed to demonstrate its improved performance over the state-of-the-art self-supervised methods in DCASE 2022 challenge Task 2. △ Less

Submitted 18 December, 2023; v1 submitted 14 September, 2023; originally announced September 2023.

Comments: To appear at ICASSP 2024

arXiv:2309.04819 [pdf, other]

doi 10.1145/3576915.3623108

Detecting Violations of Differential Privacy for Quantum Algorithms

Authors: Ji Guan, Wang Fang, Mingyu Huang, Mingsheng Ying

Abstract: Quantum algorithms for solving a wide range of practical problems have been proposed in the last ten years, such as data search and analysis, product recommendation, and credit scoring. The concern about privacy and other ethical issues in quantum computing naturally rises up. In this paper, we define a formal framework for detecting violations of differential privacy for quantum algorithms. A det… ▽ More Quantum algorithms for solving a wide range of practical problems have been proposed in the last ten years, such as data search and analysis, product recommendation, and credit scoring. The concern about privacy and other ethical issues in quantum computing naturally rises up. In this paper, we define a formal framework for detecting violations of differential privacy for quantum algorithms. A detection algorithm is developed to verify whether a (noisy) quantum algorithm is differentially private and automatically generate bugging information when the violation of differential privacy is reported. The information consists of a pair of quantum states that violate the privacy, to illustrate the cause of the violation. Our algorithm is equipped with Tensor Networks, a highly efficient data structure, and executed both on TensorFlow Quantum and TorchQuantum which are the quantum extensions of famous machine learning platforms -- TensorFlow and PyTorch, respectively. The effectiveness and efficiency of our algorithm are confirmed by the experimental results of almost all types of quantum algorithms already implemented on realistic quantum computers, including quantum supremacy algorithms (beyond the capability of classical algorithms), quantum machine learning models, quantum approximate optimization algorithms, and variational quantum eigensolvers with up to 21 quantum bits. △ Less

Submitted 9 September, 2023; originally announced September 2023.

Journal ref: In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS 2023)

arXiv:2308.14063 [pdf, other]

Anomalous Sound Detection Using Self-Attention-Based Frequency Pattern Analysis of Machine Sounds

Authors: He**g Zhang, Jian Guan, Qiaoxi Zhu, Feiyang Xiao, Youde Liu

Abstract: Different machines can exhibit diverse frequency patterns in their emitted sound. This feature has been recently explored in anomaly sound detection and reached state-of-the-art performance. However, existing methods rely on the manual or empirical determination of the frequency filter by observing the effective frequency range in the training data, which may be impractical for general application… ▽ More Different machines can exhibit diverse frequency patterns in their emitted sound. This feature has been recently explored in anomaly sound detection and reached state-of-the-art performance. However, existing methods rely on the manual or empirical determination of the frequency filter by observing the effective frequency range in the training data, which may be impractical for general application. This paper proposes an anomalous sound detection method using self-attention-based frequency pattern analysis and spectral-temporal information fusion. Our experiments demonstrate that the self-attention module automatically and adaptively analyses the effective frequencies of a machine sound and enhances that information in the spectral feature representation. With spectral-temporal information fusion, the obtained audio feature eventually improves the anomaly detection performance on the DCASE 2020 Challenge Task 2 dataset. △ Less

Submitted 6 September, 2023; v1 submitted 27 August, 2023; originally announced August 2023.

Comments: Published in INTERSPEECH 2023

arXiv:2308.09540 [pdf, other]

Meta-ZSDETR: Zero-shot DETR with Meta-learning

Authors: Lu Zhang, Chenbo Zhang, Jiajia Zhao, Jihong Guan, Shuigeng Zhou

Abstract: Zero-shot object detection aims to localize and recognize objects of unseen classes. Most of existing works face two problems: the low recall of RPN in unseen classes and the confusion of unseen classes with background. In this paper, we present the first method that combines DETR and meta-learning to perform zero-shot object detection, named Meta-ZSDETR, where model training is formalized as an i… ▽ More Zero-shot object detection aims to localize and recognize objects of unseen classes. Most of existing works face two problems: the low recall of RPN in unseen classes and the confusion of unseen classes with background. In this paper, we present the first method that combines DETR and meta-learning to perform zero-shot object detection, named Meta-ZSDETR, where model training is formalized as an individual episode based meta-learning task. Different from Faster R-CNN based methods that firstly generate class-agnostic proposals, and then classify them with visual-semantic alignment module, Meta-ZSDETR directly predict class-specific boxes with class-specific queries and further filter them with the predicted accuracy from classification head. The model is optimized with meta-contrastive learning, which contains a regression head to generate the coordinates of class-specific boxes, a classification head to predict the accuracy of generated boxes, and a contrastive head that utilizes the proposed contrastive-reconstruction loss to further separate different classes in visual space. We conduct extensive experiments on two benchmark datasets MS COCO and PASCAL VOC. Experimental results show that our method outperforms the existing ZSD methods by a large margin. △ Less

Submitted 18 August, 2023; originally announced August 2023.

Comments: Accepted in ICCV 2023

arXiv:2308.06107 [pdf, other]

Test-Time Backdoor Defense via Detecting and Repairing

Authors: Jiyang Guan, Jian Liang, Ran He

Abstract: Deep neural networks have played a crucial part in many critical domains, such as autonomous driving, face recognition, and medical diagnosis. However, deep neural networks are facing security threats from backdoor attacks and can be manipulated into attacker-decided behaviors by the backdoor attacker. To defend the backdoor, prior research has focused on using clean data to remove backdoor attack… ▽ More Deep neural networks have played a crucial part in many critical domains, such as autonomous driving, face recognition, and medical diagnosis. However, deep neural networks are facing security threats from backdoor attacks and can be manipulated into attacker-decided behaviors by the backdoor attacker. To defend the backdoor, prior research has focused on using clean data to remove backdoor attacks before model deployment. In this paper, we investigate the possibility of defending against backdoor attacks at test time by utilizing partially poisoned data to remove the backdoor from the model. To address the problem, a two-stage method Test-Time Backdoor Defense (TTBD) is proposed. In the first stage, we propose a backdoor sample detection method DDP to identify poisoned samples from a batch of mixed, partially poisoned samples. Once the poisoned samples are detected, we employ Shapley estimation to calculate the contribution of each neuron's significance in the network, locate the poisoned neurons, and prune them to remove backdoor in the models. Our experiments demonstrate that TTBD removes the backdoor successfully with only a batch of partially poisoned data across different model architectures and datasets against different types of backdoor attacks. △ Less

Submitted 29 November, 2023; v1 submitted 11 August, 2023; originally announced August 2023.

arXiv:2308.02362 [pdf, other]

Flexible Differentially Private Vertical Federated Learning with Adaptive Feature Embeddings

Authors: Yuxi Mi, Hongquan Liu, Yewei Xia, Yiheng Sun, Jihong Guan, Shuigeng Zhou

Abstract: The emergence of vertical federated learning (VFL) has stimulated concerns about the imperfection in privacy protection, as shared feature embeddings may reveal sensitive information under privacy attacks. This paper studies the delicate equilibrium between data privacy and task utility goals of VFL under differential privacy (DP). To address the generality issue of prior arts, this paper advocate… ▽ More The emergence of vertical federated learning (VFL) has stimulated concerns about the imperfection in privacy protection, as shared feature embeddings may reveal sensitive information under privacy attacks. This paper studies the delicate equilibrium between data privacy and task utility goals of VFL under differential privacy (DP). To address the generality issue of prior arts, this paper advocates a flexible and generic approach that decouples the two goals and addresses them successively. Specifically, we initially derive a rigorous privacy guarantee by applying norm clip** on shared feature embeddings, which is applicable across various datasets and models. Subsequently, we demonstrate that task utility can be optimized via adaptive adjustments on the scale and distribution of feature embeddings in an accuracy-appreciative way, without compromising established DP mechanisms. We concretize our observation into the proposed VFL-AFE framework, which exhibits effectiveness against privacy attacks and the capacity to retain favorable task utility, as substantiated by extensive experiments. △ Less

Submitted 26 July, 2023; originally announced August 2023.

arXiv:2308.01999 [pdf, other]

cuQuantum SDK: A High-Performance Library for Accelerating Quantum Science

Authors: Harun Bayraktar, Ali Charara, David Clark, Saul Cohen, Timothy Costa, Yao-Lung L. Fang, Yang Gao, Jack Guan, John Gunnels, Azzam Haidar, Andreas Hehn, Markus Hohnerbach, Matthew Jones, Tom Lubowe, Dmitry Lyakh, Shinya Morino, Paul Springer, Sam Stanwyck, Igor Terentyev, Satya Varadhan, Jonathan Wong, Takuma Yamaguchi

Abstract: We present the NVIDIA cuQuantum SDK, a state-of-the-art library of composable primitives for GPU-accelerated quantum circuit simulations. As the size of quantum devices continues to increase, making their classical simulation progressively more difficult, the availability of fast and scalable quantum circuit simulators becomes vital for quantum algorithm developers, as well as quantum hardware eng… ▽ More We present the NVIDIA cuQuantum SDK, a state-of-the-art library of composable primitives for GPU-accelerated quantum circuit simulations. As the size of quantum devices continues to increase, making their classical simulation progressively more difficult, the availability of fast and scalable quantum circuit simulators becomes vital for quantum algorithm developers, as well as quantum hardware engineers focused on the validation and optimization of quantum devices. The cuQuantum SDK was created to accelerate and scale up quantum circuit simulators developed by the quantum information science community by enabling them to utilize efficient scalable software building blocks optimized for NVIDIA GPU platforms. The functional building blocks provided cover the needs of both state vector- and tensor network- based simulators, including approximate tensor network simulation methods based on matrix product state, projected entangled pair state, and other factorized tensor representations. By leveraging the enormous computing power of the latest NVIDIA GPU architectures, quantum circuit simulators that have adopted the cuQuantum SDK demonstrate significant acceleration, compared to CPU-only execution, for both the state vector and tensor network simulation methods. Furthermore, by utilizing the parallel primitives available in the cuQuantum SDK, one can easily transition to distributed GPU-accelerated platforms, including those furnished by cloud service providers and high-performance computing systems deployed by supercomputing centers, extending the scale of possible quantum circuit simulations. The rich capabilities provided by the SDK are conveniently made available via both Python and C application programming interfaces, where the former is directly targeting a broad Python quantum community and the latter allows tight integration with simulators written in any programming language. △ Less

Submitted 3 August, 2023; originally announced August 2023.

Comments: paper accepted at QCE 2023, journal reference will be updated whenever available

MSC Class: 68Q12; 68Q09; 81P68;

arXiv:2307.16410 [pdf, other]

HiREN: Towards Higher Supervision Quality for Better Scene Text Image Super-Resolution

Authors: Minyi Zhao, Yi Xu, Bingjia Li, Jie Wang, Jihong Guan, Shuigeng Zhou

Abstract: Scene text image super-resolution (STISR) is an important pre-processing technique for text recognition from low-resolution scene images. Nowadays, various methods have been proposed to extract text-specific information from high-resolution (HR) images to supervise STISR model training. However, due to uncontrollable factors (e.g. shooting equipment, focus, and environment) in manually photographi… ▽ More Scene text image super-resolution (STISR) is an important pre-processing technique for text recognition from low-resolution scene images. Nowadays, various methods have been proposed to extract text-specific information from high-resolution (HR) images to supervise STISR model training. However, due to uncontrollable factors (e.g. shooting equipment, focus, and environment) in manually photographing HR images, the quality of HR images cannot be guaranteed, which unavoidably impacts STISR performance. Observing the quality issue of HR images, in this paper we propose a novel idea to boost STISR by first enhancing the quality of HR images and then using the enhanced HR images as supervision to do STISR. Concretely, we develop a new STISR framework, called High-Resolution ENhancement (HiREN) that consists of two branches and a quality estimation module. The first branch is developed to recover the low-resolution (LR) images, and the other is an HR quality enhancement branch aiming at generating high-quality (HQ) text images based on the HR images to provide more accurate supervision to the LR images. As the degradation from HQ to HR may be diverse, and there is no pixel-level supervision for HQ image generation, we design a kernel-guided enhancement network to handle various degradation, and exploit the feedback from a recognizer and text-level annotations as weak supervision signal to train the HR enhancement branch. Then, a quality estimation module is employed to evaluate the qualities of HQ images, which are used to suppress the erroneous supervision information by weighting the loss of each image. Extensive experiments on TextZoom show that HiREN can work well with most existing STISR methods and significantly boost their performances. △ Less

Submitted 31 July, 2023; originally announced July 2023.

arXiv:2307.10803 [pdf, other]

Spatial-Temporal Data Mining for Ocean Science: Data, Methodologies, and Opportunities

Authors: Hanchen Yang, Wengen Li, Shuyu Wang, Hui Li, Jihong Guan, Shuigeng Zhou, Jiannong Cao

Abstract: With the rapid amassing of spatial-temporal (ST) ocean data, many spatial-temporal data mining (STDM) studies have been conducted to address various oceanic issues, including climate forecasting and disaster warning. Compared with typical ST data (e.g., traffic data), ST ocean data is more complicated but with unique characteristics, e.g., diverse regionality and high sparsity. These characteristi… ▽ More With the rapid amassing of spatial-temporal (ST) ocean data, many spatial-temporal data mining (STDM) studies have been conducted to address various oceanic issues, including climate forecasting and disaster warning. Compared with typical ST data (e.g., traffic data), ST ocean data is more complicated but with unique characteristics, e.g., diverse regionality and high sparsity. These characteristics make it difficult to design and train STDM models on ST ocean data. To the best of our knowledge, a comprehensive survey of existing studies remains missing in the literature, which hinders not only computer scientists from identifying the research issues in ocean data mining but also ocean scientists to apply advanced STDM techniques. In this paper, we provide a comprehensive survey of existing STDM studies for ocean science. Concretely, we first review the widely-used ST ocean datasets and highlight their unique characteristics. Then, typical ST ocean data quality enhancement techniques are explored. Next, we classify existing STDM studies in ocean science into four types of tasks, i.e., prediction, event detection, pattern mining, and anomaly detection, and elaborate on the techniques for these tasks. Finally, promising research opportunities are discussed. This survey can help scientists from both computer science and ocean science better understand the fundamental concepts, key techniques, and open challenges of STDM for ocean science. △ Less

Submitted 3 August, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

arXiv:2307.01542 [pdf, other]

Mitigating the Learning Bias towards Repetition by Self-Contrastive Training for Open-Ended Generation

Authors: Jian Guan, Minlie Huang

Abstract: Despite the huge progress in myriad generation tasks, pretrained language models (LMs) such as GPT2 still tend to generate repetitive texts with maximization-based decoding algorithms for open-ended generation. We attribute their overestimation of token-level repetition probabilities to the learning bias: LMs capture simple repetitive patterns faster with the MLE loss. We propose self-contrastive… ▽ More Despite the huge progress in myriad generation tasks, pretrained language models (LMs) such as GPT2 still tend to generate repetitive texts with maximization-based decoding algorithms for open-ended generation. We attribute their overestimation of token-level repetition probabilities to the learning bias: LMs capture simple repetitive patterns faster with the MLE loss. We propose self-contrastive training to penalize the output of a premature checkpoint of the same model when it incorrectly predicts repetition, which is shown to mitigate repetition effectively while maintaining fluency on two datasets. Furthermore, we find that LMs use longer-range dependencies to predict repetitive tokens than non-repetitive ones, which may be the cause of sentence-level repetition loops. △ Less

Submitted 4 July, 2023; originally announced July 2023.

Comments: ACL 2023 Short Findings

arXiv:2307.01504 [pdf, other]

All in One: Multi-task Prompting for Graph Neural Networks

Authors: Xiangguo Sun, Hong Cheng, Jia Li, Bo Liu, Jihong Guan

Abstract: Recently, ''pre-training and fine-tuning'' has been adopted as a standard workflow for many graph tasks since it can take general graph knowledge to relieve the lack of graph annotations from each application. However, graph tasks with node level, edge level, and graph level are far diversified, making the pre-training pretext often incompatible with these multiple tasks. This gap may even cause a… ▽ More Recently, ''pre-training and fine-tuning'' has been adopted as a standard workflow for many graph tasks since it can take general graph knowledge to relieve the lack of graph annotations from each application. However, graph tasks with node level, edge level, and graph level are far diversified, making the pre-training pretext often incompatible with these multiple tasks. This gap may even cause a ''negative transfer'' to the specific application, leading to poor results. Inspired by the prompt learning in natural language processing (NLP), which has presented significant effectiveness in leveraging prior knowledge for various NLP tasks, we study the prompting topic for graphs with the motivation of filling the gap between pre-trained models and various graph tasks. In this paper, we propose a novel multi-task prompting method for graph models. Specifically, we first unify the format of graph prompts and language prompts with the prompt token, token structure, and inserting pattern. In this way, the prompting idea from NLP can be seamlessly introduced to the graph area. Then, to further narrow the gap between various graph tasks and state-of-the-art pre-training strategies, we further study the task space of various graph applications and reformulate downstream problems to the graph-level task. Afterward, we introduce meta-learning to efficiently learn a better initialization for the multi-task prompt of graphs so that our prompting framework can be more reliable and general for different tasks. We conduct extensive experiments, results from which demonstrate the superiority of our method. △ Less

Submitted 17 December, 2023; v1 submitted 4 July, 2023; originally announced July 2023.

Comments: KDD 23 Best Research Paper Award, which is the first for Hong Kong and Mainland China. A Python Library is released as ProG: https://github.com/sheldonresearch/ProG Submitted to SIGKDD'23 in 03 Feb 2023; Receive Acceptance in 17 May 2023 (Rating 3 4 4 4); Submit to arXiv 1st time in 4 Jul 2023

arXiv:2306.06230 [pdf, other]

doi 10.1109/MNET.005.2300129

Design Frameworks for Hyper-Connected Social XRI Immersive Metaverse Environments

Authors: Jie Guan, Alexis Morris

Abstract: The metaverse refers to the merger of technologies for providing a digital twin of the real world and the underlying connectivity and interactions for the many kinds of agents within. As this set of technology paradigms - involving artificial intelligence, mixed reality, the internet-of-things and others - gains in scale, maturity, and utility there are rapidly emerging design challenges and new r… ▽ More The metaverse refers to the merger of technologies for providing a digital twin of the real world and the underlying connectivity and interactions for the many kinds of agents within. As this set of technology paradigms - involving artificial intelligence, mixed reality, the internet-of-things and others - gains in scale, maturity, and utility there are rapidly emerging design challenges and new research opportunities. In particular is the metaverse disconnect problem, the gap in task switching that inevitably occurs when a user engages with multiple virtual and physical environments simultaneously. Addressing this gap remains an open issue that affects the user experience and must be overcome to increase overall utility of the metaverse. This article presents design frameworks that consider how to address the metaverse as a hyper-connected meta-environment that connects and expands multiple user environments, modalities, contexts, and the many objects and relationships within them. This article contributes to i) a framing of the metaverse as a social XR-IoT (XRI) concept, ii) design Considerations for XRI metaverse experiences, iii) a design architecture for social multi-user XRI metaverse environments, and iv) descriptive exploration of social interaction scenarios within XRI multi-user metaverses. These contribute a new design framework for metaverse researchers and creators to consider the coming wave of interconnected and immersive smart environments. △ Less

Submitted 27 January, 2024; v1 submitted 9 June, 2023; originally announced June 2023.

Journal ref: IEEE Network ( Volume: 37, Issue: 4, July/August 2023)

arXiv:2306.05358 [pdf, other]

Trustworthy Sensor Fusion against Inaudible Command Attacks in Advanced Driver-Assistance System

Authors: Jiwei Guan, Lei Pan, Chen Wang, Shui Yu, Longxiang Gao, Xi Zheng

Abstract: There are increasing concerns about malicious attacks on autonomous vehicles. In particular, inaudible voice command attacks pose a significant threat as voice commands become available in autonomous driving systems. How to empirically defend against these inaudible attacks remains an open question. Previous research investigates utilizing deep learning-based multimodal fusion for defense, without… ▽ More There are increasing concerns about malicious attacks on autonomous vehicles. In particular, inaudible voice command attacks pose a significant threat as voice commands become available in autonomous driving systems. How to empirically defend against these inaudible attacks remains an open question. Previous research investigates utilizing deep learning-based multimodal fusion for defense, without considering the model uncertainty in trustworthiness. As deep learning has been applied to increasingly sensitive tasks, uncertainty measurement is crucial in hel** improve model robustness, especially in mission-critical scenarios. In this paper, we propose the Multimodal Fusion Framework (MFF) as an intelligent security system to defend against inaudible voice command attacks. MFF fuses heterogeneous audio-vision modalities using VGG family neural networks and achieves the detection accuracy of 92.25% in the comparative fusion method empirical study. Additionally, extensive experiments on audio-vision tasks reveal the model's uncertainty. Using Expected Calibration Errors, we measure calibration errors and Monte-Carlo Dropout to estimate the predictive distribution for the proposed models. Our findings show empirically to train robust multimodal models, improve standard accuracy and provide a further step toward interpretability. Finally, we discuss the pros and cons of our approach and its applicability for Advanced Driver Assistance Systems. △ Less

Submitted 29 May, 2023; originally announced June 2023.

arXiv:2306.01139 [pdf, other]

doi 10.1109/ISMAR-Adjunct54149.2021.00024

An XRI Mixed-Reality Internet-of-Things Architectural Framework Toward Immersive and Adaptive Smart Environments

Authors: Alexis Morris, Jie Guan, Amna Azhar

Abstract: The internet-of-things (IoT) refers to the growing number of embedded interconnected devices within everyday ubiquitous objects and environments, especially their networks, edge controllers, data gathering and management, sharing, and contextual analysis capabilities. However, the IoT suffers from inherent limitations in terms of human-computer interaction. In this landscape, there is a need for i… ▽ More The internet-of-things (IoT) refers to the growing number of embedded interconnected devices within everyday ubiquitous objects and environments, especially their networks, edge controllers, data gathering and management, sharing, and contextual analysis capabilities. However, the IoT suffers from inherent limitations in terms of human-computer interaction. In this landscape, there is a need for interfaces that have the potential to translate the IoT more solidly into the foreground of everyday smart environments, where its users are multimodal, multifaceted, and where new forms of presentation, adaptation, and immersion are essential. This work highlights the synergetic opportunities for both IoT and XR to converge toward hybrid XR objects with strong real-world connectivity, and IoT objects with rich XR interfaces. The paper contributes i) an understanding of this multi-disciplinary domain XR-IoT (XRI); ii) a theoretical perspective on how to design XRI agents based on the literature; iii) a system design architectural framework for XRI smart environment development; and iv) an early discussion of this process. It is hoped that this research enables future researchers in both communities to better understand and deploy hybrid smart XRI environments. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Journal ref: 2021 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct)

arXiv:2306.01137 [pdf, other]

doi 10.1109/VRW58643.2023.00251

Extending the Metaverse: Hyper-Connected Smart Environments with Mixed Reality and the Internet of Things

Authors: Jie Guan, Alexis Morris, Jay Irizawa

Abstract: The metaverse, i.e., the collection of technologies that provide a virtual twin of the real world via mixed reality, internet of things, and others, is gaining prominence. However, the metaverse faces challenges as it grows toward mainstream adoption. Among these is the lack of strong connections between metaverse objects and traditional physical objects and environments, which leads to inconsiste… ▽ More The metaverse, i.e., the collection of technologies that provide a virtual twin of the real world via mixed reality, internet of things, and others, is gaining prominence. However, the metaverse faces challenges as it grows toward mainstream adoption. Among these is the lack of strong connections between metaverse objects and traditional physical objects and environments, which leads to inconsistencies for users within metaverse environments. To address this issue, this work explores the design and development of a framework for bridging the physical environment and the metaverse through the use of internet-of-things objects and mixed reality designs. The contributions of this include: i) an architectural framework for extending the metaverse, ii) design prototypes using the framework. Together, this exploration charts the course toward a more cohesive and hyper-connected metaverse smart environment. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Journal ref: 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops

arXiv:2306.01113 [pdf, other]

doi 10.1109/VRW58643.2023.00071

Cross-Reality for Extending the Metaverse: Designing Hyper-Connected Immersive Environments with XRI

Authors: Jie Guan, Alexis Morris, Jay Irizawa

Abstract: The Metaverse comprises technologies to enable virtual twins of the real world, via mixed reality, internet of things, and others. As it matures unique challenges arise such as a lack of strong connections between virtual and physical worlds. This work presents design frameworks for cross-reality hybrid spaces. Contributions include: i) clarifying the metaverse "disconnect", ii) extended metaverse… ▽ More The Metaverse comprises technologies to enable virtual twins of the real world, via mixed reality, internet of things, and others. As it matures unique challenges arise such as a lack of strong connections between virtual and physical worlds. This work presents design frameworks for cross-reality hybrid spaces. Contributions include: i) clarifying the metaverse "disconnect", ii) extended metaverse design frameworks, iii) prototypes, and iv) discussions toward new metaverse smart environments. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Journal ref: 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW)

arXiv:2306.01096 [pdf, other]

doi 10.1109/GEM56474.2022.10017701

Extended-XRI Body Interfaces for Hyper-Connected Metaverse Environments

Authors: Jie Guan, Alexis Morris

Abstract: Hybrid mixed-reality (XR) internet-of-things (IoT) research, here called XRI, aims at a strong integration between physical and virtual objects, environments, and agents wherein IoT-enabled edge devices are deployed for sensing, context understanding, networked communication and control of device actuators. Likewise, as augmented reality systems provide an immersive overlay on the environments, an… ▽ More Hybrid mixed-reality (XR) internet-of-things (IoT) research, here called XRI, aims at a strong integration between physical and virtual objects, environments, and agents wherein IoT-enabled edge devices are deployed for sensing, context understanding, networked communication and control of device actuators. Likewise, as augmented reality systems provide an immersive overlay on the environments, and virtual reality provides fully immersive environments, the merger of these domains leads to immersive smart spaces that are hyper-connected, adaptive and dynamic components that anchor the metaverse to real-world constructs. Enabling the human-in-the-loop to remain engaged and connected across these virtual-physical hybrid environments requires advances in user interaction that are multi-dimensional. This work investigates the potential to transition the user interface to the human body as an extended-reality avatar with hybrid extended-body interfaces that can interact both with the physical and virtual sides of the metaverse. It contributes: i) an overview of metaverses, XRI, and avatarization concepts, ii) a taxonomy landscape for extended XRI body interfaces, iii) an architecture and potential interactions for XRI body designs, iv) a prototype XRI body implementation based on the architecture, v) a design-science evaluation, toward enabling future design research directions. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Journal ref: 2022 IEEE Games, 2022 IEEE Games, Entertainment, Media Conference (GEM)

arXiv:2305.12881 [pdf, other]

Building an Invisible Shield for Your Portrait against Deepfakes

Authors: Jiazhi Guan, Tianshu Hu, Hang Zhou, Zhizhi Guo, Lirui Deng, Chengbin Quan, Errui Ding, Youjian Zhao

Abstract: The issue of detecting deepfakes has garnered significant attention in the research community, with the goal of identifying facial manipulations for abuse prevention. Although recent studies have focused on develo** generalized models that can detect various types of deepfakes, their performance is not always be reliable and stable, which poses limitations in real-world applications. Instead of… ▽ More The issue of detecting deepfakes has garnered significant attention in the research community, with the goal of identifying facial manipulations for abuse prevention. Although recent studies have focused on develo** generalized models that can detect various types of deepfakes, their performance is not always be reliable and stable, which poses limitations in real-world applications. Instead of learning a forgery detector, in this paper, we propose a novel framework - Integrity Encryptor, aiming to protect portraits in a proactive strategy. Our methodology involves covertly encoding messages that are closely associated with key facial attributes into authentic images prior to their public release. Unlike authentic images, where the hidden messages can be extracted with precision, manipulating the facial attributes through deepfake techniques can disrupt the decoding process. Consequently, the modified facial attributes serve as a mean of detecting manipulated images through a comparison of the decoded messages. Our encryption approach is characterized by its brevity and efficiency, and the resulting method exhibits a good robustness against typical image processing traces, such as image degradation and noise. When compared to baselines that struggle to detect deepfakes in a black-box setting, our method utilizing conditional encryption showcases superior performance when presented with a range of different types of forgeries. In experiments conducted on our protected data, our approach outperforms existing state-of-the-art methods by a significant margin. △ Less

Submitted 22 May, 2023; originally announced May 2023.

Comments: under review

arXiv:2305.11960 [pdf, other]

doi 10.1109/SMC42975.2020.9282914

Toward Mixed Reality Hybrid Objects with IoT Avatar Agents

Authors: Alexis Morris, Jie Guan, Nadine Lessio, Yiyi Shao

Abstract: The internet-of-things (IoT) refers to the growing field of interconnected pervasive computing devices and the networking that supports smart, embedded applications. The IoT has multiple human-computer interaction challenges due to its many formats and interlinked components, and central to these is the need to provide sensory information and situational context pertaining to users in a more human… ▽ More The internet-of-things (IoT) refers to the growing field of interconnected pervasive computing devices and the networking that supports smart, embedded applications. The IoT has multiple human-computer interaction challenges due to its many formats and interlinked components, and central to these is the need to provide sensory information and situational context pertaining to users in a more human-friendly, easily understandable format. This work addresses this by applying mixed reality toward expressing the underlying behaviors and states internal to IoT devices and IoT-enabled objects. It extends the authors' previous research on IoT Avatars (mixed reality character representations of physical IoT devices), presenting a new head-mounted display framework and interconnection architecture. This contributes i) an exploration of mixed reality for smart spaces, ii) an approach toward expressive avatar behaviors using fuzzy inference, and iii) an early functional prototype of a hybrid physical and mixed reality IoT-enabled object. This approach is a step toward new information presentation, interaction, and engagement capabilities for smart devices and environments. △ Less

Submitted 19 May, 2023; originally announced May 2023.

arXiv:2305.07508 [pdf, other]

MolDiff: Addressing the Atom-Bond Inconsistency Problem in 3D Molecule Diffusion Generation

Authors: Xingang Peng, Jiaqi Guan, Qiang Liu, Jianzhu Ma

Abstract: Deep generative models have recently achieved superior performance in 3D molecule generation. Most of them first generate atoms and then add chemical bonds based on the generated atoms in a post-processing manner. However, there might be no corresponding bond solution for the temporally generated atoms as their locations are generated without considering potential bonds. We define this problem as… ▽ More Deep generative models have recently achieved superior performance in 3D molecule generation. Most of them first generate atoms and then add chemical bonds based on the generated atoms in a post-processing manner. However, there might be no corresponding bond solution for the temporally generated atoms as their locations are generated without considering potential bonds. We define this problem as the atom-bond inconsistency problem and claim it is the main reason for current approaches to generating unrealistic 3D molecules. To overcome this problem, we propose a new diffusion model called MolDiff which can generate atoms and bonds simultaneously while still maintaining their consistency by explicitly modeling the dependence between their relationships. We evaluated the generation ability of our proposed model and the quality of the generated molecules using criteria related to both geometry and chemical properties. The empirical studies showed that our model outperforms previous approaches, achieving a three-fold improvement in success rate and generating molecules with significantly better quality. △ Less

Submitted 11 May, 2023; originally announced May 2023.

arXiv:2305.05445 [pdf, other]

StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-based Generator

Authors: Jiazhi Guan, Zhanwang Zhang, Hang Zhou, Tianshu Hu, Kaisiyuan Wang, Dongliang He, Haocheng Feng, **gtuo Liu, Errui Ding, Ziwei Liu, **gdong Wang

Abstract: Despite recent advances in syncing lip movements with any audio waves, current methods still struggle to balance generation quality and the model's generalization ability. Previous studies either require long-term data for training or produce a similar movement pattern on all subjects with low quality. In this paper, we propose StyleSync, an effective framework that enables high-fidelity lip synch… ▽ More Despite recent advances in syncing lip movements with any audio waves, current methods still struggle to balance generation quality and the model's generalization ability. Previous studies either require long-term data for training or produce a similar movement pattern on all subjects with low quality. In this paper, we propose StyleSync, an effective framework that enables high-fidelity lip synchronization. We identify that a style-based generator would sufficiently enable such a charming property on both one-shot and few-shot scenarios. Specifically, we design a mask-guided spatial information encoding module that preserves the details of the given face. The mouth shapes are accurately modified by audio through modulated convolutions. Moreover, our design also enables personalized lip-sync by introducing style space and generator refinement on only limited frames. Thus the identity and talking style of a target person could be accurately preserved. Extensive experiments demonstrate the effectiveness of our method in producing high-fidelity results on a variety of scenes. Resources can be found at https://hangz-nju-cuhk.github.io/projects/StyleSync. △ Less

Submitted 9 May, 2023; originally announced May 2023.

Comments: Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. Project page: https://hangz-nju-cuhk.github.io/projects/StyleSync

arXiv:2305.03328 [pdf, other]

Time-weighted Frequency Domain Audio Representation with GMM Estimator for Anomalous Sound Detection

Authors: Jian Guan, Youde Liu, Qiaoxi Zhu, Tieran Zheng, Jiqing Han, Wenwu Wang

Abstract: Although deep learning is the mainstream method in unsupervised anomalous sound detection, Gaussian Mixture Model (GMM) with statistical audio frequency representation as input can achieve comparable results with much lower model complexity and fewer parameters. Existing statistical frequency representations, e.g, the log-Mel spectrogram's average or maximum over time, do not always work well for… ▽ More Although deep learning is the mainstream method in unsupervised anomalous sound detection, Gaussian Mixture Model (GMM) with statistical audio frequency representation as input can achieve comparable results with much lower model complexity and fewer parameters. Existing statistical frequency representations, e.g, the log-Mel spectrogram's average or maximum over time, do not always work well for different machines. This paper presents Time-Weighted Frequency Domain Representation (TWFR) with the GMM method (TWFR-GMM) for anomalous sound detection. The TWFR is a generalized statistical frequency domain representation that can adapt to different machine types, using the global weighted ranking pooling over time-domain. This allows GMM estimator to recognize anomalies, even under domain-shift conditions, as visualized with a Mahalanobis distance-based metric. Experiments on DCASE 2022 Challenge Task2 dataset show that our method has better detection performance than recent deep learning methods. TWFR-GMM is the core of our submission that achieved the 3rd place in DCASE 2022 Challenge Task2. △ Less

Submitted 5 May, 2023; originally announced May 2023.

Comments: To appear at ICASSP 2023

arXiv:2305.02606 [pdf, other]

Re$^3$Dial: Retrieve, Reorganize and Rescale Dialogue Corpus for Long-Turn Open-Domain Dialogue Pre-training

Authors: Jiaxin Wen, Hao Zhou, Jian Guan, Minlie Huang

Abstract: Pre-training on large-scale open-domain dialogue data can substantially improve the performance of dialogue models. However, the pre-trained dialogue model's ability to utilize long-range context is limited due to the scarcity of long-turn dialogue sessions. Most dialogues in existing pre-training corpora contain fewer than three turns of dialogue. To alleviate this issue, we propose the Retrieve,… ▽ More Pre-training on large-scale open-domain dialogue data can substantially improve the performance of dialogue models. However, the pre-trained dialogue model's ability to utilize long-range context is limited due to the scarcity of long-turn dialogue sessions. Most dialogues in existing pre-training corpora contain fewer than three turns of dialogue. To alleviate this issue, we propose the Retrieve, Reorganize and Rescale framework (Re$^3$Dial), which can automatically construct billion-scale long-turn dialogues by reorganizing existing short-turn ones. Given a short-turn session, Re$^3$Dial first employs a session retriever to retrieve coherent consecutive sessions. To this end, we train the retriever to capture semantic and discourse relations within multi-turn dialogues through contrastive training. Next, Re$^3$Dial samples a session from retrieved results following a diversity sampling strategy, which is designed to penalize repetitive or generic sessions. A longer session is then derived by concatenating the original session and the sampled session. By repeating the above process, Re$^3$Dial can yield a coherent long-turn dialogue. Extensive experiments on multiple multi-turn dialogue benchmarks demonstrate that Re$^3$Dial significantly improves the dialogue model's ability to utilize long-range context and thus generate more sensible and informative responses. Finally, we build a toolkit for efficiently rescaling conversations with Re$^3$Dial, which enables us to construct a corpus containing 1B Chinese dialogue sessions with 11.3 turns on average (5$\times$ longer than the original corpus). Our retriever model, code, and data is publicly available at \url{https://github.com/thu-coai/Re3Dial}. △ Less

Submitted 22 October, 2023; v1 submitted 4 May, 2023; originally announced May 2023.

Comments: EMNLP 2023 Main Coference

arXiv:2304.03588 [pdf, other]

Anomalous Sound Detection using Audio Representation with Machine ID based Contrastive Learning Pretraining

Authors: Jian Guan, Feiyang Xiao, Youde Liu, Qiaoxi Zhu, Wenwu Wang

Abstract: Existing contrastive learning methods for anomalous sound detection refine the audio representation of each audio sample by using the contrast between the samples' augmentations (e.g., with time or frequency masking). However, they might be biased by the augmented data, due to the lack of physical properties of machine sound, thereby limiting the detection performance. This paper uses contrastive… ▽ More Existing contrastive learning methods for anomalous sound detection refine the audio representation of each audio sample by using the contrast between the samples' augmentations (e.g., with time or frequency masking). However, they might be biased by the augmented data, due to the lack of physical properties of machine sound, thereby limiting the detection performance. This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample. The proposed two-stage method uses contrastive learning to pretrain the audio representation model by incorporating machine ID and a self-supervised ID classifier to fine-tune the learnt model, while enhancing the relation between audio features from the same ID. Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification in overall anomaly detection performance and stability on DCASE 2020 Challenge Task2 dataset. △ Less

Submitted 10 April, 2023; v1 submitted 7 April, 2023; originally announced April 2023.

Comments: To appear in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2023)

arXiv:2304.03586 [pdf, other]

doi 10.1109/LSP.2023.3266114

Graph Attention for Automated Audio Captioning

Authors: Feiyang Xiao, Jian Guan, Qiaoxi Zhu, Wenwu Wang

Abstract: State-of-the-art audio captioning methods typically use the encoder-decoder structure with pretrained audio neural networks (PANNs) as encoders for feature extraction. However, the convolution operation used in PANNs is limited in capturing the long-time dependencies within an audio signal, thereby leading to potential performance degradation in audio captioning. This letter presents a novel metho… ▽ More State-of-the-art audio captioning methods typically use the encoder-decoder structure with pretrained audio neural networks (PANNs) as encoders for feature extraction. However, the convolution operation used in PANNs is limited in capturing the long-time dependencies within an audio signal, thereby leading to potential performance degradation in audio captioning. This letter presents a novel method using graph attention (GraphAC) for encoder-decoder based audio captioning. In the encoder, a graph attention module is introduced after the PANNs to learn contextual association (i.e. the dependency among the audio features over different time frames) through an adjacency graph, and a top-k mask is used to mitigate the interference from noisy nodes. The learnt contextual association leads to a more effective feature representation with feature node aggregation. As a result, the decoder can predict important semantic information about the acoustic scene and events based on the contextual associations learned from the audio signal. Experimental results show that GraphAC outperforms the state-of-the-art methods with PANNs as the encoders, thanks to the incorporation of the graph attention module into the encoder for capturing the long-time dependencies within the audio signal. The source code is available at https://github.com/LittleFlyingSheep/GraphAC. △ Less

Submitted 10 April, 2023; v1 submitted 7 April, 2023; originally announced April 2023.

Comments: Accepted by IEEE Signal Processing Letters

Showing 1–50 of 148 results for author: Guan, J