Search | arXiv e-print repository

AMOR: A Recipe for Building Adaptable Modular Knowledge Agents Through Process Feedback

Authors: Jian Guan, Wei Wu, Zujie Wen, Peng Xu, Hongning Wang, Minlie Huang

Abstract: The notable success of large language models (LLMs) has sparked an upsurge in building language agents to complete various complex tasks. We present AMOR, an agent framework based on open-source LLMs, which reasons with external knowledge bases and adapts to specific domains through human supervision to the reasoning process. AMOR builds reasoning logic over a finite state machine (FSM) that solve… ▽ More The notable success of large language models (LLMs) has sparked an upsurge in building language agents to complete various complex tasks. We present AMOR, an agent framework based on open-source LLMs, which reasons with external knowledge bases and adapts to specific domains through human supervision to the reasoning process. AMOR builds reasoning logic over a finite state machine (FSM) that solves problems through autonomous executions and transitions over disentangled modules. This allows humans to provide direct feedback to the individual modules, and thus naturally forms process supervision. Based on this reasoning and feedback framework, we develop AMOR through two-stage fine-tuning: warm-up and adaptation. The former fine-tunes the LLM with examples automatically constructed from various public datasets and enables AMOR to generalize across different knowledge environments, while the latter tailors AMOR to specific domains using process feedback. Extensive experiments across multiple domains demonstrate the advantage of AMOR to strong baselines, thanks to its FSM-based reasoning and process feedback mechanism. △ Less

Submitted 2 February, 2024; originally announced February 2024.

Comments: Work in progress

arXiv:2402.01440 [pdf, other]

Few-Shot Learning on Graphs: from Meta-learning to Pre-training and Prompting

Authors: Xingtong Yu, Yuan Fang, Zemin Liu, Yuxia Wu, Zhihao Wen, Jianyuan Bo, Xinming Zhang, Steven C. H. Hoi

Abstract: Graph representation learning, a critical step in graph-centric tasks, has seen significant advancements. Earlier techniques often operate in an end-to-end setting, where performance heavily relies on the availability of ample labeled data. This constraint has spurred the emergence of few-shot learning on graphs, where only a few task-specific labels are available for each task. Given the extensiv… ▽ More Graph representation learning, a critical step in graph-centric tasks, has seen significant advancements. Earlier techniques often operate in an end-to-end setting, where performance heavily relies on the availability of ample labeled data. This constraint has spurred the emergence of few-shot learning on graphs, where only a few task-specific labels are available for each task. Given the extensive literature in this field, this survey endeavors to synthesize recent developments, provide comparative insights, and identify future directions. We systematically categorize existing studies into three major families: meta-learning approaches, pre-training approaches, and hybrid approaches, with a finer-grained classification in each family to aid readers in their method selection process. Within each category, we analyze the relationships among these methods and compare their strengths and limitations. Finally, we outline prospective future directions for few-shot learning on graphs to catalyze continued innovation in this field. △ Less

Submitted 2 March, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

arXiv:2401.16702 [pdf, other]

Multi-granularity Correspondence Learning from Long-term Noisy Videos

Authors: Yijie Lin, Jie Zhang, Zhenyu Huang, Jia Liu, Zujie Wen, Xi Peng

Abstract: Existing video-language studies mainly focus on learning short video clips, leaving long-term temporal dependencies rarely explored due to over-high computational cost of modeling long videos. To address this issue, one feasible solution is learning the correspondence between video clips and captions, which however inevitably encounters the multi-granularity noisy correspondence (MNC) problem. To… ▽ More Existing video-language studies mainly focus on learning short video clips, leaving long-term temporal dependencies rarely explored due to over-high computational cost of modeling long videos. To address this issue, one feasible solution is learning the correspondence between video clips and captions, which however inevitably encounters the multi-granularity noisy correspondence (MNC) problem. To be specific, MNC refers to the clip-caption misalignment (coarse-grained) and frame-word misalignment (fine-grained), hindering temporal learning and video understanding. In this paper, we propose NOise Robust Temporal Optimal traNsport (Norton) that addresses MNC in a unified optimal transport (OT) framework. In brief, Norton employs video-paragraph and clip-caption contrastive losses to capture long-term dependencies based on OT. To address coarse-grained misalignment in video-paragraph contrast, Norton filters out the irrelevant clips and captions through an alignable prompt bucket and realigns asynchronous clip-caption pairs based on transport distance. To address the fine-grained misalignment, Norton incorporates a soft-maximum operator to identify crucial words and key frames. Additionally, Norton exploits the potential faulty negative samples in clip-caption contrast by rectifying the alignment target with OT assignment to ensure precise temporal modeling. Extensive experiments on video retrieval, videoQA, and action segmentation verify the effectiveness of our method. Code is available at https://lin-yijie.github.io/projects/Norton. △ Less

Submitted 29 January, 2024; originally announced January 2024.

Comments: Accepted by ICLR 2024 (oral)

arXiv:2401.13503 [pdf, other]

Learning Representations for Clustering via Partial Information Discrimination and Cross-Level Interaction

Authors: Hai-Xin Zhang, Dong Huang, Hua-Bao Ling, Guang-Yu Zhang, Wei-jun Sun, Zi-hao Wen

Abstract: In this paper, we present a novel deep image clustering approach termed PICI, which enforces the partial information discrimination and the cross-level interaction in a joint learning framework. In particular, we leverage a Transformer encoder as the backbone, through which the masked image modeling with two paralleled augmented views is formulated. After deriving the class tokens from the masked… ▽ More In this paper, we present a novel deep image clustering approach termed PICI, which enforces the partial information discrimination and the cross-level interaction in a joint learning framework. In particular, we leverage a Transformer encoder as the backbone, through which the masked image modeling with two paralleled augmented views is formulated. After deriving the class tokens from the masked images by the Transformer encoder, three partial information learning modules are further incorporated, including the PISD module for training the auto-encoder via masked image reconstruction, the PICD module for employing two levels of contrastive learning, and the CLI module for mutual interaction between the instance-level and cluster-level subspaces. Extensive experiments have been conducted on six real-world image datasets, which demononstrate the superior clustering performance of the proposed PICI approach over the state-of-the-art deep clustering approaches. The source code is available at https://github.com/Regan-Zhang/PICI. △ Less

Submitted 24 January, 2024; originally announced January 2024.

arXiv:2401.10296 [pdf, other]

The Study of Mode Switching behavior of PSR J0614+2229 Using the Parkes Ultra-wideband Receiver Observations

Authors: Yanqing Cai, Shijun Dang, Rai Yuen, Lunhua Shang, Feifei Kou, Jian** Yuan, Lei Zhang, Zurong Zhou, Na Wang, Qingying Li, Zhigang Wen, Wenming Yan, Shuangqiang Wang, Shengnan Sun, Habtamu Menberu Tedila, Shuo Xiao, Xin Xu, Rushuang Zhao, Qijun Zhi, Aijun Dong, Bing Zhang, Wei Li, Yingying Ren, Yujia Liu

Abstract: In this paper, we presented a detailed single pulse and polarization study of PSR J0614+2229 based on the archived data observed on 2019 August 15 (MJD 58710) and September 12 (MJD 58738) using the Ultra-wideband Low-frequency Receiver on the Parkes radio telescope. The single-pulse sequences show that this pulsar switches between two emission states, in which the emission of state A occurs earlie… ▽ More In this paper, we presented a detailed single pulse and polarization study of PSR J0614+2229 based on the archived data observed on 2019 August 15 (MJD 58710) and September 12 (MJD 58738) using the Ultra-wideband Low-frequency Receiver on the Parkes radio telescope. The single-pulse sequences show that this pulsar switches between two emission states, in which the emission of state A occurs earlier than that of state B in pulse longitude. We found that the variation in relative brightness between the two states is related to time and both states follow a simple power law very well. Based on the phase-aligned multi-frequency profiles, we found that there is a significant difference in the distributions of spectral index across the emission regions of the two states. Furthermore, we obtained the emission height evolution for the two emission states and found that, at a fixed frequency, the emission height of state A is higher than that of state B. What is even more interesting is that the emission heights of both states A and B have not changed with frequency. Our results suggest that the mode switching of this pulsar is possibly caused by changes in the emission heights that alter the distributions of spectral index across the emission regions of states A and B resulting in the frequency-dependent behaviors, i.e., intensity and pulse width. △ Less

Submitted 17 January, 2024; originally announced January 2024.

arXiv:2401.09085 [pdf]

3D orientation super-resolution spatial-frequency-shift microscopy

Authors: Xiaowei Liu, Mingwei Tang, Ning Zhou, Chenlei Pang, Zhong Wen, Xu Liu, Qing Yang

Abstract: Super-resolution map** of the 3D orientation of fluorophores reveals the alignment of biological structures where the fluorophores are tightly attached, and thus plays a vital role in studying the organization and dynamics of bio-complexes. However, current super-resolution imaging techniques are either limited to 2D orientation map** or suffer from slow speed and the requirement of special la… ▽ More Super-resolution map** of the 3D orientation of fluorophores reveals the alignment of biological structures where the fluorophores are tightly attached, and thus plays a vital role in studying the organization and dynamics of bio-complexes. However, current super-resolution imaging techniques are either limited to 2D orientation map** or suffer from slow speed and the requirement of special labels in 3D orientation map**. Here, we propose a novel polarized virtual spatial-frequency-shift effect to overcome these restrictions to achieve a universal 3D orientation super-resolution map** capability. To demonstrate the mechanism, we simulate the imaging process and reconstruct the spatial-angular information for sparsely distributed dipoles with random 3D orientations and microfilament-like structures decorated with fluorophores oriented parallel to them. The 3D orientation distribution can be recovered with a doubled spatial resolution and an average angular precision of up to 2.39 degrees. The performance of the approach with noise has also been analyzed considering real implementation. △ Less

Submitted 22 January, 2024; v1 submitted 17 January, 2024; originally announced January 2024.

Comments: 22 pages, 5 figures

arXiv:2401.05778 [pdf, other]

Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems

Authors: Tianyu Cui, Yanling Wang, Chuanpu Fu, Yong Xiao, Sijia Li, Xinhao Deng, Yunpeng Liu, Qinglin Zhang, Ziyi Qiu, Peiyang Li, Zhixing Tan, Junwu Xiong, Xinyu Kong, Zujie Wen, Ke Xu, Qi Li

Abstract: Large language models (LLMs) have strong capabilities in solving diverse natural language processing tasks. However, the safety and security issues of LLM systems have become the major obstacle to their widespread application. Many studies have extensively investigated risks in LLM systems and developed the corresponding mitigation strategies. Leading-edge enterprises such as OpenAI, Google, Meta,… ▽ More Large language models (LLMs) have strong capabilities in solving diverse natural language processing tasks. However, the safety and security issues of LLM systems have become the major obstacle to their widespread application. Many studies have extensively investigated risks in LLM systems and developed the corresponding mitigation strategies. Leading-edge enterprises such as OpenAI, Google, Meta, and Anthropic have also made lots of efforts on responsible LLMs. Therefore, there is a growing need to organize the existing studies and establish comprehensive taxonomies for the community. In this paper, we delve into four essential modules of an LLM system, including an input module for receiving prompts, a language model trained on extensive corpora, a toolchain module for development and deployment, and an output module for exporting LLM-generated content. Based on this, we propose a comprehensive taxonomy, which systematically analyzes potential risks associated with each module of an LLM system and discusses the corresponding mitigation strategies. Furthermore, we review prevalent benchmarks, aiming to facilitate the risk assessment of LLM systems. We hope that this paper can help LLM participants embrace a systematic perspective to build their responsible LLM systems. △ Less

Submitted 11 January, 2024; originally announced January 2024.

arXiv:2401.05596 [pdf]

POMP: Probability-driven Meta-graph Prompter for LLMs in Low-resource Unsupervised Neural Machine Translation

Authors: Shilong Pan, Zhiliang Tian, Liang Ding, Zhen Huang, Zhihua Wen, Dongsheng Li

Abstract: Low-resource languages (LRLs) face challenges in supervised neural machine translation due to limited parallel data, prompting research into unsupervised methods. Unsupervised neural machine translation (UNMT) methods, including back-translation, transfer learning, and pivot-based translation, offer practical solutions for LRL translation, but they are hindered by issues like synthetic data noise,… ▽ More Low-resource languages (LRLs) face challenges in supervised neural machine translation due to limited parallel data, prompting research into unsupervised methods. Unsupervised neural machine translation (UNMT) methods, including back-translation, transfer learning, and pivot-based translation, offer practical solutions for LRL translation, but they are hindered by issues like synthetic data noise, language bias, and error propagation, which can potentially be mitigated by Large Language Models (LLMs). LLMs have advanced NMT with in-context learning (ICL) and supervised fine-tuning methods, but insufficient training data results in poor performance in LRLs. We argue that LLMs can mitigate the linguistic noise with auxiliary languages to improve translations in LRLs. In this paper, we propose Probability-driven Meta-graph Prompter (POMP), a novel approach employing a dynamic, sampling-based graph of multiple auxiliary languages to enhance LLMs' translation capabilities for LRLs. POMP involves constructing a directed acyclic meta-graph for each source language, from which we dynamically sample multiple paths to prompt LLMs to mitigate the linguistic noise and improve translations during training. We use the BLEURT metric to evaluate the translations and back-propagate rewards, estimated by scores, to update the probabilities of auxiliary languages in the paths. Our experiments show significant improvements in the translation quality of three LRLs, demonstrating the effectiveness of our approach. △ Less

Submitted 16 January, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

arXiv:2401.02682 [pdf, other]

Homophily-Related: Adaptive Hybrid Graph Filter for Multi-View Graph Clustering

Authors: Zichen Wen, Yawen Ling, Yazhou Ren, Tianyi Wu, Jianpeng Chen, Xiaorong Pu, Zhifeng Hao, Lifang He

Abstract: Recently there is a growing focus on graph data, and multi-view graph clustering has become a popular area of research interest. Most of the existing methods are only applicable to homophilous graphs, yet the extensive real-world graph data can hardly fulfill the homophily assumption, where the connected nodes tend to belong to the same class. Several studies have pointed out that the poor perform… ▽ More Recently there is a growing focus on graph data, and multi-view graph clustering has become a popular area of research interest. Most of the existing methods are only applicable to homophilous graphs, yet the extensive real-world graph data can hardly fulfill the homophily assumption, where the connected nodes tend to belong to the same class. Several studies have pointed out that the poor performance on heterophilous graphs is actually due to the fact that conventional graph neural networks (GNNs), which are essentially low-pass filters, discard information other than the low-frequency information on the graph. Nevertheless, on certain graphs, particularly heterophilous ones, neglecting high-frequency information and focusing solely on low-frequency information impedes the learning of node representations. To break this limitation, our motivation is to perform graph filtering that is closely related to the homophily degree of the given graph, with the aim of fully leveraging both low-frequency and high-frequency signals to learn distinguishable node embedding. In this work, we propose Adaptive Hybrid Graph Filter for Multi-View Graph Clustering (AHGFC). Specifically, a graph joint process and graph joint aggregation matrix are first designed by using the intrinsic node features and adjacency relationship, which makes the low and high-frequency signals on the graph more distinguishable. Then we design an adaptive hybrid graph filter that is related to the homophily degree, which learns the node embedding based on the graph joint aggregation matrix. After that, the node embedding of each view is weighted and fused into a consensus embedding for the downstream task. Experimental results show that our proposed model performs well on six datasets containing homophilous and heterophilous graphs. △ Less

Submitted 5 January, 2024; originally announced January 2024.

Comments: Accepted by AAAI2024

arXiv:2312.16998 [pdf, other]

Deep Unfolding Network with Spatial Alignment for multi-modal MRI reconstruction

Authors: Hao Zhang, Qi Wang, Jun Shi, Shihui Ying, Zhijie Wen

Abstract: Multi-modal Magnetic Resonance Imaging (MRI) offers complementary diagnostic information, but some modalities are limited by the long scanning time. To accelerate the whole acquisition process, MRI reconstruction of one modality from highly undersampled k-space data with another fully-sampled reference modality is an efficient solution. However, the misalignment between modalities, which is common… ▽ More Multi-modal Magnetic Resonance Imaging (MRI) offers complementary diagnostic information, but some modalities are limited by the long scanning time. To accelerate the whole acquisition process, MRI reconstruction of one modality from highly undersampled k-space data with another fully-sampled reference modality is an efficient solution. However, the misalignment between modalities, which is common in clinic practice, can negatively affect reconstruction quality. Existing deep learning-based methods that account for inter-modality misalignment perform better, but still share two main common limitations: (1) The spatial alignment task is not adaptively integrated with the reconstruction process, resulting in insufficient complementarity between the two tasks; (2) the entire framework has weak interpretability. In this paper, we construct a novel Deep Unfolding Network with Spatial Alignment, termed DUN-SA, to appropriately embed the spatial alignment task into the reconstruction process. Concretely, we derive a novel joint alignment-reconstruction model with a specially designed cross-modal spatial alignment term. By relaxing the model into cross-modal spatial alignment and multi-modal reconstruction tasks, we propose an effective algorithm to solve this model alternatively. Then, we unfold the iterative steps of the proposed algorithm and design corresponding network modules to build DUN-SA with interpretability. Through end-to-end training, we effectively compensate for spatial misalignment using only reconstruction loss, and utilize the progressively aligned reference modality to provide inter-modality prior to improve the reconstruction of the target modality. Comprehensive experiments on three real datasets demonstrate that our method exhibits superior reconstruction performance compared to state-of-the-art methods. △ Less

Submitted 28 December, 2023; originally announced December 2023.

arXiv:2312.12693 [pdf, other]

Anderson Accelerated Gauss-Newton-guided deep learning for nonlinear inverse problems with Application to Electrical Impedance Tomography

Authors: Qing** Zhou, Guixian Xu, Zhexin Wen, Hongqiao Wang

Abstract: Physics-guided deep learning is an important prevalent research topic in scientific machine learning, which has tremendous potential in various complex applications including science and engineering. In these applications, data is expensive to acquire and high accuracy is required for making decisions. In this work, we introduce an efficient physics-guided deep learning framework for the variation… ▽ More Physics-guided deep learning is an important prevalent research topic in scientific machine learning, which has tremendous potential in various complex applications including science and engineering. In these applications, data is expensive to acquire and high accuracy is required for making decisions. In this work, we introduce an efficient physics-guided deep learning framework for the variational modeling of nonlinear inverse problems, which is then applied to solve an electrical impedance tomography (EIT) inverse problem. The framework is achieved by unrolling the proposed Anderson accelerated Gauss-Newton (GNAA) algorithm into an end-to-end deep learning method. Firstly, we show the convergence of the GNAA algorithm in both cases: Anderson depth is equal to one and Anderson depth is greater than one. Then, we propose three types of strategies by combining the complementary strengths of GNAA and deep learning: GNAA of learned regularization (GNAA-LRNet), where the singular values of the regularization matrix are learned by a deep neural network; GNAA of learned proximity (GNAA-LPNet), where the regularization proximal operator is learned by using a deep neural network; GNAA of plug-and-play method (GNAA-PnPNet) where the regularization proximal operator is replaced by a pre-trained deep denoisers. Lastly, we present some numerical experiments to illustrate that the proposed approaches greatly improve the convergence rate and the quality of inverse solutions. △ Less

Submitted 19 December, 2023; originally announced December 2023.

MSC Class: 78A46; 68U10; 68T07

arXiv:2312.07889 [pdf, other]

Adaptive Isogeometric Topology Optimization of Shell Structures based on PHT-splines

Authors: Zepeng Wen, Qiong Pan, Xiaoya Zhai, Hongmei Kang, Falai Chen

Abstract: This paper proposes an Adaptive Isogeometric Topology Optimization framework for shell structures based on PHT-splines (PHT-AITO). In this framework, the design domain, displacement, and density are represented by PHT-splines. Leveraging the local refinement capability of PHT-splines, mesh elements defining the density function are adaptively refined to achieve a suitable resolution at the interfa… ▽ More This paper proposes an Adaptive Isogeometric Topology Optimization framework for shell structures based on PHT-splines (PHT-AITO). In this framework, the design domain, displacement, and density are represented by PHT-splines. Leveraging the local refinement capability of PHT-splines, mesh elements defining the density function are adaptively refined to achieve a suitable resolution at the interface between solid and void regions. This addresses the issue of excessive degrees of freedom resulting from global refinement. The refinement of the mesh elements is driven by their density. During the optimization of the density on a refined mesh, the initial value of the density is inherited from the optimization results on the previous mesh to accelerate the iteration process and maintain the stability of the optimized structure. Numerical experiments on various shell structures have verified the effectiveness of PHT-AITO. Compared with isogeometric topology optimization based on tensor-product splines, PHT-AITO can significantly reduce the degrees of freedom in the optimization problem, thereby improving computational efficiency. △ Less

Submitted 12 December, 2023; originally announced December 2023.

arXiv:2312.06993 [pdf]

Dynamically configured physics-informed neural network in topology optimization applications

Authors: Jichao Yin, Ziming Wen, Shuhao Li, Yaya Zhanga, Hu Wang

Abstract: Integration of machine learning (ML) into the topology optimization (TO) framework is attracting increasing attention, but data acquisition in data-driven models is prohibitive. Compared with popular ML methods, the physics-informed neural network (PINN) can avoid generating enormous amounts of data when solving forward problems and additionally provide better inference. To this end, a dynamically… ▽ More Integration of machine learning (ML) into the topology optimization (TO) framework is attracting increasing attention, but data acquisition in data-driven models is prohibitive. Compared with popular ML methods, the physics-informed neural network (PINN) can avoid generating enormous amounts of data when solving forward problems and additionally provide better inference. To this end, a dynamically configured PINN-based topology optimization (DCPINN-TO) method is proposed. The DCPINN is composed of two subnetworks, namely the backbone neural network (NN) and the coefficient NN, where the coefficient NN has fewer trainable parameters. The designed architecture aims to dynamically configure trainable parameters; that is, an inexpensive NN is used to replace an expensive one at certain optimization cycles. Furthermore, an active sampling strategy is proposed to selectively sample collocations depending on the pseudo-densities at each optimization cycle. In this manner, the number of collocations will decrease with the optimization process but will hardly affect it. The Gaussian integral is used to calculate the strain energy of elements, which yields a byproduct of decoupling the map** of the material at the collocations. Several examples with different resolutions validate the feasibility of the DCPINN-TO method, and multiload and multiconstraint problems are employed to illustrate its generalization. In addition, compared to finite element analysis-based TO (FEA-TO), the accuracy of the displacement prediction and optimization results indicate that the DCPINN-TO method is effective and efficient. △ Less

Submitted 12 December, 2023; originally announced December 2023.

Comments: 31 pages, 22 figures

arXiv:2312.06644 [pdf, other]

AnyHome: Open-Vocabulary Generation of Structured and Textured 3D Homes

Authors: Rao Fu, Zehao Wen, Zichen Liu, Srinath Sridhar

Abstract: Inspired by cognitive theories, we introduce AnyHome, a framework that translates any text into well-structured and textured indoor scenes at a house-scale. By prompting Large Language Models (LLMs) with designed templates, our approach converts provided textual narratives into amodal structured representations. These representations guarantee consistent and realistic spatial layouts by directing… ▽ More Inspired by cognitive theories, we introduce AnyHome, a framework that translates any text into well-structured and textured indoor scenes at a house-scale. By prompting Large Language Models (LLMs) with designed templates, our approach converts provided textual narratives into amodal structured representations. These representations guarantee consistent and realistic spatial layouts by directing the synthesis of a geometry mesh within defined constraints. A Score Distillation Sampling process is then employed to refine the geometry, followed by an egocentric inpainting process that adds lifelike textures to it. AnyHome stands out with its editability, customizability, diversity, and realism. The structured representations for scenes allow for extensive editing at varying levels of granularity. Capable of interpreting texts ranging from simple labels to detailed narratives, AnyHome generates detailed geometries and textures that outperform existing methods in both quantitative and qualitative measures. △ Less

Submitted 20 March, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

arXiv:2312.04293 [pdf, other]

GPT-4V with Emotion: A Zero-shot Benchmark for Generalized Emotion Recognition

Authors: Zheng Lian, Licai Sun, Haiyang Sun, Kang Chen, Zhuofan Wen, Hao Gu, Bin Liu, Jianhua Tao

Abstract: Recently, GPT-4 with Vision (GPT-4V) has demonstrated remarkable visual capabilities across various tasks, but its performance in emotion recognition has not been fully evaluated. To bridge this gap, we present the quantitative evaluation results of GPT-4V on 21 benchmark datasets covering 6 tasks: visual sentiment analysis, tweet sentiment analysis, micro-expression recognition, facial emotion re… ▽ More Recently, GPT-4 with Vision (GPT-4V) has demonstrated remarkable visual capabilities across various tasks, but its performance in emotion recognition has not been fully evaluated. To bridge this gap, we present the quantitative evaluation results of GPT-4V on 21 benchmark datasets covering 6 tasks: visual sentiment analysis, tweet sentiment analysis, micro-expression recognition, facial emotion recognition, dynamic facial emotion recognition, and multimodal emotion recognition. This paper collectively refers to these tasks as ``Generalized Emotion Recognition (GER)''. Through experimental analysis, we observe that GPT-4V exhibits strong visual understanding capabilities in GER tasks. Meanwhile, GPT-4V shows the ability to integrate multimodal clues and exploit temporal information, which is also critical for emotion recognition. However, it's worth noting that GPT-4V is primarily designed for general domains and cannot recognize micro-expressions that require specialized knowledge. To the best of our knowledge, this paper provides the first quantitative assessment of GPT-4V for GER tasks. We have open-sourced the code and encourage subsequent researchers to broaden the evaluation scope by including more tasks and datasets. Our code and evaluation results are available at: https://github.com/zeroQiaoba/gpt4v-emotion. △ Less

Submitted 17 March, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

arXiv:2312.01801 [pdf, other]

SPROUT: Authoring Programming Tutorials with Interactive Visualization of Large Language Model Generation Process

Authors: Yihan Liu, Zhen Wen, Luoxuan Weng, Ollie Woodman, Yi Yang, Wei Chen

Abstract: The rapid development of large language models (LLMs), such as ChatGPT, has revolutionized the efficiency of creating programming tutorials. LLMs can be instructed with text prompts to generate comprehensive text descriptions of code snippets. However, the lack of transparency in the end-to-end generation process has hindered the understanding of model behavior and limited user control over the ge… ▽ More The rapid development of large language models (LLMs), such as ChatGPT, has revolutionized the efficiency of creating programming tutorials. LLMs can be instructed with text prompts to generate comprehensive text descriptions of code snippets. However, the lack of transparency in the end-to-end generation process has hindered the understanding of model behavior and limited user control over the generated results. To tackle this challenge, we introduce a novel approach that breaks down the programming tutorial creation task into actionable steps. By employing the tree-of-thought method, LLMs engage in an exploratory process to generate diverse and faithful programming tutorials. We then present SPROUT, an authoring tool equipped with a series of interactive visualizations that empower users to have greater control and understanding of the programming tutorial creation process. A formal user study demonstrated the effectiveness of SPROUT, showing that our tool assists users to actively participate in the programming tutorial creation process, leading to more reliable and customizable results. By providing users with greater control and understanding, SPROUT enhances the user experience and improves the overall quality of programming tutorial. A free copy of this paper and all supplemental materials are available at https://osf.io/uez2t/?view_only=5102e958802341daa414707646428f86. △ Less

Submitted 4 December, 2023; originally announced December 2023.

arXiv:2312.01273 [pdf, other]

An Augmented Lagrangian Primal-Dual Semismooth Newton Method for Multi-Block Composite Optimization

Authors: Zhanwang Deng, Kangkang Deng, Jiang Hu, Zaiwen Wen

Abstract: In this paper, we develop a novel primal-dual semismooth Newton method for solving linearly constrained multi-block convex composite optimization problems. First, a differentiable augmented Lagrangian (AL) function is constructed by utilizing the Moreau envelopes of the nonsmooth functions. It enables us to derive an equivalent saddle point problem and establish the strong AL duality under the Sla… ▽ More In this paper, we develop a novel primal-dual semismooth Newton method for solving linearly constrained multi-block convex composite optimization problems. First, a differentiable augmented Lagrangian (AL) function is constructed by utilizing the Moreau envelopes of the nonsmooth functions. It enables us to derive an equivalent saddle point problem and establish the strong AL duality under the Slater's condition. Consequently, a semismooth system of nonlinear equations is formulated to characterize the optimality of the original problem instead of the inclusion-form KKT conditions. We then develop a semismooth Newton method, called ALPDSN, which uses purely second-order steps and a nonmonotone line search based globalization strategy. Through a connection to the inexact first-order steps when the regularization parameter is sufficiently large, the global convergence of ALPDSN is established. Under the regularity conditions, partial smoothness, the local error bound, and the strict complementarity, we show that both the primal and the dual iteration sequences possess a superlinear convergence rate and provide concrete examples where these regularity conditions are met. Numerical results on the image restoration with two regularization terms and the corrected tensor nuclear norm problem are presented to demonstrate the high efficiency and robustness of our ALPDSN. △ Less

Submitted 15 May, 2024; v1 submitted 2 December, 2023; originally announced December 2023.

Comments: 27 pages

arXiv:2312.01057 [pdf, other]

RLHF and IIA: Perverse Incentives

Authors: Wanqiao Xu, Shi Dong, Xiuyuan Lu, Grace Lam, Zheng Wen, Benjamin Van Roy

Abstract: Existing algorithms for reinforcement learning from human feedback (RLHF) can incentivize responses at odds with preferences because they are based on models that assume independence of irrelevant alternatives (IIA). The perverse incentives induced by IIA hinder innovations on query formats and learning algorithms. Existing algorithms for reinforcement learning from human feedback (RLHF) can incentivize responses at odds with preferences because they are based on models that assume independence of irrelevant alternatives (IIA). The perverse incentives induced by IIA hinder innovations on query formats and learning algorithms. △ Less

Submitted 1 February, 2024; v1 submitted 2 December, 2023; originally announced December 2023.

arXiv:2310.18894 [pdf, other]

Emergence of Shape Bias in Convolutional Neural Networks through Activation Sparsity

Authors: Tianqin Li, Ziqi Wen, Yangfan Li, Tai Sing Lee

Abstract: Current deep-learning models for object recognition are known to be heavily biased toward texture. In contrast, human visual systems are known to be biased toward shape and structure. What could be the design principles in human visual systems that led to this difference? How could we introduce more shape bias into the deep learning models? In this paper, we report that sparse coding, a ubiquitous… ▽ More Current deep-learning models for object recognition are known to be heavily biased toward texture. In contrast, human visual systems are known to be biased toward shape and structure. What could be the design principles in human visual systems that led to this difference? How could we introduce more shape bias into the deep learning models? In this paper, we report that sparse coding, a ubiquitous principle in the brain, can in itself introduce shape bias into the network. We found that enforcing the sparse coding constraint using a non-differential Top-K operation can lead to the emergence of structural encoding in neurons in convolutional neural networks, resulting in a smooth decomposition of objects into parts and subparts and endowing the networks with shape bias. We demonstrated this emergence of shape bias and its functional benefits for different network structures with various datasets. For object recognition convolutional neural networks, the shape bias leads to greater robustness against style and pattern change distraction. For the image synthesis generative adversary networks, the emerged shape bias leads to more coherent and decomposable structures in the synthesized images. Ablation studies suggest that sparse codes tend to encode structures, whereas the more distributed codes tend to favor texture. Our code is host at the github repository: \url{https://github.com/Crazy-Jack/nips2023_shape_vs_texture} △ Less

Submitted 29 October, 2023; originally announced October 2023.

Comments: Published as NeurIPS 2023 (Oral)

arXiv:2310.11531 [pdf, ps, other]

Efficient Online Learning with Offline Datasets for Infinite Horizon MDPs: A Bayesian Approach

Authors: Dengwang Tang, Rahul Jain, Botao Hao, Zheng Wen

Abstract: In this paper, we study the problem of efficient online reinforcement learning in the infinite horizon setting when there is an offline dataset to start with. We assume that the offline dataset is generated by an expert but with unknown level of competence, i.e., it is not perfect and not necessarily using the optimal policy. We show that if the learning agent models the behavioral policy (paramet… ▽ More In this paper, we study the problem of efficient online reinforcement learning in the infinite horizon setting when there is an offline dataset to start with. We assume that the offline dataset is generated by an expert but with unknown level of competence, i.e., it is not perfect and not necessarily using the optimal policy. We show that if the learning agent models the behavioral policy (parameterized by a competence parameter) used by the expert, it can do substantially better in terms of minimizing cumulative regret, than if it doesn't do that. We establish an upper bound on regret of the exact informed PSRL algorithm that scales as $\tilde{O}(\sqrt{T})$. This requires a novel prior-dependent regret analysis of Bayesian online learning algorithms for the infinite horizon setting. We then propose the Informed RLSVI algorithm to efficiently approximate the iPSRL algorithm. △ Less

Submitted 1 February, 2024; v1 submitted 17 October, 2023; originally announced October 2023.

Comments: 22 pages

MSC Class: 93E35

arXiv:2310.08869 [pdf, other]

doi 10.1109/TASLP.2024.3389643

Dual-Branch Knowledge Distillation for Noise-Robust Synthetic Speech Detection

Authors: Cunhang Fan, Mingming Ding, Jianhua Tao, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Zhao Lv

Abstract: Most research in synthetic speech detection (SSD) focuses on improving performance on standard noise-free datasets. However, in actual situations, noise interference is usually present, causing significant performance degradation in SSD systems. To improve noise robustness, this paper proposes a dual-branch knowledge distillation synthetic speech detection (DKDSSD) method. Specifically, a parallel… ▽ More Most research in synthetic speech detection (SSD) focuses on improving performance on standard noise-free datasets. However, in actual situations, noise interference is usually present, causing significant performance degradation in SSD systems. To improve noise robustness, this paper proposes a dual-branch knowledge distillation synthetic speech detection (DKDSSD) method. Specifically, a parallel data flow of the clean teacher branch and the noisy student branch is designed, and interactive fusion module and response-based teacher-student paradigms are proposed to guide the training of noisy data from both the data distribution and decision-making perspectives. In the noisy student branch, speech enhancement is introduced initially for denoising, aiming to reduce the interference of strong noise. The proposed interactive fusion combines denoised features and noisy features to mitigate the impact of speech distortion and ensure consistency with the data distribution of the clean branch. The teacher-student paradigm maps the student's decision space to the teacher's decision space, enabling noisy speech to behave similarly to clean speech. Additionally, a joint training method is employed to optimize both branches for achieving global optimality. Experimental results based on multiple datasets demonstrate that the proposed method performs effectively in noisy environments and maintains its performance in cross-dataset experiments. Source code is available at https://github.com/fchest/DKDSSD. △ Less

Submitted 16 April, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

arXiv:2310.07555 [pdf, other]

Does resistance to style-transfer equal Global Shape Bias? Measuring network sensitivity to global shape configuration

Authors: Ziqi Wen, Tianqin Li, Zhi **g, Tai Sing Lee

Abstract: Deep learning models are known to exhibit a strong texture bias, while human tends to rely heavily on global shape structure for object recognition. The current benchmark for evaluating a model's global shape bias is a set of style-transferred images with the assumption that resistance to the attack of style transfer is related to the development of global structure sensitivity in the model. In th… ▽ More Deep learning models are known to exhibit a strong texture bias, while human tends to rely heavily on global shape structure for object recognition. The current benchmark for evaluating a model's global shape bias is a set of style-transferred images with the assumption that resistance to the attack of style transfer is related to the development of global structure sensitivity in the model. In this work, we show that networks trained with style-transfer images indeed learn to ignore style, but its shape bias arises primarily from local detail. We provide a \textbf{Disrupted Structure Testbench (DiST)} as a direct measurement of global structure sensitivity. Our test includes 2400 original images from ImageNet-1K, each of which is accompanied by two images with the global shapes of the original image disrupted while preserving its texture via the texture synthesis program. We found that \textcolor{black}{(1) models that performed well on the previous cue-conflict dataset do not fare well in the proposed DiST; (2) the supervised trained Vision Transformer (ViT) lose its global spatial information from positional embedding, leading to no significant advantages over Convolutional Neural Networks (CNNs) on DiST. While self-supervised learning methods, especially mask autoencoder significantly improves the global structure sensitivity of ViT. (3) Improving the global structure sensitivity is orthogonal to resistance to style-transfer, indicating that the relationship between global shape structure and local texture detail is not an either/or relationship. Training with DiST images and style-transferred images are complementary, and can be combined to train network together to enhance the global shape sensitivity and robustness of local features.} Our code will be hosted in github: https://github.com/leelabcnbc/DiST △ Less

Submitted 29 February, 2024; v1 submitted 11 October, 2023; originally announced October 2023.

arXiv:2310.06713 [pdf, other]

Interpretable Traffic Event Analysis with Bayesian Networks

Authors: Tong Yuan, Jian Yang, Zeyi Wen

Abstract: Although existing machine learning-based methods for traffic accident analysis can provide good quality results to downstream tasks, they lack interpretability which is crucial for this critical problem. This paper proposes an interpretable framework based on Bayesian Networks for traffic accident prediction. To enable the ease of interpretability, we design a dataset construction pipeline to feed… ▽ More Although existing machine learning-based methods for traffic accident analysis can provide good quality results to downstream tasks, they lack interpretability which is crucial for this critical problem. This paper proposes an interpretable framework based on Bayesian Networks for traffic accident prediction. To enable the ease of interpretability, we design a dataset construction pipeline to feed the traffic data into the framework while retaining the essential traffic data information. With a concrete case study, our framework can derive a Bayesian Network from a dataset based on the causal relationships between weather and traffic events across the United States. Consequently, our framework enables the prediction of traffic accidents with competitive accuracy while examining how the probability of these events changes under different conditions, thus illustrating transparent relationships between traffic and weather events. Additionally, the visualization of the network simplifies the analysis of relationships between different variables, revealing the primary causes of traffic accidents and ultimately providing a valuable reference for reducing traffic accidents. △ Less

Submitted 10 October, 2023; originally announced October 2023.

Comments: 11 pages, 7 figures

MSC Class: 62F15 ACM Class: G.3

arXiv:2310.05388 [pdf, other]

GROVE: A Retrieval-augmented Complex Story Generation Framework with A Forest of Evidence

Authors: Zhihua Wen, Zhiliang Tian, Wei Wu, Yuxin Yang, Yanqi Shi, Zhen Huang, Dongsheng Li

Abstract: Conditional story generation is significant in human-machine interaction, particularly in producing stories with complex plots. While Large language models (LLMs) perform well on multiple NLP tasks, including story generation, it is challenging to generate stories with both complex and creative plots. Existing methods often rely on detailed prompts to guide LLMs to meet target conditions, which in… ▽ More Conditional story generation is significant in human-machine interaction, particularly in producing stories with complex plots. While Large language models (LLMs) perform well on multiple NLP tasks, including story generation, it is challenging to generate stories with both complex and creative plots. Existing methods often rely on detailed prompts to guide LLMs to meet target conditions, which inadvertently restrict the creative potential of the generated stories. We argue that leveraging information from exemplary human-written stories facilitates generating more diverse plotlines. Delving deeper into story details helps build complex and credible plots. In this paper, we propose a retrieval-au\textbf{G}mented sto\textbf{R}y generation framework with a f\textbf{O}rest of e\textbf{V}id\textbf{E}nce (GROVE) to enhance stories' complexity. We build a retrieval repository for target conditions to produce few-shot examples to prompt LLMs. Additionally, we design an ``asking-why'' prompting scheme that extracts a forest of evidence, providing compensation for the ambiguities that may occur in the generated story. This iterative process uncovers underlying story backgrounds. Finally, we select the most fitting chains of evidence from the evidence forest and integrate them into the generated story, thereby enhancing the narrative's complexity and credibility. Experimental results and numerous examples verify the effectiveness of our method. △ Less

Submitted 23 October, 2023; v1 submitted 8 October, 2023; originally announced October 2023.

Comments: Findings of EMNLP 2023

arXiv:2310.01419 [pdf, other]

Design Principles of Robust Multi-Armed Bandit Framework in Video Recommendations

Authors: Belhassen Bayar, Phanideep Gampa, Ainur Yessenalina, Zhen Wen

Abstract: Current multi-armed bandit approaches in recommender systems (RS) have focused more on devising effective exploration techniques, while not adequately addressing common exploitation challenges related to distributional changes and item cannibalization. Little work exists to guide the design of robust bandit frameworks that can address these frequent challenges in RS. In this paper, we propose a ne… ▽ More Current multi-armed bandit approaches in recommender systems (RS) have focused more on devising effective exploration techniques, while not adequately addressing common exploitation challenges related to distributional changes and item cannibalization. Little work exists to guide the design of robust bandit frameworks that can address these frequent challenges in RS. In this paper, we propose a new design principles to (i) make bandit models robust to time-variant metadata signals, (ii) less prone to item cannibalization, and (iii) prevent their weights fluctuating due to data sparsity. Through a series of experiments, we systematically examine the influence of several important bandit design choices. We demonstrate the advantage of our proposed design principles at making bandit models robust to dynamic behavioral changes through in-depth analyses. Noticeably, we show improved relative gain compared to a baseline bandit model not incorporating our design choices of up to $11.88\%$ and $44.85\%$, respectively in ROC-AUC and PR-AUC. Case studies about fairness in recommending specific popular and unpopular titles are presented, to demonstrate the robustness of our proposed design at addressing popularity biases. △ Less

Submitted 24 September, 2023; originally announced October 2023.

Comments: RecSys CARS 2023 Workshop paper

arXiv:2310.00212 [pdf, other]

Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment

Authors: Tianhao Wu, Banghua Zhu, Ruoyu Zhang, Zhao** Wen, Kannan Ramchandran, Jiantao Jiao

Abstract: Large Language Models (LLMs) can acquire extensive world knowledge through pre-training on large corpora. However, due to exposure to low-quality data, LLMs may exhibit harmful behavior without aligning with human values. The dominant approach for steering LLMs towards beneficial behavior involves Reinforcement Learning with Human Feedback (RLHF), with Proximal Policy Optimization (PPO) serving as… ▽ More Large Language Models (LLMs) can acquire extensive world knowledge through pre-training on large corpora. However, due to exposure to low-quality data, LLMs may exhibit harmful behavior without aligning with human values. The dominant approach for steering LLMs towards beneficial behavior involves Reinforcement Learning with Human Feedback (RLHF), with Proximal Policy Optimization (PPO) serving as the default RL optimizer. Despite its effectiveness, PPO has limitations when optimizing rewards trained from comparison-based loss. Primarily, PPO is not invariant to equivalent reward functions containing identical preference information due to the need to calibrate the reward scale. Additionally, PPO's necessity for token-wise updates introduces complexity in both function approximation and algorithm design compared to trajectory-wise optimization. This paper proposes a new framework, reinforcement learning with relative feedback, and a novel trajectory-wise policy gradient algorithm, Pairwise Proximal Policy Optimization (P3O) that operates directly on comparative rewards. We show theoretically that P3O is invariant to equivalent rewards and avoids the complexity of PPO. Empirical evaluations demonstrate that P3O outperforms PPO in the KL-Reward trade-off and can align with human preferences as well as or better than prior methods. In summary, this work introduces a simpler yet effective approach for aligning LLMs to human preferences through relative feedback. △ Less

Submitted 9 October, 2023; v1 submitted 29 September, 2023; originally announced October 2023.

Comments: 19 pages, 5 figures

arXiv:2309.17409 [pdf, ps, other]

Sharper Convergence Guarantees for Federated Learning with Partial Model Personalization

Authors: Yiming Chen, Liyuan Cao, Kun Yuan, Zaiwen Wen

Abstract: Partial model personalization, which encompasses both shared and personal variables in its formulation, is a critical optimization problem in federated learning. It balances individual client needs with collective knowledge utilization, and serves as a general formulation covering various key scenarios, ranging from fully shared to fully personalized federated learning. This paper introduces two e… ▽ More Partial model personalization, which encompasses both shared and personal variables in its formulation, is a critical optimization problem in federated learning. It balances individual client needs with collective knowledge utilization, and serves as a general formulation covering various key scenarios, ranging from fully shared to fully personalized federated learning. This paper introduces two effective algorithms, FedAvg-P and Scaffold-P, to solve this problem and provides sharp convergence analyses, quantifying the influence of gradient variance, local steps, and partial client sampling on their performance. Our established rates surpass existing results and, meanwhile, are based on more relaxed assumptions. Additionally, our analyses are also applicable to fully shared or fully personalized federated learning, matching or even outperforming their best known convergence rates. Numerical experiments corroborate our theoretical findings. △ Less

Submitted 29 September, 2023; originally announced September 2023.

arXiv:2308.13295 [pdf]

Resolution-independent generative models based on operator learning for physics-constrained Bayesian inverse problems

Authors: Xinchao Jiang, Xin Wang, Ziming Wen, Hu Wang

Abstract: The Bayesian inference approach is widely used to tackle inverse problems due to its versatile and natural ability to handle ill-posedness. However, it often faces challenges when dealing with situations involving continuous fields or large-resolution discrete representations (high-dimensional). Moreover, the prior distribution of unknown parameters is commonly difficult to be determined. In this… ▽ More The Bayesian inference approach is widely used to tackle inverse problems due to its versatile and natural ability to handle ill-posedness. However, it often faces challenges when dealing with situations involving continuous fields or large-resolution discrete representations (high-dimensional). Moreover, the prior distribution of unknown parameters is commonly difficult to be determined. In this study, an Operator Learning-based Generative Adversarial Network (OL-GAN) is proposed and integrated into the Bayesian inference framework to handle these issues. Unlike most Bayesian approaches, the distinctive characteristic of the proposed method is to learn the joint distribution of parameters and responses. By leveraging the trained generative model, the posteriors of the unknown parameters can theoretically be approximated by any sampling algorithm (e.g., Markov Chain Monte Carlo, MCMC) in a low-dimensional latent space shared by the components of the joint distribution. The latent space is typically a simple and easy-to-sample distribution (e.g., Gaussian, uniform), which significantly reduces the computational cost associated with the Bayesian inference while avoiding prior selection concerns. Furthermore, incorporating operator learning enables resolution-independent in the generator. Predictions can be obtained at desired coordinates, and inversions can be performed even if the observation data are misaligned with the training data. Finally, the effectiveness of the proposed method is validated through several numerical experiments. △ Less

Submitted 25 August, 2023; originally announced August 2023.

arXiv:2308.10028 [pdf, other]

doi 10.1145/3583780.3615505

Voucher Abuse Detection with Prompt-based Fine-tuning on Graph Neural Networks

Authors: Zhihao Wen, Yuan Fang, Yihan Liu, Yang Guo, Shuji Hao

Abstract: Voucher abuse detection is an important anomaly detection problem in E-commerce. While many GNN-based solutions have emerged, the supervised paradigm depends on a large quantity of labeled data. A popular alternative is to adopt self-supervised pre-training using label-free data, and further fine-tune on a downstream task with limited labels. Nevertheless, the "pre-train, fine-tune" paradigm is of… ▽ More Voucher abuse detection is an important anomaly detection problem in E-commerce. While many GNN-based solutions have emerged, the supervised paradigm depends on a large quantity of labeled data. A popular alternative is to adopt self-supervised pre-training using label-free data, and further fine-tune on a downstream task with limited labels. Nevertheless, the "pre-train, fine-tune" paradigm is often plagued by the objective gap between pre-training and downstream tasks. Hence, we propose VPGNN, a prompt-based fine-tuning framework on GNNs for voucher abuse detection. We design a novel graph prompting function to reformulate the downstream task into a similar template as the pretext task in pre-training, thereby narrowing the objective gap. Extensive experiments on both proprietary and public datasets demonstrate the strength of VPGNN in both few-shot and semi-supervised scenarios. Moreover, an online deployment of VPGNN in a production environment shows a 23.4% improvement over two existing deployed models. △ Less

Submitted 30 August, 2023; v1 submitted 19 August, 2023; originally announced August 2023.

Comments: 7 pages, Accepted by CIKM23 Applied Research Track

arXiv:2308.06470 [pdf, ps, other]

On the Optimal Lower and Upper Complexity Bounds for a Class of Composite Optimization Problems

Authors: Zhenyuan Zhu, Fan Chen, Junyu Zhang, Zaiwen Wen

Abstract: We study the optimal lower and upper complexity bounds for finding approximate solutions to the composite problem $\min_x\ f(x)+h(Ax-b)$, where $f$ is smooth and $h$ is convex. Given access to the proximal operator of $h$, for strongly convex, convex, and nonconvex $f$, we design efficient first order algorithms with complexities $\tilde{O}\left(κ_A\sqrt{κ_f}\log\left(1/ε\right)\right)$,… ▽ More We study the optimal lower and upper complexity bounds for finding approximate solutions to the composite problem $\min_x\ f(x)+h(Ax-b)$, where $f$ is smooth and $h$ is convex. Given access to the proximal operator of $h$, for strongly convex, convex, and nonconvex $f$, we design efficient first order algorithms with complexities $\tilde{O}\left(κ_A\sqrt{κ_f}\log\left(1/ε\right)\right)$, $\tilde{O}\left(κ_A\sqrt{L_f}D/\sqrtε\right)$, and $\tilde{O}\left(κ_A L_fΔ/ε^2\right)$, respectively. Here, $κ_A$ is the condition number of the matrix $A$ in the composition, $L_f$ is the smoothness constant of $f$, and $κ_f$ is the condition number of $f$ in the strongly convex case. $D$ is the initial point distance and $Δ$ is the initial function value gap. Tight lower complexity bounds for the three cases are also derived and they match the upper bounds up to logarithmic factors, thereby demonstrating the optimality of both the upper and lower bounds proposed in this paper. △ Less

Submitted 12 August, 2023; originally announced August 2023.

MSC Class: 90C25; 90C26; 90C46; 90C60

arXiv:2308.04149 [pdf]

Fully epitaxial fcc(111) magnetic tunnel junctions with a Co90Fe10/MgAlO/Co90Fe10 structure

Authors: Jieyuan Song, Thomas Scheike, Cong He, Zhenchao Wen, Tadakatsu Ohkubo, Kazuhiro Hono, Hiroaki Sukegawa, Seiji Mitani

Abstract: Magnetic tunnel junctions (MTJs) with bcc(001)-type structures such as Fe(001)/MgO(001)/Fe(001), have been widely used as the core of various spintronic devices such as magnetoresistive memories; however, the limited material selection of (001)-type MTJs hinders the further development of spintronic devices. Here, as an alternative to the (001)-type MTJs, an fcc(111)-type MTJ using a fully epitaxi… ▽ More Magnetic tunnel junctions (MTJs) with bcc(001)-type structures such as Fe(001)/MgO(001)/Fe(001), have been widely used as the core of various spintronic devices such as magnetoresistive memories; however, the limited material selection of (001)-type MTJs hinders the further development of spintronic devices. Here, as an alternative to the (001)-type MTJs, an fcc(111)-type MTJ using a fully epitaxial CoFe/rock-salt MgAlO (MAO)/CoFe is explored to introduce close-packed lattice systems into MTJs. Using an atomically flat Ru(0001) epitaxial buffer layer, fcc(111) epitaxial growth of the CoFe/MAO/CoFe trilayer is achieved. Sharp CoFe(111)/MAO(111) interfaces are confirmed due to the introduction of periodic dislocations by forming a 5:6 in-plane lattice matching structure. The fabricated (111) MTJ exhibits a tunnel magnetoresistance ratio of 37% at room temperature (47% at 10 K). Symmetric differential conductance curves with respect to bias polarity are observed, indicating the achievement of nearly identical upper and lower MAO interface qualities. Despite the charge-uncompensated (111) orientation for a rock-salt-like MAO barrier, the achievement of flat, stable, and spin-polarized barrier interfaces opens a promising avenue for expanding the design of MTJ structures. △ Less

Submitted 8 August, 2023; originally announced August 2023.

Comments: 18 pages, 5 figures

arXiv:2307.14024 [pdf, other]

Multi-view Hypergraph Contrastive Policy Learning for Conversational Recommendation

Authors: Sen Zhao, Wei Wei, Xian-Ling Mao, Shuai Zhu, Minghui Yang, Zujie Wen, Dangyang Chen, Feida Zhu

Abstract: Conversational recommendation systems (CRS) aim to interactively acquire user preferences and accordingly recommend items to users. Accurately learning the dynamic user preferences is of crucial importance for CRS. Previous works learn the user preferences with pairwise relations from the interactive conversation and item knowledge, while largely ignoring the fact that factors for a relationship i… ▽ More Conversational recommendation systems (CRS) aim to interactively acquire user preferences and accordingly recommend items to users. Accurately learning the dynamic user preferences is of crucial importance for CRS. Previous works learn the user preferences with pairwise relations from the interactive conversation and item knowledge, while largely ignoring the fact that factors for a relationship in CRS are multiplex. Specifically, the user likes/dislikes the items that satisfy some attributes (Like/Dislike view). Moreover social influence is another important factor that affects user preference towards the item (Social view), while is largely ignored by previous works in CRS. The user preferences from these three views are inherently different but also correlated as a whole. The user preferences from the same views should be more similar than that from different views. The user preferences from Like View should be similar to Social View while different from Dislike View. To this end, we propose a novel model, namely Multi-view Hypergraph Contrastive Policy Learning (MHCPL). Specifically, MHCPL timely chooses useful social information according to the interactive history and builds a dynamic hypergraph with three types of multiplex relations from different views. The multiplex relations in each view are successively connected according to their generation order. △ Less

Submitted 26 July, 2023; originally announced July 2023.

arXiv:2307.10230 [pdf, other]

Prompt Tuning on Graph-augmented Low-resource Text Classification

Authors: Zhihao Wen, Yuan Fang

Abstract: Text classification is a fundamental problem in information retrieval with many real-world applications, such as predicting the topics of online articles and the categories of e-commerce product descriptions. However, low-resource text classification, with no or few labeled samples, presents a serious concern for supervised learning. Meanwhile, many text data are inherently grounded on a network s… ▽ More Text classification is a fundamental problem in information retrieval with many real-world applications, such as predicting the topics of online articles and the categories of e-commerce product descriptions. However, low-resource text classification, with no or few labeled samples, presents a serious concern for supervised learning. Meanwhile, many text data are inherently grounded on a network structure, such as a hyperlink/citation network for online articles, and a user-item purchase network for e-commerce products. These graph structures capture rich semantic relationships, which can potentially augment low-resource text classification. In this paper, we propose a novel model called Graph-Grounded Pre-training and Prompting (G2P2) to address low-resource text classification in a two-pronged approach. During pre-training, we propose three graph interaction-based contrastive strategies to jointly pre-train a graph-text model; during downstream classification, we explore handcrafted discrete prompts and continuous prompt tuning for the jointly pre-trained model to achieve zero- and few-shot classification, respectively. Moreover, we explore the possibility of employing continuous prompt tuning for zero-shot inference. Specifically, we aim to generalize continuous prompts to unseen classes while leveraging a set of base classes. To this end, we extend G2P2 into G2P2$^*$, hinging on a new architecture of conditional prompt tuning. Extensive experiments on four real-world datasets demonstrate the strength of G2P2 in zero- and few-shot low-resource text classification tasks, and illustrate the advantage of G2P2$^*$ in dealing with unseen classes. △ Less

Submitted 27 November, 2023; v1 submitted 15 July, 2023; originally announced July 2023.

Comments: 14 pages, journal under review. arXiv admin note: substantial text overlap with arXiv:2305.03324

arXiv:2307.08969 [pdf, other]

doi 10.1109/TVCG.2023.3327148

Quantivine: A Visualization Approach for Large-scale Quantum Circuit Representation and Analysis

Authors: Zhen Wen, Yihan Liu, Siwei Tan, Jieyi Chen, Minfeng Zhu, Dongming Han, Jianwei Yin, Mingliang Xu, Wei Chen

Abstract: Quantum computing is a rapidly evolving field that enables exponential speed-up over classical algorithms. At the heart of this revolutionary technology are quantum circuits, which serve as vital tools for implementing, analyzing, and optimizing quantum algorithms. Recent advancements in quantum computing and the increasing capability of quantum devices have led to the development of more complex… ▽ More Quantum computing is a rapidly evolving field that enables exponential speed-up over classical algorithms. At the heart of this revolutionary technology are quantum circuits, which serve as vital tools for implementing, analyzing, and optimizing quantum algorithms. Recent advancements in quantum computing and the increasing capability of quantum devices have led to the development of more complex quantum circuits. However, traditional quantum circuit diagrams suffer from scalability and readability issues, which limit the efficiency of analysis and optimization processes. In this research, we propose a novel visualization approach for large-scale quantum circuits by adopting semantic analysis to facilitate the comprehension of quantum circuits. We first exploit meta-data and semantic information extracted from the underlying code of quantum circuits to create component segmentations and pattern abstractions, allowing for easier wrangling of massive circuit diagrams. We then develop Quantivine, an interactive system for exploring and understanding quantum circuits. A series of novel circuit visualizations are designed to uncover contextual details such as qubit provenance, parallelism, and entanglement. The effectiveness of Quantivine is demonstrated through two usage scenarios of quantum circuits with up to 100 qubits and a formal user evaluation with quantum experts. A free copy of this paper and all supplemental materials are available at https://osf.io/2m9yh/?view_only=0aa1618c97244f5093cd7ce15f1431f9. △ Less

Submitted 18 July, 2023; originally announced July 2023.

Comments: Accepted by IEEE VIS 2023

Journal ref: IEEE Transactions on Visualization and Computer Graphics, 2023

arXiv:2307.08929 [pdf, other]

Active learning of effective Hamiltonian for super-large-scale atomic structures

Authors: Xingyue Ma, Hongying Chen, Ri He, Zhanbo Yu, Sergei Prokhorenko, Zheng Wen, Zhicheng Zhong, Jorge Iñiguez, L. Bellaiche, Di Wu, Yurong Yang

Abstract: The first-principles-based effective Hamiltonian scheme provides one of the most accurate modeling technique for large-scale structures, especially for ferroelectrics. However, the parameterization of the effective Hamiltonian is complicated and can be difficult for some complex systems such as high-entropy perovskites. Here, we propose a general form of effective Hamiltonian and develop an active… ▽ More The first-principles-based effective Hamiltonian scheme provides one of the most accurate modeling technique for large-scale structures, especially for ferroelectrics. However, the parameterization of the effective Hamiltonian is complicated and can be difficult for some complex systems such as high-entropy perovskites. Here, we propose a general form of effective Hamiltonian and develop an active machine learning approach to parameterize the effective Hamiltonian based on Bayesian linear regression. The parameterization is employed in molecular dynamics simulations with the prediction of energy, forces, stress and their uncertainties at each step, which decides whether first-principles calculations are executed to retrain the parameters. Structures of BaTiO$_3$, Pb(Zr$_{0.75}$Ti$_{0.25}$)O$_3$ and (Pb,Sr)TiO$_3$ system are taken as examples to show the accuracy of this approach, as compared with conventional parametrization method and experiments. This machine learning approach provides a universal and automatic way to compute the effective Hamiltonian parameters for any considered complex systems with super-large-scale (more than $10^7$ atoms) atomic structures. △ Less

Submitted 14 May, 2024; v1 submitted 17 July, 2023; originally announced July 2023.

Comments: 11 pages, 4 figures

arXiv:2307.08699 [pdf, other]

Pair then Relation: Pair-Net for Panoptic Scene Graph Generation

Authors: **ghao Wang, Zhengyu Wen, Xiangtai Li, Zu** Guo, **gkang Yang, Ziwei Liu

Abstract: Panoptic Scene Graph (PSG) is a challenging task in Scene Graph Generation (SGG) that aims to create a more comprehensive scene graph representation using panoptic segmentation instead of boxes. Compared to SGG, PSG has several challenging problems: pixel-level segment outputs and full relationship exploration (It also considers thing and stuff relation). Thus, current PSG methods have limited per… ▽ More Panoptic Scene Graph (PSG) is a challenging task in Scene Graph Generation (SGG) that aims to create a more comprehensive scene graph representation using panoptic segmentation instead of boxes. Compared to SGG, PSG has several challenging problems: pixel-level segment outputs and full relationship exploration (It also considers thing and stuff relation). Thus, current PSG methods have limited performance, which hinders downstream tasks or applications. The goal of this work aims to design a novel and strong baseline for PSG. To achieve that, we first conduct an in-depth analysis to identify the bottleneck of the current PSG models, finding that inter-object pair-wise recall is a crucial factor that was ignored by previous PSG methods. Based on this and the recent query-based frameworks, we present a novel framework: Pair then Relation (Pair-Net), which uses a Pair Proposal Network (PPN) to learn and filter sparse pair-wise relationships between subjects and objects. Moreover, we also observed the sparse nature of object pairs for both Motivated by this, we design a lightweight Matrix Learner within the PPN, which directly learn pair-wised relationships for pair proposal generation. Through extensive ablation and analysis, our approach significantly improves upon leveraging the segmenter solid baseline. Notably, our method achieves new state-of-the-art results on the PSG benchmark, with over 10\% absolute gains compared to PSGFormer. The code of this paper is publicly available at https://github.com/king159/Pair-Net. △ Less

Submitted 1 August, 2023; v1 submitted 17 July, 2023; originally announced July 2023.

Comments: Project Page: https://github.com/king159/Pair-Net

arXiv:2307.05074 [pdf, other]

Retrieval-augmented GPT-3.5-based Text-to-SQL Framework with Sample-aware Prompting and Dynamic Revision Chain

Authors: Chunxi Guo, Zhiliang Tian, **tao Tang, Shasha Li, Zhihua Wen, Kaixuan Wang, Ting Wang

Abstract: Text-to-SQL aims at generating SQL queries for the given natural language questions and thus hel** users to query databases. Prompt learning with large language models (LLMs) has emerged as a recent approach, which designs prompts to lead LLMs to understand the input question and generate the corresponding SQL. However, it faces challenges with strict SQL syntax requirements. Existing work promp… ▽ More Text-to-SQL aims at generating SQL queries for the given natural language questions and thus hel** users to query databases. Prompt learning with large language models (LLMs) has emerged as a recent approach, which designs prompts to lead LLMs to understand the input question and generate the corresponding SQL. However, it faces challenges with strict SQL syntax requirements. Existing work prompts the LLMs with a list of demonstration examples (i.e. question-SQL pairs) to generate SQL, but the fixed prompts can hardly handle the scenario where the semantic gap between the retrieved demonstration and the input question is large. In this paper, we propose a retrieval-augmented prompting method for a LLM-based Text-to-SQL framework, involving sample-aware prompting and a dynamic revision chain. Our approach incorporates sample-aware demonstrations, which include the composition of SQL operators and fine-grained information related to the given question. To retrieve questions sharing similar intents with input questions, we propose two strategies for assisting retrieval. Firstly, we leverage LLMs to simplify the original questions, unifying the syntax and thereby clarifying the users' intentions. To generate executable and accurate SQLs without human intervention, we design a dynamic revision chain which iteratively adapts fine-grained feedback from the previously generated SQL. Experimental results on three Text-to-SQL benchmarks demonstrate the superiority of our method over strong baseline models. △ Less

Submitted 4 September, 2023; v1 submitted 11 July, 2023; originally announced July 2023.

arXiv:2307.02046 [pdf, other]

Recommender Systems in the Era of Large Language Models (LLMs)

Authors: Zihuai Zhao, Wenqi Fan, Jiatong Li, Yunqing Liu, Xiaowei Mei, Yiqi Wang, Zhen Wen, Fei Wang, Xiangyu Zhao, Jiliang Tang, Qing Li

Abstract: With the prosperity of e-commerce and web applications, Recommender Systems (RecSys) have become an important component of our daily life, providing personalized suggestions that cater to user preferences. While Deep Neural Networks (DNNs) have made significant advancements in enhancing recommender systems by modeling user-item interactions and incorporating textual side information, DNN-based met… ▽ More With the prosperity of e-commerce and web applications, Recommender Systems (RecSys) have become an important component of our daily life, providing personalized suggestions that cater to user preferences. While Deep Neural Networks (DNNs) have made significant advancements in enhancing recommender systems by modeling user-item interactions and incorporating textual side information, DNN-based methods still face limitations, such as difficulties in understanding users' interests and capturing textual side information, inabilities in generalizing to various recommendation scenarios and reasoning on their predictions, etc. Meanwhile, the emergence of Large Language Models (LLMs), such as ChatGPT and GPT4, has revolutionized the fields of Natural Language Processing (NLP) and Artificial Intelligence (AI), due to their remarkable abilities in fundamental responsibilities of language understanding and generation, as well as impressive generalization and reasoning capabilities. As a result, recent studies have attempted to harness the power of LLMs to enhance recommender systems. Given the rapid evolution of this research direction in recommender systems, there is a pressing need for a systematic overview that summarizes existing LLM-empowered recommender systems, to provide researchers in relevant fields with an in-depth understanding. Therefore, in this paper, we conduct a comprehensive review of LLM-empowered recommender systems from various aspects including Pre-training, Fine-tuning, and Prompting. More specifically, we first introduce representative methods to harness the power of LLMs (as a feature encoder) for learning representations of users and items. Then, we review recent techniques of LLMs for enhancing recommender systems from three paradigms, namely pre-training, fine-tuning, and prompting. Finally, we comprehensively discuss future directions in this emerging field. △ Less

Submitted 29 April, 2024; v1 submitted 5 July, 2023; originally announced July 2023.

Comments: Accepted by IEEE TKDE

arXiv:2307.00783 [pdf, other]

Monte Carlo Policy Gradient Method for Binary Optimization

Authors: Cheng Chen, Ruitao Chen, Tianyou Li, Ruichen Ao, Zaiwen Wen

Abstract: Binary optimization has a wide range of applications in combinatorial optimization problems such as MaxCut, MIMO detection, and MaxSAT. However, these problems are typically NP-hard due to the binary constraints. We develop a novel probabilistic model to sample the binary solution according to a parameterized policy distribution. Specifically, minimizing the KL divergence between the parameterized… ▽ More Binary optimization has a wide range of applications in combinatorial optimization problems such as MaxCut, MIMO detection, and MaxSAT. However, these problems are typically NP-hard due to the binary constraints. We develop a novel probabilistic model to sample the binary solution according to a parameterized policy distribution. Specifically, minimizing the KL divergence between the parameterized policy distribution and the Gibbs distributions of the function value leads to a stochastic optimization problem whose policy gradient can be derived explicitly similar to reinforcement learning. For coherent exploration in discrete spaces, parallel Markov Chain Monte Carlo (MCMC) methods are employed to sample from the policy distribution with diversity and approximate the gradient efficiently. We further develop a filter scheme to replace the original objective function by the one with the local search technique to broaden the horizon of the function landscape. Convergence to stationary points in expectation of the policy gradient method is established based on the concentration inequality for MCMC. Numerical results show that this framework is very promising to provide near-optimal solutions for quite a few binary optimization problems. △ Less

Submitted 3 July, 2023; originally announced July 2023.

MSC Class: 90C09; 90C27; 90C59; 60J45; 60J20

arXiv:2307.00731 [pdf, other]

doi 10.1088/1674-4527/ace179

Reciprocating Magnetic Fields in the Pulsar Wind Observed from the Black Widow Pulsar J1720-0534

Authors: Chen-Chen Miao, Victoria Blackmon, Wei-Wei Zhu, Dong-Zi Li, Mingyu Ge, Xiao-Peng You, Maura McLaughlin, Di Li, Na Wang, Pei Wang, Jia-Rui Niu, M. Cruces, Jian-** Yuan, Jun-Tao Bai, D. J. Champion, Yu-Tong Chen, Ming-Min Chi, P. C. C. Freire, Yi Feng, Zhen-Ye Gan, M. Kramer, Fei-Fei Kou, Yu-Xi Li, Xue-Li Miao, Ling-Qi Meng , et al. (19 additional authors not shown)

Abstract: We report the radio observations of the eclipsing black widow pulsar J1720-0534, a 3.26 ms pulsar in orbit with a low mass companion of mass 0.029 to 0.034 M$_{\odot}$. We obtain the phase-connected timing ephemeris and polarization profile of this millisecond pulsar (MSP) using the Five-hundred-meter Aperture Spherical Radio Telescope (FAST), the Green Bank Telescope (GBT), and the Parkes Telesco… ▽ More We report the radio observations of the eclipsing black widow pulsar J1720-0534, a 3.26 ms pulsar in orbit with a low mass companion of mass 0.029 to 0.034 M$_{\odot}$. We obtain the phase-connected timing ephemeris and polarization profile of this millisecond pulsar (MSP) using the Five-hundred-meter Aperture Spherical Radio Telescope (FAST), the Green Bank Telescope (GBT), and the Parkes Telescope. For the first time from such a system, an oscillatory polarisation angle change was observed from a particular eclipse egress with partial depolarization, indicating 10-milliGauss-level reciprocating magnetic fields oscillating in a length scale of 5000 km (assuming an orbital inclination angle of 90 degrees) outside the companion's magnetosphere. The dispersion measure variation observed during the ingresses and egresses shows the rapid raising of the electron density in the shock boundary between the companion's magnetosphere and the surrounding pulsar wind. We suggest that the observed oscillatory magnetic fields originate from the pulsar wind outside the companion's magnetosphere. △ Less

Submitted 28 August, 2023; v1 submitted 2 July, 2023; originally announced July 2023.

Comments: 15 pages, 8 figures, 1 table, accepted by RAA

arXiv:2307.00358 [pdf, ps, other]

The Error in Multivariate Linear Extrapolation with Applications to Derivative-Free Optimization

Authors: Liyuan Cao, Zaiwen Wen, Ya-xiang Yuan

Abstract: We study in this paper the function approximation error of multivariate linear extrapolation. The sharp error bound of linear interpolation already exists in the literature. However, linear extrapolation is used far more often in applications such as derivative-free optimization, while its error is not well-studied. We introduce in this paper a method to numerically compute the sharp bound on the… ▽ More We study in this paper the function approximation error of multivariate linear extrapolation. The sharp error bound of linear interpolation already exists in the literature. However, linear extrapolation is used far more often in applications such as derivative-free optimization, while its error is not well-studied. We introduce in this paper a method to numerically compute the sharp bound on the error, and then present several analytical bounds along with the conditions under which they are sharp. We analyze in depth the approximation error achievable by quadratic functions and the error bound for the bivariate case. All results are under the assumptions that the function being interpolated has Lipschitz continuous gradient and is interpolated on an affinely independent sample set. △ Less

Submitted 1 July, 2023; originally announced July 2023.

Comments: arXiv admin note: text overlap with arXiv:2209.12606

arXiv:2306.15401 [pdf, other]

Explainable Multimodal Emotion Recognition

Authors: Zheng Lian, Haiyang Sun, Licai Sun, Hao Gu, Zhuofan Wen, Siyuan Zhang, Shun Chen, Mingyu Xu, Ke Xu, Kang Chen, Lan Chen, Shan Liang, Ya Li, Jiangyan Yi, Bin Liu, Jianhua Tao

Abstract: Multimodal emotion recognition is an important research topic in artificial intelligence, whose main goal is to integrate multimodal clues to identify human emotional states. Current works generally assume accurate labels for benchmark datasets and focus on develo** more effective architectures. However, emotion annotation relies on subjective judgment. To obtain more reliable labels, existing d… ▽ More Multimodal emotion recognition is an important research topic in artificial intelligence, whose main goal is to integrate multimodal clues to identify human emotional states. Current works generally assume accurate labels for benchmark datasets and focus on develo** more effective architectures. However, emotion annotation relies on subjective judgment. To obtain more reliable labels, existing datasets usually restrict the label space to some basic categories, then hire plenty of annotators and use majority voting to select the most likely label. However, this process may result in some correct but non-candidate or non-majority labels being ignored. To ensure reliability without ignoring subtle emotions, we propose a new task called ``Explainable Multimodal Emotion Recognition (EMER)''. Unlike traditional emotion recognition, EMER takes a step further by providing explanations for these predictions. Through this task, we can extract relatively reliable labels since each label has a certain basis. Meanwhile, we borrow large language models (LLMs) to disambiguate unimodal clues and generate more complete multimodal explanations. From them, we can extract richer emotions in an open-vocabulary manner. This paper presents our initial attempt at this task, including introducing a new dataset, establishing baselines, and defining evaluation metrics. In addition, EMER can serve as a benchmark task to evaluate the audio-video-text understanding performance of multimodal LLMs. △ Less

Submitted 23 May, 2024; v1 submitted 27 June, 2023; originally announced June 2023.

arXiv:2306.14112 [pdf, other]

Enhancing Dynamic Image Advertising with Vision-Language Pre-training

Authors: Zhoufutu Wen, Xinyu Zhao, Zhipeng **, Yi Yang, Wei Jia, Xiaodong Chen, Shuanglong Li, Lin Liu

Abstract: In the multimedia era, image is an effective medium in search advertising. Dynamic Image Advertising (DIA), a system that matches queries with ad images and generates multimodal ads, is introduced to improve user experience and ad revenue. The core of DIA is a query-image matching module performing ad image retrieval and relevance modeling. Current query-image matching suffers from limited and inc… ▽ More In the multimedia era, image is an effective medium in search advertising. Dynamic Image Advertising (DIA), a system that matches queries with ad images and generates multimodal ads, is introduced to improve user experience and ad revenue. The core of DIA is a query-image matching module performing ad image retrieval and relevance modeling. Current query-image matching suffers from limited and inconsistent data, and insufficient cross-modal interaction. Also, the separate optimization of retrieval and relevance models affects overall performance. To address this issue, we propose a vision-language framework consisting of two parts. First, we train a base model on large-scale image-text pairs to learn general multimodal representation. Then, we fine-tune the base model on advertising business data, unifying relevance modeling and retrieval through multi-objective learning. Our framework has been implemented in Baidu search advertising system "Phoneix Nest". Online evaluation shows that it improves cost per mille (CPM) and click-through rate (CTR) by 1.04% and 1.865%. △ Less

Submitted 24 June, 2023; originally announced June 2023.

Comments: 6 pages, 3 figures, accepted to SIRIP 2023

arXiv:2306.10508 [pdf, other]

QCNeXt: A Next-Generation Framework For Joint Multi-Agent Trajectory Prediction

Authors: Zikang Zhou, Zihao Wen, Jian** Wang, Yung-Hui Li, Yu-Kai Huang

Abstract: Estimating the joint distribution of on-road agents' future trajectories is essential for autonomous driving. In this technical report, we propose a next-generation framework for joint multi-agent trajectory prediction called QCNeXt. First, we adopt the query-centric encoding paradigm for the task of joint multi-agent trajectory prediction. Powered by this encoding scheme, our scene encoder is equ… ▽ More Estimating the joint distribution of on-road agents' future trajectories is essential for autonomous driving. In this technical report, we propose a next-generation framework for joint multi-agent trajectory prediction called QCNeXt. First, we adopt the query-centric encoding paradigm for the task of joint multi-agent trajectory prediction. Powered by this encoding scheme, our scene encoder is equipped with permutation equivariance on the set elements, roto-translation invariance in the space dimension, and translation invariance in the time dimension. These invariance properties not only enable accurate multi-agent forecasting fundamentally but also empower the encoder with the capability of streaming processing. Second, we propose a multi-agent DETR-like decoder, which facilitates joint multi-agent trajectory prediction by modeling agents' interactions at future time steps. For the first time, we show that a joint prediction model can outperform marginal prediction models even on the marginal metrics, which opens up new research opportunities in trajectory prediction. Our approach ranks 1st on the Argoverse 2 multi-agent motion forecasting benchmark, winning the championship of the Argoverse Challenge at the CVPR 2023 Workshop on Autonomous Driving. △ Less

Submitted 18 June, 2023; originally announced June 2023.

Comments: Technical report for the 1st place solution of the Argoverse 2 Multi-Agent Motion Forecasting Competition at the CVPR 2023 Workshop on Autonomous Driving

arXiv:2306.05118 [pdf, other]

doi 10.1145/3580305.3599796

Controllable Multi-Objective Re-ranking with Policy Hypernetworks

Authors: Sirui Chen, Yuan Wang, Zi**g Wen, Zhiyu Li, Changshuo Zhang, Xiao Zhang, Quan Lin, Cheng Zhu, Jun Xu

Abstract: Multi-stage ranking pipelines have become widely used strategies in modern recommender systems, where the final stage aims to return a ranked list of items that balances a number of requirements such as user preference, diversity, novelty etc. Linear scalarization is arguably the most widely used technique to merge multiple requirements into one optimization objective, by summing up the requiremen… ▽ More Multi-stage ranking pipelines have become widely used strategies in modern recommender systems, where the final stage aims to return a ranked list of items that balances a number of requirements such as user preference, diversity, novelty etc. Linear scalarization is arguably the most widely used technique to merge multiple requirements into one optimization objective, by summing up the requirements with certain preference weights. Existing final-stage ranking methods often adopt a static model where the preference weights are determined during offline training and kept unchanged during online serving. Whenever a modification of the preference weights is needed, the model has to be re-trained, which is time and resources inefficient. Meanwhile, the most appropriate weights may vary greatly for different groups of targeting users or at different time periods (e.g., during holiday promotions). In this paper, we propose a framework called controllable multi-objective re-ranking (CMR) which incorporates a hypernetwork to generate parameters for a re-ranking model according to different preference weights. In this way, CMR is enabled to adapt the preference weights according to the environment changes in an online manner, without retraining the models. Moreover, we classify practical business-oriented tasks into four main categories and seamlessly incorporate them in a new proposed re-ranking model based on an Actor-Evaluator framework, which serves as a reliable real-world testbed for CMR. Offline experiments based on the dataset collected from Taobao App showed that CMR improved several popular re-ranking models by using them as underlying models. Online A/B tests also demonstrated the effectiveness and trustworthiness of CMR. △ Less

Submitted 17 July, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

arXiv:2306.04187 [pdf, other]

doi 10.18653/v1/2023.findings-acl.671

Knowing-how & Knowing-that: A New Task for Machine Comprehension of User Manuals

Authors: Hongru Liang, Jia Liu, Weihong Du, Dingnan **, Wenqiang Lei, Zujie Wen, Jiancheng Lv

Abstract: The machine reading comprehension (MRC) of user manuals has huge potential in customer service. However, current methods have trouble answering complex questions. Therefore, we introduce the Knowing-how & Knowing-that task that requires the model to answer factoid-style, procedure-style, and inconsistent questions about user manuals. We resolve this task by jointly representing the steps and facts… ▽ More The machine reading comprehension (MRC) of user manuals has huge potential in customer service. However, current methods have trouble answering complex questions. Therefore, we introduce the Knowing-how & Knowing-that task that requires the model to answer factoid-style, procedure-style, and inconsistent questions about user manuals. We resolve this task by jointly representing the steps and facts in a graph TARA, which supports a unified inference of various questions. Towards a systematical benchmarking study, we design a heuristic method to automatically parse user manuals into TARAs and build an annotated dataset to test the model's ability in answering real-world questions. Empirical results demonstrate that representing user manuals as TARAs is a desired solution for the MRC of user manuals. An in-depth investigation of TARA further sheds light on the issues and broader impacts of future representations of user manuals. We hope our work can move the MRC of user manuals to a more complex and realistic stage. △ Less

Submitted 8 August, 2023; v1 submitted 7 June, 2023; originally announced June 2023.

Journal ref: Findings of the Association for Computational Linguistics: ACL 2023. (2023)

arXiv:2306.04099 [pdf, other]

NTKCPL: Active Learning on Top of Self-Supervised Model by Estimating True Coverage

Authors: Ziting Wen, Oscar Pizarro, Stefan Williams

Abstract: High annotation cost for training machine learning classifiers has driven extensive research in active learning and self-supervised learning. Recent research has shown that in the context of supervised learning different active learning strategies need to be applied at various stages of the training process to ensure improved performance over the random baseline. We refer to the point where the nu… ▽ More High annotation cost for training machine learning classifiers has driven extensive research in active learning and self-supervised learning. Recent research has shown that in the context of supervised learning different active learning strategies need to be applied at various stages of the training process to ensure improved performance over the random baseline. We refer to the point where the number of available annotations changes the suitable active learning strategy as the phase transition point. In this paper, we establish that when combining active learning with self-supervised models to achieve improved performance, the phase transition point occurs earlier. It becomes challenging to determine which strategy should be used for previously unseen datasets. We argue that existing active learning algorithms are heavily influenced by the phase transition because the empirical risk over the entire active learning pool estimated by these algorithms is inaccurate and influenced by the number of labeled samples. To address this issue, we propose a novel active learning strategy, neural tangent kernel clustering-pseudo-labels (NTKCPL). It estimates empirical risk based on pseudo-labels and the model prediction with NTK approximation. We analyze the factors affecting this approximation error and design a pseudo-label clustering generation method to reduce the approximation error. We validate our method on five datasets, empirically demonstrating that it outperforms the baseline methods in most cases and is valid over a wider range of training budgets. △ Less

Submitted 6 June, 2023; originally announced June 2023.

arXiv:2305.20068 [pdf, other]

TOFG: A Unified and Fine-Grained Environment Representation in Autonomous Driving

Authors: Zihao Wen, Yifan Zhang, Xinhong Chen, Jian** Wang

Abstract: In autonomous driving, an accurate understanding of environment, e.g., the vehicle-to-vehicle and vehicle-to-lane interactions, plays a critical role in many driving tasks such as trajectory prediction and motion planning. Environment information comes from high-definition (HD) map and historical trajectories of vehicles. Due to the heterogeneity of the map data and trajectory data, many data-driv… ▽ More In autonomous driving, an accurate understanding of environment, e.g., the vehicle-to-vehicle and vehicle-to-lane interactions, plays a critical role in many driving tasks such as trajectory prediction and motion planning. Environment information comes from high-definition (HD) map and historical trajectories of vehicles. Due to the heterogeneity of the map data and trajectory data, many data-driven models for trajectory prediction and motion planning extract vehicle-to-vehicle and vehicle-to-lane interactions in a separate and sequential manner. However, such a manner may capture biased interpretation of interactions, causing lower prediction and planning accuracy. Moreover, separate extraction leads to a complicated model structure and hence the overall efficiency and scalability are sacrificed. To address the above issues, we propose an environment representation, Temporal Occupancy Flow Graph (TOFG). Specifically, the occupancy flow-based representation unifies the map information and vehicle trajectories into a homogeneous data format and enables a consistent prediction. The temporal dependencies among vehicles can help capture the change of occupancy flow timely to further promote model performance. To demonstrate that TOFG is capable of simplifying the model architecture, we incorporate TOFG with a simple graph attention (GAT) based neural network and propose TOFG-GAT, which can be used for both trajectory prediction and motion planning. Experiment results show that TOFG-GAT achieves better or competitive performance than all the SOTA baselines with less training time. △ Less

Submitted 31 May, 2023; originally announced May 2023.

Comments: Accepted by ICRA 2023

arXiv:2305.13774 [pdf, other]

ADD 2023: the Second Audio Deepfake Detection Challenge

Authors: Jiangyan Yi, Jianhua Tao, Ruibo Fu, Xinrui Yan, Chenglong Wang, Tao Wang, Chu Yuan Zhang, Xiaohui Zhang, Yan Zhao, Yong Ren, Le Xu, Junzuo Zhou, Hao Gu, Zhengqi Wen, Shan Liang, Zheng Lian, Shuai Nie, Haizhou Li

Abstract: Audio deepfake detection is an emerging topic in the artificial intelligence community. The second Audio Deepfake Detection Challenge (ADD 2023) aims to spur researchers around the world to build new innovative technologies that can further accelerate and foster research on detecting and analyzing deepfake speech utterances. Different from previous challenges (e.g. ADD 2022), ADD 2023 focuses on s… ▽ More Audio deepfake detection is an emerging topic in the artificial intelligence community. The second Audio Deepfake Detection Challenge (ADD 2023) aims to spur researchers around the world to build new innovative technologies that can further accelerate and foster research on detecting and analyzing deepfake speech utterances. Different from previous challenges (e.g. ADD 2022), ADD 2023 focuses on surpassing the constraints of binary real/fake classification, and actually localizing the manipulated intervals in a partially fake speech as well as pinpointing the source responsible for generating any fake audio. Furthermore, ADD 2023 includes more rounds of evaluation for the fake audio game sub-challenge. The ADD 2023 challenge includes three subchallenges: audio fake game (FG), manipulation region location (RL) and deepfake algorithm recognition (AR). This paper describes the datasets, evaluation metrics, and protocols. Some findings are also reported in audio deepfake detection tasks. △ Less

Submitted 23 May, 2023; originally announced May 2023.

arXiv:2305.10011 [pdf]

Super-Resolution Imaging via Angular Magnification

Authors: Yi Zhou, Dingpeng Liao, Kun Zhang, Zijie Ma, Shikai Wu, Jun Ma, Xuemei Dai, Zhengguo Shang, Zhongquan Wen, Gang Chen

Abstract: The far-field resolution of optical imaging systems is restricted by the Abbe diffraction limit, a direct result of the wave nature of light. One successful technological approach to circumventing this limit is to reduce the effective size of a point-spread-function. In the past decades, great endeavors have been made to engineer an effective point-spread-function by exploiting different mechanism… ▽ More The far-field resolution of optical imaging systems is restricted by the Abbe diffraction limit, a direct result of the wave nature of light. One successful technological approach to circumventing this limit is to reduce the effective size of a point-spread-function. In the past decades, great endeavors have been made to engineer an effective point-spread-function by exploiting different mechanisms, including optical nonlinearities and structured light illumination. However, these methods are hard to be applied to objects in a far distance. Here, we propose a new way to achieve super-resolution in a far field by utilizing angular magnification. We present the first proof-of-concept demonstration of such an idea and demonstrate a new class of lenses with angular magnification for far-field super-resolution imaging. Both theoretical and experimental results demonstrate a more than two-fold enhancement beyond the angular-resolution limit in the far-field imaging. The proposed approach can be applied to super-resolution imaging of objects in far distance. It has promising potential applications in super-resolution telescopes and remote sensing. △ Less

Submitted 17 May, 2023; originally announced May 2023.

Showing 51–100 of 487 results for author: Wen, Z