Search | arXiv e-print repository

Zero-Shot Generalization during Instruction Tuning: Insights from Similarity and Granularity

Authors: Bingxiang He, Ning Ding, Cheng Qian, Jia Deng, Ganqu Cui, Lifan Yuan, Huan-ang Gao, Huimin Chen, Zhiyuan Liu, Maosong Sun

Abstract: Understanding alignment techniques begins with comprehending zero-shot generalization brought by instruction tuning, but little of the mechanism has been understood. Existing work has largely been confined to the task level, without considering that tasks are artificially defined and, to LLMs, merely consist of tokens and representations. This line of research has been limited to examining transfe… ▽ More Understanding alignment techniques begins with comprehending zero-shot generalization brought by instruction tuning, but little of the mechanism has been understood. Existing work has largely been confined to the task level, without considering that tasks are artificially defined and, to LLMs, merely consist of tokens and representations. This line of research has been limited to examining transfer between tasks from a task-pair perspective, with few studies focusing on understanding zero-shot generalization from the perspective of the data itself. To bridge this gap, we first demonstrate through multiple metrics that zero-shot generalization during instruction tuning happens very early. Next, we investigate the facilitation of zero-shot generalization from both data similarity and granularity perspectives, confirming that encountering highly similar and fine-grained training data earlier during instruction tuning, without the constraints of defined "tasks", enables better generalization. Finally, we propose a more grounded training data arrangement method, Test-centric Multi-turn Arrangement, and show its effectiveness in promoting continual learning and further loss reduction. For the first time, we show that zero-shot generalization during instruction tuning is a form of similarity-based generalization between training and test data at the instance level. We hope our analysis will advance the understanding of zero-shot generalization during instruction tuning and contribute to the development of more aligned LLMs. Our code is released at https://github.com/HBX-hbx/dynamics_of_zero-shot_generalization. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 33 pages, 14 figures

arXiv:2406.03949 [pdf, other]

UltraMedical: Building Specialized Generalists in Biomedicine

Authors: Kaiyan Zhang, Sihang Zeng, Ermo Hua, Ning Ding, Zhang-Ren Chen, Zhiyuan Ma, Haoxin Li, Ganqu Cui, Biqing Qi, Xuekai Zhu, Xingtai Lv, Hu **fang, Zhiyuan Liu, Bowen Zhou

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains and are moving towards more specialized areas. Recent advanced proprietary models such as GPT-4 and Gemini have achieved significant advancements in biomedicine, which have also raised privacy and security challenges. The construction of specialized generalists hinges largely on high-quality datasets, enh… ▽ More Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains and are moving towards more specialized areas. Recent advanced proprietary models such as GPT-4 and Gemini have achieved significant advancements in biomedicine, which have also raised privacy and security challenges. The construction of specialized generalists hinges largely on high-quality datasets, enhanced by techniques like supervised fine-tuning and reinforcement learning from human or AI feedback, and direct preference optimization. However, these leading technologies (e.g., preference learning) are still significantly limited in the open source community due to the scarcity of specialized data. In this paper, we present the UltraMedical collections, which consist of high-quality manual and synthetic datasets in the biomedicine domain, featuring preference annotations across multiple advanced LLMs. By utilizing these datasets, we fine-tune a suite of specialized medical models based on Llama-3 series, demonstrating breathtaking capabilities across various medical benchmarks. Moreover, we develop powerful reward models skilled in biomedical and general reward benchmark, enhancing further online preference learning within the biomedical LLM community. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: Datasets and models are available at https://github.com/TsinghuaC3I/UltraMedical

arXiv:2405.17220 [pdf, other]

RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness

Authors: Tianyu Yu, Haoye Zhang, Yuan Yao, Yunkai Dang, Da Chen, Xiaoman Lu, Ganqu Cui, Taiwen He, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun

Abstract: Learning from feedback reduces the hallucination of multimodal large language models (MLLMs) by aligning them with human preferences. While traditional methods rely on labor-intensive and time-consuming manual labeling, recent approaches employing models as automatic labelers have shown promising results without human intervention. However, these methods heavily rely on costly proprietary models l… ▽ More Learning from feedback reduces the hallucination of multimodal large language models (MLLMs) by aligning them with human preferences. While traditional methods rely on labor-intensive and time-consuming manual labeling, recent approaches employing models as automatic labelers have shown promising results without human intervention. However, these methods heavily rely on costly proprietary models like GPT-4V, resulting in scalability issues. Moreover, this paradigm essentially distills the proprietary models to provide a temporary solution to quickly bridge the performance gap. As this gap continues to shrink, the community is soon facing the essential challenge of aligning MLLMs using labeler models of comparable capability. In this work, we introduce RLAIF-V, a novel framework that aligns MLLMs in a fully open-source paradigm for super GPT-4V trustworthiness. RLAIF-V maximally exploits the open-source feedback from two perspectives, including high-quality feedback data and online feedback learning algorithm. Extensive experiments on seven benchmarks in both automatic and human evaluation show that RLAIF-V substantially enhances the trustworthiness of models without sacrificing performance on other tasks. Using a 34B model as labeler, RLAIF-V 7B model reduces object hallucination by 82.9\% and overall hallucination by 42.1\%, outperforming the labeler model. Remarkably, RLAIF-V also reveals the self-alignment potential of open-source MLLMs, where a 12B model can learn from the feedback of itself to achieve less than 29.5\% overall hallucination rate, surpassing GPT-4V (45.9\%) by a large margin. The results shed light on a promising route to enhance the efficacy of leading-edge MLLMs. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: Project Website: https://github.com/RLHF-V/RLAIF-V

arXiv:2404.06395 [pdf, other]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Authors: Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zheng Leng Thai, Kaihuo Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, Maosong Sun

Abstract: The burgeoning interest in develo** Large Language Models (LLMs) with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, particularly given the immense cost of experimentation. This scenario underscores the importance of exploring the potential of Small Language Models (SLMs) as a resource-efficient alternative. In this context, we introduce… ▽ More The burgeoning interest in develo** Large Language Models (LLMs) with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, particularly given the immense cost of experimentation. This scenario underscores the importance of exploring the potential of Small Language Models (SLMs) as a resource-efficient alternative. In this context, we introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter variants, not only excel in their respective categories but also demonstrate capabilities on par with 7B-13B LLMs. While focusing on SLMs, our approach exhibits scalability in both model and data dimensions for future LLM research. Regarding model scaling, we employ extensive model wind tunnel experiments for stable and optimal scaling. For data scaling, we introduce a Warmup-Stable-Decay (WSD) learning rate scheduler (LRS), conducive to continuous training and domain adaptation. We present an in-depth analysis of the intriguing training dynamics that occurred in the WSD LRS. With WSD LRS, we are now able to efficiently study data-model scaling law without extensive retraining experiments on both axes of model and data, from which we derive the much higher compute optimal data-model ratio than Chinchilla Optimal. Additionally, we introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE and MiniCPM-128K, whose excellent performance further cementing MiniCPM's foundation in diverse SLM applications. MiniCPM models are available publicly at https://github.com/OpenBMB/MiniCPM . △ Less

Submitted 3 June, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

Comments: revise according to peer review

arXiv:2404.02078 [pdf, other]

Advancing LLM Reasoning Generalists with Preference Trees

Authors: Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, Maosong Sun

Abstract: We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 1… ▽ More We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests covering five tasks, and achieves a 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, two challenging benchmarks, substantially outperforming existing open-source models by margins more than 13.3%. The strong performance of Eurus can be primarily attributed to UltraInteract, our newly-curated large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. UltraInteract can be used in both supervised fine-tuning and preference learning. For each instruction, it includes a preference tree consisting of (1) reasoning chains with diverse planning strategies in a unified format, (2) multi-turn interaction trajectories with the environment and the critique, and (3) pairwise data to facilitate preference learning. UltraInteract allows us to conduct an in-depth exploration of preference learning for reasoning tasks. Our investigation reveals that some well-established preference learning algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. Inspired by this, we derive a novel reward modeling objective which, together with UltraInteract, leads to a strong reward model. △ Less

Submitted 2 April, 2024; originally announced April 2024.

Comments: Models and data are available at https://github.com/OpenBMB/Eurus

arXiv:2403.11038 [pdf, other]

Texture Edge detection by Patch consensus (TEP)

Authors: Guangyu Cui, Sung Ha Kang

Abstract: We propose Texture Edge detection using Patch consensus (TEP) which is a training-free method to detect the boundary of texture. We propose a new simple way to identify the texture edge location, using the consensus of segmented local patch information. While on the boundary, even using local patch information, the distinction between textures are typically not clear, but using neighbor consensus… ▽ More We propose Texture Edge detection using Patch consensus (TEP) which is a training-free method to detect the boundary of texture. We propose a new simple way to identify the texture edge location, using the consensus of segmented local patch information. While on the boundary, even using local patch information, the distinction between textures are typically not clear, but using neighbor consensus give a clear idea of the boundary. We utilize local patch, and its response against neighboring regions, to emphasize the similarities and the differences across different textures. The step of segmentation of response further emphasizes the edge location, and the neighborhood voting gives consensus and stabilize the edge detection. We analyze texture as a stationary process to give insight into the patch width parameter verses the quality of edge detection. We derive the necessary condition for textures to be distinguished, and analyze the patch width with respect to the scale of textures. Various experiments are presented to validate the proposed model. △ Less

Submitted 16 March, 2024; originally announced March 2024.

arXiv:2403.08281 [pdf, other]

Mastering Text, Code and Math Simultaneously via Fusing Highly Specialized Language Models

Authors: Ning Ding, Yulin Chen, Ganqu Cui, Xingtai Lv, Weilin Zhao, Ruobing Xie, Bowen Zhou, Zhiyuan Liu, Maosong Sun

Abstract: Underlying data distributions of natural language, programming code, and mathematical symbols vary vastly, presenting a complex challenge for large language models (LLMs) that strive to achieve high performance across all three domains simultaneously. Achieving a very high level of proficiency for an LLM within a specific domain often requires extensive training with relevant corpora, which is typ… ▽ More Underlying data distributions of natural language, programming code, and mathematical symbols vary vastly, presenting a complex challenge for large language models (LLMs) that strive to achieve high performance across all three domains simultaneously. Achieving a very high level of proficiency for an LLM within a specific domain often requires extensive training with relevant corpora, which is typically accompanied by a sacrifice in performance in other domains. In this paper, we propose to fuse models that are already highly-specialized directly. The proposed fusing framework, UltraFuser, consists of three distinct specialists that are already sufficiently trained on language, coding, and mathematics. A token-level gating mechanism is introduced to blend the specialists' outputs. A two-stage training strategy accompanied by balanced sampling is designed to ensure stability. To effectively train the fused model, we further construct a high-quality supervised instruction tuning dataset, UltraChat 2, which includes text, code, and mathematical content. This dataset comprises approximately 300,000 instructions and covers a wide range of topics in each domain. Experiments show that our model could simultaneously achieve mastery of the three crucial domains. △ Less

Submitted 26 March, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

arXiv:2403.04180 [pdf, ps, other]

RATSF: Empowering Customer Service Volume Management through Retrieval-Augmented Time-Series Forecasting

Authors: Tianfeng Wang, Gaojie Cui

Abstract: An efficient customer service management system hinges on precise forecasting of service volume. In this scenario, where data non-stationarity is pronounced, successful forecasting heavily relies on identifying and leveraging similar historical data rather than merely summarizing periodic patterns. Existing models based on RNN or Transformer architectures may struggle with this flexible and effect… ▽ More An efficient customer service management system hinges on precise forecasting of service volume. In this scenario, where data non-stationarity is pronounced, successful forecasting heavily relies on identifying and leveraging similar historical data rather than merely summarizing periodic patterns. Existing models based on RNN or Transformer architectures may struggle with this flexible and effective utilization. To tackle this challenge, we initially developed the Time Series Knowledge Base (TSKB) with an advanced indexing system for efficient historical data retrieval. We also developed the Retrieval Augmented Cross-Attention (RACA) module, a variant of the cross-attention mechanism within Transformer's decoder layers, designed to be seamlessly integrated into the vanilla Transformer architecture to assimilate key historical data segments. The synergy between TSKB and RACA forms the backbone of our Retrieval-Augmented Time Series Forecasting (RATSF) framework. Based on the above two components, RATSF not only significantly enhances performance in the context of Fliggy hotel service volume forecasting but also adapts flexibly to various scenarios and integrates with a multitude of Transformer variants for time-series forecasting. Extensive experimentation has validated the effectiveness and generalizability of this system design across multiple diverse contexts. △ Less

Submitted 16 June, 2024; v1 submitted 6 March, 2024; originally announced March 2024.

arXiv:2402.19085 [pdf, other]

Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment

Authors: Yiju Guo, Ganqu Cui, Lifan Yuan, Ning Ding, Jiexin Wang, Huimin Chen, Bowen Sun, Ruobing Xie, Jie Zhou, Yankai Lin, Zhiyuan Liu, Maosong Sun

Abstract: Alignment in artificial intelligence pursues the consistency between model responses and human preferences as well as values. In practice, the multifaceted nature of human preferences inadvertently introduces what is known as the "alignment tax" -a compromise where enhancements in alignment within one objective (e.g.,harmlessness) can diminish performance in others (e.g.,helpfulness). However, exi… ▽ More Alignment in artificial intelligence pursues the consistency between model responses and human preferences as well as values. In practice, the multifaceted nature of human preferences inadvertently introduces what is known as the "alignment tax" -a compromise where enhancements in alignment within one objective (e.g.,harmlessness) can diminish performance in others (e.g.,helpfulness). However, existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives. To navigate this challenge, we argue the prominence of grounding LLMs with evident preferences. We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives, thereby guiding the model to generate responses that meet the requirements. Our experimental analysis reveals that the aligned models can provide responses that match various preferences among the "3H" (helpfulness, honesty, harmlessness) desiderata. Furthermore, by introducing diverse data and alignment goals, we surpass baseline methods in aligning with single objectives, hence mitigating the impact of the alignment tax and achieving Pareto improvements in multi-objective alignment. △ Less

Submitted 29 February, 2024; originally announced February 2024.

arXiv:2402.05369 [pdf, other]

Noise Contrastive Alignment of Language Models with Explicit Rewards

Authors: Huayu Chen, Guande He, Lifan Yuan, Ganqu Cui, Hang Su, Jun Zhu

Abstract: User intentions are typically formalized as evaluation rewards to be maximized when fine-tuning language models (LMs). Existing alignment methods, such as Direct Preference Optimization (DPO), are mainly tailored for pairwise preference data where rewards are implicitly defined rather than explicitly given. In this paper, we introduce a general framework for LM alignment, leveraging Noise Contrast… ▽ More User intentions are typically formalized as evaluation rewards to be maximized when fine-tuning language models (LMs). Existing alignment methods, such as Direct Preference Optimization (DPO), are mainly tailored for pairwise preference data where rewards are implicitly defined rather than explicitly given. In this paper, we introduce a general framework for LM alignment, leveraging Noise Contrastive Estimation (NCE) to bridge the gap in handling reward datasets explicitly annotated with scalar evaluations. Our framework comprises two parallel algorithms, NCA and InfoNCA, both enabling the direct extraction of an LM policy from reward data as well as preference data. Notably, we show that the DPO loss is a special case of our proposed InfoNCA objective under pairwise preference settings, thereby integrating and extending current alignment theories. By comparing NCA and InfoNCA, we demonstrate that the well-observed decreasing-likelihood trend of DPO/InfoNCA is caused by their focus on adjusting relative likelihood across different responses. In contrast, NCA optimizes the absolute likelihood for each response, thereby effectively preventing the chosen likelihood from decreasing. We evaluate our methods in both reward and preference settings with Mistral-8*7B and 7B models. Experiments suggest that InfoNCA/NCA surpasses various preference baselines when reward datasets are available. We also find NCA significantly outperforms DPO in complex reasoning tasks like math and coding. △ Less

Submitted 3 July, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

arXiv:2402.05079 [pdf, other]

Mamba-UNet: UNet-Like Pure Visual Mamba for Medical Image Segmentation

Authors: Ziyang Wang, Jian-Qing Zheng, Yichi Zhang, Ge Cui, Lei Li

Abstract: In recent advancements in medical image analysis, Convolutional Neural Networks (CNN) and Vision Transformers (ViT) have set significant benchmarks. While the former excels in capturing local features through its convolution operations, the latter achieves remarkable global context understanding by leveraging self-attention mechanisms. However, both architectures exhibit limitations in efficiently… ▽ More In recent advancements in medical image analysis, Convolutional Neural Networks (CNN) and Vision Transformers (ViT) have set significant benchmarks. While the former excels in capturing local features through its convolution operations, the latter achieves remarkable global context understanding by leveraging self-attention mechanisms. However, both architectures exhibit limitations in efficiently modeling long-range dependencies within medical images, which is a critical aspect for precise segmentation. Inspired by the Mamba architecture, known for its proficiency in handling long sequences and global contextual information with enhanced computational efficiency as a State Space Model (SSM), we propose Mamba-UNet, a novel architecture that synergizes the U-Net in medical image segmentation with Mamba's capability. Mamba-UNet adopts a pure Visual Mamba (VMamba)-based encoder-decoder structure, infused with skip connections to preserve spatial information across different scales of the network. This design facilitates a comprehensive feature learning process, capturing intricate details and broader semantic contexts within medical images. We introduce a novel integration mechanism within the VMamba blocks to ensure seamless connectivity and information flow between the encoder and decoder paths, enhancing the segmentation performance. We conducted experiments on publicly available ACDC MRI Cardiac segmentation dataset, and Synapse CT Abdomen segmentation dataset. The results show that Mamba-UNet outperforms several types of UNet in medical image segmentation under the same hyper-parameter setting. The source code and baseline implementations are available. △ Less

Submitted 30 March, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

arXiv:2402.03456 [pdf, other]

Constrained Multiview Representation for Self-supervised Contrastive Learning

Authors: Siyuan Dai, Kai Ye, Kun Zhao, Ge Cui, Haoteng Tang, Liang Zhan

Abstract: Representation learning constitutes a pivotal cornerstone in contemporary deep learning paradigms, offering a conduit to elucidate distinctive features within the latent space and interpret the deep models. Nevertheless, the inherent complexity of anatomical patterns and the random nature of lesion distribution in medical image segmentation pose significant challenges to the disentanglement of rep… ▽ More Representation learning constitutes a pivotal cornerstone in contemporary deep learning paradigms, offering a conduit to elucidate distinctive features within the latent space and interpret the deep models. Nevertheless, the inherent complexity of anatomical patterns and the random nature of lesion distribution in medical image segmentation pose significant challenges to the disentanglement of representations and the understanding of salient features. Methods guided by the maximization of mutual information, particularly within the framework of contrastive learning, have demonstrated remarkable success and superiority in decoupling densely intertwined representations. However, the effectiveness of contrastive learning highly depends on the quality of the positive and negative sample pairs, i.e. the unselected average mutual information among multi-views would obstruct the learning strategy so the selection of the views is vital. In this work, we introduce a novel approach predicated on representation distance-based mutual information (MI) maximization for measuring the significance of different views, aiming at conducting more efficient contrastive learning and representation disentanglement. Additionally, we introduce an MI re-ranking strategy for representation selection, benefiting both the continuous MI estimating and representation significance distance measuring. Specifically, we harness multi-view representations extracted from the frequency domain, re-evaluating their significance based on mutual information across varying frequencies, thereby facilitating a multifaceted contrastive learning approach to bolster semantic comprehension. The statistical results under the five metrics demonstrate that our proposed framework proficiently constrains the MI maximization-driven representation selection and steers the multi-view contrastive learning process. △ Less

Submitted 5 February, 2024; originally announced February 2024.

Comments: 11 pages, 9 figures, 2 algorithms

arXiv:2312.07075 [pdf, other]

Motion Planning and Control of A Morphing Quadrotor in Restricted Scenarios

Authors: Guiyang Cui, Ruihao Xia, Xin **, Yang Tang

Abstract: Morphing quadrotors with four external actuators can adapt to different restricted scenarios by changing their geometric structure. However, previous works mainly focus on the improvements in structures and controllers, and existing planning algorithms don't consider the morphological modifications, which leads to safety and dynamic feasibility issues. In this paper, we propose a unified planning… ▽ More Morphing quadrotors with four external actuators can adapt to different restricted scenarios by changing their geometric structure. However, previous works mainly focus on the improvements in structures and controllers, and existing planning algorithms don't consider the morphological modifications, which leads to safety and dynamic feasibility issues. In this paper, we propose a unified planning and control framework for morphing quadrotors to deform autonomously and efficiently. The framework consists of a milliseconds-level spatial-temporal trajectory optimizer that takes into account the morphological modifications of quadrotors. The optimizer can generate full-body safety trajectories including position and attitude. Additionally, it incorporates a nonlinear attitude controller that accounts for aerodynamic drag and dynamically adjusts dynamic parameters such as the inertia tensor and Center of Gravity. The controller can also online compute the thrust coefficient during morphing. Benchmark experiments compared with existing methods validate the robustness of the proposed controller. Extensive simulations and real-world experiments are performed to demonstrate the effectiveness of the proposed framework. △ Less

Submitted 12 December, 2023; originally announced December 2023.

Comments: 8 pages, 9 figures

arXiv:2312.00849 [pdf, other]

RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

Authors: Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, **yi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, Tat-Seng Chua

Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in multimodal understanding, reasoning, and interaction. However, existing MLLMs prevalently suffer from serious hallucination problems, generating text that is not factually grounded in associated images. The problem makes existing MLLMs untrustworthy and thus impractical in real-world (especially high-sta… ▽ More Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in multimodal understanding, reasoning, and interaction. However, existing MLLMs prevalently suffer from serious hallucination problems, generating text that is not factually grounded in associated images. The problem makes existing MLLMs untrustworthy and thus impractical in real-world (especially high-stakes) applications. To address the challenge, we present RLHF-V, which enhances MLLM trustworthiness via behavior alignment from fine-grained correctional human feedback. Specifically, RLHF-V collects human preference in the form of segment-level corrections on hallucinations, and performs dense direct preference optimization over the human feedback. Comprehensive experiments on five benchmarks in both automatic and human evaluation show that, RLHF-V can enable substantially more trustworthy MLLM behaviors with promising data and computation efficiency. Remarkably, using 1.4k annotated data samples, RLHF-V significantly reduces the hallucination rate of the base MLLM by 34.8%, outperforming the concurrent LLaVA-RLHF trained on 10k annotated data. The final model achieves state-of-the-art performance in trustworthiness among open-source MLLMs, and shows better robustness than GPT-4V in preventing hallucinations aroused from over-generalization. We open-source our code, model, and data at https://github.com/RLHF-V/RLHF-V. △ Less

Submitted 8 March, 2024; v1 submitted 1 December, 2023; originally announced December 2023.

Comments: Accepted by CVPR 2024

arXiv:2311.11340 [pdf, other]

RflyMAD: A Dataset for Multicopter Fault Detection and Health Assessment

Authors: Xiangli Le, Bo **, Gen Cui, Xunhua Dai, Quan Quan

Abstract: This paper presents an open-source dataset RflyMAD, a Multicopter Abnomal Dataset developed by Reliable Flight Control (Rfly) Group aiming to promote the development of research fields like fault detection and isolation (FDI) or health assessment (HA). The entire 114 GB dataset includes 11 types of faults under 6 flight statuses which are adapted from ADS-33 file to cover more occasions in which t… ▽ More This paper presents an open-source dataset RflyMAD, a Multicopter Abnomal Dataset developed by Reliable Flight Control (Rfly) Group aiming to promote the development of research fields like fault detection and isolation (FDI) or health assessment (HA). The entire 114 GB dataset includes 11 types of faults under 6 flight statuses which are adapted from ADS-33 file to cover more occasions in which the multicopters have different mobility levels when faults occur. In the total 5629 flight cases, the fault time is up to 3283 minutes, and there are 2566 cases for software-in-the-loop (SIL) simulation, 2566 cases for hardware-in-the-loop (HIL) simulation and 497 cases for real flight. As it contains simulation data based on RflySim and real flight data, it is possible to improve the quantity while increasing the data quality. In each case, there are ULog, Telemetry log, Flight information and processed files for researchers to use and check. The RflyMAD dataset could be used as a benchmark for fault diagnosis methods and the support relationship between simulation data and real flight is verified through transfer learning methods. More methods as a baseline will be presented in the future, and RflyMAD will be updated with more data and types. In addition, the dataset and related toolkit can be accessed through https://rfly-openha.github.io/documents/4_resources/dataset.html. △ Less

Submitted 11 January, 2024; v1 submitted 19 November, 2023; originally announced November 2023.

arXiv:2311.09868 [pdf, other]

INTERVENOR: Prompting the Coding Ability of Large Language Models with the Interactive Chain of Repair

Authors: Hanbin Wang, Zhenghao Liu, Shuo Wang, Ganqu Cui, Ning Ding, Zhiyuan Liu, Ge Yu

Abstract: This paper introduces INTERVENOR (INTERactiVE chaiN Of Repair), a system designed to emulate the interactive code repair processes observed in humans, encompassing both code diagnosis and code repair. INTERVENOR prompts Large Language Models (LLMs) to play distinct roles during the code repair process, functioning as both a Code Learner and a Code Teacher. Specifically, the Code Learner is tasked… ▽ More This paper introduces INTERVENOR (INTERactiVE chaiN Of Repair), a system designed to emulate the interactive code repair processes observed in humans, encompassing both code diagnosis and code repair. INTERVENOR prompts Large Language Models (LLMs) to play distinct roles during the code repair process, functioning as both a Code Learner and a Code Teacher. Specifically, the Code Learner is tasked with adhering to instructions to generate or repair code, while the Code Teacher is responsible for crafting a Chain-of-Repair (CoR) to serve as guidance for the Code Learner. During generating the CoR, the Code Teacher needs to check the generated codes from Code Learner and reassess how to address code bugs based on error feedback received from compilers. Experimental results demonstrate that INTERVENOR surpasses baseline models, exhibiting improvements of approximately 18% and 4.3% over GPT-3.5 in code generation and code translation tasks, respectively. Our further analyses show that CoR is effective to illuminate the reasons behind bugs and outline solution plans in natural language. With the feedback of code compilers, INTERVENOR can accurately identify syntax errors and assertion errors and provide precise instructions to repair codes. All data and codes are available at https://github.com/NEUIR/INTERVENOR △ Less

Submitted 12 June, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

Comments: 27 pages, 19 figures, 10 tables

arXiv:2310.01377 [pdf, other]

UltraFeedback: Boosting Language Models with High-quality Feedback

Authors: Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, Maosong Sun

Abstract: Reinforcement learning from human feedback (RLHF) has become a pivot technique in aligning large language models (LLMs) with human preferences. In RLHF practice, preference data plays a crucial role in bridging human proclivity and LLMs. However, the scarcity of diverse, naturalistic datasets of human preferences on LLM outputs at scale poses a great challenge to RLHF as well as feedback learning… ▽ More Reinforcement learning from human feedback (RLHF) has become a pivot technique in aligning large language models (LLMs) with human preferences. In RLHF practice, preference data plays a crucial role in bridging human proclivity and LLMs. However, the scarcity of diverse, naturalistic datasets of human preferences on LLM outputs at scale poses a great challenge to RLHF as well as feedback learning research within the open-source community. Current preference datasets, either proprietary or limited in size and prompt variety, result in limited RLHF adoption in open-source models and hinder further exploration. In this study, we propose ULTRAFEEDBACK, a large-scale, high-quality, and diversified preference dataset designed to overcome these limitations and foster RLHF development. To create ULTRAFEEDBACK, we compile a diverse array of instructions and models from multiple sources to produce comparative data. We meticulously devise annotation instructions and employ GPT-4 to offer detailed feedback in both numerical and textual forms. ULTRAFEEDBACK establishes a reproducible and expandable preference data construction pipeline, serving as a solid foundation for future RLHF and feedback learning research. Utilizing ULTRAFEEDBACK, we train various models to demonstrate its effectiveness, including the reward model UltraRM, chat language model UltraLM-13B-PPO, and critique model UltraCM. Experimental results indicate that our models outperform existing open-source models, achieving top performance across multiple benchmarks. Our data and models are available at https://github.com/thunlp/UltraFeedback. △ Less

Submitted 2 October, 2023; originally announced October 2023.

arXiv:2307.11973 [pdf, other]

Two-stream Multi-level Dynamic Point Transformer for Two-person Interaction Recognition

Authors: Yao Liu, Gangfeng Cui, Jiahui Luo, Xiaojun Chang, Lina Yao

Abstract: As a fundamental aspect of human life, two-person interactions contain meaningful information about people's activities, relationships, and social settings. Human action recognition serves as the foundation for many smart applications, with a strong focus on personal privacy. However, recognizing two-person interactions poses more challenges due to increased body occlusion and overlap compared to… ▽ More As a fundamental aspect of human life, two-person interactions contain meaningful information about people's activities, relationships, and social settings. Human action recognition serves as the foundation for many smart applications, with a strong focus on personal privacy. However, recognizing two-person interactions poses more challenges due to increased body occlusion and overlap compared to single-person actions. In this paper, we propose a point cloud-based network named Two-stream Multi-level Dynamic Point Transformer for two-person interaction recognition. Our model addresses the challenge of recognizing two-person interactions by incorporating local-region spatial information, appearance information, and motion information. To achieve this, we introduce a designed frame selection method named Interval Frame Sampling (IFS), which efficiently samples frames from videos, capturing more discriminative information in a relatively short processing time. Subsequently, a frame features learning module and a two-stream multi-level feature aggregation module extract global and partial features from the sampled frames, effectively representing the local-region spatial information, appearance information, and motion information related to the interactions. Finally, we apply a transformer to perform self-attention on the learned features for the final classification. Extensive experiments are conducted on two large-scale datasets, the interaction subsets of NTU RGB+D 60 and NTU RGB+D 120. The results show that our network outperforms state-of-the-art approaches in most standard evaluation settings. △ Less

Submitted 14 May, 2024; v1 submitted 21 July, 2023; originally announced July 2023.

arXiv:2306.04618 [pdf, other]

Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations

Authors: Lifan Yuan, Yangyi Chen, Ganqu Cui, Hongcheng Gao, Fangyuan Zou, Xingyi Cheng, Heng Ji, Zhiyuan Liu, Maosong Sun

Abstract: This paper reexamines the research on out-of-distribution (OOD) robustness in the field of NLP. We find that the distribution shift settings in previous studies commonly lack adequate challenges, hindering the accurate evaluation of OOD robustness. To address these issues, we propose a benchmark construction protocol that ensures clear differentiation and challenging distribution shifts. Then we i… ▽ More This paper reexamines the research on out-of-distribution (OOD) robustness in the field of NLP. We find that the distribution shift settings in previous studies commonly lack adequate challenges, hindering the accurate evaluation of OOD robustness. To address these issues, we propose a benchmark construction protocol that ensures clear differentiation and challenging distribution shifts. Then we introduce BOSS, a Benchmark suite for Out-of-distribution robustneSS evaluation covering 5 tasks and 20 datasets. Based on BOSS, we conduct a series of experiments on pre-trained language models for analysis and evaluation of OOD robustness. First, for vanilla fine-tuning, we examine the relationship between in-distribution (ID) and OOD performance. We identify three typical types that unveil the inner learning mechanism, which could potentially facilitate the forecasting of OOD robustness, correlating with the advancements on ID datasets. Then, we evaluate 5 classic methods on BOSS and find that, despite exhibiting some effectiveness in specific cases, they do not offer significant improvement compared to vanilla fine-tuning. Further, we evaluate 5 LLMs with various adaptation paradigms and find that when sufficient ID data is available, fine-tuning domain-specific models outperform LLMs on ID examples significantly. However, in the case of OOD instances, prioritizing LLMs with in-context learning yields better results. We identify that both fine-tuned small models and LLMs face challenges in effectively addressing downstream tasks. The code is public at \url{https://github.com/lifan-yuan/OOD_NLP}. △ Less

Submitted 26 October, 2023; v1 submitted 7 June, 2023; originally announced June 2023.

Comments: Accepted to NeurIPS 2023 Dataset and Benchmark Track. Code is available at \url{https://github.com/lifan-yuan/OOD_NLP}

arXiv:2305.18503 [pdf, other]

From Adversarial Arms Race to Model-centric Evaluation: Motivating a Unified Automatic Robustness Evaluation Framework

Authors: Yangyi Chen, Hongcheng Gao, Ganqu Cui, Lifan Yuan, Dehan Kong, Hanlu Wu, Ning Shi, Bo Yuan, Longtao Huang, Hui Xue, Zhiyuan Liu, Maosong Sun, Heng Ji

Abstract: Textual adversarial attacks can discover models' weaknesses by adding semantic-preserved but misleading perturbations to the inputs. The long-lasting adversarial attack-and-defense arms race in Natural Language Processing (NLP) is algorithm-centric, providing valuable techniques for automatic robustness evaluation. However, the existing practice of robustness evaluation may exhibit issues of incom… ▽ More Textual adversarial attacks can discover models' weaknesses by adding semantic-preserved but misleading perturbations to the inputs. The long-lasting adversarial attack-and-defense arms race in Natural Language Processing (NLP) is algorithm-centric, providing valuable techniques for automatic robustness evaluation. However, the existing practice of robustness evaluation may exhibit issues of incomprehensive evaluation, impractical evaluation protocol, and invalid adversarial samples. In this paper, we aim to set up a unified automatic robustness evaluation framework, shifting towards model-centric evaluation to further exploit the advantages of adversarial attacks. To address the above challenges, we first determine robustness evaluation dimensions based on model capabilities and specify the reasonable algorithm to generate adversarial samples for each dimension. Then we establish the evaluation protocol, including evaluation settings and metrics, under realistic demands. Finally, we use the perturbation degree of adversarial samples to control the sample validity. We implement a toolkit RobTest that realizes our automatic robustness evaluation framework. In our experiments, we conduct a robustness evaluation of RoBERTa models to demonstrate the effectiveness of our evaluation framework, and further show the rationality of each component in the framework. The code will be made public at \url{https://github.com/thunlp/RobTest}. △ Less

Submitted 29 May, 2023; originally announced May 2023.

Comments: Accepted to Findings of ACL 2023

arXiv:2304.08354 [pdf, other]

Tool Learning with Foundation Models

Authors: Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su, Huadong Wang, Cheng Qian, Runchu Tian, Kunlun Zhu, Shihao Liang, Xingyu Shen, Bokai Xu, Zhen Zhang, Yining Ye, Bowen Li, Ziwei Tang, **g Yi, Yuzhang Zhu , et al. (16 additional authors not shown)

Abstract: Humans possess an extraordinary ability to create and utilize tools, allowing them to overcome physical limitations and explore new frontiers. With the advent of foundation models, AI systems have the potential to be equally adept in tool use as humans. This paradigm, i.e., tool learning with foundation models, combines the strengths of specialized tools and foundation models to achieve enhanced a… ▽ More Humans possess an extraordinary ability to create and utilize tools, allowing them to overcome physical limitations and explore new frontiers. With the advent of foundation models, AI systems have the potential to be equally adept in tool use as humans. This paradigm, i.e., tool learning with foundation models, combines the strengths of specialized tools and foundation models to achieve enhanced accuracy, efficiency, and automation in problem-solving. Despite its immense potential, there is still a lack of a comprehensive understanding of key challenges, opportunities, and future endeavors in this field. To this end, we present a systematic investigation of tool learning in this paper. We first introduce the background of tool learning, including its cognitive origins, the paradigm shift of foundation models, and the complementary roles of tools and models. Then we recapitulate existing tool learning research into tool-augmented and tool-oriented learning. We formulate a general tool learning framework: starting from understanding the user instruction, models should learn to decompose a complex task into several subtasks, dynamically adjust their plan through reasoning, and effectively conquer each sub-task by selecting appropriate tools. We also discuss how to train models for improved tool-use capabilities and facilitate the generalization in tool learning. Considering the lack of a systematic tool learning evaluation in prior works, we experiment with 18 representative tools and show the potential of current foundation models in skillfully utilizing tools. Finally, we discuss several open problems that require further investigation for tool learning. Overall, we hope this paper could inspire future research in integrating tools with foundation models. △ Less

Submitted 15 June, 2023; v1 submitted 17 April, 2023; originally announced April 2023.

arXiv:2212.12418 [pdf]

Dynamic Speed Guidance for CAV Ramp Merging in Non-Cooperative Environment: An On-Site Experiment

Authors: Wei Ji, Yechi Ma, Guangzhang Cui, Xiaotian Qin, Wei Hua

Abstract: Ramp merging is a typical application of cooperative intelligent transportation system (C-ITS). Vehicle trajectories perceived by roadside sensors are importation complement to the limited visual field of on-board perception. Vehicle tracking and trajectory denoising algorithm is proposed in this paper to take full advantage of roadside cameras for vehicle trajectory and speed profile estimation.… ▽ More Ramp merging is a typical application of cooperative intelligent transportation system (C-ITS). Vehicle trajectories perceived by roadside sensors are importation complement to the limited visual field of on-board perception. Vehicle tracking and trajectory denoising algorithm is proposed in this paper to take full advantage of roadside cameras for vehicle trajectory and speed profile estimation. Dynamic speed guidance algorithm is proposed to help on-ramp vehicles to merge into mainline smoothly, even in non-cooperative environment where mainline vehicles are not expected to slow down to accommodate on-ramp vehicles. On-site experiments were taken out in a merging area of Hangzhou Belt Highway to testify our prototype system, and simulation analysis shows our proposed algorithm can achieve significant fuel savings during the ramp merging process. △ Less

Submitted 21 December, 2022; originally announced December 2022.

Comments: This work has been submitted to IFAC for possible publication

arXiv:2212.08408 [pdf, other]

Decoder Tuning: Efficient Language Understanding as Decoding

Authors: Ganqu Cui, Wentao Li, Ning Ding, Longtao Huang, Zhiyuan Liu, Maosong Sun

Abstract: With the evergrowing sizes of pre-trained models (PTMs), it has been an emerging practice to only provide the inference APIs for users, namely model-as-a-service (MaaS) setting. To adapt PTMs with model parameters frozen, most current approaches focus on the input side, seeking for powerful prompts to stimulate models for correct answers. However, we argue that input-side adaptation could be arduo… ▽ More With the evergrowing sizes of pre-trained models (PTMs), it has been an emerging practice to only provide the inference APIs for users, namely model-as-a-service (MaaS) setting. To adapt PTMs with model parameters frozen, most current approaches focus on the input side, seeking for powerful prompts to stimulate models for correct answers. However, we argue that input-side adaptation could be arduous due to the lack of gradient signals and they usually require thousands of API queries, resulting in high computation and time costs. In light of this, we present Decoder Tuning (DecT), which in contrast optimizes task-specific decoder networks on the output side. Specifically, DecT first extracts prompt-stimulated output scores for initial predictions. On top of that, we train an additional decoder network on the output representations to incorporate posterior data knowledge. By gradient-based optimization, DecT can be trained within several seconds and requires only one PTM query per sample. Empirically, we conduct extensive natural language understanding experiments and show that DecT significantly outperforms state-of-the-art algorithms with a $200\times$ speed-up. △ Less

Submitted 24 May, 2023; v1 submitted 16 December, 2022; originally announced December 2022.

Comments: ACL 2023 main conference. Code: https://github.com/thunlp/DecT

arXiv:2211.05319 [pdf, other]

Few-shot Classification with Hypersphere Modeling of Prototypes

Authors: Ning Ding, Yulin Chen, Ganqu Cui, Xiaobin Wang, Hai-Tao Zheng, Zhiyuan Liu, Pengjun Xie

Abstract: Metric-based meta-learning is one of the de facto standards in few-shot learning. It composes of representation learning and metrics calculation designs. Previous works construct class representations in different ways, varying from mean output embedding to covariance and distributions. However, using embeddings in space lacks expressivity and cannot capture class information robustly, while stati… ▽ More Metric-based meta-learning is one of the de facto standards in few-shot learning. It composes of representation learning and metrics calculation designs. Previous works construct class representations in different ways, varying from mean output embedding to covariance and distributions. However, using embeddings in space lacks expressivity and cannot capture class information robustly, while statistical complex modeling poses difficulty to metric designs. In this work, we use tensor fields (``areas'') to model classes from the geometrical perspective for few-shot learning. We present a simple and effective method, dubbed hypersphere prototypes (HyperProto), where class information is represented by hyperspheres with dynamic sizes with two sets of learnable parameters: the hypersphere's center and the radius. Extending from points to areas, hyperspheres are much more expressive than embeddings. Moreover, it is more convenient to perform metric-based classification with hypersphere prototypes than statistical modeling, as we only need to calculate the distance from a data point to the surface of the hypersphere. Following this idea, we also develop two variants of prototypes under other measurements. Extensive experiments and analysis on few-shot learning tasks across NLP and CV and comparison with 20+ competitive baselines demonstrate the effectiveness of our approach. △ Less

Submitted 9 November, 2022; originally announced November 2022.

Comments: preprint

arXiv:2211.00151 [pdf, other]

A Close Look into the Calibration of Pre-trained Language Models

Authors: Yangyi Chen, Lifan Yuan, Ganqu Cui, Zhiyuan Liu, Heng Ji

Abstract: Pre-trained language models (PLMs) may fail in giving reliable estimates of their predictive uncertainty. We take a close look into this problem, aiming to answer two questions: (1) Do PLMs learn to become calibrated in the training process? (2) How effective are existing calibration methods? For the first question, we conduct fine-grained control experiments to study the dynamic change in PLMs' c… ▽ More Pre-trained language models (PLMs) may fail in giving reliable estimates of their predictive uncertainty. We take a close look into this problem, aiming to answer two questions: (1) Do PLMs learn to become calibrated in the training process? (2) How effective are existing calibration methods? For the first question, we conduct fine-grained control experiments to study the dynamic change in PLMs' calibration performance in training. We consider six factors as control variables, including dataset difficulty, available training samples, training steps, the number of tunable parameters, model scale, and pretraining. We observe a consistent change in calibration performance across six factors. We find that PLMs don't learn to become calibrated in training, evidenced by the continual increase in confidence, no matter whether the predictions are correct or not. We highlight that our finding somewhat contradicts two established conclusions: (a) Larger PLMs are more calibrated; (b) Pretraining improves model calibration. Next, we study the effectiveness of existing calibration methods in mitigating the overconfidence issue. Besides unlearnable calibration methods (e.g., label smoothing), we adapt and extend two recently proposed learnable methods that directly collect data to train models to have reasonable confidence estimations. Experimental results show that learnable methods significantly reduce PLMs' confidence in wrong predictions. The code is available at \url{https://github.com/lifan-yuan/PLMCalibration}. △ Less

Submitted 8 May, 2023; v1 submitted 31 October, 2022; originally announced November 2022.

Comments: Accepted to ACL 2023 main conference. Code is available at: https://github.com/lifan-yuan/PLMCalibration

arXiv:2210.10683 [pdf, other]

Why Should Adversarial Perturbations be Imperceptible? Rethink the Research Paradigm in Adversarial NLP

Authors: Yangyi Chen, Hongcheng Gao, Ganqu Cui, Fanchao Qi, Longtao Huang, Zhiyuan Liu, Maosong Sun

Abstract: Textual adversarial samples play important roles in multiple subfields of NLP research, including security, evaluation, explainability, and data augmentation. However, most work mixes all these roles, obscuring the problem definitions and research goals of the security role that aims to reveal the practical concerns of NLP models. In this paper, we rethink the research paradigm of textual adversar… ▽ More Textual adversarial samples play important roles in multiple subfields of NLP research, including security, evaluation, explainability, and data augmentation. However, most work mixes all these roles, obscuring the problem definitions and research goals of the security role that aims to reveal the practical concerns of NLP models. In this paper, we rethink the research paradigm of textual adversarial samples in security scenarios. We discuss the deficiencies in previous work and propose our suggestions that the research on the Security-oriented adversarial NLP (SoadNLP) should: (1) evaluate their methods on security tasks to demonstrate the real-world concerns; (2) consider real-world attackers' goals, instead of develo** impractical methods. To this end, we first collect, process, and release a security datasets collection Advbench. Then, we reformalize the task and adjust the emphasis on different goals in SoadNLP. Next, we propose a simple method based on heuristic rules that can easily fulfill the actual adversarial goals to simulate real-world attack methods. We conduct experiments on both the attack and the defense sides on Advbench. Experimental results show that our method has higher practical value, indicating that the research paradigm in SoadNLP may start from our new benchmark. All the code and data of Advbench can be obtained at \url{https://github.com/thunlp/Advbench}. △ Less

Submitted 19 October, 2022; originally announced October 2022.

Comments: Accepted to EMNLP 2022, main conference

arXiv:2208.08149 [pdf, ps, other]

A Concept and Argumentation based Interpretable Model in High Risk Domains

Authors: Haixiao Chi, Dawei Wang, Gaojie Cui, Feng Mao, Beishui Liao

Abstract: Interpretability has become an essential topic for artificial intelligence in some high-risk domains such as healthcare, bank and security. For commonly-used tabular data, traditional methods trained end-to-end machine learning models with numerical and categorical data only, and did not leverage human understandable knowledge such as data descriptions. Yet mining human-level knowledge from tabula… ▽ More Interpretability has become an essential topic for artificial intelligence in some high-risk domains such as healthcare, bank and security. For commonly-used tabular data, traditional methods trained end-to-end machine learning models with numerical and categorical data only, and did not leverage human understandable knowledge such as data descriptions. Yet mining human-level knowledge from tabular data and using it for prediction remain a challenge. Therefore, we propose a concept and argumentation based model (CAM) that includes the following two components: a novel concept mining method to obtain human understandable concepts and their relations from both descriptions of features and the underlying data, and a quantitative argumentation-based method to do knowledge representation and reasoning. As a result of it, CAM provides decisions that are based on human-level knowledge and the reasoning process is intrinsically interpretable. Finally, to visualize the purposed interpretable model, we provide a dialogical explanation that contain dominated reasoning path within CAM. Experimental results on both open source benchmark dataset and real-word business dataset show that (1) CAM is transparent and interpretable, and the knowledge inside the CAM is coherent with human understanding; (2) Our interpretable approach can reach competitive results comparing with other state-of-art models. △ Less

Submitted 17 August, 2022; originally announced August 2022.

arXiv:2208.05263 [pdf, ps, other]

Enhanced Low-Redundancy Restricted Array for Direction of Arrival Estimation

Authors: Shidong Zhang, Zhengchun Zhou, Guolong Cui, Xiaohu Tang, **zhi Fan

Abstract: Sensor arrays play a significant role in direction of arrival (DOA) estimation. Specifically, arrays with low redundancy and reduced mutual coupling are desirable. In this paper, we investigate a sensor array configuration that has a restricted sensor spacing and propose a closed-form expression. We also propose several classes of low redundancy (LR) arrays. Interestingly, compared with super nest… ▽ More Sensor arrays play a significant role in direction of arrival (DOA) estimation. Specifically, arrays with low redundancy and reduced mutual coupling are desirable. In this paper, we investigate a sensor array configuration that has a restricted sensor spacing and propose a closed-form expression. We also propose several classes of low redundancy (LR) arrays. Interestingly, compared with super nested arrays (SNA) and maximum inter-element spacing constraint (MISC) arrays, one of the proposed arrays has a significant reduction in both redundancy ratio and mutual coupling. Numerical simulations are also conducted to verify the superiority of the proposed array over the known sparse arrays in terms of weight functions, mutual coupling matrices as well and DOA estimation performance. △ Less

Submitted 13 November, 2023; v1 submitted 10 August, 2022; originally announced August 2022.

arXiv:2206.08514 [pdf, other]

A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks

Authors: Ganqu Cui, Lifan Yuan, Bingxiang He, Yangyi Chen, Zhiyuan Liu, Maosong Sun

Abstract: Textual backdoor attacks are a kind of practical threat to NLP systems. By injecting a backdoor in the training phase, the adversary could control model predictions via predefined triggers. As various attack and defense models have been proposed, it is of great significance to perform rigorous evaluations. However, we highlight two issues in previous backdoor learning evaluations: (1) The differen… ▽ More Textual backdoor attacks are a kind of practical threat to NLP systems. By injecting a backdoor in the training phase, the adversary could control model predictions via predefined triggers. As various attack and defense models have been proposed, it is of great significance to perform rigorous evaluations. However, we highlight two issues in previous backdoor learning evaluations: (1) The differences between real-world scenarios (e.g. releasing poisoned datasets or models) are neglected, and we argue that each scenario has its own constraints and concerns, thus requires specific evaluation protocols; (2) The evaluation metrics only consider whether the attacks could flip the models' predictions on poisoned samples and retain performances on benign samples, but ignore that poisoned samples should also be stealthy and semantic-preserving. To address these issues, we categorize existing works into three practical scenarios in which attackers release datasets, pre-trained models, and fine-tuned models respectively, then discuss their unique evaluation methodologies. On metrics, to completely evaluate poisoned samples, we use grammar error increase and perplexity difference for stealthiness, along with text similarity for validity. After formalizing the frameworks, we develop an open-source toolkit OpenBackdoor to foster the implementations and evaluations of textual backdoor learning. With this toolkit, we perform extensive experiments to benchmark attack and defense models under the suggested paradigm. To facilitate the underexplored defenses against poisoned datasets, we further propose CUBE, a simple yet strong clustering-based defense baseline. We hope that our frameworks and benchmarks could serve as the cornerstones for future model development and evaluations. △ Less

Submitted 1 November, 2022; v1 submitted 16 June, 2022; originally announced June 2022.

Comments: NeurIPS 2022 Datasets & Benchmarks; Toolkits avaliable at https://github.com/thunlp/OpenBackdoor

arXiv:2204.05239 [pdf, other]

Exploring the Universal Vulnerability of Prompt-based Learning Paradigm

Authors: Lei Xu, Yangyi Chen, Ganqu Cui, Hongcheng Gao, Zhiyuan Liu

Abstract: Prompt-based learning paradigm bridges the gap between pre-training and fine-tuning, and works effectively under the few-shot setting. However, we find that this learning paradigm inherits the vulnerability from the pre-training stage, where model predictions can be misled by inserting certain triggers into the text. In this paper, we explore this universal vulnerability by either injecting backdo… ▽ More Prompt-based learning paradigm bridges the gap between pre-training and fine-tuning, and works effectively under the few-shot setting. However, we find that this learning paradigm inherits the vulnerability from the pre-training stage, where model predictions can be misled by inserting certain triggers into the text. In this paper, we explore this universal vulnerability by either injecting backdoor triggers or searching for adversarial triggers on pre-trained language models using only plain text. In both scenarios, we demonstrate that our triggers can totally control or severely decrease the performance of prompt-based models fine-tuned on arbitrary downstream tasks, reflecting the universal vulnerability of the prompt-based learning paradigm. Further experiments show that adversarial triggers have good transferability among language models. We also find conventional fine-tuning models are not vulnerable to adversarial triggers constructed from pre-trained language models. We conclude by proposing a potential solution to mitigate our attack methods. Code and data are publicly available at https://github.com/leix28/prompt-universal-vulnerability △ Less

Submitted 11 April, 2022; originally announced April 2022.

Comments: Accepted to Findings of NAACL 2022

arXiv:2203.14101

A Roadmap for Big Model

Authors: Sha Yuan, Hanyu Zhao, Shuai Zhao, Jiahong Leng, Yangxiao Liang, Xiaozhi Wang, Jifan Yu, Xin Lv, Zhou Shao, Jiaao He, Yankai Lin, Xu Han, Zhenghao Liu, Ning Ding, Yongming Rao, Yizhao Gao, Liang Zhang, Ming Ding, Cong Fang, Yisen Wang, Mingsheng Long, **g Zhang, Yinpeng Dong, Tianyu Pang, Peng Cui , et al. (75 additional authors not shown)

Abstract: With the rapid development of deep learning, training Big Models (BMs) for multiple downstream tasks becomes a popular paradigm. Researchers have achieved various outcomes in the construction of BMs and the BM application in many fields. At present, there is a lack of research work that sorts out the overall progress of BMs and guides the follow-up research. In this paper, we cover not only the BM… ▽ More With the rapid development of deep learning, training Big Models (BMs) for multiple downstream tasks becomes a popular paradigm. Researchers have achieved various outcomes in the construction of BMs and the BM application in many fields. At present, there is a lack of research work that sorts out the overall progress of BMs and guides the follow-up research. In this paper, we cover not only the BM technologies themselves but also the prerequisites for BM training and applications with BMs, dividing the BM review into four parts: Resource, Models, Key Technologies and Application. We introduce 16 specific BM-related topics in those four parts, they are Data, Knowledge, Computing System, Parallel Training System, Language Model, Vision Model, Multi-modal Model, Theory&Interpretability, Commonsense Reasoning, Reliability&Security, Governance, Evaluation, Machine Translation, Text Generation, Dialogue and Protein Research. In each topic, we summarize clearly the current studies and propose some future research directions. At the end of this paper, we conclude the further development of BMs in a more general view. △ Less

Submitted 20 April, 2022; v1 submitted 26 March, 2022; originally announced March 2022.

Comments: This report has been withdrawn by the authors due to critical issues in Section 2.3.1 of Article 2

arXiv:2203.09770 [pdf, other]

Prototypical Verbalizer for Prompt-based Few-shot Tuning

Authors: Ganqu Cui, Shengding Hu, Ning Ding, Longtao Huang, Zhiyuan Liu

Abstract: Prompt-based tuning for pre-trained language models (PLMs) has shown its effectiveness in few-shot learning. Typically, prompt-based tuning wraps the input text into a cloze question. To make predictions, the model maps the output words to labels via a verbalizer, which is either manually designed or automatically built. However, manual verbalizers heavily depend on domain-specific prior knowledge… ▽ More Prompt-based tuning for pre-trained language models (PLMs) has shown its effectiveness in few-shot learning. Typically, prompt-based tuning wraps the input text into a cloze question. To make predictions, the model maps the output words to labels via a verbalizer, which is either manually designed or automatically built. However, manual verbalizers heavily depend on domain-specific prior knowledge and human efforts, while finding appropriate label words automatically still remains challenging.In this work, we propose the prototypical verbalizer (ProtoVerb) which is built directly from training data. Specifically, ProtoVerb learns prototype vectors as verbalizers by contrastive learning. In this way, the prototypes summarize training instances and are able to enclose rich class-level semantics. We conduct experiments on both topic classification and entity ty** tasks, and the results demonstrate that ProtoVerb significantly outperforms current automatic verbalizers, especially when training data is extremely scarce. More surprisingly, ProtoVerb consistently boosts prompt-based tuning even on untuned PLMs, indicating an elegant non-tuning way to utilize PLMs. Our codes are avaliable at https://github.com/thunlp/OpenPrompt. △ Less

Submitted 18 March, 2022; originally announced March 2022.

Comments: 11 pages. ACL 2022 main conference

arXiv:2202.10017 [pdf, other]

The PCG-AIID System for L3DAS22 Challenge: MIMO and MISO convolutional recurrent Network for Multi Channel Speech Enhancement and Speech Recognition

Authors: **gdong Li, Yuanyuan Zhu, Dawei Luo, Yun Liu, Guohui Cui, Zhaoxia Li

Abstract: This paper described the PCG-AIID system for L3DAS22 challenge in Task 1: 3D speech enhancement in office reverberant environment. We proposed a two-stage framework to address multi-channel speech denoising and dereverberation. In the first stage, a multiple input and multiple output (MIMO) network is applied to remove background noise while maintaining the spatial characteristics of multi-channel… ▽ More This paper described the PCG-AIID system for L3DAS22 challenge in Task 1: 3D speech enhancement in office reverberant environment. We proposed a two-stage framework to address multi-channel speech denoising and dereverberation. In the first stage, a multiple input and multiple output (MIMO) network is applied to remove background noise while maintaining the spatial characteristics of multi-channel signals. In the second stage, a multiple input and single output (MISO) network is applied to enhance the speech from desired direction and post-filtering. As a result, our system ranked 3rd place in ICASSP2022 L3DAS22 challenge and significantly outperforms the baseline system, while achieving 3.2% WER and 0.972 STOI on the blind test-set. △ Less

Submitted 21 February, 2022; originally announced February 2022.

Comments: To appear at ICASSP 2022 (Accepted)

arXiv:2202.04312 [pdf, other]

Using 5G in Smart Cities: A Systematic Map** Study

Authors: Chen Yang, Peng Liang, Liming Fu, Guorui Cui, Fei Huang, Feng Teng, Yawar Abbas Bangash

Abstract: 5G is the fifth generation wireless network, with a set of characteristics, e.g., high bandwidth and data rates. The scenarios of using 5G include enhanced Mobile Broadband (eMBB), massive Machine Type Communications (mMTC), and ultra-Reliable and Low-Latency Communications (uRLLC). 5G is expected to support a wide variety of applications. We conducted a systematic map** study that covers the li… ▽ More 5G is the fifth generation wireless network, with a set of characteristics, e.g., high bandwidth and data rates. The scenarios of using 5G include enhanced Mobile Broadband (eMBB), massive Machine Type Communications (mMTC), and ultra-Reliable and Low-Latency Communications (uRLLC). 5G is expected to support a wide variety of applications. We conducted a systematic map** study that covers the literature published between Jan 2012 and Dec 2019 regarding using 5G in smart cities. The scenarios, architecture, technologies, challenges, and lessons learned of using 5G in smart cities are summarized and further analyzed based on 32 selected studies, and the results are that: (1) The studies are distributed over 27 publication venues. 17 studies report results based on academic studies and 13 studies use demonstration or toy examples. Only 2 studies report using 5G in smart cities based on industrial studies. 16 studies include assumptions of 5G network design or smart city scenarios. (2) The most discussed smart city scenario is transportation, followed by public safety, healthcare, city tourism, entertainment, and education. (3) 28 studies propose and/or discuss the architecture of 5G-enabled smart cities, containing smart city architecture (treating 5G as a component), 5G network architecture in smart cities, and business architecture of using 5G in smart cities. (4) The most mentioned 5G-related technologies are radio access technologies, network slicing, and edge computing. (5) Challenges are mainly about complex context, challenging requirements, and network development of using 5G in smart cities. (6) Most of the lessons learned identified are benefits regarding 5G itself or the proposed 5G-related methods in smart cities. This work provides a reflection of the past eight years of the state of the art on using 5G in smart cities, which can benefit both researchers and practitioners. △ Less

Submitted 15 February, 2022; v1 submitted 9 February, 2022; originally announced February 2022.

Comments: Preprint accepted for publication in Intelligent Systems with Applications, 2022

arXiv:2201.12994 [pdf, other]

MGNN: Graph Neural Networks Inspired by Distance Geometry Problem

Authors: Guanyu Cui, Zhewei Wei

Abstract: Graph Neural Networks (GNNs) have emerged as a prominent research topic in the field of machine learning. Existing GNN models are commonly categorized into two types: spectral GNNs, which are designed based on polynomial graph filters, and spatial GNNs, which utilize a message-passing scheme as the foundation of the model. For the expressive power and universality of spectral GNNs, a natural appro… ▽ More Graph Neural Networks (GNNs) have emerged as a prominent research topic in the field of machine learning. Existing GNN models are commonly categorized into two types: spectral GNNs, which are designed based on polynomial graph filters, and spatial GNNs, which utilize a message-passing scheme as the foundation of the model. For the expressive power and universality of spectral GNNs, a natural approach is to improve the design of basis functions for better approximation ability. As for spatial GNNs, models like Graph Isomorphism Networks (GIN) analyze their expressive power based on Graph Isomorphism Tests. Recently, there have been attempts to establish connections between spatial GNNs and geometric concepts like curvature and cellular sheaves, as well as physical phenomena like oscillators. However, despite the recent progress, there is still a lack of comprehensive analysis regarding the universality of spatial GNNs from the perspectives of geometry and physics. In this paper, we propose MetricGNN (MGNN), a spatial GNN model inspired by the congruent-insensitivity property of classifiers in the classification phase of GNNs. We demonstrate that a GNN model is universal in the spatial domain if it can generate embedding matrices that are congruent to any given embedding matrix. This property is closely related to the Distance Geometry Problem (DGP). Since DGP is an NP-Hard combinatorial optimization problem, we propose optimizing an energy function derived from spring networks and the Multi-Dimensional Scaling (MDS) problem. This approach also allows our model to handle both homophilic and heterophilic graphs. Finally, we propose employing the iteration method to optimize our energy function. We extensively evaluate the effectiveness of our model through experiments conducted on both synthetic and real-world datasets. Our code is available at: https://github.com/GuanyuCui/MGNN. △ Less

Submitted 30 August, 2023; v1 submitted 30 January, 2022; originally announced January 2022.

Comments: Accepted by KDD 2023

arXiv:2201.03199 [pdf, other]

Task planning and explanation with virtual actions

Authors: Guowei Cui, ** Chen

Abstract: One of the challenges of task planning is to find out what causes the planning failure and how to handle the failure intelligently. This paper shows how to achieve this. The idea is inspired by the connected graph: each verticle represents a set of compatible \textit{states}, and each edge represents an \textit{action}. For any given initial states and goals, we construct virtual actions to ensure… ▽ More One of the challenges of task planning is to find out what causes the planning failure and how to handle the failure intelligently. This paper shows how to achieve this. The idea is inspired by the connected graph: each verticle represents a set of compatible \textit{states}, and each edge represents an \textit{action}. For any given initial states and goals, we construct virtual actions to ensure that we always get a plan via task planning. This paper shows how to introduce virtual action to extend action models to make the graph to be connected: i) explicitly defines static predicate (type, permanent properties, etc) or dynamic predicate (state); ii) constructs a full virtual action or a semi-virtual action for each state; iii) finds the cause of the planning failure through a progressive planning approach. The implementation was evaluated in three typical scenarios. △ Less

Submitted 10 January, 2022; originally announced January 2022.

arXiv:2106.08171 [pdf, other]

Evaluating Modules in Graph Contrastive Learning

Authors: Ganqu Cui, Yufeng Du, Cheng Yang, Jie Zhou, Liang Xu, Xing Zhou, Xingyi Cheng, Zhiyuan Liu

Abstract: The recent emergence of contrastive learning approaches facilitates the application on graph representation learning (GRL), introducing graph contrastive learning (GCL) into the literature. These methods contrast semantically similar and dissimilar sample pairs to encode the semantics into node or graph embeddings. However, most existing works only performed \textbf{model-level} evaluation, and di… ▽ More The recent emergence of contrastive learning approaches facilitates the application on graph representation learning (GRL), introducing graph contrastive learning (GCL) into the literature. These methods contrast semantically similar and dissimilar sample pairs to encode the semantics into node or graph embeddings. However, most existing works only performed \textbf{model-level} evaluation, and did not explore the combination space of modules for more comprehensive and systematic studies. For effective \textbf{module-level} evaluation, we propose a framework that decomposes GCL models into four modules: (1) a \textbf{sampler} to generate anchor, positive and negative data samples (nodes or graphs); (2) an \textbf{encoder} and a \textbf{readout} function to get sample embeddings; (3) a \textbf{discriminator} to score each sample pair (anchor-positive and anchor-negative); and (4) an \textbf{estimator} to define the loss function. Based on this framework, we conduct controlled experiments over a wide range of architectural designs and hyperparameter settings on node and graph classification tasks. Specifically, we manage to quantify the impact of a single module, investigate the interaction between modules, and compare the overall performance with current model architectures. Our key findings include a set of module-level guidelines for GCL, e.g., simple samplers from LINE and DeepWalk are strong and robust; an MLP encoder associated with Sum readout could achieve competitive performance on graph classification. Finally, we release our implementations and results as OpenGCL, a modularized toolkit that allows convenient reproduction, standard model and module evaluation, and easy extension. OpenGCL is available at \url{https://github.com/thunlp/OpenGCL}. △ Less

Submitted 2 June, 2022; v1 submitted 15 June, 2021; originally announced June 2021.

arXiv:2011.00621 [pdf, other]

Semantic Task Planning for Service Robots in Open World

Authors: Guowei Cui, Wei Shuai, ** Chen

Abstract: In this paper, we present a planning system based on semantic reasoning for a general-purpose service robot, which is aimed at behaving more intelligently in domains that contain incomplete information, under-specified goals, and dynamic changes. First, Two kinds of data are generated by Natural Language Processing module from the speech: (i) action frames and their relationships; (ii) the modifie… ▽ More In this paper, we present a planning system based on semantic reasoning for a general-purpose service robot, which is aimed at behaving more intelligently in domains that contain incomplete information, under-specified goals, and dynamic changes. First, Two kinds of data are generated by Natural Language Processing module from the speech: (i) action frames and their relationships; (ii) the modifier used to indicate some property or characteristic of a variable in the action frame. Next, the goals of the task are generated from these action frames and modifiers. These goals are represented as AI symbols, combining world state and domain knowledge, which are used to generate plans by an Answer Set Programming solver. Finally, the actions of the plan are executed one by one, and continuous sensing grounds useful information, which make the robot to use contingent knowledge to adapt to dynamic changes and faults. For each action in the plan, the planner gets its preconditions and effects from domain knowledge, so during the execution of the task, the environmental changes, especially those conflict with the actions, not only the action being performed, but also the subsequent actions, can be detected and handled as early as possible. A series of case studies are used to evaluate the system and verify its ability to acquire knowledge through dialogue with users, solve problems with the acquired causal knowledge, and plan for complex tasks autonomously in the open world. △ Less

Submitted 1 November, 2020; originally announced November 2020.

arXiv:2010.13297 [pdf, other]

Discriminatively Constrained Semi-supervised Multi-view Nonnegative Matrix Factorization with Graph Regularization

Authors: Guosheng Cui, Ruxin Wang, Dan Wu, Ye Li

Abstract: In recent years, semi-supervised multi-view nonnegative matrix factorization (MVNMF) algorithms have achieved promising performances for multi-view clustering. While most of semi-supervised MVNMFs have failed to effectively consider discriminative information among clusters and feature alignment from multiple views simultaneously. In this paper, a novel Discriminatively Constrained Semi-Supervised… ▽ More In recent years, semi-supervised multi-view nonnegative matrix factorization (MVNMF) algorithms have achieved promising performances for multi-view clustering. While most of semi-supervised MVNMFs have failed to effectively consider discriminative information among clusters and feature alignment from multiple views simultaneously. In this paper, a novel Discriminatively Constrained Semi-Supervised Multi-View Nonnegative Matrix Factorization (DCS^2MVNMF) is proposed. Specifically, a discriminative weighting matrix is introduced for the auxiliary matrix of each view, which enhances the inter-class distinction. Meanwhile, a new graph regularization is constructed with the label and geometrical information. In addition, we design a new feature scale normalization strategy to align the multiple views and complete the corresponding iterative optimization schemes. Extensive experiments conducted on several real world multi-view datasets have demonstrated the effectiveness of the proposed method. △ Less

Submitted 25 October, 2020; originally announced October 2020.

arXiv:2007.08547 [pdf, other]

Talking-head Generation with Rhythmic Head Motion

Authors: Lele Chen, Guofeng Cui, Celong Liu, Zhong Li, Ziyi Kou, Yi Xu, Chenliang Xu

Abstract: When people deliver a speech, they naturally move heads, and this rhythmic head motion conveys prosodic information. However, generating a lip-synced video while moving head naturally is challenging. While remarkably successful, existing works either generate still talkingface videos or rely on landmark/video frames as sparse/dense map** guidance to generate head movements, which leads to unreal… ▽ More When people deliver a speech, they naturally move heads, and this rhythmic head motion conveys prosodic information. However, generating a lip-synced video while moving head naturally is challenging. While remarkably successful, existing works either generate still talkingface videos or rely on landmark/video frames as sparse/dense map** guidance to generate head movements, which leads to unrealistic or uncontrollable video synthesis. To overcome the limitations, we propose a 3D-aware generative network along with a hybrid embedding module and a non-linear composition module. Through modeling the head motion and facial expressions1 explicitly, manipulating 3D animation carefully, and embedding reference images dynamically, our approach achieves controllable, photo-realistic, and temporally coherent talking-head videos with natural head movements. Thoughtful experiments on several standard benchmarks demonstrate that our method achieves significantly better results than the state-of-the-art methods in both quantitative and qualitative comparisons. The code is available on https://github.com/ lelechen63/Talking-head-Generation-with-Rhythmic-Head-Motion. △ Less

Submitted 16 July, 2020; originally announced July 2020.

arXiv:2007.01594 [pdf, other]

doi 10.1145/3394486.3403140

Adaptive Graph Encoder for Attributed Graph Embedding

Authors: Ganqu Cui, Jie Zhou, Cheng Yang, Zhiyuan Liu

Abstract: Attributed graph embedding, which learns vector representations from graph topology and node features, is a challenging task for graph analysis. Recently, methods based on graph convolutional networks (GCNs) have made great progress on this task. However,existing GCN-based methods have three major drawbacks. Firstly,our experiments indicate that the entanglement of graph convolutional filters and… ▽ More Attributed graph embedding, which learns vector representations from graph topology and node features, is a challenging task for graph analysis. Recently, methods based on graph convolutional networks (GCNs) have made great progress on this task. However,existing GCN-based methods have three major drawbacks. Firstly,our experiments indicate that the entanglement of graph convolutional filters and weight matrices will harm both the performance and robustness. Secondly, we show that graph convolutional filters in these methods reveal to be special cases of generalized Laplacian smoothing filters, but they do not preserve optimal low-pass characteristics. Finally, the training objectives of existing algorithms are usually recovering the adjacency matrix or feature matrix, which are not always consistent with real-world applications. To address these issues, we propose Adaptive Graph Encoder (AGE), a novel attributed graph embedding framework. AGE consists of two modules: (1) To better alleviate the high-frequency noises in the node features, AGE first applies a carefully-designed Laplacian smoothing filter. (2) AGE employs an adaptive encoder that iteratively strengthens the filtered features for better node embeddings. We conduct experiments using four public benchmark datasets to validate AGE on node clustering and link prediction tasks. Experimental results show that AGE consistently outperforms state-of-the-art graph embedding methods considerably on these tasks. △ Less

Submitted 3 July, 2020; originally announced July 2020.

Comments: To appear in KDD 2020

arXiv:2005.03201 [pdf, other]

What comprises a good talking-head video generation?: A Survey and Benchmark

Authors: Lele Chen, Guofeng Cui, Ziyi Kou, Haitian Zheng, Chenliang Xu

Abstract: Over the years, performance evaluation has become essential in computer vision, enabling tangible progress in many sub-fields. While talking-head video generation has become an emerging research topic, existing evaluations on this topic present many limitations. For example, most approaches use human subjects (e.g., via Amazon MTurk) to evaluate their research claims directly. This subjective eval… ▽ More Over the years, performance evaluation has become essential in computer vision, enabling tangible progress in many sub-fields. While talking-head video generation has become an emerging research topic, existing evaluations on this topic present many limitations. For example, most approaches use human subjects (e.g., via Amazon MTurk) to evaluate their research claims directly. This subjective evaluation is cumbersome, unreproducible, and may impend the evolution of new research. In this work, we present a carefully-designed benchmark for evaluating talking-head video generation with standardized dataset pre-processing strategies. As for evaluation, we either propose new metrics or select the most appropriate ones to evaluate results in what we consider as desired properties for a good talking-head video, namely, identity preserving, lip synchronization, high video quality, and natural-spontaneous motion. By conducting a thoughtful analysis across several state-of-the-art talking-head generation approaches, we aim to uncover the merits and drawbacks of current methods and point out promising directions for future work. All the evaluation code is available at: https://github.com/lelechen63/talking-head-generation-survey. △ Less

Submitted 6 May, 2020; originally announced May 2020.

arXiv:1911.07160 [pdf, other]

Improve CAM with Auto-adapted Segmentation and Co-supervised Augmentation

Authors: Ziyi Kou, Guofeng Cui, Shaojie Wang, Wentian Zhao, Chenliang Xu

Abstract: Weakly Supervised Object Localization (WSOL) methods generate both classification and localization results by learning from only image category labels. Previous methods usually utilize class activation map (CAM) to obtain target object regions. However, most of them only focus on improving foreground object parts in CAM, but ignore the important effect of its background contents. In this paper, we… ▽ More Weakly Supervised Object Localization (WSOL) methods generate both classification and localization results by learning from only image category labels. Previous methods usually utilize class activation map (CAM) to obtain target object regions. However, most of them only focus on improving foreground object parts in CAM, but ignore the important effect of its background contents. In this paper, we propose a confidence segmentation (ConfSeg) module that builds confidence score for each pixel in CAM without introducing additional hyper-parameters. The generated sample-specific confidence mask is able to indicate the extent of determination for each pixel in CAM, and further supervises additional CAM extended from internal feature maps. Besides, we introduce Co-supervised Augmentation (CoAug) module to capture feature-level representation for foreground and background parts in CAM separately. Then a metric loss is applied at batch sample level to augment distinguish ability of our model, which helps a lot to localize more related object parts. Our final model, CSoA, combines the two modules and achieves superior performance, e.g. $37.69\%$ and $48.81\%$ Top-1 localization error on CUB-200 and ILSVRC datasets, respectively, which outperforms all previous methods and becomes the new state-of-the-art. △ Less

Submitted 13 January, 2021; v1 submitted 17 November, 2019; originally announced November 2019.

Comments: Accepted by WACV2021. Equal contribution for the first two authors

arXiv:1909.03619 [pdf, other]

Weakly Supervised Localization Using Background Images

Authors: Ziyi Kou, Wentian Zhao, Guofeng Cui, Shaojie Wang

Abstract: Weakly Supervised Object Localization (WSOL) methodsusually rely on fully convolutional networks in order to ob-tain class activation maps(CAMs) of targeted labels. How-ever, these networks always highlight the most discriminativeparts to perform the task, the located areas are much smallerthan entire targeted objects. In this work, we propose a novelend-to-end model to enlarge CAMs generated from… ▽ More Weakly Supervised Object Localization (WSOL) methodsusually rely on fully convolutional networks in order to ob-tain class activation maps(CAMs) of targeted labels. How-ever, these networks always highlight the most discriminativeparts to perform the task, the located areas are much smallerthan entire targeted objects. In this work, we propose a novelend-to-end model to enlarge CAMs generated from classifi-cation models, which can localize targeted objects more pre-cisely. In detail, we add an additional module in traditionalclassification networks to extract foreground object propos-als from images without classifying them into specific cate-gories. Then we set these normalized regions as unrestrictedpixel-level mask supervision for the following classificationtask. We collect a set of images defined as Background ImageSet from the Internet. The number of them is much smallerthan the targeted dataset but surprisingly well supports themethod to extract foreground regions from different pictures.The region extracted is independent from classification task,where the extracted region in each image covers almost en-tire object rather than just a significant part. Therefore, theseregions can serve as masks to supervise the response mapgenerated from classification models to become larger andmore precise. The method achieves state-of-the-art results onCUB-200-2011 in terms of Top-1 and Top-5 localization er-ror while has a competitive result on ILSVRC2016 comparedwith other approaches. △ Less

Submitted 10 September, 2019; v1 submitted 8 September, 2019; originally announced September 2019.

Comments: Course project of CSC577, University of Rochester

arXiv:1907.11580 [pdf, other]

Edge User Allocation with Dynamic Quality of Service

Authors: Phu Lai, Qiang He, Guangming Cui, Xiaoyu Xia, Mohamed Abdelrazek, Feifei Chen, John Hosking, John Grundy, Yun Yang

Abstract: In edge computing, edge servers are placed in close proximity to end-users. App vendors can deploy their services on edge servers to reduce network latency experienced by their app users. The edge user allocation (EUA) problem challenges service providers with the objective to maximize the number of allocated app users with hired computing resources on edge servers while ensuring their fixed quali… ▽ More In edge computing, edge servers are placed in close proximity to end-users. App vendors can deploy their services on edge servers to reduce network latency experienced by their app users. The edge user allocation (EUA) problem challenges service providers with the objective to maximize the number of allocated app users with hired computing resources on edge servers while ensuring their fixed quality of service (QoS), e.g., the amount of computing resources allocated to an app user. In this paper, we take a step forward to consider dynamic QoS levels for app users, which generalizes but further complicates the EUA problem, turning it into a dynamic QoS EUA problem. This enables flexible levels of quality of experience (QoE) for app users. We propose an optimal approach for finding a solution that maximizes app users' overall QoE. We also propose a heuristic approach for quickly finding sub-optimal solutions to large-scale instances of the dynamic QoS EUA problem. Experiments are conducted on a real-world dataset to demonstrate the effectiveness and efficiency of our approaches against a baseline approach and the state of the art. △ Less

Submitted 26 July, 2019; originally announced July 2019.

Comments: This manuscript has been accepted for publication at the 17th International Conference on Service-Oriented Computing and may be published in the book series Lecture Notes in Computer Science. All copyrights reserved to Springer Nature Switzerland AG, Gewerbestrasse 11, 6330 Cham, Switzerland

arXiv:1812.08434 [pdf]

Graph Neural Networks: A Review of Methods and Applications

Authors: Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, Maosong Sun

Abstract: Lots of learning tasks require dealing with graph data which contains rich relation information among elements. Modeling physics systems, learning molecular fingerprints, predicting protein interface, and classifying diseases demand a model to learn from graph inputs. In other domains such as learning from non-structural data like texts and images, reasoning on extracted structures (like the depen… ▽ More Lots of learning tasks require dealing with graph data which contains rich relation information among elements. Modeling physics systems, learning molecular fingerprints, predicting protein interface, and classifying diseases demand a model to learn from graph inputs. In other domains such as learning from non-structural data like texts and images, reasoning on extracted structures (like the dependency trees of sentences and the scene graphs of images) is an important research topic which also needs graph reasoning models. Graph neural networks (GNNs) are neural models that capture the dependence of graphs via message passing between the nodes of graphs. In recent years, variants of GNNs such as graph convolutional network (GCN), graph attention network (GAT), graph recurrent network (GRN) have demonstrated ground-breaking performances on many deep learning tasks. In this survey, we propose a general design pipeline for GNN models and discuss the variants of each component, systematically categorize the applications, and propose four open problems for future research. △ Less

Submitted 6 October, 2021; v1 submitted 20 December, 2018; originally announced December 2018.

Comments: Published at AI Open 2021

arXiv:1804.08066 [pdf, other]

MQGrad: Reinforcement Learning of Gradient Quantization in Parameter Server

Authors: Guoxin Cui, Jun Xu, Wei Zeng, Yanyan Lan, Jiafeng Guo, Xueqi Cheng

Abstract: One of the most significant bottleneck in training large scale machine learning models on parameter server (PS) is the communication overhead, because it needs to frequently exchange the model gradients between the workers and servers during the training iterations. Gradient quantization has been proposed as an effective approach to reducing the communication volume. One key issue in gradient quan… ▽ More One of the most significant bottleneck in training large scale machine learning models on parameter server (PS) is the communication overhead, because it needs to frequently exchange the model gradients between the workers and servers during the training iterations. Gradient quantization has been proposed as an effective approach to reducing the communication volume. One key issue in gradient quantization is setting the number of bits for quantizing the gradients. Small number of bits can significantly reduce the communication overhead while hurts the gradient accuracies, and vise versa. An ideal quantization method would dynamically balance the communication overhead and model accuracy, through adjusting the number bits according to the knowledge learned from the immediate past training iterations. Existing methods, however, quantize the gradients either with fixed number of bits, or with predefined heuristic rules. In this paper we propose a novel adaptive quantization method within the framework of reinforcement learning. The method, referred to as MQGrad, formalizes the selection of quantization bits as actions in a Markov decision process (MDP) where the MDP states records the information collected from the past optimization iterations (e.g., the sequence of the loss function values). During the training iterations of a machine learning algorithm, MQGrad continuously updates the MDP state according to the changes of the loss function. Based on the information, MDP learns to select the optimal actions (number of bits) to quantize the gradients. Experimental results based on a benchmark dataset showed that MQGrad can accelerate the learning of a large scale deep neural network while kee** its prediction accuracies. △ Less

Submitted 22 April, 2018; originally announced April 2018.

Comments: 7 pages, 5 figures

arXiv:1803.02256 [pdf, other]

Depth Information Guided Crowd Counting for Complex Crowd Scenes

Authors: Mingliang Xu, Zhaoyang Ge, Xiaoheng Jiang, Gaoge Cui, Pei Lv, Bing Zhou, Changsheng Xu

Abstract: It is important to monitor and analyze crowd events for the sake of city safety. In an EDOF (extended depth of field) image with a crowded scene, the distribution of people is highly imbalanced. People far away from the camera look much smaller and often occlude each other heavily, while people close to the camera look larger. In such a case, it is difficult to accurately estimate the number of pe… ▽ More It is important to monitor and analyze crowd events for the sake of city safety. In an EDOF (extended depth of field) image with a crowded scene, the distribution of people is highly imbalanced. People far away from the camera look much smaller and often occlude each other heavily, while people close to the camera look larger. In such a case, it is difficult to accurately estimate the number of people by using one technique. In this paper, we propose a Depth Information Guided Crowd Counting (DigCrowd) method to deal with crowded EDOF scenes. DigCrowd first uses the depth information of an image to segment the scene into a far-view region and a near-view region. Then Digcrowd maps the far-view region to its crowd density map and uses a detection method to count the people in the near-view region. In addition, we introduce a new crowd dataset that contains 1000 images. Experimental results demonstrate the effectiveness of our DigCrowd method △ Less

Submitted 23 April, 2018; v1 submitted 3 March, 2018; originally announced March 2018.

Comments: 9 pages, 8 figures. The paper is under consideration at Pattern Recognition Letters

arXiv:1701.05700 [pdf, ps, other]

Antenna Deployment Method for MIMO Radar under the Situation of Multiple Interference Regions

Authors: Tianxian Zhang, Jiadong Liang, Yichuan Yang, Guolong Cui, Lingjiang Kong, Xiaobo Yang

Abstract: In this paper, considering multiple interference regions simultaneously, an optimal antenna deployment problem for distributed Multi-Input Multi-Output (MIMO) radar is investigated. The optimal antenna deployment problem is solved by proposing an antenna deployment method based on Multi-Objective Particle Swarm Optimization (MOPSO). Firstly, we construct a multi-objective optimization problem for… ▽ More In this paper, considering multiple interference regions simultaneously, an optimal antenna deployment problem for distributed Multi-Input Multi-Output (MIMO) radar is investigated. The optimal antenna deployment problem is solved by proposing an antenna deployment method based on Multi-Objective Particle Swarm Optimization (MOPSO). Firstly, we construct a multi-objective optimization problem for MIMO radar antenna deployment by choosing the interference power densities of different regions as objective functions. Then, to obtain the optimal deployment result without wasting time and computational resources, an iteration convergence criterion based on interval distance is proposed. The iteration convergence criterion can be used to stop the MOPSO optimization process efficiently when the optimal antenna deployment algorithm reaches the desired convergence level. Finally, numerical results are provided to verify the validity of the proposed algorithm. △ Less

Submitted 20 January, 2017; originally announced January 2017.

Comments: 12 pages

Showing 1–49 of 49 results for author: Cui, G