-
EXCGEC: A Benchmark of Edit-wise Explainable Chinese Grammatical Error Correction
Authors:
**gheng Ye,
Shang Qin,
Yinghui Li,
Xuxin Cheng,
Libo Qin,
Hai-Tao Zheng,
Peng Xing,
Zishan Xu,
Guo Cheng,
Zhao Wei
Abstract:
Existing studies explore the explainability of Grammatical Error Correction (GEC) in a limited scenario, where they ignore the interaction between corrections and explanations. To bridge the gap, this paper introduces the task of EXplainable GEC (EXGEC), which focuses on the integral role of both correction and explanation tasks. To facilitate the task, we propose EXCGEC, a tailored benchmark for…
▽ More
Existing studies explore the explainability of Grammatical Error Correction (GEC) in a limited scenario, where they ignore the interaction between corrections and explanations. To bridge the gap, this paper introduces the task of EXplainable GEC (EXGEC), which focuses on the integral role of both correction and explanation tasks. To facilitate the task, we propose EXCGEC, a tailored benchmark for Chinese EXGEC consisting of 8,216 explanation-augmented samples featuring the design of hybrid edit-wise explanations. We benchmark several series of LLMs in multiple settings, covering post-explaining and pre-explaining. To promote the development of the task, we introduce a comprehensive suite of automatic metrics and conduct human evaluation experiments to demonstrate the human consistency of the automatic metrics for free-text explanations. All the codes and data will be released after the review.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
Self-Constructed Context Decompilation with Fined-grained Alignment Enhancement
Authors:
Yunlong Feng,
Yang Xu,
Dechuan Teng,
Honglin Mu,
Xiao Xu,
Libo Qin,
Wanxiang Che,
Qingfu Zhu
Abstract:
Decompilation transforms compiled code back into a high-level programming language for analysis when source code is unavailable. Previous work has primarily focused on enhancing decompilation performance by increasing the scale of model parameters or training data for pre-training. Based on the characteristics of the decompilation task, we propose two methods: (1) Without fine-tuning, the Self-Con…
▽ More
Decompilation transforms compiled code back into a high-level programming language for analysis when source code is unavailable. Previous work has primarily focused on enhancing decompilation performance by increasing the scale of model parameters or training data for pre-training. Based on the characteristics of the decompilation task, we propose two methods: (1) Without fine-tuning, the Self-Constructed Context Decompilation (sc$^2$dec) method recompiles the LLM's decompilation results to construct pairs for in-context learning, hel** the model improve decompilation performance. (2) Fine-grained Alignment Enhancement (FAE), which meticulously aligns assembly code with source code at the statement level by leveraging debugging information, is employed during the fine-tuning phase to achieve further improvements in decompilation. By integrating these two methods, we achieved a Re-Executability performance improvement of approximately 7.35\% on the Decompile-Eval benchmark, establishing a new state-of-the-art performance of 55.03\%.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
EVALALIGN: Supervised Fine-Tuning Multimodal LLMs with Human-Aligned Data for Evaluating Text-to-Image Models
Authors:
Zhiyu Tan,
Xiaomeng Yang,
Luozheng Qin,
Meng** Yang,
Cheng Zhang,
Hao Li
Abstract:
The recent advancements in text-to-image generative models have been remarkable. Yet, the field suffers from a lack of evaluation metrics that accurately reflect the performance of these models, particularly lacking fine-grained metrics that can guide the optimization of the models. In this paper, we propose EvalAlign, a metric characterized by its accuracy, stability, and fine granularity. Our ap…
▽ More
The recent advancements in text-to-image generative models have been remarkable. Yet, the field suffers from a lack of evaluation metrics that accurately reflect the performance of these models, particularly lacking fine-grained metrics that can guide the optimization of the models. In this paper, we propose EvalAlign, a metric characterized by its accuracy, stability, and fine granularity. Our approach leverages the capabilities of Multimodal Large Language Models (MLLMs) pre-trained on extensive datasets. We develop evaluation protocols that focus on two key dimensions: image faithfulness and text-image alignment. Each protocol comprises a set of detailed, fine-grained instructions linked to specific scoring options, enabling precise manual scoring of the generated images. We Supervised Fine-Tune (SFT) the MLLM to align closely with human evaluative judgments, resulting in a robust evaluation model. Our comprehensive tests across 24 text-to-image generation models demonstrate that EvalAlign not only provides superior metric stability but also aligns more closely with human preferences than existing metrics, confirming its effectiveness and utility in model assessment.
△ Less
Submitted 26 June, 2024; v1 submitted 24 June, 2024;
originally announced June 2024.
-
Efficient Antagonistic k-plex Enumeration in Signed Graphs
Authors:
Lantian Xu,
Rong-Hua Li,
Dong Wen,
Qiangqiang Dai,
Guoren Wang,
Lu Qin
Abstract:
A signed graph is a graph where each edge receives a sign, positive or negative. The signed graph model has been used in many real applications, such as protein complex discovery and social network analysis. Finding cohesive subgraphs in signed graphs is a fundamental problem. A k-plex is a common model for cohesive subgraphs in which every vertex is adjacent to all but at most k vertices within t…
▽ More
A signed graph is a graph where each edge receives a sign, positive or negative. The signed graph model has been used in many real applications, such as protein complex discovery and social network analysis. Finding cohesive subgraphs in signed graphs is a fundamental problem. A k-plex is a common model for cohesive subgraphs in which every vertex is adjacent to all but at most k vertices within the subgraph. In this paper, we propose the model of size-constrained antagonistic k-plex in a signed graph. The proposed model guarantees that the resulting subgraph is a k-plex and can be divided into two sub-k-plexes, both of which have positive inner edges and negative outer edges. This paper aims to identify all maximal antagonistic k-plexes in a signed graph. Through rigorous analysis, we show that the problem is NP-Hardness. We propose a novel framework for maximal antagonistic k-plexes utilizing set enumeration. Efficiency is improved through pivot pruning and early termination based on the color bound. Preprocessing techniques based on degree and dichromatic graphs effectively narrow the search space before enumeration. Extensive experiments on real-world datasets demonstrate our algorithm's efficiency, effectiveness, and scalability.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
AutoCAP: Towards Automatic Cross-lingual Alignment Planning for Zero-shot Chain-of-Thought
Authors:
Yongheng Zhang,
Qiguang Chen,
Min Li,
Wanxiang Che,
Libo Qin
Abstract:
Cross-lingual chain-of-thought can effectively complete reasoning tasks across languages, which gains increasing attention. Recently, dominant approaches in the literature improve cross-lingual alignment capabilities by integrating reasoning knowledge from different languages. Despite achieving excellent performance, current methods still have two main challenges: (1) Manual language specification…
▽ More
Cross-lingual chain-of-thought can effectively complete reasoning tasks across languages, which gains increasing attention. Recently, dominant approaches in the literature improve cross-lingual alignment capabilities by integrating reasoning knowledge from different languages. Despite achieving excellent performance, current methods still have two main challenges: (1) Manual language specification: They still highly rely on manually selecting the languages to integrate, severely affecting their generalizability; (2) Static weight allocation: Current methods simply integrate all languages equally. In fact, different language reasoning paths should have different weights to achieve better complementation and integration. Motivated by this, we introduce an Automatic Cross-lingual Alignment Planning (AutoCAP) for zero-shot chain-of-thought to address the above challenges. The core of AutoCAP consists of two components: (1) Automatic Language Selection Prompting to guide LLMs to select appropriate languages and (2) Automatic Weight Allocation Prompting to automatically allocate alignment weight scores to each reasoning path. Extensive experiments on several benchmarks reveal that AutoCAP achieves state-of-the-art performance, surpassing previous methods that required manual effort.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
WeatherQA: Can Multimodal Language Models Reason about Severe Weather?
Authors:
Chengqian Ma,
Zhanxiang Hua,
Alexandra Anderson-Frey,
Vikram Iyer,
Xin Liu,
Lianhui Qin
Abstract:
Severe convective weather events, such as hail, tornadoes, and thunderstorms, often occur quickly yet cause significant damage, costing billions of dollars every year. This highlights the importance of forecasting severe weather threats hours in advance to better prepare meteorologists and residents in at-risk areas. Can modern large foundation models perform such forecasting? Existing weather ben…
▽ More
Severe convective weather events, such as hail, tornadoes, and thunderstorms, often occur quickly yet cause significant damage, costing billions of dollars every year. This highlights the importance of forecasting severe weather threats hours in advance to better prepare meteorologists and residents in at-risk areas. Can modern large foundation models perform such forecasting? Existing weather benchmarks typically focus only on predicting time-series changes in certain weather parameters (e.g., temperature, moisture) with text-only features. In this work, we introduce WeatherQA, the first multimodal dataset designed for machines to reason about complex combinations of weather parameters (a.k.a., ingredients) and predict severe weather in real-world scenarios. The dataset includes over 8,000 (multi-images, text) pairs for diverse severe weather events. Each pair contains rich information crucial for forecasting -- the images describe the ingredients capturing environmental instability, surface observations, and radar reflectivity, and the text contains forecast analyses written by human experts. With WeatherQA, we evaluate state-of-the-art vision language models, including GPT4, Claude3.5, Gemini-1.5, and a fine-tuned Llama3-based VLM, by designing two challenging tasks: (1) multi-choice QA for predicting affected area and (2) classification of the development potential of severe convection. These tasks require deep understanding of domain knowledge (e.g., atmospheric dynamics) and complex reasoning over multimodal data (e.g., interactions between weather parameters). We show a substantial gap between the strongest VLM, GPT4o, and human reasoning. Our comprehensive case study with meteorologists further reveals the weaknesses of the models, suggesting that better training and data integration are necessary to bridge this gap. WeatherQA link: https://github.com/chengqianma/WeatherQA.
△ Less
Submitted 23 June, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
Knowledge Distillation in Federated Learning: a Survey on Long Lasting Challenges and New Solutions
Authors:
Laiqiao Qin,
Tianqing Zhu,
Wanlei Zhou,
Philip S. Yu
Abstract:
Federated Learning (FL) is a distributed and privacy-preserving machine learning paradigm that coordinates multiple clients to train a model while kee** the raw data localized. However, this traditional FL poses some challenges, including privacy risks, data heterogeneity, communication bottlenecks, and system heterogeneity issues. To tackle these challenges, knowledge distillation (KD) has been…
▽ More
Federated Learning (FL) is a distributed and privacy-preserving machine learning paradigm that coordinates multiple clients to train a model while kee** the raw data localized. However, this traditional FL poses some challenges, including privacy risks, data heterogeneity, communication bottlenecks, and system heterogeneity issues. To tackle these challenges, knowledge distillation (KD) has been widely applied in FL since 2020. KD is a validated and efficacious model compression and enhancement algorithm. The core concept of KD involves facilitating knowledge transfer between models by exchanging logits at intermediate or output layers. These properties make KD an excellent solution for the long-lasting challenges in FL. Up to now, there have been few reviews that summarize and analyze the current trend and methods for how KD can be applied in FL efficiently. This article aims to provide a comprehensive survey of KD-based FL, focusing on addressing the above challenges. First, we provide an overview of KD-based FL, including its motivation, basics, taxonomy, and a comparison with traditional FL and where KD should execute. We also analyze the critical factors in KD-based FL in the appendix, including teachers, knowledge, data, and methods. We discuss how KD can address the challenges in FL, including privacy protection, data heterogeneity, communication efficiency, and personalization. Finally, we discuss the challenges facing KD-based FL algorithms and future research directions. We hope this survey can provide insights and guidance for researchers and practitioners in the FL area.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
CroPrompt: Cross-task Interactive Prompting for Zero-shot Spoken Language Understanding
Authors:
Libo Qin,
Fuxuan Wei,
Qiguang Chen,
**gxuan Zhou,
Shijue Huang,
Jiasheng Si,
Wenpeng Lu,
Wanxiang Che
Abstract:
Slot filling and intent detection are two highly correlated tasks in spoken language understanding (SLU). Recent SLU research attempts to explore zero-shot prompting techniques in large language models to alleviate the data scarcity problem. Nevertheless, the existing prompting work ignores the cross-task interaction information for SLU, which leads to sub-optimal performance. To solve this proble…
▽ More
Slot filling and intent detection are two highly correlated tasks in spoken language understanding (SLU). Recent SLU research attempts to explore zero-shot prompting techniques in large language models to alleviate the data scarcity problem. Nevertheless, the existing prompting work ignores the cross-task interaction information for SLU, which leads to sub-optimal performance. To solve this problem, we present the pioneering work of Cross-task Interactive Prompting (CroPrompt) for SLU, which enables the model to interactively leverage the information exchange across the correlated tasks in SLU. Additionally, we further introduce a multi-task self-consistency mechanism to mitigate the error propagation caused by the intent information injection. We conduct extensive experiments on the standard SLU benchmark and the results reveal that CroPrompt consistently outperforms the existing prompting approaches. In addition, the multi-task self-consistency mechanism can effectively ease the error propagation issue, thereby enhancing the performance. We hope this work can inspire more research on cross-task prompting for SLU.
△ Less
Submitted 15 June, 2024;
originally announced June 2024.
-
Flow of Reasoning: Efficient Training of LLM Policy with Divergent Thinking
Authors:
Fangxu Yu,
Lai Jiang,
Haoqiang Kang,
Shibo Hao,
Lianhui Qin
Abstract:
Divergent thinking, the cognitive process of generating diverse solutions, is a hallmark of human creativity and problem-solving. For machines, sampling diverse solution trajectories in complex reasoning problems is crucial for robust outcomes, data augmentation, and enhanced model generalization. Large language models (LLMs) often struggle with generating high-quality, diverse reasoning. While su…
▽ More
Divergent thinking, the cognitive process of generating diverse solutions, is a hallmark of human creativity and problem-solving. For machines, sampling diverse solution trajectories in complex reasoning problems is crucial for robust outcomes, data augmentation, and enhanced model generalization. Large language models (LLMs) often struggle with generating high-quality, diverse reasoning. While supervised fine-tuning helps with quality, it requires extensive supervision data to capture the full diversity of solutions. Alternatively, reinforcement learning methods like PPO aim to find limited highest-reward solutions while neglecting the solution diversity, akin to convergent thinking. To address these limitations, we propose Flow of Reasoning (FoR) -- an efficient LLM training approach enabling diverse reasoning with minimal data. FoR formulates multi-step LLM reasoning as a Markovian flow from an initial state to terminal states. The formulation allows to adapt principled GFlowNet approaches to train the LLM as a policy, which is able to sample multiple reasoning paths with probabilities proportional to the unnormalized reward. Empirical results show that, with limited training data (e.g., 15 examples), FoR can discover diverse high-quality solutions that excel greatly beyond current state-of-the-art methods across three tasks, including embodied reasoning (BlocksWorld), math puzzle solving (Game24), and logical reasoning (PrOntoQA). Code is available at https://github.com/Yu-Fangxu/FoR.
△ Less
Submitted 24 June, 2024; v1 submitted 9 June, 2024;
originally announced June 2024.
-
EduNLP: Towards a Unified and Modularized Library for Educational Resources
Authors:
Zhenya Huang,
Yuting Ning,
Longhu Qin,
Shiwei Tong,
Shangzi Xue,
Tong Xiao,
Xin Lin,
Jiayu Liu,
Qi Liu,
Enhong Chen,
Shi**g Wang
Abstract:
Educational resource understanding is vital to online learning platforms, which have demonstrated growing applications recently. However, researchers and developers always struggle with using existing general natural language toolkits or domain-specific models. The issue raises a need to develop an effective and easy-to-use one that benefits AI education-related research and applications. To bridg…
▽ More
Educational resource understanding is vital to online learning platforms, which have demonstrated growing applications recently. However, researchers and developers always struggle with using existing general natural language toolkits or domain-specific models. The issue raises a need to develop an effective and easy-to-use one that benefits AI education-related research and applications. To bridge this gap, we present a unified, modularized, and extensive library, EduNLP, focusing on educational resource understanding. In the library, we decouple the whole workflow to four key modules with consistent interfaces including data configuration, processing, model implementation, and model evaluation. We also provide a configurable pipeline to unify the data usage and model usage in standard ways, where users can customize their own needs. For the current version, we primarily provide 10 typical models from four categories, and 5 common downstream-evaluation tasks in the education domain on 8 subjects for users' usage. The project is released at: https://github.com/bigdata-ustc/EduNLP.
△ Less
Submitted 4 June, 2024; v1 submitted 3 June, 2024;
originally announced June 2024.
-
Guiding ChatGPT to Generate Salient Domain Summaries
Authors:
Jun Gao,
Ziqiang Cao,
Shaoyao Huang,
Luozheng Qin,
Chunhui Ai
Abstract:
ChatGPT is instruct-tuned to generate general and human-expected content to align with human preference through Reinforcement Learning from Human Feedback (RLHF), meanwhile resulting in generated responses not salient enough. Therefore, in this case, ChatGPT may fail to satisfy domain requirements in zero-shot settings, leading to poor ROUGE scores. Inspired by the In-Context Learning (ICL) and re…
▽ More
ChatGPT is instruct-tuned to generate general and human-expected content to align with human preference through Reinforcement Learning from Human Feedback (RLHF), meanwhile resulting in generated responses not salient enough. Therefore, in this case, ChatGPT may fail to satisfy domain requirements in zero-shot settings, leading to poor ROUGE scores. Inspired by the In-Context Learning (ICL) and retelling ability of ChatGPT, this paper proposes PADS, a \textbf{P}ipeline for \textbf{A}ssisting ChatGPT in \textbf{D}omain \textbf{S}ummarization. PADS consists of a retriever to retrieve similar examples from corpora and a rank model to rerank the multiple candidate summaries generated by ChatGPT. Specifically, given an inference document, we first retrieve an in-context demonstration via the retriever. Then, we require ChatGPT to generate $k$ candidate summaries for the inference document at a time under the guidance of the retrieved demonstration. Finally, the rank model independently scores the $k$ candidate summaries according to their quality and selects the optimal one. We extensively explore dense and sparse retrieval methods to select effective demonstrations for reference and efficiently train the rank model to reflect the quality of candidate summaries for each given summarized document. Additionally, PADS contains merely 400M trainable parameters originating from the rank model and we merely collect 2.5k data to train it. We evaluate PADS on five datasets from different domains, and the result indicates that each module in PADS is committed to effectively guiding ChatGPT to generate salient summaries fitting different domain requirements. Specifically, in the popular summarization dataset Gigaword, PADS achieves over +8 gain on ROUGE-L, compared with the naive ChatGPT in the zero-shot setting. \footnote{Our code are available at \url{https://github.com/jungao1106/PADS}}
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
M$^3$CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought
Authors:
Qiguang Chen,
Libo Qin,
** Zhang,
Zhi Chen,
Xiao Xu,
Wanxiang Che
Abstract:
Multi-modal Chain-of-Thought (MCoT) requires models to leverage knowledge from both textual and visual modalities for step-by-step reasoning, which gains increasing attention. Nevertheless, the current MCoT benchmark still faces some challenges: (1) absence of visual modal reasoning, (2) single-step visual modal reasoning, and (3) Domain missing, thereby hindering the development of MCoT. Motivate…
▽ More
Multi-modal Chain-of-Thought (MCoT) requires models to leverage knowledge from both textual and visual modalities for step-by-step reasoning, which gains increasing attention. Nevertheless, the current MCoT benchmark still faces some challenges: (1) absence of visual modal reasoning, (2) single-step visual modal reasoning, and (3) Domain missing, thereby hindering the development of MCoT. Motivated by this, we introduce a novel benchmark (M$^3$CoT) to address the above challenges, advancing the multi-domain, multi-step, and multi-modal CoT. Additionally, we conduct a thorough evaluation involving abundant MCoT approaches on Vision Large Language Models (VLLMs). In addition, we highlight that the current VLLMs still struggle to correctly reason in M$^3$CoT and there remains a large gap between existing VLLMs and human performance in M$^3$CoT, despite their superior results on previous MCoT benchmarks. To our knowledge, we take the first meaningful step toward the multi-domain, multi-step, and multi-modal scenario in MCoT. We hope that M$^3$CoT can serve as a valuable resource, providing a pioneering foundation in multi-domain, multi-step, multi-modal chain-of-thought research.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
Performance Optimization in RSMA-assisted Uplink xURLLC IIoT Networks with Statistical QoS Provisioning
Authors:
Yuang Chen,
Hancheng Lu,
Chang Wu,
Langtian Qin,
Xiaobo Guo
Abstract:
Industry 5.0 and beyond networks have driven the emergence of numerous mission-critical applications, prompting contemplation of the neXt-generation ultra-reliable low-latency communication (xURLLC). To guarantee low-latency requirements, xURLLC heavily relies on short-blocklength packets with sporadic arrival traffic. As a disruptive multi-access technique, rate-splitting multiple access (RSMA) h…
▽ More
Industry 5.0 and beyond networks have driven the emergence of numerous mission-critical applications, prompting contemplation of the neXt-generation ultra-reliable low-latency communication (xURLLC). To guarantee low-latency requirements, xURLLC heavily relies on short-blocklength packets with sporadic arrival traffic. As a disruptive multi-access technique, rate-splitting multiple access (RSMA) has emerged as a promising avenue to enhance quality of service (QoS) and flexibly manage interference for next-generation communication networks. In this paper, we investigate an innovative RSMA-assisted uplink xURLLC industrial internet-of-things (IIoT) (RSMA-xURLLC-IIoT) network. To unveil reliable insights into the statistical QoS provisioning (SQP) for our proposed network with sporadic arrival traffic, we leverage stochastic network calculus (SNC) to develop a dependable theoretical framework. Building upon this theoretical framework, we formulate the SQP-driven short-packet size maximization problem and the SQP-driven transmit power minimization problem, aiming to guarantee the SQP performance to latency, decoding, and reliability while maximizing the short-packet size and minimizing the transmit power, respectively. By exploiting Monte-Carlo methods, we have thoroughly validated the dependability of the developed theoretical framework. Moreover, through extensive comparison analysis with state-of-the-art multi-access techniques, including non-orthogonal multiple access (NOMA) and orthogonal multiple access (OMA), we have demonstrated the superior performance gains achieved by the proposed RSMA-xURLLC-IIoT networks.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
Endowing Interpretability for Neural Cognitive Diagnosis by Efficient Kolmogorov-Arnold Networks
Authors:
Shangshang Yang,
Linrui Qin,
Xiaoshan Yu
Abstract:
In the realm of intelligent education, cognitive diagnosis plays a crucial role in subsequent recommendation tasks attributed to the revealed students' proficiency in knowledge concepts. Although neural network-based neural cognitive diagnosis models (CDMs) have exhibited significantly better performance than traditional models, neural cognitive diagnosis is criticized for the poor model interpret…
▽ More
In the realm of intelligent education, cognitive diagnosis plays a crucial role in subsequent recommendation tasks attributed to the revealed students' proficiency in knowledge concepts. Although neural network-based neural cognitive diagnosis models (CDMs) have exhibited significantly better performance than traditional models, neural cognitive diagnosis is criticized for the poor model interpretability due to the multi-layer perception (MLP) employed, even with the monotonicity assumption. Therefore, this paper proposes to empower the interpretability of neural cognitive diagnosis models through efficient kolmogorov-arnold networks (KANs), named KAN2CD, where KANs are designed to enhance interpretability in two manners. Specifically, in the first manner, KANs are directly used to replace the used MLPs in existing neural CDMs; while in the second manner, the student embedding, exercise embedding, and concept embedding are directly processed by several KANs, and then their outputs are further combined and learned in a unified KAN to get final predictions. To overcome the problem of training KANs slowly, we modify the implementation of original KANs to accelerate the training. Experiments on four real-world datasets show that the proposed KA2NCD exhibits better performance than traditional CDMs, and the proposed KA2NCD still has a bit of performance leading even over the existing neural CDMs. More importantly, the learned structures of KANs enable the proposed KA2NCD to hold as good interpretability as traditional CDMs, which is superior to existing neural CDMs. Besides, the training cost of the proposed KA2NCD is competitive to existing models.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation
Authors:
Zhiyu Tan,
Meng** Yang,
Luozheng Qin,
Hao Yang,
Ye Qian,
Qiang Zhou,
Cheng Zhang,
Hao Li
Abstract:
One critical prerequisite for faithful text-to-image generation is the accurate understanding of text inputs. Existing methods leverage the text encoder of the CLIP model to represent input prompts. However, the pre-trained CLIP model can merely encode English with a maximum token length of 77. Moreover, the model capacity of the text encoder from CLIP is relatively limited compared to Large Langu…
▽ More
One critical prerequisite for faithful text-to-image generation is the accurate understanding of text inputs. Existing methods leverage the text encoder of the CLIP model to represent input prompts. However, the pre-trained CLIP model can merely encode English with a maximum token length of 77. Moreover, the model capacity of the text encoder from CLIP is relatively limited compared to Large Language Models (LLMs), which offer multilingual input, accommodate longer context, and achieve superior text representation. In this paper, we investigate LLMs as the text encoder to improve the language understanding in text-to-image generation. Unfortunately, training text-to-image generative model with LLMs from scratch demands significant computational resources and data. To this end, we introduce a three-stage training pipeline that effectively and efficiently integrates the existing text-to-image model with LLMs. Specifically, we propose a lightweight adapter that enables fast training of the text-to-image model using the textual representations from LLMs. Extensive experiments demonstrate that our model supports not only multilingual but also longer input context with superior image generation quality.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
Efficient Influence Minimization via Node Blocking
Authors:
**ghao Wang,
Yan** Wu,
Xiaoyang Wang,
Ying Zhang,
Lu Qin,
Wenjie Zhang,
Xuemin Lin
Abstract:
Given a graph G, a budget k and a misinformation seed set S, Influence Minimization (IMIN) via node blocking aims to find a set of k nodes to be blocked such that the expected spread of S is minimized. This problem finds important applications in suppressing the spread of misinformation and has been extensively studied in the literature. However, existing solutions for IMIN still incur significant…
▽ More
Given a graph G, a budget k and a misinformation seed set S, Influence Minimization (IMIN) via node blocking aims to find a set of k nodes to be blocked such that the expected spread of S is minimized. This problem finds important applications in suppressing the spread of misinformation and has been extensively studied in the literature. However, existing solutions for IMIN still incur significant computation overhead, especially when k becomes large. In addition, there is still no approximation solution with non-trivial theoretical guarantee for IMIN via node blocking prior to our work. In this paper, we conduct the first attempt to propose algorithms that yield data-dependent approximation guarantees. Based on the Sandwich framework, we first develop submodular and monotonic lower and upper bounds for our non-submodular objective function and prove the computation of proposed bounds is \#P-hard. In addition, two advanced sampling methods are proposed to estimate the value of bounding functions. Moreover, we develop two novel martingale-based concentration bounds to reduce the sample complexity and design two non-trivial algorithms that provide (1-1/e-ε)-approximate solutions to our bounding functions. Comprehensive experiments on 9 real-world datasets are conducted to validate the efficiency and effectiveness of the proposed techniques. Compared with the state-of-the-art methods, our solutions can achieve up to two orders of magnitude speedup and provide theoretical guarantees for the quality of returned results.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
Large Language Models Meet NLP: A Survey
Authors:
Libo Qin,
Qiguang Chen,
Xiachong Feng,
Yang Wu,
Yongheng Zhang,
Yinghui Li,
Min Li,
Wanxiang Che,
Philip S. Yu
Abstract:
While large language models (LLMs) like ChatGPT have shown impressive capabilities in Natural Language Processing (NLP) tasks, a systematic investigation of their potential in this field remains largely unexplored. This study aims to address this gap by exploring the following questions: (1) How are LLMs currently applied to NLP tasks in the literature? (2) Have traditional NLP tasks already been…
▽ More
While large language models (LLMs) like ChatGPT have shown impressive capabilities in Natural Language Processing (NLP) tasks, a systematic investigation of their potential in this field remains largely unexplored. This study aims to address this gap by exploring the following questions: (1) How are LLMs currently applied to NLP tasks in the literature? (2) Have traditional NLP tasks already been solved with LLMs? (3) What is the future of the LLMs for NLP? To answer these questions, we take the first step to provide a comprehensive overview of LLMs in NLP. Specifically, we first introduce a unified taxonomy including (1) parameter-frozen application and (2) parameter-tuning application to offer a unified perspective for understanding the current progress of LLMs in NLP. Furthermore, we summarize the new frontiers and the associated challenges, aiming to inspire further groundbreaking advancements. We hope this work offers valuable insights into the {potential and limitations} of LLMs in NLP, while also serving as a practical guide for building effective LLMs in NLP.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
DARA: Domain- and Relation-aware Adapters Make Parameter-efficient Tuning for Visual Grounding
Authors:
Ting Liu,
Xuyang Liu,
Siteng Huang,
Honggang Chen,
Quanjun Yin,
Long Qin,
Donglin Wang,
Yue Hu
Abstract:
Visual grounding (VG) is a challenging task to localize an object in an image based on a textual description. Recent surge in the scale of VG models has substantially improved performance, but also introduced a significant burden on computational costs during fine-tuning. In this paper, we explore applying parameter-efficient transfer learning (PETL) to efficiently transfer the pre-trained vision-…
▽ More
Visual grounding (VG) is a challenging task to localize an object in an image based on a textual description. Recent surge in the scale of VG models has substantially improved performance, but also introduced a significant burden on computational costs during fine-tuning. In this paper, we explore applying parameter-efficient transfer learning (PETL) to efficiently transfer the pre-trained vision-language knowledge to VG. Specifically, we propose \textbf{DARA}, a novel PETL method comprising \underline{\textbf{D}}omain-aware \underline{\textbf{A}}dapters (DA Adapters) and \underline{\textbf{R}}elation-aware \underline{\textbf{A}}dapters (RA Adapters) for VG. DA Adapters first transfer intra-modality representations to be more fine-grained for the VG domain. Then RA Adapters share weights to bridge the relation between two modalities, improving spatial reasoning. Empirical results on widely-used benchmarks demonstrate that DARA achieves the best accuracy while saving numerous updated parameters compared to the full fine-tuning and other PETL methods. Notably, with only \textbf{2.13\%} tunable backbone parameters, DARA improves average accuracy by \textbf{0.81\%} across the three benchmarks compared to the baseline model. Our code is available at \url{https://github.com/liuting20/DARA}.
△ Less
Submitted 8 June, 2024; v1 submitted 9 May, 2024;
originally announced May 2024.
-
GRSN: Gated Recurrent Spiking Neurons for POMDPs and MARL
Authors:
Lang Qin,
Ziming Wang,
Runhao Jiang,
Rui Yan,
Hua** Tang
Abstract:
Spiking neural networks (SNNs) are widely applied in various fields due to their energy-efficient and fast-inference capabilities. Applying SNNs to reinforcement learning (RL) can significantly reduce the computational resource requirements for agents and improve the algorithm's performance under resource-constrained conditions. However, in current spiking reinforcement learning (SRL) algorithms,…
▽ More
Spiking neural networks (SNNs) are widely applied in various fields due to their energy-efficient and fast-inference capabilities. Applying SNNs to reinforcement learning (RL) can significantly reduce the computational resource requirements for agents and improve the algorithm's performance under resource-constrained conditions. However, in current spiking reinforcement learning (SRL) algorithms, the simulation results of multiple time steps can only correspond to a single-step decision in RL. This is quite different from the real temporal dynamics in the brain and also fails to fully exploit the capacity of SNNs to process temporal data. In order to address this temporal mismatch issue and further take advantage of the inherent temporal dynamics of spiking neurons, we propose a novel temporal alignment paradigm (TAP) that leverages the single-step update of spiking neurons to accumulate historical state information in RL and introduces gated units to enhance the memory capacity of spiking neurons. Experimental results show that our method can solve partially observable Markov decision processes (POMDPs) and multi-agent cooperation problems with similar performance as recurrent neural networks (RNNs) but with about 50% power consumption.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
EduAgent: Generative Student Agents in Learning
Authors:
Songlin Xu,
Xinyu Zhang,
Lianhui Qin
Abstract:
Student simulation in online education is important to address dynamic learning behaviors of students with diverse backgrounds. Existing simulation models based on deep learning usually need massive training data, lacking prior knowledge in educational contexts. Large language models (LLMs) may contain such prior knowledge since they are pre-trained from a large corpus. However, because student be…
▽ More
Student simulation in online education is important to address dynamic learning behaviors of students with diverse backgrounds. Existing simulation models based on deep learning usually need massive training data, lacking prior knowledge in educational contexts. Large language models (LLMs) may contain such prior knowledge since they are pre-trained from a large corpus. However, because student behaviors are dynamic and multifaceted with individual differences, directly prompting LLMs is not robust nor accurate enough to capture fine-grained interactions among diverse student personas, learning behaviors, and learning outcomes. This work tackles this problem by presenting a newly annotated fine-grained large-scale dataset and proposing EduAgent, a novel generative agent framework incorporating cognitive prior knowledge (i.e., theoretical findings revealed in cognitive science) to guide LLMs to first reason correlations among various behaviors and then make simulations. Our two experiments show that EduAgent could not only mimic and predict learning behaviors of real students but also generate realistic learning behaviors of virtual students without real data.
△ Less
Submitted 23 March, 2024;
originally announced April 2024.
-
Improving Language Model Reasoning with Self-motivated Learning
Authors:
Yunlong Feng,
Yang Xu,
Libo Qin,
Yasheng Wang,
Wanxiang Che
Abstract:
Large-scale high-quality training data is important for improving the performance of models. After trained with data that has rationales (reasoning steps), models gain reasoning capability. However, the dataset with high-quality rationales is relatively scarce due to the high annotation cost. To address this issue, we propose \textit{Self-motivated Learning} framework. The framework motivates the…
▽ More
Large-scale high-quality training data is important for improving the performance of models. After trained with data that has rationales (reasoning steps), models gain reasoning capability. However, the dataset with high-quality rationales is relatively scarce due to the high annotation cost. To address this issue, we propose \textit{Self-motivated Learning} framework. The framework motivates the model itself to automatically generate rationales on existing datasets. Based on the inherent rank from correctness across multiple rationales, the model learns to generate better rationales, leading to higher reasoning capability. Specifically, we train a reward model with the rank to evaluate the quality of rationales, and improve the performance of reasoning through reinforcement learning. Experiment results of Llama2 7B on multiple reasoning datasets show that our method significantly improves the reasoning ability of models, even outperforming text-davinci-002 in some datasets.
△ Less
Submitted 30 April, 2024; v1 submitted 10 April, 2024;
originally announced April 2024.
-
Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts
Authors:
Weilin Cai,
Juyong Jiang,
Le Qin,
Junwei Cui,
Sunghun Kim,
Jiayi Huang
Abstract:
Expert parallelism has been introduced as a strategy to distribute the computational workload of sparsely-gated mixture-of-experts (MoE) models across multiple computing devices, facilitating the execution of these increasingly large-scale models. However, the All-to-All communication intrinsic to expert parallelism constitutes a significant overhead, diminishing the MoE models' efficiency. Curren…
▽ More
Expert parallelism has been introduced as a strategy to distribute the computational workload of sparsely-gated mixture-of-experts (MoE) models across multiple computing devices, facilitating the execution of these increasingly large-scale models. However, the All-to-All communication intrinsic to expert parallelism constitutes a significant overhead, diminishing the MoE models' efficiency. Current optimization approaches offer some relief, yet they are constrained by the sequential interdependence of communication and computation operations. To address this limitation, we present a novel shortcut-connected MoE architecture with overlap** parallel strategy, designated as ScMoE, which effectively decouples communication from its conventional sequence, allowing for a substantial overlap of 70% to 100% with computation. When compared with the prevalent top-2 MoE architecture, ScMoE demonstrates training speed improvements of 30% and 11%, and inference improvements of 40% and 15%, in our PCIe and NVLink hardware environments, respectively, where communication constitutes 60% and 15% of the total MoE time consumption. On the other hand, extensive experiments and theoretical analyses indicate that ScMoE not only achieves comparable but in some instances surpasses the model quality of existing approaches in vision and language tasks.
△ Less
Submitted 7 April, 2024;
originally announced April 2024.
-
Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers
Authors:
Libo Qin,
Qiguang Chen,
Yuhang Zhou,
Zhi Chen,
Yinghui Li,
Lizi Liao,
Min Li,
Wanxiang Che,
Philip S. Yu
Abstract:
Multilingual Large Language Models are capable of using powerful Large Language Models to handle and respond to queries in multiple languages, which achieves remarkable success in multilingual natural language processing tasks. Despite these breakthroughs, there still remains a lack of a comprehensive survey to summarize existing approaches and recent developments in this field. To this end, in th…
▽ More
Multilingual Large Language Models are capable of using powerful Large Language Models to handle and respond to queries in multiple languages, which achieves remarkable success in multilingual natural language processing tasks. Despite these breakthroughs, there still remains a lack of a comprehensive survey to summarize existing approaches and recent developments in this field. To this end, in this paper, we present a thorough review and provide a unified perspective to summarize the recent progress as well as emerging trends in multilingual large language models (MLLMs) literature. The contributions of this paper can be summarized: (1) First survey: to our knowledge, we take the first step and present a thorough review in MLLMs research field according to multi-lingual alignment; (2) New taxonomy: we offer a new and unified perspective to summarize the current progress of MLLMs; (3) New frontiers: we highlight several emerging frontiers and discuss the corresponding challenges; (4) Abundant resources: we collect abundant open-source resources, including relevant papers, data corpora, and leaderboards. We hope our work can provide the community with quick access and spur breakthrough research in MLLMs.
△ Less
Submitted 7 April, 2024;
originally announced April 2024.
-
Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra
Authors:
Darioush Kevian,
Usman Syed,
Xingang Guo,
Aaron Havens,
Geir Dullerud,
Peter Seiler,
Lianhui Qin,
Bin Hu
Abstract:
In this paper, we explore the capabilities of state-of-the-art large language models (LLMs) such as GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra in solving undergraduate-level control problems. Controls provides an interesting case study for LLM reasoning due to its combination of mathematical theory and engineering design. We introduce ControlBench, a benchmark dataset tailored to reflect the bread…
▽ More
In this paper, we explore the capabilities of state-of-the-art large language models (LLMs) such as GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra in solving undergraduate-level control problems. Controls provides an interesting case study for LLM reasoning due to its combination of mathematical theory and engineering design. We introduce ControlBench, a benchmark dataset tailored to reflect the breadth, depth, and complexity of classical control design. We use this dataset to study and evaluate the problem-solving abilities of these LLMs in the context of control engineering. We present evaluations conducted by a panel of human experts, providing insights into the accuracy, reasoning, and explanatory prowess of LLMs in control engineering. Our analysis reveals the strengths and limitations of each LLM in the context of classical control, and our results imply that Claude 3 Opus has become the state-of-the-art LLM for solving undergraduate control problems. Our study serves as an initial step towards the broader goal of employing artificial general intelligence in control engineering.
△ Less
Submitted 4 April, 2024;
originally announced April 2024.
-
Large Language Model based Situational Dialogues for Second Language Learning
Authors:
Shuyao Xu,
Long Qin,
Tianyang Chen,
Zhenzhou Zha,
Bingxue Qiu,
Weizhi Wang
Abstract:
In second language learning, scenario-based conversation practice is important for language learners to achieve fluency in speaking, but students often lack sufficient opportunities to practice their conversational skills with qualified instructors or native speakers. To bridge this gap, we propose situational dialogue models for students to engage in conversational practice. Our situational dialo…
▽ More
In second language learning, scenario-based conversation practice is important for language learners to achieve fluency in speaking, but students often lack sufficient opportunities to practice their conversational skills with qualified instructors or native speakers. To bridge this gap, we propose situational dialogue models for students to engage in conversational practice. Our situational dialogue models are fine-tuned on large language models (LLMs), with the aim of combining the engaging nature of an open-ended conversation with the focused practice of scenario-based tasks. Leveraging the generalization capabilities of LLMs, we demonstrate that our situational dialogue models perform effectively not only on training topics but also on topics not encountered during training. This offers a promising solution to support a wide range of conversational topics without extensive manual work. Additionally, research in the field of dialogue systems still lacks reliable automatic evaluation metrics, leading to human evaluation as the gold standard (Smith et al., 2022), which is typically expensive. To address the limitations of existing evaluation methods, we present a novel automatic evaluation method that employs fine-tuned LLMs to efficiently and effectively assess the performance of situational dialogue models.
△ Less
Submitted 29 March, 2024;
originally announced March 2024.
-
LLMs-based Few-Shot Disease Predictions using EHR: A Novel Approach Combining Predictive Agent Reasoning and Critical Agent Instruction
Authors:
Hejie Cui,
Zhuocheng Shen,
Jieyu Zhang,
Hui Shao,
Lianhui Qin,
Joyce C. Ho,
Carl Yang
Abstract:
Electronic health records (EHRs) contain valuable patient data for health-related prediction tasks, such as disease prediction. Traditional approaches rely on supervised learning methods that require large labeled datasets, which can be expensive and challenging to obtain. In this study, we investigate the feasibility of applying Large Language Models (LLMs) to convert structured patient visit dat…
▽ More
Electronic health records (EHRs) contain valuable patient data for health-related prediction tasks, such as disease prediction. Traditional approaches rely on supervised learning methods that require large labeled datasets, which can be expensive and challenging to obtain. In this study, we investigate the feasibility of applying Large Language Models (LLMs) to convert structured patient visit data (e.g., diagnoses, labs, prescriptions) into natural language narratives. We evaluate the zero-shot and few-shot performance of LLMs using various EHR-prediction-oriented prompting strategies. Furthermore, we propose a novel approach that utilizes LLM agents with different roles: a predictor agent that makes predictions and generates reasoning processes and a critic agent that analyzes incorrect predictions and provides guidance for improving the reasoning of the predictor agent. Our results demonstrate that with the proposed approach, LLMs can achieve decent few-shot performance compared to traditional supervised learning methods in EHR-based disease predictions, suggesting its potential for health-oriented applications.
△ Less
Submitted 19 March, 2024;
originally announced March 2024.
-
MM-Diff: High-Fidelity Image Personalization via Multi-Modal Condition Integration
Authors:
Zhichao Wei,
Qingkun Su,
Long Qin,
Weizhi Wang
Abstract:
Recent advances in tuning-free personalized image generation based on diffusion models are impressive. However, to improve subject fidelity, existing methods either retrain the diffusion model or infuse it with dense visual embeddings, both of which suffer from poor generalization and efficiency. Also, these methods falter in multi-subject image generation due to the unconstrained cross-attention…
▽ More
Recent advances in tuning-free personalized image generation based on diffusion models are impressive. However, to improve subject fidelity, existing methods either retrain the diffusion model or infuse it with dense visual embeddings, both of which suffer from poor generalization and efficiency. Also, these methods falter in multi-subject image generation due to the unconstrained cross-attention mechanism. In this paper, we propose MM-Diff, a unified and tuning-free image personalization framework capable of generating high-fidelity images of both single and multiple subjects in seconds. Specifically, to simultaneously enhance text consistency and subject fidelity, MM-Diff employs a vision encoder to transform the input image into CLS and patch embeddings. CLS embeddings are used on the one hand to augment the text embeddings, and on the other hand together with patch embeddings to derive a small number of detail-rich subject embeddings, both of which are efficiently integrated into the diffusion model through the well-designed multimodal cross-attention mechanism. Additionally, MM-Diff introduces cross-attention map constraints during the training phase, ensuring flexible multi-subject image sampling during inference without any predefined inputs (e.g., layout). Extensive experiments demonstrate the superior performance of MM-Diff over other leading methods.
△ Less
Submitted 22 March, 2024;
originally announced March 2024.
-
EAS-SNN: End-to-End Adaptive Sampling and Representation for Event-based Detection with Recurrent Spiking Neural Networks
Authors:
Ziming Wang,
Ziling Wang,
Huaning Li,
Lang Qin,
Runhao Jiang,
De Ma,
Hua** Tang
Abstract:
Event cameras, with their high dynamic range and temporal resolution, are ideally suited for object detection, especially under scenarios with motion blur and challenging lighting conditions. However, while most existing approaches prioritize optimizing spatiotemporal representations with advanced detection backbones and early aggregation functions, the crucial issue of adaptive event sampling rem…
▽ More
Event cameras, with their high dynamic range and temporal resolution, are ideally suited for object detection, especially under scenarios with motion blur and challenging lighting conditions. However, while most existing approaches prioritize optimizing spatiotemporal representations with advanced detection backbones and early aggregation functions, the crucial issue of adaptive event sampling remains largely unaddressed. Spiking Neural Networks (SNNs), which operate on an event-driven paradigm through sparse spike communication, emerge as a natural fit for addressing this challenge. In this study, we discover that the neural dynamics of spiking neurons align closely with the behavior of an ideal temporal event sampler. Motivated by this insight, we propose a novel adaptive sampling module that leverages recurrent convolutional SNNs enhanced with temporal memory, facilitating a fully end-to-end learnable framework for event-based detection. Additionally, we introduce Residual Potential Dropout (RPD) and Spike-Aware Training (SAT) to regulate potential distribution and address performance degradation encountered in spike-based sampling modules. Through rigorous testing on neuromorphic datasets for event-based detection, our approach demonstrably surpasses existing state-of-the-art spike-based methods, achieving superior performance with significantly fewer parameters and time steps. For instance, our method achieves a 4.4\% mAP improvement on the Gen1 dataset, while requiring 38\% fewer parameters and three time steps. Moreover, the applicability and effectiveness of our adaptive sampling methodology extend beyond SNNs, as demonstrated through further validation on conventional non-spiking detection models.
△ Less
Submitted 19 March, 2024;
originally announced March 2024.
-
EffiVED:Efficient Video Editing via Text-instruction Diffusion Models
Authors:
Zhenghao Zhang,
Zuozhuo Dai,
Long Qin,
Weizhi Wang
Abstract:
Large-scale text-to-video models have shown remarkable abilities, but their direct application in video editing remains challenging due to limited available datasets. Current video editing methods commonly require per-video fine-tuning of diffusion models or specific inversion optimization to ensure high-fidelity edits. In this paper, we introduce EffiVED, an efficient diffusion-based model that d…
▽ More
Large-scale text-to-video models have shown remarkable abilities, but their direct application in video editing remains challenging due to limited available datasets. Current video editing methods commonly require per-video fine-tuning of diffusion models or specific inversion optimization to ensure high-fidelity edits. In this paper, we introduce EffiVED, an efficient diffusion-based model that directly supports instruction-guided video editing. To achieve this, we present two efficient workflows to gather video editing pairs, utilizing augmentation and fundamental vision-language techniques. These workflows transform vast image editing datasets and open-world videos into a high-quality dataset for training EffiVED. Experimental results reveal that EffiVED not only generates high-quality editing videos but also executes rapidly. Finally, we demonstrate that our data collection method significantly improves editing performance and can potentially tackle the scarcity of video editing data. Code can be found at https://github.com/alibaba/EffiVED.
△ Less
Submitted 5 June, 2024; v1 submitted 18 March, 2024;
originally announced March 2024.
-
Faceptor: A Generalist Model for Face Perception
Authors:
Lixiong Qin,
Mei Wang,
Xuannan Liu,
Yuhang Zhang,
Wei Deng,
Xiaoshuai Song,
Weiran Xu,
Weihong Deng
Abstract:
With the comprehensive research conducted on various face analysis tasks, there is a growing interest among researchers to develop a unified approach to face perception. Existing methods mainly discuss unified representation and training, which lack task extensibility and application efficiency. To tackle this issue, we focus on the unified model structure, exploring a face generalist model. As an…
▽ More
With the comprehensive research conducted on various face analysis tasks, there is a growing interest among researchers to develop a unified approach to face perception. Existing methods mainly discuss unified representation and training, which lack task extensibility and application efficiency. To tackle this issue, we focus on the unified model structure, exploring a face generalist model. As an intuitive design, Naive Faceptor enables tasks with the same output shape and granularity to share the structural design of the standardized output head, achieving improved task extensibility. Furthermore, Faceptor is proposed to adopt a well-designed single-encoder dual-decoder architecture, allowing task-specific queries to represent new-coming semantics. This design enhances the unification of model structure while improving application efficiency in terms of storage overhead. Additionally, we introduce Layer-Attention into Faceptor, enabling the model to adaptively select features from optimal layers to perform the desired tasks. Through joint training on 13 face perception datasets, Faceptor achieves exceptional performance in facial landmark localization, face parsing, age estimation, expression recognition, binary attribute classification, and face recognition, achieving or surpassing specialized methods in most tasks. Our training framework can also be applied to auxiliary supervised learning, significantly improving performance in data-sparse tasks such as age estimation and expression recognition. The code and models will be made publicly available at https://github.com/lxq1000/Faceptor.
△ Less
Submitted 14 March, 2024;
originally announced March 2024.
-
"Did They F***ing Consent to That?": Safer Digital Intimacy via Proactive Protection Against Image-Based Sexual Abuse
Authors:
Lucy Qin,
Vaughn Hamilton,
Sharon Wang,
Yigit Aydinalp,
Marin Scarlett,
Elissa M. Redmiles
Abstract:
As many as 8 in 10 adults share intimate content such as nude or lewd images. Sharing such content has significant benefits for relationship intimacy and body image, and can offer employment. However, stigmatizing attitudes and a lack of technological mitigations put those sharing such content at risk of sexual violence. An estimated 1 in 3 people have been subjected to image-based sexual abuse (I…
▽ More
As many as 8 in 10 adults share intimate content such as nude or lewd images. Sharing such content has significant benefits for relationship intimacy and body image, and can offer employment. However, stigmatizing attitudes and a lack of technological mitigations put those sharing such content at risk of sexual violence. An estimated 1 in 3 people have been subjected to image-based sexual abuse (IBSA), a spectrum of violence that includes the nonconsensual distribution or threat of distribution of consensually-created intimate content (also called NDII). In this work, we conducted a rigorous empirical interview study of 52 European creators of intimate content to examine the threats they face and how they defend against them, situated in the context of their different use cases for intimate content sharing and their choice of technologies for storing and sharing such content. Synthesizing our results with the limited body of prior work on technological prevention of NDII, we offer concrete next steps for both platforms and security & privacy researchers to work toward safer intimate content sharing through proactive protection.
Content Warning: This work discusses sexual violence, specifically, the harms of image-based sexual abuse (particularly in Sections 2 and 6).
△ Less
Submitted 13 June, 2024; v1 submitted 7 March, 2024;
originally announced March 2024.
-
FakeNewsGPT4: Advancing Multimodal Fake News Detection through Knowledge-Augmented LVLMs
Authors:
Xuannan Liu,
Peipei Li,
Huaibo Huang,
Zekun Li,
Xing Cui,
Jiahao Liang,
Lixiong Qin,
Weihong Deng,
Zhaofeng He
Abstract:
The massive generation of multimodal fake news exhibits substantial distribution discrepancies, prompting the need for generalized detectors. However, the insulated nature of training within specific domains restricts the capability of classical detectors to obtain open-world facts. In this paper, we propose FakeNewsGPT4, a novel framework that augments Large Vision-Language Models (LVLMs) with fo…
▽ More
The massive generation of multimodal fake news exhibits substantial distribution discrepancies, prompting the need for generalized detectors. However, the insulated nature of training within specific domains restricts the capability of classical detectors to obtain open-world facts. In this paper, we propose FakeNewsGPT4, a novel framework that augments Large Vision-Language Models (LVLMs) with forgery-specific knowledge for manipulation reasoning while inheriting extensive world knowledge as complementary. Knowledge augmentation in FakeNewsGPT4 involves acquiring two types of forgery-specific knowledge, i.e., semantic correlation and artifact trace, and merging them into LVLMs. Specifically, we design a multi-level cross-modal reasoning module that establishes interactions across modalities for extracting semantic correlations. Concurrently, a dual-branch fine-grained verification module is presented to comprehend localized details to encode artifact traces. The generated knowledge is translated into refined embeddings compatible with LVLMs. We also incorporate candidate answer heuristics and soft prompts to enhance input informativeness. Extensive experiments on the public benchmark demonstrate that FakeNewsGPT4 achieves superior cross-domain performance compared to previous methods. Code will be available.
△ Less
Submitted 4 March, 2024;
originally announced March 2024.
-
Rethinking the Roles of Large Language Models in Chinese Grammatical Error Correction
Authors:
Yinghui Li,
Shang Qin,
**gheng Ye,
Shirong Ma,
Yangning Li,
Libo Qin,
Xuming Hu,
Wenhao Jiang,
Hai-Tao Zheng,
Philip S. Yu
Abstract:
Recently, Large Language Models (LLMs) have been widely studied by researchers for their roles in various downstream NLP tasks. As a fundamental task in the NLP field, Chinese Grammatical Error Correction (CGEC) aims to correct all potential grammatical errors in the input sentences. Previous studies have shown that LLMs' performance as correctors on CGEC remains unsatisfactory due to its challeng…
▽ More
Recently, Large Language Models (LLMs) have been widely studied by researchers for their roles in various downstream NLP tasks. As a fundamental task in the NLP field, Chinese Grammatical Error Correction (CGEC) aims to correct all potential grammatical errors in the input sentences. Previous studies have shown that LLMs' performance as correctors on CGEC remains unsatisfactory due to its challenging task focus. To promote the CGEC field to better adapt to the era of LLMs, we rethink the roles of LLMs in the CGEC task so that they can be better utilized and explored in CGEC. Considering the rich grammatical knowledge stored in LLMs and their powerful semantic understanding capabilities, we utilize LLMs as explainers to provide explanation information for the CGEC small models during error correction to enhance performance. We also use LLMs as evaluators to bring more reasonable CGEC evaluations, thus alleviating the troubles caused by the subjectivity of the CGEC task. In particular, our work is also an active exploration of how LLMs and small models better collaborate in downstream tasks. Extensive experiments and detailed analyses on widely used datasets verify the effectiveness of our thinking intuition and the proposed methods.
△ Less
Submitted 17 February, 2024;
originally announced February 2024.
-
Python is Not Always the Best Choice: Embracing Multilingual Program of Thoughts
Authors:
Xianzhen Luo,
Qingfu Zhu,
Zhiming Zhang,
Libo Qin,
Xuanyu Zhang,
Qing Yang,
Dongliang Xu,
Wanxiang Che
Abstract:
Program of Thoughts (PoT) is an approach characterized by its executable intermediate steps, which ensure the accuracy of the logical calculations in the reasoning process. Currently, PoT primarily uses Python. However, relying solely on a single language may result in suboptimal solutions and overlook the potential benefits of other programming languages. In this paper, we conduct comprehensive e…
▽ More
Program of Thoughts (PoT) is an approach characterized by its executable intermediate steps, which ensure the accuracy of the logical calculations in the reasoning process. Currently, PoT primarily uses Python. However, relying solely on a single language may result in suboptimal solutions and overlook the potential benefits of other programming languages. In this paper, we conduct comprehensive experiments on the programming languages used in PoT and find that no single language consistently delivers optimal performance across all tasks and models. The effectiveness of each language varies depending on the specific scenarios. Inspired by this, we propose a task and model agnostic approach called MultiPoT, which harnesses strength and diversity from various languages. Experimental results reveal that it significantly outperforms Python Self-Consistency. Furthermore, it achieves comparable or superior performance compared to the best monolingual PoT in almost all tasks across all models. In particular, MultiPoT achieves more than 4.6% improvement on average on ChatGPT (gpt-3.5-turbo-0701).
△ Less
Submitted 16 June, 2024; v1 submitted 16 February, 2024;
originally announced February 2024.
-
COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
Authors:
Xingang Guo,
Fangxu Yu,
Huan Zhang,
Lianhui Qin,
Bin Hu
Abstract:
Jailbreaks on large language models (LLMs) have recently received increasing attention. For a comprehensive assessment of LLM safety, it is essential to consider jailbreaks with diverse attributes, such as contextual coherence and sentiment/stylistic variations, and hence it is beneficial to study controllable jailbreaking, i.e. how to enforce control on LLM attacks. In this paper, we formally for…
▽ More
Jailbreaks on large language models (LLMs) have recently received increasing attention. For a comprehensive assessment of LLM safety, it is essential to consider jailbreaks with diverse attributes, such as contextual coherence and sentiment/stylistic variations, and hence it is beneficial to study controllable jailbreaking, i.e. how to enforce control on LLM attacks. In this paper, we formally formulate the controllable attack generation problem, and build a novel connection between this problem and controllable text generation, a well-explored topic of natural language processing. Based on this connection, we adapt the Energy-based Constrained Decoding with Langevin Dynamics (COLD), a state-of-the-art, highly efficient algorithm in controllable text generation, and introduce the COLD-Attack framework which unifies and automates the search of adversarial LLM attacks under a variety of control requirements such as fluency, stealthiness, sentiment, and left-right-coherence. The controllability enabled by COLD-Attack leads to diverse new jailbreak scenarios which not only cover the standard setting of generating fluent (suffix) attack with continuation constraint, but also allow us to address new controllable attack settings such as revising a user query adversarially with paraphrasing constraint, and inserting stealthy attacks in context with position constraint. Our extensive experiments on various LLMs (Llama-2, Mistral, Vicuna, Guanaco, GPT-3.5, and GPT-4) show COLD-Attack's broad applicability, strong controllability, high success rate, and attack transferability. Our code is available at https://github.com/Yu-Fangxu/COLD-Attack.
△ Less
Submitted 6 June, 2024; v1 submitted 13 February, 2024;
originally announced February 2024.
-
You Still See Me: How Data Protection Supports the Architecture of ML Surveillance
Authors:
Rui-Jie Yew,
Lucy Qin,
Suresh Venkatasubramanian
Abstract:
Human data forms the backbone of machine learning. Data protection laws thus have strong bearing on how ML systems are governed. Given that most requirements in data protection laws accompany the processing of personal data, organizations have an incentive to keep their data out of legal scope. This makes the development and application of certain privacy-preserving techniques--data protection tec…
▽ More
Human data forms the backbone of machine learning. Data protection laws thus have strong bearing on how ML systems are governed. Given that most requirements in data protection laws accompany the processing of personal data, organizations have an incentive to keep their data out of legal scope. This makes the development and application of certain privacy-preserving techniques--data protection techniques--an important strategy for ML compliance. In this paper, we examine the impact of a rhetoric that deems data wrapped in these techniques as data that is "good-to-go". We show how their application in the development of ML systems--from private set intersection as part of dataset curation to homomorphic encryption and federated learning as part of model computation--can further support individual monitoring and data consolidation. With data accumulation at the core of how the ML pipeline is configured, we argue that data protection techniques are often instrumentalized in ways that support infrastructures of surveillance, rather than in ways that protect individuals associated with data. Finally, we propose technology and policy strategies to evaluate data protection techniques in light of the protections they actually confer. We conclude by highlighting the role that technologists might play in devising policies that combat surveillance ML technologies.
△ Less
Submitted 18 February, 2024; v1 submitted 9 February, 2024;
originally announced February 2024.
-
Pro-HAN: A Heterogeneous Graph Attention Network for Profile-Based Spoken Language Understanding
Authors:
Dechuan Teng,
Chunlin Lu,
Xiao Xu,
Wanxiang Che,
Libo Qin
Abstract:
Recently, Profile-based Spoken Language Understanding (SLU) has gained increasing attention, which aims to incorporate various types of supplementary profile information (i.e., Knowledge Graph, User Profile, Context Awareness) to eliminate the prevalent ambiguities in user utterances. However, existing approaches can only separately model different profile information, without considering their in…
▽ More
Recently, Profile-based Spoken Language Understanding (SLU) has gained increasing attention, which aims to incorporate various types of supplementary profile information (i.e., Knowledge Graph, User Profile, Context Awareness) to eliminate the prevalent ambiguities in user utterances. However, existing approaches can only separately model different profile information, without considering their interrelationships or excluding irrelevant and conflicting information within them. To address the above issues, we introduce a Heterogeneous Graph Attention Network to perform reasoning across multiple Profile information, called Pro-HAN. Specifically, we design three types of edges, denoted as intra-Pro, inter-Pro, and utterance-Pro, to capture interrelationships among multiple Pros. We establish a new state-of-the-art on the ProSLU dataset, with an improvement of approximately 8% across all three metrics. Further analysis experiments also confirm the effectiveness of our method in modeling multi-source profile information.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
Local Privacy-preserving Mechanisms and Applications in Machine Learning
Authors:
Likun Qin,
Tianshuo Qiu
Abstract:
The emergence and evolution of Local Differential Privacy (LDP) and its various adaptations play a pivotal role in tackling privacy issues related to the vast amounts of data generated by intelligent devices, which are crucial for data-informed decision-making in the realm of crowdsensing. Utilizing these extensive datasets can provide critical insights but also introduces substantial privacy conc…
▽ More
The emergence and evolution of Local Differential Privacy (LDP) and its various adaptations play a pivotal role in tackling privacy issues related to the vast amounts of data generated by intelligent devices, which are crucial for data-informed decision-making in the realm of crowdsensing. Utilizing these extensive datasets can provide critical insights but also introduces substantial privacy concerns for the individuals involved. LDP, noted for its decentralized framework, excels in providing strong privacy protection for individual users during the stages of data collection and processing. The core principle of LDP lies in its technique of altering each user's data locally at the client end before it is sent to the server, thus preventing privacy violations at both stages. There are many LDP variances in the privacy research community aimed to improve the utility-privacy tradeoff. On the other hand, one of the major applications of the privacy-preserving mechanisms is machine learning. In this paper, we firstly delves into a comprehensive analysis of LDP and its variances, focusing on their various models, the diverse range of its adaptations, and the underlying structure of privacy mechanisms; then we discuss the state-of-art privacy mechanisms applications in machine learning.
△ Less
Submitted 8 January, 2024;
originally announced January 2024.
-
Open-Set Facial Expression Recognition
Authors:
Yuhang Zhang,
Yue Yao,
Xuannan Liu,
Lixiong Qin,
Wen**g Wang,
Weihong Deng
Abstract:
Facial expression recognition (FER) models are typically trained on datasets with a fixed number of seven basic classes. However, recent research works point out that there are far more expressions than the basic ones. Thus, when these models are deployed in the real world, they may encounter unknown classes, such as compound expressions that cannot be classified into existing basic classes. To ad…
▽ More
Facial expression recognition (FER) models are typically trained on datasets with a fixed number of seven basic classes. However, recent research works point out that there are far more expressions than the basic ones. Thus, when these models are deployed in the real world, they may encounter unknown classes, such as compound expressions that cannot be classified into existing basic classes. To address this issue, we propose the open-set FER task for the first time. Though there are many existing open-set recognition methods, we argue that they do not work well for open-set FER because FER data are all human faces with very small inter-class distances, which makes the open-set samples very similar to close-set samples. In this paper, we are the first to transform the disadvantage of small inter-class distance into an advantage by proposing a new way for open-set FER. Specifically, we find that small inter-class distance allows for sparsely distributed pseudo labels of open-set samples, which can be viewed as symmetric noisy labels. Based on this novel observation, we convert the open-set FER to a noisy label detection problem. We further propose a novel method that incorporates attention map consistency and cycle training to detect the open-set samples. Extensive experiments on various FER datasets demonstrate that our method clearly outperforms state-of-the-art open-set recognition methods by large margins. Code is available at https://github.com/zyh-uaiaaaa.
△ Less
Submitted 23 January, 2024;
originally announced January 2024.
-
SDIF-DA: A Shallow-to-Deep Interaction Framework with Data Augmentation for Multi-modal Intent Detection
Authors:
Shijue Huang,
Libo Qin,
Bingbing Wang,
Geng Tu,
Ruifeng Xu
Abstract:
Multi-modal intent detection aims to utilize various modalities to understand the user's intentions, which is essential for the deployment of dialogue systems in real-world scenarios. The two core challenges for multi-modal intent detection are (1) how to effectively align and fuse different features of modalities and (2) the limited labeled multi-modal intent training data. In this work, we intro…
▽ More
Multi-modal intent detection aims to utilize various modalities to understand the user's intentions, which is essential for the deployment of dialogue systems in real-world scenarios. The two core challenges for multi-modal intent detection are (1) how to effectively align and fuse different features of modalities and (2) the limited labeled multi-modal intent training data. In this work, we introduce a shallow-to-deep interaction framework with data augmentation (SDIF-DA) to address the above challenges. Firstly, SDIF-DA leverages a shallow-to-deep interaction module to progressively and effectively align and fuse features across text, video, and audio modalities. Secondly, we propose a ChatGPT-based data augmentation approach to automatically augment sufficient training data. Experimental results demonstrate that SDIF-DA can effectively align and fuse multi-modal features by achieving state-of-the-art performance. In addition, extensive analyses show that the introduced data augmentation approach can successfully distill knowledge from the large language model.
△ Less
Submitted 31 December, 2023;
originally announced January 2024.
-
Camera calibration for the surround-view system: a benchmark and dataset
Authors:
L Qin,
C Lin,
S Huang,
S Yang,
Y Zhao
Abstract:
Surround-view system (SVS) is widely used in the Advanced Driver Assistance System (ADAS). SVS uses four fisheye lenses to monitor real-time scenes around the vehicle. However, accurate intrinsic and extrinsic parameter estimation is required for the proper functioning of the system. At present, the intrinsic calibration can be pipeline by utilizing checkerboard algorithm, while extrinsic calibrat…
▽ More
Surround-view system (SVS) is widely used in the Advanced Driver Assistance System (ADAS). SVS uses four fisheye lenses to monitor real-time scenes around the vehicle. However, accurate intrinsic and extrinsic parameter estimation is required for the proper functioning of the system. At present, the intrinsic calibration can be pipeline by utilizing checkerboard algorithm, while extrinsic calibration is still immature. Therefore, we proposed a specific calibration pipeline to estimate extrinsic parameters robustly. This scheme takes a driving sequence of four cameras as input. It firstly utilizes lane line to roughly estimate each camera pose. Considering the environmental condition differences in each camera, we separately select strategies from two methods to accurately estimate the extrinsic parameters. To achieve accurate estimates for both front and rear camera, we proposed a method that mutually iterating line detection and pose estimation. As for bilateral camera, we iteratively adjust the camera pose and position by minimizing texture and edge error between ground projections of adjacent cameras. After estimating the extrinsic parameters, the surround-view image can be synthesized by homography-based transformation. The proposed pipeline can robustly estimate the four SVS camera extrinsic parameters in real driving environments. In addition, to evaluate the proposed scheme, we build a surround-view fisheye dataset, which contains 40 videos with 32,000 frames, acquired from different real traffic scenarios. All the frames in each video are manually labeled with lane annotation, with its GT extrinsic parameters. Moreover, this surround-view dataset could be used by other researchers to evaluate their performance. The dataset will be available soon.
△ Less
Submitted 27 December, 2023;
originally announced December 2023.
-
DTIAM: A unified framework for predicting drug-target interactions, binding affinities and activation/inhibition mechanisms
Authors:
Zhangli Lu,
Chuqi Lei,
Kaili Wang,
Libo Qin,
**g Tang,
Min Li
Abstract:
Accurate and robust prediction of drug-target interactions (DTIs) plays a vital role in drug discovery. Despite extensive efforts have been invested in predicting novel DTIs, existing approaches still suffer from insufficient labeled data and cold start problems. More importantly, there is currently a lack of studies focusing on elucidating the mechanism of action (MoA) between drugs and targets.…
▽ More
Accurate and robust prediction of drug-target interactions (DTIs) plays a vital role in drug discovery. Despite extensive efforts have been invested in predicting novel DTIs, existing approaches still suffer from insufficient labeled data and cold start problems. More importantly, there is currently a lack of studies focusing on elucidating the mechanism of action (MoA) between drugs and targets. Distinguishing the activation and inhibition mechanisms is critical and challenging in drug development. Here, we introduce a unified framework called DTIAM, which aims to predict interactions, binding affinities, and activation/inhibition mechanisms between drugs and targets. DTIAM learns drug and target representations from large amounts of label-free data through self-supervised pre-training, which accurately extracts the substructure and contextual information of drugs and targets, and thus benefits the downstream prediction based on these representations. DTIAM achieves substantial performance improvement over other state-of-the-art methods in all tasks, particularly in the cold start scenario. Moreover, independent validation demonstrates the strong generalization ability of DTIAM. All these results suggested that DTIAM can provide a practically useful tool for predicting novel DTIs and further distinguishing the MoA of candidate drugs. DTIAM, for the first time, provides a unified framework for accurate and robust prediction of drug-target interactions, binding affinities, and activation/inhibition mechanisms.
△ Less
Submitted 23 December, 2023;
originally announced December 2023.
-
Towards Decentralized Task Offloading and Resource Allocation in User-Centric Mobile Edge Computing
Authors:
Langtian Qin,
Hancheng Lu,
Yuang Chen,
Baolin Chong,
Feng Wu
Abstract:
In the traditional cellular-based mobile edge computing (MEC), users at the edge of the cell are prone to suffer severe inter-cell interference and signal attenuation, leading to low throughput even transmission interruptions. Such edge effect severely obstructs offloading of tasks to MEC servers. To address this issue, we propose user-centric mobile edge computing (UCMEC), a novel MEC architectur…
▽ More
In the traditional cellular-based mobile edge computing (MEC), users at the edge of the cell are prone to suffer severe inter-cell interference and signal attenuation, leading to low throughput even transmission interruptions. Such edge effect severely obstructs offloading of tasks to MEC servers. To address this issue, we propose user-centric mobile edge computing (UCMEC), a novel MEC architecture integrating user-centric transmission, which can ensure high throughput and reliable communication for task offloading. Then, we formulate an optimization problem with joint consideration of task offloading, power control, and computing resource allocation in UCMEC, aiming at obtaining the optimal performance in terms of long-term average total delay. To solve the intractable problem, we propose two decentralized joint optimization schemes based on multi-agent deep reinforcement learning (MADRL) and convex optimization, which consider both cooperation and non-cooperation among network nodes. Simulation results demonstrate that the proposed schemes in UCMEC can significantly improve the uplink transmission rate by at most 343.56% and reduce the long-term average total delay by at most 45.57% compared to traditional cellular-based MEC.
△ Less
Submitted 3 December, 2023;
originally announced December 2023.
-
From Classification to Clinical Insights: Towards Analyzing and Reasoning About Mobile and Behavioral Health Data With Large Language Models
Authors:
Zachary Englhardt,
Chengqian Ma,
Margaret E. Morris,
Xuhai "Orson" Xu,
Chun-Cheng Chang,
Lianhui Qin,
Daniel McDuff,
Xin Liu,
Shwetak Patel,
Vikram Iyer
Abstract:
Passively collected behavioral health data from ubiquitous sensors holds significant promise to provide mental health professionals insights from patient's daily lives; however, develo** analysis tools to use this data in clinical practice requires addressing challenges of generalization across devices and weak or ambiguous correlations between the measured signals and an individual's mental hea…
▽ More
Passively collected behavioral health data from ubiquitous sensors holds significant promise to provide mental health professionals insights from patient's daily lives; however, develo** analysis tools to use this data in clinical practice requires addressing challenges of generalization across devices and weak or ambiguous correlations between the measured signals and an individual's mental health. To address these challenges, we take a novel approach that leverages large language models (LLMs) to synthesize clinically useful insights from multi-sensor data. We develop chain of thought prompting methods that use LLMs to generate reasoning about how trends in data such as step count and sleep relate to conditions like depression and anxiety. We first demonstrate binary depression classification with LLMs achieving accuracies of 61.1% which exceed the state of the art. While it is not robust for clinical use, this leads us to our key finding: even more impactful and valued than classification is a new human-AI collaboration approach in which clinician experts interactively query these tools and combine their domain expertise and context about the patient with AI generated reasoning to support clinical decision-making. We find models like GPT-4 correctly reference numerical data 75% of the time, and clinician participants express strong interest in using this approach to interpret self-tracking data.
△ Less
Submitted 25 November, 2023; v1 submitted 21 November, 2023;
originally announced November 2023.
-
AnimateAnything: Fine-Grained Open Domain Image Animation with Motion Guidance
Authors:
Zuozhuo Dai,
Zhenghao Zhang,
Yao Yao,
Bingxue Qiu,
Siyu Zhu,
Long Qin,
Weizhi Wang
Abstract:
Image animation is a key task in computer vision which aims to generate dynamic visual content from static image. Recent image animation methods employ neural based rendering technique to generate realistic animations. Despite these advancements, achieving fine-grained and controllable image animation guided by text remains challenging, particularly for open-domain images captured in diverse real…
▽ More
Image animation is a key task in computer vision which aims to generate dynamic visual content from static image. Recent image animation methods employ neural based rendering technique to generate realistic animations. Despite these advancements, achieving fine-grained and controllable image animation guided by text remains challenging, particularly for open-domain images captured in diverse real environments. In this paper, we introduce an open domain image animation method that leverages the motion prior of video diffusion model. Our approach introduces targeted motion area guidance and motion strength guidance, enabling precise control the movable area and its motion speed. This results in enhanced alignment between the animated visual elements and the prompting text, thereby facilitating a fine-grained and interactive animation generation process for intricate motion sequences. We validate the effectiveness of our method through rigorous experiments on an open-domain dataset, with the results showcasing its superior performance. Project page can be found at https://animationai.github.io/AnimateAnything.
△ Less
Submitted 4 December, 2023; v1 submitted 20 November, 2023;
originally announced November 2023.
-
Fast Heavy Inner Product Identification Between Weights and Inputs in Neural Network Training
Authors:
Lianke Qin,
Saayan Mitra,
Zhao Song,
Yuanyuan Yang,
Tianyi Zhou
Abstract:
In this paper, we consider a heavy inner product identification problem, which generalizes the Light Bulb problem~(\cite{prr89}): Given two sets $A \subset \{-1,+1\}^d$ and $B \subset \{-1,+1\}^d$ with $|A|=|B| = n$, if there are exact $k$ pairs whose inner product passes a certain threshold, i.e., $\{(a_1, b_1), \cdots, (a_k, b_k)\} \subset A \times B$ such that…
▽ More
In this paper, we consider a heavy inner product identification problem, which generalizes the Light Bulb problem~(\cite{prr89}): Given two sets $A \subset \{-1,+1\}^d$ and $B \subset \{-1,+1\}^d$ with $|A|=|B| = n$, if there are exact $k$ pairs whose inner product passes a certain threshold, i.e., $\{(a_1, b_1), \cdots, (a_k, b_k)\} \subset A \times B$ such that $\forall i \in [k], \langle a_i,b_i \rangle \geq ρ\cdot d$, for a threshold $ρ\in (0,1)$, the goal is to identify those $k$ heavy inner products. We provide an algorithm that runs in $O(n^{2 ω/ 3+ o(1)})$ time to find the $k$ inner product pairs that surpass $ρ\cdot d$ threshold with high probability, where $ω$ is the current matrix multiplication exponent. By solving this problem, our method speed up the training of neural networks with ReLU activation function.
△ Less
Submitted 19 November, 2023;
originally announced November 2023.
-
Cross-Layer Optimization for Statistical QoS Provision in C-RAN with Finite-Length Coding
Authors:
Chang Wu,
Hancheng Lu,
Yuang Chen,
Langtian Qin
Abstract:
The cloud radio access network (C-RAN) has become the foundational structure for various emerging communication paradigms, leveraging the flexible deployment of distributed access points (APs) and centralized task processing. In this paper, we propose a cross-layer optimization framework based on a practical finite-length coding communication system in C-RAN, aiming at maximizing bandwidth efficie…
▽ More
The cloud radio access network (C-RAN) has become the foundational structure for various emerging communication paradigms, leveraging the flexible deployment of distributed access points (APs) and centralized task processing. In this paper, we propose a cross-layer optimization framework based on a practical finite-length coding communication system in C-RAN, aiming at maximizing bandwidth efficiency while providing statistical quality of service (QoS) for individual services. Based on the theoretical results from effective capacity and finite-length coding, we formulate a joint optimization problem involving modulation and coding schemes (MCS), retransmission count, initial bandwidth allocation and AP selection, which reflects the coordinated decision of parameters across the physical layer, data link layer and transport layer. To tackle such a mixed-integer nonlinear programming (MINLP) problem, we firstly decompose it into a transmission parameter decision (TPD) sub-problem and a user association (UA) sub-problem, which can be solved by a binary search-based algorithm and an auction-based algorithm respectively. Simulation results demonstrate that the proposed model can accurately capture the impact of QoS requirements and channel quality on the optimal transmission parameters. Furthermore, compared with fixed transmission parameter setting, the proposed algorithms achieve the bandwidth efficiency gain up to 27.87% under various traffic and channel scenarios.
△ Less
Submitted 16 November, 2023;
originally announced November 2023.
-
MacGyver: Are Large Language Models Creative Problem Solvers?
Authors:
Yufei Tian,
Abhilasha Ravichander,
Lianhui Qin,
Ronan Le Bras,
Raja Marjieh,
Nanyun Peng,
Ye** Choi,
Thomas L. Griffiths,
Faeze Brahman
Abstract:
We explore the creative problem-solving capabilities of modern LLMs in a novel constrained setting. To this end, we create MACGYVER, an automatically generated dataset consisting of over 1,600 real-world problems deliberately designed to trigger innovative usage of objects and necessitate out-of-the-box thinking. We then present our collection to both LLMs and humans to compare and contrast their…
▽ More
We explore the creative problem-solving capabilities of modern LLMs in a novel constrained setting. To this end, we create MACGYVER, an automatically generated dataset consisting of over 1,600 real-world problems deliberately designed to trigger innovative usage of objects and necessitate out-of-the-box thinking. We then present our collection to both LLMs and humans to compare and contrast their problem-solving abilities. MACGYVER is challenging for both groups, but in unique and complementary ways. For instance, humans excel in tasks they are familiar with but struggle with domain-specific knowledge, leading to a higher variance. In contrast, LLMs, exposed to a variety of specialized knowledge, attempt broader problems but fail by proposing physically-infeasible actions. Finally, we provide a detailed error analysis of LLMs, and demonstrate the potential of enhancing their problem-solving ability with novel prompting techniques such as iterative step-wise reflection and divergent-convergent thinking.
This work (1) introduces a fresh arena for intelligent agents focusing on intricate aspects of physical reasoning, planning, and unconventional thinking, which supplements the existing spectrum of machine intelligence; and (2) provides insight into the constrained problem-solving capabilities of both humans and AI.
△ Less
Submitted 27 March, 2024; v1 submitted 16 November, 2023;
originally announced November 2023.
-
Structured Chemistry Reasoning with Large Language Models
Authors:
Siru Ouyang,
Zhuosheng Zhang,
Bing Yan,
Xuan Liu,
Ye** Choi,
Jiawei Han,
Lianhui Qin
Abstract:
Large Language Models (LLMs) excel in diverse areas, yet struggle with complex scientific reasoning, especially in the field of chemistry. Different from the simple chemistry tasks (e.g., molecule classification) addressed in previous studies, complex chemistry problems require not only vast knowledge and precise calculation, but also compositional reasoning about rich dynamic interactions of diff…
▽ More
Large Language Models (LLMs) excel in diverse areas, yet struggle with complex scientific reasoning, especially in the field of chemistry. Different from the simple chemistry tasks (e.g., molecule classification) addressed in previous studies, complex chemistry problems require not only vast knowledge and precise calculation, but also compositional reasoning about rich dynamic interactions of different concepts (e.g., temperature changes). Our study shows that even advanced LLMs, like GPT-4, can fail easily in different ways. Interestingly, the errors often stem not from a lack of domain knowledge within the LLMs, but rather from the absence of an effective reasoning structure that guides the LLMs to elicit the right knowledge, incorporate the knowledge in step-by-step reasoning, and iteratively refine results for further improved quality. On this basis, we introduce StructChem, a simple yet effective prompting strategy that offers the desired guidance and substantially boosts the LLMs' chemical reasoning capability. Testing across four chemistry areas -- quantum chemistry, mechanics, physical chemistry, and kinetics -- StructChem substantially enhances GPT-4's performance, with up to 30\% peak improvement. Our analysis also underscores the unique difficulties of precise grounded reasoning in science with LLMs, highlighting a need for more research in this area. Code is available at \url{https://github.com/ozyyshr/StructChem}.
△ Less
Submitted 9 February, 2024; v1 submitted 16 November, 2023;
originally announced November 2023.
-
End-to-end Task-oriented Dialogue: A Survey of Tasks, Methods, and Future Directions
Authors:
Libo Qin,
Wenbo Pan,
Qiguang Chen,
Lizi Liao,
Zhou Yu,
Yue Zhang,
Wanxiang Che,
Min Li
Abstract:
End-to-end task-oriented dialogue (EToD) can directly generate responses in an end-to-end fashion without modular training, which attracts escalating popularity. The advancement of deep neural networks, especially the successful use of large pre-trained models, has further led to significant progress in EToD research in recent years. In this paper, we present a thorough review and provide a unifie…
▽ More
End-to-end task-oriented dialogue (EToD) can directly generate responses in an end-to-end fashion without modular training, which attracts escalating popularity. The advancement of deep neural networks, especially the successful use of large pre-trained models, has further led to significant progress in EToD research in recent years. In this paper, we present a thorough review and provide a unified perspective to summarize existing approaches as well as recent trends to advance the development of EToD research. The contributions of this paper can be summarized: (1) \textbf{\textit{First survey}}: to our knowledge, we take the first step to present a thorough survey of this research field; (2) \textbf{\textit{New taxonomy}}: we first introduce a unified perspective for EToD, including (i) \textit{Modularly EToD} and (ii) \textit{Fully EToD}; (3) \textbf{\textit{New Frontiers}}: we discuss some potential frontier areas as well as the corresponding challenges, ho** to spur breakthrough research in EToD field; (4) \textbf{\textit{Abundant resources}}: we build a public website\footnote{We collect the related papers, baseline projects, and leaderboards for the community at \url{https://etods.net/}.}, where EToD researchers could directly access the recent progress. We hope this work can serve as a thorough reference for the EToD research community.
△ Less
Submitted 15 November, 2023;
originally announced November 2023.