-
A Survey on Deep Clustering: From the Prior Perspective
Authors:
Yiding Lu,
Haobin Li,
Yunfan Li,
Yijie Lin,
Xi Peng
Abstract:
Facilitated by the powerful feature extraction ability of neural networks, deep clustering has achieved great success in analyzing high-dimensional and complex real-world data. The performance of deep clustering methods is affected by various factors such as network structures and learning objectives. However, as pointed out in this survey, the essence of deep clustering lies in the incorporation…
▽ More
Facilitated by the powerful feature extraction ability of neural networks, deep clustering has achieved great success in analyzing high-dimensional and complex real-world data. The performance of deep clustering methods is affected by various factors such as network structures and learning objectives. However, as pointed out in this survey, the essence of deep clustering lies in the incorporation and utilization of prior knowledge, which is largely ignored by existing works. From pioneering deep clustering methods based on data structure assumptions to recent contrastive clustering methods based on data augmentation invariances, the development of deep clustering intrinsically corresponds to the evolution of prior knowledge. In this survey, we provide a comprehensive review of deep clustering methods by categorizing them into six types of prior knowledge. We find that in general the prior innovation follows two trends, namely, i) from mining to constructing, and ii) from internal to external. Besides, we provide a benchmark on five widely-used datasets and analyze the performance of methods with diverse priors. By providing a novel prior knowledge perspective, we hope this survey could provide some novel insights and inspire future research in the deep clustering community.
△ Less
Submitted 30 June, 2024; v1 submitted 27 June, 2024;
originally announced June 2024.
-
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
Authors:
Xin Lai,
Zhuotao Tian,
Yukang Chen,
Senqiao Yang,
Xiangru Peng,
Jiaya Jia
Abstract:
Mathematical reasoning presents a significant challenge for Large Language Models (LLMs) due to the extensive and precise chain of reasoning required for accuracy. Ensuring the correctness of each reasoning step is critical. To address this, we aim to enhance the robustness and factuality of LLMs by learning from human feedback. However, Direct Preference Optimization (DPO) has shown limited benef…
▽ More
Mathematical reasoning presents a significant challenge for Large Language Models (LLMs) due to the extensive and precise chain of reasoning required for accuracy. Ensuring the correctness of each reasoning step is critical. To address this, we aim to enhance the robustness and factuality of LLMs by learning from human feedback. However, Direct Preference Optimization (DPO) has shown limited benefits for long-chain mathematical reasoning, as models employing DPO struggle to identify detailed errors in incorrect answers. This limitation stems from a lack of fine-grained process supervision. We propose a simple, effective, and data-efficient method called Step-DPO, which treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically. Additionally, we have developed a data construction pipeline for Step-DPO, enabling the creation of a high-quality dataset containing 10K step-wise preference pairs. We also observe that in DPO, self-generated data is more effective than data generated by humans or GPT-4, due to the latter's out-of-distribution nature. Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves scores of 70.8% and 94.0% on the test sets of MATH and GSM8K, respectively, surpassing a series of closed-source models, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro. Our code, data, and models are available at https://github.com/dvlab-research/Step-DPO.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
Leveraging LLMs for Dialogue Quality Measurement
Authors:
**ghan Jia,
Abi Komma,
Timothy Leffel,
Xujun Peng,
Ajay Nagesh,
Tamer Soliman,
Aram Galstyan,
Anoop Kumar
Abstract:
In task-oriented conversational AI evaluation, unsupervised methods poorly correlate with human judgments, and supervised approaches lack generalization. Recent advances in large language models (LLMs) show robust zeroshot and few-shot capabilities across NLP tasks. This paper explores using LLMs for automated dialogue quality evaluation, experimenting with various configurations on public and pro…
▽ More
In task-oriented conversational AI evaluation, unsupervised methods poorly correlate with human judgments, and supervised approaches lack generalization. Recent advances in large language models (LLMs) show robust zeroshot and few-shot capabilities across NLP tasks. This paper explores using LLMs for automated dialogue quality evaluation, experimenting with various configurations on public and proprietary datasets. Manipulating factors such as model size, in-context examples, and selection techniques, we examine "chain-of-thought" (CoT) reasoning and label extraction procedures. Our results show that (1) larger models yield more accurate dialogue labels; (2) algorithmic selection of in-context examples outperforms random selection; (3) CoT reasoning where an LLM is asked to provide justifications before outputting final labels improves performance; and (4) fine-tuned LLMs outperform out-of-the-box ones. Our results indicate that LLMs that are suitably fine-tuned and have sufficient reasoning capabilities can be leveraged for automated dialogue evaluation.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Failure-Resilient Distributed Inference with Model Compression over Heterogeneous Edge Devices
Authors:
Li Wang,
Liang Li,
Lianming Xu,
Xian Peng,
Aiguo Fei
Abstract:
The distributed inference paradigm enables the computation workload to be distributed across multiple devices, facilitating the implementations of deep learning based intelligent services on extremely resource-constrained Internet of Things (IoT) scenarios. Yet it raises great challenges to perform complicated inference tasks relying on a cluster of IoT devices that are heterogeneous in their comp…
▽ More
The distributed inference paradigm enables the computation workload to be distributed across multiple devices, facilitating the implementations of deep learning based intelligent services on extremely resource-constrained Internet of Things (IoT) scenarios. Yet it raises great challenges to perform complicated inference tasks relying on a cluster of IoT devices that are heterogeneous in their computing/communication capacity and prone to crash or timeout failures. In this paper, we present RoCoIn, a robust cooperative inference mechanism for locally distributed execution of deep neural network-based inference tasks over heterogeneous edge devices. It creates a set of independent and compact student models that are learned from a large model using knowledge distillation for distributed deployment. In particular, the devices are strategically grouped to redundantly deploy and execute the same student model such that the inference process is resilient to any local failures, while a joint knowledge partition and student model assignment scheme are designed to minimize the response latency of the distributed inference system in the presence of devices with diverse capacities. Extensive simulations are conducted to corroborate the superior performance of our RoCoIn for distributed inference compared to several baselines, and the results demonstrate its efficacy in timely inference and failure resiliency.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning
Authors:
Zebang Cheng,
Zhi-Qi Cheng,
Jun-Yan He,
**gdong Sun,
Kai Wang,
Yuxiang Lin,
Zheng Lian,
Xiaojiang Peng,
Alexander Hauptmann
Abstract:
Accurate emotion perception is crucial for various applications, including human-computer interaction, education, and counseling. However, traditional single-modality approaches often fail to capture the complexity of real-world emotional expressions, which are inherently multimodal. Moreover, existing Multimodal Large Language Models (MLLMs) face challenges in integrating audio and recognizing su…
▽ More
Accurate emotion perception is crucial for various applications, including human-computer interaction, education, and counseling. However, traditional single-modality approaches often fail to capture the complexity of real-world emotional expressions, which are inherently multimodal. Moreover, existing Multimodal Large Language Models (MLLMs) face challenges in integrating audio and recognizing subtle facial micro-expressions. To address this, we introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories. This dataset enables models to learn from varied scenarios and generalize to real-world applications. Furthermore, we propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders. By aligning features into a shared space and employing a modified LLaMA model with instruction tuning, Emotion-LLaMA significantly enhances both emotional recognition and reasoning capabilities. Extensive evaluations show Emotion-LLaMA outperforms other MLLMs, achieving top scores in Clue Overlap (7.83) and Label Overlap (6.25) on EMER, an F1 score of 0.9036 on MER2023 challenge, and the highest UAR (45.59) and WAR (59.37) in zero-shot evaluations on DFEW dataset.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
Vul-RAG: Enhancing LLM-based Vulnerability Detection via Knowledge-level RAG
Authors:
Xueying Du,
Geng Zheng,
Kaixin Wang,
Jiayi Feng,
Wentai Deng,
Mingwei Liu,
Bihuan Chen,
Xin Peng,
Tao Ma,
Yiling Lou
Abstract:
Vulnerability detection is essential for software quality assurance. In recent years, deep learning models (especially large language models) have shown promise in vulnerability detection. In this work, we propose a novel LLM-based vulnerability detection technique Vul-RAG, which leverages knowledge-level retrieval-augmented generation (RAG) framework to detect vulnerability for the given code in…
▽ More
Vulnerability detection is essential for software quality assurance. In recent years, deep learning models (especially large language models) have shown promise in vulnerability detection. In this work, we propose a novel LLM-based vulnerability detection technique Vul-RAG, which leverages knowledge-level retrieval-augmented generation (RAG) framework to detect vulnerability for the given code in three phases. First, Vul-RAG constructs a vulnerability knowledge base by extracting multi-dimension knowledge via LLMs from existing CVE instances; second, for a given code snippet, Vul-RAG} retrieves the relevant vulnerability knowledge from the constructed knowledge base based on functional semantics; third, Vul-RAG leverages LLMs to check the vulnerability of the given code snippet by reasoning the presence of vulnerability causes and fixing solutions of the retrieved vulnerability knowledge. Our evaluation of Vul-RAG on our constructed benchmark PairVul shows that Vul-RAG substantially outperforms all baselines by 12.96\%/110\% relative improvement in accuracy/pairwise-accuracy. In addition, our user study shows that the vulnerability knowledge generated by Vul-RAG can serve as high-quality explanations which can improve the manual detection accuracy from 0.60 to 0.77.
△ Less
Submitted 19 June, 2024; v1 submitted 16 June, 2024;
originally announced June 2024.
-
MemDPT: Differential Privacy for Memory Efficient Language Models
Authors:
Yanming Liu,
Xinyue Peng,
Jiannan Cao,
Yuwei Zhang,
Chen Ma,
Songhang Deng,
Mengchen Fu,
Xuhong Zhang,
Sheng Cheng,
Xun Wang,
Jianwei Yin,
Tianyu Du
Abstract:
Large language models have consistently demonstrated remarkable performance across a wide spectrum of applications. Nonetheless, the deployment of these models can inadvertently expose user privacy to potential risks. The substantial memory demands of these models during training represent a significant resource consumption challenge. The sheer size of these models imposes a considerable burden on…
▽ More
Large language models have consistently demonstrated remarkable performance across a wide spectrum of applications. Nonetheless, the deployment of these models can inadvertently expose user privacy to potential risks. The substantial memory demands of these models during training represent a significant resource consumption challenge. The sheer size of these models imposes a considerable burden on memory resources, which is a matter of significant concern in practice. In this paper, we present an innovative training framework MemDPT that not only reduces the memory cost of large language models but also places a strong emphasis on safeguarding user data privacy. MemDPT provides edge network and reverse network designs to accommodate various differential privacy memory-efficient fine-tuning schemes. Our approach not only achieves $2 \sim 3 \times$ memory optimization but also provides robust privacy protection, ensuring that user data remains secure and confidential. Extensive experiments have demonstrated that MemDPT can effectively provide differential privacy efficient fine-tuning across various task scenarios.
△ Less
Submitted 20 June, 2024; v1 submitted 16 June, 2024;
originally announced June 2024.
-
STALL+: Boosting LLM-based Repository-level Code Completion with Static Analysis
Authors:
Junwei Liu,
Yixuan Chen,
Mingwei Liu,
Xin Peng,
Yiling Lou
Abstract:
Repository-level code completion is challenging as it involves complicated contexts from multiple files in the repository. To date, researchers have proposed two technical categories to enhance LLM-based repository-level code completion, i.e., retrieval-augmented generation (RAG) and static analysis integration. This work performs the first study on the static analysis integration in LLM-based rep…
▽ More
Repository-level code completion is challenging as it involves complicated contexts from multiple files in the repository. To date, researchers have proposed two technical categories to enhance LLM-based repository-level code completion, i.e., retrieval-augmented generation (RAG) and static analysis integration. This work performs the first study on the static analysis integration in LLM-based repository-level code completion by investigating both the effectiveness and efficiency of static analysis integration strategies across different phases of code completion. We first implement a framework STALL+, which supports an extendable and customizable integration of multiple static analysis strategies into the complete pipeline of LLM-based repository-level code completion; and based on STALL+, we perform extensive experiments by including different code LLMs on the latest repository-level code completion benchmark CrossCodeEval. Our findings show that integrating file-level dependencies in prompting phase performs the best while the integration in post-processing phase performs the worse. Additionally, we observe different improvements from static analysis between dynamic languages and static languages, i.e., the best combination is prompting-phase with decoding-phase integration for Java while the best combination is prompting-phase with post-processing-phase integration for Python given the limitations of statically analyzing dynamic languages. Additionally, we find the complementarity between RAG and static analysis integration as well as their cost-effectiveness after combination.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
How and Why LLMs Use Deprecated APIs in Code Completion? An Empirical Study
Authors:
Chong Wang,
Kaifeng Huang,
Jian Zhang,
Yebo Feng,
Lyuye Zhang,
Yang Liu,
Xin Peng
Abstract:
Large language models (LLMs), pre-trained or fine-tuned on large code corpora, have shown effectiveness in generating code completions. However, in LLM-based code completion, LLMs may struggle to use correct and up-to-date Application Programming Interfaces (APIs) due to the rapid and continuous evolution of libraries. While existing studies have highlighted issues with predicting incorrect APIs,…
▽ More
Large language models (LLMs), pre-trained or fine-tuned on large code corpora, have shown effectiveness in generating code completions. However, in LLM-based code completion, LLMs may struggle to use correct and up-to-date Application Programming Interfaces (APIs) due to the rapid and continuous evolution of libraries. While existing studies have highlighted issues with predicting incorrect APIs, the specific problem of deprecated API usage in LLM-based code completion has not been thoroughly investigated.
To address this gap, we conducted the first evaluation study on deprecated API usage in LLM-based code completion. This study involved seven advanced LLMs, 145 API map**s from eight popular Python libraries, and 28,125 completion prompts. The study results reveal the \textit{status quo} and \textit{root causes} of deprecated API usage in LLM-based code completion from the perspectives of \textit{model}, \textit{prompt}, and \textit{library}. Based on these findings, we propose two lightweight fixing approaches, \textsc{ReplaceAPI} and \textsc{InsertPrompt}, which can serve as baseline approaches for future research on mitigating deprecated API usage in LLM-based completion. Additionally, we provide implications for future research on integrating library evolution with LLM-driven software development.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Language Guided Skill Discovery
Authors:
Seungeun Rho,
Laura Smith,
Tianyu Li,
Sergey Levine,
Xue Bin Peng,
Sehoon Ha
Abstract:
Skill discovery methods enable agents to learn diverse emergent behaviors without explicit rewards. To make learned skills useful for unknown downstream tasks, obtaining a semantically diverse repertoire of skills is essential. While some approaches introduce a discriminator to distinguish skills and others aim to increase state coverage, no existing work directly addresses the "semantic diversity…
▽ More
Skill discovery methods enable agents to learn diverse emergent behaviors without explicit rewards. To make learned skills useful for unknown downstream tasks, obtaining a semantically diverse repertoire of skills is essential. While some approaches introduce a discriminator to distinguish skills and others aim to increase state coverage, no existing work directly addresses the "semantic diversity" of skills. We hypothesize that leveraging the semantic knowledge of large language models (LLMs) can lead us to improve semantic diversity of resulting behaviors. In this sense, we introduce Language Guided Skill Discovery (LGSD), a skill discovery framework that aims to directly maximize the semantic diversity between skills. LGSD takes user prompts as input and outputs a set of semantically distinctive skills. The prompts serve as a means to constrain the search space into a semantically desired subspace, and the generated LLM outputs guide the agent to visit semantically diverse states within the subspace. We demonstrate that LGSD enables legged robots to visit different user-intended areas on a plane by simply changing the prompt. Furthermore, we show that language guidance aids in discovering more diverse skills compared to five existing skill discovery methods in robot-arm manipulation environments. Lastly, LGSD provides a simple way of utilizing learned skills via natural language.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
Automatic Bug Detection in LLM-Powered Text-Based Games Using LLMs
Authors:
Claire **,
Sudha Rao,
Xiangyu Peng,
Portia Botchway,
Jessica Quaye,
Chris Brockett,
Bill Dolan
Abstract:
Advancements in large language models (LLMs) are revolutionizing interactive game design, enabling dynamic plotlines and interactions between players and non-player characters (NPCs). However, LLMs may exhibit flaws such as hallucinations, forgetfulness, or misinterpretations of prompts, causing logical inconsistencies and unexpected deviations from intended designs. Automated techniques for detec…
▽ More
Advancements in large language models (LLMs) are revolutionizing interactive game design, enabling dynamic plotlines and interactions between players and non-player characters (NPCs). However, LLMs may exhibit flaws such as hallucinations, forgetfulness, or misinterpretations of prompts, causing logical inconsistencies and unexpected deviations from intended designs. Automated techniques for detecting such game bugs are still lacking. To address this, we propose a systematic LLM-based method for automatically identifying such bugs from player game logs, eliminating the need for collecting additional data such as post-play surveys. Applied to a text-based game DejaBoom!, our approach effectively identifies bugs inherent in LLM-powered interactive games, surpassing unstructured LLM-powered bug-catching methods and filling the gap in automated detection of logical and design flaws.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Tool-Planner: Dynamic Solution Tree Planning for Large Language Model with Tool Clustering
Authors:
Yanming Liu,
Xinyue Peng,
Yuwei Zhang,
Jiannan Cao,
Xuhong Zhang,
Sheng Cheng,
Xun Wang,
Jianwei Yin,
Tianyu Du
Abstract:
Large language models (LLMs) have demonstrated exceptional reasoning capabilities, enabling them to solve various complex problems. Recently, this ability has been applied to the paradigm of tool learning. Tool learning involves providing examples of tool usage and their corresponding functions, allowing LLMs to formulate plans and demonstrate the process of invoking and executing each tool. LLMs…
▽ More
Large language models (LLMs) have demonstrated exceptional reasoning capabilities, enabling them to solve various complex problems. Recently, this ability has been applied to the paradigm of tool learning. Tool learning involves providing examples of tool usage and their corresponding functions, allowing LLMs to formulate plans and demonstrate the process of invoking and executing each tool. LLMs can address tasks that they cannot complete independently, thereby enhancing their potential across different tasks. However, this approach faces two key challenges. First, redundant error correction leads to unstable planning and long execution time. Additionally, designing a correct plan among multiple tools is also a challenge in tool learning. To address these issues, we propose Tool-Planner, a task-processing framework based on toolkits. Tool-Planner groups tools based on the API functions with the same function into a toolkit and allows LLMs to implement planning across the various toolkits. When a tool error occurs, the language model can reselect and adjust tools based on the toolkit. Experiments show that our approach demonstrates a high pass and win rate across different datasets and optimizes the planning scheme for tool learning in models such as GPT-4 and Claude 3, showcasing the potential of our method.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Genshin: General Shield for Natural Language Processing with Large Language Models
Authors:
Xiao Peng,
Tao Liu,
Ying Wang
Abstract:
Large language models (LLMs) like ChatGPT, Gemini, or LLaMA have been trending recently, demonstrating considerable advancement and generalizability power in countless domains. However, LLMs create an even bigger black box exacerbating opacity, with interpretability limited to few approaches. The uncertainty and opacity embedded in LLMs' nature restrict their application in high-stakes domains lik…
▽ More
Large language models (LLMs) like ChatGPT, Gemini, or LLaMA have been trending recently, demonstrating considerable advancement and generalizability power in countless domains. However, LLMs create an even bigger black box exacerbating opacity, with interpretability limited to few approaches. The uncertainty and opacity embedded in LLMs' nature restrict their application in high-stakes domains like financial fraud, phishing, etc. Current approaches mainly rely on traditional textual classification with posterior interpretable algorithms, suffering from attackers who may create versatile adversarial samples to break the system's defense, forcing users to make trade-offs between efficiency and robustness. To address this issue, we propose a novel cascading framework called Genshin (General Shield for Natural Language Processing with Large Language Models), utilizing LLMs as defensive one-time plug-ins. Unlike most applications of LLMs that try to transform text into something new or structural, Genshin uses LLMs to recover text to its original state. Genshin aims to combine the generalizability of the LLM, the discrimination of the median model, and the interpretability of the simple model. Our experiments on the task of sentimental analysis and spam detection have shown fatal flaws of the current median models and exhilarating results on LLMs' recovery ability, demonstrating that Genshin is both effective and efficient. In our ablation study, we unearth several intriguing observations. Utilizing the LLM defender, a tool derived from the 4th paradigm, we have reproduced BERT's 15% optimal mask rate results in the 3rd paradigm of NLP. Additionally, when employing the LLM as a potential adversarial tool, attackers are capable of executing effective attacks that are nearly semantically lossless.
△ Less
Submitted 3 June, 2024; v1 submitted 29 May, 2024;
originally announced May 2024.
-
Dataset Growth
Authors:
Ziheng Qin,
Zhaopan Xu,
Yukun Zhou,
Zangwei Zheng,
Zebang Cheng,
Hao Tang,
Lei Shang,
Baigui Sun,
Xiaojiang Peng,
Radu Timofte,
Hongxun Yao,
Kai Wang,
Yang You
Abstract:
Deep learning benefits from the growing abundance of available data. Meanwhile, efficiently dealing with the growing data scale has become a challenge. Data publicly available are from different sources with various qualities, and it is impractical to do manual cleaning against noise and redundancy given today's data scale. There are existing techniques for cleaning/selecting the collected data. H…
▽ More
Deep learning benefits from the growing abundance of available data. Meanwhile, efficiently dealing with the growing data scale has become a challenge. Data publicly available are from different sources with various qualities, and it is impractical to do manual cleaning against noise and redundancy given today's data scale. There are existing techniques for cleaning/selecting the collected data. However, these methods are mainly proposed for offline settings that target one of the cleanness and redundancy problems. In practice, data are growing exponentially with both problems. This leads to repeated data curation with sub-optimal efficiency. To tackle this challenge, we propose InfoGrowth, an efficient online algorithm for data cleaning and selection, resulting in a growing dataset that keeps up to date with awareness of cleanliness and diversity. InfoGrowth can improve data quality/efficiency on both single-modal and multi-modal tasks, with an efficient and scalable design. Its framework makes it practical for real-world data engines.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training
Authors:
Kai Wang,
Yukun Zhou,
Mingjia Shi,
Zhihang Yuan,
Yuzhang Shang,
Xiaojiang Peng,
Hanwang Zhang,
Yang You
Abstract:
Training diffusion models is always a computation-intensive task. In this paper, we introduce a novel speed-up method for diffusion model training, called, which is based on a closer look at time steps. Our key findings are: i) Time steps can be empirically divided into acceleration, deceleration, and convergence areas based on the process increment. ii) These time steps are imbalanced, with many…
▽ More
Training diffusion models is always a computation-intensive task. In this paper, we introduce a novel speed-up method for diffusion model training, called, which is based on a closer look at time steps. Our key findings are: i) Time steps can be empirically divided into acceleration, deceleration, and convergence areas based on the process increment. ii) These time steps are imbalanced, with many concentrated in the convergence area. iii) The concentrated steps provide limited benefits for diffusion training. To address this, we design an asymmetric sampling strategy that reduces the frequency of steps from the convergence area while increasing the sampling probability for steps from other areas. Additionally, we propose a weighting strategy to emphasize the importance of time steps with rapid-change process increments. As a plug-and-play and architecture-agnostic approach, SpeeD consistently achieves 3-times acceleration across various diffusion architectures, datasets, and tasks. Notably, due to its simple design, our approach significantly reduces the cost of diffusion model training with minimal overhead. Our research enables more researchers to train diffusion models at a lower cost.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Flexible Motion In-betweening with Diffusion Models
Authors:
Setareh Cohan,
Guy Tevet,
Daniele Reda,
Xue Bin Peng,
Michiel van de Panne
Abstract:
Motion in-betweening, a fundamental task in character animation, consists of generating motion sequences that plausibly interpolate user-provided keyframe constraints. It has long been recognized as a labor-intensive and challenging process. We investigate the potential of diffusion models in generating diverse human motions guided by keyframes. Unlike previous inbetweening methods, we propose a s…
▽ More
Motion in-betweening, a fundamental task in character animation, consists of generating motion sequences that plausibly interpolate user-provided keyframe constraints. It has long been recognized as a labor-intensive and challenging process. We investigate the potential of diffusion models in generating diverse human motions guided by keyframes. Unlike previous inbetweening methods, we propose a simple unified model capable of generating precise and diverse motions that conform to a flexible range of user-specified spatial constraints, as well as text conditioning. To this end, we propose Conditional Motion Diffusion In-betweening (CondMDI) which allows for arbitrary dense-or-sparse keyframe placement and partial keyframe constraints while generating high-quality motions that are diverse and coherent with the given keyframes. We evaluate the performance of CondMDI on the text-conditioned HumanML3D dataset and demonstrate the versatility and efficacy of diffusion models for keyframe in-betweening. We further explore the use of guidance and imputation-based approaches for inference-time keyframing and compare CondMDI against these methods.
△ Less
Submitted 23 May, 2024; v1 submitted 17 May, 2024;
originally announced May 2024.
-
Automatic News Generation and Fact-Checking System Based on Language Processing
Authors:
Xirui Peng,
Qiming Xu,
Zheng Feng,
Haopeng Zhao,
Lianghao Tan,
Yan Zhou,
Zecheng Zhang,
Chenwei Gong,
Yingqiao Zheng
Abstract:
This paper explores an automatic news generation and fact-checking system based on language processing, aimed at enhancing the efficiency and quality of news production while ensuring the authenticity and reliability of the news content. With the rapid development of Natural Language Processing (NLP) and deep learning technologies, automatic news generation systems are capable of extracting key in…
▽ More
This paper explores an automatic news generation and fact-checking system based on language processing, aimed at enhancing the efficiency and quality of news production while ensuring the authenticity and reliability of the news content. With the rapid development of Natural Language Processing (NLP) and deep learning technologies, automatic news generation systems are capable of extracting key information from massive data and generating well-structured, fluent news articles. Meanwhile, by integrating fact-checking technology, the system can effectively prevent the spread of false news and improve the accuracy and credibility of news. This study details the key technologies involved in automatic news generation and factchecking, including text generation, information extraction, and the application of knowledge graphs, and validates the effectiveness of these technologies through experiments. Additionally, the paper discusses the future development directions of automatic news generation and fact-checking systems, emphasizing the importance of further integration and innovation of technologies. The results show that with continuous technological optimization and practical application, these systems will play an increasingly important role in the future news industry, providing more efficient and reliable news services.
△ Less
Submitted 20 May, 2024; v1 submitted 16 May, 2024;
originally announced May 2024.
-
Dim Small Target Detection and Tracking: A Novel Method Based on Temporal Energy Selective Scaling and Trajectory Association
Authors:
Weihua Gao,
Wenlong Niu,
Wenlong Lu,
Pengcheng Wang,
Zhaoyuan Qi,
Xiaodong Peng,
Zhen Yang
Abstract:
The detection and tracking of small targets in passive optical remote sensing (PORS) has broad applications. However, most of the previously proposed methods seldom utilize the abundant temporal features formed by target motion, resulting in poor detection and tracking performance for low signal-to-clutter ratio (SCR) targets. In this article, we analyze the difficulty based on spatial features an…
▽ More
The detection and tracking of small targets in passive optical remote sensing (PORS) has broad applications. However, most of the previously proposed methods seldom utilize the abundant temporal features formed by target motion, resulting in poor detection and tracking performance for low signal-to-clutter ratio (SCR) targets. In this article, we analyze the difficulty based on spatial features and the feasibility based on temporal features of realizing effective detection. According to this analysis, we use a multi-frame as a detection unit and propose a detection method based on temporal energy selective scaling (TESS). Specifically, we investigated the composition of intensity temporal profiles (ITPs) formed by pixels on a multi-frame detection unit. For the target-present pixel, the target passing through the pixel will bring a weak transient disturbance on the ITP and introduce a change in the statistical properties of ITP. We use a well-designed function to amplify the transient disturbance, suppress the background and noise components, and output the trajectory of the target on the multi-frame detection unit. Subsequently, to solve the contradiction between the detection rate and the false alarm rate brought by the traditional threshold segmentation, we associate the temporal and spatial features of the output trajectory and propose a trajectory extraction method based on the 3D Hough transform. Finally, we model the trajectory of the target and propose a trajectory-based multi-target tracking method. Compared with the various state-of-the-art detection and tracking methods, experiments in multiple scenarios prove the superiority of our proposed methods.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
A Mixture-of-Experts Approach to Few-Shot Task Transfer in Open-Ended Text Worlds
Authors:
Christopher Z. Cui,
Xiangyu Peng,
Mark O. Riedl
Abstract:
Open-ended worlds are those in which there are no pre-specified goals or environmental reward signal. As a consequence, an agent must know how to perform a multitude of tasks. However, when a new task is presented to an agent, we expect it to be able to reuse some of what it knows from previous tasks to rapidly learn that new task. We introduce a novel technique whereby policies for different a pr…
▽ More
Open-ended worlds are those in which there are no pre-specified goals or environmental reward signal. As a consequence, an agent must know how to perform a multitude of tasks. However, when a new task is presented to an agent, we expect it to be able to reuse some of what it knows from previous tasks to rapidly learn that new task. We introduce a novel technique whereby policies for different a priori known tasks are combined into a Mixture-of-Experts model with an attention mechanism across a mix of frozen and unfrozen experts. The model learns when to attend to frozen task-specific experts when appropriate and learns new experts to handle novel situations. We work in an open-ended text-based environment in which the agent is tasked with behaving like different types of character roles and must rapidly learn behaviors associated with new character role types. We show that our agent both obtains more rewards in the zero-shot setting, and discovers these rewards with greater sample efficiency in the few-shot learning settings.
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
DiffuseLoco: Real-Time Legged Locomotion Control with Diffusion from Offline Datasets
Authors:
Xiaoyu Huang,
Yufeng Chi,
Ruofeng Wang,
Zhongyu Li,
Xue Bin Peng,
Sophia Shao,
Borivoje Nikolic,
Koushil Sreenath
Abstract:
This work introduces DiffuseLoco, a framework for training multi-skill diffusion-based policies for dynamic legged locomotion from offline datasets, enabling real-time control of diverse skills on robots in the real world. Offline learning at scale has led to breakthroughs in computer vision, natural language processing, and robotic manipulation domains. However, scaling up learning for legged rob…
▽ More
This work introduces DiffuseLoco, a framework for training multi-skill diffusion-based policies for dynamic legged locomotion from offline datasets, enabling real-time control of diverse skills on robots in the real world. Offline learning at scale has led to breakthroughs in computer vision, natural language processing, and robotic manipulation domains. However, scaling up learning for legged robot locomotion, especially with multiple skills in a single policy, presents significant challenges for prior online reinforcement learning methods. To address this challenge, we propose a novel, scalable framework that leverages diffusion models to directly learn from offline multimodal datasets with a diverse set of locomotion skills. With design choices tailored for real-time control in dynamical systems, including receding horizon control and delayed inputs, DiffuseLoco is capable of reproducing multimodality in performing various locomotion skills, zero-shot transfer to real quadrupedal robots, and it can be deployed on edge computing devices. Furthermore, DiffuseLoco demonstrates free transitions between skills and robustness against environmental variations. Through extensive benchmarking in real-world experiments, DiffuseLoco exhibits better stability and velocity tracking performance compared to prior reinforcement learning and non-diffusion-based behavior cloning baselines. The design choices are validated via comprehensive ablation studies. This work opens new possibilities for scaling up learning-based legged locomotion controllers through the scaling of large, expressive models and diverse offline datasets.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
Multimodal Fusion on Low-quality Data: A Comprehensive Survey
Authors:
Qingyang Zhang,
Yake Wei,
Zongbo Han,
Huazhu Fu,
Xi Peng,
Cheng Deng,
Qinghua Hu,
Cai Xu,
Jie Wen,
Di Hu,
Changqing Zhang
Abstract:
Multimodal fusion focuses on integrating information from multiple modalities with the goal of more accurate prediction, which has achieved remarkable progress in a wide range of scenarios, including autonomous driving and medical diagnosis. However, the reliability of multimodal fusion remains largely unexplored especially under low-quality data settings. This paper surveys the common challenges…
▽ More
Multimodal fusion focuses on integrating information from multiple modalities with the goal of more accurate prediction, which has achieved remarkable progress in a wide range of scenarios, including autonomous driving and medical diagnosis. However, the reliability of multimodal fusion remains largely unexplored especially under low-quality data settings. This paper surveys the common challenges and recent advances of multimodal fusion in the wild and presents them in a comprehensive taxonomy. From a data-centric view, we identify four main challenges that are faced by multimodal fusion on low-quality data, namely (1) noisy multimodal data that are contaminated with heterogeneous noises, (2) incomplete multimodal data that some modalities are missing, (3) imbalanced multimodal data that the qualities or properties of different modalities are significantly different and (4) quality-varying multimodal data that the quality of each modality dynamically changes with respect to different samples. This new taxonomy will enable researchers to understand the state of the field and identify several potential directions. We also provide discussion for the open problems in this field together with interesting future research directions.
△ Less
Submitted 5 May, 2024; v1 submitted 27 April, 2024;
originally announced April 2024.
-
MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis
Authors:
Xiang Li,
Zhi-Qi Cheng,
Jun-Yan He,
Xiaojiang Peng,
Alexander G. Hauptmann
Abstract:
Emotional Text-to-Speech (E-TTS) synthesis has gained significant attention in recent years due to its potential to enhance human-computer interaction. However, current E-TTS approaches often struggle to capture the complexity of human emotions, primarily relying on oversimplified emotional labels or single-modality inputs. To address these limitations, we propose the Multimodal Emotional Text-to-…
▽ More
Emotional Text-to-Speech (E-TTS) synthesis has gained significant attention in recent years due to its potential to enhance human-computer interaction. However, current E-TTS approaches often struggle to capture the complexity of human emotions, primarily relying on oversimplified emotional labels or single-modality inputs. To address these limitations, we propose the Multimodal Emotional Text-to-Speech System (MM-TTS), a unified framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. MM-TTS consists of two key components: (1) the Emotion Prompt Alignment Module (EP-Align), which employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information; and (2) the Emotion Embedding-Induced TTS (EMI-TTS), which integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions. Extensive evaluations across diverse datasets demonstrate the superior performance of MM-TTS compared to traditional E-TTS models. Objective metrics, including Word Error Rate (WER) and Character Error Rate (CER), show significant improvements on ESD dataset, with MM-TTS achieving scores of 7.35% and 3.07%, respectively. Subjective assessments further validate that MM-TTS generates speech with emotional fidelity and naturalness comparable to human speech. Our code and pre-trained models are publicly available at https://anonymous.4open.science/r/MMTTS-D214
△ Less
Submitted 28 April, 2024;
originally announced April 2024.
-
Two in One Go: Single-stage Emotion Recognition with Decoupled Subject-context Transformer
Authors:
Xinpeng Li,
Teng Wang,
Jian Zhao,
Shuyi Mao,
**bao Wang,
Feng Zheng,
Xiaojiang Peng,
Xuelong Li
Abstract:
Emotion recognition aims to discern the emotional state of subjects within an image, relying on subject-centric and contextual visual cues. Current approaches typically follow a two-stage pipeline: first localize subjects by off-the-shelf detectors, then perform emotion classification through the late fusion of subject and context features. However, the complicated paradigm suffers from disjoint t…
▽ More
Emotion recognition aims to discern the emotional state of subjects within an image, relying on subject-centric and contextual visual cues. Current approaches typically follow a two-stage pipeline: first localize subjects by off-the-shelf detectors, then perform emotion classification through the late fusion of subject and context features. However, the complicated paradigm suffers from disjoint training stages and limited interaction between fine-grained subject-context elements. To address the challenge, we present a single-stage emotion recognition approach, employing a Decoupled Subject-Context Transformer (DSCT), for simultaneous subject localization and emotion classification. Rather than compartmentalizing training stages, we jointly leverage box and emotion signals as supervision to enrich subject-centric feature learning. Furthermore, we introduce DSCT to facilitate interactions between fine-grained subject-context cues in a decouple-then-fuse manner. The decoupled query token--subject queries and context queries--gradually intertwine across layers within DSCT, during which spatial and semantic relations are exploited and aggregated. We evaluate our single-stage framework on two widely used context-aware emotion recognition datasets, CAER-S and EMOTIC. Our approach surpasses two-stage alternatives with fewer parameter numbers, achieving a 3.39% accuracy improvement and a 6.46% average precision gain on CAER-S and EMOTIC datasets, respectively.
△ Less
Submitted 28 April, 2024; v1 submitted 26 April, 2024;
originally announced April 2024.
-
Player-Driven Emergence in LLM-Driven Game Narrative
Authors:
Xiangyu Peng,
Jessica Quaye,
Sudha Rao,
Weijia Xu,
Portia Botchway,
Chris Brockett,
Nebojsa Jojic,
Gabriel DesGarennes,
Ken Lobb,
Michael Xu,
Jorge Leandro,
Claire **,
Bill Dolan
Abstract:
We explore how interaction with large language models (LLMs) can give rise to emergent behaviors, empowering players to participate in the evolution of game narratives. Our testbed is a text-adventure game in which players attempt to solve a mystery under a fixed narrative premise, but can freely interact with non-player characters generated by GPT-4, a large language model. We recruit 28 gamers t…
▽ More
We explore how interaction with large language models (LLMs) can give rise to emergent behaviors, empowering players to participate in the evolution of game narratives. Our testbed is a text-adventure game in which players attempt to solve a mystery under a fixed narrative premise, but can freely interact with non-player characters generated by GPT-4, a large language model. We recruit 28 gamers to play the game and use GPT-4 to automatically convert the game logs into a node-graph representing the narrative in the player's gameplay. We find that through their interactions with the non-deterministic behavior of the LLM, players are able to discover interesting new emergent nodes that were not a part of the original narrative but have potential for being fun and engaging. Players that created the most emergent nodes tended to be those that often enjoy games that facilitate discovery, exploration and experimentation.
△ Less
Submitted 3 June, 2024; v1 submitted 25 April, 2024;
originally announced April 2024.
-
LEAF: Unveiling Two Sides of the Same Coin in Semi-supervised Facial Expression Recognition
Authors:
Fan Zhang,
Zhi-Qi Cheng,
Jian Zhao,
Xiaojiang Peng,
Xuelong Li
Abstract:
Semi-supervised learning has emerged as a promising approach to tackle the challenge of label scarcity in facial expression recognition (FER) task. However, current state-of-the-art methods primarily focus on one side of the coin, i.e., generating high-quality pseudo-labels, while overlooking the other side: enhancing expression-relevant representations. In this paper, we unveil both sides of the…
▽ More
Semi-supervised learning has emerged as a promising approach to tackle the challenge of label scarcity in facial expression recognition (FER) task. However, current state-of-the-art methods primarily focus on one side of the coin, i.e., generating high-quality pseudo-labels, while overlooking the other side: enhancing expression-relevant representations. In this paper, we unveil both sides of the coin by proposing a unified framework termed hierarchicaL dEcoupling And Fusing (LEAF) to coordinate expression-relevant representations and pseudo-labels for semi-supervised FER. LEAF introduces a hierarchical expression-aware aggregation strategy that operates at three levels: semantic, instance, and category. (1) At the semantic and instance levels, LEAF decouples representations into expression-agnostic and expression-relevant components, and adaptively fuses them using learnable gating weights. (2) At the category level, LEAF assigns ambiguous pseudo-labels by decoupling predictions into positive and negative parts, and employs a consistency loss to ensure agreement between two augmented views of the same image. Extensive experiments on benchmark datasets demonstrate that by unveiling and harmonizing both sides of the coin, LEAF outperforms state-of-the-art semi-supervised FER methods, effectively leveraging both labeled and unlabeled data. Moreover, the proposed expression-aware aggregation strategy can be seamlessly integrated into existing semi-supervised frameworks, leading to significant performance gains. Our code is available at https://anonymous.4open.science/r/LEAF-BC57/.
△ Less
Submitted 26 April, 2024; v1 submitted 23 April, 2024;
originally announced April 2024.
-
Functionality Locality, Mixture & Control = Logic = Memory
Authors:
Xiangjun Peng
Abstract:
This work provides new insights and constructs to the field of computer architecture and systems, and these insights are expected to be useful for the broad software stack. First, this work introduces Functionality Locality: this form of Functionality Locality shows that functionalities can be changed with a single piece of information, by solely changing the access order. This broadens the scope…
▽ More
This work provides new insights and constructs to the field of computer architecture and systems, and these insights are expected to be useful for the broad software stack. First, this work introduces Functionality Locality: this form of Functionality Locality shows that functionalities can be changed with a single piece of information, by solely changing the access order. This broadens the scope of ``principle of locality", which originally includes spatial and temporal locality. Second, this work coins the term Mixture, by incorporating the layout-directed functionalities with the original quantifiers such as scalar and vector. The implications of Mixture significantly expands new understanding of quantifiers, and this work identifies several important ones (from the author perspective). Third, with Functionality and Mixture, this work identifies the principle ``Control = Logic = Memory", and provides a revisit to Von Neumann architectures and Harvard architectures. This centers the focus on the memory, and brings further guidelines on memory-centric architectures with a new analytic framework. Fourth, this work discusses several important implications from this work in a variety of aspects.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
Generating Human Interaction Motions in Scenes with Text Control
Authors:
Hongwei Yi,
Justus Thies,
Michael J. Black,
Xue Bin Peng,
Davis Rempe
Abstract:
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Previous text-to-motion methods focus on characters in isolation without considering scenes due to the limited availability of datasets that include motion, text descriptions, and interactive scenes. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model,…
▽ More
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Previous text-to-motion methods focus on characters in isolation without considering scenes due to the limited availability of datasets that include motion, text descriptions, and interactive scenes. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model, emphasizing goal-reaching constraints on large-scale motion-capture datasets. We then enhance this model with a scene-aware component, fine-tuned using data augmented with detailed scene information, including ground plane and object shapes. To facilitate training, we embed annotated navigation and interaction motions within scenes. The proposed method produces realistic and diverse human-object interactions, such as navigation and sitting, in different scenes with various object shapes, orientations, initial body positions, and poses. Extensive experiments demonstrate that our approach surpasses prior techniques in terms of the plausibility of human-scene interactions, as well as the realism and variety of the generated motions. Code will be released upon publication of this work at https://research.nvidia.com/labs/toronto-ai/tesmo.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
Efficient sEMG-based Cross-Subject Joint Angle Estimation via Hierarchical Spiking Attentional Feature Decomposition Network
Authors:
Xin Zhou,
Chuang Lin,
Can Wang,
Xiaojiang Peng
Abstract:
Surface electromyography (sEMG) has demonstrated significant potential in simultaneous and proportional control (SPC). However, existing algorithms for predicting joint angles based on sEMG often suffer from high inference costs or are limited to specific subjects rather than cross-subject scenarios. To address these challenges, we introduced a hierarchical Spiking Attentional Feature Decompositio…
▽ More
Surface electromyography (sEMG) has demonstrated significant potential in simultaneous and proportional control (SPC). However, existing algorithms for predicting joint angles based on sEMG often suffer from high inference costs or are limited to specific subjects rather than cross-subject scenarios. To address these challenges, we introduced a hierarchical Spiking Attentional Feature Decomposition Network (SAFE-Net). This network initially compresses sEMG signals into neural spiking forms using a Spiking Sparse Attention Encoder (SSAE). Subsequently, the compressed features are decomposed into kinematic and biological features through a Spiking Attentional Feature Decomposition (SAFD) module. Finally, the kinematic and biological features are used to predict joint angles and identify subject identities, respectively. Our validation on two datasets (SIAT-DB1 and SIAT-DB2) and comparison with two existing methods, Informer and Spikformer, demonstrate that SSAE achieves significant power consumption savings of 39.1% and 37.5% respectively over them in terms of inference costs. Furthermore, SAFE-Net surpasses Informer and Spikformer in recognition accuracy on both datasets. This study underscores the potential of SAFE-Net to advance the field of SPC in lower limb rehabilitation exoskeleton robots.
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
Relation Extraction Using Large Language Models: A Case Study on Acupuncture Point Locations
Authors:
Yiming Li,
Xueqing Peng,
Jianfu Li,
Xu Zuo,
Suyuan Peng,
Donghong Pei,
Cui Tao,
Hua Xu,
Na Hong
Abstract:
In acupuncture therapy, the accurate location of acupoints is essential for its effectiveness. The advanced language understanding capabilities of large language models (LLMs) like Generative Pre-trained Transformers (GPT) present a significant opportunity for extracting relations related to acupoint locations from textual knowledge sources. This study aims to compare the performance of GPT with t…
▽ More
In acupuncture therapy, the accurate location of acupoints is essential for its effectiveness. The advanced language understanding capabilities of large language models (LLMs) like Generative Pre-trained Transformers (GPT) present a significant opportunity for extracting relations related to acupoint locations from textual knowledge sources. This study aims to compare the performance of GPT with traditional deep learning models (Long Short-Term Memory (LSTM) and Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT)) in extracting acupoint-related location relations and assess the impact of pretraining and fine-tuning on GPT's performance. We utilized the World Health Organization Standard Acupuncture Point Locations in the Western Pacific Region (WHO Standard) as our corpus, which consists of descriptions of 361 acupoints. Five types of relations ('direction_of,' 'distance_of,' 'part_of,' 'near_acupoint,' and 'located_near') (n= 3,174) between acupoints were annotated. Five models were compared: BioBERT, LSTM, pre-trained GPT-3.5, fine-tuned GPT-3.5, as well as pre-trained GPT-4. Performance metrics included micro-average exact match precision, recall, and F1 scores. Our results demonstrate that fine-tuned GPT-3.5 consistently outperformed other models in F1 scores across all relation types. Overall, it achieved the highest micro-average F1 score of 0.92. This study underscores the effectiveness of LLMs like GPT in extracting relations related to acupoint locations, with implications for accurately modeling acupuncture knowledge and promoting standard implementation in acupuncture training and practice. The findings also contribute to advancing informatics applications in traditional and complementary medicine, showcasing the potential of LLMs in natural language processing.
△ Less
Submitted 14 April, 2024; v1 submitted 8 April, 2024;
originally announced April 2024.
-
Unlocking Parameter-Efficient Fine-Tuning for Low-Resource Language Translation
Authors:
Tong Su,
Xin Peng,
Sarubi Thillainathan,
David Guzmán,
Surangika Ranathunga,
En-Shiun Annie Lee
Abstract:
Parameter-efficient fine-tuning (PEFT) methods are increasingly vital in adapting large-scale pre-trained language models for diverse tasks, offering a balance between adaptability and computational efficiency. They are important in Low-Resource Language (LRL) Neural Machine Translation (NMT) to enhance translation accuracy with minimal resources. However, their practical effectiveness varies sign…
▽ More
Parameter-efficient fine-tuning (PEFT) methods are increasingly vital in adapting large-scale pre-trained language models for diverse tasks, offering a balance between adaptability and computational efficiency. They are important in Low-Resource Language (LRL) Neural Machine Translation (NMT) to enhance translation accuracy with minimal resources. However, their practical effectiveness varies significantly across different languages. We conducted comprehensive empirical experiments with varying LRL domains and sizes to evaluate the performance of 8 PEFT methods with in total of 15 architectures using the SacreBLEU score. We showed that 6 PEFT architectures outperform the baseline for both in-domain and out-domain tests and the Houlsby+Inversion adapter has the best performance overall, proving the effectiveness of PEFT methods.
△ Less
Submitted 5 April, 2024;
originally announced April 2024.
-
MIPS at SemEval-2024 Task 3: Multimodal Emotion-Cause Pair Extraction in Conversations with Multimodal Language Models
Authors:
Zebang Cheng,
Fuqiang Niu,
Yuxiang Lin,
Zhi-Qi Cheng,
Bowen Zhang,
Xiaojiang Peng
Abstract:
This paper presents our winning submission to Subtask 2 of SemEval 2024 Task 3 on multimodal emotion cause analysis in conversations. We propose a novel Multimodal Emotion Recognition and Multimodal Emotion Cause Extraction (MER-MCE) framework that integrates text, audio, and visual modalities using specialized emotion encoders. Our approach sets itself apart from top-performing teams by leveragin…
▽ More
This paper presents our winning submission to Subtask 2 of SemEval 2024 Task 3 on multimodal emotion cause analysis in conversations. We propose a novel Multimodal Emotion Recognition and Multimodal Emotion Cause Extraction (MER-MCE) framework that integrates text, audio, and visual modalities using specialized emotion encoders. Our approach sets itself apart from top-performing teams by leveraging modality-specific features for enhanced emotion understanding and causality inference. Experimental evaluation demonstrates the advantages of our multimodal approach, with our submission achieving a competitive weighted F1 score of 0.3435, ranking third with a margin of only 0.0339 behind the 1st team and 0.0025 behind the 2nd team. Project: https://github.com/MIPS-COLT/MER-MCE.git
△ Less
Submitted 11 April, 2024; v1 submitted 30 March, 2024;
originally announced April 2024.
-
Value, Representation, Information and Communication
Authors:
Xiangjun Peng
Abstract:
A new analytic framework is first formalized via the usage of the Monadology (Leibniz 1898), to expand the understanding of Zermelo-Fraenkel-choice set theory (ZFC) and Von Neumann-Bernays-Godel set theory (NBG). Implicitly, the framework levels value, representation and information separately. Given the fact that there exists a coincidental equivalence between Von Neumann universe and originally-…
▽ More
A new analytic framework is first formalized via the usage of the Monadology (Leibniz 1898), to expand the understanding of Zermelo-Fraenkel-choice set theory (ZFC) and Von Neumann-Bernays-Godel set theory (NBG). Implicitly, the framework levels value, representation and information separately. Given the fact that there exists a coincidental equivalence between Von Neumann universe and originally-formalized motivation in ZFC, this work hypothesizes the essential of ordered values for one monand, to carry out efficient communication with the rest. This work then focuses on the relationship among values, representation and information (and suggests potential methods for quantitative analysis). First, this framework generalizes the definition of values and representations from "Indexes approximate Values" principle by (Peng 2023) via surreal numbers (Knuth 1974). Second, credited to surreal numbers, this work recursively connects representations and information via subsets of sets. Therefore, the definition to metric space(s) is naturally formed by representations, and quantitative methods (e.g., Hausdorff Distance) can be applied for quantitative analysis among (sub)sets. Third, this framework conjectures that: as long as the metric space is (or can be formed as) complete, the existence tests can be performed via Cauchy Sequence (or its generalized methods). This work finally revisits the communication theory, and suggests new perspectives from the new analytic framework. Particularly, this work hypothesizes a (quantitative) relationship between values and representation, and conjectures that: the optimal construction of representations exists, and it can be derived as the core value of one monad via Cauchy Inequality (or its generalized methods).
△ Less
Submitted 30 March, 2024;
originally announced April 2024.
-
PointCloud-Text Matching: Benchmark Datasets and a Baseline
Authors:
Yanglin Feng,
Yang Qin,
Dezhong Peng,
Hongyuan Zhu,
Xi Peng,
Peng Hu
Abstract:
In this paper, we present and study a new instance-level retrieval task: PointCloud-Text Matching~(PTM), which aims to find the exact cross-modal instance that matches a given point-cloud query or text query. PTM could be applied to various scenarios, such as indoor/urban-canyon localization and scene retrieval. However, there exists no suitable and targeted dataset for PTM in practice. Therefore,…
▽ More
In this paper, we present and study a new instance-level retrieval task: PointCloud-Text Matching~(PTM), which aims to find the exact cross-modal instance that matches a given point-cloud query or text query. PTM could be applied to various scenarios, such as indoor/urban-canyon localization and scene retrieval. However, there exists no suitable and targeted dataset for PTM in practice. Therefore, we construct three new PTM benchmark datasets, namely 3D2T-SR, 3D2T-NR, and 3D2T-QA. We observe that the data is challenging and with noisy correspondence due to the sparsity, noise, or disorder of point clouds and the ambiguity, vagueness, or incompleteness of texts, which make existing cross-modal matching methods ineffective for PTM. To tackle these challenges, we propose a PTM baseline, named Robust PointCloud-Text Matching method (RoMa). RoMa consists of two modules: a Dual Attention Perception module (DAP) and a Robust Negative Contrastive Learning module (RNCL). Specifically, DAP leverages token-level and feature-level attention to adaptively focus on useful local and global features, and aggregate them into common representations, thereby reducing the adverse impact of noise and ambiguity. To handle noisy correspondence, RNCL divides negative pairs, which are much less error-prone than positive pairs, into clean and noisy subsets, and assigns them forward and reverse optimization directions respectively, thus enhancing robustness against noisy correspondence. We conduct extensive experiments on our benchmarks and demonstrate the superiority of our RoMa.
△ Less
Submitted 28 March, 2024;
originally announced March 2024.
-
Invisible Gas Detection: An RGB-Thermal Cross Attention Network and A New Benchmark
Authors:
Jue Wang,
Yuxiang Lin,
Qi Zhao,
Dong Luo,
Shuaibao Chen,
Wei Chen,
Xiaojiang Peng
Abstract:
The widespread use of various chemical gases in industrial processes necessitates effective measures to prevent their leakage during transportation and storage, given their high toxicity. Thermal infrared-based computer vision detection techniques provide a straightforward approach to identify gas leakage areas. However, the development of high-quality algorithms has been challenging due to the lo…
▽ More
The widespread use of various chemical gases in industrial processes necessitates effective measures to prevent their leakage during transportation and storage, given their high toxicity. Thermal infrared-based computer vision detection techniques provide a straightforward approach to identify gas leakage areas. However, the development of high-quality algorithms has been challenging due to the low texture in thermal images and the lack of open-source datasets. In this paper, we present the RGB-Thermal Cross Attention Network (RT-CAN), which employs an RGB-assisted two-stream network architecture to integrate texture information from RGB images and gas area information from thermal images. Additionally, to facilitate the research of invisible gas detection, we introduce Gas-DB, an extensive open-source gas detection database including about 1.3K well-annotated RGB-thermal images with eight variant collection scenes. Experimental results demonstrate that our method successfully leverages the advantages of both modalities, achieving state-of-the-art (SOTA) performance among RGB-thermal methods, surpassing single-stream SOTA models in terms of accuracy, Intersection of Union (IoU), and F2 metrics by 4.86%, 5.65%, and 4.88%, respectively. The code and data will be made available soon.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
CPA-Enhancer: Chain-of-Thought Prompted Adaptive Enhancer for Object Detection under Unknown Degradations
Authors:
Yuwei Zhang,
Yan Wu,
Yanming Liu,
Xinyue Peng
Abstract:
Object detection methods under known single degradations have been extensively investigated. However, existing approaches require prior knowledge of the degradation type and train a separate model for each, limiting their practical applications in unpredictable environments. To address this challenge, we propose a chain-of-thought (CoT) prompted adaptive enhancer, CPA-Enhancer, for object detectio…
▽ More
Object detection methods under known single degradations have been extensively investigated. However, existing approaches require prior knowledge of the degradation type and train a separate model for each, limiting their practical applications in unpredictable environments. To address this challenge, we propose a chain-of-thought (CoT) prompted adaptive enhancer, CPA-Enhancer, for object detection under unknown degradations. Specifically, CPA-Enhancer progressively adapts its enhancement strategy under the step-by-step guidance of CoT prompts, that encode degradation-related information. To the best of our knowledge, it's the first work that exploits CoT prompting for object detection tasks. Overall, CPA-Enhancer is a plug-and-play enhancement model that can be integrated into any generic detectors to achieve substantial gains on degraded images, without knowing the degradation type priorly. Experimental results demonstrate that CPA-Enhancer not only sets the new state of the art for object detection but also boosts the performance of other downstream vision tasks under unknown degradations.
△ Less
Submitted 22 March, 2024; v1 submitted 17 March, 2024;
originally announced March 2024.
-
A Challenge Dataset and Effective Models for Conversational Stance Detection
Authors:
Fuqiang Niu,
Min Yang,
Ang Li,
Baoquan Zhang,
Xiaojiang Peng,
Bowen Zhang
Abstract:
Previous stance detection studies typically concentrate on evaluating stances within individual instances, thereby exhibiting limitations in effectively modeling multi-party discussions concerning the same specific topic, as naturally transpire in authentic social media interactions. This constraint arises primarily due to the scarcity of datasets that authentically replicate real social media con…
▽ More
Previous stance detection studies typically concentrate on evaluating stances within individual instances, thereby exhibiting limitations in effectively modeling multi-party discussions concerning the same specific topic, as naturally transpire in authentic social media interactions. This constraint arises primarily due to the scarcity of datasets that authentically replicate real social media contexts, hindering the research progress of conversational stance detection. In this paper, we introduce a new multi-turn conversation stance detection dataset (called \textbf{MT-CSD}), which encompasses multiple targets for conversational stance detection. To derive stances from this challenging dataset, we propose a global-local attention network (\textbf{GLAN}) to address both long and short-range dependencies inherent in conversational data. Notably, even state-of-the-art stance detection methods, exemplified by GLAN, exhibit an accuracy of only 50.47\%, highlighting the persistent challenges in conversational stance detection. Furthermore, our MT-CSD dataset serves as a valuable resource to catalyze advancements in cross-domain stance detection, where a classifier is adapted from a different yet related target. We believe that MT-CSD will contribute to advancing real-world applications of stance detection research. Our source code, data, and models are available at \url{https://github.com/nfq729/MT-CSD}.
△ Less
Submitted 21 March, 2024; v1 submitted 17 March, 2024;
originally announced March 2024.
-
An Empirical Study of Parameter Efficient Fine-tuning on Vision-Language Pre-train Model
Authors:
Yuxin Tian,
Mouxing Yang,
Yunfan Li,
Dayiheng Liu,
Xingzhang Ren,
Xi Peng,
Jiancheng Lv
Abstract:
Recent studies applied Parameter Efficient Fine-Tuning techniques (PEFTs) to efficiently narrow the performance gap between pre-training and downstream. There are two important factors for various PEFTs, namely, the accessible data size and fine-tunable parameter size. A natural expectation for PEFTs is that the performance of various PEFTs is positively related to the data size and fine-tunable p…
▽ More
Recent studies applied Parameter Efficient Fine-Tuning techniques (PEFTs) to efficiently narrow the performance gap between pre-training and downstream. There are two important factors for various PEFTs, namely, the accessible data size and fine-tunable parameter size. A natural expectation for PEFTs is that the performance of various PEFTs is positively related to the data size and fine-tunable parameter size. However, according to the evaluation of five PEFTs on two downstream vision-language (VL) tasks, we find that such an intuition holds only if the downstream data and task are not consistent with pre-training. For downstream fine-tuning consistent with pre-training, data size no longer affects the performance, while the influence of fine-tunable parameter size is not monotonous. We believe such an observation could guide the choice of training strategy for various PEFTs.
△ Less
Submitted 18 May, 2024; v1 submitted 13 March, 2024;
originally announced March 2024.
-
Feint in Multi-Player Games
Authors:
Junyu Liu,
Wangkai **,
Xiangjun Peng
Abstract:
This paper introduces the first formalization, implementation and quantitative evaluation of Feint in Multi-Player Games. Our work first formalizes Feint from the perspective of Multi-Player Games, in terms of the temporal, spatial, and their collective impacts. The formalization is built upon Non-transitive Active Markov Game Model, where Feint can have a considerable amount of impacts. Then, our…
▽ More
This paper introduces the first formalization, implementation and quantitative evaluation of Feint in Multi-Player Games. Our work first formalizes Feint from the perspective of Multi-Player Games, in terms of the temporal, spatial, and their collective impacts. The formalization is built upon Non-transitive Active Markov Game Model, where Feint can have a considerable amount of impacts. Then, our work considers practical implementation details of Feint in Multi-Player Games, under the state-of-the-art progress of multi-agent modeling to date (namely Multi-Agent Reinforcement Learning). Finally, our work quantitatively examines the effectiveness of our design, and the results show that our design of Feint can (1) greatly improve the reward gains from the game; (2) significantly improve the diversity of Multi-Player Games; and (3) only incur negligible overheads in terms of time consumption. We conclude that our design of Feint is effective and practical, to make Multi-Player Games more interesting.
△ Less
Submitted 3 March, 2024;
originally announced March 2024.
-
Formalizing Feint Actions, and Example Studies in Two-Player Games
Authors:
Junyu Liu,
Wangkai **,
Xiangjun Peng
Abstract:
Feint actions refer to a set of deceptive actions, which enable players to obtain temporal advantages from their opponents. Such actions are regarded as widely-used tactic in most non-deterministic Two-player Games (e.g. boxing and fencing). However, existing literature does not provide comprehensive and concrete formalization on Feint actions, and their implications on Two-Player Games. We argue…
▽ More
Feint actions refer to a set of deceptive actions, which enable players to obtain temporal advantages from their opponents. Such actions are regarded as widely-used tactic in most non-deterministic Two-player Games (e.g. boxing and fencing). However, existing literature does not provide comprehensive and concrete formalization on Feint actions, and their implications on Two-Player Games. We argue that a full exploration on Feint actions is of great importance towards more realistic Two-player Games. In this paper, we provide the first comprehensive and concrete formalization of Feint actions. The key idea of our work is to (1) allow automatic generation of Feint actions, via our proposed Palindrome-directed Generation of Feint actions; and (2) provide concrete principles to properly combine Feint and attack actions. Based on our formalization of Feint actions, we also explore the implications on the game strategy model, and provide optimizations to better incorporate Feint actions. Our experimental results shows that accounting for Feint actions in Non-Deterministic Games (1) brings overall benefits to the game design; and (2) has great benefits on on either game animations or strategy designs, which also introduces a great extent of randomness into randomness-demanded Game models.
△ Less
Submitted 3 March, 2024;
originally announced March 2024.
-
SPA: Towards A Computational Friendly Cloud-Base and On-Devices Collaboration Seq2seq Personalized Generation
Authors:
Yanming Liu,
Xinyue Peng,
Jiannan Cao,
Le Dai,
Xingzu Liu,
Ruilin Nong,
Weihao Liu
Abstract:
Large language models(LLMs) have shown its outperforming ability on various tasks and question answering. However, LLMs require substantial memory storage on low-resource devices. More critically, the computational speed on these devices is also severely limited. In this paper, we propose SPA(Side Plugin Adaption), a lightweight architecture for fast on-devices inference on the constraints of stri…
▽ More
Large language models(LLMs) have shown its outperforming ability on various tasks and question answering. However, LLMs require substantial memory storage on low-resource devices. More critically, the computational speed on these devices is also severely limited. In this paper, we propose SPA(Side Plugin Adaption), a lightweight architecture for fast on-devices inference on the constraints of strict on-devices computation and memory constraints. Compared with other on-devices seq2seq generation, SPA could make a fast and stable inference on low-resource constraints, allowing it to obtain cost effiency. Our method establish an interaction between a pretrained LLMs on-cloud and additive parameters on-devices, which could provide the knowledge on both pretrained LLMs and featured personal feature. Further more, SPA provides a framework to keep feature-base parameters on low computational devices while leave the parameters containing general information on the high computational devices.
△ Less
Submitted 20 June, 2024; v1 submitted 11 March, 2024;
originally announced March 2024.
-
ERA-CoT: Improving Chain-of-Thought through Entity Relationship Analysis
Authors:
Yanming Liu,
Xinyue Peng,
Tianyu Du,
Jianwei Yin,
Weihao Liu,
Xuhong Zhang
Abstract:
Large language models (LLMs) have achieved commendable accomplishments in various natural language processing tasks. However, LLMs still encounter significant challenges when dealing with complex scenarios involving multiple entities. These challenges arise from the presence of implicit relationships that demand multi-step reasoning. In this paper, we propose a novel approach ERA-CoT, which aids L…
▽ More
Large language models (LLMs) have achieved commendable accomplishments in various natural language processing tasks. However, LLMs still encounter significant challenges when dealing with complex scenarios involving multiple entities. These challenges arise from the presence of implicit relationships that demand multi-step reasoning. In this paper, we propose a novel approach ERA-CoT, which aids LLMs in understanding context by capturing relationships between entities and supports the reasoning of diverse tasks through Chain-of-Thoughts (CoT). Experimental results show that ERA-CoT demonstrates the superior performance of our proposed method compared to current CoT prompting methods, achieving a significant improvement of an average of 5.1\% on GPT3.5 compared to previous SOTA baselines. Our analysis indicates that ERA-CoT increases the LLM's understanding of entity relationships, significantly improves the accuracy of question answering, and enhances the reasoning ability of LLMs.
△ Less
Submitted 6 June, 2024; v1 submitted 11 March, 2024;
originally announced March 2024.
-
RA-ISF: Learning to Answer and Understand from Retrieval Augmentation via Iterative Self-Feedback
Authors:
Yanming Liu,
Xinyue Peng,
Xuhong Zhang,
Weihao Liu,
Jianwei Yin,
Jiannan Cao,
Tianyu Du
Abstract:
Large language models (LLMs) demonstrate exceptional performance in numerous tasks but still heavily rely on knowledge stored in their parameters. Moreover, updating this knowledge incurs high training costs. Retrieval-augmented generation (RAG) methods address this issue by integrating external knowledge. The model can answer questions it couldn't previously by retrieving knowledge relevant to th…
▽ More
Large language models (LLMs) demonstrate exceptional performance in numerous tasks but still heavily rely on knowledge stored in their parameters. Moreover, updating this knowledge incurs high training costs. Retrieval-augmented generation (RAG) methods address this issue by integrating external knowledge. The model can answer questions it couldn't previously by retrieving knowledge relevant to the query. This approach improves performance in certain scenarios for specific tasks. However, if irrelevant texts are retrieved, it may impair model performance. In this paper, we propose Retrieval Augmented Iterative Self-Feedback (RA-ISF), a framework that iteratively decomposes tasks and processes them in three submodules to enhance the model's problem-solving capabilities. Experiments show that our method outperforms existing benchmarks, performing well on models like GPT3.5, Llama2, significantly enhancing factual reasoning capabilities and reducing hallucinations.
△ Less
Submitted 6 June, 2024; v1 submitted 11 March, 2024;
originally announced March 2024.
-
DiffuMatting: Synthesizing Arbitrary Objects with Matting-level Annotation
Authors:
Xiaobin Hu,
Xu Peng,
Donghao Luo,
Xiaozhong Ji,
**long Peng,
Zhengkai Jiang,
Jiangning Zhang,
Taisong **,
Chengjie Wang,
Rongrong Ji
Abstract:
Due to the difficulty and labor-consuming nature of getting highly accurate or matting annotations, there only exists a limited amount of highly accurate labels available to the public. To tackle this challenge, we propose a DiffuMatting which inherits the strong Everything generation ability of diffusion and endows the power of "matting anything". Our DiffuMatting can 1). act as an anything matti…
▽ More
Due to the difficulty and labor-consuming nature of getting highly accurate or matting annotations, there only exists a limited amount of highly accurate labels available to the public. To tackle this challenge, we propose a DiffuMatting which inherits the strong Everything generation ability of diffusion and endows the power of "matting anything". Our DiffuMatting can 1). act as an anything matting factory with high accurate annotations 2). be well-compatible with community LoRAs or various conditional control approaches to achieve the community-friendly art design and controllable generation. Specifically, inspired by green-screen-matting, we aim to teach the diffusion model to paint on a fixed green screen canvas. To this end, a large-scale greenscreen dataset (Green100K) is collected as a training dataset for DiffuMatting. Secondly, a green background control loss is proposed to keep the drawing board as a pure green color to distinguish the foreground and background. To ensure the synthesized object has more edge details, a detailed-enhancement of transition boundary loss is proposed as a guideline to generate objects with more complicated edge structures. Aiming to simultaneously generate the object and its matting annotation, we build a matting head to make a green color removal in the latent space of the VAE decoder. Our DiffuMatting shows several potential applications (e.g., matting-data generator, community-friendly art design and controllable generation). As a matting-data generator, DiffuMatting synthesizes general object and portrait matting sets, effectively reducing the relative MSE error by 15.4% in General Object Matting and 11.4% in Portrait Matting tasks.
△ Less
Submitted 10 March, 2024;
originally announced March 2024.
-
Self-supervised Photographic Image Layout Representation Learning
Authors:
Zhaoran Zhao,
Peng Lu,
Xujun Peng,
Wenhao Guo
Abstract:
In the domain of image layout representation learning, the critical process of translating image layouts into succinct vector forms is increasingly significant across diverse applications, such as image retrieval, manipulation, and generation. Most approaches in this area heavily rely on costly labeled datasets and notably lack in adapting their modeling and learning methods to the specific nuance…
▽ More
In the domain of image layout representation learning, the critical process of translating image layouts into succinct vector forms is increasingly significant across diverse applications, such as image retrieval, manipulation, and generation. Most approaches in this area heavily rely on costly labeled datasets and notably lack in adapting their modeling and learning methods to the specific nuances of photographic image layouts. This shortfall makes the learning process for photographic image layouts suboptimal. In our research, we directly address these challenges. We innovate by defining basic layout primitives that encapsulate various levels of layout information and by map** these, along with their interconnections, onto a heterogeneous graph structure. This graph is meticulously engineered to capture the intricate layout information within the pixel domain explicitly. Advancing further, we introduce novel pretext tasks coupled with customized loss functions, strategically designed for effective self-supervised learning of these layout graphs. Building on this foundation, we develop an autoencoder-based network architecture skilled in compressing these heterogeneous layout graphs into precise, dimensionally-reduced layout representations. Additionally, we introduce the LODB dataset, which features a broader range of layout categories and richer semantics, serving as a comprehensive benchmark for evaluating the effectiveness of layout representation learning methods. Our extensive experimentation on this dataset demonstrates the superior performance of our approach in the realm of photographic image layout representation learning.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
A general approach to enhance the survivability of backdoor attacks by decision path coupling
Authors:
Yufei Zhao,
Dingji Wang,
Bihuan Chen,
Ziqian Chen,
Xin Peng
Abstract:
Backdoor attacks have been one of the emerging security threats to deep neural networks (DNNs), leading to serious consequences. One of the mainstream backdoor defenses is model reconstruction-based. Such defenses adopt model unlearning or pruning to eliminate backdoors. However, little attention has been paid to survive from such defenses. To bridge the gap, we propose Venom, the first generic ba…
▽ More
Backdoor attacks have been one of the emerging security threats to deep neural networks (DNNs), leading to serious consequences. One of the mainstream backdoor defenses is model reconstruction-based. Such defenses adopt model unlearning or pruning to eliminate backdoors. However, little attention has been paid to survive from such defenses. To bridge the gap, we propose Venom, the first generic backdoor attack enhancer to improve the survivability of existing backdoor attacks against model reconstruction-based defenses. We formalize Venom as a binary-task optimization problem. The first is the original backdoor attack task to preserve the original attack capability, while the second is the attack enhancement task to improve the attack survivability. To realize the second task, we propose attention imitation loss to force the decision path of poisoned samples in backdoored models to couple with the crucial decision path of benign samples, which makes backdoors difficult to eliminate. Our extensive evaluation on two DNNs and three datasets has demonstrated that Venom significantly improves the survivability of eight state-of-the-art attacks against eight state-of-the-art defenses without impacting the capability of the original attacks.
△ Less
Submitted 5 March, 2024;
originally announced March 2024.
-
Learn Suspected Anomalies from Event Prompts for Video Anomaly Detection
Authors:
Chenchen Tao,
Chong Wang,
Yuexian Zou,
Xiaohao Peng,
Jiafei Wu,
Jiangbo Qian
Abstract:
Most models for weakly supervised video anomaly detection (WS-VAD) rely on multiple instance learning, aiming to distinguish normal and abnormal snippets without specifying the type of anomaly. The ambiguous nature of anomaly definitions across contexts introduces bias in detecting abnormal and normal snippets within the abnormal bag. Taking the first step to show the model why it is anomalous, a…
▽ More
Most models for weakly supervised video anomaly detection (WS-VAD) rely on multiple instance learning, aiming to distinguish normal and abnormal snippets without specifying the type of anomaly. The ambiguous nature of anomaly definitions across contexts introduces bias in detecting abnormal and normal snippets within the abnormal bag. Taking the first step to show the model why it is anomalous, a novel framework is proposed to guide the learning of suspected anomalies from event prompts. Given a textual prompt dictionary of potential anomaly events and the captions generated from anomaly videos, the semantic anomaly similarity between them could be calculated to identify the suspected anomalous events for each video snippet. It enables a new multi-prompt learning process to constrain the visual-semantic features across all videos, as well as provides a new way to label pseudo anomalies for self-training. To demonstrate effectiveness, comprehensive experiments and detailed ablation studies are conducted on four datasets, namely XD-Violence, UCF-Crime, TAD, and ShanghaiTech. Our proposed model outperforms most state-of-the-art methods in terms of AP or AUC (82.6\%, 87.7\%, 93.1\%, and 97.4\%). Furthermore, it shows promising performance in open-set and cross-dataset cases.
△ Less
Submitted 2 March, 2024;
originally announced March 2024.
-
Dose Prediction Driven Radiotherapy Paramters Regression via Intra- and Inter-Relation Modeling
Authors:
Jiaqi Cui,
Yuanyuan Xu,
Jianghong Xiao,
Yuchen Fei,
Jiliu Zhou,
Xingcheng Peng,
Yan Wang
Abstract:
Deep learning has facilitated the automation of radiotherapy by predicting accurate dose distribution maps. However, existing methods fail to derive the desirable radiotherapy parameters that can be directly input into the treatment planning system (TPS), impeding the full automation of radiotherapy. To enable more thorough automatic radiotherapy, in this paper, we propose a novel two-stage framew…
▽ More
Deep learning has facilitated the automation of radiotherapy by predicting accurate dose distribution maps. However, existing methods fail to derive the desirable radiotherapy parameters that can be directly input into the treatment planning system (TPS), impeding the full automation of radiotherapy. To enable more thorough automatic radiotherapy, in this paper, we propose a novel two-stage framework to directly regress the radiotherapy parameters, including a dose map prediction stage and a radiotherapy parameters regression stage. In stage one, we combine transformer and convolutional neural network (CNN) to predict realistic dose maps with rich global and local information, providing accurate dosimetric knowledge for the subsequent parameters regression. In stage two, two elaborate modules, i.e., an intra-relation modeling (Intra-RM) module and an inter-relation modeling (Inter-RM) module, are designed to exploit the organ-specific and organ-shared features for precise parameters regression. Experimental results on a rectal cancer dataset demonstrate the effectiveness of our method.
△ Less
Submitted 29 February, 2024;
originally announced February 2024.
-
Adjusting Dynamics of Hopfield Neural Network via Time-variant Stimulus
Authors:
Xuenan Peng,
Chengqing Li,
Yicheng Zeng,
Chun-Lai Li
Abstract:
As a paradigmatic model for nonlinear dynamics studies, the Hopfield Neural Network (HNN) demonstrates a high susceptibility to external disturbances owing to its intricate structure. This paper delves into the challenge of modulating HNN dynamics through time-variant stimuli. The effects of adjustments using two distinct types of time-variant stimuli, namely the Weight Matrix Stimulus (WMS) and t…
▽ More
As a paradigmatic model for nonlinear dynamics studies, the Hopfield Neural Network (HNN) demonstrates a high susceptibility to external disturbances owing to its intricate structure. This paper delves into the challenge of modulating HNN dynamics through time-variant stimuli. The effects of adjustments using two distinct types of time-variant stimuli, namely the Weight Matrix Stimulus (WMS) and the State Variable Stimulus (SVS), along with a Constant Stimulus (CS) are reported. The findings reveal that deploying four WMSs enables the HNN to generate either a four-scroll or a coexisting two-scroll attractor. When combined with one SVS, four WMSs can lead to the formation of an eight-scroll or four-scroll attractor, while the integration of four WMSs and multiple SVSs can induce grid-multi-scroll attractors. Moreover, the introduction of a CS and an SVS can significantly disrupt the dynamic behavior of the HNN. Consequently, suitable adjustment methods are crucial for enhancing the network's dynamics, whereas inappropriate applications can lead to the loss of its chaotic characteristics. To empirically validate these enhancement effects, the study employs an FPGA hardware platform. Subsequently, an image encryption scheme is designed to demonstrate the practical application benefits of the dynamically adjusted HNN in secure multimedia communication. This exploration into the dynamic modulation of HNN via time-variant stimuli offers insightful contributions to the advancement of secure communication technologies.
△ Less
Submitted 15 January, 2024;
originally announced February 2024.
-
Learning to See Through Dazzle
Authors:
Xiaopeng Peng,
Erin F. Fleet,
Abbie T. Watnik,
Grover A. Swartzlander
Abstract:
Machine vision is susceptible to laser dazzle, where intense laser light can blind and distort its perception of the environment through oversaturation or permanent damage to sensor pixels. Here we employ a wavefront-coded phase mask to diffuse the energy of laser light and introduce a sandwich generative adversarial network (SGAN) to restore images from complex image degradations, such as varying…
▽ More
Machine vision is susceptible to laser dazzle, where intense laser light can blind and distort its perception of the environment through oversaturation or permanent damage to sensor pixels. Here we employ a wavefront-coded phase mask to diffuse the energy of laser light and introduce a sandwich generative adversarial network (SGAN) to restore images from complex image degradations, such as varying laser-induced image saturation, mask-induced image blurring, unknown lighting conditions, and various noise corruptions. The SGAN architecture combines discriminative and generative methods by wrap** two GANs around a learnable image deconvolution module. In addition, we make use of Fourier feature representations to reduce the spectral bias of neural networks and improve its learning of high-frequency image details. End-to-end training includes the realistic physics-based synthesis of a large set of training data from publicly available images. We trained the SGAN to suppress the peak laser irradiance as high as $10^6$ times the sensor saturation threshold - the point at which camera sensors may experience damage without the mask. The trained model was evaluated on both a synthetic data set and data collected from the laboratory. The proposed image restoration model quantitatively and qualitatively outperforms state-of-the-art methods for a wide range of scene contents, laser powers, incident laser angles, ambient illumination strengths, and noise characteristics.
△ Less
Submitted 4 March, 2024; v1 submitted 24 February, 2024;
originally announced February 2024.
-
A Duality Analysis of Kernel Ridge Regression in the Noiseless Regime
Authors:
Jihao Long,
Xiaojun Peng,
Lei Wu
Abstract:
In this paper, we conduct a comprehensive analysis of generalization properties of Kernel Ridge Regression (KRR) in the noiseless regime, a scenario crucial to scientific computing, where data are often generated via computer simulations. We prove that KRR can attain the minimax optimal rate, which depends on both the eigenvalue decay of the associated kernel and the relative smoothness of target…
▽ More
In this paper, we conduct a comprehensive analysis of generalization properties of Kernel Ridge Regression (KRR) in the noiseless regime, a scenario crucial to scientific computing, where data are often generated via computer simulations. We prove that KRR can attain the minimax optimal rate, which depends on both the eigenvalue decay of the associated kernel and the relative smoothness of target functions. Particularly, when the eigenvalue decays exponentially fast, KRR achieves the spectral accuracy, i.e., a convergence rate faster than any polynomial. Moreover, the numerical experiments well corroborate our theoretical findings. Our proof leverages a novel extension of the duality framework introduced by Chen et al. (2023), which could be useful in analyzing kernel-based methods beyond the scope of this work.
△ Less
Submitted 23 February, 2024;
originally announced February 2024.