Search | arXiv e-print repository

Smurfs: Leveraging Multiple Proficiency Agents with Context-Efficiency for Tool Planning

Authors: Junzhi Chen, Juhao Liang, Benyou Wang

Abstract: The emergence of large language models (LLMs) has opened up unprecedented possibilities for automating complex tasks that are often comparable to human performance. Despite their capabilities, LLMs still encounter difficulties in completing tasks that require high levels of accuracy and complexity due to their inherent limitations in handling multifaceted problems single-handedly. This paper intro… ▽ More The emergence of large language models (LLMs) has opened up unprecedented possibilities for automating complex tasks that are often comparable to human performance. Despite their capabilities, LLMs still encounter difficulties in completing tasks that require high levels of accuracy and complexity due to their inherent limitations in handling multifaceted problems single-handedly. This paper introduces `Smurfs', a cutting-edge multi-agent framework designed to revolutionize the application of LLMs. By seamlessly transforming a conventional LLM into a synergistic multi-agent ensemble, Smurfs can enhance the model's ability to solve complex tasks at no additional cost. This is achieved through innovative prompting strategies that allocate distinct roles within the model, thereby facilitating collaboration among specialized agents and forming an intelligent multi-agent system. Our empirical investigation on both open-ended task of StableToolBench and closed-ended task on HotpotQA showcases Smurfs' superior capability in intricate tool utilization scenarios. Notably, Smurfs outmatches all the baseline methods in both experiments, setting new state-of-the-art performance. Furthermore, through comprehensive ablation studies, we dissect the contribution of the core components of the multi-agent framework to its overall efficacy. This not only verifies the effectiveness of the framework, but also sets a route for future exploration of multi-agent LLM systems. △ Less

Submitted 23 June, 2024; v1 submitted 9 May, 2024; originally announced May 2024.

arXiv:2405.05757 [pdf, other]

Design and Implementation of Energy-Efficient Wireless Tire Sensing System with Delay Analysis for Intelligent Vehicles

Authors: Shashank Mishra, Jia-Ming Liang

Abstract: The growing prevalence of Internet of Things (IoT) technologies has led to a rise in the popularity of intelligent vehicles that incorporate a range of sensors to monitor various aspects, such as driving speed, fuel usage, distance proximity and tire anomalies. Nowadays, real-time tire sensing systems play important roles for intelligent vehicles in increasing mileage, reducing fuel consumption, i… ▽ More The growing prevalence of Internet of Things (IoT) technologies has led to a rise in the popularity of intelligent vehicles that incorporate a range of sensors to monitor various aspects, such as driving speed, fuel usage, distance proximity and tire anomalies. Nowadays, real-time tire sensing systems play important roles for intelligent vehicles in increasing mileage, reducing fuel consumption, improving driving safety, and reducing the potential for traffic accidents. However, the current tire sensing system drains a significant vehicle' energy and lacks effective collection of sensing data, which may not guarantee the immediacy of driving safety. Thus, this paper designs an energy-efficient wireless tire sensing system (WTSS), which leverages energy-saving techniques to significantly reduce power consumption while ensuring data retrieval delays during real-time monitoring. Additionally, we mathematically analyze the worst-case transmission delay of the system to ensure the immediacy based on the collision probabilities of sensor transmissions. This system has been implemented and verified by the simulation and field trial experiments. These results show that the proposed scheme provides enhanced performance in energy efficiency and accurately identifies the worst transmission delay. △ Less

Submitted 27 May, 2024; v1 submitted 9 May, 2024; originally announced May 2024.

arXiv:2405.04859 [pdf, other]

Guarding Force: Safety-Critical Compliant Control for Robot-Environment Interaction

Authors: Xinming Wang, Jun Yang, Jianliang Mao, **zhuo Liang, Shihua Li, Yunda Yan

Abstract: In this study, we propose a safety-critical compliant control strategy designed to strictly enforce interaction force constraints during the physical interaction of robots with unknown environments. The interaction force constraint is interpreted as a new force-constrained control barrier function (FC-CBF) by exploiting the generalized contact model and the prior information of the environment, i.… ▽ More In this study, we propose a safety-critical compliant control strategy designed to strictly enforce interaction force constraints during the physical interaction of robots with unknown environments. The interaction force constraint is interpreted as a new force-constrained control barrier function (FC-CBF) by exploiting the generalized contact model and the prior information of the environment, i.e., the prior stiffness and rest position, for robot kinematics. The difference between the real environment and the generalized contact model is approximated by constructing a tracking differentiator, and its estimation error is quantified based on Lyapunov theory. By interpreting strict interaction safety specification as a dynamic constraint, restricting the desired joint angular rates in kinematics, the proposed approach modifies nominal compliant controllers using quadratic programming, ensuring adherence to interaction force constraints in unknown environments. The strict force constraint and the stability of the closed-loop system are rigorously analyzed. Experimental tests using a UR3e industrial robot with different environments verify the effectiveness of the proposed method in achieving the force constraints in unknown environments. △ Less

Submitted 8 May, 2024; originally announced May 2024.

arXiv:2405.04855 [pdf, other]

Revisiting general dark matter-bound-electron interactions

Authors: **-Han Liang, Yi Liao, Xiao-Dong Ma, Hao-Lin Wang

Abstract: In this letter we revisit general dark matter (DM)-bound-electron interactions studied previously in the influential work of [Catena et al., Phys. Rev. Res. 2, 033195 (2020)]. We derive the DM-electron response functions and find a crucial minus sign was missed for the second atomic response function $W_2$ defined in that work. The minus sign has significant phenomenological consequences when expl… ▽ More In this letter we revisit general dark matter (DM)-bound-electron interactions studied previously in the influential work of [Catena et al., Phys. Rev. Res. 2, 033195 (2020)]. We derive the DM-electron response functions and find a crucial minus sign was missed for the second atomic response function $W_2$ defined in that work. The minus sign has significant phenomenological consequences when explaining experimental bounds on specific DM scenarios. Furthermore, for the most general DM-electron nonrelativistic or relativistic interactions for DM with spin up to one, we find there are three DM response functions ($a_{0,1,2}$) whose corresponding atomic response functions ($\widetilde W_{0,1,2}$) are linear combinations of the four response functions ($W_{1,2,3,4}$) given in that work, $$ \widetilde W_0 = W_1, \, \widetilde W_2 = |\mathbf{v}_0^\perp|^2 W_1+ W_3 - 2 {m_e\, \mathbf{q}\cdot \mathbf{v}_0^\perp \over \mathbf{q}^2} W_2,\, \widetilde W_3 = { (\mathbf{q}\cdot \mathbf{v}_0^\perp)^2 \over \mathbf{q}^2} W_1 + {m_e^2 \over \mathbf{q}^2}W_4 - 2 {m_e\, \mathbf{q}\cdot \mathbf{v}_0^\perp \over \mathbf{q}^2} W_2. $$ Due to the minus sign correction for $W_2$, there can be significant cancellations between the $W_2$ and $W_{3,4}$ terms, so that $\widetilde W_{2,3}$ are dominated by the usual response function $W_1$ in some cases. Ignoring the sign could thus result in misinterpretation of the experimental data in some DM scenarios. As an example, we show that the recent XENON1T constraint on the fermionic DM anapole moment is weakened by a factor of 2 or so. Many DM scenarios involving DM or electron axial-vector current can yield $W_2$ and thus are potentially affected by the sign. △ Less

Submitted 8 May, 2024; originally announced May 2024.

Comments: 6+2 pages, 4 figures

arXiv:2405.04434 [pdf, other]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Authors: DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding , et al. (132 additional authors not shown)

Abstract: We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference… ▽ More We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models. △ Less

Submitted 19 June, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

arXiv:2405.03988 [pdf, other]

Knowledge Adaptation from Large Language Model to Recommendation for Practical Industrial Application

Authors: Jian Jia, Yipei Wang, Yan Li, Honggang Chen, Xuehan Bai, Zhaocheng Liu, Jian Liang, Quan Chen, Han Li, Peng Jiang, Kun Gai

Abstract: Contemporary recommender systems predominantly rely on collaborative filtering techniques, employing ID-embedding to capture latent associations among users and items. However, this approach overlooks the wealth of semantic information embedded within textual descriptions of items, leading to suboptimal performance in cold-start scenarios and long-tail user recommendations. Leveraging the capabili… ▽ More Contemporary recommender systems predominantly rely on collaborative filtering techniques, employing ID-embedding to capture latent associations among users and items. However, this approach overlooks the wealth of semantic information embedded within textual descriptions of items, leading to suboptimal performance in cold-start scenarios and long-tail user recommendations. Leveraging the capabilities of Large Language Models (LLMs) pretrained on massive text corpus presents a promising avenue for enhancing recommender systems by integrating open-world domain knowledge. In this paper, we propose an Llm-driven knowlEdge Adaptive RecommeNdation (LEARN) framework that synergizes open-world knowledge with collaborative knowledge. We address computational complexity concerns by utilizing pretrained LLMs as item encoders and freezing LLM parameters to avoid catastrophic forgetting and preserve open-world knowledge. To bridge the gap between the open-world and collaborative domains, we design a twin-tower structure supervised by the recommendation task and tailored for practical industrial application. Through offline experiments on the large-scale industrial dataset and online experiments on A/B tests, we demonstrate the efficacy of our approach. △ Less

Submitted 7 May, 2024; originally announced May 2024.

Comments: 11 pages, 6 figures

arXiv:2404.17186 [pdf, other]

MCSDNet: Mesoscale Convective System Detection Network via Multi-scale Spatiotemporal Information

Authors: Jiajun Liang, Baoquan Zhang, Yunming Ye, Xutao Li, Chuyao Luo, Xukai Fu

Abstract: The accurate detection of Mesoscale Convective Systems (MCS) is crucial for meteorological monitoring due to their potential to cause significant destruction through severe weather phenomena such as hail, thunderstorms, and heavy rainfall. However, the existing methods for MCS detection mostly targets on single-frame detection, which just considers the static characteristics and ignores the tempor… ▽ More The accurate detection of Mesoscale Convective Systems (MCS) is crucial for meteorological monitoring due to their potential to cause significant destruction through severe weather phenomena such as hail, thunderstorms, and heavy rainfall. However, the existing methods for MCS detection mostly targets on single-frame detection, which just considers the static characteristics and ignores the temporal evolution in the life cycle of MCS. In this paper, we propose a novel encoder-decoder neural network for MCS detection(MCSDNet). MCSDNet has a simple architecture and is easy to expand. Different from the previous models, MCSDNet targets on multi-frames detection and leverages multi-scale spatiotemporal information for the detection of MCS regions in remote sensing imagery(RSI). As far as we know, it is the first work to utilize multi-scale spatiotemporal information to detect MCS regions. Firstly, we design a multi-scale spatiotemporal information module to extract multi-level semantic from different encoder levels, which makes our models can extract more detail spatiotemporal features. Secondly, a Spatiotemporal Mix Unit(STMU) is introduced to MCSDNet to capture both intra-frame features and inter-frame correlations, which is a scalable module and can be replaced by other spatiotemporal module, e.g., CNN, RNN, Transformer and our proposed Dual Spatiotemporal Attention(DSTA). This means that the future works about spatiotemporal modules can be easily integrated to our model. Finally, we present MCSRSI, the first publicly available dataset for multi-frames MCS detection based on visible channel images from the FY-4A satellite. We also conduct several experiments on MCSRSI and find that our proposed MCSDNet achieve the best performance on MCS detection task when comparing to other baseline methods. △ Less

Submitted 26 April, 2024; originally announced April 2024.

arXiv:2404.16348 [pdf, other]

Dual Expert Distillation Network for Generalized Zero-Shot Learning

Authors: Zhijie Rao, **gcai Guo, Xiaocheng Lu, **gming Liang, Jie Zhang, Haozhao Wang, Kang Wei, Xiaofeng Cao

Abstract: Zero-shot learning has consistently yielded remarkable progress via modeling nuanced one-to-one visual-attribute correlation. Existing studies resort to refining a uniform map** function to align and correlate the sample regions and subattributes, ignoring two crucial issues: 1) the inherent asymmetry of attributes; and 2) the unutilized channel information. This paper addresses these issues by… ▽ More Zero-shot learning has consistently yielded remarkable progress via modeling nuanced one-to-one visual-attribute correlation. Existing studies resort to refining a uniform map** function to align and correlate the sample regions and subattributes, ignoring two crucial issues: 1) the inherent asymmetry of attributes; and 2) the unutilized channel information. This paper addresses these issues by introducing a simple yet effective approach, dubbed Dual Expert Distillation Network (DEDN), where two experts are dedicated to coarse- and fine-grained visual-attribute modeling, respectively. Concretely, one coarse expert, namely cExp, has a complete perceptual scope to coordinate visual-attribute similarity metrics across dimensions, and moreover, another fine expert, namely fExp, consists of multiple specialized subnetworks, each corresponds to an exclusive set of attributes. Two experts cooperatively distill from each other to reach a mutual agreement during training. Meanwhile, we further equip DEDN with a newly designed backbone network, i.e., Dual Attention Network (DAN), which incorporates both region and channel attention information to fully exploit and leverage visual semantic knowledge. Experiments on various benchmark datasets indicate a new state-of-the-art. △ Less

Submitted 29 April, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

Comments: 9 pages, 4 figures; Accepted to IJCAI 2024

arXiv:2404.16297 [pdf, other]

When Fuzzing Meets LLMs: Challenges and Opportunities

Authors: Yu Jiang, Jie Liang, Fuchen Ma, Yuanliang Chen, Chi** Zhou, Yuheng Shen, Zhiyong Wu, **gzhou Fu, Mingzhe Wang, ShanShan Li, Quan Zhang

Abstract: Fuzzing, a widely-used technique for bug detection, has seen advancements through Large Language Models (LLMs). Despite their potential, LLMs face specific challenges in fuzzing. In this paper, we identified five major challenges of LLM-assisted fuzzing. To support our findings, we revisited the most recent papers from top-tier conferences, confirming that these challenges are widespread. As a rem… ▽ More Fuzzing, a widely-used technique for bug detection, has seen advancements through Large Language Models (LLMs). Despite their potential, LLMs face specific challenges in fuzzing. In this paper, we identified five major challenges of LLM-assisted fuzzing. To support our findings, we revisited the most recent papers from top-tier conferences, confirming that these challenges are widespread. As a remedy, we propose some actionable recommendations to help improve applying LLM in Fuzzing and conduct preliminary evaluations on DBMS fuzzing. The results demonstrate that our recommendations effectively address the identified challenges. △ Less

Submitted 24 April, 2024; originally announced April 2024.

arXiv:2404.15846 [pdf, other]

From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large Language Models

Authors: Qianyu He, Jie Zeng, Qianxi He, Jiaqing Liang, Yanghua Xiao

Abstract: It is imperative for Large language models (LLMs) to follow instructions with elaborate requirements (i.e. Complex Instructions Following). Yet, it remains under-explored how to enhance the ability of LLMs to follow complex instructions with multiple constraints. To bridge the gap, we initially study what training data is effective in enhancing complex constraints following abilities. We found tha… ▽ More It is imperative for Large language models (LLMs) to follow instructions with elaborate requirements (i.e. Complex Instructions Following). Yet, it remains under-explored how to enhance the ability of LLMs to follow complex instructions with multiple constraints. To bridge the gap, we initially study what training data is effective in enhancing complex constraints following abilities. We found that training LLMs with instructions containing multiple constraints enhances their understanding of complex instructions, especially those with lower complexity levels. The improvement can even generalize to compositions of out-of-domain constraints. Additionally, we further propose methods addressing how to obtain and utilize the effective training data. Finally, we conduct extensive experiments to prove the effectiveness of our methods in terms of overall performance and training efficiency. We also demonstrate that our methods improve models' ability to follow instructions generally and generalize effectively across out-of-domain, in-domain, and adversarial settings, while maintaining general capabilities. △ Less

Submitted 18 June, 2024; v1 submitted 24 April, 2024; originally announced April 2024.

arXiv:2404.15672 [pdf, other]

Representing Part-Whole Hierarchies in Foundation Models by Learning Localizability, Composability, and Decomposability from Anatomy via Self-Supervision

Authors: Mohammad Reza Hosseinzadeh Taher, Michael B. Gotway, Jianming Liang

Abstract: Humans effortlessly interpret images by parsing them into part-whole hierarchies; deep learning excels in learning multi-level feature spaces, but they often lack explicit coding of part-whole relations, a prominent property of medical imaging. To overcome this limitation, we introduce Adam-v2, a new self-supervised learning framework extending Adam [79] by explicitly incorporating part-whole hier… ▽ More Humans effortlessly interpret images by parsing them into part-whole hierarchies; deep learning excels in learning multi-level feature spaces, but they often lack explicit coding of part-whole relations, a prominent property of medical imaging. To overcome this limitation, we introduce Adam-v2, a new self-supervised learning framework extending Adam [79] by explicitly incorporating part-whole hierarchies into its learning objectives through three key branches: (1) Localizability, acquiring discriminative representations to distinguish different anatomical patterns; (2) Composability, learning each anatomical structure in a parts-to-whole manner; and (3) Decomposability, comprehending each anatomical structure in a whole-to-parts manner. Experimental results across 10 tasks, compared to 11 baselines in zero-shot, few-shot transfer, and full fine-tuning settings, showcase Adam-v2's superior performance over large-scale medical models and existing SSL methods across diverse downstream tasks. The higher generality and robustness of Adam-v2's representations originate from its explicit construction of hierarchies for distinct anatomical structures from unlabeled medical images. Adam-v2 preserves a semantic balance of anatomical diversity and harmony in its embedding, yielding representations that are both generic and semantically meaningful, yet overlooked in existing SSL methods. All code and pretrained models are available at https://github.com/JLiangLab/Eden. △ Less

Submitted 24 April, 2024; originally announced April 2024.

Comments: Accepted at CVPR 2024 [main conference]

arXiv:2404.14713 [pdf, other]

Enhancing High-Speed Cruising Performance of Autonomous Vehicles through Integrated Deep Reinforcement Learning Framework

Authors: **hao Liang, Kaidi Yang, Chaopeng Tan, **xiang Wang, Guodong Yin

Abstract: High-speed cruising scenarios with mixed traffic greatly challenge the road safety of autonomous vehicles (AVs). Unlike existing works that only look at fundamental modules in isolation, this work enhances AV safety in mixed-traffic high-speed cruising scenarios by proposing an integrated framework that synthesizes three fundamental modules, i.e., behavioral decision-making, path-planning, and mot… ▽ More High-speed cruising scenarios with mixed traffic greatly challenge the road safety of autonomous vehicles (AVs). Unlike existing works that only look at fundamental modules in isolation, this work enhances AV safety in mixed-traffic high-speed cruising scenarios by proposing an integrated framework that synthesizes three fundamental modules, i.e., behavioral decision-making, path-planning, and motion-control modules. Considering that the integrated framework would increase the system complexity, a bootstrapped deep Q-Network (DQN) is employed to enhance the deep exploration of the reinforcement learning method and achieve adaptive decision making of AVs. Moreover, to make AV behavior understandable by surrounding HDVs to prevent unexpected operations caused by misinterpretations, we derive an inverse reinforcement learning (IRL) approach to learn the reward function of skilled drivers for the path planning of lane-changing maneuvers. Such a design enables AVs to achieve a human-like tradeoff between multi-performance requirements. Simulations demonstrate that the proposed integrated framework can guide AVs to take safe actions while guaranteeing high-speed cruising performance. △ Less

Submitted 22 April, 2024; originally announced April 2024.

arXiv:2404.14225 [pdf]

GravitoMagneto-Hydrodynamics and Spacetime Turbulence in Early Universe

Authors: Jiaxiang Liang, Minghui Du, Peng Xu

Abstract: Based on the gravitoelectromagnetic formalism and inspired by the rich analogies between electrodynamics and general relativity, we try one step further along this line and suggest a new counterpart in the gravitoelectromagnetic world analogue to the electromagnetic physics. A counterpart model of the MagnetoHydroDynamics that could help us to understand the possible new physics in tightly bounded… ▽ More Based on the gravitoelectromagnetic formalism and inspired by the rich analogies between electrodynamics and general relativity, we try one step further along this line and suggest a new counterpart in the gravitoelectromagnetic world analogue to the electromagnetic physics. A counterpart model of the MagnetoHydroDynamics that could help us to understand the possible new physics in tightly bounded spacetime-matter systems such as the case of extremely relativistic fluids in the early Universe. This new viewpoint also suggests a possible new form of spacetime-matter turbulence which may be tested through gravitational wave observations. △ Less

Submitted 22 April, 2024; originally announced April 2024.

arXiv:2404.12753 [pdf, other]

AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation

Authors: Wenhao Huang, Chenghao Peng, Zhixu Li, Jiaqing Liang, Yanghua Xiao, Liqian Wen, Zulong Chen

Abstract: Web automation is a significant technique that accomplishes complicated web tasks by automating common web actions, enhancing operational efficiency, and reducing the need for manual intervention. Traditional methods, such as wrappers, suffer from limited adaptability and scalability when faced with a new website. On the other hand, generative agents empowered by large language models (LLMs) exhib… ▽ More Web automation is a significant technique that accomplishes complicated web tasks by automating common web actions, enhancing operational efficiency, and reducing the need for manual intervention. Traditional methods, such as wrappers, suffer from limited adaptability and scalability when faced with a new website. On the other hand, generative agents empowered by large language models (LLMs) exhibit poor performance and reusability in open-world scenarios. In this work, we introduce a crawler generation task for vertical information web pages and the paradigm of combining LLMs with crawlers, which helps crawlers handle diverse and changing web environments more efficiently. We propose AutoCrawler, a two-stage framework that leverages the hierarchical structure of HTML for progressive understanding. Through top-down and step-back operations, AutoCrawler can learn from erroneous actions and continuously prune HTML for better action generation. We conduct comprehensive experiments with multiple LLMs and demonstrate the effectiveness of our framework. Resources of this paper can be found at \url{https://github.com/EZ-hwh/AutoCrawler} △ Less

Submitted 19 April, 2024; originally announced April 2024.

Comments: 18 pages, 5 figures

arXiv:2404.12322 [pdf, other]

Generalizable Face Landmarking Guided by Conditional Face War**

Authors: Jiayi Liang, Haotian Liu, Hongteng Xu, Dixin Luo

Abstract: As a significant step for human face modeling, editing, and generation, face landmarking aims at extracting facial keypoints from images. A generalizable face landmarker is required in practice because real-world facial images, e.g., the avatars in animations and games, are often stylized in various ways. However, achieving generalizable face landmarking is challenging due to the diversity of faci… ▽ More As a significant step for human face modeling, editing, and generation, face landmarking aims at extracting facial keypoints from images. A generalizable face landmarker is required in practice because real-world facial images, e.g., the avatars in animations and games, are often stylized in various ways. However, achieving generalizable face landmarking is challenging due to the diversity of facial styles and the scarcity of labeled stylized faces. In this study, we propose a simple but effective paradigm to learn a generalizable face landmarker based on labeled real human faces and unlabeled stylized faces. Our method learns the face landmarker as the key module of a conditional face warper. Given a pair of real and stylized facial images, the conditional face warper predicts a war** field from the real face to the stylized one, in which the face landmarker predicts the ending points of the war** field and provides us with high-quality pseudo landmarks for the corresponding stylized facial images. Applying an alternating optimization strategy, we learn the face landmarker to minimize $i)$ the discrepancy between the stylized faces and the warped real ones and $ii)$ the prediction errors of both real and pseudo landmarks. Experiments on various datasets show that our method outperforms existing state-of-the-art domain adaptation methods in face landmarking tasks, leading to a face landmarker with better generalizability. Code is available at https://plustwo0.github.io/project-face-landmarker. △ Less

Submitted 21 April, 2024; v1 submitted 18 April, 2024; originally announced April 2024.

Comments: Accepted in CVPR 2024

arXiv:2404.12138 [pdf, other]

Character is Destiny: Can Large Language Models Simulate Persona-Driven Decisions in Role-Playing?

Authors: Rui Xu, Xintao Wang, Jiangjie Chen, Siyu Yuan, Xinfeng Yuan, Jiaqing Liang, Zulong Chen, Xiaoqing Dong, Yanghua Xiao

Abstract: Can Large Language Models substitute humans in making important decisions? Recent research has unveiled the potential of LLMs to role-play assigned personas, mimicking their knowledge and linguistic habits. However, imitative decision-making requires a more nuanced understanding of personas. In this paper, we benchmark the ability of LLMs in persona-driven decision-making. Specifically, we investi… ▽ More Can Large Language Models substitute humans in making important decisions? Recent research has unveiled the potential of LLMs to role-play assigned personas, mimicking their knowledge and linguistic habits. However, imitative decision-making requires a more nuanced understanding of personas. In this paper, we benchmark the ability of LLMs in persona-driven decision-making. Specifically, we investigate whether LLMs can predict characters' decisions provided with the preceding stories in high-quality novels. Leveraging character analyses written by literary experts, we construct a dataset LIFECHOICE comprising 1,401 character decision points from 395 books. Then, we conduct comprehensive experiments on LIFECHOICE, with various LLMs and methods for LLM role-playing. The results demonstrate that state-of-the-art LLMs exhibit promising capabilities in this task, yet there is substantial room for improvement. Hence, we further propose the CHARMAP method, which achieves a 6.01% increase in accuracy via persona-based memory retrieval. We will make our datasets and code publicly available. △ Less

Submitted 18 April, 2024; originally announced April 2024.

arXiv:2404.10315 [pdf, other]

Enhancing Confidence Expression in Large Language Models Through Learning from Past Experience

Authors: Haixia Han, Tingyun Li, Shisong Chen, Jie Shi, Chengyu Du, Yanghua Xiao, Jiaqing Liang, Xin Lin

Abstract: Large Language Models (LLMs) have exhibited remarkable performance across various downstream tasks, but they may generate inaccurate or false information with a confident tone. One of the possible solutions is to empower the LLM confidence expression capability, in which the confidence expressed can be well-aligned with the true probability of the generated answer being correct. However, leveragin… ▽ More Large Language Models (LLMs) have exhibited remarkable performance across various downstream tasks, but they may generate inaccurate or false information with a confident tone. One of the possible solutions is to empower the LLM confidence expression capability, in which the confidence expressed can be well-aligned with the true probability of the generated answer being correct. However, leveraging the intrinsic ability of LLMs or the signals from the output logits of answers proves challenging in accurately capturing the response uncertainty in LLMs. Therefore, drawing inspiration from cognitive diagnostics, we propose a method of Learning from Past experience (LePe) to enhance the capability for confidence expression. Specifically, we first identify three key problems: (1) How to capture the inherent confidence of the LLM? (2) How to teach the LLM to express confidence? (3) How to evaluate the confidence expression of the LLM? Then we devise three stages in LePe to deal with these problems. Besides, to accurately capture the confidence of an LLM when constructing the training data, we design a complete pipeline including question preparation and answer sampling. We also conduct experiments using the Llama family of LLMs to verify the effectiveness of our proposed method on four datasets. △ Less

Submitted 16 April, 2024; originally announced April 2024.

arXiv:2404.09593 [pdf, other]

Improving Recall of Large Language Models: A Model Collaboration Approach for Relational Triple Extraction

Authors: Zepeng Ding, Wenhao Huang, Jiaqing Liang, Deqing Yang, Yanghua Xiao

Abstract: Relation triple extraction, which outputs a set of triples from long sentences, plays a vital role in knowledge acquisition. Large language models can accurately extract triples from simple sentences through few-shot learning or fine-tuning when given appropriate instructions. However, they often miss out when extracting from complex sentences. In this paper, we design an evaluation-filtering fram… ▽ More Relation triple extraction, which outputs a set of triples from long sentences, plays a vital role in knowledge acquisition. Large language models can accurately extract triples from simple sentences through few-shot learning or fine-tuning when given appropriate instructions. However, they often miss out when extracting from complex sentences. In this paper, we design an evaluation-filtering framework that integrates large language models with small models for relational triple extraction tasks. The framework includes an evaluation model that can extract related entity pairs with high precision. We propose a simple labeling principle and a deep neural network to build the model, embedding the outputs as prompts into the extraction process of the large model. We conduct extensive experiments to demonstrate that the proposed method can assist large language models in obtaining more accurate extraction results, especially from complex sentences containing multiple relational triples. Our evaluation model can also be embedded into traditional extraction models to enhance their extraction precision from complex sentences. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: Accepted at LREC-COLING 2024 main conference

arXiv:2404.09145 [pdf, other]

ToNER: Type-oriented Named Entity Recognition with Generative Language Model

Authors: Guochao Jiang, Ziqin Luo, Yuchen Shi, Dixuan Wang, Jiaqing Liang, Deqing Yang

Abstract: In recent years, the fine-tuned generative models have been proven more powerful than the previous tagging-based or span-based models on named entity recognition (NER) task. It has also been found that the information related to entities, such as entity types, can prompt a model to achieve NER better. However, it is not easy to determine the entity types indeed existing in the given sentence in ad… ▽ More In recent years, the fine-tuned generative models have been proven more powerful than the previous tagging-based or span-based models on named entity recognition (NER) task. It has also been found that the information related to entities, such as entity types, can prompt a model to achieve NER better. However, it is not easy to determine the entity types indeed existing in the given sentence in advance, and inputting too many potential entity types would distract the model inevitably. To exploit entity types' merit on promoting NER task, in this paper we propose a novel NER framework, namely ToNER based on a generative model. In ToNER, a type matching model is proposed at first to identify the entity types most likely to appear in the sentence. Then, we append a multiple binary classification task to fine-tune the generative model's encoder, so as to generate the refined representation of the input sentence. Moreover, we add an auxiliary task for the model to discover the entity types which further fine-tunes the model to output more accurate results. Our extensive experiments on some NER benchmarks verify the effectiveness of our proposed strategies in ToNER that are oriented towards entity types' exploitation. △ Less

Submitted 11 June, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

Comments: Accepted by LREC-COLING 2024

arXiv:2404.09120 [pdf]

Interfacial reaction boosts thermal conductance of room-temperature integrated semiconductor interfaces stable up to 1100 C

Authors: Zhe Cheng, Xiaoyang Ji, Zifeng Huang, Yutaka Ohno, Koji Inoue, Yasusyohi Nagai, Yoshiki Sakaida, Hiroki Uratani, Naoteru Shigekawa, Jianbo Liang

Abstract: Overheating has emerged as a primary challenge constraining the reliability and performance of next-generation high-performance electronics, such as chiplets and (ultra)wide bandgap electronics. Advanced heterogeneous integration not only constitutes a pivotal technique for fabricating these electronics but also offers potential solutions for thermal management. This study presents the integration… ▽ More Overheating has emerged as a primary challenge constraining the reliability and performance of next-generation high-performance electronics, such as chiplets and (ultra)wide bandgap electronics. Advanced heterogeneous integration not only constitutes a pivotal technique for fabricating these electronics but also offers potential solutions for thermal management. This study presents the integration of high thermal conductivity semiconductors, specifically, 3C-SiC thin films and diamond substrates, through a room-temperature surface-activated bonding technique. Notably, the thermal conductivity of the 3C-SiC films is among the highest for all semiconductor films which can be integrated near room temperature with similar thicknesses. Furthermore, following annealing, the interfaces between 3C-SiC and diamond demonstrate a remarkable enhancement in thermal boundary conductance (TBC), reaching up to approximately 300%, surpassing all other grown and bonded heterointerfaces. This enhancement is attributed to interfacial reactions, specifically the transformation of amorphous silicon into SiC upon interaction with diamond, which is further corroborated by picosecond ultrasonics measurements. Subsequent to annealing at 1100 C, the achieved TBC (150 MW/m2-K) is record-high among all bonded diamond interfaces. Additionally, the visualization of large-area TBC, facilitated by femtosecond laser-based time-domain thermoreflectance measurements, shows the uniformity of the interfaces which are capable of withstanding temperatures as high as 1100 C. Our research marks a significant advancement in the realm of thermally conductive heterogeneous integration, which is promising for enhanced cooling of next-generation electronics. △ Less

Submitted 13 April, 2024; originally announced April 2024.

arXiv:2404.08707 [pdf, other]

Large Language Model Can Continue Evolving From Mistakes

Authors: Haokun Zhao, Haixia Han, Jie Shi, Chengyu Du, Jiaqing Liang, Yanghua Xiao

Abstract: As world knowledge evolves and new task paradigms emerge, Continual Learning (CL) is crucial for kee** Large Language Models (LLMs) up-to-date and addressing their shortcomings. In practical applications, LLMs often require both continual instruction tuning (CIT) and continual pre-training (CPT) to adapt to new task paradigms and acquire necessary knowledge for task-solving. However, it remains… ▽ More As world knowledge evolves and new task paradigms emerge, Continual Learning (CL) is crucial for kee** Large Language Models (LLMs) up-to-date and addressing their shortcomings. In practical applications, LLMs often require both continual instruction tuning (CIT) and continual pre-training (CPT) to adapt to new task paradigms and acquire necessary knowledge for task-solving. However, it remains challenging to collect CPT data that addresses the knowledge deficiencies in models while maintaining adequate volume, and improving the efficiency of utilizing this data also presents significant difficulties. Inspired by the 'summarizing mistakes' learning skill, we propose the Continue Evolving from Mistakes (CEM) method, aiming to provide a data-efficient approach for collecting CPT data and continually improving LLMs' performance through iterative evaluation and supplementation with mistake-relevant knowledge. To efficiently utilize these CPT data and mitigate forgetting, we design a novel CL training set construction paradigm that integrates parallel CIT and CPT data. Extensive experiments demonstrate the efficacy of the CEM method, achieving up to a 17% improvement in accuracy in the best case. Furthermore, additional experiments confirm the potential of combining CEM with catastrophic forgetting mitigation methods, enabling iterative and continual model evolution. △ Less

Submitted 17 June, 2024; v1 submitted 11 April, 2024; originally announced April 2024.

arXiv:2404.06970 [pdf, other]

Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning

Authors: Peipei Liu, Gaosheng Wang, Ying Tong, Jian Liang, Zhenquan Ding, Hongsong Zhu

Abstract: Few-shot named entity recognition can identify new types of named entities based on a few labeled examples. Previous methods employing token-level or span-level metric learning suffer from the computational burden and a large number of negative sample spans. In this paper, we propose the Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning (MsFNER), which splits the… ▽ More Few-shot named entity recognition can identify new types of named entities based on a few labeled examples. Previous methods employing token-level or span-level metric learning suffer from the computational burden and a large number of negative sample spans. In this paper, we propose the Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning (MsFNER), which splits the general NER into two stages: entity-span detection and entity classification. There are 3 processes for introducing MsFNER: training, finetuning, and inference. In the training process, we train and get the best entity-span detection model and the entity classification model separately on the source domain using meta-learning, where we create a contrastive learning module to enhance entity representations for entity classification. During finetuning, we finetune the both models on the support dataset of target domain. In the inference process, for the unlabeled data, we first detect the entity-spans, then the entity-spans are jointly determined by the entity classification model and the KNN. We conduct experiments on the open FewNERD dataset and the results demonstrate the advance of MsFNER. △ Less

Submitted 10 April, 2024; originally announced April 2024.

arXiv:2404.05609 [pdf, other]

Feedback Stability Under Mixed Gain and Phase Uncertainty

Authors: Jia** Liang, Di Zhao, Li Qiu

Abstract: In this study, we investigate the robust feedback stability problem for multiple-input-multiple-output linear time-invariant systems involving sectored-disk uncertainty, namely, dynamic uncertainty subject to simultaneous gain and phase constraints. This problem is thereby called a sectored-disk problem. Employing a frequency-wise analysis approach, we derive a fundamental static matrix problem th… ▽ More In this study, we investigate the robust feedback stability problem for multiple-input-multiple-output linear time-invariant systems involving sectored-disk uncertainty, namely, dynamic uncertainty subject to simultaneous gain and phase constraints. This problem is thereby called a sectored-disk problem. Employing a frequency-wise analysis approach, we derive a fundamental static matrix problem that serves as a key component in addressing the feedback stability. The study of this matrix problem heavily relies on the Davis-Wielandt (DW) shells of matrices, providing a profound insight into matrices subjected to simultaneous gain and phase constraints. This understanding is pivotal for establishing a less conservative sufficient condition for the matrix sectored-disk problem, from which we formulate several robust feedback stability conditions against sectored-disk uncertainty. Finally, several conditions based on linear matrix inequalities are developed for efficient computation and verification of feedback robust stability against sectored-disk uncertainty. △ Less

Submitted 8 April, 2024; originally announced April 2024.

arXiv:2404.04293 [pdf, other]

Reason from Fallacy: Enhancing Large Language Models' Logical Reasoning through Logical Fallacy Understanding

Authors: Yanda Li, Dixuan Wang, Jiaqing Liang, Guochao Jiang, Qianyu He, Yanghua Xiao, Deqing Yang

Abstract: Large Language Models (LLMs) have demonstrated good performance in many reasoning tasks, but they still struggle with some complicated reasoning tasks including logical reasoning. One non-negligible reason for LLMs' suboptimal performance on logical reasoning is their overlooking of understanding logical fallacies correctly. To evaluate LLMs' capability of logical fallacy understanding (LFU), we p… ▽ More Large Language Models (LLMs) have demonstrated good performance in many reasoning tasks, but they still struggle with some complicated reasoning tasks including logical reasoning. One non-negligible reason for LLMs' suboptimal performance on logical reasoning is their overlooking of understanding logical fallacies correctly. To evaluate LLMs' capability of logical fallacy understanding (LFU), we propose five concrete tasks from three cognitive dimensions of WHAT, WHY, and HOW in this paper. Towards these LFU tasks, we have successfully constructed a new dataset LFUD based on GPT-4 accompanied by a little human effort. Our extensive experiments justify that our LFUD can be used not only to evaluate LLMs' LFU capability, but also to fine-tune LLMs to obtain significantly enhanced performance on logical reasoning. △ Less

Submitted 4 April, 2024; originally announced April 2024.

arXiv:2404.03845 [pdf]

Buck You: Designing Easy-to-Onboard Blockchain Applications with Zero-Knowledge Login and Sponsored Transactions on Sui

Authors: Eason Chen, Zimo Xiao, Justa Liang, Damien Chen, Pierce Hung, Kostas Kryptos Chalkias

Abstract: In this paper, we developed a blockchain application to demonstrate the functionality of Sui's recent innovations: Zero Knowledge Login and Sponsored Transactions. Zero Knowledge Login allows users to create and access their blockchain wallets just with their OAuth accounts (e.g., Google, Facebook, Twitch), while Sponsored Transactions eliminate the need for users to prepare transaction fees, as t… ▽ More In this paper, we developed a blockchain application to demonstrate the functionality of Sui's recent innovations: Zero Knowledge Login and Sponsored Transactions. Zero Knowledge Login allows users to create and access their blockchain wallets just with their OAuth accounts (e.g., Google, Facebook, Twitch), while Sponsored Transactions eliminate the need for users to prepare transaction fees, as they can delegate fees to sponsors' accounts. Additionally, thanks to Sui's Storage Rebate feature, sponsors in Sponsored Transactions can profit from the sponsorship, achieving a win-win and sustainable service model. Zero Knowledge Login and Sponsored Transactions are pivotal in overcoming key challenges novice blockchain users face, particularly in managing private keys and depositing initial transaction fees. By addressing these challenges in the user experience of blockchain, Sui makes the blockchain more accessible and engaging for novice users and paves the way for the broader adoption of blockchain applications in everyday life. △ Less

Submitted 4 April, 2024; originally announced April 2024.

arXiv:2404.02885 [pdf, other]

PoCo: Point Context Cluster for RGBD Indoor Place Recognition

Authors: **g Liang, Zhuo Deng, Zheming Zhou, Omid Ghasemalizadeh, Dinesh Manocha, Min Sun, Cheng-Hao Kuo, Arnie Sen

Abstract: We present a novel end-to-end algorithm (PoCo) for the indoor RGB-D place recognition task, aimed at identifying the most likely match for a given query frame within a reference database. The task presents inherent challenges attributed to the constrained field of view and limited range of perception sensors. We propose a new network architecture, which generalizes the recent Context of Clusters (… ▽ More We present a novel end-to-end algorithm (PoCo) for the indoor RGB-D place recognition task, aimed at identifying the most likely match for a given query frame within a reference database. The task presents inherent challenges attributed to the constrained field of view and limited range of perception sensors. We propose a new network architecture, which generalizes the recent Context of Clusters (CoCs) to extract global descriptors directly from the noisy point clouds through end-to-end learning. Moreover, we develop the architecture by integrating both color and geometric modalities into the point features to enhance the global descriptor representation. We conducted evaluations on public datasets ScanNet-PR and ARKit with 807 and 5047 scenarios, respectively. PoCo achieves SOTA performance: on ScanNet-PR, we achieve R@1 of 64.63%, a 5.7% improvement from the best-published result CGis (61.12%); on Arkit, we achieve R@1 of 45.12%, a 13.3% improvement from the best-published result CGis (39.82%). In addition, PoCo shows higher efficiency than CGis in inference time (1.75X-faster), and we demonstrate the effectiveness of PoCo in recognizing places within a real-world laboratory environment. △ Less

Submitted 3 April, 2024; originally announced April 2024.

arXiv:2404.02663 [pdf]

Ground-to-UAV sub-Terahertz channel measurement and modeling

Authors: Da Li, Peian Li, Jiabiao Zhao, Jianjian Liang, Jiacheng Liu, Guohao Liu, Yuanshuai Lei, Wenbo Liu, Jianqin Deng, Fuyong Liu, Jianjun Ma

Abstract: Unmanned Aerial Vehicle (UAV) assisted terahertz (THz) wireless communications have been expected to play a vital role in the next generation of wireless networks. UAVs can serve as either repeaters or data collectors within the communication link, thereby potentially augmenting the efficacy of communication systems. Despite their promise, the channel analysis and modeling specific to THz wireless… ▽ More Unmanned Aerial Vehicle (UAV) assisted terahertz (THz) wireless communications have been expected to play a vital role in the next generation of wireless networks. UAVs can serve as either repeaters or data collectors within the communication link, thereby potentially augmenting the efficacy of communication systems. Despite their promise, the channel analysis and modeling specific to THz wireless channels leveraging UAVs remain under explored. This work delves into a ground-to-UAV channel at 140 GHz, with a specific focus on the influence of UAV hovering behavior on channel performance. Employing experimental measurements through an unmodulated channel setup and a geometry-based stochastic model (GBSM) that integrates three-dimensional positional coordinates and beamwidth, this work evaluates the impact of UAV dynamic movements and antenna orientation on channel performance. Our findings highlight the minimal impact of UAV orientation adjustments on channel performance and underscore the diminishing necessity for precise alignment between UAVs and ground stations as beamwidth increases. △ Less

Submitted 28 June, 2024; v1 submitted 3 April, 2024; originally announced April 2024.

Comments: Submitted to Optics Express

arXiv:2404.02239 [pdf, ps, other]

Proximal Oracles for Optimization and Sampling

Authors: Jiaming Liang, Yongxin Chen

Abstract: We consider convex optimization with non-smooth objective function and log-concave sampling with non-smooth potential (negative log density). In particular, we study two specific settings where the convex objective/potential function is either semi-smooth or in composite form as the finite sum of semi-smooth components. To overcome the challenges caused by non-smoothness, our algorithms employ two… ▽ More We consider convex optimization with non-smooth objective function and log-concave sampling with non-smooth potential (negative log density). In particular, we study two specific settings where the convex objective/potential function is either semi-smooth or in composite form as the finite sum of semi-smooth components. To overcome the challenges caused by non-smoothness, our algorithms employ two powerful proximal frameworks in optimization and sampling: the proximal point framework for optimization and the alternating sampling framework (ASF) that uses Gibbs sampling on an augmented distribution. A key component of both optimization and sampling algorithms is the efficient implementation of the proximal map by the regularized cutting-plane method. We establish the iteration-complexity of the proximal map in both semi-smooth and composite settings. We further propose an adaptive proximal bundle method for non-smooth optimization. The proposed method is universal since it does not need any problem parameters as input. Additionally, we develop a proximal sampling oracle that resembles the proximal map in optimization and establish its complexity using a novel technique (a modified Gaussian integral). Finally, we combine this proximal sampling oracle and ASF to obtain a Markov chain Monte Carlo method with non-asymptotic complexity bounds for sampling in semi-smooth and composite settings. △ Less

Submitted 2 April, 2024; originally announced April 2024.

Comments: 25 pages. arXiv admin note: text overlap with arXiv:2202.13975

arXiv:2404.01564 [pdf, other]

The radiative decay of scalar glueball from lattice QCD

Authors: **tao Zou, Long-Cheng Gui, Ying Chen, Jian Liang, Xiangyu Jiang, Wen Qin

Abstract: We perform the first lattice QCD study on the radiative decay of the scalar glueball to the vector meson $φ$ in the quenched approximation. The calculations are carried out on three gauge ensembles with different lattice spaicings, which enable us to do the continuum extrapolation. We first revisit the radiative $J/ψ$ decay into the scalar glueball $G$ and obtain the partial decay width… ▽ More We perform the first lattice QCD study on the radiative decay of the scalar glueball to the vector meson $φ$ in the quenched approximation. The calculations are carried out on three gauge ensembles with different lattice spaicings, which enable us to do the continuum extrapolation. We first revisit the radiative $J/ψ$ decay into the scalar glueball $G$ and obtain the partial decay width $Γ(J/ψ\to γG)=0.449(44)~\text{keV}$ and the branching fraction $\text{Br}(J/ψ\to γG) = 4.8(5)\times 10^{-3}$, which are in agreement with the previous lattice results. We then extend the similar calculation to the process $G\to γφ$ and get the partial decay width $Γ(G \to γφ)= 0.074(47)~\text{keV}$, which implies that the combined branching fraction of $J/ψ\toγG\to γγφ$ is as small as $\mathcal{O}(10^{-9})$ such that this process is hardly detected by the BESIII experiment even with the large $J/ψ$ sample of $\mathcal{O}(10^{10})$. With the vector meson dominance model, the two-photon decay width of the scalar glueball is estimated to be $Γ(G\toγγ)=0.53(46)~\text{eV}$, which results in a large stickiness $S(G)\sim \mathcal{O}(10^4)$ of the scalar glueball by assuming the stickiness of $f_2(1270)$ to be one. △ Less

Submitted 4 April, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

Comments: 12 pages,10 figures

arXiv:2404.00210 [pdf, other]

Socially Aware Robot Navigation through Scoring Using Vision-Language Models

Authors: Daeun Song, **g Liang, Amirreza Payandeh, Xuesu Xiao, Dinesh Manocha

Abstract: We propose VLM-Social-Nav, a novel Vision-Language Model (VLM) based navigation approach to compute a robot's trajectory in human-centered environments. Our goal is to make real-time decisions on robot actions that are socially compliant with human expectations. We utilize a perception model to detect important social entities and prompt a VLM to generate guidance for socially compliant robot beha… ▽ More We propose VLM-Social-Nav, a novel Vision-Language Model (VLM) based navigation approach to compute a robot's trajectory in human-centered environments. Our goal is to make real-time decisions on robot actions that are socially compliant with human expectations. We utilize a perception model to detect important social entities and prompt a VLM to generate guidance for socially compliant robot behavior. VLM-Social-Nav uses a VLM-based scoring module that computes a cost term that ensures socially appropriate and effective robot actions generated by the underlying planner. Our overall approach reduces reliance on large datasets (for training) and enhances adaptability in decision-making. In practice, it results in improved socially compliant navigation in human-shared environments. We demonstrate and evaluate our system in four different real-world social navigation scenarios with a Turtlebot robot. We observe at least 36.37% improvement in average success rate and 20.00% improvement in average collision rate in the four social navigation scenarios. The user study score shows that VLM-Social-Nav generates the most socially compliant navigation behavior. △ Less

Submitted 29 March, 2024; originally announced April 2024.

arXiv:2403.20254 [pdf, other]

Benchmarking the Robustness of Temporal Action Detection Models Against Temporal Corruptions

Authors: Runhao Zeng, Xiaoyong Chen, Jiaming Liang, Huisi Wu, Guangzhong Cao, Yong Guo

Abstract: Temporal action detection (TAD) aims to locate action positions and recognize action categories in long-term untrimmed videos. Although many methods have achieved promising results, their robustness has not been thoroughly studied. In practice, we observe that temporal information in videos can be occasionally corrupted, such as missing or blurred frames. Interestingly, existing methods often incu… ▽ More Temporal action detection (TAD) aims to locate action positions and recognize action categories in long-term untrimmed videos. Although many methods have achieved promising results, their robustness has not been thoroughly studied. In practice, we observe that temporal information in videos can be occasionally corrupted, such as missing or blurred frames. Interestingly, existing methods often incur a significant performance drop even if only one frame is affected. To formally evaluate the robustness, we establish two temporal corruption robustness benchmarks, namely THUMOS14-C and ActivityNet-v1.3-C. In this paper, we extensively analyze the robustness of seven leading TAD methods and obtain some interesting findings: 1) Existing methods are particularly vulnerable to temporal corruptions, and end-to-end methods are often more susceptible than those with a pre-trained feature extractor; 2) Vulnerability mainly comes from localization error rather than classification error; 3) When corruptions occur in the middle of an action instance, TAD models tend to yield the largest performance drop. Besides building a benchmark, we further develop a simple but effective robust training method to defend against temporal corruptions, through the FrameDrop augmentation and Temporal-Robust Consistency loss. Remarkably, our approach not only improves robustness but also yields promising improvements on clean data. We believe that this study will serve as a benchmark for future research in robust video analysis. Source code and models are available at https://github.com/Alvin-Zeng/temporal-robustness-benchmark. △ Less

Submitted 29 March, 2024; originally announced March 2024.

Comments: Accepted by CVPR2024

arXiv:2403.18638 [pdf, other]

Mind the Domain Gap: a Systematic Analysis on Bioacoustic Sound Event Detection

Authors: **hua Liang, Ines Nolasco, Burooj Ghani, Huy Phan, Emmanouil Benetos, Dan Stowell

Abstract: Detecting the presence of animal vocalisations in nature is essential to study animal populations and their behaviors. A recent development in the field is the introduction of the task known as few-shot bioacoustic sound event detection, which aims to train a versatile animal sound detector using only a small set of audio samples. Previous efforts in this area have utilized different architectures… ▽ More Detecting the presence of animal vocalisations in nature is essential to study animal populations and their behaviors. A recent development in the field is the introduction of the task known as few-shot bioacoustic sound event detection, which aims to train a versatile animal sound detector using only a small set of audio samples. Previous efforts in this area have utilized different architectures and data augmentation techniques to enhance model performance. However, these approaches have not fully bridged the domain gap between source and target distributions, limiting their applicability in real-world scenarios. In this work, we introduce an new dataset designed to augment the diversity and breadth of classes available for few-shot bioacoustic event detection, building on the foundations of our previous datasets. To establish a robust baseline system tailored for the DCASE 2024 Task 5 challenge, we delve into an array of acoustic features and adopt negative hard sampling as our primary domain adaptation strategy. This approach, chosen in alignment with the challenge's guidelines that necessitate the independent treatment of each audio file, sidesteps the use of transductive learning to ensure compliance while aiming to enhance the system's adaptability to domain shifts. Our experiments show that the proposed baseline system achieves a better performance compared with the vanilla prototypical network. The findings also confirm the effectiveness of each domain adaptation method by ablating different components within the networks. This highlights the potential to improve few-shot bioacoustic sound event detection by further reducing the impact of domain shift. △ Less

Submitted 27 March, 2024; originally announced March 2024.

arXiv:2403.18105 [pdf, other]

Large Language Models for Education: A Survey and Outlook

Authors: Shen Wang, Tianlong Xu, Hang Li, Chaoli Zhang, Joleen Liang, Jiliang Tang, Philip S. Yu, Qingsong Wen

Abstract: The advent of Large Language Models (LLMs) has brought in a new era of possibilities in the realm of education. This survey paper summarizes the various technologies of LLMs in educational settings from multifaceted perspectives, encompassing student and teacher assistance, adaptive learning, and commercial tools. We systematically review the technological advancements in each perspective, organiz… ▽ More The advent of Large Language Models (LLMs) has brought in a new era of possibilities in the realm of education. This survey paper summarizes the various technologies of LLMs in educational settings from multifaceted perspectives, encompassing student and teacher assistance, adaptive learning, and commercial tools. We systematically review the technological advancements in each perspective, organize related datasets and benchmarks, and identify the risks and challenges associated with deploying LLMs in education. Furthermore, we outline future research opportunities, highlighting the potential promising directions. Our survey aims to provide a comprehensive technological picture for educators, researchers, and policymakers to harness the power of LLMs to revolutionize educational practices and foster a more effective personalized learning environment. △ Less

Submitted 1 April, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

arXiv:2403.16536 [pdf, other]

VMRNN: Integrating Vision Mamba and LSTM for Efficient and Accurate Spatiotemporal Forecasting

Authors: Yu** Tang, Peijie Dong, Zhenheng Tang, Xiaowen Chu, Junwei Liang

Abstract: Combining CNNs or ViTs, with RNNs for spatiotemporal forecasting, has yielded unparalleled results in predicting temporal and spatial dynamics. However, modeling extensive global information remains a formidable challenge; CNNs are limited by their narrow receptive fields, and ViTs struggle with the intensive computational demands of their attention mechanisms. The emergence of recent Mamba-based… ▽ More Combining CNNs or ViTs, with RNNs for spatiotemporal forecasting, has yielded unparalleled results in predicting temporal and spatial dynamics. However, modeling extensive global information remains a formidable challenge; CNNs are limited by their narrow receptive fields, and ViTs struggle with the intensive computational demands of their attention mechanisms. The emergence of recent Mamba-based architectures has been met with enthusiasm for their exceptional long-sequence modeling capabilities, surpassing established vision models in efficiency and accuracy, which motivates us to develop an innovative architecture tailored for spatiotemporal forecasting. In this paper, we propose the VMRNN cell, a new recurrent unit that integrates the strengths of Vision Mamba blocks with LSTM. We construct a network centered on VMRNN cells to tackle spatiotemporal prediction tasks effectively. Our extensive evaluations show that our proposed approach secures competitive results on a variety of tasks while maintaining a smaller model size. Our code is available at https://github.com/yyyu**tang/VMRNN-PyTorch. △ Less

Submitted 29 June, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

Comments: CVPR2024 Precognition Workshop

arXiv:2403.16428 [pdf, other]

Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects

Authors: Zicong Fan, Takehiko Ohkawa, Linlin Yang, Nie Lin, Zhishan Zhou, Shihao Zhou, Jiajun Liang, Zhong Gao, Xuanyang Zhang, Xue Zhang, Fei Li, Liu Zheng, Feng Lu, Karim Abou Zeid, Bastian Leibe, Jeongwan On, Seungryul Baek, Aditya Prakash, Saurabh Gupta, Kun He, Yoichi Sato, Otmar Hilliges, Hyung ** Chang, Angela Yao

Abstract: We interact with the world with our hands and see it through our own (egocentric) perspective. A holistic 3D understanding of such interactions from egocentric views is important for tasks in robotics, AR/VR, action recognition and motion generation. Accurately reconstructing such interactions in 3D is challenging due to heavy occlusion, viewpoint bias, camera distortion, and motion blur from the… ▽ More We interact with the world with our hands and see it through our own (egocentric) perspective. A holistic 3D understanding of such interactions from egocentric views is important for tasks in robotics, AR/VR, action recognition and motion generation. Accurately reconstructing such interactions in 3D is challenging due to heavy occlusion, viewpoint bias, camera distortion, and motion blur from the head movement. To this end, we designed the HANDS23 challenge based on the AssemblyHands and ARCTIC datasets with carefully designed training and testing splits. Based on the results of the top submitted methods and more recent baselines on the leaderboards, we perform a thorough analysis on 3D hand(-object) reconstruction tasks. Our analysis demonstrates the effectiveness of addressing distortion specific to egocentric cameras, adopting high-capacity transformers to learn complex hand-object interactions, and fusing predictions from different views. Our study further reveals challenging scenarios intractable with state-of-the-art methods, such as fast hand motion, object reconstruction from narrow egocentric views, and close contact between two hands and objects. Our efforts will enrich the community's knowledge foundation and facilitate future hand studies on egocentric hand-object interactions. △ Less

Submitted 25 March, 2024; originally announced March 2024.

arXiv:2403.16396 [pdf, other]

Is There a One-Model-Fits-All Approach to Information Extraction? Revisiting Task Definition Biases

Authors: Wenhao Huang, Qianyu He, Zhixu Li, Jiaqing Liang, Yanghua Xiao

Abstract: Definition bias is a negative phenomenon that can mislead models. Definition bias in information extraction appears not only across datasets from different domains but also within datasets sharing the same domain. We identify two types of definition bias in IE: bias among information extraction datasets and bias between information extraction datasets and instruction tuning datasets. To systematic… ▽ More Definition bias is a negative phenomenon that can mislead models. Definition bias in information extraction appears not only across datasets from different domains but also within datasets sharing the same domain. We identify two types of definition bias in IE: bias among information extraction datasets and bias between information extraction datasets and instruction tuning datasets. To systematically investigate definition bias, we conduct three probing experiments to quantitatively analyze it and discover the limitations of unified information extraction and large language models in solving definition bias. To mitigate definition bias in information extraction, we propose a multi-stage framework consisting of definition bias measurement, bias-aware fine-tuning, and task-specific bias mitigation. Experimental results demonstrate the effectiveness of our framework in addressing definition bias. Resources of this paper can be found at https://github.com/EZ-hwh/definition-bias △ Less