-
Smurfs: Leveraging Multiple Proficiency Agents with Context-Efficiency for Tool Planning
Authors:
Junzhi Chen,
Juhao Liang,
Benyou Wang
Abstract:
The emergence of large language models (LLMs) has opened up unprecedented possibilities for automating complex tasks that are often comparable to human performance. Despite their capabilities, LLMs still encounter difficulties in completing tasks that require high levels of accuracy and complexity due to their inherent limitations in handling multifaceted problems single-handedly. This paper intro…
▽ More
The emergence of large language models (LLMs) has opened up unprecedented possibilities for automating complex tasks that are often comparable to human performance. Despite their capabilities, LLMs still encounter difficulties in completing tasks that require high levels of accuracy and complexity due to their inherent limitations in handling multifaceted problems single-handedly. This paper introduces `Smurfs', a cutting-edge multi-agent framework designed to revolutionize the application of LLMs. By seamlessly transforming a conventional LLM into a synergistic multi-agent ensemble, Smurfs can enhance the model's ability to solve complex tasks at no additional cost. This is achieved through innovative prompting strategies that allocate distinct roles within the model, thereby facilitating collaboration among specialized agents and forming an intelligent multi-agent system. Our empirical investigation on both open-ended task of StableToolBench and closed-ended task on HotpotQA showcases Smurfs' superior capability in intricate tool utilization scenarios. Notably, Smurfs outmatches all the baseline methods in both experiments, setting new state-of-the-art performance. Furthermore, through comprehensive ablation studies, we dissect the contribution of the core components of the multi-agent framework to its overall efficacy. This not only verifies the effectiveness of the framework, but also sets a route for future exploration of multi-agent LLM systems.
△ Less
Submitted 23 June, 2024; v1 submitted 9 May, 2024;
originally announced May 2024.
-
Design and Implementation of Energy-Efficient Wireless Tire Sensing System with Delay Analysis for Intelligent Vehicles
Authors:
Shashank Mishra,
Jia-Ming Liang
Abstract:
The growing prevalence of Internet of Things (IoT) technologies has led to a rise in the popularity of intelligent vehicles that incorporate a range of sensors to monitor various aspects, such as driving speed, fuel usage, distance proximity and tire anomalies. Nowadays, real-time tire sensing systems play important roles for intelligent vehicles in increasing mileage, reducing fuel consumption, i…
▽ More
The growing prevalence of Internet of Things (IoT) technologies has led to a rise in the popularity of intelligent vehicles that incorporate a range of sensors to monitor various aspects, such as driving speed, fuel usage, distance proximity and tire anomalies. Nowadays, real-time tire sensing systems play important roles for intelligent vehicles in increasing mileage, reducing fuel consumption, improving driving safety, and reducing the potential for traffic accidents. However, the current tire sensing system drains a significant vehicle' energy and lacks effective collection of sensing data, which may not guarantee the immediacy of driving safety. Thus, this paper designs an energy-efficient wireless tire sensing system (WTSS), which leverages energy-saving techniques to significantly reduce power consumption while ensuring data retrieval delays during real-time monitoring. Additionally, we mathematically analyze the worst-case transmission delay of the system to ensure the immediacy based on the collision probabilities of sensor transmissions. This system has been implemented and verified by the simulation and field trial experiments. These results show that the proposed scheme provides enhanced performance in energy efficiency and accurately identifies the worst transmission delay.
△ Less
Submitted 27 May, 2024; v1 submitted 9 May, 2024;
originally announced May 2024.
-
Guarding Force: Safety-Critical Compliant Control for Robot-Environment Interaction
Authors:
Xinming Wang,
Jun Yang,
Jianliang Mao,
**zhuo Liang,
Shihua Li,
Yunda Yan
Abstract:
In this study, we propose a safety-critical compliant control strategy designed to strictly enforce interaction force constraints during the physical interaction of robots with unknown environments. The interaction force constraint is interpreted as a new force-constrained control barrier function (FC-CBF) by exploiting the generalized contact model and the prior information of the environment, i.…
▽ More
In this study, we propose a safety-critical compliant control strategy designed to strictly enforce interaction force constraints during the physical interaction of robots with unknown environments. The interaction force constraint is interpreted as a new force-constrained control barrier function (FC-CBF) by exploiting the generalized contact model and the prior information of the environment, i.e., the prior stiffness and rest position, for robot kinematics. The difference between the real environment and the generalized contact model is approximated by constructing a tracking differentiator, and its estimation error is quantified based on Lyapunov theory. By interpreting strict interaction safety specification as a dynamic constraint, restricting the desired joint angular rates in kinematics, the proposed approach modifies nominal compliant controllers using quadratic programming, ensuring adherence to interaction force constraints in unknown environments. The strict force constraint and the stability of the closed-loop system are rigorously analyzed. Experimental tests using a UR3e industrial robot with different environments verify the effectiveness of the proposed method in achieving the force constraints in unknown environments.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
Revisiting general dark matter-bound-electron interactions
Authors:
**-Han Liang,
Yi Liao,
Xiao-Dong Ma,
Hao-Lin Wang
Abstract:
In this letter we revisit general dark matter (DM)-bound-electron interactions studied previously in the influential work of [Catena et al., Phys. Rev. Res. 2, 033195 (2020)]. We derive the DM-electron response functions and find a crucial minus sign was missed for the second atomic response function $W_2$ defined in that work. The minus sign has significant phenomenological consequences when expl…
▽ More
In this letter we revisit general dark matter (DM)-bound-electron interactions studied previously in the influential work of [Catena et al., Phys. Rev. Res. 2, 033195 (2020)]. We derive the DM-electron response functions and find a crucial minus sign was missed for the second atomic response function $W_2$ defined in that work. The minus sign has significant phenomenological consequences when explaining experimental bounds on specific DM scenarios. Furthermore, for the most general DM-electron nonrelativistic or relativistic interactions for DM with spin up to one, we find there are three DM response functions ($a_{0,1,2}$) whose corresponding atomic response functions ($\widetilde W_{0,1,2}$) are linear combinations of the four response functions ($W_{1,2,3,4}$) given in that work, $$ \widetilde W_0 = W_1, \, \widetilde W_2 = |\mathbf{v}_0^\perp|^2 W_1+ W_3 - 2 {m_e\, \mathbf{q}\cdot \mathbf{v}_0^\perp \over \mathbf{q}^2} W_2,\, \widetilde W_3 = { (\mathbf{q}\cdot \mathbf{v}_0^\perp)^2 \over \mathbf{q}^2} W_1 + {m_e^2 \over \mathbf{q}^2}W_4 - 2 {m_e\, \mathbf{q}\cdot \mathbf{v}_0^\perp \over \mathbf{q}^2} W_2. $$ Due to the minus sign correction for $W_2$, there can be significant cancellations between the $W_2$ and $W_{3,4}$ terms, so that $\widetilde W_{2,3}$ are dominated by the usual response function $W_1$ in some cases. Ignoring the sign could thus result in misinterpretation of the experimental data in some DM scenarios. As an example, we show that the recent XENON1T constraint on the fermionic DM anapole moment is weakened by a factor of 2 or so. Many DM scenarios involving DM or electron axial-vector current can yield $W_2$ and thus are potentially affected by the sign.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Authors:
DeepSeek-AI,
Aixin Liu,
Bei Feng,
Bin Wang,
Bingxuan Wang,
Bo Liu,
Chenggang Zhao,
Chengqi Dengr,
Chong Ruan,
Damai Dai,
Daya Guo,
Dejian Yang,
Deli Chen,
Dongjie Ji,
Erhang Li,
Fangyun Lin,
Fuli Luo,
Guangbo Hao,
Guanting Chen,
Guowei Li,
H. Zhang,
Hanwei Xu,
Hao Yang,
Haowei Zhang,
Honghui Ding
, et al. (132 additional authors not shown)
Abstract:
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference…
▽ More
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.
△ Less
Submitted 19 June, 2024; v1 submitted 7 May, 2024;
originally announced May 2024.
-
Knowledge Adaptation from Large Language Model to Recommendation for Practical Industrial Application
Authors:
Jian Jia,
Yipei Wang,
Yan Li,
Honggang Chen,
Xuehan Bai,
Zhaocheng Liu,
Jian Liang,
Quan Chen,
Han Li,
Peng Jiang,
Kun Gai
Abstract:
Contemporary recommender systems predominantly rely on collaborative filtering techniques, employing ID-embedding to capture latent associations among users and items. However, this approach overlooks the wealth of semantic information embedded within textual descriptions of items, leading to suboptimal performance in cold-start scenarios and long-tail user recommendations. Leveraging the capabili…
▽ More
Contemporary recommender systems predominantly rely on collaborative filtering techniques, employing ID-embedding to capture latent associations among users and items. However, this approach overlooks the wealth of semantic information embedded within textual descriptions of items, leading to suboptimal performance in cold-start scenarios and long-tail user recommendations. Leveraging the capabilities of Large Language Models (LLMs) pretrained on massive text corpus presents a promising avenue for enhancing recommender systems by integrating open-world domain knowledge. In this paper, we propose an Llm-driven knowlEdge Adaptive RecommeNdation (LEARN) framework that synergizes open-world knowledge with collaborative knowledge. We address computational complexity concerns by utilizing pretrained LLMs as item encoders and freezing LLM parameters to avoid catastrophic forgetting and preserve open-world knowledge. To bridge the gap between the open-world and collaborative domains, we design a twin-tower structure supervised by the recommendation task and tailored for practical industrial application. Through offline experiments on the large-scale industrial dataset and online experiments on A/B tests, we demonstrate the efficacy of our approach.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
MCSDNet: Mesoscale Convective System Detection Network via Multi-scale Spatiotemporal Information
Authors:
Jiajun Liang,
Baoquan Zhang,
Yunming Ye,
Xutao Li,
Chuyao Luo,
Xukai Fu
Abstract:
The accurate detection of Mesoscale Convective Systems (MCS) is crucial for meteorological monitoring due to their potential to cause significant destruction through severe weather phenomena such as hail, thunderstorms, and heavy rainfall. However, the existing methods for MCS detection mostly targets on single-frame detection, which just considers the static characteristics and ignores the tempor…
▽ More
The accurate detection of Mesoscale Convective Systems (MCS) is crucial for meteorological monitoring due to their potential to cause significant destruction through severe weather phenomena such as hail, thunderstorms, and heavy rainfall. However, the existing methods for MCS detection mostly targets on single-frame detection, which just considers the static characteristics and ignores the temporal evolution in the life cycle of MCS. In this paper, we propose a novel encoder-decoder neural network for MCS detection(MCSDNet). MCSDNet has a simple architecture and is easy to expand. Different from the previous models, MCSDNet targets on multi-frames detection and leverages multi-scale spatiotemporal information for the detection of MCS regions in remote sensing imagery(RSI). As far as we know, it is the first work to utilize multi-scale spatiotemporal information to detect MCS regions. Firstly, we design a multi-scale spatiotemporal information module to extract multi-level semantic from different encoder levels, which makes our models can extract more detail spatiotemporal features. Secondly, a Spatiotemporal Mix Unit(STMU) is introduced to MCSDNet to capture both intra-frame features and inter-frame correlations, which is a scalable module and can be replaced by other spatiotemporal module, e.g., CNN, RNN, Transformer and our proposed Dual Spatiotemporal Attention(DSTA). This means that the future works about spatiotemporal modules can be easily integrated to our model. Finally, we present MCSRSI, the first publicly available dataset for multi-frames MCS detection based on visible channel images from the FY-4A satellite. We also conduct several experiments on MCSRSI and find that our proposed MCSDNet achieve the best performance on MCS detection task when comparing to other baseline methods.
△ Less
Submitted 26 April, 2024;
originally announced April 2024.
-
Dual Expert Distillation Network for Generalized Zero-Shot Learning
Authors:
Zhijie Rao,
**gcai Guo,
Xiaocheng Lu,
**gming Liang,
Jie Zhang,
Haozhao Wang,
Kang Wei,
Xiaofeng Cao
Abstract:
Zero-shot learning has consistently yielded remarkable progress via modeling nuanced one-to-one visual-attribute correlation. Existing studies resort to refining a uniform map** function to align and correlate the sample regions and subattributes, ignoring two crucial issues: 1) the inherent asymmetry of attributes; and 2) the unutilized channel information. This paper addresses these issues by…
▽ More
Zero-shot learning has consistently yielded remarkable progress via modeling nuanced one-to-one visual-attribute correlation. Existing studies resort to refining a uniform map** function to align and correlate the sample regions and subattributes, ignoring two crucial issues: 1) the inherent asymmetry of attributes; and 2) the unutilized channel information. This paper addresses these issues by introducing a simple yet effective approach, dubbed Dual Expert Distillation Network (DEDN), where two experts are dedicated to coarse- and fine-grained visual-attribute modeling, respectively. Concretely, one coarse expert, namely cExp, has a complete perceptual scope to coordinate visual-attribute similarity metrics across dimensions, and moreover, another fine expert, namely fExp, consists of multiple specialized subnetworks, each corresponds to an exclusive set of attributes. Two experts cooperatively distill from each other to reach a mutual agreement during training. Meanwhile, we further equip DEDN with a newly designed backbone network, i.e., Dual Attention Network (DAN), which incorporates both region and channel attention information to fully exploit and leverage visual semantic knowledge. Experiments on various benchmark datasets indicate a new state-of-the-art.
△ Less
Submitted 29 April, 2024; v1 submitted 25 April, 2024;
originally announced April 2024.
-
When Fuzzing Meets LLMs: Challenges and Opportunities
Authors:
Yu Jiang,
Jie Liang,
Fuchen Ma,
Yuanliang Chen,
Chi** Zhou,
Yuheng Shen,
Zhiyong Wu,
**gzhou Fu,
Mingzhe Wang,
ShanShan Li,
Quan Zhang
Abstract:
Fuzzing, a widely-used technique for bug detection, has seen advancements through Large Language Models (LLMs). Despite their potential, LLMs face specific challenges in fuzzing. In this paper, we identified five major challenges of LLM-assisted fuzzing. To support our findings, we revisited the most recent papers from top-tier conferences, confirming that these challenges are widespread. As a rem…
▽ More
Fuzzing, a widely-used technique for bug detection, has seen advancements through Large Language Models (LLMs). Despite their potential, LLMs face specific challenges in fuzzing. In this paper, we identified five major challenges of LLM-assisted fuzzing. To support our findings, we revisited the most recent papers from top-tier conferences, confirming that these challenges are widespread. As a remedy, we propose some actionable recommendations to help improve applying LLM in Fuzzing and conduct preliminary evaluations on DBMS fuzzing. The results demonstrate that our recommendations effectively address the identified challenges.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large Language Models
Authors:
Qianyu He,
Jie Zeng,
Qianxi He,
Jiaqing Liang,
Yanghua Xiao
Abstract:
It is imperative for Large language models (LLMs) to follow instructions with elaborate requirements (i.e. Complex Instructions Following). Yet, it remains under-explored how to enhance the ability of LLMs to follow complex instructions with multiple constraints. To bridge the gap, we initially study what training data is effective in enhancing complex constraints following abilities. We found tha…
▽ More
It is imperative for Large language models (LLMs) to follow instructions with elaborate requirements (i.e. Complex Instructions Following). Yet, it remains under-explored how to enhance the ability of LLMs to follow complex instructions with multiple constraints. To bridge the gap, we initially study what training data is effective in enhancing complex constraints following abilities. We found that training LLMs with instructions containing multiple constraints enhances their understanding of complex instructions, especially those with lower complexity levels. The improvement can even generalize to compositions of out-of-domain constraints. Additionally, we further propose methods addressing how to obtain and utilize the effective training data. Finally, we conduct extensive experiments to prove the effectiveness of our methods in terms of overall performance and training efficiency. We also demonstrate that our methods improve models' ability to follow instructions generally and generalize effectively across out-of-domain, in-domain, and adversarial settings, while maintaining general capabilities.
△ Less
Submitted 18 June, 2024; v1 submitted 24 April, 2024;
originally announced April 2024.
-
Representing Part-Whole Hierarchies in Foundation Models by Learning Localizability, Composability, and Decomposability from Anatomy via Self-Supervision
Authors:
Mohammad Reza Hosseinzadeh Taher,
Michael B. Gotway,
Jianming Liang
Abstract:
Humans effortlessly interpret images by parsing them into part-whole hierarchies; deep learning excels in learning multi-level feature spaces, but they often lack explicit coding of part-whole relations, a prominent property of medical imaging. To overcome this limitation, we introduce Adam-v2, a new self-supervised learning framework extending Adam [79] by explicitly incorporating part-whole hier…
▽ More
Humans effortlessly interpret images by parsing them into part-whole hierarchies; deep learning excels in learning multi-level feature spaces, but they often lack explicit coding of part-whole relations, a prominent property of medical imaging. To overcome this limitation, we introduce Adam-v2, a new self-supervised learning framework extending Adam [79] by explicitly incorporating part-whole hierarchies into its learning objectives through three key branches: (1) Localizability, acquiring discriminative representations to distinguish different anatomical patterns; (2) Composability, learning each anatomical structure in a parts-to-whole manner; and (3) Decomposability, comprehending each anatomical structure in a whole-to-parts manner. Experimental results across 10 tasks, compared to 11 baselines in zero-shot, few-shot transfer, and full fine-tuning settings, showcase Adam-v2's superior performance over large-scale medical models and existing SSL methods across diverse downstream tasks. The higher generality and robustness of Adam-v2's representations originate from its explicit construction of hierarchies for distinct anatomical structures from unlabeled medical images. Adam-v2 preserves a semantic balance of anatomical diversity and harmony in its embedding, yielding representations that are both generic and semantically meaningful, yet overlooked in existing SSL methods. All code and pretrained models are available at https://github.com/JLiangLab/Eden.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
Enhancing High-Speed Cruising Performance of Autonomous Vehicles through Integrated Deep Reinforcement Learning Framework
Authors:
**hao Liang,
Kaidi Yang,
Chaopeng Tan,
**xiang Wang,
Guodong Yin
Abstract:
High-speed cruising scenarios with mixed traffic greatly challenge the road safety of autonomous vehicles (AVs). Unlike existing works that only look at fundamental modules in isolation, this work enhances AV safety in mixed-traffic high-speed cruising scenarios by proposing an integrated framework that synthesizes three fundamental modules, i.e., behavioral decision-making, path-planning, and mot…
▽ More
High-speed cruising scenarios with mixed traffic greatly challenge the road safety of autonomous vehicles (AVs). Unlike existing works that only look at fundamental modules in isolation, this work enhances AV safety in mixed-traffic high-speed cruising scenarios by proposing an integrated framework that synthesizes three fundamental modules, i.e., behavioral decision-making, path-planning, and motion-control modules. Considering that the integrated framework would increase the system complexity, a bootstrapped deep Q-Network (DQN) is employed to enhance the deep exploration of the reinforcement learning method and achieve adaptive decision making of AVs. Moreover, to make AV behavior understandable by surrounding HDVs to prevent unexpected operations caused by misinterpretations, we derive an inverse reinforcement learning (IRL) approach to learn the reward function of skilled drivers for the path planning of lane-changing maneuvers. Such a design enables AVs to achieve a human-like tradeoff between multi-performance requirements. Simulations demonstrate that the proposed integrated framework can guide AVs to take safe actions while guaranteeing high-speed cruising performance.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
GravitoMagneto-Hydrodynamics and Spacetime Turbulence in Early Universe
Authors:
Jiaxiang Liang,
Minghui Du,
Peng Xu
Abstract:
Based on the gravitoelectromagnetic formalism and inspired by the rich analogies between electrodynamics and general relativity, we try one step further along this line and suggest a new counterpart in the gravitoelectromagnetic world analogue to the electromagnetic physics. A counterpart model of the MagnetoHydroDynamics that could help us to understand the possible new physics in tightly bounded…
▽ More
Based on the gravitoelectromagnetic formalism and inspired by the rich analogies between electrodynamics and general relativity, we try one step further along this line and suggest a new counterpart in the gravitoelectromagnetic world analogue to the electromagnetic physics. A counterpart model of the MagnetoHydroDynamics that could help us to understand the possible new physics in tightly bounded spacetime-matter systems such as the case of extremely relativistic fluids in the early Universe. This new viewpoint also suggests a possible new form of spacetime-matter turbulence which may be tested through gravitational wave observations.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation
Authors:
Wenhao Huang,
Chenghao Peng,
Zhixu Li,
Jiaqing Liang,
Yanghua Xiao,
Liqian Wen,
Zulong Chen
Abstract:
Web automation is a significant technique that accomplishes complicated web tasks by automating common web actions, enhancing operational efficiency, and reducing the need for manual intervention. Traditional methods, such as wrappers, suffer from limited adaptability and scalability when faced with a new website. On the other hand, generative agents empowered by large language models (LLMs) exhib…
▽ More
Web automation is a significant technique that accomplishes complicated web tasks by automating common web actions, enhancing operational efficiency, and reducing the need for manual intervention. Traditional methods, such as wrappers, suffer from limited adaptability and scalability when faced with a new website. On the other hand, generative agents empowered by large language models (LLMs) exhibit poor performance and reusability in open-world scenarios. In this work, we introduce a crawler generation task for vertical information web pages and the paradigm of combining LLMs with crawlers, which helps crawlers handle diverse and changing web environments more efficiently. We propose AutoCrawler, a two-stage framework that leverages the hierarchical structure of HTML for progressive understanding. Through top-down and step-back operations, AutoCrawler can learn from erroneous actions and continuously prune HTML for better action generation. We conduct comprehensive experiments with multiple LLMs and demonstrate the effectiveness of our framework. Resources of this paper can be found at \url{https://github.com/EZ-hwh/AutoCrawler}
△ Less
Submitted 19 April, 2024;
originally announced April 2024.
-
Generalizable Face Landmarking Guided by Conditional Face War**
Authors:
Jiayi Liang,
Haotian Liu,
Hongteng Xu,
Dixin Luo
Abstract:
As a significant step for human face modeling, editing, and generation, face landmarking aims at extracting facial keypoints from images. A generalizable face landmarker is required in practice because real-world facial images, e.g., the avatars in animations and games, are often stylized in various ways. However, achieving generalizable face landmarking is challenging due to the diversity of faci…
▽ More
As a significant step for human face modeling, editing, and generation, face landmarking aims at extracting facial keypoints from images. A generalizable face landmarker is required in practice because real-world facial images, e.g., the avatars in animations and games, are often stylized in various ways. However, achieving generalizable face landmarking is challenging due to the diversity of facial styles and the scarcity of labeled stylized faces. In this study, we propose a simple but effective paradigm to learn a generalizable face landmarker based on labeled real human faces and unlabeled stylized faces. Our method learns the face landmarker as the key module of a conditional face warper. Given a pair of real and stylized facial images, the conditional face warper predicts a war** field from the real face to the stylized one, in which the face landmarker predicts the ending points of the war** field and provides us with high-quality pseudo landmarks for the corresponding stylized facial images. Applying an alternating optimization strategy, we learn the face landmarker to minimize $i)$ the discrepancy between the stylized faces and the warped real ones and $ii)$ the prediction errors of both real and pseudo landmarks. Experiments on various datasets show that our method outperforms existing state-of-the-art domain adaptation methods in face landmarking tasks, leading to a face landmarker with better generalizability. Code is available at https://plustwo0.github.io/project-face-landmarker.
△ Less
Submitted 21 April, 2024; v1 submitted 18 April, 2024;
originally announced April 2024.
-
Character is Destiny: Can Large Language Models Simulate Persona-Driven Decisions in Role-Playing?
Authors:
Rui Xu,
Xintao Wang,
Jiangjie Chen,
Siyu Yuan,
Xinfeng Yuan,
Jiaqing Liang,
Zulong Chen,
Xiaoqing Dong,
Yanghua Xiao
Abstract:
Can Large Language Models substitute humans in making important decisions? Recent research has unveiled the potential of LLMs to role-play assigned personas, mimicking their knowledge and linguistic habits. However, imitative decision-making requires a more nuanced understanding of personas. In this paper, we benchmark the ability of LLMs in persona-driven decision-making. Specifically, we investi…
▽ More
Can Large Language Models substitute humans in making important decisions? Recent research has unveiled the potential of LLMs to role-play assigned personas, mimicking their knowledge and linguistic habits. However, imitative decision-making requires a more nuanced understanding of personas. In this paper, we benchmark the ability of LLMs in persona-driven decision-making. Specifically, we investigate whether LLMs can predict characters' decisions provided with the preceding stories in high-quality novels. Leveraging character analyses written by literary experts, we construct a dataset LIFECHOICE comprising 1,401 character decision points from 395 books. Then, we conduct comprehensive experiments on LIFECHOICE, with various LLMs and methods for LLM role-playing. The results demonstrate that state-of-the-art LLMs exhibit promising capabilities in this task, yet there is substantial room for improvement. Hence, we further propose the CHARMAP method, which achieves a 6.01% increase in accuracy via persona-based memory retrieval. We will make our datasets and code publicly available.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
Enhancing Confidence Expression in Large Language Models Through Learning from Past Experience
Authors:
Haixia Han,
Tingyun Li,
Shisong Chen,
Jie Shi,
Chengyu Du,
Yanghua Xiao,
Jiaqing Liang,
Xin Lin
Abstract:
Large Language Models (LLMs) have exhibited remarkable performance across various downstream tasks, but they may generate inaccurate or false information with a confident tone. One of the possible solutions is to empower the LLM confidence expression capability, in which the confidence expressed can be well-aligned with the true probability of the generated answer being correct. However, leveragin…
▽ More
Large Language Models (LLMs) have exhibited remarkable performance across various downstream tasks, but they may generate inaccurate or false information with a confident tone. One of the possible solutions is to empower the LLM confidence expression capability, in which the confidence expressed can be well-aligned with the true probability of the generated answer being correct. However, leveraging the intrinsic ability of LLMs or the signals from the output logits of answers proves challenging in accurately capturing the response uncertainty in LLMs. Therefore, drawing inspiration from cognitive diagnostics, we propose a method of Learning from Past experience (LePe) to enhance the capability for confidence expression. Specifically, we first identify three key problems: (1) How to capture the inherent confidence of the LLM? (2) How to teach the LLM to express confidence? (3) How to evaluate the confidence expression of the LLM? Then we devise three stages in LePe to deal with these problems. Besides, to accurately capture the confidence of an LLM when constructing the training data, we design a complete pipeline including question preparation and answer sampling. We also conduct experiments using the Llama family of LLMs to verify the effectiveness of our proposed method on four datasets.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
Improving Recall of Large Language Models: A Model Collaboration Approach for Relational Triple Extraction
Authors:
Zepeng Ding,
Wenhao Huang,
Jiaqing Liang,
Deqing Yang,
Yanghua Xiao
Abstract:
Relation triple extraction, which outputs a set of triples from long sentences, plays a vital role in knowledge acquisition. Large language models can accurately extract triples from simple sentences through few-shot learning or fine-tuning when given appropriate instructions. However, they often miss out when extracting from complex sentences. In this paper, we design an evaluation-filtering fram…
▽ More
Relation triple extraction, which outputs a set of triples from long sentences, plays a vital role in knowledge acquisition. Large language models can accurately extract triples from simple sentences through few-shot learning or fine-tuning when given appropriate instructions. However, they often miss out when extracting from complex sentences. In this paper, we design an evaluation-filtering framework that integrates large language models with small models for relational triple extraction tasks. The framework includes an evaluation model that can extract related entity pairs with high precision. We propose a simple labeling principle and a deep neural network to build the model, embedding the outputs as prompts into the extraction process of the large model. We conduct extensive experiments to demonstrate that the proposed method can assist large language models in obtaining more accurate extraction results, especially from complex sentences containing multiple relational triples. Our evaluation model can also be embedded into traditional extraction models to enhance their extraction precision from complex sentences.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
ToNER: Type-oriented Named Entity Recognition with Generative Language Model
Authors:
Guochao Jiang,
Ziqin Luo,
Yuchen Shi,
Dixuan Wang,
Jiaqing Liang,
Deqing Yang
Abstract:
In recent years, the fine-tuned generative models have been proven more powerful than the previous tagging-based or span-based models on named entity recognition (NER) task. It has also been found that the information related to entities, such as entity types, can prompt a model to achieve NER better. However, it is not easy to determine the entity types indeed existing in the given sentence in ad…
▽ More
In recent years, the fine-tuned generative models have been proven more powerful than the previous tagging-based or span-based models on named entity recognition (NER) task. It has also been found that the information related to entities, such as entity types, can prompt a model to achieve NER better. However, it is not easy to determine the entity types indeed existing in the given sentence in advance, and inputting too many potential entity types would distract the model inevitably. To exploit entity types' merit on promoting NER task, in this paper we propose a novel NER framework, namely ToNER based on a generative model. In ToNER, a type matching model is proposed at first to identify the entity types most likely to appear in the sentence. Then, we append a multiple binary classification task to fine-tune the generative model's encoder, so as to generate the refined representation of the input sentence. Moreover, we add an auxiliary task for the model to discover the entity types which further fine-tunes the model to output more accurate results. Our extensive experiments on some NER benchmarks verify the effectiveness of our proposed strategies in ToNER that are oriented towards entity types' exploitation.
△ Less
Submitted 11 June, 2024; v1 submitted 14 April, 2024;
originally announced April 2024.
-
Interfacial reaction boosts thermal conductance of room-temperature integrated semiconductor interfaces stable up to 1100 C
Authors:
Zhe Cheng,
Xiaoyang Ji,
Zifeng Huang,
Yutaka Ohno,
Koji Inoue,
Yasusyohi Nagai,
Yoshiki Sakaida,
Hiroki Uratani,
Naoteru Shigekawa,
Jianbo Liang
Abstract:
Overheating has emerged as a primary challenge constraining the reliability and performance of next-generation high-performance electronics, such as chiplets and (ultra)wide bandgap electronics. Advanced heterogeneous integration not only constitutes a pivotal technique for fabricating these electronics but also offers potential solutions for thermal management. This study presents the integration…
▽ More
Overheating has emerged as a primary challenge constraining the reliability and performance of next-generation high-performance electronics, such as chiplets and (ultra)wide bandgap electronics. Advanced heterogeneous integration not only constitutes a pivotal technique for fabricating these electronics but also offers potential solutions for thermal management. This study presents the integration of high thermal conductivity semiconductors, specifically, 3C-SiC thin films and diamond substrates, through a room-temperature surface-activated bonding technique. Notably, the thermal conductivity of the 3C-SiC films is among the highest for all semiconductor films which can be integrated near room temperature with similar thicknesses. Furthermore, following annealing, the interfaces between 3C-SiC and diamond demonstrate a remarkable enhancement in thermal boundary conductance (TBC), reaching up to approximately 300%, surpassing all other grown and bonded heterointerfaces. This enhancement is attributed to interfacial reactions, specifically the transformation of amorphous silicon into SiC upon interaction with diamond, which is further corroborated by picosecond ultrasonics measurements. Subsequent to annealing at 1100 C, the achieved TBC (150 MW/m2-K) is record-high among all bonded diamond interfaces. Additionally, the visualization of large-area TBC, facilitated by femtosecond laser-based time-domain thermoreflectance measurements, shows the uniformity of the interfaces which are capable of withstanding temperatures as high as 1100 C. Our research marks a significant advancement in the realm of thermally conductive heterogeneous integration, which is promising for enhanced cooling of next-generation electronics.
△ Less
Submitted 13 April, 2024;
originally announced April 2024.
-
Large Language Model Can Continue Evolving From Mistakes
Authors:
Haokun Zhao,
Haixia Han,
Jie Shi,
Chengyu Du,
Jiaqing Liang,
Yanghua Xiao
Abstract:
As world knowledge evolves and new task paradigms emerge, Continual Learning (CL) is crucial for kee** Large Language Models (LLMs) up-to-date and addressing their shortcomings. In practical applications, LLMs often require both continual instruction tuning (CIT) and continual pre-training (CPT) to adapt to new task paradigms and acquire necessary knowledge for task-solving. However, it remains…
▽ More
As world knowledge evolves and new task paradigms emerge, Continual Learning (CL) is crucial for kee** Large Language Models (LLMs) up-to-date and addressing their shortcomings. In practical applications, LLMs often require both continual instruction tuning (CIT) and continual pre-training (CPT) to adapt to new task paradigms and acquire necessary knowledge for task-solving. However, it remains challenging to collect CPT data that addresses the knowledge deficiencies in models while maintaining adequate volume, and improving the efficiency of utilizing this data also presents significant difficulties. Inspired by the 'summarizing mistakes' learning skill, we propose the Continue Evolving from Mistakes (CEM) method, aiming to provide a data-efficient approach for collecting CPT data and continually improving LLMs' performance through iterative evaluation and supplementation with mistake-relevant knowledge. To efficiently utilize these CPT data and mitigate forgetting, we design a novel CL training set construction paradigm that integrates parallel CIT and CPT data. Extensive experiments demonstrate the efficacy of the CEM method, achieving up to a 17% improvement in accuracy in the best case. Furthermore, additional experiments confirm the potential of combining CEM with catastrophic forgetting mitigation methods, enabling iterative and continual model evolution.
△ Less
Submitted 17 June, 2024; v1 submitted 11 April, 2024;
originally announced April 2024.
-
Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning
Authors:
Peipei Liu,
Gaosheng Wang,
Ying Tong,
Jian Liang,
Zhenquan Ding,
Hongsong Zhu
Abstract:
Few-shot named entity recognition can identify new types of named entities based on a few labeled examples. Previous methods employing token-level or span-level metric learning suffer from the computational burden and a large number of negative sample spans. In this paper, we propose the Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning (MsFNER), which splits the…
▽ More
Few-shot named entity recognition can identify new types of named entities based on a few labeled examples. Previous methods employing token-level or span-level metric learning suffer from the computational burden and a large number of negative sample spans. In this paper, we propose the Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning (MsFNER), which splits the general NER into two stages: entity-span detection and entity classification. There are 3 processes for introducing MsFNER: training, finetuning, and inference. In the training process, we train and get the best entity-span detection model and the entity classification model separately on the source domain using meta-learning, where we create a contrastive learning module to enhance entity representations for entity classification. During finetuning, we finetune the both models on the support dataset of target domain. In the inference process, for the unlabeled data, we first detect the entity-spans, then the entity-spans are jointly determined by the entity classification model and the KNN. We conduct experiments on the open FewNERD dataset and the results demonstrate the advance of MsFNER.
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
Feedback Stability Under Mixed Gain and Phase Uncertainty
Authors:
Jia** Liang,
Di Zhao,
Li Qiu
Abstract:
In this study, we investigate the robust feedback stability problem for multiple-input-multiple-output linear time-invariant systems involving sectored-disk uncertainty, namely, dynamic uncertainty subject to simultaneous gain and phase constraints. This problem is thereby called a sectored-disk problem. Employing a frequency-wise analysis approach, we derive a fundamental static matrix problem th…
▽ More
In this study, we investigate the robust feedback stability problem for multiple-input-multiple-output linear time-invariant systems involving sectored-disk uncertainty, namely, dynamic uncertainty subject to simultaneous gain and phase constraints. This problem is thereby called a sectored-disk problem. Employing a frequency-wise analysis approach, we derive a fundamental static matrix problem that serves as a key component in addressing the feedback stability. The study of this matrix problem heavily relies on the Davis-Wielandt (DW) shells of matrices, providing a profound insight into matrices subjected to simultaneous gain and phase constraints. This understanding is pivotal for establishing a less conservative sufficient condition for the matrix sectored-disk problem, from which we formulate several robust feedback stability conditions against sectored-disk uncertainty. Finally, several conditions based on linear matrix inequalities are developed for efficient computation and verification of feedback robust stability against sectored-disk uncertainty.
△ Less
Submitted 8 April, 2024;
originally announced April 2024.
-
Reason from Fallacy: Enhancing Large Language Models' Logical Reasoning through Logical Fallacy Understanding
Authors:
Yanda Li,
Dixuan Wang,
Jiaqing Liang,
Guochao Jiang,
Qianyu He,
Yanghua Xiao,
Deqing Yang
Abstract:
Large Language Models (LLMs) have demonstrated good performance in many reasoning tasks, but they still struggle with some complicated reasoning tasks including logical reasoning. One non-negligible reason for LLMs' suboptimal performance on logical reasoning is their overlooking of understanding logical fallacies correctly. To evaluate LLMs' capability of logical fallacy understanding (LFU), we p…
▽ More
Large Language Models (LLMs) have demonstrated good performance in many reasoning tasks, but they still struggle with some complicated reasoning tasks including logical reasoning. One non-negligible reason for LLMs' suboptimal performance on logical reasoning is their overlooking of understanding logical fallacies correctly. To evaluate LLMs' capability of logical fallacy understanding (LFU), we propose five concrete tasks from three cognitive dimensions of WHAT, WHY, and HOW in this paper. Towards these LFU tasks, we have successfully constructed a new dataset LFUD based on GPT-4 accompanied by a little human effort. Our extensive experiments justify that our LFUD can be used not only to evaluate LLMs' LFU capability, but also to fine-tune LLMs to obtain significantly enhanced performance on logical reasoning.
△ Less
Submitted 4 April, 2024;
originally announced April 2024.
-
Buck You: Designing Easy-to-Onboard Blockchain Applications with Zero-Knowledge Login and Sponsored Transactions on Sui
Authors:
Eason Chen,
Zimo Xiao,
Justa Liang,
Damien Chen,
Pierce Hung,
Kostas Kryptos Chalkias
Abstract:
In this paper, we developed a blockchain application to demonstrate the functionality of Sui's recent innovations: Zero Knowledge Login and Sponsored Transactions. Zero Knowledge Login allows users to create and access their blockchain wallets just with their OAuth accounts (e.g., Google, Facebook, Twitch), while Sponsored Transactions eliminate the need for users to prepare transaction fees, as t…
▽ More
In this paper, we developed a blockchain application to demonstrate the functionality of Sui's recent innovations: Zero Knowledge Login and Sponsored Transactions. Zero Knowledge Login allows users to create and access their blockchain wallets just with their OAuth accounts (e.g., Google, Facebook, Twitch), while Sponsored Transactions eliminate the need for users to prepare transaction fees, as they can delegate fees to sponsors' accounts. Additionally, thanks to Sui's Storage Rebate feature, sponsors in Sponsored Transactions can profit from the sponsorship, achieving a win-win and sustainable service model. Zero Knowledge Login and Sponsored Transactions are pivotal in overcoming key challenges novice blockchain users face, particularly in managing private keys and depositing initial transaction fees. By addressing these challenges in the user experience of blockchain, Sui makes the blockchain more accessible and engaging for novice users and paves the way for the broader adoption of blockchain applications in everyday life.
△ Less
Submitted 4 April, 2024;
originally announced April 2024.
-
PoCo: Point Context Cluster for RGBD Indoor Place Recognition
Authors:
**g Liang,
Zhuo Deng,
Zheming Zhou,
Omid Ghasemalizadeh,
Dinesh Manocha,
Min Sun,
Cheng-Hao Kuo,
Arnie Sen
Abstract:
We present a novel end-to-end algorithm (PoCo) for the indoor RGB-D place recognition task, aimed at identifying the most likely match for a given query frame within a reference database. The task presents inherent challenges attributed to the constrained field of view and limited range of perception sensors. We propose a new network architecture, which generalizes the recent Context of Clusters (…
▽ More
We present a novel end-to-end algorithm (PoCo) for the indoor RGB-D place recognition task, aimed at identifying the most likely match for a given query frame within a reference database. The task presents inherent challenges attributed to the constrained field of view and limited range of perception sensors. We propose a new network architecture, which generalizes the recent Context of Clusters (CoCs) to extract global descriptors directly from the noisy point clouds through end-to-end learning. Moreover, we develop the architecture by integrating both color and geometric modalities into the point features to enhance the global descriptor representation. We conducted evaluations on public datasets ScanNet-PR and ARKit with 807 and 5047 scenarios, respectively. PoCo achieves SOTA performance: on ScanNet-PR, we achieve R@1 of 64.63%, a 5.7% improvement from the best-published result CGis (61.12%); on Arkit, we achieve R@1 of 45.12%, a 13.3% improvement from the best-published result CGis (39.82%). In addition, PoCo shows higher efficiency than CGis in inference time (1.75X-faster), and we demonstrate the effectiveness of PoCo in recognizing places within a real-world laboratory environment.
△ Less
Submitted 3 April, 2024;
originally announced April 2024.
-
Ground-to-UAV sub-Terahertz channel measurement and modeling
Authors:
Da Li,
Peian Li,
Jiabiao Zhao,
Jianjian Liang,
Jiacheng Liu,
Guohao Liu,
Yuanshuai Lei,
Wenbo Liu,
Jianqin Deng,
Fuyong Liu,
Jianjun Ma
Abstract:
Unmanned Aerial Vehicle (UAV) assisted terahertz (THz) wireless communications have been expected to play a vital role in the next generation of wireless networks. UAVs can serve as either repeaters or data collectors within the communication link, thereby potentially augmenting the efficacy of communication systems. Despite their promise, the channel analysis and modeling specific to THz wireless…
▽ More
Unmanned Aerial Vehicle (UAV) assisted terahertz (THz) wireless communications have been expected to play a vital role in the next generation of wireless networks. UAVs can serve as either repeaters or data collectors within the communication link, thereby potentially augmenting the efficacy of communication systems. Despite their promise, the channel analysis and modeling specific to THz wireless channels leveraging UAVs remain under explored. This work delves into a ground-to-UAV channel at 140 GHz, with a specific focus on the influence of UAV hovering behavior on channel performance. Employing experimental measurements through an unmodulated channel setup and a geometry-based stochastic model (GBSM) that integrates three-dimensional positional coordinates and beamwidth, this work evaluates the impact of UAV dynamic movements and antenna orientation on channel performance. Our findings highlight the minimal impact of UAV orientation adjustments on channel performance and underscore the diminishing necessity for precise alignment between UAVs and ground stations as beamwidth increases.
△ Less
Submitted 28 June, 2024; v1 submitted 3 April, 2024;
originally announced April 2024.
-
Proximal Oracles for Optimization and Sampling
Authors:
Jiaming Liang,
Yongxin Chen
Abstract:
We consider convex optimization with non-smooth objective function and log-concave sampling with non-smooth potential (negative log density). In particular, we study two specific settings where the convex objective/potential function is either semi-smooth or in composite form as the finite sum of semi-smooth components. To overcome the challenges caused by non-smoothness, our algorithms employ two…
▽ More
We consider convex optimization with non-smooth objective function and log-concave sampling with non-smooth potential (negative log density). In particular, we study two specific settings where the convex objective/potential function is either semi-smooth or in composite form as the finite sum of semi-smooth components. To overcome the challenges caused by non-smoothness, our algorithms employ two powerful proximal frameworks in optimization and sampling: the proximal point framework for optimization and the alternating sampling framework (ASF) that uses Gibbs sampling on an augmented distribution. A key component of both optimization and sampling algorithms is the efficient implementation of the proximal map by the regularized cutting-plane method. We establish the iteration-complexity of the proximal map in both semi-smooth and composite settings. We further propose an adaptive proximal bundle method for non-smooth optimization. The proposed method is universal since it does not need any problem parameters as input. Additionally, we develop a proximal sampling oracle that resembles the proximal map in optimization and establish its complexity using a novel technique (a modified Gaussian integral). Finally, we combine this proximal sampling oracle and ASF to obtain a Markov chain Monte Carlo method with non-asymptotic complexity bounds for sampling in semi-smooth and composite settings.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
The radiative decay of scalar glueball from lattice QCD
Authors:
**tao Zou,
Long-Cheng Gui,
Ying Chen,
Jian Liang,
Xiangyu Jiang,
Wen Qin
Abstract:
We perform the first lattice QCD study on the radiative decay of the scalar glueball to the vector meson $φ$ in the quenched approximation. The calculations are carried out on three gauge ensembles with different lattice spaicings, which enable us to do the continuum extrapolation. We first revisit the radiative $J/ψ$ decay into the scalar glueball $G$ and obtain the partial decay width…
▽ More
We perform the first lattice QCD study on the radiative decay of the scalar glueball to the vector meson $φ$ in the quenched approximation. The calculations are carried out on three gauge ensembles with different lattice spaicings, which enable us to do the continuum extrapolation. We first revisit the radiative $J/ψ$ decay into the scalar glueball $G$ and obtain the partial decay width $Γ(J/ψ\to γG)=0.449(44)~\text{keV}$ and the branching fraction $\text{Br}(J/ψ\to γG) = 4.8(5)\times 10^{-3}$, which are in agreement with the previous lattice results. We then extend the similar calculation to the process $G\to γφ$ and get the partial decay width $Γ(G \to γφ)= 0.074(47)~\text{keV}$, which implies that the combined branching fraction of $J/ψ\toγG\to γγφ$ is as small as $\mathcal{O}(10^{-9})$ such that this process is hardly detected by the BESIII experiment even with the large $J/ψ$ sample of $\mathcal{O}(10^{10})$. With the vector meson dominance model, the two-photon decay width of the scalar glueball is estimated to be $Γ(G\toγγ)=0.53(46)~\text{eV}$, which results in a large stickiness $S(G)\sim \mathcal{O}(10^4)$ of the scalar glueball by assuming the stickiness of $f_2(1270)$ to be one.
△ Less
Submitted 4 April, 2024; v1 submitted 1 April, 2024;
originally announced April 2024.
-
Socially Aware Robot Navigation through Scoring Using Vision-Language Models
Authors:
Daeun Song,
**g Liang,
Amirreza Payandeh,
Xuesu Xiao,
Dinesh Manocha
Abstract:
We propose VLM-Social-Nav, a novel Vision-Language Model (VLM) based navigation approach to compute a robot's trajectory in human-centered environments. Our goal is to make real-time decisions on robot actions that are socially compliant with human expectations. We utilize a perception model to detect important social entities and prompt a VLM to generate guidance for socially compliant robot beha…
▽ More
We propose VLM-Social-Nav, a novel Vision-Language Model (VLM) based navigation approach to compute a robot's trajectory in human-centered environments. Our goal is to make real-time decisions on robot actions that are socially compliant with human expectations. We utilize a perception model to detect important social entities and prompt a VLM to generate guidance for socially compliant robot behavior. VLM-Social-Nav uses a VLM-based scoring module that computes a cost term that ensures socially appropriate and effective robot actions generated by the underlying planner. Our overall approach reduces reliance on large datasets (for training) and enhances adaptability in decision-making. In practice, it results in improved socially compliant navigation in human-shared environments. We demonstrate and evaluate our system in four different real-world social navigation scenarios with a Turtlebot robot. We observe at least 36.37% improvement in average success rate and 20.00% improvement in average collision rate in the four social navigation scenarios. The user study score shows that VLM-Social-Nav generates the most socially compliant navigation behavior.
△ Less
Submitted 29 March, 2024;
originally announced April 2024.
-
Benchmarking the Robustness of Temporal Action Detection Models Against Temporal Corruptions
Authors:
Runhao Zeng,
Xiaoyong Chen,
Jiaming Liang,
Huisi Wu,
Guangzhong Cao,
Yong Guo
Abstract:
Temporal action detection (TAD) aims to locate action positions and recognize action categories in long-term untrimmed videos. Although many methods have achieved promising results, their robustness has not been thoroughly studied. In practice, we observe that temporal information in videos can be occasionally corrupted, such as missing or blurred frames. Interestingly, existing methods often incu…
▽ More
Temporal action detection (TAD) aims to locate action positions and recognize action categories in long-term untrimmed videos. Although many methods have achieved promising results, their robustness has not been thoroughly studied. In practice, we observe that temporal information in videos can be occasionally corrupted, such as missing or blurred frames. Interestingly, existing methods often incur a significant performance drop even if only one frame is affected. To formally evaluate the robustness, we establish two temporal corruption robustness benchmarks, namely THUMOS14-C and ActivityNet-v1.3-C. In this paper, we extensively analyze the robustness of seven leading TAD methods and obtain some interesting findings: 1) Existing methods are particularly vulnerable to temporal corruptions, and end-to-end methods are often more susceptible than those with a pre-trained feature extractor; 2) Vulnerability mainly comes from localization error rather than classification error; 3) When corruptions occur in the middle of an action instance, TAD models tend to yield the largest performance drop. Besides building a benchmark, we further develop a simple but effective robust training method to defend against temporal corruptions, through the FrameDrop augmentation and Temporal-Robust Consistency loss. Remarkably, our approach not only improves robustness but also yields promising improvements on clean data. We believe that this study will serve as a benchmark for future research in robust video analysis. Source code and models are available at https://github.com/Alvin-Zeng/temporal-robustness-benchmark.
△ Less
Submitted 29 March, 2024;
originally announced March 2024.
-
Mind the Domain Gap: a Systematic Analysis on Bioacoustic Sound Event Detection
Authors:
**hua Liang,
Ines Nolasco,
Burooj Ghani,
Huy Phan,
Emmanouil Benetos,
Dan Stowell
Abstract:
Detecting the presence of animal vocalisations in nature is essential to study animal populations and their behaviors. A recent development in the field is the introduction of the task known as few-shot bioacoustic sound event detection, which aims to train a versatile animal sound detector using only a small set of audio samples. Previous efforts in this area have utilized different architectures…
▽ More
Detecting the presence of animal vocalisations in nature is essential to study animal populations and their behaviors. A recent development in the field is the introduction of the task known as few-shot bioacoustic sound event detection, which aims to train a versatile animal sound detector using only a small set of audio samples. Previous efforts in this area have utilized different architectures and data augmentation techniques to enhance model performance. However, these approaches have not fully bridged the domain gap between source and target distributions, limiting their applicability in real-world scenarios. In this work, we introduce an new dataset designed to augment the diversity and breadth of classes available for few-shot bioacoustic event detection, building on the foundations of our previous datasets. To establish a robust baseline system tailored for the DCASE 2024 Task 5 challenge, we delve into an array of acoustic features and adopt negative hard sampling as our primary domain adaptation strategy. This approach, chosen in alignment with the challenge's guidelines that necessitate the independent treatment of each audio file, sidesteps the use of transductive learning to ensure compliance while aiming to enhance the system's adaptability to domain shifts. Our experiments show that the proposed baseline system achieves a better performance compared with the vanilla prototypical network. The findings also confirm the effectiveness of each domain adaptation method by ablating different components within the networks. This highlights the potential to improve few-shot bioacoustic sound event detection by further reducing the impact of domain shift.
△ Less
Submitted 27 March, 2024;
originally announced March 2024.
-
Large Language Models for Education: A Survey and Outlook
Authors:
Shen Wang,
Tianlong Xu,
Hang Li,
Chaoli Zhang,
Joleen Liang,
Jiliang Tang,
Philip S. Yu,
Qingsong Wen
Abstract:
The advent of Large Language Models (LLMs) has brought in a new era of possibilities in the realm of education. This survey paper summarizes the various technologies of LLMs in educational settings from multifaceted perspectives, encompassing student and teacher assistance, adaptive learning, and commercial tools. We systematically review the technological advancements in each perspective, organiz…
▽ More
The advent of Large Language Models (LLMs) has brought in a new era of possibilities in the realm of education. This survey paper summarizes the various technologies of LLMs in educational settings from multifaceted perspectives, encompassing student and teacher assistance, adaptive learning, and commercial tools. We systematically review the technological advancements in each perspective, organize related datasets and benchmarks, and identify the risks and challenges associated with deploying LLMs in education. Furthermore, we outline future research opportunities, highlighting the potential promising directions. Our survey aims to provide a comprehensive technological picture for educators, researchers, and policymakers to harness the power of LLMs to revolutionize educational practices and foster a more effective personalized learning environment.
△ Less
Submitted 1 April, 2024; v1 submitted 26 March, 2024;
originally announced March 2024.
-
VMRNN: Integrating Vision Mamba and LSTM for Efficient and Accurate Spatiotemporal Forecasting
Authors:
Yu** Tang,
Peijie Dong,
Zhenheng Tang,
Xiaowen Chu,
Junwei Liang
Abstract:
Combining CNNs or ViTs, with RNNs for spatiotemporal forecasting, has yielded unparalleled results in predicting temporal and spatial dynamics. However, modeling extensive global information remains a formidable challenge; CNNs are limited by their narrow receptive fields, and ViTs struggle with the intensive computational demands of their attention mechanisms. The emergence of recent Mamba-based…
▽ More
Combining CNNs or ViTs, with RNNs for spatiotemporal forecasting, has yielded unparalleled results in predicting temporal and spatial dynamics. However, modeling extensive global information remains a formidable challenge; CNNs are limited by their narrow receptive fields, and ViTs struggle with the intensive computational demands of their attention mechanisms. The emergence of recent Mamba-based architectures has been met with enthusiasm for their exceptional long-sequence modeling capabilities, surpassing established vision models in efficiency and accuracy, which motivates us to develop an innovative architecture tailored for spatiotemporal forecasting. In this paper, we propose the VMRNN cell, a new recurrent unit that integrates the strengths of Vision Mamba blocks with LSTM. We construct a network centered on VMRNN cells to tackle spatiotemporal prediction tasks effectively. Our extensive evaluations show that our proposed approach secures competitive results on a variety of tasks while maintaining a smaller model size. Our code is available at https://github.com/yyyu**tang/VMRNN-PyTorch.
△ Less
Submitted 29 June, 2024; v1 submitted 25 March, 2024;
originally announced March 2024.
-
Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects
Authors:
Zicong Fan,
Takehiko Ohkawa,
Linlin Yang,
Nie Lin,
Zhishan Zhou,
Shihao Zhou,
Jiajun Liang,
Zhong Gao,
Xuanyang Zhang,
Xue Zhang,
Fei Li,
Liu Zheng,
Feng Lu,
Karim Abou Zeid,
Bastian Leibe,
Jeongwan On,
Seungryul Baek,
Aditya Prakash,
Saurabh Gupta,
Kun He,
Yoichi Sato,
Otmar Hilliges,
Hyung ** Chang,
Angela Yao
Abstract:
We interact with the world with our hands and see it through our own (egocentric) perspective. A holistic 3D understanding of such interactions from egocentric views is important for tasks in robotics, AR/VR, action recognition and motion generation. Accurately reconstructing such interactions in 3D is challenging due to heavy occlusion, viewpoint bias, camera distortion, and motion blur from the…
▽ More
We interact with the world with our hands and see it through our own (egocentric) perspective. A holistic 3D understanding of such interactions from egocentric views is important for tasks in robotics, AR/VR, action recognition and motion generation. Accurately reconstructing such interactions in 3D is challenging due to heavy occlusion, viewpoint bias, camera distortion, and motion blur from the head movement. To this end, we designed the HANDS23 challenge based on the AssemblyHands and ARCTIC datasets with carefully designed training and testing splits. Based on the results of the top submitted methods and more recent baselines on the leaderboards, we perform a thorough analysis on 3D hand(-object) reconstruction tasks. Our analysis demonstrates the effectiveness of addressing distortion specific to egocentric cameras, adopting high-capacity transformers to learn complex hand-object interactions, and fusing predictions from different views. Our study further reveals challenging scenarios intractable with state-of-the-art methods, such as fast hand motion, object reconstruction from narrow egocentric views, and close contact between two hands and objects. Our efforts will enrich the community's knowledge foundation and facilitate future hand studies on egocentric hand-object interactions.
△ Less
Submitted 25 March, 2024;
originally announced March 2024.
-
Is There a One-Model-Fits-All Approach to Information Extraction? Revisiting Task Definition Biases
Authors:
Wenhao Huang,
Qianyu He,
Zhixu Li,
Jiaqing Liang,
Yanghua Xiao
Abstract:
Definition bias is a negative phenomenon that can mislead models. Definition bias in information extraction appears not only across datasets from different domains but also within datasets sharing the same domain. We identify two types of definition bias in IE: bias among information extraction datasets and bias between information extraction datasets and instruction tuning datasets. To systematic…
▽ More
Definition bias is a negative phenomenon that can mislead models. Definition bias in information extraction appears not only across datasets from different domains but also within datasets sharing the same domain. We identify two types of definition bias in IE: bias among information extraction datasets and bias between information extraction datasets and instruction tuning datasets. To systematically investigate definition bias, we conduct three probing experiments to quantitatively analyze it and discover the limitations of unified information extraction and large language models in solving definition bias. To mitigate definition bias in information extraction, we propose a multi-stage framework consisting of definition bias measurement, bias-aware fine-tuning, and task-specific bias mitigation. Experimental results demonstrate the effectiveness of our framework in addressing definition bias. Resources of this paper can be found at https://github.com/EZ-hwh/definition-bias
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
Unlearning Backdoor Threats: Enhancing Backdoor Defense in Multimodal Contrastive Learning via Local Token Unlearning
Authors:
Siyuan Liang,
Kuanrong Liu,
Jiajun Gong,
Jiawei Liang,
Yuan Xun,
Ee-Chien Chang,
Xiaochun Cao
Abstract:
Multimodal contrastive learning has emerged as a powerful paradigm for building high-quality features using the complementary strengths of various data modalities. However, the open nature of such systems inadvertently increases the possibility of backdoor attacks. These attacks subtly embed malicious behaviors within the model during training, which can be activated by specific triggers in the in…
▽ More
Multimodal contrastive learning has emerged as a powerful paradigm for building high-quality features using the complementary strengths of various data modalities. However, the open nature of such systems inadvertently increases the possibility of backdoor attacks. These attacks subtly embed malicious behaviors within the model during training, which can be activated by specific triggers in the inference phase, posing significant security risks. Despite existing countermeasures through fine-tuning that reduce the adverse impacts of such attacks, these defenses often degrade the clean accuracy and necessitate the construction of extensive clean training pairs. In this paper, we explore the possibility of a less-cost defense from the perspective of model unlearning, that is, whether the model can be made to quickly \textbf{u}nlearn \textbf{b}ackdoor \textbf{t}hreats (UBT) by constructing a small set of poisoned samples. Specifically, we strengthen the backdoor shortcuts to discover suspicious samples through overfitting training prioritized by weak similarity samples. Building on the initial identification of suspicious samples, we introduce an innovative token-based localized forgetting training regime. This technique specifically targets the poisoned aspects of the model, applying a focused effort to unlearn the backdoor associations and trying not to damage the integrity of the overall model. Experimental results show that our method not only ensures a minimal success rate for attacks, but also preserves the model's high clean accuracy.
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
Adversarially Masked Video Consistency for Unsupervised Domain Adaptation
Authors:
Xiaoyu Zhu,
Junwei Liang,
Po-Yao Huang,
Alex Hauptmann
Abstract:
We study the problem of unsupervised domain adaptation for egocentric videos. We propose a transformer-based model to learn class-discriminative and domain-invariant feature representations. It consists of two novel designs. The first module is called Generative Adversarial Domain Alignment Network with the aim of learning domain-invariant representations. It simultaneously learns a mask generator…
▽ More
We study the problem of unsupervised domain adaptation for egocentric videos. We propose a transformer-based model to learn class-discriminative and domain-invariant feature representations. It consists of two novel designs. The first module is called Generative Adversarial Domain Alignment Network with the aim of learning domain-invariant representations. It simultaneously learns a mask generator and a domain-invariant encoder in an adversarial way. The domain-invariant encoder is trained to minimize the distance between the source and target domain. The masking generator, conversely, aims at producing challenging masks by maximizing the domain distance. The second is a Masked Consistency Learning module to learn class-discriminative representations. It enforces the prediction consistency between the masked target videos and their full forms. To better evaluate the effectiveness of domain adaptation methods, we construct a more challenging benchmark for egocentric videos, U-Ego4D. Our method achieves state-of-the-art performance on the Epic-Kitchen and the proposed U-Ego4D benchmark.
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
Bi-Level Control of Weaving Sections in Mixed Traffic Environments with Connected and Automated Vehicles
Authors:
Longhao Yan,
**hao Liang,
Kaidi Yang
Abstract:
Connected and automated vehicles (CAVs) can be beneficial for improving the operation of highway bottlenecks such as weaving sections. This paper proposes a bi-level control approach based on an upper-level deep reinforcement learning controller and a lower-level model predictive controller to coordinate the lane-changings of a mixed fleet of CAVs and human-driven vehicles (HVs) in weaving section…
▽ More
Connected and automated vehicles (CAVs) can be beneficial for improving the operation of highway bottlenecks such as weaving sections. This paper proposes a bi-level control approach based on an upper-level deep reinforcement learning controller and a lower-level model predictive controller to coordinate the lane-changings of a mixed fleet of CAVs and human-driven vehicles (HVs) in weaving sections. The upper level represents a roadside controller that collects vehicular information from the entire weaving section and determines the control weights used in the lower-level controller. The lower level is implemented within each CAV, which takes the control weights from the upper-level controller and generates the acceleration and steering angle for individual CAVs based on the local situation. The lower-level controller further incorporates an HV trajectory predictor, which is capable of handling the dynamic topology of vehicles in weaving scenarios with intensive mandatory lane changes. The case study inspired by a real weaving section in Basel, Switzerland, shows that our method consistently outperforms state-of-the-art benchmarks.
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
Ideal spin-polarized Weyl-half-semimetal with a single pair of Weyl points in half-Heusler compounds XCrTe (X=K, Rb)
Authors:
Hongshuang Liu,
** Cao,
Zeying Zhang,
Jiashuo Liang,
Liying Wang,
Shengyuan A. Yang
Abstract:
Realizing ideal Weyl semimetal state with a single pair of Weyl points has been a long-sought goal in the field of topological semimetals. Here, we reveal such a state in the Cr-based half-Heusler compounds XCrTe (X=K, Rb). We show that these materials have a half metal ground state, with Fermi level crossing only one spin channel. Importantly, the Fermi surface is clean, consisting of the minimal…
▽ More
Realizing ideal Weyl semimetal state with a single pair of Weyl points has been a long-sought goal in the field of topological semimetals. Here, we reveal such a state in the Cr-based half-Heusler compounds XCrTe (X=K, Rb). We show that these materials have a half metal ground state, with Fermi level crossing only one spin channel. Importantly, the Fermi surface is clean, consisting of the minimal number (i.e., a single pair) of spin-polarized Weyl points, so the state represents an ideal Weyl half semimetal. We show that the locations of the two Weyl points and the associated Chern vector can be flexibly tuned by rotating the magnetization vector. The minimal surface Fermi arc pattern and its contribution to anomalous Hall transport are discussed. Our finding offers an ideal material platform for exploring magnetic Weyl fermions, which will also facilitate the interplay between Weyl physics and spintronics.
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
Particip-AI: A Democratic Surveying Framework for Anticipating Future AI Use Cases, Harms and Benefits
Authors:
Jimin Mun,
Liwei Jiang,
Jenny Liang,
Inyoung Cheong,
Nicole DeCario,
Ye** Choi,
Tadayoshi Kohno,
Maarten Sap
Abstract:
General purpose AI, such as ChatGPT, seems to have lowered the barriers for the public to use AI and harness its power. However, the governance and development of AI still remain in the hands of a few, and the pace of development is accelerating without proper assessment of risks. As a first step towards democratic governance and risk assessment of AI, we introduce Particip-AI, a framework to gath…
▽ More
General purpose AI, such as ChatGPT, seems to have lowered the barriers for the public to use AI and harness its power. However, the governance and development of AI still remain in the hands of a few, and the pace of development is accelerating without proper assessment of risks. As a first step towards democratic governance and risk assessment of AI, we introduce Particip-AI, a framework to gather current and future AI use cases and their harms and benefits from non-expert public. Our framework allows us to study more nuanced and detailed public opinions on AI through collecting use cases, surfacing diverse harms through risk assessment under alternate scenarios (i.e., develo** and not develo** a use case), and illuminating tensions over AI development through making a concluding choice on its development. To showcase the promise of our framework towards guiding democratic AI, we gather responses from 295 demographically diverse participants. We find that participants' responses emphasize applications for personal life and society, contrasting with most current AI development's business focus. This shows the value of surfacing diverse harms that are complementary to expert assessments. Furthermore, we found that perceived impact of not develo** use cases predicted participants' judgements of whether AI use cases should be developed, and highlighted lay users' concerns of techno-solutionism. We conclude with a discussion on how frameworks like Particip-AI can further guide democratic AI governance and regulation.
△ Less
Submitted 21 March, 2024;
originally announced March 2024.
-
Connection between galaxy morphology and dark-matter halo structure I: a running threshold for thin discs and size predictors from the dark sector
Authors:
**ning Liang,
Fangzhou Jiang,
Houjun Mo,
Andrew Benson,
Avishai Dekel,
Noa Tavron,
Philip F. Hopkins,
Luis C. Ho
Abstract:
We present a series of studies on the connection between galaxy morphology and the structure of host dark-matter (DM) haloes using cosmological simulations. In this work, we introduce a new kinematic decomposition scheme that features physical identification of morphological components, enabling robust separation of thin and thick discs; and measure a wide range of halo properties, including their…
▽ More
We present a series of studies on the connection between galaxy morphology and the structure of host dark-matter (DM) haloes using cosmological simulations. In this work, we introduce a new kinematic decomposition scheme that features physical identification of morphological components, enabling robust separation of thin and thick discs; and measure a wide range of halo properties, including their locations in the cosmic web, internal structures, and assembly histories. Our analysis of the TNG50 simulation reveals that the orbital-circularity threshold for disc differentiation varies across galaxies, with systematic trends in mass and redshift, so the widely used decomposition method with constant circularity cuts is oversimplified and underestimates thin disc at JWST redshifts. The energy threshold between the stellar halo and the inner galaxy is also a function of mass and redshift, minimizing at the sub-Galactic halo mass, where the circularity threshold peaks. Revisiting the issue of galaxy size predictor, we show that disc sizes in TNG50 exhibit correlations with three structural parameters besides virial mass and redshift: 1) a positive correlation with halo spin $λ$ across redshifts -- stronger than previously reported for zoom-in simulations but still weaker than the simple $r_{1/2}/R_{\rm vir} \propto λ$ scaling; 2) an anti-correlation with DM concentration $c$ that is well described by $r_{1/2}/R_{\rm vir} \propto c^{-0.7}$ even when $c$ is measured in the DM only run; 3) more actively accreting haloes having slightly larger discs, as well as more significant stellar haloes and lower thin-to-thick ratio. Disc mass fraction is higher in rounder haloes and in cosmic knots and filaments, implying that disc development needs both stable halo conditions and continuous material supply. Our methodology is public and adaptable to other simulations.
△ Less
Submitted 21 March, 2024;
originally announced March 2024.
-
Develo** and Deploying Industry Standards for Artificial Intelligence in Education (AIED): Challenges, Strategies, and Future Directions
Authors:
Richard Tong,
Haoyang Li,
Joleen Liang,
Qingsong Wen
Abstract:
The adoption of Artificial Intelligence in Education (AIED) holds the promise of revolutionizing educational practices by offering personalized learning experiences, automating administrative and pedagogical tasks, and reducing the cost of content creation. However, the lack of standardized practices in the development and deployment of AIED solutions has led to fragmented ecosystems, which presen…
▽ More
The adoption of Artificial Intelligence in Education (AIED) holds the promise of revolutionizing educational practices by offering personalized learning experiences, automating administrative and pedagogical tasks, and reducing the cost of content creation. However, the lack of standardized practices in the development and deployment of AIED solutions has led to fragmented ecosystems, which presents challenges in interoperability, scalability, and ethical governance. This article aims to address the critical need to develop and implement industry standards in AIED, offering a comprehensive analysis of the current landscape, challenges, and strategic approaches to overcome these obstacles. We begin by examining the various applications of AIED in various educational settings and identify key areas lacking in standardization, including system interoperability, ontology map**, data integration, evaluation, and ethical governance. Then, we propose a multi-tiered framework for establishing robust industry standards for AIED. In addition, we discuss methodologies for the iterative development and deployment of standards, incorporating feedback loops from real-world applications to refine and adapt standards over time. The paper also highlights the role of emerging technologies and pedagogical theories in sha** future standards for AIED. Finally, we outline a strategic roadmap for stakeholders to implement these standards, fostering a cohesive and ethical AIED ecosystem. By establishing comprehensive industry standards, such as those by IEEE Artificial Intelligence Standards Committee (AISC) and International Organization for Standardization (ISO), we can accelerate and scale AIED solutions to improve educational outcomes, ensuring that technological advances align with the principles of inclusivity, fairness, and educational excellence.
△ Less
Submitted 25 March, 2024; v1 submitted 13 March, 2024;
originally announced March 2024.
-
InfNeRF: Towards Infinite Scale NeRF Rendering with O(log n) Space Complexity
Authors:
Jiabin Liang,
Lanqing Zhang,
Zhuoran Zhao,
Xiangyu Xu
Abstract:
The conventional mesh-based Level of Detail (LoD) technique, exemplified by applications such as Google Earth and many game engines, exhibits the capability to holistically represent a large scene even the Earth, and achieves rendering with a space complexity of O(log n). This constrained data requirement not only enhances rendering efficiency but also facilitates dynamic data fetching, thereby en…
▽ More
The conventional mesh-based Level of Detail (LoD) technique, exemplified by applications such as Google Earth and many game engines, exhibits the capability to holistically represent a large scene even the Earth, and achieves rendering with a space complexity of O(log n). This constrained data requirement not only enhances rendering efficiency but also facilitates dynamic data fetching, thereby enabling a seamless 3D navigation experience for users. In this work, we extend this proven LoD technique to Neural Radiance Fields (NeRF) by introducing an octree structure to represent the scenes in different scales. This innovative approach provides a mathematically simple and elegant representation with a rendering space complexity of O(log n), aligned with the efficiency of mesh-based LoD techniques. We also present a novel training strategy that maintains a complexity of O(n). This strategy allows for parallel training with minimal overhead, ensuring the scalability and efficiency of our proposed method. Our contribution is not only in extending the capabilities of existing techniques but also in establishing a foundation for scalable and efficient large-scale scene representation using NeRF and octree structures.
△ Less
Submitted 21 March, 2024;
originally announced March 2024.
-
Picotesla-sensitivity microcavity optomechanical magnetometry
Authors:
Zhi-Gang Hu,
Yi-Meng Gao,
Jian-Fei Liu,
Hao Yang,
Min Wang,
Yuechen Lei,
Xin Zhou,
**cheng Li,
Xuening Cao,
****g Liang,
Chao-Qun Hu,
Zhilin Li,
Yong-Chang Lau,
Jian-Wang Cai,
Bei-Bei Li
Abstract:
Cavity optomechanical systems have enabled precision sensing of magnetic fields, by leveraging the optical resonance-enhanced readout and mechanical resonance-enhanced response. Previous studies have successfully achieved scalable and reproducible microcavity optomechanical magnetometry (MCOM) by incorporating Terfenol-D thin films into high-quality ($Q$) factor whispering gallery mode (WGM) micro…
▽ More
Cavity optomechanical systems have enabled precision sensing of magnetic fields, by leveraging the optical resonance-enhanced readout and mechanical resonance-enhanced response. Previous studies have successfully achieved scalable and reproducible microcavity optomechanical magnetometry (MCOM) by incorporating Terfenol-D thin films into high-quality ($Q$) factor whispering gallery mode (WGM) microcavities. However, the sensitivity was limited to 585 pT/Hz$^{1/2}$, over 20 times inferior to those using Terfenol-D particles. In this work, we propose and demonstrate a high-sensitivity and scalable MCOM approach by sputtering a FeGaB thin film onto a high-$Q$ SiO$_2$ WGM microdisk. Theoretical studies are conducted to explore the magnetic actuation constant and noise-limited sensitivity by varying the parameters of the FeGaB film and SiO$_2$ microdisk. Multiple magnetometers with different radii are fabricated and characterized. By utilizing a microdisk with a radius of 355 $μ$m and a thickness of 1 $μ$m, along with a FeGaB film with a radius of 330 $μ$m and a thickness of 1.3 $μ$m, we have achieved a remarkable peak sensitivity of 1.68 pT/Hz$^{1/2}$ at 9.52 MHz. This represents a significant improvement of over two orders of magnitude compared with previous studies employing sputtered Terfenol-D film. Notably, the magnetometer operates without a bias magnetic field, thanks to the remarkable soft magnetic properties of the FeGaB film. Furthermore, as a proof-of-concept, we have demonstrated the real-time measurement of a pulsed magnetic field simulating the corona current in a high-voltage transmission line using our developed magnetometer. These high-sensitivity magnetometers hold great potential for various applications, such as magnetic induction tomography and corona current monitoring.
△ Less
Submitted 21 March, 2024;
originally announced March 2024.
-
The Wreaths of KHAN: Uniform Graph Feature Selection with False Discovery Rate Control
Authors:
Jiajun Liang,
Yue Liu,
Doudou Zhou,
Sinian Zhang,
Junwei Lu
Abstract:
Graphical models find numerous applications in biology, chemistry, sociology, neuroscience, etc. While substantial progress has been made in graph estimation, it remains largely unexplored how to select significant graph signals with uncertainty assessment, especially those graph features related to topological structures including cycles (i.e., wreaths), cliques, hubs, etc. These features play a…
▽ More
Graphical models find numerous applications in biology, chemistry, sociology, neuroscience, etc. While substantial progress has been made in graph estimation, it remains largely unexplored how to select significant graph signals with uncertainty assessment, especially those graph features related to topological structures including cycles (i.e., wreaths), cliques, hubs, etc. These features play a vital role in protein substructure analysis, drug molecular design, and brain network connectivity analysis. To fill the gap, we propose a novel inferential framework for general high dimensional graphical models to select graph features with false discovery rate controlled. Our method is based on the maximum of $p$-values from single edges that comprise the topological feature of interest, thus is able to detect weak signals. Moreover, we introduce the $K$-dimensional persistent Homology Adaptive selectioN (KHAN) algorithm to select all the homological features within $K$ dimensions with the uniform control of the false discovery rate over continuous filtration levels. The KHAN method applies a novel discrete Gram-Schmidt algorithm to select statistically significant generators from the homology group. We apply the structural screening method to identify the important residues of the SARS-CoV-2 spike protein during the binding process to the ACE2 receptors. We score the residues for all domains in the spike protein by the $p$-value weighted filtration level in the network persistent homology for the closed, partially open, and open states and identify the residues crucial for protein conformational changes and thus being potential targets for inhibition.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
Prioritized Semantic Learning for Zero-shot Instance Navigation
Authors:
Xander Sun,
Louis Lau,
Hoyard Zhi,
Ronghe Qiu,
Junwei Liang
Abstract:
We study zero-shot instance navigation, in which the agent navigates to a specific object without using object annotations for training. Previous object navigation approaches apply the image-goal navigation (ImageNav) task (go to the location of an image) for pretraining, and transfer the agent to achieve object goals using a vision-language model. However, these approaches lead to issues of seman…
▽ More
We study zero-shot instance navigation, in which the agent navigates to a specific object without using object annotations for training. Previous object navigation approaches apply the image-goal navigation (ImageNav) task (go to the location of an image) for pretraining, and transfer the agent to achieve object goals using a vision-language model. However, these approaches lead to issues of semantic neglect, where the model fails to learn meaningful semantic alignments. In this paper, we propose a Prioritized Semantic Learning (PSL) method to improve the semantic understanding ability of navigation agents. Specifically, a semantic-enhanced PSL agent is proposed and a prioritized semantic training strategy is introduced to select goal images that exhibit clear semantic supervision and relax the reward function from strict exact view matching. At inference time, a semantic expansion inference scheme is designed to preserve the same granularity level of the goal-semantic as training. Furthermore, for the popular HM3D environment, we present an Instance Navigation (InstanceNav) task that requires going to a specific object instance with detailed descriptions, as opposed to the Object Navigation (ObjectNav) task where the goal is defined merely by the object category. Our PSL agent outperforms the previous state-of-the-art by 66% on zero-shot ObjectNav in terms of success rate and is also superior on the new InstanceNav task. Code will be released at https://anonymous.4open. science/r/PSL/.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment
Authors:
Tianhe Wu,
Kede Ma,
Jie Liang,
Yujiu Yang,
Lei Zhang
Abstract:
While Multimodal Large Language Models (MLLMs) have experienced significant advancement on visual understanding and reasoning, their potentials to serve as powerful, flexible, interpretable, and text-driven models for Image Quality Assessment (IQA) remains largely unexplored. In this paper, we conduct a comprehensive and systematic study of prompting MLLMs for IQA. Specifically, we first investiga…
▽ More
While Multimodal Large Language Models (MLLMs) have experienced significant advancement on visual understanding and reasoning, their potentials to serve as powerful, flexible, interpretable, and text-driven models for Image Quality Assessment (IQA) remains largely unexplored. In this paper, we conduct a comprehensive and systematic study of prompting MLLMs for IQA. Specifically, we first investigate nine prompting systems for MLLMs as the combinations of three standardized testing procedures in psychophysics (i.e., the single-stimulus, double-stimulus, and multiple-stimulus methods) and three popular prompting strategies in natural language processing (i.e., the standard, in-context, and chain-of-thought prompting). We then present a difficult sample selection procedure, taking into account sample diversity and uncertainty, to further challenge MLLMs equipped with the respective optimal prompting systems. We assess three open-source and one close-source MLLMs on several visual attributes of image quality (e.g., structural and textural distortions, color differences, and geometric transformations) in both full-reference and no-reference scenarios. Experimental results show that only the close-source GPT-4V provides a reasonable account for human perception of image quality, but is weak at discriminating fine-grained quality variations (e.g., color differences) and at comparing visual quality of multiple images, tasks humans can perform effortlessly.
△ Less
Submitted 16 March, 2024;
originally announced March 2024.
-
Learning Spatiotemporal Inconsistency via Thumbnail Layout for Face Deepfake Detection
Authors:
Yuting Xu,
Jian Liang,
Lijun Sheng,
Xiao-Yu Zhang
Abstract:
The deepfake threats to society and cybersecurity have provoked significant public apprehension, driving intensified efforts within the realm of deepfake video detection. Current video-level methods are mostly based on {3D CNNs} resulting in high computational demands, although have achieved good performance. This paper introduces an elegantly simple yet effective strategy named Thumbnail Layout (…
▽ More
The deepfake threats to society and cybersecurity have provoked significant public apprehension, driving intensified efforts within the realm of deepfake video detection. Current video-level methods are mostly based on {3D CNNs} resulting in high computational demands, although have achieved good performance. This paper introduces an elegantly simple yet effective strategy named Thumbnail Layout (TALL), which transforms a video clip into a pre-defined layout to realize the preservation of spatial and temporal dependencies. This transformation process involves sequentially masking frames at the same positions within each frame. These frames are then resized into sub-frames and reorganized into the predetermined layout, forming thumbnails. TALL is model-agnostic and has remarkable simplicity, necessitating only minimal code modifications. Furthermore, we introduce a graph reasoning block (GRB) and semantic consistency (SC) loss to strengthen TALL, culminating in TALL++. GRB enhances interactions between different semantic regions to capture semantic-level inconsistency clues. The semantic consistency loss imposes consistency constraints on semantic features to improve model generalization ability. Extensive experiments on intra-dataset, cross-dataset, diffusion-generated image detection, and deepfake generation method recognition show that TALL++ achieves results surpassing or comparable to the state-of-the-art methods, demonstrating the effectiveness of our approaches for various deepfake detection problems. The code is available at https://github.com/rainy-xu/TALL4Deepfake.
△ Less
Submitted 20 March, 2024; v1 submitted 15 March, 2024;
originally announced March 2024.
-
HawkEye: Training Video-Text LLMs for Grounding Text in Videos
Authors:
Yueqian Wang,
Xiaojun Meng,
Jianxin Liang,
Yuxuan Wang,
Qun Liu,
Dongyan Zhao
Abstract:
Video-text Large Language Models (video-text LLMs) have shown remarkable performance in answering questions and holding conversations on simple videos. However, they perform almost the same as random on grounding text queries in long and complicated videos, having little ability to understand and reason about temporal information, which is the most fundamental difference between videos and images.…
▽ More
Video-text Large Language Models (video-text LLMs) have shown remarkable performance in answering questions and holding conversations on simple videos. However, they perform almost the same as random on grounding text queries in long and complicated videos, having little ability to understand and reason about temporal information, which is the most fundamental difference between videos and images. In this paper, we propose HawkEye, one of the first video-text LLMs that can perform temporal video grounding in a fully text-to-text manner. To collect training data that is applicable for temporal video grounding, we construct InternVid-G, a large-scale video-text corpus with segment-level captions and negative spans, with which we introduce two new time-aware training objectives to video-text LLMs. We also propose a coarse-grained method of representing segments in videos, which is more robust and easier for LLMs to learn and follow than other alternatives. Extensive experiments show that HawkEye is better at temporal video grounding and comparable on other video-text tasks with existing video-text LLMs, which verifies its superior video-text multi-modal understanding abilities.
△ Less
Submitted 15 March, 2024;
originally announced March 2024.