Search | arXiv e-print repository

arXiv:2406.11937 [pdf, other]

Using graph neural networks to reconstruct charged pion showers in the CMS High Granularity Calorimeter

Authors: M. Aamir, B. Acar, G. Adamov, T. Adams, C. Adloff, S. Afanasiev, C. Agrawal, C. Agrawal, A. Ahmad, H. A. Ahmed, S. Akbar, N. Akchurin, B. Akgul, B. Akgun, R. O. Akpinar, E. Aktas, A. AlKadhim, V. Alexakhin, J. Alimena, J. Alison, A. Alpana, W. Alshehri, P. Alvarez Dominguez, M. Alyari, C. Amendola , et al. (550 additional authors not shown)

Abstract: A novel method to reconstruct the energy of hadronic showers in the CMS High Granularity Calorimeter (HGCAL) is presented. The HGCAL is a sampling calorimeter with very fine transverse and longitudinal granularity. The active media are silicon sensors and scintillator tiles readout by SiPMs and the absorbers are a combination of lead and Cu/CuW in the electromagnetic section, and steel in the hadr… ▽ More A novel method to reconstruct the energy of hadronic showers in the CMS High Granularity Calorimeter (HGCAL) is presented. The HGCAL is a sampling calorimeter with very fine transverse and longitudinal granularity. The active media are silicon sensors and scintillator tiles readout by SiPMs and the absorbers are a combination of lead and Cu/CuW in the electromagnetic section, and steel in the hadronic section. The shower reconstruction method is based on graph neural networks and it makes use of a dynamic reduction network architecture. It is shown that the algorithm is able to capture and mitigate the main effects that normally hinder the reconstruction of hadronic showers using classical reconstruction methods, by compensating for fluctuations in the multiplicity, energy, and spatial distributions of the shower's constituents. The performance of the algorithm is evaluated using test beam data collected in 2018 prototype of the CMS HGCAL accompanied by a section of the CALICE AHCAL prototype. The capability of the method to mitigate the impact of energy leakage from the calorimeter is also demonstrated. △ Less

Submitted 30 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: Prepared for submission to JINST

arXiv:2406.10591 [pdf, other]

MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation

Authors: Ruibo Fu, Shuchen Shi, Hongming Guo, Tao Wang, Chunyu Qiang, Zhengqi Wen, Jianhua Tao, Xin Qi, Yi Lu, Xiaopeng Wang, Zhiyong Wang, Yukun Liu, Xuefei Liu, Shuai Zhang, Guanjun Li

Abstract: Foley audio, critical for enhancing the immersive experience in multimedia content, faces significant challenges in the AI-generated content (AIGC) landscape. Despite advancements in AIGC technologies for text and image generation, the foley audio dubbing remains rudimentary due to difficulties in cross-modal scene matching and content correlation. Current text-to-audio technology, which relies on… ▽ More Foley audio, critical for enhancing the immersive experience in multimedia content, faces significant challenges in the AI-generated content (AIGC) landscape. Despite advancements in AIGC technologies for text and image generation, the foley audio dubbing remains rudimentary due to difficulties in cross-modal scene matching and content correlation. Current text-to-audio technology, which relies on detailed and acoustically relevant textual descriptions, falls short in practical video dubbing applications. Existing datasets like AudioSet, AudioCaps, Clotho, Sound-of-Story, and WavCaps do not fully meet the requirements for real-world foley audio dubbing task. To address this, we introduce the Multi-modal Image and Narrative Text Dubbing Dataset (MINT), designed to enhance mainstream dubbing tasks such as literary story audiobooks dubbing, image/silent video dubbing. Besides, to address the limitations of existing TTA technology in understanding and planning complex prompts, a Foley Audio Content Planning, Generation, and Alignment (CPGA) framework is proposed, which includes a content planning module leveraging large language models for complex multi-modal prompts comprehension. Additionally, the training process is optimized using Proximal Policy Optimization based reinforcement learning, significantly improving the alignment and auditory realism of generated foley audio. Experimental results demonstrate that our approach significantly advances the field of foley audio dubbing, providing robust solutions for the challenges of multi-modal dubbing. Even when utilizing the relatively lightweight GPT-2 model, our framework outperforms open-source multimodal large models such as LLaVA, DeepSeek-VL, and Moondream2. The dataset is available at https://github.com/borisfrb/MINT . △ Less

Submitted 15 June, 2024; originally announced June 2024.

arXiv:2406.09670 [pdf]

Delayed phosphate release can highly improve energy efficiency of muscle contraction

Authors: Jiaxiang Xu, Jiangke Tao, Bin Chen

Abstract: The power stroke of myosin and the release of inorganic phosphate (Pi) are pivotal in the conversion of ATP's chemical energy into mechanical work. Although the precise sequence of these two events remains a subject of debate, it is generally agreed that Pi-release into the solution doesn't occur instantly upon the binding of a myosin to actin. Here, we examine how Pi-release that is not directly… ▽ More The power stroke of myosin and the release of inorganic phosphate (Pi) are pivotal in the conversion of ATP's chemical energy into mechanical work. Although the precise sequence of these two events remains a subject of debate, it is generally agreed that Pi-release into the solution doesn't occur instantly upon the binding of a myosin to actin. Here, we examine how Pi-release that is not directly coupled with the power stroke affects muscle contraction. Utilizing a cross-scale mechanics model for a sarcomere unit that integrates the chemomechanical cycle of individual myosins, we find that relatively slow Pi-release can markedly improve energy efficiency during muscle contraction in silico. Our analysis leads us to propose that gradual Pi-release may offer a route to finely adjust the bond strength of an attached myosin, thereby indirectly modulating the power stroke to influence muscle performance. When our model is applied to simulate muscle performance in response to rapid jumps in Pi concentrations, we observe asymmetric rates of force alteration, which corroborate previous experimental findings. Indeed, our model's predictions in the current work are largely consistent with experimental data. This research provides crucial insights into the kinetics of Pi-release within the myosin's chemomechanical cycle and its significant regulatory impact on muscle contraction. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.08112 [pdf, other]

Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio

Authors: Yi Lu, Yuankun Xie, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Zhiyong Wang, Xin Qi, Xuefei Liu, Yongwei Li, Yukun Liu, Xiaopeng Wang, Shuchen Shi

Abstract: With the proliferation of Large Language Model (LLM) based deepfake audio, there is an urgent need for effective detection methods. Previous deepfake audio generation methods typically involve a multi-step generation process, with the final step using a vocoder to predict the waveform from handcrafted features. However, LLM-based audio is directly generated from discrete neural codecs in an end-to… ▽ More With the proliferation of Large Language Model (LLM) based deepfake audio, there is an urgent need for effective detection methods. Previous deepfake audio generation methods typically involve a multi-step generation process, with the final step using a vocoder to predict the waveform from handcrafted features. However, LLM-based audio is directly generated from discrete neural codecs in an end-to-end generation process, skip** the final step of vocoder processing. This poses a significant challenge for current audio deepfake detection (ADD) models based on vocoder artifacts. To effectively detect LLM-based deepfake audio, we focus on the core of the generation process, the conversion from neural codec to waveform. We propose Codecfake dataset, which is generated by seven representative neural codec methods. Experiment results show that codec-trained ADD models exhibit a 41.406% reduction in average equal error rate compared to vocoder-trained ADD models on the Codecfake test set. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: Accepted by INTERSPEECH 2024. arXiv admin note: substantial text overlap with arXiv:2405.04880

arXiv:2406.07381 [pdf, other]

World Models with Hints of Large Language Models for Goal Achieving

Authors: Zeyuan Liu, Ziyu Huan, Xiyao Wang, Jiafei Lyu, Jian Tao, Xiu Li, Furong Huang, Huazhe Xu

Abstract: Reinforcement learning struggles in the face of long-horizon tasks and sparse goals due to the difficulty in manual reward specification. While existing methods address this by adding intrinsic rewards, they may fail to provide meaningful guidance in long-horizon decision-making tasks with large state and action spaces, lacking purposeful exploration. Inspired by human cognition, we propose a new… ▽ More Reinforcement learning struggles in the face of long-horizon tasks and sparse goals due to the difficulty in manual reward specification. While existing methods address this by adding intrinsic rewards, they may fail to provide meaningful guidance in long-horizon decision-making tasks with large state and action spaces, lacking purposeful exploration. Inspired by human cognition, we propose a new multi-modal model-based RL approach named Dreaming with Large Language Models (DLLM). DLLM integrates the proposed hinting subgoals from the LLMs into the model rollouts to encourage goal discovery and reaching in challenging tasks. By assigning higher intrinsic rewards to samples that align with the hints outlined by the language model during model rollouts, DLLM guides the agent toward meaningful and efficient exploration. Extensive experiments demonstrate that the DLLM outperforms recent methods in various challenging, sparse-reward environments such as HomeGrid, Crafter, and Minecraft by 27.7\%, 21.1\%, and 9.9\%, respectively. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.06086 [pdf, other]

RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake Detection

Authors: Yujie Chen, Jiangyan Yi, Jun Xue, Chenglong Wang, Xiaohui Zhang, Shunbo Dong, Siding Zeng, Jianhua Tao, Lv Zhao, Cunhang Fan

Abstract: Fake artefacts for discriminating between bonafide and fake audio can exist in both short- and long-range segments. Therefore, combining local and global feature information can effectively discriminate between bonafide and fake audio. This paper proposes an end-to-end bidirectional state space model, named RawBMamba, to capture both short- and long-range discriminative information for audio deepf… ▽ More Fake artefacts for discriminating between bonafide and fake audio can exist in both short- and long-range segments. Therefore, combining local and global feature information can effectively discriminate between bonafide and fake audio. This paper proposes an end-to-end bidirectional state space model, named RawBMamba, to capture both short- and long-range discriminative information for audio deepfake detection. Specifically, we use sinc Layer and multiple convolutional layers to capture short-range features, and then design a bidirectional Mamba to address Mamba's unidirectional modelling problem and further capture long-range feature information. Moreover, we develop a bidirectional fusion module to integrate embeddings, enhancing audio context representation and combining short- and long-range information. The results show that our proposed RawBMamba achieves a 34.1\% improvement over Rawformer on ASVspoof2021 LA dataset, and demonstrates competitive performance on other datasets. △ Less

Submitted 18 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech 2024

arXiv:2406.04840 [pdf, other]

TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking

Authors: Junzuo Zhou, Jiangyan Yi, Tao Wang, Jianhua Tao, Ye Bai, Chu Yuan Zhang, Yong Ren, Zhengqi Wen

Abstract: Various threats posed by the progress in text-to-speech (TTS) have prompted the need to reliably trace synthesized speech. However, contemporary approaches to this task involve adding watermarks to the audio separately after generation, a process that hurts both speech quality and watermark imperceptibility. In addition, these approaches are limited in robustness and flexibility. To address these… ▽ More Various threats posed by the progress in text-to-speech (TTS) have prompted the need to reliably trace synthesized speech. However, contemporary approaches to this task involve adding watermarks to the audio separately after generation, a process that hurts both speech quality and watermark imperceptibility. In addition, these approaches are limited in robustness and flexibility. To address these problems, we propose TraceableSpeech, a novel TTS model that directly generates watermarked speech, improving watermark imperceptibility and speech quality. Furthermore, We design the frame-wise imprinting and extraction of watermarks, achieving higher robustness against resplicing attacks and temporal flexibility in operation. Experimental results show that TraceableSpeech outperforms the strong baseline where VALL-E or HiFicodec individually uses WavMark in watermark imperceptibility, speech quality and resilience against resplicing attacks. It also can apply to speech of various durations. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: acceped by interspeech 2024

arXiv:2406.04683 [pdf, other]

PPPR: Portable Plug-in Prompt Refiner for Text to Audio Generation

Authors: Shuchen Shi, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Tao Wang, Chunyu Qiang, Yi Lu, Xin Qi, Xuefei Liu, Yukun Liu, Yongwei Li, Zhiyong Wang, Xiaopeng Wang

Abstract: Text-to-Audio (TTA) aims to generate audio that corresponds to the given text description, playing a crucial role in media production. The text descriptions in TTA datasets lack rich variations and diversity, resulting in a drop in TTA model performance when faced with complex text. To address this issue, we propose a method called Portable Plug-in Prompt Refiner, which utilizes rich knowledge abo… ▽ More Text-to-Audio (TTA) aims to generate audio that corresponds to the given text description, playing a crucial role in media production. The text descriptions in TTA datasets lack rich variations and diversity, resulting in a drop in TTA model performance when faced with complex text. To address this issue, we propose a method called Portable Plug-in Prompt Refiner, which utilizes rich knowledge about textual descriptions inherent in large language models to effectively enhance the robustness of TTA acoustic models without altering the acoustic training set. Furthermore, a Chain-of-Thought that mimics human verification is introduced to enhance the accuracy of audio descriptions, thereby improving the accuracy of generated content in practical applications. The experiments show that our method achieves a state-of-the-art Inception Score (IS) of 8.72, surpassing AudioGen, AudioLDM and Tango. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: accepted by INTERSPEECH2024

arXiv:2406.04027 [pdf, other]

PowerPeeler: A Precise and General Dynamic Deobfuscation Method for PowerShell Scripts

Authors: Ruijie Li, Chenyang Zhang, Huajun Chai, Lingyun Ying, Haixin Duan, Jun Tao

Abstract: PowerShell is a powerful and versatile task automation tool. Unfortunately, it is also widely abused by cyber attackers. To bypass malware detection and hinder threat analysis, attackers often employ diverse techniques to obfuscate malicious PowerShell scripts. Existing deobfuscation tools suffer from the limitation of static analysis, which fails to simulate the real deobfuscation process accurat… ▽ More PowerShell is a powerful and versatile task automation tool. Unfortunately, it is also widely abused by cyber attackers. To bypass malware detection and hinder threat analysis, attackers often employ diverse techniques to obfuscate malicious PowerShell scripts. Existing deobfuscation tools suffer from the limitation of static analysis, which fails to simulate the real deobfuscation process accurately. In this paper, we propose PowerPeeler. To the best of our knowledge, it is the first dynamic PowerShell script deobfuscation approach at the instruction level. It utilizes expression-related Abstract Syntax Tree (AST) nodes to identify potential obfuscated script pieces. Then, PowerPeeler correlates the AST nodes with their corresponding instructions and monitors the script's entire execution process. Subsequently, PowerPeeler dynamically tracks the execution of these instructions and records their execution results. Finally, PowerPeeler stringifies these results to replace the corresponding obfuscated script pieces and reconstruct the deobfuscated script. To evaluate the effectiveness of PowerPeeler, we collect 1,736,669 real-world malicious PowerShell samples with diversity obfuscation methods. We compare PowerPeeler with five state-of-the-art deobfuscation tools and GPT-4. The evaluation results demonstrate that PowerPeeler can effectively handle all well-known obfuscation methods. Additionally, the deobfuscation correctness rate of PowerPeeler reaches 95%, significantly surpassing that of other tools. PowerPeeler not only recovers the highest amount of sensitive data but also maintains a semantic consistency over 97%, which is also the best. Moreover, PowerPeeler effectively obtains the largest quantity of valid deobfuscated results within a limited time frame. Furthermore, PowerPeeler is extendable and can be used as a helpful tool for other cyber security solutions. △ Less

Submitted 19 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

Comments: To appear in the ACM CCS 2024

arXiv:2406.03247 [pdf, other]

Genuine-Focused Learning using Mask AutoEncoder for Generalized Fake Audio Detection

Authors: Xiaopeng Wang, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Yuankun Xie, Yukun Liu, Jianhua Tao, Xuefei Liu, Yongwei Li, Xin Qi, Yi Lu, Shuchen Shi

Abstract: The generalization of Fake Audio Detection (FAD) is critical due to the emergence of new spoofing techniques. Traditional FAD methods often focus solely on distinguishing between genuine and known spoofed audio. We propose a Genuine-Focused Learning (GFL) framework guided, aiming for highly generalized FAD, called GFL-FAD. This method incorporates a Counterfactual Reasoning Enhanced Representation… ▽ More The generalization of Fake Audio Detection (FAD) is critical due to the emergence of new spoofing techniques. Traditional FAD methods often focus solely on distinguishing between genuine and known spoofed audio. We propose a Genuine-Focused Learning (GFL) framework guided, aiming for highly generalized FAD, called GFL-FAD. This method incorporates a Counterfactual Reasoning Enhanced Representation (CRER) based on audio reconstruction using the Mask AutoEncoder (MAE) architecture to accurately model genuine audio features. To reduce the influence of spoofed audio during training, we introduce a genuine audio reconstruction loss, maintaining the focus on learning genuine data features. In addition, content-related bottleneck (BN) features are extracted from the MAE to supplement the knowledge of the original audio. These BN features are adaptively fused with CRER to further improve robustness. Our method achieves state-of-the-art performance with an EER of 0.25% on ASVspoof2019 LA. △ Less

Submitted 9 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

Comments: Accepted by INTERSPEECH 2024

arXiv:2406.03240 [pdf, other]

Generalized Source Tracing: Detecting Novel Audio Deepfake Algorithm with Real Emphasis and Fake Dispersion Strategy

Authors: Yuankun Xie, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Xiaopeng Wang, Haonnan Cheng, Long Ye, Jianhua Tao

Abstract: With the proliferation of deepfake audio, there is an urgent need to investigate their attribution. Current source tracing methods can effectively distinguish in-distribution (ID) categories. However, the rapid evolution of deepfake algorithms poses a critical challenge in the accurate identification of out-of-distribution (OOD) novel deepfake algorithms. In this paper, we propose Real Emphasis an… ▽ More With the proliferation of deepfake audio, there is an urgent need to investigate their attribution. Current source tracing methods can effectively distinguish in-distribution (ID) categories. However, the rapid evolution of deepfake algorithms poses a critical challenge in the accurate identification of out-of-distribution (OOD) novel deepfake algorithms. In this paper, we propose Real Emphasis and Fake Dispersion (REFD) strategy for audio deepfake algorithm recognition, demonstrating its effectiveness in discriminating ID samples while identifying OOD samples. For effective OOD detection, we first explore current post-hoc OOD methods and propose NSD, a novel OOD approach in identifying novel deepfake algorithms through the similarity consideration of both feature and logits scores. REFD achieves 86.83% F1-score as a single system in Audio Deepfake Detection Challenge 2023 Track3, showcasing its state-of-the-art performance. △ Less

Submitted 8 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

Comments: Accepted by INTERSPEECH 2024

arXiv:2406.03237 [pdf, other]

Generalized Fake Audio Detection via Deep Stable Learning

Authors: Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Yuankun Xie, Yukun Liu, Xiaopeng Wang, Xuefei Liu, Yongwei Li, Jianhua Tao, Yi Lu, Xin Qi, Shuchen Shi

Abstract: Although current fake audio detection approaches have achieved remarkable success on specific datasets, they often fail when evaluated with datasets from different distributions. Previous studies typically address distribution shift by focusing on using extra data or applying extra loss restrictions during training. However, these methods either require a substantial amount of data or complicate t… ▽ More Although current fake audio detection approaches have achieved remarkable success on specific datasets, they often fail when evaluated with datasets from different distributions. Previous studies typically address distribution shift by focusing on using extra data or applying extra loss restrictions during training. However, these methods either require a substantial amount of data or complicate the training process. In this work, we propose a stable learning-based training scheme that involves a Sample Weight Learning (SWL) module, addressing distribution shift by decorrelating all selected features via learning weights from training samples. The proposed portable plug-in-like SWL is easy to apply to multiple base models and generalizes them without using extra data during training. Experiments conducted on the ASVspoof datasets clearly demonstrate the effectiveness of SWL in generalizing different models across three evaluation datasets from different distributions. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: accepted by INTERSPEECH2024

arXiv:2406.00504 [pdf]

Research on an Autonomous UAV Search and Rescue System Based on the Improved

Authors: Haobin Chen, Junyu Tao, Bize Zhou, Xiaoyan Liu

Abstract: The demand is to solve the issue of UAV (unmanned aerial vehicle) operating autonomously and implementing practical functions such as search and rescue in complex unknown environments. This paper proposes an autonomous search and rescue UAV system based on an EGO-Planner algorithm, which is improved by innovative UAV body application and takes the methods of inverse motor backstep** to enhance t… ▽ More The demand is to solve the issue of UAV (unmanned aerial vehicle) operating autonomously and implementing practical functions such as search and rescue in complex unknown environments. This paper proposes an autonomous search and rescue UAV system based on an EGO-Planner algorithm, which is improved by innovative UAV body application and takes the methods of inverse motor backstep** to enhance the overall flight efficiency of the UAV and miniaturization of the whole machine. At the same time, the system introduced the EGO-Planner planning tool, which is optimized by a bidirectional A* algorithm along with an object detection algorithm. It solves the issue of intelligent obstacle avoidance and search and rescue. Through the simulation and field verification work, and compared with traditional algorithms, this method shows more efficiency and reliability in the task. In addition, due to the existing algorithm's improved robustness, this application shows good prospection. △ Less

Submitted 7 June, 2024; v1 submitted 1 June, 2024; originally announced June 2024.

Comments: 2024 5th International Conference on Computer Engineering and Application

arXiv:2405.20914 [pdf, other]

RASE: Efficient Privacy-preserving Data Aggregation against Disclosure Attacks for IoTs

Authors: Zuyan Wang, Jun Tao, Dika Zou

Abstract: The growing popular awareness of personal privacy raises the following quandary: what is the new paradigm for collecting and protecting the data produced by ever-increasing sensor devices. Most previous studies on co-design of data aggregation and privacy preservation assume that a trusted fusion center adheres to privacy regimes. Very recent work has taken steps towards relaxing the assumption by… ▽ More The growing popular awareness of personal privacy raises the following quandary: what is the new paradigm for collecting and protecting the data produced by ever-increasing sensor devices. Most previous studies on co-design of data aggregation and privacy preservation assume that a trusted fusion center adheres to privacy regimes. Very recent work has taken steps towards relaxing the assumption by allowing data contributors to locally perturb their own data. Although these solutions withhold some data content to mitigate privacy risks, they have been shown to offer insufficient protection against disclosure attacks. Aiming at providing a more rigorous data safeguard for the Internet of Things (IoTs), this paper initiates the study of privacy-preserving data aggregation. We propose a novel paradigm (called RASE), which can be generalized into a 3-step sequential procedure, noise addition, followed by random permutation, and then parameter estimation. Specially, we design a differentially private randomizer, which carefully guides data contributors to obfuscate the truth. Then, a shuffler is employed to receive the noisy data from all data contributors. After that, it breaks the correct linkage between senders and receivers by applying a random permutation. The estimation phase involves using inaccurate data to calculate an approximate aggregate value. Extensive simulations are provided to explore the privacy-utility landscape of our RASE. △ Less

Submitted 31 May, 2024; originally announced May 2024.

Comments: 14 pages, 19 figures

arXiv:2405.14913 [pdf, other]

High Rank Path Development: an approach of learning the filtration of stochastic processes

Authors: Jiajie Tao, Hao Ni, Chong Liu

Abstract: Since the weak convergence for stochastic processes does not account for the growth of information over time which is represented by the underlying filtration, a slightly erroneous stochastic model in weak topology may cause huge loss in multi-periods decision making problems. To address such discontinuities Aldous introduced the extended weak convergence, which can fully characterise all essentia… ▽ More Since the weak convergence for stochastic processes does not account for the growth of information over time which is represented by the underlying filtration, a slightly erroneous stochastic model in weak topology may cause huge loss in multi-periods decision making problems. To address such discontinuities Aldous introduced the extended weak convergence, which can fully characterise all essential properties, including the filtration, of stochastic processes; however was considered to be hard to find efficient numerical implementations. In this paper, we introduce a novel metric called High Rank PCF Distance (HRPCFD) for extended weak convergence based on the high rank path development method from rough path theory, which also defines the characteristic function for measure-valued processes. We then show that such HRPCFD admits many favourable analytic properties which allows us to design an efficient algorithm for training HRPCFD from data and construct the HRPCF-GAN by using HRPCFD as the discriminator for conditional time series generation. Our numerical experiments on both hypothesis testing and generative modelling validate the out-performance of our approach compared with several state-of-the-art methods, highlighting its potential in broad applications of synthetic time series generation and in addressing classic financial and economic challenges, such as optimal stop** or utility maximisation problems. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.10576 [pdf, other]

An Efficient Learning Control Framework With Sim-to-Real for String-Type Artificial Muscle-Driven Robotic Systems

Authors: Jiyue Tao, Yunsong Zhang, Sunil Kumar Rajendran, Feitian Zhang, Dexin Zhao, Tongsheng Shen

Abstract: Robotic systems driven by artificial muscles present unique challenges due to the nonlinear dynamics of actuators and the complex designs of mechanical structures. Traditional model-based controllers often struggle to achieve desired control performance in such systems. Deep reinforcement learning (DRL), a trending machine learning technique widely adopted in robot control, offers a promising alte… ▽ More Robotic systems driven by artificial muscles present unique challenges due to the nonlinear dynamics of actuators and the complex designs of mechanical structures. Traditional model-based controllers often struggle to achieve desired control performance in such systems. Deep reinforcement learning (DRL), a trending machine learning technique widely adopted in robot control, offers a promising alternative. However, integrating DRL into these robotic systems faces significant challenges, including the requirement for large amounts of training data and the inevitable sim-to-real gap when deployed to real-world robots. This paper proposes an efficient reinforcement learning control framework with sim-to-real transfer to address these challenges. Bootstrap and augmentation enhancements are designed to improve the data efficiency of baseline DRL algorithms, while a sim-to-real transfer technique, namely randomization of muscle dynamics, is adopted to bridge the gap between simulation and real-world deployment. Extensive experiments and ablation studies are conducted utilizing two string-type artificial muscle-driven robotic systems including a two degree-of-freedom robotic eye and a parallel robotic wrist, the results of which demonstrate the effectiveness of the proposed learning control strategy. △ Less

Submitted 7 June, 2024; v1 submitted 17 May, 2024; originally announced May 2024.

arXiv:2405.08596

EVDA: Evolving Deepfake Audio Detection Continual Learning Benchmark

Authors: Xiaohui Zhang, Jiangyan Yi, Jianhua Tao

Abstract: The rise of advanced large language models such as GPT-4, GPT-4o, and the Claude family has made fake audio detection increasingly challenging. Traditional fine-tuning methods struggle to keep pace with the evolving landscape of synthetic speech, necessitating continual learning approaches that can adapt to new audio while retaining the ability to detect older types. Continual learning, which acts… ▽ More The rise of advanced large language models such as GPT-4, GPT-4o, and the Claude family has made fake audio detection increasingly challenging. Traditional fine-tuning methods struggle to keep pace with the evolving landscape of synthetic speech, necessitating continual learning approaches that can adapt to new audio while retaining the ability to detect older types. Continual learning, which acts as an effective tool for detecting newly emerged deepfake audio while maintaining performance on older types, lacks a well-constructed and user-friendly evaluation framework. To address this gap, we introduce EVDA, a benchmark for evaluating continual learning methods in deepfake audio detection. EVDA includes classic datasets from the Anti-Spoofing Voice series, Chinese fake audio detection series, and newly generated deepfake audio from models like GPT-4 and GPT-4o. It supports various continual learning techniques, such as Elastic Weight Consolidation (EWC), Learning without Forgetting (LwF), and recent methods like Regularized Adaptive Weight Modification (RAWM) and Radian Weight Modification (RWM). Additionally, EVDA facilitates the development of robust algorithms by providing an open interface for integrating new continual learning methods △ Less

Submitted 15 May, 2024; v1 submitted 14 May, 2024; originally announced May 2024.

Comments: This paper need more modification

arXiv:2405.05741 [pdf, ps, other]

Can large language models understand uncommon meanings of common words?

Authors: **yang Wu, Feihu Che, Xinxin Zheng, Shuai Zhang, Ruihan **, Shuai Nie, Pengpeng Shao, Jianhua Tao

Abstract: Large language models (LLMs) like ChatGPT have shown significant advancements across diverse natural language understanding (NLU) tasks, including intelligent dialogue and autonomous agents. Yet, lacking widely acknowledged testing mechanisms, answering `whether LLMs are stochastic parrots or genuinely comprehend the world' remains unclear, fostering numerous studies and sparking heated debates. P… ▽ More Large language models (LLMs) like ChatGPT have shown significant advancements across diverse natural language understanding (NLU) tasks, including intelligent dialogue and autonomous agents. Yet, lacking widely acknowledged testing mechanisms, answering `whether LLMs are stochastic parrots or genuinely comprehend the world' remains unclear, fostering numerous studies and sparking heated debates. Prevailing research mainly focuses on surface-level NLU, neglecting fine-grained explorations. However, such explorations are crucial for understanding their unique comprehension mechanisms, aligning with human cognition, and finally enhancing LLMs' general NLU capacities. To address this gap, our study delves into LLMs' nuanced semantic comprehension capabilities, particularly regarding common words with uncommon meanings. The idea stems from foundational principles of human communication within psychology, which underscore accurate shared understandings of word semantics. Specifically, this paper presents the innovative construction of a Lexical Semantic Comprehension (LeSC) dataset with novel evaluation metrics, the first benchmark encompassing both fine-grained and cross-lingual dimensions. Introducing models of both open-source and closed-source, varied scales and architectures, our extensive empirical experiments demonstrate the inferior performance of existing models in this basic lexical-meaning understanding task. Notably, even the state-of-the-art LLMs GPT-4 and GPT-3.5 lag behind 16-year-old humans by 3.9% and 22.3%, respectively. Additionally, multiple advanced prompting techniques and retrieval-augmented generation are also introduced to help alleviate this trouble, yet limitations persist. By highlighting the above critical shortcomings, this research motivates further investigation and offers novel insights for develo** more intelligent LLMs. △ Less

Submitted 9 May, 2024; originally announced May 2024.

arXiv:2405.04880 [pdf, other]

The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio

Authors: Yuankun Xie, Yi Lu, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Jianhua Tao, Xin Qi, Xiaopeng Wang, Yukun Liu, Haonan Cheng, Long Ye, Yi Sun

Abstract: With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for generalized detection methods. ALM-based deepfake audio currently exhibits widespread, high deception, and type versatility, posing a significant challenge to current audio deepfake detection (ADD) models trained solely on vocoded data. To effectively detect ALM-based deepfake audio, we focus on… ▽ More With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for generalized detection methods. ALM-based deepfake audio currently exhibits widespread, high deception, and type versatility, posing a significant challenge to current audio deepfake detection (ADD) models trained solely on vocoded data. To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method, the conversion from neural codec to waveform. We initially construct the Codecfake dataset, an open-source large-scale dataset, including 2 languages, over 1M audio samples, and various test conditions, focus on ALM-based audio detection. As countermeasure, to achieve universal detection of deepfake audio and tackle domain ascent bias issue of original SAM, we propose the CSAM strategy to learn a domain balanced and generalized minima. In our experiments, we first demonstrate that ADD model training with the Codecfake dataset can effectively detects ALM-based audio. Furthermore, our proposed generalization countermeasure yields the lowest average Equal Error Rate (EER) of 0.616% across all test conditions compared to baseline models. The dataset and associated code are available online. △ Less

Submitted 15 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

arXiv:2404.18661 [pdf, other]

On the determination of path signature from its unitary development

Authors: Siran Li, Zijiu Lyu, Hao Ni, Jiajie Tao

Abstract: We establish an explicit, constructive approach to determine any element $X$ in the tensor algebra $\mathcal{T}\left(\mathbb{R}^d\right) = \bigoplus_{n=0}^\infty\left(\mathbb{R}^d\right)^{\otimes n}$ from its moment generating function. The only assumption is that $X$ has a nonzero radius of convergence, which relaxes the condition of having an infinite radius of convergence in the literature. The… ▽ More We establish an explicit, constructive approach to determine any element $X$ in the tensor algebra $\mathcal{T}\left(\mathbb{R}^d\right) = \bigoplus_{n=0}^\infty\left(\mathbb{R}^d\right)^{\otimes n}$ from its moment generating function. The only assumption is that $X$ has a nonzero radius of convergence, which relaxes the condition of having an infinite radius of convergence in the literature. The key building block of our approach is tridiagonal antisymmetric matrices, whose sparsity offers a considerable advantage for dimension reduction in applications. In particular, specialising $X$ to the signature space of bounded $p$-variation paths in $\mathbb{R}^d$ with $1\leq p <2$, we show that the developments of such sparse matrices are sufficient to separate points over the space of signatures, which yields a refined answer to the "moment problem" concerning the signature. Based on the above theoretical investigations, we propose a new distance function for probability measures on the path space, termed as the "Restricted Path Characteristic Function Distance" (RPCFD), and validate its effectiveness via numerical experiments on hypothesis testing for examples of fractional Brownian motions. △ Less

Submitted 29 April, 2024; originally announced April 2024.

Comments: 24 pages, 8 figures, and 3 tables

MSC Class: 60L20 (Primary); 62M99; 60G35; 62M07 (Secondary)

arXiv:2404.17113 [pdf, other]

MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition

Authors: Zheng Lian, Haiyang Sun, Licai Sun, Zhuofan Wen, Siyuan Zhang, Shun Chen, Hao Gu, **ming Zhao, Ziyang Ma, Xie Chen, Jiangyan Yi, Rui Liu, Kele Xu, Bin Liu, Erik Cambria, Guoying Zhao, Björn W. Schuller, Jianhua Tao

Abstract: Multimodal emotion recognition is an important research topic in artificial intelligence. Over the past few decades, researchers have made remarkable progress by increasing dataset size and building more effective architectures. However, due to various reasons (such as complex environments and inaccurate annotations), current systems are hard to meet the demands of practical applications. Therefor… ▽ More Multimodal emotion recognition is an important research topic in artificial intelligence. Over the past few decades, researchers have made remarkable progress by increasing dataset size and building more effective architectures. However, due to various reasons (such as complex environments and inaccurate annotations), current systems are hard to meet the demands of practical applications. Therefore, we organize a series of challenges around emotion recognition to further promote the development of this area. Last year, we launched MER2023, focusing on three topics: multi-label learning, noise robustness, and semi-supervised learning. This year, we continue to organize MER2024. In addition to expanding the dataset size, we introduce a new track around open-vocabulary emotion recognition. The main consideration for this track is that existing datasets often fix the label space and use majority voting to enhance annotator consistency, but this process may limit the model's ability to describe subtle emotions. In this track, we encourage participants to generate any number of labels in any category, aiming to describe the emotional state as accurately as possible. Our baseline is based on MERTools and the code is available at: https://github.com/zeroQiaoba/MERTools/tree/master/MER2024. △ Less

Submitted 23 May, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

arXiv:2404.15660 [pdf, other]

KS-LLM: Knowledge Selection of Large Language Models with Evidence Document for Question Answering

Authors: Xinxin Zheng, Feihu Che, **yang Wu, Shuai Zhang, Shuai Nie, Kang Liu, Jianhua Tao

Abstract: Large language models (LLMs) suffer from the hallucination problem and face significant challenges when applied to knowledge-intensive tasks. A promising approach is to leverage evidence documents as extra supporting knowledge, which can be obtained through retrieval or generation. However, existing methods directly leverage the entire contents of the evidence document, which may introduce noise i… ▽ More Large language models (LLMs) suffer from the hallucination problem and face significant challenges when applied to knowledge-intensive tasks. A promising approach is to leverage evidence documents as extra supporting knowledge, which can be obtained through retrieval or generation. However, existing methods directly leverage the entire contents of the evidence document, which may introduce noise information and impair the performance of large language models. To tackle this problem, we propose a novel Knowledge Selection of Large Language Models (KS-LLM) method, aiming to identify valuable information from evidence documents. The KS-LLM approach utilizes triples to effectively select knowledge snippets from evidence documents that are beneficial to answering questions. Specifically, we first generate triples based on the input question, then select the evidence sentences most similar to triples from the evidence document, and finally combine the evidence sentences and triples to assist large language models in generating answers. Experimental comparisons on several question answering datasets, such as TriviaQA, WebQ, and NQ, demonstrate that the proposed method surpasses the baselines and achieves the best results. △ Less

Submitted 24 April, 2024; originally announced April 2024.

arXiv:2404.09606 [pdf, other]

A Self-feedback Knowledge Elicitation Approach for Chemical Reaction Predictions

Authors: Pengfei Liu, Jun Tao, Zhixiang Ren

Abstract: The task of chemical reaction predictions (CRPs) plays a pivotal role in advancing drug discovery and material science. However, its effectiveness is constrained by the vast and uncertain chemical reaction space and challenges in capturing reaction selectivity, particularly due to existing methods' limitations in exploiting the data's inherent knowledge. To address these challenges, we introduce a… ▽ More The task of chemical reaction predictions (CRPs) plays a pivotal role in advancing drug discovery and material science. However, its effectiveness is constrained by the vast and uncertain chemical reaction space and challenges in capturing reaction selectivity, particularly due to existing methods' limitations in exploiting the data's inherent knowledge. To address these challenges, we introduce a data-curated self-feedback knowledge elicitation approach. This method starts from iterative optimization of molecular representations and facilitates the extraction of knowledge on chemical reaction types (RTs). Then, we employ adaptive prompt learning to infuse the prior knowledge into the large language model (LLM). As a result, we achieve significant enhancements: a 14.2% increase in retrosynthesis prediction accuracy, a 74.2% rise in reagent prediction accuracy, and an expansion in the model's capability for handling multi-task chemical reactions. This research offers a novel paradigm for knowledge elicitation in scientific research and showcases the untapped potential of LLMs in CRPs. △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2404.07454 [pdf, other]

Representation Learning of Tangled Key-Value Sequence Data for Early Classification

Authors: Tao Duan, Junzhou Zhao, Shuo Zhang, **g Tao, **hui Wang

Abstract: Key-value sequence data has become ubiquitous and naturally appears in a variety of real-world applications, ranging from the user-product purchasing sequences in e-commerce, to network packet sequences forwarded by routers in networking. Classifying these key-value sequences is important in many scenarios such as user profiling and malicious applications identification. In many time-sensitive sce… ▽ More Key-value sequence data has become ubiquitous and naturally appears in a variety of real-world applications, ranging from the user-product purchasing sequences in e-commerce, to network packet sequences forwarded by routers in networking. Classifying these key-value sequences is important in many scenarios such as user profiling and malicious applications identification. In many time-sensitive scenarios, besides the requirement of classifying a key-value sequence accurately, it is also desired to classify a key-value sequence early, in order to respond fast. However, these two goals are conflicting in nature, and it is challenging to achieve them simultaneously. In this work, we formulate a novel tangled key-value sequence early classification problem, where a tangled key-value sequence is a mixture of several concurrent key-value sequences with different keys. The goal is to classify each individual key-value sequence sharing a same key both accurately and early. To address this problem, we propose a novel method, i.e., Key-Value sequence Early Co-classification (KVEC), which leverages both inner- and inter-correlations of items in a tangled key-value sequence through key correlation and value correlation to learn a better sequence representation. Meanwhile, a time-aware halting policy decides when to stop the ongoing key-value sequence and classify it based on current sequence representation. Experiments on both real-world and synthetic datasets demonstrate that our method outperforms the state-of-the-art baselines significantly. KVEC improves the prediction accuracy by up to $4.7 - 17.5\%$ under the same prediction earliness condition, and improves the harmonic mean of accuracy and earliness by up to $3.7 - 14.0\%$. △ Less

Submitted 10 April, 2024; originally announced April 2024.

Comments: 12 pages, 31 figures, Accepted by ICDE2024

arXiv:2404.03571 [pdf, other]

Extra Higgs boson searches at the LHC

Authors: Junquan Tao

Abstract: Many searches for additional Higgs bosons, which are predicted by a lot of interesting models beyond the standard model, have been performed at the LHC. Some selected latest results of the searches for extra Higgs bosons at the LHC are presented. These additional Higgs bosons could be produced either directly from the parton interactions or from the decays of the observed standard model Higgs boso… ▽ More Many searches for additional Higgs bosons, which are predicted by a lot of interesting models beyond the standard model, have been performed at the LHC. Some selected latest results of the searches for extra Higgs bosons at the LHC are presented. These additional Higgs bosons could be produced either directly from the parton interactions or from the decays of the observed standard model Higgs boson or a new heavier resonance. The searches used the data from proton-proton collisions delivered by the LHC at a centre-of-mass energy of $\sqrt{s}=13~\TeV$ and recorded with the ATLAS and CMS detectors. No direct evidence of new physics has been observed yet. Several mild excesses were observed in some final states. More data is needed to conclude on the nature of these excesses. △ Less

Submitted 4 April, 2024; originally announced April 2024.

Comments: Presented at the 12th Workshop on the CKM Unitarity Triangle, 18-22 September 2023, Santiago de Compostela

arXiv:2404.01089 [pdf, other]

Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On

Authors: Xu Yang, Changxing Ding, Zhibin Hong, Junhao Huang, ** Tao, Xiangmin Xu

Abstract: Image-based virtual try-on is an increasingly important task for online shop**. It aims to synthesize images of a specific person wearing a specified garment. Diffusion model-based approaches have recently become popular, as they are excellent at image synthesis tasks. However, these approaches usually employ additional image encoders and rely on the cross-attention mechanism for texture transfe… ▽ More Image-based virtual try-on is an increasingly important task for online shop**. It aims to synthesize images of a specific person wearing a specified garment. Diffusion model-based approaches have recently become popular, as they are excellent at image synthesis tasks. However, these approaches usually employ additional image encoders and rely on the cross-attention mechanism for texture transfer from the garment to the person image, which affects the try-on's efficiency and fidelity. To address these issues, we propose an Texture-Preserving Diffusion (TPD) model for virtual try-on, which enhances the fidelity of the results and introduces no additional image encoders. Accordingly, we make contributions from two aspects. First, we propose to concatenate the masked person and reference garment images along the spatial dimension and utilize the resulting image as the input for the diffusion model's denoising UNet. This enables the original self-attention layers contained in the diffusion model to achieve efficient and accurate texture transfer. Second, we propose a novel diffusion-based method that predicts a precise inpainting mask based on the person and reference garment images, further enhancing the reliability of the try-on results. In addition, we integrate mask prediction and image synthesis into a single compact model. The experimental results show that our approach can be applied to various try-on tasks, e.g., garment-to-person and person-to-person try-ons, and significantly outperforms state-of-the-art methods on popular VITON, VITON-HD databases. △ Less

Submitted 1 April, 2024; originally announced April 2024.

Comments: CVPR 2024

arXiv:2403.15044 [pdf, other]

Multimodal Fusion with Pre-Trained Model Features in Affective Behaviour Analysis In-the-wild

Authors: Zhuofan Wen, Fengyu Zhang, Siyuan Zhang, Haiyang Sun, Mingyu Xu, Licai Sun, Zheng Lian, Bin Liu, Jianhua Tao

Abstract: Multimodal fusion is a significant method for most multimodal tasks. With the recent surge in the number of large pre-trained models, combining both multimodal fusion methods and pre-trained model features can achieve outstanding performance in many multimodal tasks. In this paper, we present our approach, which leverages both advantages for addressing the task of Expression (Expr) Recognition and… ▽ More Multimodal fusion is a significant method for most multimodal tasks. With the recent surge in the number of large pre-trained models, combining both multimodal fusion methods and pre-trained model features can achieve outstanding performance in many multimodal tasks. In this paper, we present our approach, which leverages both advantages for addressing the task of Expression (Expr) Recognition and Valence-Arousal (VA) Estimation. We evaluate the Aff-Wild2 database using pre-trained models, then extract the final hidden layers of the models as features. Following preprocessing and interpolation or convolution to align the extracted features, different models are employed for modal fusion. Our code is available at GitHub - FulgenceWen/ABAW6th. △ Less

Submitted 22 March, 2024; originally announced March 2024.

arXiv:2403.03821 [pdf, other]

Identifying Black Holes Through Space Telescopes and Deep Learning

Authors: Yeqi Fang, Wei Hong, Jun Tao

Abstract: The EHT has captured a series of images of black holes. These images could provide valuable information about the gravitational environment near the event horizon. However, accurate detection and parameter estimation for candidate black holes are necessary. This paper explores the potential for identifying black holes in the ultraviolet band using space telescopes. We establish a data pipeline for… ▽ More The EHT has captured a series of images of black holes. These images could provide valuable information about the gravitational environment near the event horizon. However, accurate detection and parameter estimation for candidate black holes are necessary. This paper explores the potential for identifying black holes in the ultraviolet band using space telescopes. We establish a data pipeline for generating simulated observations and present an ensemble neural network model for black hole detection and parameter estimation. The model achieves mean average precision [0.5] values of 0.9176 even when reaching the imaging FWHM ($θ_c$) and maintains the detection ability until $0.54θ_c$. The parameter estimation is also accurate. These results indicate that our methodology enables super-resolution recognition. Moreover, the model successfully detects the shadow of M87* from background noise and other celestial bodies and estimates its inclination and positional angle. Our work demonstrates the feasibility of detecting black holes in the ultraviolet band and provides a new method for black hole detection and further parameter estimation. △ Less

Submitted 11 March, 2024; v1 submitted 6 March, 2024; originally announced March 2024.

Comments: 18 pages, 18 figures, 8 tables. We propose a ensemble neural network which demonstrates the feasibility of detecting black holes in the UV band and provides a new method for the accurate and real-time detection of candidate black holes and further parameter estimation

arXiv:2403.01318 [pdf, other]

High-Dimensional Tail Index Regression: with An Application to Text Analyses of Viral Posts in Social Media

Authors: Yuya Sasaki, **g Tao, Yulong Wang

Abstract: Motivated by the empirical power law of the distributions of credits (e.g., the number of "likes") of viral posts in social media, we introduce the high-dimensional tail index regression and methods of estimation and inference for its parameters. We propose a regularized estimator, establish its consistency, and derive its convergence rate. To conduct inference, we propose to debias the regularize… ▽ More Motivated by the empirical power law of the distributions of credits (e.g., the number of "likes") of viral posts in social media, we introduce the high-dimensional tail index regression and methods of estimation and inference for its parameters. We propose a regularized estimator, establish its consistency, and derive its convergence rate. To conduct inference, we propose to debias the regularized estimate, and establish the asymptotic normality of the debiased estimator. Simulation studies support our theory. These methods are applied to text analyses of viral posts in X (formerly Twitter) concerning LGBTQ+. △ Less

Submitted 2 March, 2024; originally announced March 2024.

arXiv:2402.11432 [pdf, other]

Can Deception Detection Go Deeper? Dataset, Evaluation, and Benchmark for Deception Reasoning

Authors: Kang Chen, Zheng Lian, Haiyang Sun, Bin Liu, Jianhua Tao

Abstract: Deception detection has attracted increasing attention due to its importance in real-world scenarios. Its main goal is to detect deceptive behaviors from multimodal clues such as gestures, facial expressions, prosody, etc. However, these bases are usually subjective and related to personal habits. Therefore, we extend deception detection to deception reasoning, further providing objective evidence… ▽ More Deception detection has attracted increasing attention due to its importance in real-world scenarios. Its main goal is to detect deceptive behaviors from multimodal clues such as gestures, facial expressions, prosody, etc. However, these bases are usually subjective and related to personal habits. Therefore, we extend deception detection to deception reasoning, further providing objective evidence to support subjective judgment. Specifically, we provide potential lies and basic facts and then analyze why this sentence may be a lie by combining factual inconsistencies and intent behind them. Compared with deception detection, this task is more applicable to real-world scenarios. For example, in interrogation, the police should judge whether a person is lying based on solid evidence. This paper presents our initial attempts at this task, including constructing a dataset and defining evaluation metrics. Meanwhile, this task can serve as a benchmark for evaluating the complex reasoning capability of large language models. Code and data will be made publicly available. △ Less

Submitted 16 June, 2024; v1 submitted 17 February, 2024; originally announced February 2024.

arXiv:2402.11082 [pdf, other]

The AI Security Pyramid of Pain

Authors: Chris M. Ward, Josh Harguess, Julia Tao, Daniel Christman, Paul Spicer, Mike Tan

Abstract: We introduce the AI Security Pyramid of Pain, a framework that adapts the cybersecurity Pyramid of Pain to categorize and prioritize AI-specific threats. This framework provides a structured approach to understanding and addressing various levels of AI threats. Starting at the base, the pyramid emphasizes Data Integrity, which is essential for the accuracy and reliability of datasets and AI models… ▽ More We introduce the AI Security Pyramid of Pain, a framework that adapts the cybersecurity Pyramid of Pain to categorize and prioritize AI-specific threats. This framework provides a structured approach to understanding and addressing various levels of AI threats. Starting at the base, the pyramid emphasizes Data Integrity, which is essential for the accuracy and reliability of datasets and AI models, including their weights and parameters. Ensuring data integrity is crucial, as it underpins the effectiveness of all AI-driven decisions and operations. The next level, AI System Performance, focuses on MLOps-driven metrics such as model drift, accuracy, and false positive rates. These metrics are crucial for detecting potential security breaches, allowing for early intervention and maintenance of AI system integrity. Advancing further, the pyramid addresses the threat posed by Adversarial Tools, identifying and neutralizing tools used by adversaries to target AI systems. This layer is key to staying ahead of evolving attack methodologies. At the Adversarial Input layer, the framework addresses the detection and mitigation of inputs designed to deceive or exploit AI models. This includes techniques like adversarial patterns and prompt injection attacks, which are increasingly used in sophisticated attacks on AI systems. Data Provenance is the next critical layer, ensuring the authenticity and lineage of data and models. This layer is pivotal in preventing the use of compromised or biased data in AI systems. At the apex is the tactics, techniques, and procedures (TTPs) layer, dealing with the most complex and challenging aspects of AI security. This involves a deep understanding and strategic approach to counter advanced AI-targeted attacks, requiring comprehensive knowledge and planning. △ Less

Submitted 16 February, 2024; originally announced February 2024.

Comments: SPIE DCS 2024

arXiv:2402.04119 [pdf, other]

Scientific Language Modeling: A Quantitative Review of Large Language Models in Molecular Science

Authors: Pengfei Liu, Jun Tao, Zhixiang Ren

Abstract: Efficient molecular modeling and design are crucial for the discovery and exploration of novel molecules, and the incorporation of deep learning methods has revolutionized this field. In particular, large language models (LLMs) offer a fresh approach to tackle scientific problems from a natural language processing (NLP) perspective, introducing a research paradigm called scientific language modeli… ▽ More Efficient molecular modeling and design are crucial for the discovery and exploration of novel molecules, and the incorporation of deep learning methods has revolutionized this field. In particular, large language models (LLMs) offer a fresh approach to tackle scientific problems from a natural language processing (NLP) perspective, introducing a research paradigm called scientific language modeling (SLM). However, two key issues remain: how to quantify the match between model and data modalities and how to identify the knowledge-learning preferences of models. To address these challenges, we propose a multi-modal benchmark, named ChEBI-20-MM, and perform 1263 experiments to assess the model's compatibility with data modalities and knowledge acquisition. Through the modal transition probability matrix, we provide insights into the most suitable modalities for tasks. Furthermore, we introduce a statistically interpretable approach to discover context-specific knowledge map** by localized feature filtering. Our pioneering analysis offers an exploration of the learning mechanism and paves the way for advancing SLM in molecular science. △ Less

Submitted 6 February, 2024; originally announced February 2024.

arXiv:2401.12997 [pdf, other]

doi 10.1609/aaai.v38i8.28680

Progressive Distillation Based on Masked Generation Feature Method for Knowledge Graph Completion

Authors: Cunhang Fan, Yujie Chen, Jun Xue, Yonghui Kong, Jianhua Tao, Zhao Lv

Abstract: In recent years, knowledge graph completion (KGC) models based on pre-trained language model (PLM) have shown promising results. However, the large number of parameters and high computational cost of PLM models pose challenges for their application in downstream tasks. This paper proposes a progressive distillation method based on masked generation features for KGC task, aiming to significantly re… ▽ More In recent years, knowledge graph completion (KGC) models based on pre-trained language model (PLM) have shown promising results. However, the large number of parameters and high computational cost of PLM models pose challenges for their application in downstream tasks. This paper proposes a progressive distillation method based on masked generation features for KGC task, aiming to significantly reduce the complexity of pre-trained models. Specifically, we perform pre-distillation on PLM to obtain high-quality teacher models, and compress the PLM network to obtain multi-grade student models. However, traditional feature distillation suffers from the limitation of having a single representation of information in teacher models. To solve this problem, we propose masked generation of teacher-student features, which contain richer representation information. Furthermore, there is a significant gap in representation ability between teacher and student. Therefore, we design a progressive distillation method to distill student models at each grade level, enabling efficient knowledge transfer from teachers to students. The experimental results demonstrate that the model in the pre-distillation stage surpasses the existing state-of-the-art methods. Furthermore, in the progressive distillation stage, the model significantly reduces the model parameters while maintaining a certain level of performance. Specifically, the model parameters of the lower-grade student model are reduced by 56.7\% compared to the baseline. △ Less

Submitted 10 June, 2024; v1 submitted 19 January, 2024; originally announced January 2024.

Comments: Accepted by AAAI2024

Journal ref: (2024) Vol. 38 No. 8: AAAI-24 Technical Tracks 8 Vol. 38 No. 8: AAAI-24 Technical Tracks 8 Vol. 38 No. 8: AAAI-24 Technical Tracks 8 Proceedings of the AAAI Conference on Artificial Intelligence, 38(8), 8380-8388

arXiv:2401.10273 [pdf]

Revolutionizing Pharma: Unveiling the AI and LLM Trends in the Pharmaceutical Industry

Authors: Yu Han, **gwen Tao

Abstract: This document offers a critical overview of the emerging trends and significant advancements in artificial intelligence (AI) within the pharmaceutical industry. Detailing its application across key operational areas, including research and development, animal testing, clinical trials, hospital clinical stages, production, regulatory affairs, quality control and other supporting areas, the paper ca… ▽ More This document offers a critical overview of the emerging trends and significant advancements in artificial intelligence (AI) within the pharmaceutical industry. Detailing its application across key operational areas, including research and development, animal testing, clinical trials, hospital clinical stages, production, regulatory affairs, quality control and other supporting areas, the paper categorically examines AI's role in each sector. Special emphasis is placed on cutting-edge AI technologies like machine learning algorithms and their contributions to various aspects of pharmaceutical operations. Through this comprehensive analysis, the paper highlights the transformative potential of AI in resha** the pharmaceutical industry's future. △ Less

Submitted 21 January, 2024; v1 submitted 4 January, 2024; originally announced January 2024.

arXiv:2401.09750 [pdf, other]

Exploration and Anti-Exploration with Distributional Random Network Distillation

Authors: Kai Yang, Jian Tao, Jiafei Lyu, Xiu Li

Abstract: Exploration remains a critical issue in deep reinforcement learning for an agent to attain high returns in unknown environments. Although the prevailing exploration Random Network Distillation (RND) algorithm has been demonstrated to be effective in numerous environments, it often needs more discriminative power in bonus allocation. This paper highlights the "bonus inconsistency" issue within RND,… ▽ More Exploration remains a critical issue in deep reinforcement learning for an agent to attain high returns in unknown environments. Although the prevailing exploration Random Network Distillation (RND) algorithm has been demonstrated to be effective in numerous environments, it often needs more discriminative power in bonus allocation. This paper highlights the "bonus inconsistency" issue within RND, pinpointing its primary limitation. To address this issue, we introduce the Distributional RND (DRND), a derivative of the RND. DRND enhances the exploration process by distilling a distribution of random networks and implicitly incorporating pseudo counts to improve the precision of bonus allocation. This refinement encourages agents to engage in more extensive exploration. Our method effectively mitigates the inconsistency issue without introducing significant computational overhead. Both theoretical analysis and experimental results demonstrate the superiority of our approach over the original RND algorithm. Our method excels in challenging online exploration scenarios and effectively serves as an anti-exploration mechanism in D4RL offline tasks. Our code is publicly available at https://github.com/yk7333/DRND. △ Less

Submitted 19 May, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

Comments: ICML 2024 accepted

arXiv:2401.05698 [pdf, other]

doi 10.1016/j.inffus.2024.102382

HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition

Authors: Licai Sun, Zheng Lian, Bin Liu, Jianhua Tao

Abstract: Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in recent years for its critical role in creating emotion-ware intelligent machines. Previous efforts in this area are dominated by the supervised learning paradigm. Despite significant progress, supervised learning is meeting its bottleneck due to the longstanding data scarcity issue in AVER. Motivated by recent advances in… ▽ More Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in recent years for its critical role in creating emotion-ware intelligent machines. Previous efforts in this area are dominated by the supervised learning paradigm. Despite significant progress, supervised learning is meeting its bottleneck due to the longstanding data scarcity issue in AVER. Motivated by recent advances in self-supervised learning, we propose Hierarchical Contrastive Masked Autoencoder (HiCMAE), a novel self-supervised framework that leverages large-scale self-supervised pre-training on vast unlabeled audio-visual data to promote the advancement of AVER. Following prior arts in self-supervised audio-visual representation learning, HiCMAE adopts two primary forms of self-supervision for pre-training, namely masked data modeling and contrastive learning. Unlike them which focus exclusively on top-layer representations while neglecting explicit guidance of intermediate layers, HiCMAE develops a three-pronged strategy to foster hierarchical audio-visual feature learning and improve the overall quality of learned representations. To verify the effectiveness of HiCMAE, we conduct extensive experiments on 9 datasets covering both categorical and dimensional AVER tasks. Experimental results show that our method significantly outperforms state-of-the-art supervised and self-supervised audio-visual methods, which indicates that HiCMAE is a powerful audio-visual emotion representation learner. Codes and models will be publicly available at https://github.com/sunlicai/HiCMAE. △ Less

Submitted 1 April, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

Comments: Accepted by Information Fusion. The code is available at https://github.com/sunlicai/HiCMAE

Journal ref: Information Fusion, 2024

arXiv:2401.03429 [pdf, other]

MERBench: A Unified Evaluation Benchmark for Multimodal Emotion Recognition

Authors: Zheng Lian, Licai Sun, Yong Ren, Hao Gu, Haiyang Sun, Lan Chen, Bin Liu, Jianhua Tao

Abstract: Multimodal emotion recognition plays a crucial role in enhancing user experience in human-computer interaction. Over the past few decades, researchers have proposed a series of algorithms and achieved impressive progress. Although each method shows its superior performance, different methods lack a fair comparison due to inconsistencies in feature extractors, evaluation manners, and experimental s… ▽ More Multimodal emotion recognition plays a crucial role in enhancing user experience in human-computer interaction. Over the past few decades, researchers have proposed a series of algorithms and achieved impressive progress. Although each method shows its superior performance, different methods lack a fair comparison due to inconsistencies in feature extractors, evaluation manners, and experimental settings. These inconsistencies severely hinder the development of this field. Therefore, we build MERBench, a unified evaluation benchmark for multimodal emotion recognition. We aim to reveal the contribution of some important techniques employed in previous works, such as feature selection, multimodal fusion, robustness analysis, fine-tuning, pre-training, etc. We hope this benchmark can provide clear and comprehensive guidance for follow-up researchers. Based on the evaluation results of MERBench, we further point out some promising research directions. Additionally, we introduce a new emotion dataset MER2023, focusing on the Chinese language environment. This dataset can serve as a benchmark dataset for research on multi-label learning, noise robustness, and semi-supervised learning. We encourage the follow-up researchers to evaluate their algorithms under the same experimental setup as MERBench for fair comparisons. Our code is available at: https://github.com/zeroQiaoba/MERTools. △ Less

Submitted 20 April, 2024; v1 submitted 7 January, 2024; originally announced January 2024.

arXiv:2401.02393 [pdf, ps, other]

A PDE approach for solving the characteristic function of the generalised signature process

Authors: Terry Lyons, Hao Ni, Jiajie Tao

Abstract: The signature of a path, as a fundamental object in Rough path theory, serves as a generating function for non-commutative monomials on path space. It transforms the path into a grouplike element in the tensor algebra space, summarising the path faithfully up to a generalised form of re-parameterisation (a negligible equivalence class in this context). Our paper concerns stochastic processes and s… ▽ More The signature of a path, as a fundamental object in Rough path theory, serves as a generating function for non-commutative monomials on path space. It transforms the path into a grouplike element in the tensor algebra space, summarising the path faithfully up to a generalised form of re-parameterisation (a negligible equivalence class in this context). Our paper concerns stochastic processes and studies the characteristic function of the path signature of the stochastic process. In contrast to the expected signature, it determines the law on the random signatures without any regularity condition. The computation of the characteristic function of the random signature offers potential applications in stochastic analysis and machine learning, where the expected signature plays an important role. In this paper, we focus on a time-homogeneous Itô diffusion process, and adopt a PDE approach to derive the characteristic function of its signature defined at any fixed time horizon. A key ingredient of our approach is the introduction of the generalised-signature process. This lifting enables us to establish the Feynman-Kac-type theorem for the characteristic function of the generalised-signature process by following the martingale approach. Moreover, as an application of our results, we present a novel derivation of the joint characteristic function of Brownian motion coupled with the Lévy area, leveraging the structure theorem of anti-symmetric matrices. △ Less

Submitted 29 February, 2024; v1 submitted 4 January, 2024; originally announced January 2024.

arXiv:2401.00416 [pdf, other]

SVFAP: Self-supervised Video Facial Affect Perceiver

Authors: Licai Sun, Zheng Lian, Kexin Wang, Yu He, Mingyu Xu, Haiyang Sun, Bin Liu, Jianhua Tao

Abstract: Video-based facial affect analysis has recently attracted increasing attention owing to its critical role in human-computer interaction. Previous studies mainly focus on develo** various deep learning architectures and training them in a fully supervised manner. Although significant progress has been achieved by these supervised methods, the longstanding lack of large-scale high-quality labeled… ▽ More Video-based facial affect analysis has recently attracted increasing attention owing to its critical role in human-computer interaction. Previous studies mainly focus on develo** various deep learning architectures and training them in a fully supervised manner. Although significant progress has been achieved by these supervised methods, the longstanding lack of large-scale high-quality labeled data severely hinders their further improvements. Motivated by the recent success of self-supervised learning in computer vision, this paper introduces a self-supervised approach, termed Self-supervised Video Facial Affect Perceiver (SVFAP), to address the dilemma faced by supervised methods. Specifically, SVFAP leverages masked facial video autoencoding to perform self-supervised pre-training on massive unlabeled facial videos. Considering that large spatiotemporal redundancy exists in facial videos, we propose a novel temporal pyramid and spatial bottleneck Transformer as the encoder of SVFAP, which not only enjoys low computational cost but also achieves excellent performance. To verify the effectiveness of our method, we conduct experiments on nine datasets spanning three downstream tasks, including dynamic facial expression recognition, dimensional emotion recognition, and personality recognition. Comprehensive results demonstrate that SVFAP can learn powerful affect-related representations via large-scale self-supervised pre-training and it significantly outperforms previous state-of-the-art methods on all datasets. Codes will be available at https://github.com/sunlicai/SVFAP. △ Less

Submitted 31 December, 2023; originally announced January 2024.

Comments: Submitted to IEEE Trans. on Affective Computing (February 8, 2023)

arXiv:2312.15760 [pdf, other]

Gravitational Lensing of Spherically Symmetric Black Holes in Dark Matter Halos

Authors: Yi-Gao Liu, Chen-Kai Qiao, Jun Tao

Abstract: The gravitational lensing of supermassive black holes surrounded by dark matter halo has attracted a great number of interests in recent years. However, many studies employed simplified dark matter density models, which makes it very hard to give a precise prediction on the dark matter effects in real astrophysical galaxies. In this work, to more accurately describe the distribution of dark matter… ▽ More The gravitational lensing of supermassive black holes surrounded by dark matter halo has attracted a great number of interests in recent years. However, many studies employed simplified dark matter density models, which makes it very hard to give a precise prediction on the dark matter effects in real astrophysical galaxies. In this work, to more accurately describe the distribution of dark matter in real astrophysical galaxies, we study the gravitational lensing of black holes in astrophysical dark matter halo models (Beta, Burkert, Brownstein, and Moore). The deflection angle is obtained using a generalized Gibbons-Werner approach. The visual angular positions and the Einstein rings are also calculated by adopting the gravitational lens equation. Specifically, we choose the supermassive black holes in Milky Way Galaxy, Andromeda galaxy (M31), Virgo galaxy (M87), and ESO138-G014 galaxy as examples, including the corresponding fitted value of dark matter halos. The results suggest that the dark matter halo described by the Beta model has non-negligible influences on the gravitational deflection angle and gravitational lensing observations. However, the Burkert, Brownstein, and Moore models have relatively small influences on angular position of images and the Einstein ring. △ Less

Submitted 25 May, 2024; v1 submitted 25 December, 2023; originally announced December 2023.

Comments: 29 pages, 9 figures, 2 appendices

arXiv:2312.15583 [pdf, other]

ITEACH-Net: Inverted Teacher-studEnt seArCH Network for Emotion Recognition in Conversation

Authors: Haiyang Sun, Zheng Lian, Chenglong Wang, Kang Chen, Licai Sun, Bin Liu, Jianhua Tao

Abstract: There remain two critical challenges that hinder the development of ERC. Firstly, there is a lack of exploration into mining deeper insights from the data itself for conversational emotion tasks. Secondly, the systems exhibit vulnerability to random modality feature missing, which is a common occurrence in realistic settings. Focusing on these two key challenges, we propose a novel framework for i… ▽ More There remain two critical challenges that hinder the development of ERC. Firstly, there is a lack of exploration into mining deeper insights from the data itself for conversational emotion tasks. Secondly, the systems exhibit vulnerability to random modality feature missing, which is a common occurrence in realistic settings. Focusing on these two key challenges, we propose a novel framework for incomplete multimodal learning in ERC, called "Inverted Teacher-studEnt seArCH Network (ITEACH-Net)." ITEACH-Net comprises two novel components: the Emotion Context Changing Encoder (ECCE) and the Inverted Teacher-Student (ITS) framework. Specifically, leveraging the tendency for emotional states to exhibit local stability within conversational contexts, ECCE captures these patterns and further perceives their evolution over time. Recognizing the varying challenges of handling incomplete versus complete data, ITS employs a teacher-student framework to decouple the respective computations. Subsequently, through Neural Architecture Search, the student model develops enhanced computational capabilities for handling incomplete data compared to the teacher model. During testing, we design a novel evaluation method, testing the model's performance under different missing rate conditions without altering the model weights. We conduct experiments on three benchmark ERC datasets, and the results demonstrate that our ITEACH-Net outperforms existing methods in incomplete multimodal ERC. We believe ITEACH-Net can inspire relevant research on the intrinsic nature of emotions within conversation scenarios and pave a more robust route for incomplete learning techniques. Codes will be made available. △ Less

Submitted 1 June, 2024; v1 submitted 24 December, 2023; originally announced December 2023.

arXiv:2312.15258 [pdf, other]

Human101: Training 100+FPS Human Gaussians in 100s from 1 View

Authors: Mingwei Li, Jiachen Tao, Zongxin Yang, Yi Yang

Abstract: Reconstructing the human body from single-view videos plays a pivotal role in the virtual reality domain. One prevalent application scenario necessitates the rapid reconstruction of high-fidelity 3D digital humans while simultaneously ensuring real-time rendering and interaction. Existing methods often struggle to fulfill both requirements. In this paper, we introduce Human101, a novel framework a… ▽ More Reconstructing the human body from single-view videos plays a pivotal role in the virtual reality domain. One prevalent application scenario necessitates the rapid reconstruction of high-fidelity 3D digital humans while simultaneously ensuring real-time rendering and interaction. Existing methods often struggle to fulfill both requirements. In this paper, we introduce Human101, a novel framework adept at producing high-fidelity dynamic 3D human reconstructions from 1-view videos by training 3D Gaussians in 100 seconds and rendering in 100+ FPS. Our method leverages the strengths of 3D Gaussian Splatting, which provides an explicit and efficient representation of 3D humans. Standing apart from prior NeRF-based pipelines, Human101 ingeniously applies a Human-centric Forward Gaussian Animation method to deform the parameters of 3D Gaussians, thereby enhancing rendering speed (i.e., rendering 1024-resolution images at an impressive 60+ FPS and rendering 512-resolution images at 100+ FPS). Experimental results indicate that our approach substantially eclipses current methods, clocking up to a 10 times surge in frames per second and delivering comparable or superior rendering quality. Code and demos will be released at https://github.com/longxiang-ai/Human101. △ Less

Submitted 23 December, 2023; originally announced December 2023.

Comments: Website: https://github.com/longxiang-ai/Human101

arXiv:2312.14944 [pdf]

Surface termination effect of SrTiO3 substrate on ultrathin SrRuO3

Authors: Huiyu Wang, Zhen Wang, Zeeshan Ali, Enling Wang, Mohammad Saghayezhian, Jiandong Guo, Yimei Zhu, **g Tao, Jiandi Zhang

Abstract: A uniform one-unit-cell-high step on the SrTiO3 substrate is a prerequisite for growing high-quality epitaxial oxide heterostructures. However, it is inevitable that defects induced by mixed substrate surface termination exist at the interface, significantly impacting the properties of ultrathin films. In this study, we microscopically identify the origin for the lateral inhomogeneity in the growt… ▽ More A uniform one-unit-cell-high step on the SrTiO3 substrate is a prerequisite for growing high-quality epitaxial oxide heterostructures. However, it is inevitable that defects induced by mixed substrate surface termination exist at the interface, significantly impacting the properties of ultrathin films. In this study, we microscopically identify the origin for the lateral inhomogeneity in the growth of ultrathin SrRuO3 films due to the step effects of SrTiO3(001). By using atomic-resolved scanning transmission electron microscopy, we observe two distinct types of step propagation along the [011] and [0-11]crystallographic direction in SrTiO3-SrRuO3 heterostructures, respectively. In particular, the type-II [0-11] step results in lateral discontinuity of monolayer SrRuO3 and originates from the SrO-terminated regions along the TiO2-terminated step edge. Such an induced lateral discontinuity should be responsible for the distinct electronic and magnetic properties of monolayer SrRuO3. Our findings underscore the critical importance of using single termination STO substrate to achieve high-quality termination selective films and to unveil the intrinsic properties of epitaxial films in the atomic limit. △ Less

Submitted 6 December, 2023; originally announced December 2023.

Comments: 19 pages, 10 figures, 30 references

arXiv:2312.11912 [pdf, other]

Probing the thermodynamics of charged Gauss Bonnet AdS black holes with the Lyapunov exponent

Authors: Xin Lyu, Jun Tao, Peng Wang

Abstract: In this paper, we investigate the thermodynamic properties of charged AdS Gauss-Bonnet black holes and the associations with the Lyapunov exponent. The chaotic features of the black holes and the isobaric heat capacity characterized by Lyapunov exponent are studied to reveal the stability of black hole phases. With the consideration of both timelike and null geodesic, we find the relationship betw… ▽ More In this paper, we investigate the thermodynamic properties of charged AdS Gauss-Bonnet black holes and the associations with the Lyapunov exponent. The chaotic features of the black holes and the isobaric heat capacity characterized by Lyapunov exponent are studied to reveal the stability of black hole phases. With the consideration of both timelike and null geodesic, we find the relationship between Lyapunov exponent and Hawking temperature can fully embody the feature of the Small/Large phase transition and the triple point even further. Then we briefly reveal the properties of Lyapunov exponent as an order parameter and explore the black hole shadow with it. △ Less

Submitted 19 December, 2023; originally announced December 2023.

Report number: CTP-SCU/2023039

arXiv:2312.09651 [pdf, other]

What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection

Authors: Xiaohui Zhang, Jiangyan Yi, Chenglong Wang, Chuyuan Zhang, Siding Zeng, Jianhua Tao

Abstract: The rapid evolution of speech synthesis and voice conversion has raised substantial concerns due to the potential misuse of such technology, prompting a pressing need for effective audio deepfake detection mechanisms. Existing detection models have shown remarkable success in discriminating known deepfake audio, but struggle when encountering new attack types. To address this challenge, one of the… ▽ More The rapid evolution of speech synthesis and voice conversion has raised substantial concerns due to the potential misuse of such technology, prompting a pressing need for effective audio deepfake detection mechanisms. Existing detection models have shown remarkable success in discriminating known deepfake audio, but struggle when encountering new attack types. To address this challenge, one of the emergent effective approaches is continual learning. In this paper, we propose a continual learning approach called Radian Weight Modification (RWM) for audio deepfake detection. The fundamental concept underlying RWM involves categorizing all classes into two groups: those with compact feature distributions across tasks, such as genuine audio, and those with more spread-out distributions, like various types of fake audio. These distinctions are quantified by means of the in-class cosine distance, which subsequently serves as the basis for RWM to introduce a trainable gradient modification direction for distinct data types. Experimental evaluations against mainstream continual learning methods reveal the superiority of RWM in terms of knowledge acquisition and mitigating forgetting in audio deepfake detection. Furthermore, RWM's applicability extends beyond audio deepfake detection, demonstrating its potential significance in diverse machine learning domains such as image recognition. △ Less

Submitted 15 December, 2023; originally announced December 2023.

Comments: Accepted by the main track The 38th Annual AAAI Conference on Artificial Intelligence (AAAI 2024)

arXiv:2312.04293 [pdf, other]

GPT-4V with Emotion: A Zero-shot Benchmark for Generalized Emotion Recognition

Authors: Zheng Lian, Licai Sun, Haiyang Sun, Kang Chen, Zhuofan Wen, Hao Gu, Bin Liu, Jianhua Tao

Abstract: Recently, GPT-4 with Vision (GPT-4V) has demonstrated remarkable visual capabilities across various tasks, but its performance in emotion recognition has not been fully evaluated. To bridge this gap, we present the quantitative evaluation results of GPT-4V on 21 benchmark datasets covering 6 tasks: visual sentiment analysis, tweet sentiment analysis, micro-expression recognition, facial emotion re… ▽ More Recently, GPT-4 with Vision (GPT-4V) has demonstrated remarkable visual capabilities across various tasks, but its performance in emotion recognition has not been fully evaluated. To bridge this gap, we present the quantitative evaluation results of GPT-4V on 21 benchmark datasets covering 6 tasks: visual sentiment analysis, tweet sentiment analysis, micro-expression recognition, facial emotion recognition, dynamic facial emotion recognition, and multimodal emotion recognition. This paper collectively refers to these tasks as ``Generalized Emotion Recognition (GER)''. Through experimental analysis, we observe that GPT-4V exhibits strong visual understanding capabilities in GER tasks. Meanwhile, GPT-4V shows the ability to integrate multimodal clues and exploit temporal information, which is also critical for emotion recognition. However, it's worth noting that GPT-4V is primarily designed for general domains and cannot recognize micro-expressions that require specialized knowledge. To the best of our knowledge, this paper provides the first quantitative assessment of GPT-4V for GER tasks. We have open-sourced the code and encourage subsequent researchers to broaden the evaluation scope by including more tasks and datasets. Our code and evaluation results are available at: https://github.com/zeroQiaoba/gpt4v-emotion. △ Less

Submitted 17 March, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

arXiv:2312.02243 [pdf, other]

FlowHON: Representing Flow Fields Using Higher-Order Networks

Authors: Nan Chen, Zhihong Li, Jun Tao

Abstract: Flow fields are often partitioned into data blocks for massively parallel computation and analysis based on blockwise relationships. However, most of the previous techniques only consider the first-order dependencies among blocks, which is insufficient in describing complex flow patterns. In this work, we present FlowHON, an approach to construct higher-order networks (HONs) from flow fields. Flow… ▽ More Flow fields are often partitioned into data blocks for massively parallel computation and analysis based on blockwise relationships. However, most of the previous techniques only consider the first-order dependencies among blocks, which is insufficient in describing complex flow patterns. In this work, we present FlowHON, an approach to construct higher-order networks (HONs) from flow fields. FlowHON captures the inherent higher-order dependencies in flow fields as nodes and estimates the transitions among them as edges. We formulate the HON construction as an optimization problem with three linear transformations. The first two layers correspond to the node generation and the third one corresponds to edge estimation. Our formulation allows the node generation and edge estimation to be solved in a unified framework. With FlowHON, the rich set of traditional graph algorithms can be applied without any modification to analyze flow fields, while leveraging the higher-order information to understand the inherent structure and manage flow data for efficiency. We demonstrate the effectiveness of FlowHON using a series of downstream tasks, including estimating the density of particles during tracing, partitioning flow fields for data management, and understanding flow fields using the node-link diagram representation of networks. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: To be submitted to TVCG

arXiv:2311.13231 [pdf, other]

Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model

Authors: Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Qimai Li, Weihan Shen, Xiaolong Zhu, Xiu Li

Abstract: Using reinforcement learning with human feedback (RLHF) has shown significant promise in fine-tuning diffusion models. Previous methods start by training a reward model that aligns with human preferences, then leverage RL techniques to fine-tune the underlying models. However, crafting an efficient reward model demands extensive datasets, optimal architecture, and manual hyperparameter tuning, mak… ▽ More Using reinforcement learning with human feedback (RLHF) has shown significant promise in fine-tuning diffusion models. Previous methods start by training a reward model that aligns with human preferences, then leverage RL techniques to fine-tune the underlying models. However, crafting an efficient reward model demands extensive datasets, optimal architecture, and manual hyperparameter tuning, making the process both time and cost-intensive. The direct preference optimization (DPO) method, effective in fine-tuning large language models, eliminates the necessity for a reward model. However, the extensive GPU memory requirement of the diffusion model's denoising process hinders the direct application of the DPO method. To address this issue, we introduce the Direct Preference for Denoising Diffusion Policy Optimization (D3PO) method to directly fine-tune diffusion models. The theoretical analysis demonstrates that although D3PO omits training a reward model, it effectively functions as the optimal reward model trained using human feedback data to guide the learning process. This approach requires no training of a reward model, proving to be more direct, cost-effective, and minimizing computational overhead. In experiments, our method uses the relative scale of objectives as a proxy for human preference, delivering comparable results to methods using ground-truth rewards. Moreover, D3PO demonstrates the ability to reduce image distortion rates and generate safer images, overcoming challenges lacking robust reward models. Our code is publicly available at https://github.com/yk7333/D3PO. △ Less

Submitted 23 March, 2024; v1 submitted 22 November, 2023; originally announced November 2023.

Comments: CVPR 2024 accepted; huggingface daily paper

arXiv:2311.11606 [pdf, other]

Topology of Hořava-Lifshitz black holes in different ensembles

Authors: Deyou Chen, Yucheng He, Jun Tao, Wei Yang

Abstract: In this paper, we study topological numbers for uncharged and charged static black holes obtained in Hořava-Lifshitz gravity theory in different ensembles. We first calculate the topological numbers for the uncharged black holes by changing the value of the dynamic coupling constant, and find that the black holes with spherical and flat horizons have the same topological number. When the black hol… ▽ More In this paper, we study topological numbers for uncharged and charged static black holes obtained in Hořava-Lifshitz gravity theory in different ensembles. We first calculate the topological numbers for the uncharged black holes by changing the value of the dynamic coupling constant, and find that the black holes with spherical and flat horizons have the same topological number. When the black hole's horizon is hyperbolic, different values of the coupling constant generate different topological numbers, which can be $1$, $0$ or $-1$. This shows that the coupling constant plays an important role in the topological classification. Then, we study the topological numbers for the charged black holes in different ensembles. The black hole with a spherical horizon has the same topological number in canonical and grand canonical ensembles. When the horizons are flat or hyperbolic, they have different topological numbers in canonical and grand canonical ensembles. Therefore, the topological numbers for the uncharged black holes are parameter dependent, and those for the charged black holes are ensemble dependent. △ Less

Submitted 20 November, 2023; originally announced November 2023.

Comments: 17 pages, 13 figures

Journal ref: Eur. Phys. J. C (2024) 84:96

arXiv:2311.11586 [pdf, other]

Attractive interactions in the microstructures of asymptotically flat black holes

Authors: Deyou Chen, Jun Tao, Xuetao Yang

Abstract: In this work, we investigate the microstructure of asymptotically flat black holes with Ruppeiner curvature. Specially, the cosmological constant is considered to have a fluctuation around 0. Under such consideration, both repulsive and attractive interactions are found in the Reissner-Nordström and Kerr black holes, while the Schwarzschild black hole has dominant attractive interaction. The resul… ▽ More In this work, we investigate the microstructure of asymptotically flat black holes with Ruppeiner curvature. Specially, the cosmological constant is considered to have a fluctuation around 0. Under such consideration, both repulsive and attractive interactions are found in the Reissner-Nordström and Kerr black holes, while the Schwarzschild black hole has dominant attractive interaction. The result obtained is quite different from that of excluding the fluctuation of cosmological constant, where these black holes are found to be always characterised by repulsive interaction. △ Less

Submitted 20 November, 2023; originally announced November 2023.

Comments: 17 pages, 3 figures

Journal ref: Physics of the Dark Universe, 42 (2023) 101379

Showing 1–50 of 453 results for author: Tao, J