Skip to main content

Showing 1–50 of 699 results for author: Xie, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.18862  [pdf, other

    cs.SD eess.AS

    Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study

    Authors: Peikun Chen, Sining Sun, Changhao Shan, Qing Yang, Lei Xie

    Abstract: Unified speech-text models like SpeechGPT, VioLA, and AudioPaLM have shown impressive performance across various speech-related tasks, especially in Automatic Speech Recognition (ASR). These models typically adopt a unified method to model discrete speech and text tokens, followed by training a decoder-only transformer. However, they are all designed for non-streaming ASR tasks, where the entire s… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: Accepted for Interspeech 2024

  2. arXiv:2406.18462  [pdf, other

    cs.CV cs.GR

    GaussianDreamerPro: Text to Manipulable 3D Gaussians with Highly Enhanced Quality

    Authors: Taoran Yi, Jiemin Fang, Zanwei Zhou, Junjie Wang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Xinggang Wang, Qi Tian

    Abstract: Recently, 3D Gaussian splatting (3D-GS) has achieved great success in reconstructing and rendering real-world scenes. To transfer the high rendering quality to generation tasks, a series of research works attempt to generate 3D-Gaussian assets from text. However, the generated assets have not achieved the same quality as those in reconstruction tasks. We observe that Gaussians tend to grow without… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: Project page: https://taoranyi.com/gaussiandreamerpro/

  3. arXiv:2406.17777  [pdf, other

    cs.CV

    Text-Animator: Controllable Visual Text Video Generation

    Authors: Lin Liu, Quande Liu, Shengju Qian, Yuan Zhou, Wengang Zhou, Houqiang Li, Lingxi Xie, Qi Tian

    Abstract: Video generation is a challenging yet pivotal task in various industries, such as gaming, e-commerce, and advertising. One significant unresolved aspect within T2V is the effective visualization of text within generated videos. Despite the progress achieved in Text-to-Video~(T2V) generation, current methods still cannot effectively visualize texts in videos directly, as they mainly focus on summar… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: Project Page: https://laulampaul.github.io/text-animator.html

  4. arXiv:2406.15983  [pdf, other

    cs.IR

    Learning k-Determinantal Point Processes for Personalized Ranking

    Authors: Yuli Liu, Christian Walder, Lexing Xie

    Abstract: The key to personalized recommendation is to predict a personalized ranking on a catalog of items by modeling the user's preferences. There are many personalized ranking approaches for item recommendation from implicit feedback like Bayesian Personalized Ranking (BPR) and listwise ranking. Despite these methods have shown performance benefits, there are still limitations affecting recommendation p… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

    Comments: 14 pages, accepted at ICDE 2024 (40th IEEE International Conference on Data Engineering)

  5. arXiv:2406.15339  [pdf, other

    cs.CV cs.AI cs.MM

    Image Conductor: Precision Control for Interactive Video Synthesis

    Authors: Yaowei Li, Xintao Wang, Zhaoyang Zhang, Zhouxia Wang, Ziyang Yuan, Liangbin Xie, Yuexian Zou, Ying Shan

    Abstract: Filmmaking and animation production often require sophisticated techniques for coordinating camera transitions and object movements, typically involving labor-intensive real-world capturing. Despite advancements in generative AI for video creation, achieving precise control over motion for interactive video asset generation remains challenging. To this end, we propose Image Conductor, a method for… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

    Comments: Project webpage available at https://liyaowei-stu.github.io/project/ImageConductor/

  6. arXiv:2406.11327  [pdf, other

    cs.CV

    ClawMachine: Fetching Visual Tokens as An Entity for Referring and Grounding

    Authors: Tianren Ma, Lingxi Xie, Yunjie Tian, Boyu Yang, Yuan Zhang, David Doermann, Qixiang Ye

    Abstract: An essential topic for multimodal large language models (MLLMs) is aligning vision and language concepts at a finer level. In particular, we devote efforts to encoding visual referential information for tasks such as referring and grounding. Existing methods, including proxy encoding and geometry encoding, incorporate additional syntax to encode the object's location, bringing extra burdens in tra… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: Project page: https://github.com/martian422/ClawMachine

  7. arXiv:2406.09844  [pdf, other

    cs.SD eess.AS

    Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy

    Authors: Linhan Ma, Xinfa Zhu, Yuanjun Lv, Zhichao Wang, Ziqian Wang, Wendi He, Hongbin Zhou, Lei Xie

    Abstract: Zero-shot voice conversion (VC) aims to transform source speech into arbitrary unseen target voice while kee** the linguistic content unchanged. Recent VC methods have made significant progress, but semantic losses in the decoupling process as well as training-inference mismatch still hinder conversion performance. In this paper, we propose Vec-Tok-VC+, a novel prompt-based zero-shot VC model im… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH2024

  8. arXiv:2406.08393  [pdf, other

    eess.AS cs.SD

    SCDNet: Self-supervised Learning Feature-based Speaker Change Detection

    Authors: Yue Li, Xinsheng Wang, Li Zhang, Lei Xie

    Abstract: Speaker Change Detection (SCD) is to identify boundaries among speakers in a conversation. Motivated by the success of fine-tuning wav2vec 2.0 models for the SCD task, a further investigation of self-supervised learning (SSL) features for SCD is conducted in this work. Specifically, an SCD model, named SCDNet, is proposed. With this model, various state-of-the-art SSL models, including Hubert, wav… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  9. arXiv:2406.08196  [pdf, other

    cs.SD eess.AS

    FreeV: Free Lunch For Vocoders Through Pseudo Inversed Mel Filter

    Authors: Yuanjun Lv, Hai Li, Ying Yan, Junhui Liu, Danming Xie, Lei Xie

    Abstract: Vocoders reconstruct speech waveforms from acoustic features and play a pivotal role in modern TTS systems. Frequent-domain GAN vocoders like Vocos and APNet2 have recently seen rapid advancements, outperforming time-domain models in inference speed while achieving comparable audio quality. However, these frequency-domain vocoders suffer from large parameter sizes, thus introducing extra memory bu… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted by InterSpeech 2024; 5 pages, 5 figures

  10. arXiv:2406.07498  [pdf, other

    cs.SD eess.AS

    RaD-Net 2: A causal two-stage repairing and denoising speech enhancement network with knowledge distillation and complex axial self-attention

    Authors: Mingshuai Liu, Zhuangqi Chen, Xiaopeng Yan, Yuanjun Lv, Xianjun Xia, Chuanzeng Huang, Yijian Xiao, Lei Xie

    Abstract: In real-time speech communication systems, speech signals are often degraded by multiple distortions. Recently, a two-stage Repair-and-Denoising network (RaD-Net) was proposed with superior speech quality improvement in the ICASSP 2024 Speech Signal Improvement (SSI) Challenge. However, failure to use future information and constraint receptive field of convolution layers limit the system's perfor… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  11. arXiv:2406.07256  [pdf, ps, other

    cs.SD cs.AI eess.AS

    AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection

    Authors: Rong Gong, Hongfei Xue, Lezhi Wang, Xin Xu, Qisheng Li, Lei Xie, Hui Bu, Shaomei Wu, Jiaming Zhou, Yong Qin, Binbin Zhang, Jun Du, Jia Bin, Ming Li

    Abstract: The rapid advancements in speech technologies over the past two decades have led to human-level performance in tasks like automatic speech recognition (ASR) for fluent speech. However, the efficacy of these models diminishes when applied to atypical speech, such as stuttering. This paper introduces AS-70, the first publicly available Mandarin stuttered speech dataset, which stands out as the large… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  12. arXiv:2406.06579  [pdf, other

    cs.CL cs.AI cs.CV

    From Redundancy to Relevance: Enhancing Explainability in Multimodal Large Language Models

    Authors: Xiaofeng Zhang, Chen Shen, Xiaosong Yuan, Shaotian Yan, Liang Xie, Wenxiao Wang, Chaochen Gu, Hao Tang, Jie** Ye

    Abstract: Recently, multimodal large language models have exploded with an endless variety, most of the popular Large Vision Language Models (LVLMs) depend on sequential visual representation, where images are converted into hundreds or thousands of tokens before being input into the Large Language Model (LLM) along with language prompts. The black-box design hinders the interpretability of visual-language… ▽ More

    Submitted 13 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

  13. arXiv:2406.05681  [pdf, other

    cs.SD eess.AS

    Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling

    Authors: Yuepeng Jiang, Tao Li, Fengyu Yang, Lei Xie, Meng Meng, Yujun Wang

    Abstract: Recent research in zero-shot speech synthesis has made significant progress in speaker similarity. However, current efforts focus on timbre generalization rather than prosody modeling, which results in limited naturalness and expressiveness. To address this, we introduce a novel speech synthesis model trained on large-scale datasets, including both timbre and hierarchical prosody modeling. As timb… ▽ More

    Submitted 11 June, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

    Comments: 5 pages, 2 figures, accepted by Interspeech2024

  14. arXiv:2406.04062  [pdf, other

    cs.GT

    Online Learning in Betting Markets: Profit versus Prediction

    Authors: Haiqing Zhu, Alexander Soen, Yun Kuen Cheung, Lexing Xie

    Abstract: We examine two types of binary betting markets, whose primary goal is for profit (such as sports gambling) or to gain information (such as prediction markets). We articulate the interplay between belief and price-setting to analyse both types of markets, and show that the goals of maximising bookmaker profit and eliciting information are fundamentally incompatible. A key insight is that profit hin… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: ICML 2024

  15. arXiv:2406.03262  [pdf, other

    cs.CV

    ADer: A Comprehensive Benchmark for Multi-class Visual Anomaly Detection

    Authors: Jiangning Zhang, Haoyang He, Zhenye Gan, Qingdong He, Yuxuan Cai, Zhucun Xue, Yabiao Wang, Chengjie Wang, Lei Xie, Yong Liu

    Abstract: Visual anomaly detection aims to identify anomalous regions in images through unsupervised learning paradigms, with increasing application demand and value in fields such as industrial inspection and medical lesion detection. Despite significant progress in recent years, there is a lack of comprehensive benchmarks to adequately evaluate the performance of various mainstream methods across differen… ▽ More

    Submitted 6 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

  16. arXiv:2406.00364  [pdf, other

    cs.RO

    Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments

    Authors: Chuang Wang, Lie Yang, Ze Lin, Yizhi Liao, Gang Chen, Longhan Xie

    Abstract: Assembling a slave object into a fixture-free master object represents a critical challenge in flexible manufacturing. Existing deep reinforcement learning-based methods, while benefiting from visual or operational priors, often struggle with small-batch precise assembly tasks due to their reliance on insufficient priors and high-costed model development. To address these limitations, this paper i… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

    Comments: 15 pages, 14 figures

  17. arXiv:2406.00258  [pdf, other

    cs.CV cs.AI

    Artemis: Towards Referential Understanding in Complex Videos

    Authors: Jihao Qiu, Yuan Zhang, Xi Tang, Lingxi Xie, Tianren Ma, Pengyu Yan, David Doermann, Qixiang Ye, Yunjie Tian

    Abstract: Videos carry rich visual information including object description, action, interaction, etc., but the existing multimodal large language models (MLLMs) fell short in referential understanding scenarios such as video-based referring. In this paper, we present Artemis, an MLLM that pushes video-based referential understanding to a finer level. Given a video, Artemis receives a natural-language quest… ▽ More

    Submitted 31 May, 2024; originally announced June 2024.

    Comments: 19 pages, 14 figures. Code and data are available at https://github.com/qiujihao19/Artemis

  18. arXiv:2405.20892  [pdf, other

    cs.CV cs.AI

    MALT: Multi-scale Action Learning Transformer for Online Action Detection

    Authors: Zhipeng Yang, Ruoyu Wang, Yang Tan, Abstract: Online action detection (OAD) aims to identify ongoing actions from streaming video in real-time, without access to future frames. Since these actions manifest at varying scales of granularity, ranging from coarse to fine, projecting an entire set of action frames to a single latent encoding may result in a lack of local information, necessitating the acquisition of action features across multiple… ▽ More

    Submitted 31 May, 2024; originally announced May 2024.

    Comments: 8 pages, 3 figures

  19. arXiv:2405.18840  [pdf, other

    cs.CV

    Parameter-efficient Fine-tuning in Hyperspherical Space for Open-vocabulary Semantic Segmentation

    Authors: Zelin Peng, Zhengqin Xu, Zhilin Zeng, Yaoming Wang, Lingxi Xie, Qi Tian, Wei Shen

    Abstract: Open-vocabulary semantic segmentation seeks to label each pixel in an image with arbitrary text descriptions. Vision-language foundation models, especially CLIP, have recently emerged as powerful tools for acquiring open-vocabulary capabilities. However, fine-tuning CLIP to equip it with pixel-level prediction ability often suffers three issues: 1) high computational cost, 2) misalignment between… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  20. arXiv:2405.16876  [pdf, other

    cs.LG cs.AI

    Transfer Learning for Diffusion Models

    Authors: Yidong Ouyang, Liyan Xie, Hongyuan Zha, Guang Cheng

    Abstract: Diffusion models, a specific type of generative model, have achieved unprecedented performance in recent years and consistently produce high-quality synthetic samples. A critical prerequisite for their notable success lies in the presence of a substantial number of training samples, which can be impractical in real-world applications due to high collection costs or associated risks. Consequently,… ▽ More

    Submitted 27 May, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

    Comments: 24 pages

  21. arXiv:2405.10570  [pdf

    eess.IV cs.AI

    Simultaneous Deep Learning of Myocardium Segmentation and T2 Quantification for Acute Myocardial Infarction MRI

    Authors: Yirong Zhou, Chengyan Wang, Mengtian Lu, Kunyuan Guo, Zi Wang, Dan Ruan, Rui Guo, Peijun Zhao, Jianhua Wang, Naiming Wu, Jianzhong Lin, Yinyin Chen, Hang **, Lianxin Xie, Lilan Wu, Liuhong Zhu, Jianjun Zhou, Congbo Cai, He Wang, Xiaobo Qu

    Abstract: In cardiac Magnetic Resonance Imaging (MRI) analysis, simultaneous myocardial segmentation and T2 quantification are crucial for assessing myocardial pathologies. Existing methods often address these tasks separately, limiting their synergistic potential. To address this, we propose SQNet, a dual-task network integrating Transformer and Convolutional Neural Network (CNN) components. SQNet features… ▽ More

    Submitted 29 May, 2024; v1 submitted 17 May, 2024; originally announced May 2024.

    Comments: 10 pages, 8 figures, 6 tables

  22. arXiv:2405.07973  [pdf, other

    cs.PL

    A Natural Formalized Proof Language

    Authors: Lihan Xie, Zhicheng Hui, Qinxiang Cao

    Abstract: Artificial intelligence assisted mathematical proof has become a highly focused area nowadays. One key problem in this field is to generate formal mathematical proofs from natural language proofs. Due to historical reasons, the formal proof languages adopted by traditional theorem provers were not intended to represent natural language proofs. Therefore, they are not well-suited for the aforementi… ▽ More

    Submitted 13 May, 2024; originally announced May 2024.

  23. arXiv:2405.06822  [pdf, other

    cs.LG cs.AI

    MH-pFLID: Model Heterogeneous personalized Federated Learning via Injection and Distillation for Medical Data Analysis

    Authors: Luyuan Xie, Manqing Lin, Tianyu Luan, Cong Li, Yuejian Fang, Qingni Shen, Zhonghai Wu

    Abstract: Federated learning is widely used in medical applications for training global models without needing local data access. However, varying computational capabilities and network architectures (system heterogeneity), across clients pose significant challenges in effectively aggregating information from non-independently and identically distributed (non-IID) data. Current federated learning methods us… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

    Comments: This paper is accepted by ICML 2024

  24. arXiv:2405.05724  [pdf, other

    cs.SI cs.CR cs.IT

    Private Online Community Detection for Censored Block Models

    Authors: Mohamed Seif, Liyan Xie, Andrea J. Goldsmith, H. Vincent Poor

    Abstract: We study the private online change detection problem for dynamic communities, using a censored block model (CBM). Focusing on the notion of edge differential privacy (DP), we seek to understand the fundamental tradeoffs between the privacy budget, detection delay, and exact community recovery of community labels. We establish the theoretical lower bound on the delay in detecting changes privately… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

  25. arXiv:2405.03462  [pdf, ps, other

    cs.CV cs.AI cs.LG

    A Lightweight Neural Architecture Search Model for Medical Image Classification

    Authors: Lunchen Xie, Eugenio Lomurno, Matteo Gambella, Danilo Ardagna, Manuel Roveri, Matteo Matteucci, Qingjiang Shi

    Abstract: Accurate classification of medical images is essential for modern diagnostics. Deep learning advancements led clinicians to increasingly use sophisticated models to make faster and more accurate decisions, sometimes replacing human judgment. However, model development is costly and repetitive. Neural Architecture Search (NAS) provides solutions by automating the design of deep learning architectur… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

  26. arXiv:2405.03152  [pdf, other

    eess.AS cs.SD

    MMGER: Multi-modal and Multi-granularity Generative Error Correction with LLM for Joint Accent and Speech Recognition

    Authors: Bingshen Mu, Yangze Li, Qijie Shao, Kun Wei, Xucheng Wan, Naijun Zheng, Huan Zhou, Lei Xie

    Abstract: Despite notable advancements in automatic speech recognition (ASR), performance tends to degrade when faced with adverse conditions. Generative error correction (GER) leverages the exceptional text comprehension capabilities of large language models (LLM), delivering impressive performance in ASR error correction, where N-best hypotheses provide valuable information for transcription prediction. H… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

  27. arXiv:2405.02132  [pdf, other

    cs.SD cs.CL eess.AS

    Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

    Authors: Xuelong Geng, Tianyi Xu, Kun Wei, Bingshen Mu, Hongfei Xue, He Wang, Yangze Li, Pengcheng Guo, Yuhang Dai, Longhao Li, Mingchen Shao, Lei Xie

    Abstract: Large Language Models (LLMs) have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition (ASR) is becoming a mainstream paradigm. Building upon this momentum, our research delves into an in-depth examination of this paradigm on a large open-source Chinese dataset. Specifically, our research aims to evaluate the impact of various configu… ▽ More

    Submitted 6 May, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

  28. arXiv:2404.09524  [pdf, other

    cs.LG

    Dynamic fault detection and diagnosis of industrial alkaline water electrolyzer process with variational Bayesian dictionary learning

    Authors: Qi Zhang, Lei Xie, Weihua Xu, Hongye Su

    Abstract: Alkaline Water Electrolysis (AWE) is one of the simplest green hydrogen production method using renewable energy. AWE system typically yields process variables that are serially correlated and contaminated by measurement uncertainty. A novel robust dynamic variational Bayesian dictionary learning (RDVDL) monitoring approach is proposed to improve the reliability and safety of AWE operation.… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

  29. arXiv:2404.09519  [pdf, other

    cs.LG eess.SY

    Nonlinear sparse variational Bayesian learning based model predictive control with application to PEMFC temperature control

    Authors: Qi Zhang, Lei Wang, Weihua Xu, Hongye Su, Lei Xie

    Abstract: The accuracy of the underlying model predictions is crucial for the success of model predictive control (MPC) applications. If the model is unable to accurately analyze the dynamics of the controlled system, the performance and stability guarantees provided by MPC may not be achieved. Learning-based MPC can learn models from data, improving the applicability and reliability of MPC. This study deve… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

  30. arXiv:2404.06564  [pdf, other

    cs.CV

    MambaAD: Exploring State Space Models for Multi-class Unsupervised Anomaly Detection

    Authors: Haoyang He, Yuhu Bai, Jiangning Zhang, Qingdong He, Hongxu Chen, Zhenye Gan, Chengjie Wang, Xiangtai Li, Guanzhong Tian, Lei Xie

    Abstract: Recent advancements in anomaly detection have seen the efficacy of CNN- and transformer-based approaches. However, CNNs struggle with long-range dependencies, while transformers are burdened by quadratic computational complexity. Mamba-based models, with their superior long-range modeling and linear efficiency, have garnered substantial attention. This study pioneers the application of Mamba to mu… ▽ More

    Submitted 14 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

  31. arXiv:2404.05667  [pdf, other

    cs.CV

    AlignZeg: Mitigating Objective Misalignment for Zero-shot Semantic Segmentation

    Authors: Jiannan Ge, Lingxi Xie, Hongtao Xie, Pandeng Li, Xiaopeng Zhang, Yongdong Zhang, Qi Tian

    Abstract: A serious issue that harms the performance of zero-shot visual recognition is named objective misalignment, i.e., the learning objective prioritizes improving the recognition accuracy of seen classes rather than unseen classes, while the latter is the true target to pursue. This issue becomes more significant in zero-shot image segmentation because the stronger (i.e., pixel-level) supervision brin… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

  32. arXiv:2404.05466  [pdf, other

    cs.CV

    Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder

    Authors: He Wang, Pengcheng Guo, Xucheng Wan, Huan Zhou, Lei Xie

    Abstract: Automatic lip-reading (ALR) aims to automatically transcribe spoken content from a speaker's silent lip motion captured in video. Current mainstream lip-reading approaches only use a single visual encoder to model input videos of a single scale. In this paper, we propose to enhance lip-reading by incorporating multi-scale video data and multi-encoder. Specifically, we first propose a novel multi-s… ▽ More

    Submitted 30 April, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

    Comments: 6 pages, 3 figures, Accepted at ICMEW 2024

  33. Salient Sparse Visual Odometry With Pose-Only Supervision

    Authors: Siyu Chen, Kangcheng Liu, Chen Wang, Shenghai Yuan, Jianfei Yang, Lihua Xie

    Abstract: Visual Odometry (VO) is vital for the navigation of autonomous systems, providing accurate position and orientation estimates at reasonable costs. While traditional VO methods excel in some conditions, they struggle with challenges like variable lighting and motion blur. Deep learning-based VO, though more adaptable, can face generalization problems in new environments. Addressing these drawbacks,… ▽ More

    Submitted 6 April, 2024; originally announced April 2024.

    Comments: Accepted by IEEE Robotics and Automation Letters

  34. arXiv:2404.00540  [pdf, other

    cs.CV cs.AI

    Embodied Active Defense: Leveraging Recurrent Feedback to Counter Adversarial Patches

    Authors: Lingxuan Wu, Xiao Yang, Yinpeng Dong, Liuwei Xie, Hang Su, Jun Zhu

    Abstract: The vulnerability of deep neural networks to adversarial patches has motivated numerous defense strategies for boosting model robustness. However, the prevailing defenses depend on single observation or pre-established adversary information to counter adversarial patches, often failing to be confronted with unseen or adaptive adversarial attacks and easily exhibiting unsatisfying performance in dy… ▽ More

    Submitted 30 March, 2024; originally announced April 2024.

    Comments: 27pages

  35. arXiv:2404.00247  [pdf, ps, other

    eess.SY cs.AI cs.LG

    Facilitating Reinforcement Learning for Process Control Using Transfer Learning: Perspectives

    Authors: Runze Lin, Junghui Chen, Lei Xie, Hongye Su, Biao Huang

    Abstract: This paper provides insights into deep reinforcement learning (DRL) for process control from the perspective of transfer learning. We analyze the challenges of applying DRL in the field of process industries and the necessity of introducing transfer learning. Furthermore, recommendations and prospects are provided for future research directions on how transfer learning can be integrated with DRL t… ▽ More

    Submitted 1 May, 2024; v1 submitted 30 March, 2024; originally announced April 2024.

    Comments: Final Version of Asian Control Conference (ASCC 2024)

  36. arXiv:2403.16198  [pdf, other

    cs.CV

    Diffusion Model is a Good Pose Estimator from 3D RF-Vision

    Authors: Junqiao Fan, Jianfei Yang, Yuecong Xu, Lihua Xie

    Abstract: Human pose estimation (HPE) from Radio Frequency vision (RF-vision) performs human sensing using RF signals that penetrate obstacles without revealing privacy (e.g., facial information). Recently, mmWave radar has emerged as a promising RF-vision sensor, providing radar point clouds by processing RF signals. However, the mmWave radar has a limited resolution with severe noise, leading to inaccurat… ▽ More

    Submitted 24 March, 2024; originally announced March 2024.

  37. arXiv:2403.16071  [pdf, other

    cs.AI cs.CV cs.MM

    Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization

    Authors: Linzhi Wu, Xingyu Zhang, Yakun Zhang, Changyan Zheng, Tiejun Liu, Liang Xie, Ye Yan, Erwei Yin

    Abstract: Lip reading, the process of interpreting silent speech from visual lip movements, has gained rising attention for its wide range of realistic applications. Deep learning approaches greatly improve current lip reading systems. However, lip reading in cross-speaker scenarios where the speaker identity changes, poses a challenging problem due to inter-speaker variability. A well-trained lip reading s… ▽ More

    Submitted 2 May, 2024; v1 submitted 24 March, 2024; originally announced March 2024.

    Comments: To appear in LREC-COLING 2024

    Journal ref: The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

  38. arXiv:2403.15805  [pdf, other

    cs.RO

    AirCrab: A Hybrid Aerial-Ground Manipulator with An Active Wheel

    Authors: Muqing Cao, Jiayan Zhao, Xinhang Xu, Lihua Xie

    Abstract: Inspired by the behavior of birds, we present AirCrab, a hybrid aerial ground manipulator (HAGM) with a single active wheel and a 3-degree of freedom (3-DoF) manipulator. AirCrab leverages a single point of contact with the ground to reduce position drift and improve manipulation accuracy. The single active wheel enables locomotion on narrow surfaces without adding significant weight to the robot.… ▽ More

    Submitted 23 March, 2024; originally announced March 2024.

  39. arXiv:2403.14849  [pdf, other

    cs.IT cs.LG

    Output-Constrained Lossy Source Coding With Application to Rate-Distortion-Perception Theory

    Authors: Li Xie, Liangyan Li, Jun Chen, Zhongshan Zhang

    Abstract: The distortion-rate function of output-constrained lossy source coding with limited common randomness is analyzed for the special case of squared error distortion measure. An explicit expression is obtained when both source and reconstruction distributions are Gaussian. This further leads to a partial characterization of the information-theoretic limit of quadratic Gaussian rate-distortion-percept… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

  40. arXiv:2403.14173  [pdf, other

    cs.RO

    HCTO: Optimality-Aware LiDAR Inertial Odometry with Hybrid Continuous Time Optimization for Compact Wearable Map** System

    Authors: Jian** Li, Shenghai Yuan, Muqing Cao, Thien-Minh Nguyen, Kun Cao, Lihua Xie

    Abstract: Compact wearable map** system (WMS) has gained significant attention due to their convenience in various applications. Specifically, it provides an efficient way to collect prior maps for 3D structure inspection and robot-based "last-mile delivery" in complex environments. However, vibrations in human motion and the uneven distribution of point cloud features in complex environments often lead t… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

  41. arXiv:2403.12986  [pdf, other

    cs.CV cs.LG

    BaCon: Boosting Imbalanced Semi-supervised Learning via Balanced Feature-Level Contrastive Learning

    Authors: Qianhan Feng, Lu**g Xie, Shijie Fang, Tong Lin

    Abstract: Semi-supervised Learning (SSL) reduces the need for extensive annotations in deep learning, but the more realistic challenge of imbalanced data distribution in SSL remains largely unexplored. In Class Imbalanced Semi-supervised Learning (CISSL), the bias introduced by unreliable pseudo-labels can be exacerbated by imbalanced data distributions. Most existing methods address this issue at instance-… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

    Comments: Accpeted paper of AAAI2024

  42. arXiv:2403.12798  [pdf, other

    cs.RO math.OC

    Introducing Combi-Stations in Robotic Mobile Fulfilment Systems: A Queueing-Theory-Based Efficiency Analysis

    Authors: Lin Xie, Sonja Otten

    Abstract: In the era of digital commerce, the surge in online shop** and the expectation for rapid delivery have placed unprecedented demands on warehouse operations. The traditional method of order fulfilment, where human order pickers traverse large storage areas to pick items, has become a bottleneck, consuming valuable time and resources. Robotic Mobile Fulfilment Systems (RMFS) offer a solution by us… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

    Comments: 15 pages, 7 figures. arXiv admin note: text overlap with arXiv:1912.01782

  43. arXiv:2403.11496  [pdf, other

    cs.RO cs.AI

    MCD: Diverse Large-Scale Multi-Campus Dataset for Robot Perception

    Authors: Thien-Minh Nguyen, Shenghai Yuan, Thien Hoang Nguyen, Pengyu Yin, Haozhi Cao, Lihua Xie, Maciej Wozniak, Patric Jensfelt, Marko Thiel, Justin Ziegenbein, Noel Blunder

    Abstract: Perception plays a crucial role in various robot applications. However, existing well-annotated datasets are biased towards autonomous driving scenarios, while unlabelled SLAM datasets are quickly over-fitted, and often lack environment and domain variations. To expand the frontier of these fields, we introduce a comprehensive dataset named MCD (Multi-Campus Dataset), featuring a wide range of sen… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

    Comments: Accepted by The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024

  44. arXiv:2403.06461  [pdf, other

    cs.CV

    Reliable Spatial-Temporal Voxels For Multi-Modal Test-Time Adaptation

    Authors: Haozhi Cao, Yuecong Xu, Jianfei Yang, Pengyu Yin, Xingyu Ji, Shenghai Yuan, Lihua Xie

    Abstract: Multi-modal test-time adaptation (MM-TTA) is proposed to adapt models to an unlabeled target domain by leveraging the complementary multi-modal inputs in an online manner. Previous MM-TTA methods rely on predictions of cross-modal information in each input frame, while they ignore the fact that predictions of geometric neighborhoods within consecutive frames are highly correlated, leading to unsta… ▽ More

    Submitted 15 March, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

  45. arXiv:2403.06124  [pdf, other

    cs.CV cs.RO

    PSS-BA: LiDAR Bundle Adjustment with Progressive Spatial Smoothing

    Authors: Jian** Li, Thien-Minh Nguyen, Shenghai Yuan, Lihua Xie

    Abstract: Accurate and consistent construction of point clouds from LiDAR scanning data is fundamental for 3D modeling applications. Current solutions, such as multiview point cloud registration and LiDAR bundle adjustment, predominantly depend on the local plane assumption, which may be inadequate in complex environments lacking of planar geometries or substantial initial pose errors. To mitigate this prob… ▽ More

    Submitted 10 March, 2024; originally announced March 2024.

  46. arXiv:2403.03645  [pdf, other

    cs.AI

    K-Link: Knowledge-Link Graph from LLMs for Enhanced Representation Learning in Multivariate Time-Series Data

    Authors: Yucheng Wang, Ruibing **, Min Wu, Xiaoli Li, Lihua Xie, Zhenghua Chen

    Abstract: Sourced from various sensors and organized chronologically, Multivariate Time-Series (MTS) data involves crucial spatial-temporal dependencies, e.g., correlations among sensors. To capture these dependencies, Graph Neural Networks (GNNs) have emerged as powerful tools, yet their effectiveness is restricted by the quality of graph construction from MTS data. Typically, existing approaches construct… ▽ More

    Submitted 6 March, 2024; originally announced March 2024.

    Comments: 12 pages,7 figures

  47. arXiv:2403.01225  [pdf, other

    cs.RO

    A Cost-Effective Cooperative Exploration and Inspection Strategy for Heterogeneous Aerial System

    Authors: Xinhang Xu, Muqing Cao, Shenghai Yuan, Thien Hoang Nguyen, Thien-Minh Nguyen, Lihua Xie

    Abstract: In this paper, we propose a cost-effective strategy for heterogeneous UAV swarm systems for cooperative aerial inspection. Unlike previous swarm inspection works, the proposed method does not rely on precise prior knowledge of the environment and can complete full 3D surface coverage of objects in any shape. In this work, agents are partitioned into teams, with each drone assign a different task,… ▽ More

    Submitted 2 March, 2024; originally announced March 2024.

    Comments: Baseline method of CARIC at CDC 2023, Singapore

  48. arXiv:2402.19264  [pdf, other

    cs.CV

    T3DNet: Compressing Point Cloud Models for Lightweight 3D Recognition

    Authors: Zhiyuan Yang, Yunjiao Zhou, Lihua Xie, Jianfei Yang

    Abstract: 3D point cloud has been widely used in many mobile application scenarios, including autonomous driving and 3D sensing on mobile devices. However, existing 3D point cloud models tend to be large and cumbersome, making them hard to deploy on edged devices due to their high memory requirements and non-real-time latency. There has been a lack of research on how to compress 3D point cloud models into l… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

    Comments: 12 pages

  49. arXiv:2402.19258  [pdf, other

    cs.CV

    MaskFi: Unsupervised Learning of WiFi and Vision Representations for Multimodal Human Activity Recognition

    Authors: Jianfei Yang, Shijie Tang, Yuecong Xu, Yunjiao Zhou, Lihua Xie

    Abstract: Human activity recognition (HAR) has been playing an increasingly important role in various domains such as healthcare, security monitoring, and metaverse gaming. Though numerous HAR methods based on computer vision have been developed to show prominent performance, they still suffer from poor robustness in adverse visual conditions in particular low illumination, which motivates WiFi-based HAR to… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

    Comments: 9 pages

  50. arXiv:2402.18856  [pdf, other

    eess.IV cs.CV

    Anatomy-guided fiber trajectory distribution estimation for cranial nerves tractography

    Authors: Lei Xie, Qingrun Zeng, Huajun Zhou, Guoqiang Xie, Mingchu Li, Jiahao Huang, Jianan Cui, Hao Chen, Yuan**g Feng

    Abstract: Diffusion MRI tractography is an important tool for identifying and analyzing the intracranial course of cranial nerves (CNs). However, the complex environment of the skull base leads to ambiguous spatial correspondence between diffusion directions and fiber geometry, and existing diffusion tractography methods of CNs identification are prone to producing erroneous trajectories and missing true po… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.