Skip to main content

Showing 1–50 of 346 results for author: Xie, L

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.18862  [pdf, other

    cs.SD eess.AS

    Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study

    Authors: Peikun Chen, Sining Sun, Changhao Shan, Qing Yang, Lei Xie

    Abstract: Unified speech-text models like SpeechGPT, VioLA, and AudioPaLM have shown impressive performance across various speech-related tasks, especially in Automatic Speech Recognition (ASR). These models typically adopt a unified method to model discrete speech and text tokens, followed by training a decoder-only transformer. However, they are all designed for non-streaming ASR tasks, where the entire s… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: Accepted for Interspeech 2024

  2. arXiv:2406.17976  [pdf, other

    eess.SY

    The Role of Electric Grid Research in Addressing Climate Change

    Authors: Le Xie, Subir Majumder, Tong Huang, Qian Zhang, ** Chang, David J. Hill, Mohammad Shahidehpour

    Abstract: Addressing the urgency of climate change necessitates a coordinated and inclusive effort from all relevant stakeholders. Critical to this effort is the modeling, analysis, control, and integration of technological innovations within the electric energy system, which plays a crucial role in scaling up climate change solutions. This perspective article presents a set of research challenges and oppor… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: 17 pages, 2 figures

  3. arXiv:2406.09844  [pdf, other

    cs.SD eess.AS

    Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy

    Authors: Linhan Ma, Xinfa Zhu, Yuanjun Lv, Zhichao Wang, Ziqian Wang, Wendi He, Hongbin Zhou, Lei Xie

    Abstract: Zero-shot voice conversion (VC) aims to transform source speech into arbitrary unseen target voice while kee** the linguistic content unchanged. Recent VC methods have made significant progress, but semantic losses in the decoupling process as well as training-inference mismatch still hinder conversion performance. In this paper, we propose Vec-Tok-VC+, a novel prompt-based zero-shot VC model im… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH2024

  4. arXiv:2406.08393  [pdf, other

    eess.AS cs.SD

    SCDNet: Self-supervised Learning Feature-based Speaker Change Detection

    Authors: Yue Li, Xinsheng Wang, Li Zhang, Lei Xie

    Abstract: Speaker Change Detection (SCD) is to identify boundaries among speakers in a conversation. Motivated by the success of fine-tuning wav2vec 2.0 models for the SCD task, a further investigation of self-supervised learning (SSL) features for SCD is conducted in this work. Specifically, an SCD model, named SCDNet, is proposed. With this model, various state-of-the-art SSL models, including Hubert, wav… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  5. arXiv:2406.08196  [pdf, other

    cs.SD eess.AS

    FreeV: Free Lunch For Vocoders Through Pseudo Inversed Mel Filter

    Authors: Yuanjun Lv, Hai Li, Ying Yan, Junhui Liu, Danming Xie, Lei Xie

    Abstract: Vocoders reconstruct speech waveforms from acoustic features and play a pivotal role in modern TTS systems. Frequent-domain GAN vocoders like Vocos and APNet2 have recently seen rapid advancements, outperforming time-domain models in inference speed while achieving comparable audio quality. However, these frequency-domain vocoders suffer from large parameter sizes, thus introducing extra memory bu… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted by InterSpeech 2024; 5 pages, 5 figures

  6. arXiv:2406.07846  [pdf, other

    eess.AS

    DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion

    Authors: Ziqian Ning, Shuai Wang, Pengcheng Zhu, Zhichao Wang, Jixun Yao, Lei Xie, Mengxiao Bi

    Abstract: Streaming voice conversion has become increasingly popular for its potential in real-time applications. The recently proposed DualVC 2 has achieved robust and high-quality streaming voice conversion with a latency of about 180ms. Nonetheless, the recognition-synthesis framework hinders end-to-end optimization, and the instability of automatic speech recognition (ASR) model with short chunks makes… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  7. arXiv:2406.07498  [pdf, other

    cs.SD eess.AS

    RaD-Net 2: A causal two-stage repairing and denoising speech enhancement network with knowledge distillation and complex axial self-attention

    Authors: Mingshuai Liu, Zhuangqi Chen, Xiaopeng Yan, Yuanjun Lv, Xianjun Xia, Chuanzeng Huang, Yijian Xiao, Lei Xie

    Abstract: In real-time speech communication systems, speech signals are often degraded by multiple distortions. Recently, a two-stage Repair-and-Denoising network (RaD-Net) was proposed with superior speech quality improvement in the ICASSP 2024 Speech Signal Improvement (SSI) Challenge. However, failure to use future information and constraint receptive field of convolution layers limit the system's perfor… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  8. arXiv:2406.07422  [pdf, other

    eess.AS

    Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation

    Authors: Hanzhao Li, Liumeng Xue, Haohan Guo, Xinfa Zhu, Yuanjun Lv, Lei Xie, Yunlin Chen, Hao Yin, Zhifei Li

    Abstract: The multi-codebook speech codec enables the application of large language models (LLM) in TTS but bottlenecks efficiency and robustness due to multi-sequence prediction. To avoid this obstacle, we propose Single-Codec, a single-codebook single-sequence codec, which employs a disentangled VQ-VAE to decouple speech into a time-invariant embedding and a phonetically-rich discrete sequence. Furthermor… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  9. arXiv:2406.07256  [pdf, ps, other

    cs.SD cs.AI eess.AS

    AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection

    Authors: Rong Gong, Hongfei Xue, Lezhi Wang, Xin Xu, Qisheng Li, Lei Xie, Hui Bu, Shaomei Wu, Jiaming Zhou, Yong Qin, Binbin Zhang, Jun Du, Jia Bin, Ming Li

    Abstract: The rapid advancements in speech technologies over the past two decades have led to human-level performance in tasks like automatic speech recognition (ASR) for fluent speech. However, the efficacy of these models diminishes when applied to atypical speech, such as stuttering. This paper introduces AS-70, the first publicly available Mandarin stuttered speech dataset, which stands out as the large… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  10. arXiv:2406.05961  [pdf, other

    eess.AS

    BS-PLCNet 2: Two-stage Band-split Packet Loss Concealment Network with Intra-model Knowledge Distillation

    Authors: Zihan Zhang, Xianjun Xia, Chuanzeng Huang, Yijian Xiao, Lei Xie

    Abstract: Audio packet loss is an inevitable problem in real-time speech communication. A band-split packet loss concealment network (BS-PLCNet) targeting full-band signals was recently proposed. Although it performs superiorly in the ICASSP 2024 PLC Challenge, BS-PLCNet is a large model with high computational complexity of 8.95G FLOPS. This paper presents its updated version, BS-PLCNet 2, to reduce comput… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  11. arXiv:2406.05763  [pdf, other

    eess.AS

    WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark

    Authors: Linhan Ma, Dake Guo, Kun Song, Yuepeng Jiang, Shuai Wang, Liumeng Xue, Weiming Xu, Huan Zhao, Binbin Zhang, Lei Xie

    Abstract: With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio… ▽ More

    Submitted 19 June, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH2024

  12. arXiv:2406.05681  [pdf, other

    cs.SD eess.AS

    Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling

    Authors: Yuepeng Jiang, Tao Li, Fengyu Yang, Lei Xie, Meng Meng, Yujun Wang

    Abstract: Recent research in zero-shot speech synthesis has made significant progress in speaker similarity. However, current efforts focus on timbre generalization rather than prosody modeling, which results in limited naturalness and expressiveness. To address this, we introduce a novel speech synthesis model trained on large-scale datasets, including both timbre and hierarchical prosody modeling. As timb… ▽ More

    Submitted 11 June, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

    Comments: 5 pages, 2 figures, accepted by Interspeech2024

  13. arXiv:2406.05672  [pdf, other

    eess.AS

    Text-aware and Context-aware Expressive Audiobook Speech Synthesis

    Authors: Dake Guo, Xinfa Zhu, Liumeng Xue, Yongmao Zhang, Wenjie Tian, Lei Xie

    Abstract: Recent advances in text-to-speech have significantly improved the expressiveness of synthetic speech. However, a major challenge remains in generating speech that captures the diverse styles exhibited by professional narrators in audiobooks without relying on manually labeled data or reference speech. To address this problem, we propose a text-aware and context-aware(TACA) style modeling approach… ▽ More

    Submitted 12 June, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH2024

  14. arXiv:2405.10786  [pdf, other

    eess.AS

    Distinctive and Natural Speaker Anonymization via Singular Value Transformation-assisted Matrix

    Authors: Jixun Yao, Qing Wang, Pengcheng Guo, Ziqian Ning, Lei Xie

    Abstract: Speaker anonymization is an effective privacy protection solution that aims to conceal the speaker's identity while preserving the naturalness and distinctiveness of the original speech. Mainstream approaches use an utterance-level vector from a pre-trained automatic speaker verification (ASV) model to represent speaker identity, which is then averaged or modified for anonymization. However, these… ▽ More

    Submitted 17 May, 2024; originally announced May 2024.

    Comments: Accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing

  15. arXiv:2405.10570  [pdf

    eess.IV cs.AI

    Simultaneous Deep Learning of Myocardium Segmentation and T2 Quantification for Acute Myocardial Infarction MRI

    Authors: Yirong Zhou, Chengyan Wang, Mengtian Lu, Kunyuan Guo, Zi Wang, Dan Ruan, Rui Guo, Peijun Zhao, Jianhua Wang, Naiming Wu, Jianzhong Lin, Yinyin Chen, Hang **, Lianxin Xie, Lilan Wu, Liuhong Zhu, Jianjun Zhou, Congbo Cai, He Wang, Xiaobo Qu

    Abstract: In cardiac Magnetic Resonance Imaging (MRI) analysis, simultaneous myocardial segmentation and T2 quantification are crucial for assessing myocardial pathologies. Existing methods often address these tasks separately, limiting their synergistic potential. To address this, we propose SQNet, a dual-task network integrating Transformer and Convolutional Neural Network (CNN) components. SQNet features… ▽ More

    Submitted 29 May, 2024; v1 submitted 17 May, 2024; originally announced May 2024.

    Comments: 10 pages, 8 figures, 6 tables

  16. arXiv:2405.03152  [pdf, other

    eess.AS cs.SD

    MMGER: Multi-modal and Multi-granularity Generative Error Correction with LLM for Joint Accent and Speech Recognition

    Authors: Bingshen Mu, Yangze Li, Qijie Shao, Kun Wei, Xucheng Wan, Naijun Zheng, Huan Zhou, Lei Xie

    Abstract: Despite notable advancements in automatic speech recognition (ASR), performance tends to degrade when faced with adverse conditions. Generative error correction (GER) leverages the exceptional text comprehension capabilities of large language models (LLM), delivering impressive performance in ASR error correction, where N-best hypotheses provide valuable information for transcription prediction. H… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

  17. arXiv:2405.02132  [pdf, other

    cs.SD cs.CL eess.AS

    Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

    Authors: Xuelong Geng, Tianyi Xu, Kun Wei, Bingshen Mu, Hongfei Xue, He Wang, Yangze Li, Pengcheng Guo, Yuhang Dai, Longhao Li, Mingchen Shao, Lei Xie

    Abstract: Large Language Models (LLMs) have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition (ASR) is becoming a mainstream paradigm. Building upon this momentum, our research delves into an in-depth examination of this paradigm on a large open-source Chinese dataset. Specifically, our research aims to evaluate the impact of various configu… ▽ More

    Submitted 6 May, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

  18. Dynamic fault detection and diagnosis for alkaline water electrolyzer with variational Bayesian Sparse principal component analysis

    Authors: Qi Zhang, Weihua Xu, Lei Xie, Hongye Su

    Abstract: Electrolytic hydrogen production serves as not only a vital source of green hydrogen but also a key strategy for addressing renewable energy consumption challenges. For the safe production of hydrogen through alkaline water electrolyzer (AWE), dependable process monitoring technology is essential. However, random noise can easily contaminate the AWE process data collected in industrial settings, p… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

    Journal ref: Journal of Process Control, 135:103173, March 2024. ISSN 0959-1524

  19. arXiv:2404.09519  [pdf, other

    cs.LG eess.SY

    Nonlinear sparse variational Bayesian learning based model predictive control with application to PEMFC temperature control

    Authors: Qi Zhang, Lei Wang, Weihua Xu, Hongye Su, Lei Xie

    Abstract: The accuracy of the underlying model predictions is crucial for the success of model predictive control (MPC) applications. If the model is unable to accurately analyze the dynamics of the controlled system, the performance and stability guarantees provided by MPC may not be achieved. Learning-based MPC can learn models from data, improving the applicability and reliability of MPC. This study deve… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

  20. arXiv:2404.01609  [pdf

    eess.SY

    Identifying the Largest RoCoF and Its Implications

    Authors: Licheng Wang, Luochen Xie, Gang Huang, Changsen Feng

    Abstract: The rate of change of frequency (RoCoF) is a critical factor in ensuring frequency security, particularly in power systems with low inertia. Currently, most RoCoF security constrained optimal inertia dispatch methods and inertia market mechanisms predominantly rely on the center of inertia (COI) model. This model, however, does not account for the disparities in post-contingency frequency dynamics… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

  21. arXiv:2404.00608  [pdf, other

    math.OC eess.SY

    Sample Complexity of Chance Constrained Optimization in Dynamic Environment

    Authors: Apurv Shukla, Qian Zhang, Le Xie

    Abstract: We study the scenario approach for solving chance-constrained optimization in time-coupled dynamic environments. Scenario generation methods approximate the true feasible region from scenarios generated independently and identically from the actual distribution. In this paper, we consider this problem in a dynamic environment, where the scenarios are assumed to be drawn sequentially from an unknow… ▽ More

    Submitted 31 March, 2024; originally announced April 2024.

    Comments: To apper in American Control Conference 2024

  22. arXiv:2404.00247  [pdf, ps, other

    eess.SY cs.AI cs.LG

    Facilitating Reinforcement Learning for Process Control Using Transfer Learning: Perspectives

    Authors: Runze Lin, Junghui Chen, Lei Xie, Hongye Su, Biao Huang

    Abstract: This paper provides insights into deep reinforcement learning (DRL) for process control from the perspective of transfer learning. We analyze the challenges of applying DRL in the field of process industries and the necessity of introducing transfer learning. Furthermore, recommendations and prospects are provided for future research directions on how transfer learning can be integrated with DRL t… ▽ More

    Submitted 1 May, 2024; v1 submitted 30 March, 2024; originally announced April 2024.

    Comments: Final Version of Asian Control Conference (ASCC 2024)

  23. Exploring the Capabilities and Limitations of Large Language Models in the Electric Energy Sector

    Authors: Subir Majumder, Lin Dong, Fatemeh Doudi, Yuting Cai, Chao Tian, Dileep Kalathi, Kevin Ding, Anupam A. Thatte, Na Li, Le Xie

    Abstract: Large Language Models (LLMs) as chatbots have drawn remarkable attention thanks to their versatile capability in natural language processing as well as in a wide range of tasks. While there has been great enthusiasm towards adopting such foundational model-based artificial intelligence tools in all sectors possible, the capabilities and limitations of such LLMs in improving the operation of the el… ▽ More

    Submitted 20 June, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

  24. arXiv:2402.18856  [pdf, other

    eess.IV cs.CV

    Anatomy-guided fiber trajectory distribution estimation for cranial nerves tractography

    Authors: Lei Xie, Qingrun Zeng, Huajun Zhou, Guoqiang Xie, Mingchu Li, Jiahao Huang, Jianan Cui, Hao Chen, Yuan**g Feng

    Abstract: Diffusion MRI tractography is an important tool for identifying and analyzing the intracranial course of cranial nerves (CNs). However, the complex environment of the skull base leads to ambiguous spatial correspondence between diffusion directions and fiber geometry, and existing diffusion tractography methods of CNs identification are prone to producing erroneous trajectories and missing true po… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

  25. arXiv:2402.10071  [pdf, other

    eess.SP cs.IT

    Approximate Message Passing-Enhanced Graph Neural Network for OTFS Data Detection

    Authors: Wenhao Zhuang, Yuyi Mao, Hengtao He, Lei Xie, Shenghui Song, Yao Ge, Zhi Ding

    Abstract: Orthogonal time frequency space (OTFS) modulation has emerged as a promising solution to support high-mobility wireless communications, for which, cost-effective data detectors are critical. Although graph neural network (GNN)-based data detectors can achieve decent detection accuracy at reasonable computational cost, they fail to best harness prior information of transmitted data. To further mini… ▽ More

    Submitted 14 April, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

    Comments: 8 pages, 7 figures, and 3 tables. Part of this article was submitted to IEEE for possible publication

  26. arXiv:2402.03919  [pdf, other

    cs.IT eess.SP

    Sensing Mutual Information with Random Signals in Gaussian Channels: Bridging Sensing and Communication Metrics

    Authors: Lei Xie, Fan Liu, Jia** Luo, Shenghui Song

    Abstract: Sensing performance is typically evaluated by classical radar metrics, such as Cramer-Rao bound and signal-to-clutter-plus-noise ratio. The recent development of the integrated sensing and communication (ISAC) framework motivated the efforts to unify the performance metric for sensing and communication, where mutual information (MI) was proposed as a sensing performance metric with deterministic s… ▽ More

    Submitted 6 February, 2024; originally announced February 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2311.07081

  27. arXiv:2401.15647  [pdf, other

    cs.CV cs.AI eess.IV

    UP-CrackNet: Unsupervised Pixel-Wise Road Crack Detection via Adversarial Image Restoration

    Authors: Nachuan Ma, Rui Fan, Lihua Xie

    Abstract: Over the past decade, automated methods have been developed to detect cracks more efficiently, accurately, and objectively, with the ultimate goal of replacing conventional manual visual inspection techniques. Among these methods, semantic segmentation algorithms have demonstrated promising results in pixel-wise crack detection tasks. However, training such networks requires a large amount of huma… ▽ More

    Submitted 6 May, 2024; v1 submitted 28 January, 2024; originally announced January 2024.

  28. arXiv:2401.11053  [pdf, other

    eess.AS cs.SD

    StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion

    Authors: Zhichao Wang, Yuanzhe Chen, Xinsheng Wang, Lei Xie, Yu** Wang

    Abstract: Recent language model (LM) advancements have showcased impressive zero-shot voice conversion (VC) performance. However, existing LM-based VC models usually apply offline conversion from source semantics to acoustic features, demanding the complete source speech and limiting their deployment to real-time applications. In this paper, we introduce StreamVoice, a novel streaming LM-based model for zer… ▽ More

    Submitted 31 May, 2024; v1 submitted 19 January, 2024; originally announced January 2024.

  29. arXiv:2401.06788  [pdf, other

    eess.AS cs.AI cs.SD

    The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023

    Authors: He Wang, Pengcheng Guo, Wei Chen, Pan Zhou, Lei Xie

    Abstract: This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP-LiAuto (Team 237) in the first Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023, engaging in the fixed and open tracks of Single-Speaker VSR Task, and the open track of Multi-Speaker VSR Task. In terms of data processing, we leverage the lip motion extractor from the baseline1 to produce… ▽ More

    Submitted 29 February, 2024; v1 submitted 7 January, 2024; originally announced January 2024.

    Comments: Included in CNVSRC Workshop 2023, NCMMSC 2023

  30. arXiv:2401.05431  [pdf, other

    eess.SP cs.AI cs.LG

    TRLS: A Time Series Representation Learning Framework via Spectrogram for Medical Signal Processing

    Authors: Luyuan Xie, Cong Li, Xin Zhang, Shengfang Zhai, Yuejian Fang, Qingni Shen, Zhonghai Wu

    Abstract: Representation learning frameworks in unlabeled time series have been proposed for medical signal processing. Despite the numerous excellent progresses have been made in previous works, we observe the representation extracted for the time series still does not generalize well. In this paper, we present a Time series (medical signal) Representation Learning framework via Spectrogram (TRLS) to get m… ▽ More

    Submitted 5 January, 2024; originally announced January 2024.

    Comments: This paper is accept by ICASSP 2024. This is a more detailed version

  31. arXiv:2401.04389  [pdf, other

    cs.SD eess.AS

    RaD-Net: A Repairing and Denoising Network for Speech Signal Improvement

    Authors: Mingshuai Liu, Zhuangqi Chen, Xiaopeng Yan, Yuanjun Lv, Xianjun Xia, Chuanzeng Huang, Yijian Xiao, Lei Xie

    Abstract: This paper introduces our repairing and denoising network (RaD-Net) for the ICASSP 2024 Speech Signal Improvement (SSI) Challenge. We extend our previous framework based on a two-stage network and propose an upgraded model. Specifically, we replace the repairing network with COM-Net from TEA-PSE. In addition, multi-resolution discriminators and multi-band discriminators are adopted in the training… ▽ More

    Submitted 9 January, 2024; originally announced January 2024.

    Comments: submitted to ICASSP 2024

  32. arXiv:2401.03697  [pdf, other

    cs.SD eess.AS

    An audio-quality-based multi-strategy approach for target speaker extraction in the MISP 2023 Challenge

    Authors: Runduo Han, Xiaopeng Yan, Weiming Xu, Pengcheng Guo, Jiayao Sun, He Wang, Quan Lu, Ning Jiang, Lei Xie

    Abstract: This paper describes our audio-quality-based multi-strategy approach for the audio-visual target speaker extraction (AVTSE) task in the Multi-modal Information based Speech Processing (MISP) 2023 Challenge. Specifically, our approach adopts different extraction strategies based on the audio quality, striking a balance between interference removal and speech preservation, which benifits the back-en… ▽ More

    Submitted 6 March, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP 2024

  33. arXiv:2401.03687  [pdf, other

    eess.AS cs.SD

    BS-PLCNet: Band-split Packet Loss Concealment Network with Multi-task Learning Framework and Multi-discriminators

    Authors: Zihan Zhang, Jiayao Sun, Xianjun Xia, Chuanzeng Huang, Yijian Xiao, Lei Xie

    Abstract: Packet loss is a common and unavoidable problem in voice over internet phone (VoIP) systems. To deal with the problem, we propose a band-split packet loss concealment network (BS-PLCNet). Specifically, we split the full-band signal into wide-band (0-8kHz) and high-band (8-24kHz). The wide-band signals are processed by a gated convolutional recurrent network (GCRN), while the high-band counterpart… ▽ More

    Submitted 8 January, 2024; originally announced January 2024.

    Comments: submitted to ICASSP 2024

  34. arXiv:2401.03473  [pdf, ps, other

    cs.SD cs.AI eess.AS

    ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge

    Authors: He Wang, Pengcheng Guo, Yue Li, Ao Zhang, Jiayao Sun, Lei Xie, Wei Chen, Pan Zhou, Hui Bu, Xin Xu, Binbin Zhang, Zhuo Chen, Jian Wu, Longbiao Wang, Eng Siong Chng, Sun Li

    Abstract: To promote speech processing and recognition research in driving scenarios, we build on the success of the Intelligent Cockpit Speech Recognition Challenge (ICSRC) held at ISCSLP 2022 and launch the ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge. This challenge collects over 100 hours of multi-channel speech data recorded inside a new energy vehicle and 40 hours… ▽ More

    Submitted 20 February, 2024; v1 submitted 7 January, 2024; originally announced January 2024.

    Comments: Accepted at ICASSP 2024

  35. MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition

    Authors: He Wang, Pengcheng Guo, Pan Zhou, Lei Xie

    Abstract: While automatic speech recognition (ASR) systems degrade significantly in noisy environments, audio-visual speech recognition (AVSR) systems aim to complement the audio stream with noise-invariant visual cues and improve the system's robustness. However, current studies mainly focus on fusing the well-learned modality features, like the output of modality-specific encoders, without considering the… ▽ More

    Submitted 8 April, 2024; v1 submitted 7 January, 2024; originally announced January 2024.

    Comments: 5 pages, 3 figures Accepted at ICASSP 2024

  36. arXiv:2401.01685  [pdf

    eess.IV cs.CV

    Modality Exchange Network for Retinogeniculate Visual Pathway Segmentation

    Authors: Hua Han, Cheng Li, Lei Xie, Yuan**g Feng, Alou Diakite, Shanshan Wang

    Abstract: Accurate segmentation of the retinogeniculate visual pathway (RGVP) aids in the diagnosis and treatment of visual disorders by identifying disruptions or abnormalities within the pathway. However, the complex anatomical structure and connectivity of RGVP make it challenging to achieve accurate segmentation. In this study, we propose a novel Modality Exchange Network (ME-Net) that effectively utili… ▽ More

    Submitted 3 January, 2024; originally announced January 2024.

  37. arXiv:2401.01654  [pdf, other

    eess.IV cs.LG

    LESEN: Label-Efficient deep learning for Multi-parametric MRI-based Visual Pathway Segmentation

    Authors: Alou Diakite, Cheng Li, Lei Xie, Yuan**g Feng, Hua Han, Shanshan Wang

    Abstract: Recent research has shown the potential of deep learning in multi-parametric MRI-based visual pathway (VP) segmentation. However, obtaining labeled data for training is laborious and time-consuming. Therefore, it is crucial to develop effective algorithms in situations with limited labeled samples. In this work, we propose a label-efficient deep learning method with self-ensembling (LESEN). LESEN… ▽ More

    Submitted 3 January, 2024; originally announced January 2024.

  38. arXiv:2401.00475  [pdf, other

    cs.SD eess.AS

    E-chat: Emotion-sensitive Spoken Dialogue System with Large Language Models

    Authors: Hongfei Xue, Yuhao Liang, Bingshen Mu, Shiliang Zhang, Mengzhe Chen, Qian Chen, Lei Xie

    Abstract: This study focuses on emotion-sensitive spoken dialogue in human-machine speech interaction. With the advancement of Large Language Models (LLMs), dialogue systems can handle multimodal data, including audio. Recent models have enhanced the understanding of complex audio signals through the integration of various audio events. However, they are unable to generate appropriate responses based on emo… ▽ More

    Submitted 6 January, 2024; v1 submitted 31 December, 2023; originally announced January 2024.

    Comments: 6 pages, 3 figures

  39. arXiv:2312.16850  [pdf, other

    cs.SD eess.AS

    Accent-VITS:accent transfer for end-to-end TTS

    Authors: Linhan Ma, Yongmao Zhang, Xinfa Zhu, Yi Lei, Ziqian Ning, Pengcheng Zhu, Lei Xie

    Abstract: Accent transfer aims to transfer an accent from a source speaker to synthetic speech in the target speaker's voice. The main challenge is how to effectively disentangle speaker timbre and accent which are entangled in speech. This paper presents a VITS-based end-to-end accent transfer model named Accent-VITS.Based on the main structure of VITS, Accent-VITS makes substantial improvements to enable… ▽ More

    Submitted 29 December, 2023; v1 submitted 28 December, 2023; originally announced December 2023.

    Comments: Accepted by NCMMSC2023

  40. arXiv:2312.15340  [pdf, other

    eess.SY cs.LG

    Meta-Learning-Based Adaptive Stability Certificates for Dynamical Systems

    Authors: Amit Jena, Dileep Kalathil, Le Xie

    Abstract: This paper addresses the problem of Neural Network (NN) based adaptive stability certification in a dynamical system. The state-of-the-art methods, such as Neural Lyapunov Functions (NLFs), use NN-based formulations to assess the stability of a non-linear dynamical system and compute a Region of Attraction (ROA) in the state space. However, under parametric uncertainty, if the values of system par… ▽ More

    Submitted 23 December, 2023; originally announced December 2023.

    Comments: This article has been accepted for AAAI-24 (The 38th Annual AAAI Conference on Artificial Intelligence)

  41. arXiv:2312.15067  [pdf, other

    eess.SY

    Electromagnetic Transient Model of Cryptocurrency Mining Loads for Low-Voltage Ride Through Assessment in Transmission Grids

    Authors: Anindita Samanta, Subir Majumder, Hasan Ibrahim, Prasad Enjeti, Le Xie

    Abstract: In this paper, we developed an Electromagnetic Transient (EMT) model tailored for large cryptocurrency mining loads to understand the cross-interaction of these loads with the electric grid. The load model has been built using Electromagnetic Transients Program (EMTP) software. We have cross-validated the performance of the EMT model of the load with commercial application-specific integrated circ… ▽ More

    Submitted 22 December, 2023; originally announced December 2023.

    Comments: 5 pages, 10 figures, conference

  42. arXiv:2312.13076  [pdf, other

    cs.RO eess.SY

    How to Integrate Digital Twin and Virtual Reality in Robotics Systems? Design and Implementation for Providing Robotics Maintenance Services in Data Centers

    Authors: Lin Xie, Hanyi Li

    Abstract: In the context of Industry 4.0, the physical and digital worlds are closely connected, and robots are widely used to achieve system automation. Digital twin solutions have contributed significantly to the growth of Industry 4.0. Combining various technologies is a trend that aims to improve system performance. For example, digital twinning can be combined with virtual reality in automated systems.… ▽ More

    Submitted 20 December, 2023; originally announced December 2023.

  43. Angle-Displacement Rigidity Theory with Application to Distributed Network Localization

    Authors: Xu Fang, Xiaolei Li, Lihua Xie

    Abstract: This paper investigates the localization problem of a network in 2-D and 3-D spaces given the positions of anchor nodes in a global frame and inter-node relative measurements in local coordinate frames. It is assumed that the local frames of different nodes have different unknown orientations. First, an angle-displacement rigidity theory is developed, which can be used to localize all the free nod… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

  44. Distributed Semi-global Output Feedback Formation Maneuver Control of High-order Multi-agent Systems

    Authors: Xu Fang, Lihua Xie

    Abstract: This paper addresses the formation maneuver control problem of leader-follower multi-agent systems with high-order integrator dynamics. A distributed output feedback formation maneuver controller is proposed to achieve desired maneuvers so that the scale, orientation, translation, and shape of formation can be manipulated continuously, where the followers do not need to know or estimate the time-v… ▽ More

    Submitted 18 December, 2023; originally announced December 2023.

  45. 3-D Distributed Localization with Mixed Local Relative Measurements

    Authors: Xu Fang, Xiaolei Li, Lihua Xie

    Abstract: This paper studies 3-D distributed network localization using mixed types of local relative measurements. Each node holds a local coordinate frame without a common orientation and can only measure one type of information (relative position, distance, relative bearing, angle, or ratio-of-distance measurements) about its neighboring nodes in its local coordinate frame. A novel rigidity-theory-based… ▽ More

    Submitted 18 December, 2023; originally announced December 2023.

  46. Distributed Localization in Dynamic Networks via Complex Laplacian

    Authors: Xu Fang, Lihua Xie, Xiaolei Li

    Abstract: Different from most existing distributed localization approaches in static networks where the agents in a network are static, this paper addresses the distributed localization problem in dynamic networks where the positions of the agents are time-varying. Firstly, complex constraints for the positions of the agents are constructed based on local relative position (distance and local bearing) measu… ▽ More

    Submitted 18 December, 2023; originally announced December 2023.

  47. arXiv:2312.09760  [pdf, other

    eess.AS cs.SD

    U2-KWS: Unified Two-pass Open-vocabulary Keyword Spotting with Keyword Bias

    Authors: Ao Zhang, Pan Zhou, Kaixun Huang, Yong Zou, Ming Liu, Lei Xie

    Abstract: Open-vocabulary keyword spotting (KWS), which allows users to customize keywords, has attracted increasingly more interest. However, existing methods based on acoustic models and post-processing train the acoustic model with ASR training criteria to model all phonemes, making the acoustic model under-optimized for the KWS task. To solve this problem, we propose a novel unified two-pass open-vocabu… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

    Comments: Accepted by ASRU2023

  48. arXiv:2312.09747  [pdf, other

    eess.AS eess.SP

    SELM: Speech Enhancement Using Discrete Tokens and Language Models

    Authors: Ziqian Wang, Xinfa Zhu, Zihan Zhang, YuanJun Lv, Ning Jiang, Guoqing Zhao, Lei Xie

    Abstract: Language models (LMs) have shown superior performances in various speech generation tasks recently, demonstrating their powerful ability for semantic context modeling. Given the intrinsic similarity between speech generation and speech enhancement, harnessing semantic information holds potential advantages for speech enhancement tasks. In light of this, we propose SELM, a novel paradigm for speech… ▽ More

    Submitted 7 January, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

    Comments: Accepted by ICASSP 2024

  49. arXiv:2312.09746  [pdf, other

    cs.SD eess.AS

    Automatic channel selection and spatial feature integration for multi-channel speech recognition across various array topologies

    Authors: Bingshen Mu, Pengcheng Guo, Dake Guo, Pan Zhou, Wei Chen, Lei Xie

    Abstract: Automatic Speech Recognition (ASR) has shown remarkable progress, yet it still faces challenges in real-world distant scenarios across various array topologies each with multiple recording devices. The focal point of the CHiME-7 Distant ASR task is to devise a unified system capable of generalizing various array topologies that have multiple recording devices and offering reliable recognition perf… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

    Comments: Accepted by ICASSP 2024

  50. arXiv:2312.06154  [pdf, other

    eess.SY

    Predictive Reliability Assessment of Distribution Grids with Residential Distributed Energy Resources

    Authors: Arun Kumar Karngala, Chanan Singh, Le Xie

    Abstract: Distribution system end users are transforming from passive to active participants, marked by the push towards widespread adoption of edge-level Distributed Energy Resources (DERs). This paper addresses the challenges in distribution system planning arising from these dynamic changes. We introduce a bottom-up probabilistic approach that integrates these edge-level DERs into the reliability evaluat… ▽ More

    Submitted 11 December, 2023; originally announced December 2023.

    Comments: 10 Pages, 6 figures, Journal