Skip to main content

Showing 1–25 of 25 results for author: Rao, W

Searching in archive eess. Search in all archives.
.
  1. arXiv:2312.04131  [pdf, other

    eess.AS cs.SD

    Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization

    Authors: Huan Zhao, Li Zhang, Yue Li, Yannan Wang, Hongji Wang, Wei Rao, Qing Wang, Lei Xie

    Abstract: The scarcity of labeled audio-visual datasets is a constraint for training superior audio-visual speaker diarization systems. To improve the performance of audio-visual speaker diarization, we leverage pre-trained supervised and self-supervised speech models for audio-visual speaker diarization. Specifically, we adopt supervised~(ResNet and ECAPA-TDNN) and self-supervised pre-trained models~(WavLM… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

  2. arXiv:2310.01861  [pdf, other

    eess.IV cs.CV cs.GR

    Shifting More Attention to Breast Lesion Segmentation in Ultrasound Videos

    Authors: Junhao Lin, Qian Dai, Lei Zhu, Huazhu Fu, Qiong Wang, Weibin Li, Wenhao Rao, Xiaoyang Huang, Liansheng Wang

    Abstract: Breast lesion segmentation in ultrasound (US) videos is essential for diagnosing and treating axillary lymph node metastasis. However, the lack of a well-established and large-scale ultrasound video dataset with high-quality annotations has posed a persistent challenge for the research community. To overcome this issue, we meticulously curated a US video breast lesion segmentation dataset comprisi… ▽ More

    Submitted 3 October, 2023; originally announced October 2023.

    Comments: 10 pages

  3. arXiv:2307.15400  [pdf, other

    cs.SD eess.AS

    The FlySpeech Audio-Visual Speaker Diarization System for MISP Challenge 2022

    Authors: Li Zhang, Huan Zhao, Yue Li, Bowen Pang, Yannan Wang, Hongji Wang, Wei Rao, Qing Wang, Lei Xie

    Abstract: This paper describes the FlySpeech speaker diarization system submitted to the second \textbf{M}ultimodal \textbf{I}nformation Based \textbf{S}peech \textbf{P}rocessing~(\textbf{MISP}) Challenge held in ICASSP 2022. We develop an end-to-end audio-visual speaker diarization~(AVSD) system, which consists of a lip encoder, a speaker encoder, and an audio-visual decoder. Specifically, to mitigate the… ▽ More

    Submitted 28 July, 2023; originally announced July 2023.

  4. arXiv:2306.16250  [pdf, other

    cs.SD eess.AS

    MC-SpEx: Towards Effective Speaker Extraction with Multi-Scale Interfusion and Conditional Speaker Modulation

    Authors: Jun Chen, Wei Rao, Zilin Wang, Jiuxin Lin, Yukai Ju, Shulin He, Yannan Wang, Zhiyong Wu

    Abstract: The previous SpEx+ has yielded outstanding performance in speaker extraction and attracted much attention. However, it still encounters inadequate utilization of multi-scale information and speaker embedding. To this end, this paper proposes a new effective speaker extraction system with multi-scale interfusion and conditional speaker modulation (ConSM), which is called MC-SpEx. First of all, we d… ▽ More

    Submitted 28 June, 2023; originally announced June 2023.

    Comments: Accepted by InterSpeech 2023

  5. arXiv:2306.08454  [pdf, other

    cs.SD eess.AS

    Gesper: A Restoration-Enhancement Framework for General Speech Reconstruction

    Authors: Wenzhe Liu, Yupeng Shi, Jun Chen, Wei Rao, Shulin He, Andong Li, Yannan Wang, Zhiyong Wu

    Abstract: This paper describes a real-time General Speech Reconstruction (Gesper) system submitted to the ICASSP 2023 Speech Signal Improvement (SSI) Challenge. This novel proposed system is a two-stage architecture, in which the speech restoration is performed, and then cascaded by speech enhancement. We propose a complex spectral map**-based generative adversarial network (CSM-GAN) as the speech restora… ▽ More

    Submitted 14 June, 2023; originally announced June 2023.

    Comments: Accepted by InterSpeech 2023

  6. arXiv:2305.05599  [pdf, other

    cs.SD cs.HC eess.AS

    Inter-SubNet: Speech Enhancement with Subband Interaction

    Authors: Jun Chen, Wei Rao, Zilin Wang, Jiuxin Lin, Zhiyong Wu, Yannan Wang, Shidong Shang, Helen Meng

    Abstract: Subband-based approaches process subbands in parallel through the model with shared parameters to learn the commonality of local spectrums for noise reduction. In this way, they have achieved remarkable results with fewer parameters. However, in some complex environments, the lack of global spectral information has a negative impact on the performance of these subband-based approaches. To this end… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

    Comments: Accepted by ICASSP 2023

  7. arXiv:2303.07704  [pdf, other

    eess.AS cs.SD

    TEA-PSE 3.0: Tencent-Ethereal-Audio-Lab Personalized Speech Enhancement System For ICASSP 2023 DNS Challenge

    Authors: Yukai Ju, Jun Chen, Shimin Zhang, Shulin He, Wei Rao, Weixin Zhu, Yannan Wang, Tao Yu, Shidong Shang

    Abstract: This paper introduces the Unbeatable Team's submission to the ICASSP 2023 Deep Noise Suppression (DNS) Challenge. We expand our previous work, TEA-PSE, to its upgraded version -- TEA-PSE 3.0. Specifically, TEA-PSE 3.0 incorporates a residual LSTM after squeezed temporal convolution network (S-TCN) to enhance sequence modeling capabilities. Additionally, the local-global representation (LGR) struct… ▽ More

    Submitted 14 March, 2023; originally announced March 2023.

    Comments: Accepted by ICASSP 2023

  8. arXiv:2303.00264  [pdf, other

    cs.SD eess.AS

    Distance-based Weight Transfer from Near-field to Far-field Speaker Verification

    Authors: Li Zhang, Qing Wang, Hongji Wang, Yue Li, Wei Rao, Yannan Wang, Lei Xie

    Abstract: The scarcity of labeled far-field speech is a constraint for training superior far-field speaker verification systems. Fine-tuning the model pre-trained on large-scale near-field speech substantially outperforms training from scratch. However, the fine-tuning method suffers from two limitations--catastrophic forgetting and overfitting. In this paper, we propose a weight transfer regularization(WTR… ▽ More

    Submitted 14 March, 2023; v1 submitted 1 March, 2023; originally announced March 2023.

  9. arXiv:2211.05432  [pdf, other

    cs.SD eess.AS

    Speech Enhancement with Fullband-Subband Cross-Attention Network

    Authors: Jun Chen, Wei Rao, Zilin Wang, Zhiyong Wu, Yannan Wang, Tao Yu, Shidong Shang, Helen Meng

    Abstract: FullSubNet has shown its promising performance on speech enhancement by utilizing both fullband and subband information. However, the relationship between fullband and subband in FullSubNet is achieved by simply concatenating the output of fullband model and subband units. It only supplements the subband units with a small quantity of global information and has not considered the interaction betwe… ▽ More

    Submitted 10 November, 2022; originally announced November 2022.

    Comments: Accepted by InterSpeech 2022. arXiv admin note: text overlap with arXiv:2203.12188

  10. arXiv:2210.15853  [pdf, other

    cs.SD eess.AS

    Speech Enhancement with Intelligent Neural Homomorphic Synthesis

    Authors: Shulin He, Wei Rao, **jiang Liu, Jun Chen, Yukai Ju, Xueliang Zhang, Yannan Wang, Shidong Shang

    Abstract: Most neural network speech enhancement models ignore speech production mathematical models by directly map** Fourier transform spectrums or waveforms. In this work, we propose a neural source filter network for speech enhancement. Specifically, we use homomorphic signal processing and cepstral analysis to obtain noisy speech's excitation and vocal tract. Unlike traditional signal processing, we… ▽ More

    Submitted 27 October, 2022; originally announced October 2022.

    Comments: Submitted to ICASSP 2023

  11. arXiv:2210.15849  [pdf, ps, other

    cs.SD eess.AS

    Hierarchical speaker representation for target speaker extraction

    Authors: Shulin He, Huaiwen Zhang, Wei Rao, Kanghao Zhang, Yukai Ju, Yang Yang, Xueliang Zhang

    Abstract: Target speaker extraction aims to isolate a specific speaker's voice from a composite of multiple sound sources, guided by an enrollment utterance or called anchor. Current methods predominantly derive speaker embeddings from the anchor and integrate them into the separation network to separate the voice of the target speaker. However, the representation of the speaker embedding is too simplistic,… ▽ More

    Submitted 4 January, 2024; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: Accepted to ICASSP 2024

  12. arXiv:2210.08802  [pdf, other

    eess.AS cs.SD

    spatial-dccrn: dccrn equipped with frame-level angle feature and hybrid filtering for multi-channel speech enhancement

    Authors: Shubo Lv, Yihui Fu, Yukai Jv, Lei Xie, Weixin Zhu, Wei Rao, Yannan Wang

    Abstract: Recently, multi-channel speech enhancement has drawn much interest due to the use of spatial information to distinguish target speech from interfering signal. To make full use of spatial information and neural network based masking estimation, we propose a multi-channel denoising neural network -- Spatial DCCRN. Firstly, we extend S-DCCRN to multi-channel scenario, aiming at performing cascaded su… ▽ More

    Submitted 17 October, 2022; originally announced October 2022.

  13. arXiv:2106.03113  [pdf, other

    eess.AS

    Improving Channel Decorrelation for Multi-Channel Target Speech Extraction

    Authors: Jiangyu Han, Wei Rao, Yannan Wang, Yanhua Long

    Abstract: Target speech extraction has attracted widespread attention. When microphone arrays are available, the additional spatial information can be helpful in extracting the target speech. We have recently proposed a channel decorrelation (CD) mechanism to extract the inter-channel differential information to enhance the reference channel encoder representation. Although the proposed mechanism has shown… ▽ More

    Submitted 6 June, 2021; originally announced June 2021.

    Comments: accepted to Interspeech 2021. arXiv admin note: text overlap with arXiv:2010.09191

  14. arXiv:2104.00960  [pdf, other

    eess.AS cs.SD

    INTERSPEECH 2021 ConferencingSpeech Challenge: Towards Far-field Multi-Channel Speech Enhancement for Video Conferencing

    Authors: Wei Rao, Yihui Fu, Yanxin Hu, Xin Xu, Yvkai Jv, Jiangyu Han, Zhongjie Jiang, Lei Xie, Yannan Wang, Shinji Watanabe, Zheng-Hua Tan, Hui Bu, Tao Yu, Shidong Shang

    Abstract: The ConferencingSpeech 2021 challenge is proposed to stimulate research on far-field multi-channel speech enhancement for video conferencing. The challenge consists of two separate tasks: 1) Task 1 is multi-channel speech enhancement with single microphone array and focusing on practical application with real-time requirement and 2) Task 2 is multi-channel speech enhancement with multiple distribu… ▽ More

    Submitted 2 April, 2021; originally announced April 2021.

    Comments: 5 pages, submitted to INTERSPEECH 2021

  15. arXiv:2103.16269  [pdf, other

    eess.AS cs.HC cs.SD

    Target Speaker Verification with Selective Auditory Attention for Single and Multi-talker Speech

    Authors: Chenglin Xu, Wei Rao, Jibin Wu, Haizhou Li

    Abstract: Speaker verification has been studied mostly under the single-talker condition. It is adversely affected in the presence of interference speakers. Inspired by the study on target speaker extraction, e.g., SpEx, we propose a unified speaker verification framework for both single- and multi-talker speech, that is able to pay selective auditory attention to the target speaker. This target speaker ver… ▽ More

    Submitted 2 April, 2021; v1 submitted 30 March, 2021; originally announced March 2021.

    Comments: 13 pages, submitted to IEEE/ACM transaction on Audio, Speech and Language on 10 Jan. 2021

  16. arXiv:2011.08623  [pdf, other

    cs.SD eess.AS

    Adversarial Training for Multi-domain Speaker Recognition

    Authors: Qing Wang, Wei Rao, Pengcheng Guo, Lei Xie

    Abstract: In real-life applications, the performance of speaker recognition systems always degrades when there is a mismatch between training and evaluation data. Many domain adaptation methods have been successfully used for eliminating the domain mismatches in speaker recognition. However, usually both training and evaluation data themselves can be composed of several subsets. These inner variances of eac… ▽ More

    Submitted 17 November, 2020; originally announced November 2020.

    Comments: 5 pages, 2 figures

  17. arXiv:2010.10923  [pdf, other

    eess.AS cs.SD

    Attention-based scaling adaptation for target speech extraction

    Authors: Jiangyu Han, Wei Rao, Yanhua Long, Jiaen Liang

    Abstract: The target speech extraction has attracted widespread attention in recent years. In this work, we focus on investigating the dynamic interaction between different mixtures and the target speaker to exploit the discriminative target speaker clues. We propose a special attention mechanism without introducing any additional parameters in a scaling adaptation layer to better adapt the network towards… ▽ More

    Submitted 18 October, 2021; v1 submitted 18 October, 2020; originally announced October 2020.

    Comments: 5 pages, 2 figures. Accepted by ASRU 2021

  18. arXiv:2010.03905  [pdf, other

    eess.AS cs.SD

    HLT-NUS Submission for NIST 2019 Multimedia Speaker Recognition Evaluation

    Authors: Rohan Kumar Das, Ruijie Tao, Jichen Yang, Wei Rao, Cheng Yu, Haizhou Li

    Abstract: This work describes the speaker verification system developed by Human Language Technology Laboratory, National University of Singapore (HLT-NUS) for 2019 NIST Multimedia Speaker Recognition Evaluation (SRE). The multimedia research has gained attention to a wide range of applications and speaker recognition is no exception to it. In contrast to the previous NIST SREs, the latest edition focuses o… ▽ More

    Submitted 8 October, 2020; originally announced October 2020.

    Comments: Accepted for publication in APSIPA ASC 2020

  19. arXiv:2005.08046  [pdf, other

    eess.AS cs.SD

    The INTERSPEECH 2020 Far-Field Speaker Verification Challenge

    Authors: Xiaoyi Qin, Ming Li, Hui Bu, Wei Rao, Rohan Kumar Das, Shrikanth Narayanan, Haizhou Li

    Abstract: The INTERSPEECH 2020 Far-Field Speaker Verification Challenge (FFSVC 2020) addresses three different research problems under well-defined conditions: far-field text-dependent speaker verification from single microphone array, far-field text-independent speaker verification from single microphone array, and far-field text-dependent speaker verification from distributed microphone arrays. All three… ▽ More

    Submitted 16 May, 2020; originally announced May 2020.

    Comments: Submitted to INTERSPEECH 2020

  20. Time-domain speaker extraction network

    Authors: Chenglin Xu, Wei Rao, Eng Siong Chng, Haizhou Li

    Abstract: Speaker extraction is to extract a target speaker's voice from multi-talker speech. It simulates humans' cocktail party effect or the selective listening ability. The prior work mostly performs speaker extraction in frequency domain, then reconstructs the signal with some phase approximation. The inaccuracy of phase estimation is inherent to the frequency domain processing, that affects the qualit… ▽ More

    Submitted 29 April, 2020; originally announced April 2020.

    Comments: Published in ASRU 2019. arXiv admin note: text overlap with arXiv:2004.08326

  21. arXiv:2004.08326  [pdf, other

    eess.AS cs.CL cs.SD

    SpEx: Multi-Scale Time Domain Speaker Extraction Network

    Authors: Chenglin Xu, Wei Rao, Eng Siong Chng, Haizhou Li

    Abstract: Speaker extraction aims to mimic humans' selective auditory attention by extracting a target speaker's voice from a multi-talker environment. It is common to perform the extraction in frequency-domain, and reconstruct the time-domain signal from the extracted magnitude and estimated phase spectra. However, such an approach is adversely affected by the inherent difficulty of phase estimation. Inspi… ▽ More

    Submitted 17 April, 2020; originally announced April 2020.

    Comments: ACCEPTED in IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP)

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020

  22. arXiv:2002.00387  [pdf, other

    cs.SD eess.AS

    The FFSVC 2020 Evaluation Plan

    Authors: Xiaoyi Qin, Ming Li, Hui Bu, Rohan Kumar Das, Wei Rao, Shrikanth Narayanan, Haizhou Li

    Abstract: The Far-Field Speaker Verification Challenge 2020 (FFSVC20) is designed to boost the speaker verification research with special focus on far-field distributed microphone arrays under noisy conditions in real scenarios. The objectives of this challenge are to: 1) benchmark the current speech verification technology under this challenging condition, 2) promote the development of new ideas and techno… ▽ More

    Submitted 4 February, 2020; v1 submitted 2 February, 2020; originally announced February 2020.

  23. arXiv:1904.07386  [pdf, other

    eess.AS cs.CL cs.SD

    I4U Submission to NIST SRE 2018: Leveraging from a Decade of Shared Experiences

    Authors: Kong Aik Lee, Ville Hautamaki, Tomi Kinnunen, Hitoshi Yamamoto, Koji Okabe, Ville Vestman, **g Huang, Guohong Ding, Hanwu Sun, Anthony Larcher, Rohan Kumar Das, Haizhou Li, Mickael Rouvier, Pierre-Michel Bousquet, Wei Rao, Qing Wang, Chunlei Zhang, Fahimeh Bahmaninezhad, Hector Delgado, Jose Patino, Qiongqiong Wang, Ling Guo, Takafumi Koshinaka, Jiacen Zhang, Koichi Shinoda , et al. (21 additional authors not shown)

    Abstract: The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the res… ▽ More

    Submitted 15 April, 2019; originally announced April 2019.

    Comments: 5 pages

  24. arXiv:1903.09952  [pdf, other

    eess.AS cs.CL cs.SD

    Optimization of Speaker Extraction Neural Network with Magnitude and Temporal Spectrum Approximation Loss

    Authors: Chenglin Xu, Wei Rao, Eng Siong Chng, Haizhou Li

    Abstract: The SpeakerBeam-FE (SBF) method is proposed for speaker extraction. It attempts to overcome the problem of unknown number of speakers in an audio recording during source separation. The mask approximation loss of SBF is sub-optimal, which doesn't calculate direct signal reconstruction error and consider the speech context. To address these problems, this paper proposes a magnitude and temporal spe… ▽ More

    Submitted 24 March, 2019; originally announced March 2019.

    Comments: Accepted in ICASSP 2019

  25. arXiv:1902.02546  [pdf, other

    eess.AS cs.SD

    Target Speaker Extraction for Overlapped Multi-Talker Speaker Verification

    Authors: Wei Rao, Chenglin Xu, Eng Siong Chng, Haizhou Li

    Abstract: The performance of speaker verification degrades significantly when the test speech is corrupted by interference speakers. Speaker diarization does well to separate speakers if the speakers are temporally overlapped. However, if multi-talkers speak at the same time, we need the technique to separate the speech in the spectral domain. This paper proposes an overlapped multi-talker speaker verificat… ▽ More

    Submitted 7 February, 2019; originally announced February 2019.

    Comments: 5 pages, 3 figures. This paper is submitted to Interspeech 2019