Skip to main content

Showing 1–50 of 54 results for author: Weng, C

Searching in archive eess. Search in all archives.
.
  1. arXiv:2405.11975  [pdf, other

    eess.SY

    A Stochastic Sampling Approach to Privacy

    Authors: Chuanghong Weng, Ehsan Nekouei

    Abstract: This paper proposes an optimal stochastic sampling approach to privacy, in which a sensor observes a process which is correlated to private information, and a sampler decides to keep or discard the sensor's observations. The kept samples are shared with an adversary who might attempt to infer the private process. The privacy leakages are captured with the mutual information between the private pro… ▽ More

    Submitted 22 May, 2024; v1 submitted 20 May, 2024; originally announced May 2024.

  2. arXiv:2404.04947  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    Gull: A Generative Multifunctional Audio Codec

    Authors: Yi Luo, Jianwei Yu, Hangting Chen, Rongzhi Gu, Chao Weng

    Abstract: We introduce Gull, a generative multifunctional audio codec. Gull is a general purpose neural audio compression and decompression model which can be applied to a wide range of tasks and applications such as real-time communication, audio super-resolution, and codec language models. The key components of Gull include (1) universal-sample-rate modeling via subband modeling schemes motivated by recen… ▽ More

    Submitted 7 June, 2024; v1 submitted 7 April, 2024; originally announced April 2024.

    Comments: Demo page: https://yluo42.github.io/Gull/

  3. arXiv:2404.01784  [pdf, other

    cs.IT eess.SP

    Learning-Based Joint Beamforming and Antenna Movement Design for Movable Antenna Systems

    Authors: Caihao Weng, Yuanbin Chen, Lipeng Zhu, Ying Wang

    Abstract: In this paper, we investigate a multi-receiver communication system enabled by movable antennas (MAs). Specifically, the transmit beamforming and the double-side antenna movement at the transceiver are jointly designed to maximize the sum-rate of all receivers under imperfect channel state information (CSI). Since the formulated problem is non-convex with highly coupled variables, conventional opt… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: 13 pages, 5 figures

  4. arXiv:2312.15463  [pdf, other

    eess.AS cs.SD

    Consistent and Relevant: Rethink the Query Embedding in General Sound Separation

    Authors: Yuanyuan Wang, Hangting Chen, Dongchao Yang, Jianwei Yu, Chao Weng, Zhiyong Wu, Helen Meng

    Abstract: The query-based audio separation usually employs specific queries to extract target sources from a mixture of audio signals. Currently, most query-based separation models need additional networks to obtain query embedding. In this way, separation model is optimized to be adapted to the distribution of query embedding. However, query embedding may exhibit mismatches with separation models due to in… ▽ More

    Submitted 24 December, 2023; originally announced December 2023.

    Comments: Accepted by ICASSP 2024

  5. arXiv:2311.05896  [pdf, other

    eess.SY

    Optimal Privacy-Aware Dynamic Estimation

    Authors: Chuanghong Weng, Ehsan Nekouei, Karl H. Johansson

    Abstract: In this paper, we develop an information-theoretic framework for the optimal privacy-aware estimation of the states of a (linear or nonlinear) system. In our setup, a private process, modeled as a first-order Markov chain, derives the states of the system, and the state estimates are shared with an untrusted party who might attempt to infer the private process based on the state estimates. As the… ▽ More

    Submitted 10 November, 2023; originally announced November 2023.

  6. arXiv:2309.12792  [pdf, other

    eess.AS cs.SD

    DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis

    Authors: Yu Gu, Yianrao Bian, Guangzhi Lei, Chao Weng, Dan Su

    Abstract: This paper introduces an improved duration informed attention neural network (DurIAN-E) for expressive and high-fidelity text-to-speech (TTS) synthesis. Inherited from the original DurIAN model, an auto-regressive model structure in which the alignments between the input linguistic information and the output acoustic features are inferred from a duration model is adopted. Meanwhile the proposed Du… ▽ More

    Submitted 22 September, 2023; originally announced September 2023.

  7. arXiv:2309.07803  [pdf, other

    eess.AS cs.SD

    SnakeGAN: A Universal Vocoder Leveraging DDSP Prior Knowledge and Periodic Inductive Bias

    Authors: Sipan Li, Songxiang Liu, Luwen Zhang, Xiang Li, Yanyao Bian, Chao Weng, Zhiyong Wu, Helen Meng

    Abstract: Generative adversarial network (GAN)-based neural vocoders have been widely used in audio synthesis tasks due to their high generation quality, efficient inference, and small computation footprint. However, it is still challenging to train a universal vocoder which can generalize well to out-of-domain (OOD) scenarios, such as unseen speaking styles, non-speech vocalization, singing, and musical pi… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

    Comments: Accepted by ICME 2023

  8. arXiv:2309.07757  [pdf, other

    eess.AS cs.SD

    Complexity Scaling for Speech Denoising

    Authors: Hangting Chen, Jianwei Yu, Chao Weng

    Abstract: Computational complexity is critical when deploying deep learning-based speech denoising models for on-device applications. Most prior research focused on optimizing model architectures to meet specific computational cost constraints, often creating distinct neural network architectures for different complexity limitations. This study conducts complexity scaling for speech denoising tasks, aiming… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP2024

  9. arXiv:2308.14553  [pdf, other

    eess.AS cs.SD

    Rep2wav: Noise Robust text-to-speech Using self-supervised representations

    Authors: Qiushi Zhu, Yu Gu, Rilin Chen, Chao Weng, Yuchen Hu, Lirong Dai, Jie Zhang

    Abstract: Benefiting from the development of deep learning, text-to-speech (TTS) techniques using clean speech have achieved significant performance improvements. The data collected from real scenes often contains noise and generally needs to be denoised by speech enhancement models. Noise-robust TTS models are often trained using the enhanced speech, which thus suffer from speech distortion and background… ▽ More

    Submitted 3 September, 2023; v1 submitted 28 August, 2023; originally announced August 2023.

    Comments: 5 pages,2 figures

  10. Ultra Dual-Path Compression For Joint Echo Cancellation And Noise Suppression

    Authors: Hangting Chen, Jianwei Yu, Yi Luo, Rongzhi Gu, Weihua Li, Zhuocheng Lu, Chao Weng

    Abstract: Echo cancellation and noise reduction are essential for full-duplex communication, yet most existing neural networks have high computational costs and are inflexible in tuning model complexity. In this paper, we introduce time-frequency dual-path compression to achieve a wide range of compression ratios on computational cost. Specifically, for frequency compression, trainable filters are used to r… ▽ More

    Submitted 10 October, 2023; v1 submitted 21 August, 2023; originally announced August 2023.

    Comments: Proceedings of INTERSPEECH

  11. arXiv:2305.19269  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Make-A-Voice: Unified Voice Synthesis With Discrete Representation

    Authors: Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Lu** Liu, Zhenhui Ye, Ziyue Jiang, Chao Weng, Zhou Zhao, Dong Yu

    Abstract: Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common. In addition, the majority of voice synthesis models currently rely on annotated audio data, but it is crucial to scale them to self-supervised datasets in order to effectively capture the wide range of acoustic variations present in human voice, including speak… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

  12. arXiv:2305.16749  [pdf, other

    cs.SD eess.AS

    Diverse and Expressive Speech Prosody Prediction with Denoising Diffusion Probabilistic Model

    Authors: Xiang Li, Songxiang Liu, Max W. Y. Lam, Zhiyong Wu, Chao Weng, Helen Meng

    Abstract: Expressive human speech generally abounds with rich and flexible speech prosody variations. The speech prosody predictors in existing expressive speech synthesis methods mostly produce deterministic predictions, which are learned by directly minimizing the norm of prosody prediction error. Its unimodal nature leads to a mismatch with ground truth distribution and harms the model's ability in makin… ▽ More

    Submitted 7 October, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

    Comments: Proceedings of Interspeech 2023 (doi: 10.21437/Interspeech.2023-715), demo site at https://thuhcsi.github.io/interspeech2023-DiffVar/

  13. arXiv:2305.13957  [pdf, other

    eess.AS

    Eeg2vec: Self-Supervised Electroencephalographic Representation Learning

    Authors: Qiushi Zhu, Xiaoying Zhao, Jie Zhang, Yu Gu, Chao Weng, Yuchen Hu

    Abstract: Recently, many efforts have been made to explore how the brain processes speech using electroencephalographic (EEG) signals, where deep learning-based approaches were shown to be applicable in this field. In order to decode speech signals from EEG signals, linear networks, convolutional neural networks (CNN) and long short-term memory networks are often used in a supervised manner. Recording EEG-s… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Comments: 5 pages

  14. arXiv:2305.02765  [pdf, other

    cs.SD eess.AS

    HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec

    Authors: Dongchao Yang, Songxiang Liu, Rongjie Huang, **chuan Tian, Chao Weng, Yuexian Zou

    Abstract: Audio codec models are widely used in audio communication as a crucial technique for compressing audio into discrete representations. Nowadays, audio codec models are increasingly utilized in generation fields as intermediate representations. For instance, AudioLM is an audio generation model that uses the discrete representation of SoundStream as a training target, while VALL-E employs the Encode… ▽ More

    Submitted 7 May, 2023; v1 submitted 4 May, 2023; originally announced May 2023.

    Comments: The second version of HiFi-Codec

  15. arXiv:2303.15706  [pdf, other

    eess.SY

    Minimization of Sensor Activation in Discrete-Event Systems with Control Delays and Observation Delays

    Authors: Yunfeng Hou, Ching-Yen Weng, Peng Li

    Abstract: In discrete-event systems, to save sensor resources, the agent continuously adjusts sensor activation decisions according to a sensor activation policy based on the changing observations. However, new challenges arise for sensor activations in networked discrete-event systems, where observation delays and control delays exist between the sensor systems and the agent. In this paper, a new framework… ▽ More

    Submitted 5 April, 2023; v1 submitted 27 March, 2023; originally announced March 2023.

  16. arXiv:2301.13662  [pdf, other

    cs.SD eess.AS

    InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

    Authors: Dongchao Yang, Songxiang Liu, Rongjie Huang, Chao Weng, Helen Meng

    Abstract: Expressive text-to-speech (TTS) aims to synthesize different speaking style speech according to human's demands. Nowadays, there are two common ways to control speaking styles: (1) Pre-defining a group of speaking style and using categorical index to denote different speaking style. However, there are limitations in the diversity of expressiveness, as these models can only generate the pre-defined… ▽ More

    Submitted 25 June, 2023; v1 submitted 31 January, 2023; originally announced January 2023.

    Comments: Submit to TASLP

  17. arXiv:2212.00406  [pdf, other

    eess.AS

    High Fidelity Speech Enhancement with Band-split RNN

    Authors: Jianwei Yu, Yi Luo, Hangting Chen, Rongzhi Gu, Chao Weng

    Abstract: Despite the rapid progress in speech enhancement (SE) research, enhancing the quality of desired speech in environments with strong noise and interfering speakers remains challenging. In this paper, we extend the application of the recently proposed band-split RNN (BSRNN) model to full-band SE and personalized SE (PSE) tasks. To mitigate the effects of unstable high-frequency components in full-ba… ▽ More

    Submitted 6 June, 2023; v1 submitted 1 December, 2022; originally announced December 2022.

  18. arXiv:2211.02448  [pdf, other

    cs.SD eess.AS

    NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS

    Authors: Dongchao Yang, Songxiang Liu, Jianwei Yu, Helin Wang, Chao Weng, Yuexian Zou

    Abstract: Expressive text-to-speech (TTS) can synthesize a new speaking style by imiating prosody and timbre from a reference audio, which faces the following challenges: (1) The highly dynamic prosody information in the reference audio is difficult to extract, especially, when the reference audio contains background noise. (2) The TTS systems should have good generalization for unseen speaking styles. In t… ▽ More

    Submitted 4 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP2023

  19. arXiv:2210.07499  [pdf, other

    cs.CL cs.SD eess.AS

    Bayes risk CTC: Controllable CTC alignment in Sequence-to-Sequence tasks

    Authors: **chuan Tian, Brian Yan, Jianwei Yu, Chao Weng, Dong Yu, Shinji Watanabe

    Abstract: Sequence-to-Sequence (seq2seq) tasks transcribe the input sequence to a target sequence. The Connectionist Temporal Classification (CTC) criterion is widely used in multiple seq2seq tasks. Besides predicting the target sequence, a side product of CTC is to predict the alignment, which is the most probable input-long sequence that specifies a hard aligning relationship between the input and target… ▽ More

    Submitted 31 January, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

    Journal ref: International Conference on Learning Representations (ICLR), 2023

  20. arXiv:2210.05092  [pdf, other

    cs.SD eess.AS

    The DKU-Tencent System for the VoxCeleb Speaker Recognition Challenge 2022

    Authors: Xiaoyi Qin, Na Li, Yuke Lin, Yiwei Ding, Chao Weng, Dan Su, Ming Li

    Abstract: This paper is the system description of the DKU-Tencent System for the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC22). In this challenge, we focus on track1 and track3. For track1, multiple backbone networks are adopted to extract frame-level features. Since track1 focus on the cross-age scenarios, we adopt the cross-age trials and perform QMF to calibrate score. The magnitude-based qualit… ▽ More

    Submitted 10 October, 2022; originally announced October 2022.

  21. arXiv:2207.09983  [pdf, other

    cs.SD cs.AI eess.AS

    Diffsound: Discrete Diffusion Model for Text-to-sound Generation

    Authors: Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, Dong Yu

    Abstract: Generating sound effects that humans want is an important topic. However, there are few studies in this area for sound generation. In this study, we investigate generating sound conditioned on a text prompt and propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder. The framework first uses t… ▽ More

    Submitted 28 April, 2023; v1 submitted 20 July, 2022; originally announced July 2022.

    Comments: Accepted by TASLP2022

  22. arXiv:2207.05929  [pdf, other

    eess.AS cs.SD

    Cross-Age Speaker Verification: Learning Age-Invariant Speaker Embeddings

    Authors: Xiaoyi Qin, Na Li, Chao Weng, Dan Su, Ming Li

    Abstract: Automatic speaker verification has achieved remarkable progress in recent years. However, there is little research on cross-age speaker verification (CASV) due to insufficient relevant data. In this paper, we mine cross-age test sets based on the VoxCeleb dataset and propose our age-invariant speaker representation(AISR) learning method. Since the VoxCeleb is collected from the YouTube platform, t… ▽ More

    Submitted 12 July, 2022; originally announced July 2022.

    Comments: Accepted by Interspeech2022

  23. arXiv:2204.00821  [pdf, other

    cs.SD eess.AS

    Improving Target Sound Extraction with Timestamp Information

    Authors: Helin Wang, Dongchao Yang, Chao Weng, Jianwei Yu, Yuexian Zou

    Abstract: Target sound extraction (TSE) aims to extract the sound part of a target sound event class from a mixture audio with multiple sound events. The previous works mainly focus on the problems of weakly-labelled data, jointly learning and new classes, however, no one cares about the onset and offset times of the target sound event, which has been emphasized in the auditory scene analysis. In this pap… ▽ More

    Submitted 2 April, 2022; originally announced April 2022.

    Comments: submitted to interspeech2022

  24. arXiv:2203.15614  [pdf, other

    cs.CL cs.SD eess.AS

    Integrating Lattice-Free MMI into End-to-End Speech Recognition

    Authors: **chuan Tian, Jianwei Yu, Chao Weng, Yuexian Zou, Dong Yu

    Abstract: In automatic speech recognition (ASR) research, discriminative criteria have achieved superior performance in DNN-HMM systems. Given this success, the adoption of discriminative criteria is promising to boost the performance of end-to-end (E2E) ASR systems. With this motivation, previous works have introduced the minimum Bayesian risk (MBR, one of the discriminative criteria) into E2E ASR systems.… ▽ More

    Submitted 22 August, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022

  25. arXiv:2202.01986  [pdf, other

    eess.AS cs.SD

    The CUHK-TENCENT speaker diarization system for the ICASSP 2022 multi-channel multi-party meeting transcription challenge

    Authors: Naijun Zheng, Na Li, Xixin Wu, Lingwei Meng, Jiawen Kang, Haibin Wu, Chao Weng, Dan Su, Helen Meng

    Abstract: This paper describes our speaker diarization system submitted to the Multi-channel Multi-party Meeting Transcription (M2MeT) challenge, where Mandarin meeting data were recorded in multi-channel format for diarization and automatic speech recognition (ASR) tasks. In these meeting scenarios, the uncertainty of the speaker number and the high ratio of overlapped speech present great challenges for d… ▽ More

    Submitted 4 February, 2022; originally announced February 2022.

    Comments: submitted to ICASSP2022

  26. arXiv:2201.04800  [pdf, other

    eess.SY

    Online State Estimation for Supervisor Synthesis in Discrete-Event Systems with Communication Delays and Losses

    Authors: Yunfeng Hou, Yunfeng Ji, Gang Wang, Ching-Yen Weng, Qingdu Li

    Abstract: In the context of networked discrete-event systems (DESs), communication delays and losses exist between the plant and the supervisor for observation and between the supervisor and the actuator for control. In this paper, we first introduce a new framework for supervisory control of networked DESs. Under the introduced framework, we address the state estimation problem for supervisor synthesis of… ▽ More

    Submitted 6 October, 2022; v1 submitted 13 January, 2022; originally announced January 2022.

  27. arXiv:2201.01995  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Mandarin End-to-End Speech Recognition with Word N-gram Language Model

    Authors: **chuan Tian, Jianwei Yu, Chao Weng, Yuexian Zou, Dong Yu

    Abstract: Despite the rapid progress of end-to-end (E2E) automatic speech recognition (ASR), it has been shown that incorporating external language models (LMs) into the decoding can further improve the recognition performance of E2E ASR systems. To align with the modeling units adopted in E2E ASR systems, subword-level (e.g., characters, BPE) LMs are usually used to cooperate with current E2E ASR systems.… ▽ More

    Submitted 6 January, 2022; originally announced January 2022.

    Comments: 5pages, 1 figure

  28. arXiv:2111.15016  [pdf, other

    cs.CL cs.SD eess.AS

    Joint Modeling of Code-Switched and Monolingual ASR via Conditional Factorization

    Authors: Brian Yan, Chunlei Zhang, Meng Yu, Shi-Xiong Zhang, Siddharth Dalmia, Dan Berrebbi, Chao Weng, Shinji Watanabe, Dong Yu

    Abstract: Conversational bilingual speech encompasses three types of utterances: two purely monolingual types and one intra-sententially code-switched type. In this work, we propose a general framework to jointly model the likelihoods of the monolingual and code-switch sub-tasks that comprise bilingual speech recognition. By defining the monolingual sub-tasks with label-to-frame synchronization, our joint m… ▽ More

    Submitted 29 November, 2021; originally announced November 2021.

  29. arXiv:2110.06534  [pdf, other

    cs.SD eess.AS

    Simple Attention Module based Speaker Verification with Iterative noisy label detection

    Authors: Xiaoyi Qin, Na Li, Chao Weng, Dan Su, Ming Li

    Abstract: Recently, the attention mechanism such as squeeze-and-excitation module (SE) and convolutional block attention module (CBAM) has achieved great success in deep learning-based speaker verification system. This paper introduces an alternative effective yet simple one, i.e., simple attention module (SimAM), for speaker verification. The SimAM module is a plug-and-play module without extra modal param… ▽ More

    Submitted 13 October, 2021; originally announced October 2021.

    Comments: submitted to ICASSP2022

  30. arXiv:2110.00265  [pdf, other

    eess.SY

    A New Approach for Verification of Delay Coobservability of Discrete-Event Systems

    Authors: Yunfeng Hou, Qingdu Li, Yunfeng Ji, Gang Wang, Ching-Yen Weng

    Abstract: In decentralized networked supervisory control of discrete-event systems (DESs), the local supervisors observe event occurrences subject to observation delays to make correct control decisions. Delay coobservability describes whether these local supervisors can make sufficient observations. In this paper, we provide an efficient way to verify delay coobservability. For each controllable event, we… ▽ More

    Submitted 19 May, 2022; v1 submitted 1 October, 2021; originally announced October 2021.

  31. arXiv:2106.06909  [pdf, other

    cs.SD cs.CL eess.AS

    GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

    Authors: Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie **, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Yujun Wang, Zhao You, Zhiyong Yan

    Abstract: This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous sp… ▽ More

    Submitted 13 June, 2021; originally announced June 2021.

  32. arXiv:2106.06233  [pdf, other

    cs.SD cs.CL eess.AS

    Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-based Multi-modal Context Modeling

    Authors: **gbei Li, Yi Meng, Chenyi Li, Zhiyong Wu, Helen Meng, Chao Weng, Dan Su

    Abstract: Comparing with traditional text-to-speech (TTS) systems, conversational TTS systems are required to synthesize speeches with proper speaking style confirming to the conversational context. However, state-of-the-art context modeling methods in conversational TTS only model the textual information in context with a recurrent neural network (RNN). Such methods have limited ability in modeling the int… ▽ More

    Submitted 31 March, 2022; v1 submitted 11 June, 2021; originally announced June 2021.

    Comments: Accepted by ICASSP 2022

  33. arXiv:2106.04275  [pdf, other

    cs.SD cs.AI eess.AS eess.SP

    Raw Waveform Encoder with Multi-Scale Globally Attentive Locally Recurrent Networks for End-to-End Speech Recognition

    Authors: Max W. Y. Lam, Jun Wang, Chao Weng, Dan Su, Dong Yu

    Abstract: End-to-end speech recognition generally uses hand-engineered acoustic features as input and excludes the feature extraction module from its joint optimization. To extract learnable and adaptive features and mitigate information loss, we propose a new encoder that adopts globally attentive locally recurrent (GALR) networks and directly takes raw waveform as input. We observe improved ASR performanc… ▽ More

    Submitted 8 June, 2021; originally announced June 2021.

    Comments: Accepted in Interspeech 2021

  34. arXiv:2103.16849  [pdf, other

    eess.AS cs.SD

    TeCANet: Temporal-Contextual Attention Network for Environment-Aware Speech Dereverberation

    Authors: Helin Wang, Bo Wu, Lianwu Chen, Meng Yu, Jianwei Yu, Yong Xu, Shi-Xiong Zhang, Chao Weng, Dan Su, Dong Yu

    Abstract: In this paper, we exploit the effective way to leverage contextual information to improve the speech dereverberation performance in real-world reverberant environments. We propose a temporal-contextual attention approach on the deep neural network (DNN) for environment-aware speech dereverberation, which can adaptively attend to the contextual information. More specifically, a FullBand based Tempo… ▽ More

    Submitted 26 August, 2021; v1 submitted 31 March, 2021; originally announced March 2021.

    Comments: Submitted to Interspeech 2021

  35. arXiv:2103.08781  [pdf, other

    eess.AS

    Towards Robust Speaker Verification with Target Speaker Enhancement

    Authors: Chunlei Zhang, Meng Yu, Chao Weng, Dong Yu

    Abstract: This paper proposes the target speaker enhancement based speaker verification network (TASE-SVNet), an all neural model that couples target speaker enhancement and speaker embedding extraction for robust speaker verification (SV). Specifically, an enrollment speaker conditioned speech enhancement module is employed as the front-end for extracting target speaker from its mixture with interfering sp… ▽ More

    Submitted 15 March, 2021; originally announced March 2021.

    Comments: Accepted by IEEE ICASSP 2021

  36. arXiv:2102.07955  [pdf, other

    eess.AS cs.SD

    Deep Learning based Multi-Source Localization with Source Splitting and its Effectiveness in Multi-Talker Speech Recognition

    Authors: Aswin Shanmugam Subramanian, Chao Weng, Shinji Watanabe, Meng Yu, Dong Yu

    Abstract: Multi-source localization is an important and challenging technique for multi-talker conversation analysis. This paper proposes a novel supervised learning method using deep neural networks to estimate the direction of arrival (DOA) of all the speakers simultaneously from the audio mixture. At the heart of the proposal is a source splitting mechanism that creates source-specific intermediate repre… ▽ More

    Submitted 28 November, 2021; v1 submitted 15 February, 2021; originally announced February 2021.

    Comments: Submitted to Computer Speech & Language

  37. arXiv:2102.06431  [pdf, other

    cs.SD cs.CL eess.AS

    VARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention

    Authors: Peng Liu, Yuewen Cao, Songxiang Liu, Na Hu, Guangzhi Li, Chao Weng, Dan Su

    Abstract: This paper proposes VARA-TTS, a non-autoregressive (non-AR) text-to-speech (TTS) model using a very deep Variational Autoencoder (VDVAE) with Residual Attention mechanism, which refines the textual-to-acoustic alignment layer-wisely. Hierarchical latent variables with different temporal resolutions from the VDVAE are used as queries for residual attention module. By leveraging the coarse global al… ▽ More

    Submitted 12 February, 2021; originally announced February 2021.

  38. arXiv:2012.07178  [pdf, other

    eess.AS cs.LG

    Self-supervised Text-independent Speaker Verification using Prototypical Momentum Contrastive Learning

    Authors: Wei Xia, Chunlei Zhang, Chao Weng, Meng Yu, Dong Yu

    Abstract: In this study, we investigate self-supervised representation learning for speaker verification (SV). First, we examine a simple contrastive learning approach (SimCLR) with a momentum contrastive (MoCo) learning framework, where the MoCo speaker embedding system utilizes a queue to maintain a large set of negative examples. We show that better speaker embeddings can be learned by momentum contrasti… ▽ More

    Submitted 14 February, 2021; v1 submitted 13 December, 2020; originally announced December 2020.

    Comments: Accepted to ICASSP2021

  39. arXiv:2011.13393  [pdf, other

    cs.SD eess.AS

    Improving RNN Transducer With Target Speaker Extraction and Neural Uncertainty Estimation

    Authors: Jiatong Shi, Chunlei Zhang, Chao Weng, Shinji Watanabe, Meng Yu, Dong Yu

    Abstract: Target-speaker speech recognition aims to recognize target-speaker speech from noisy environments with background noise and interfering speakers. This work presents a joint framework that combines time-domain target-speaker speech extraction and Recurrent Neural Network Transducer (RNN-T). To stabilize the joint-training, we propose a multi-stage training strategy that pre-trains and fine-tunes ea… ▽ More

    Submitted 26 February, 2021; v1 submitted 26 November, 2020; originally announced November 2020.

    Comments: Accepted by ICASSP2021

  40. arXiv:2011.00091  [pdf, other

    eess.AS cs.CL cs.SD

    Directional ASR: A New Paradigm for E2E Multi-Speaker Speech Recognition with Source Localization

    Authors: Aswin Shanmugam Subramanian, Chao Weng, Shinji Watanabe, Meng Yu, Yong Xu, Shi-Xiong Zhang, Dong Yu

    Abstract: This paper proposes a new paradigm for handling far-field multi-speaker data in an end-to-end neural network manner, called directional automatic speech recognition (D-ASR), which explicitly models source speaker locations. In D-ASR, the azimuth angle of the sources with respect to the microphone array is defined as a latent variable. This angle controls the quality of separation, which in turn de… ▽ More

    Submitted 30 October, 2020; originally announced November 2020.

    Comments: submitted to ICASSP 2021

  41. arXiv:2010.15025  [pdf, other

    cs.SD cs.CL eess.AS

    Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input

    Authors: Xingchen Song, Zhiyong Wu, Yiheng Huang, Chao Weng, Dan Su, Helen Meng

    Abstract: Non-autoregressive (NAR) transformer models have achieved significantly inference speedup but at the cost of inferior accuracy compared to autoregressive (AR) models in automatic speech recognition (ASR). Most of the NAR transformers take a fixed-length sequence filled with MASK tokens or a redundant sequence copied from encoder states as decoder input, they cannot provide efficient target-side in… ▽ More

    Submitted 15 April, 2021; v1 submitted 28 October, 2020; originally announced October 2020.

    Comments: Accepted to ICASSP 2021, final version

    ACM Class: I.2.7

  42. arXiv:2010.15006  [pdf, other

    eess.AS cs.AI

    Replay and Synthetic Speech Detection with Res2net Architecture

    Authors: Xu Li, Na Li, Chao Weng, Xunying Liu, Dan Su, Dong Yu, Helen Meng

    Abstract: Existing approaches for replay and synthetic speech detection still lack generalizability to unseen spoofing attacks. This work proposes to leverage a novel model structure, so-called Res2Net, to improve the anti-spoofing countermeasure's generalizability. Res2Net mainly modifies the ResNet block to enable multiple feature scales. Specifically, it splits the feature maps within one block into mult… ▽ More

    Submitted 13 February, 2021; v1 submitted 28 October, 2020; originally announced October 2020.

    Comments: Accepted to ICASSP2021

  43. arXiv:2008.03029  [pdf, other

    eess.AS cs.CL cs.SD

    Peking Opera Synthesis via Duration Informed Attention Network

    Authors: Yusong Wu, Shengchen Li, Chengzhu Yu, Heng Lu, Chao Weng, Liqiang Zhang, Dong Yu

    Abstract: Peking Opera has been the most dominant form of Chinese performing art since around 200 years ago. A Peking Opera singer usually exhibits a very strong personal style via introducing improvisation and expressiveness on stage which leads the actual rhythm and pitch contour to deviate significantly from the original music score. This inconsistency poses a great challenge in Peking Opera singing voic… ▽ More

    Submitted 7 August, 2020; originally announced August 2020.

    Comments: Accepted by INTERSPEECH 2020

  44. arXiv:2008.03009  [pdf, other

    eess.AS cs.SD

    DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System

    Authors: Liqiang Zhang, Chengzhu Yu, Heng Lu, Chao Weng, Chunlei Zhang, Yusong Wu, Xiang Xie, Zi** Li, Dong Yu

    Abstract: Singing voice conversion is converting the timbre in the source singing to the target speaker's voice while kee** singing content the same. However, singing data for target speaker is much more difficult to collect compared with normal speech data.In this paper, we introduce a singing voice conversion algorithm that is capable of generating high quality target speaker's singing using only his/he… ▽ More

    Submitted 7 August, 2020; originally announced August 2020.

    Comments: Accepted by Interspeech 2020

  45. arXiv:2007.01566  [pdf, other

    eess.AS

    Distortionless Multi-Channel Target Speech Enhancement for Overlapped Speech Recognition

    Authors: Bo Wu, Meng Yu, Lianwu Chen, Yong Xu, Chao Weng, Dan Su, Dong Yu

    Abstract: Speech enhancement techniques based on deep learning have brought significant improvement on speech quality and intelligibility. Nevertheless, a large gain in speech quality measured by objective metrics, such as perceptual evaluation of speech quality (PESQ), does not necessarily lead to improved speech recognition performance due to speech distortion in the enhancement stage. In this paper, a mu… ▽ More

    Submitted 3 July, 2020; originally announced July 2020.

  46. arXiv:2005.03889  [pdf, other

    eess.AS cs.SD

    Neural Spatio-Temporal Beamformer for Target Speech Separation

    Authors: Yong Xu, Meng Yu, Shi-Xiong Zhang, Lianwu Chen, Chao Weng, Jianming Liu, Dong Yu

    Abstract: Purely neural network (NN) based speech separation and enhancement methods, although can achieve good objective scores, inevitably cause nonlinear speech distortions that are harmful for the automatic speech recognition (ASR). On the other hand, the minimum variance distortionless response (MVDR) beamformer with NN-predicted masks, although can significantly reduce speech distortions, has limited… ▽ More

    Submitted 31 July, 2020; v1 submitted 8 May, 2020; originally announced May 2020.

    Comments: accepted to Interspeech2020, Demo: https://yongxuustc.github.io/mtmvdr/

  47. arXiv:1912.10128  [pdf, other

    cs.SD cs.CL eess.AS

    Learning Singing From Speech

    Authors: Liqiang Zhang, Chengzhu Yu, Heng Lu, Chao Weng, Yusong Wu, Xiang Xie, Zi** Li, Dong Yu

    Abstract: We propose an algorithm that is capable of synthesizing high quality target speaker's singing voice given only their normal speech samples. The proposed algorithm first integrate speech and singing synthesis into a unified framework, and learns universal speaker embeddings that are shareable between speech and singing synthesis tasks. Specifically, the speaker embeddings learned from normal speech… ▽ More

    Submitted 20 December, 2019; originally announced December 2019.

    Comments: Submitted to ICASSP-2020

  48. arXiv:1912.01852  [pdf, other

    cs.SD cs.CL eess.AS

    PitchNet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network

    Authors: Chengqi Deng, Chengzhu Yu, Heng Lu, Chao Weng, Dong Yu

    Abstract: Singing voice conversion is to convert a singer's voice to another one's voice without changing singing content. Recent work shows that unsupervised singing voice conversion can be achieved with an autoencoder-based approach [1]. However, the converted singing voice can be easily out of key, showing that the existing approach cannot model the pitch information precisely. In this paper, we propose… ▽ More

    Submitted 18 February, 2020; v1 submitted 4 December, 2019; originally announced December 2019.

    Comments: Accepted by ICASSP 2020

  49. arXiv:1911.12487  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition

    Authors: Chao Weng, Chengzhu Yu, Jia Cui, Chunlei Zhang, Dong Yu

    Abstract: In this work, we propose minimum Bayes risk (MBR) training of RNN-Transducer (RNN-T) for end-to-end speech recognition. Specifically, initialized with a RNN-T trained model, MBR training is conducted via minimizing the expected edit distance between the reference label sequence and on-the-fly generated N-best hypothesis. We also introduce a heuristic to incorporate an external neural network langu… ▽ More

    Submitted 27 November, 2019; originally announced November 2019.

  50. arXiv:1910.13825  [pdf, ps, other

    eess.AS

    Overlapped speech recognition from a jointly learned multi-channel neural speech extraction and representation

    Authors: Bo Wu, Meng Yu, Lianwu Chen, Chao Weng, Dan Su, Dong Yu

    Abstract: We propose an end-to-end joint optimization framework of a multi-channel neural speech extraction and deep acoustic model without mel-filterbank (FBANK) extraction for overlapped speech recognition. First, based on a multi-channel convolutional TasNet with STFT kernel, we unify the multi-channel target speech enhancement front-end network and a convolutional, long short-term memory and fully conne… ▽ More

    Submitted 30 October, 2019; originally announced October 2019.