Skip to main content

Showing 1–50 of 68 results for author: Su, D

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.04350  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Prompt-guided Precise Audio Editing with Diffusion Models

    Authors: Manjie Xu, Chenxing Li, Duzhen zhang, Dan Su, Wei Liang, Dong Yu

    Abstract: Audio editing involves the arbitrary manipulation of audio content through precise control. Although text-guided diffusion models have made significant advancements in text-to-audio generation, they still face challenges in finding a flexible and precise way to modify target events within an audio track. We present a novel approach, referred to as PPAE, which serves as a general module for diffusi… ▽ More

    Submitted 11 May, 2024; originally announced June 2024.

    Comments: Accepted by ICML 2024

  2. arXiv:2406.00976  [pdf, other

    cs.CL cs.SD eess.AS

    Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer

    Authors: Yongxin Zhu, Dan Su, Liqiang He, Linli Xu, Dong Yu

    Abstract: While recent advancements in speech language models have achieved significant progress, they face remarkable challenges in modeling the long acoustic sequences of neural audio codecs. In this paper, we introduce \textbf{G}enerative \textbf{P}re-trained \textbf{S}peech \textbf{T}ransformer (GPST), a hierarchical transformer designed for efficient speech language modeling. GPST quantizes audio wavef… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: Accept in ACL2024-main

  3. arXiv:2310.10992  [pdf, other

    cs.SD eess.AS

    A High Fidelity and Low Complexity Neural Audio Coding

    Authors: Wenzhe Liu, Wei Xiao, Meng Wang, Shan Yang, Yupeng Shi, Yuyong Kang, Dan Su, Shidong Shang, Dong Yu

    Abstract: Audio coding is an essential module in the real-time communication system. Neural audio codecs can compress audio samples with a low bitrate due to the strong modeling and generative capabilities of deep neural networks. To address the poor high-frequency expression and high computational cost and storage consumption, we proposed an integrated framework that utilizes a neural network to model wide… ▽ More

    Submitted 17 October, 2023; originally announced October 2023.

  4. arXiv:2309.12792  [pdf, other

    eess.AS cs.SD

    DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis

    Authors: Yu Gu, Yianrao Bian, Guangzhi Lei, Chao Weng, Dan Su

    Abstract: This paper introduces an improved duration informed attention neural network (DurIAN-E) for expressive and high-fidelity text-to-speech (TTS) synthesis. Inherited from the original DurIAN model, an auto-regressive model structure in which the alignments between the input linguistic information and the output acoustic features are inferred from a duration model is adopted. Meanwhile the proposed Du… ▽ More

    Submitted 22 September, 2023; originally announced September 2023.

  5. Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

    Authors: Jiaxu Zhu, Weinan Tong, Yaoxun Xu, Changhe Song, Zhiyong Wu, Zhao You, Dan Su, Dong Yu, Helen Meng

    Abstract: Map** two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains. However, the length of speech representation and text representation is inconsistent. Although the previous method up-samples the text representation to align with acoustic modality, it may not… ▽ More

    Submitted 7 October, 2023; v1 submitted 4 September, 2023; originally announced September 2023.

    Comments: Proceedings of Interspeech. arXiv admin note: text overlap with arXiv:2309.01437

  6. arXiv:2301.00656  [pdf, other

    eess.AS cs.CL cs.LG

    TriNet: stabilizing self-supervised learning from complete or slow collapse on ASR

    Authors: Lixin Cao, Jun Wang, Ben Yang, Dan Su, Dong Yu

    Abstract: Self-supervised learning (SSL) models confront challenges of abrupt informational collapse or slow dimensional collapse. We propose TriNet, which introduces a novel triple-branch architecture for preventing collapse and stabilizing the pre-training. TriNet learns the SSL latent embedding space and incorporates it to a higher level space for predicting pseudo target vectors generated by a frozen te… ▽ More

    Submitted 14 March, 2023; v1 submitted 12 December, 2022; originally announced January 2023.

    Comments: Accepted by ICASSP 2023

  7. arXiv:2212.01546  [pdf, other

    cs.SD eess.AS

    UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis

    Authors: Yi Lei, Shan Yang, Xinsheng Wang, Qicong Xie, Jixun Yao, Lei Xie, Dan Su

    Abstract: Text-to-speech (TTS) and singing voice synthesis (SVS) aim at generating high-quality speaking and singing voice according to textual input and music scores, respectively. Unifying TTS and SVS into a single system is crucial to the applications requiring both of them. Existing methods usually suffer from some limitations, which rely on either both singing and speaking data from the same person or… ▽ More

    Submitted 6 December, 2022; v1 submitted 3 December, 2022; originally announced December 2022.

  8. arXiv:2210.05092  [pdf, other

    cs.SD eess.AS

    The DKU-Tencent System for the VoxCeleb Speaker Recognition Challenge 2022

    Authors: Xiaoyi Qin, Na Li, Yuke Lin, Yiwei Ding, Chao Weng, Dan Su, Ming Li

    Abstract: This paper is the system description of the DKU-Tencent System for the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC22). In this challenge, we focus on track1 and track3. For track1, multiple backbone networks are adopted to extract frame-level features. Since track1 focus on the cross-age scenarios, we adopt the cross-age trials and perform QMF to calibrate score. The magnitude-based qualit… ▽ More

    Submitted 10 October, 2022; originally announced October 2022.

  9. arXiv:2207.05929  [pdf, other

    eess.AS cs.SD

    Cross-Age Speaker Verification: Learning Age-Invariant Speaker Embeddings

    Authors: Xiaoyi Qin, Na Li, Chao Weng, Dan Su, Ming Li

    Abstract: Automatic speaker verification has achieved remarkable progress in recent years. However, there is little research on cross-age speaker verification (CASV) due to insufficient relevant data. In this paper, we mine cross-age test sets based on the VoxCeleb dataset and propose our age-invariant speaker representation(AISR) learning method. Since the VoxCeleb is collected from the YouTube platform, t… ▽ More

    Submitted 12 July, 2022; originally announced July 2022.

    Comments: Accepted by Interspeech2022

  10. arXiv:2207.01832  [pdf, other

    cs.SD eess.AS

    Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion

    Authors: Yi Lei, Shan Yang, Jian Cong, Lei Xie, Dan Su

    Abstract: The zero-shot scenario for speech generation aims at synthesizing a novel unseen voice with only one utterance of the target speaker. Although the challenges of adapting new voices in zero-shot scenario exist in both stages -- acoustic modeling and vocoder, previous works usually consider the problem from only one stage. In this paper, we extend our previous Glow-WaveGAN to Glow-WaveGAN 2, aiming… ▽ More

    Submitted 5 July, 2022; originally announced July 2022.

  11. arXiv:2207.00756  [pdf, other

    cs.SD eess.AS

    Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target Speakers

    Authors: Liumeng Xue, Shan Yang, Na Hu, Dan Su, Lei Xie

    Abstract: Building a voice conversion system for noisy target speakers, such as users providing noisy samples or Internet found data, is a challenging task since the use of contaminated speech in model training will apparently degrade the conversion performance. In this paper, we leverage the advances of our recently proposed Glow-WaveGAN and propose a noise-independent speech representation learning approa… ▽ More

    Submitted 2 July, 2022; originally announced July 2022.

    Comments: Accepted by INTERSPEECH 2022

  12. arXiv:2206.07569  [pdf, other

    eess.AS cs.SD

    End-to-End Voice Conversion with Information Perturbation

    Authors: Qicong Xie, Shan Yang, Yi Lei, Lei Xie, Dan Su

    Abstract: The ideal goal of voice conversion is to convert the source speaker's speech to sound naturally like the target speaker while maintaining the linguistic content and the prosody of the source speech. However, current approaches are insufficient to achieve comprehensive source prosody transfer and target speaker timbre preservation in the converted speech, and the quality of the converted speech is… ▽ More

    Submitted 15 June, 2022; originally announced June 2022.

  13. arXiv:2206.00208  [pdf, other

    cs.SD eess.AS

    AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation

    Authors: Kun Song, Heyang Xue, Xinsheng Wang, Jian Cong, Yongmao Zhang, Lei Xie, Bing Yang, Xiong Zhang, Dan Su

    Abstract: Speaker adaptation in text-to-speech synthesis (TTS) is to finetune a pre-trained TTS model to adapt to new target speakers with limited data. While much effort has been conducted towards this task, seldom work has been performed for low computational resource scenarios due to the challenges raised by the requirement of the lightweight model and less computational complexity. In this paper, a tiny… ▽ More

    Submitted 2 November, 2022; v1 submitted 31 May, 2022; originally announced June 2022.

    Comments: Accepted by ISCSLP 2022

  14. arXiv:2204.09934  [pdf, other

    eess.AS cs.LG cs.SD

    FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

    Authors: Rongjie Huang, Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, Zhou Zhao

    Abstract: Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hindered their applications to speech synthesis. This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis. FastDiff employs a stack of time-aware location-variable convolutions of div… ▽ More

    Submitted 21 April, 2022; originally announced April 2022.

    Comments: Accepted by IJCAI 2022

  15. arXiv:2204.03178  [pdf, other

    cs.SD cs.CL eess.AS

    3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition

    Authors: Zhao You, Shulin Feng, Dan Su, Dong Yu

    Abstract: Recently, Conformer based CTC/AED model has become a mainstream architecture for ASR. In this paper, based on our prior work, we identify and integrate several approaches to achieve further improvements for ASR tasks, which we denote as multi-loss, multi-path and multi-level, summarized as "3M" model. Specifically, multi-loss refers to the joint CTC/AED loss and multi-path denotes the Mixture-of-E… ▽ More

    Submitted 14 April, 2022; v1 submitted 6 April, 2022; originally announced April 2022.

    Comments: 5 pages, 1 figure. Submitted to INTERSPEECH 2022

  16. arXiv:2204.00990  [pdf, other

    cs.SD eess.AS

    Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis

    Authors: Yixuan Zhou, Changhe Song, Xiang Li, Luwen Zhang, Zhiyong Wu, Yanyao Bian, Dan Su, Helen Meng

    Abstract: Zero-shot speaker adaptation aims to clone an unseen speaker's voice without any adaptation time and parameters. Previous researches usually use a speaker encoder to extract a global fixed speaker embedding from reference speech, and several attempts have tried variable-length speaker embedding. However, they neglect to transfer the personal pronunciation characteristics related to phoneme content… ▽ More

    Submitted 11 November, 2022; v1 submitted 3 April, 2022; originally announced April 2022.

    Comments: Accepted by Interspeech 2022

  17. arXiv:2203.13508  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

    Authors: Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu

    Abstract: Diffusion probabilistic models (DPMs) and their extensions have emerged as competitive generative models yet confront challenges of efficient sampling. We propose a new bilateral denoising diffusion model (BDDM) that parameterizes both the forward and reverse processes with a schedule network and a score network, which can train with a novel bilateral modeling objective. We show that the new surro… ▽ More

    Submitted 25 March, 2022; originally announced March 2022.

    Comments: Accepted in ICLR 2022. arXiv admin note: text overlap with arXiv:2108.11514

    Journal ref: International Conference on Learning Representations 2022

  18. arXiv:2202.09081  [pdf, other

    eess.AS cs.AI cs.CV cs.MM cs.SD eess.IV

    VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

    Authors: Disong Wang, Shan Yang, Dan Su, Xunying Liu, Dong Yu, Helen Meng

    Abstract: Though significant progress has been made for speaker-dependent Video-to-Speech (VTS) synthesis, little attention is devoted to multi-speaker VTS that can map silent video to speech, while allowing flexible control of speaker identity, all in a single system. This paper proposes a novel multi-speaker VTS system based on cross-modal knowledge transfer from voice conversion (VC), where vector quanti… ▽ More

    Submitted 18 February, 2022; originally announced February 2022.

    Comments: Accepted to ICASSP 2022. Demo page is available at https://wendison.github.io/VCVTS-demo/

  19. arXiv:2202.01986  [pdf, other

    eess.AS cs.SD

    The CUHK-TENCENT speaker diarization system for the ICASSP 2022 multi-channel multi-party meeting transcription challenge

    Authors: Naijun Zheng, Na Li, Xixin Wu, Lingwei Meng, Jiawen Kang, Haibin Wu, Chao Weng, Dan Su, Helen Meng

    Abstract: This paper describes our speaker diarization system submitted to the Multi-channel Multi-party Meeting Transcription (M2MeT) challenge, where Mandarin meeting data were recorded in multi-channel format for diarization and automatic speech recognition (ASR) tasks. In these meeting scenarios, the uncertainty of the speaker number and the high ratio of overlapped speech present great challenges for d… ▽ More

    Submitted 4 February, 2022; originally announced February 2022.

    Comments: submitted to ICASSP2022

  20. arXiv:2201.11972  [pdf, other

    eess.AS cs.CL cs.SD

    DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

    Authors: Songxiang Liu, Dan Su, Dong Yu

    Abstract: Denoising diffusion probabilistic models (DDPMs) are expressive generative models that have been used to solve a variety of speech synthesis problems. However, because of their high sampling costs, DDPMs are difficult to use in real-time speech processing applications. In this paper, we introduce DiffGAN-TTS, a novel DDPM-based text-to-speech (TTS) model achieving high-fidelity and efficient speec… ▽ More

    Submitted 28 January, 2022; originally announced January 2022.

    Comments: Preprint. 16 pages

  21. arXiv:2111.11831  [pdf, other

    eess.AS cs.CL cs.SD

    SpeechMoE2: Mixture-of-Experts Model with Improved Routing

    Authors: Zhao You, Shulin Feng, Dan Su, Dong Yu

    Abstract: Mixture-of-experts based acoustic models with dynamic routing mechanisms have proved promising results for speech recognition. The design principle of router architecture is important for the large model capacity and high computational efficiency. Our previous work SpeechMoE only uses local grapheme embedding to help routers to make route decisions. To further improve speech recognition performanc… ▽ More

    Submitted 23 November, 2021; originally announced November 2021.

    Comments: 5 pages, 1 figure. Submitted to ICASSP 2022

  22. arXiv:2111.07218  [pdf, other

    eess.AS cs.CL cs.SD

    Meta-Voice: Fast few-shot style transfer for expressive voice cloning using meta learning

    Authors: Songxiang Liu, Dan Su, Dong Yu

    Abstract: The task of few-shot style transfer for voice cloning in text-to-speech (TTS) synthesis aims at transferring speaking styles of an arbitrary source speaker to a target speaker's voice using very limited amount of neutral data. This is a very challenging task since the learning algorithm needs to deal with few-shot voice cloning and speaker-prosody disentanglement at the same time. Accelerating the… ▽ More

    Submitted 13 November, 2021; originally announced November 2021.

    Comments: Pre-print technical report, 6 pages, 6 figures

  23. arXiv:2110.06534  [pdf, other

    cs.SD eess.AS

    Simple Attention Module based Speaker Verification with Iterative noisy label detection

    Authors: Xiaoyi Qin, Na Li, Chao Weng, Dan Su, Ming Li

    Abstract: Recently, the attention mechanism such as squeeze-and-excitation module (SE) and convolutional block attention module (CBAM) has achieved great success in deep learning-based speaker verification system. This paper introduces an alternative effective yet simple one, i.e., simple attention module (SimAM), for speaker verification. The SimAM module is a plug-and-play module without extra modal param… ▽ More

    Submitted 13 October, 2021; originally announced October 2021.

    Comments: submitted to ICASSP2022

  24. arXiv:2109.06274  [pdf, other

    eess.IV cs.CV

    Cross-Modality Domain Adaptation for Vestibular Schwannoma and Cochlea Segmentation

    Authors: Han Liu, Yubo Fan, Can Cui, Dingjie Su, Andrew McNeil, Benoit M. Dawant

    Abstract: Automatic methods to segment the vestibular schwannoma (VS) tumors and the cochlea from magnetic resonance imaging (MRI) are critical to VS treatment planning. Although supervised methods have achieved satisfactory performance in VS segmentation, they require full annotations by experts, which is laborious and time-consuming. In this work, we aim to tackle the VS and cochlea segmentation problem i… ▽ More

    Submitted 8 November, 2021; v1 submitted 13 September, 2021; originally announced September 2021.

  25. arXiv:2109.03439  [pdf, other

    eess.AS cs.CL cs.SD

    Referee: Towards reference-free cross-speaker style transfer with low-quality data for expressive speech synthesis

    Authors: Songxiang Liu, Shan Yang, Dan Su, Dong Yu

    Abstract: Cross-speaker style transfer (CSST) in text-to-speech (TTS) synthesis aims at transferring a speaking style to the synthesised speech in a target speaker's voice. Most previous CSST approaches rely on expensive high-quality data carrying desired speaking style during training and require a reference utterance to obtain speaking style descriptors as conditioning on the generation of a new sentence.… ▽ More

    Submitted 8 September, 2021; originally announced September 2021.

    Comments: 7 pages, preprint

  26. arXiv:2108.11514  [pdf, other

    cs.LG cs.AI cs.SD eess.AS eess.SP

    Bilateral Denoising Diffusion Models

    Authors: Max W. Y. Lam, Jun Wang, Rongjie Huang, Dan Su, Dong Yu

    Abstract: Denoising diffusion probabilistic models (DDPMs) have emerged as competitive generative models yet brought challenges to efficient sampling. In this paper, we propose novel bilateral denoising diffusion models (BDDMs), which take significantly fewer steps to generate high-quality samples. From a bilateral modeling objective, BDDMs parameterize the forward and reverse processes with a score network… ▽ More

    Submitted 14 September, 2021; v1 submitted 26 August, 2021; originally announced August 2021.

  27. arXiv:2107.03987  [pdf

    eess.IV cs.CV

    Atlas-Based Segmentation of Intracochlear Anatomy in Metal Artifact Affected CT Images of the Ear with Co-trained Deep Neural Networks

    Authors: Jianing Wang, Dingjie Su, Yubo Fan, Srijata Chakravorti, Jack H. Noble, Benoit M. Dawant

    Abstract: We propose an atlas-based method to segment the intracochlear anatomy (ICA) in the post-implantation CT (Post-CT) images of cochlear implant (CI) recipients that preserves the point-to-point correspondence between the meshes in the atlas and the segmented volumes. To solve this problem, which is challenging because of the strong artifacts produced by the implant, we use a pair of co-trained deep n… ▽ More

    Submitted 9 July, 2021; v1 submitted 8 July, 2021; originally announced July 2021.

    Comments: 10 pages, 5 figures

  28. arXiv:2106.10831  [pdf, other

    eess.AS cs.SD

    Glow-WaveGAN: Learning Speech Representations from GAN-based Variational Auto-Encoder For High Fidelity Flow-based Speech Synthesis

    Authors: Jian Cong, Shan Yang, Lei Xie, Dan Su

    Abstract: Current two-stage TTS framework typically integrates an acoustic model with a vocoder -- the acoustic model predicts a low resolution intermediate representation such as Mel-spectrum while the vocoder generates waveform from the intermediate representation. Although the intermediate representation is served as a bridge, there still exists critical mismatch between the acoustic model and the vocode… ▽ More

    Submitted 21 June, 2021; v1 submitted 20 June, 2021; originally announced June 2021.

    Comments: Accepted to INTERSPEECH 2021

  29. arXiv:2106.10828  [pdf, other

    eess.AS cs.SD

    Controllable Context-aware Conversational Speech Synthesis

    Authors: Jian Cong, Shan Yang, Na Hu, Guangzhi Li, Lei Xie, Dan Su

    Abstract: In spoken conversations, spontaneous behaviors like filled pause and prolongations always happen. Conversational partner tends to align features of their speech with their interlocutor which is known as entrainment. To produce human-like conversations, we propose a unified controllable spontaneous conversational speech synthesis framework to model the above two phenomena. Specifically, we use expl… ▽ More

    Submitted 20 June, 2021; originally announced June 2021.

    Comments: Accepted to INTERSPEECH 2021

  30. arXiv:2106.06909  [pdf, other

    cs.SD cs.CL eess.AS

    GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

    Authors: Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie **, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Yujun Wang, Zhao You, Zhiyong Yan

    Abstract: This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous sp… ▽ More

    Submitted 13 June, 2021; originally announced June 2021.

  31. arXiv:2106.06233  [pdf, other

    cs.SD cs.CL eess.AS

    Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-based Multi-modal Context Modeling

    Authors: **gbei Li, Yi Meng, Chenyi Li, Zhiyong Wu, Helen Meng, Chao Weng, Dan Su

    Abstract: Comparing with traditional text-to-speech (TTS) systems, conversational TTS systems are required to synthesize speeches with proper speaking style confirming to the conversational context. However, state-of-the-art context modeling methods in conversational TTS only model the textual information in context with a recurrent neural network (RNN). Such methods have limited ability in modeling the int… ▽ More

    Submitted 31 March, 2022; v1 submitted 11 June, 2021; originally announced June 2021.

    Comments: Accepted by ICASSP 2022

  32. arXiv:2106.04275  [pdf, other

    cs.SD cs.AI eess.AS eess.SP

    Raw Waveform Encoder with Multi-Scale Globally Attentive Locally Recurrent Networks for End-to-End Speech Recognition

    Authors: Max W. Y. Lam, Jun Wang, Chao Weng, Dan Su, Dong Yu

    Abstract: End-to-end speech recognition generally uses hand-engineered acoustic features as input and excludes the feature extraction module from its joint optimization. To extract learnable and adaptive features and mitigate information loss, we propose a new encoder that adopts globally attentive locally recurrent (GALR) networks and directly takes raw waveform as input. We observe improved ASR performanc… ▽ More

    Submitted 8 June, 2021; originally announced June 2021.

    Comments: Accepted in Interspeech 2021

  33. arXiv:2105.13871  [pdf, other

    eess.AS cs.CL cs.SD

    DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion

    Authors: Songxiang Liu, Yuewen Cao, Dan Su, Helen Meng

    Abstract: Singing voice conversion (SVC) is one promising technique which can enrich the way of human-computer interaction by endowing a computer the ability to produce high-fidelity and expressive singing voice. In this paper, we propose DiffSVC, an SVC system based on denoising diffusion probabilistic model. DiffSVC uses phonetic posteriorgrams (PPGs) as content features. A denoising module is trained in… ▽ More

    Submitted 28 May, 2021; originally announced May 2021.

    Comments: Preprint. 8 pages, 2 figures and 1 table

  34. arXiv:2105.03643  [pdf, ps, other

    eess.AS cs.SD

    Latency-Controlled Neural Architecture Search for Streaming Speech Recognition

    Authors: Liqiang He, Shulin Feng, Dan Su, Dong Yu

    Abstract: Neural architecture search (NAS) has attracted much attention and has been explored for automatic speech recognition (ASR). In this work, we focus on streaming ASR scenarios and propose the latency-controlled NAS for acoustic modeling. First, based on the vanilla neural architecture, normal cells are altered to causal cells to control the total latency of the architecture. Second, a revised operat… ▽ More

    Submitted 13 September, 2021; v1 submitted 8 May, 2021; originally announced May 2021.

    Comments: Accepted to ASRU 2021

  35. arXiv:2105.03036  [pdf, other

    cs.SD cs.CL eess.AS

    SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts

    Authors: Zhao You, Shulin Feng, Dan Su, Dong Yu

    Abstract: Recently, Mixture of Experts (MoE) based Transformer has shown promising results in many domains. This is largely due to the following advantages of this architecture: firstly, MoE based Transformer can increase model capacity without computational cost increasing both at training and inference time. Besides, MoE based Transformer is a dynamic network which can adapt to the varying complexity of i… ▽ More

    Submitted 6 May, 2021; originally announced May 2021.

    Comments: 5 pages, 2 figures. Submitted to Interspeech 2021

  36. arXiv:2104.06835  [pdf, other

    cs.CL cs.SD eess.AS

    Enhancing Word-Level Semantic Representation via Dependency Structure for Expressive Text-to-Speech Synthesis

    Authors: Yixuan Zhou, Changhe Song, **gbei Li, Zhiyong Wu, Yanyao Bian, Dan Su, Helen Meng

    Abstract: Exploiting rich linguistic information in raw text is crucial for expressive text-to-speech (TTS). As large scale pre-trained text representation develops, bidirectional encoder representations from Transformers (BERT) has been proven to embody semantic information and employed to TTS recently. However, original or simply fine-tuned BERT embeddings still cannot provide sufficient semantic knowledg… ▽ More

    Submitted 11 November, 2022; v1 submitted 14 April, 2021; originally announced April 2021.

    Comments: Accepted by Interspeech 2022

  37. arXiv:2103.16849  [pdf, other

    eess.AS cs.SD

    TeCANet: Temporal-Contextual Attention Network for Environment-Aware Speech Dereverberation

    Authors: Helin Wang, Bo Wu, Lianwu Chen, Meng Yu, Jianwei Yu, Yong Xu, Shi-Xiong Zhang, Chao Weng, Dan Su, Dong Yu

    Abstract: In this paper, we exploit the effective way to leverage contextual information to improve the speech dereverberation performance in real-world reverberant environments. We propose a temporal-contextual attention approach on the deep neural network (DNN) for environment-aware speech dereverberation, which can adaptively attend to the contextual information. More specifically, a FullBand based Tempo… ▽ More

    Submitted 26 August, 2021; v1 submitted 31 March, 2021; originally announced March 2021.

    Comments: Submitted to Interspeech 2021

  38. arXiv:2103.01461  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    Tune-In: Training Under Negative Environments with Interference for Attention Networks Simulating Cocktail Party Effect

    Authors: Jun Wang, Max W. Y. Lam, Dan Su, Dong Yu

    Abstract: We study the cocktail party problem and propose a novel attention network called Tune-In, abbreviated for training under negative environments with interference. It firstly learns two separate spaces of speaker-knowledge and speech-stimuli based on a shared feature space, where a new block structure is designed as the building block for all spaces, and then cooperatively solves different tasks. Be… ▽ More

    Submitted 1 March, 2021; originally announced March 2021.

    Comments: Accepted in AAAI 2021

  39. arXiv:2103.00819  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    Sandglasset: A Light Multi-Granularity Self-attentive Network For Time-Domain Speech Separation

    Authors: Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu

    Abstract: One of the leading single-channel speech separation (SS) models is based on a TasNet with a dual-path segmentation technique, where the size of each segment remains unchanged throughout all layers. In contrast, our key finding is that multi-granularity features are essential for enhancing contextual modeling and computational efficiency. We introduce a self-attentive network with a novel sandglass… ▽ More

    Submitted 8 March, 2021; v1 submitted 1 March, 2021; originally announced March 2021.

    Comments: Accepted in ICASSP 2021

  40. arXiv:2103.00816  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    Contrastive Separative Coding for Self-supervised Representation Learning

    Authors: Jun Wang, Max W. Y. Lam, Dan Su, Dong Yu

    Abstract: To extract robust deep representations from long sequential modeling of speech data, we propose a self-supervised learning approach, namely Contrastive Separative Coding (CSC). Our key finding is to learn such representations by separating the target signal from contrastive interfering signals. First, a multi-task separative encoder is built to extract shared separable and discriminative embedding… ▽ More

    Submitted 1 March, 2021; originally announced March 2021.

    Comments: Accepted in ICASSP 2021

  41. arXiv:2102.06431  [pdf, other

    cs.SD cs.CL eess.AS

    VARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention

    Authors: Peng Liu, Yuewen Cao, Songxiang Liu, Na Hu, Guangzhi Li, Chao Weng, Dan Su

    Abstract: This paper proposes VARA-TTS, a non-autoregressive (non-AR) text-to-speech (TTS) model using a very deep Variational Autoencoder (VDVAE) with Residual Attention mechanism, which refines the textual-to-acoustic alignment layer-wisely. Hierarchical latent variables with different temporal resolutions from the VDVAE are used as queries for residual attention module. By leveraging the coarse global al… ▽ More

    Submitted 12 February, 2021; originally announced February 2021.

  42. arXiv:2101.05014  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    Effective Low-Cost Time-Domain Audio Separation Using Globally Attentive Locally Recurrent Networks

    Authors: Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu

    Abstract: Recent research on the time-domain audio separation networks (TasNets) has brought great success to speech separation. Nevertheless, conventional TasNets struggle to satisfy the memory and latency constraints in industrial applications. In this regard, we design a low-cost high-performance architecture, namely, globally attentive locally recurrent (GALR) network. Alike the dual-path RNN (DPRNN), w… ▽ More

    Submitted 13 January, 2021; originally announced January 2021.

    Comments: Accepted in IEEE SLT 2021

  43. arXiv:2012.01837  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    Phonetic Posteriorgrams based Many-to-Many Singing Voice Conversion via Adversarial Training

    Authors: Haohan Guo, Heng Lu, Na Hu, Chunlei Zhang, Shan Yang, Lei Xie, Dan Su, Dong Yu

    Abstract: This paper describes an end-to-end adversarial singing voice conversion (EA-SVC) approach. It can directly generate arbitrary singing waveform by given phonetic posteriorgram (PPG) representing content, F0 representing pitch, and speaker embedding representing timbre, respectively. Proposed system is composed of three modules: generator $G$, the audio generation discriminator $D_{A}$, and the feat… ▽ More

    Submitted 3 December, 2020; originally announced December 2020.

  44. arXiv:2011.05731  [pdf, other

    eess.AS

    FastSVC: Fast Cross-Domain Singing Voice Conversion with Feature-wise Linear Modulation

    Authors: Songxiang Liu, Yuewen Cao, Na Hu, Dan Su, Helen Meng

    Abstract: This paper presents FastSVC, a light-weight cross-domain singing voice conversion (SVC) system, which can achieve high conversion performance, with inference speed 4x faster than real-time on CPUs. FastSVC uses Conformer-based phoneme recognizer to extract singer-agnostic linguistic features from singing signals. A feature-wise linear modulation based generator is used to synthesize waveform direc… ▽ More

    Submitted 23 May, 2021; v1 submitted 11 November, 2020; originally announced November 2020.

    Comments: Accepted by IEEE International Conference on Multimedia and Expo (ICME) 2021

  45. arXiv:2010.15025  [pdf, other

    cs.SD cs.CL eess.AS

    Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input

    Authors: Xingchen Song, Zhiyong Wu, Yiheng Huang, Chao Weng, Dan Su, Helen Meng

    Abstract: Non-autoregressive (NAR) transformer models have achieved significantly inference speedup but at the cost of inferior accuracy compared to autoregressive (AR) models in automatic speech recognition (ASR). Most of the NAR transformers take a fixed-length sequence filled with MASK tokens or a redundant sequence copied from encoder states as decoder input, they cannot provide efficient target-side in… ▽ More

    Submitted 15 April, 2021; v1 submitted 28 October, 2020; originally announced October 2020.

    Comments: Accepted to ICASSP 2021, final version

    ACM Class: I.2.7

  46. arXiv:2010.15006  [pdf, other

    eess.AS cs.AI

    Replay and Synthetic Speech Detection with Res2net Architecture

    Authors: Xu Li, Na Li, Chao Weng, Xunying Liu, Dan Su, Dong Yu, Helen Meng

    Abstract: Existing approaches for replay and synthetic speech detection still lack generalizability to unseen spoofing attacks. This work proposes to leverage a novel model structure, so-called Res2Net, to improve the anti-spoofing countermeasure's generalizability. Res2Net mainly modifies the ResNet block to enable multiple feature scales. Specifically, it splits the feature maps within one block into mult… ▽ More

    Submitted 13 February, 2021; v1 submitted 28 October, 2020; originally announced October 2020.

    Comments: Accepted to ICASSP2021

  47. arXiv:2008.11589  [pdf, other

    eess.AS cs.SD

    Learned Transferable Architectures Can Surpass Hand-Designed Architectures for Large Scale Speech Recognition

    Authors: Liqiang He, Dan Su, Dong Yu

    Abstract: In this paper, we explore the neural architecture search (NAS) for automatic speech recognition (ASR) systems. With reference to the previous works in the computer vision field, the transferability of the searched architecture is the main focus of our work. The architecture search is conducted on the small proxy dataset, and then the evaluation network, constructed with the searched architecture,… ▽ More

    Submitted 8 May, 2021; v1 submitted 25 August, 2020; originally announced August 2020.

    Comments: Accepted to ICASSP 2021

  48. arXiv:2007.01566  [pdf, other

    eess.AS

    Distortionless Multi-Channel Target Speech Enhancement for Overlapped Speech Recognition

    Authors: Bo Wu, Meng Yu, Lianwu Chen, Yong Xu, Chao Weng, Dan Su, Dong Yu

    Abstract: Speech enhancement techniques based on deep learning have brought significant improvement on speech quality and intelligibility. Nevertheless, a large gain in speech quality measured by objective metrics, such as perceptual evaluation of speech quality (PESQ), does not necessarily lead to improved speech recognition performance due to speech distortion in the enhancement stage. In this paper, a mu… ▽ More

    Submitted 3 July, 2020; originally announced July 2020.

  49. arXiv:2006.11610  [pdf, other

    eess.AS cs.LG cs.MM cs.SD

    Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking Head Generation Using Phonetic Posteriorgrams

    Authors: Huirong Huang, Zhiyong Wu, Shiyin Kang, Dongyang Dai, Jia Jia, Tianxiao Fu, Deyi Tuo, Guangzhi Lei, Peng Liu, Dan Su, Dong Yu, Helen Meng

    Abstract: Generating 3D speech-driven talking head has received more and more attention in recent years. Recent approaches mainly have following limitations: 1) most speaker-independent methods need handcrafted features that are time-consuming to design or unreliable; 2) there is no convincing method to support multilingual or mixlingual speech as input. In this work, we propose a novel approach using phone… ▽ More

    Submitted 20 June, 2020; originally announced June 2020.

    Comments: 5 pages, 5 figures

  50. arXiv:2006.06186  [pdf, other

    eess.AS cs.LG cs.SD

    Investigating Robustness of Adversarial Samples Detection for Automatic Speaker Verification

    Authors: Xu Li, Na Li, **ghua Zhong, Xixin Wu, Xunying Liu, Dan Su, Dong Yu, Helen Meng

    Abstract: Recently adversarial attacks on automatic speaker verification (ASV) systems attracted widespread attention as they pose severe threats to ASV systems. However, methods to defend against such attacks are limited. Existing approaches mainly focus on retraining ASV systems with adversarial data augmentation. Also, countermeasure robustness against different attack settings are insufficiently investi… ▽ More

    Submitted 7 August, 2020; v1 submitted 11 June, 2020; originally announced June 2020.

    Comments: accepted by Interspeech2020