Skip to main content

Showing 1–17 of 17 results for author: Leng, Y

Searching in archive eess. Search in all archives.
.
  1. arXiv:2403.03100  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

    Authors: Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, **yu Li, Sheng Zhao

    Abstract: While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing di… ▽ More

    Submitted 23 April, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

    Comments: Achieving human-level quality and naturalness on multi-speaker datasets (e.g., LibriSpeech) in a zero-shot way

  2. arXiv:2402.07729  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension

    Authors: Qian Yang, ** Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, **gren Zhou

    Abstract: Recently, instruction-following audio-language models have received broad attention for human-audio interaction. However, the absence of benchmarks capable of evaluating audio-centric interaction capabilities has impeded advancements in this field. Previous models primarily focus on assessing different fundamental tasks, such as Automatic Speech Recognition (ASR), and lack an assessment of the ope… ▽ More

    Submitted 12 February, 2024; originally announced February 2024.

  3. arXiv:2312.14383  [pdf, other

    cs.MM cs.CV eess.IV

    Removing Interference and Recovering Content Imaginatively for Visible Watermark Removal

    Authors: Yicheng Leng, Chaowei Fang, Gen Li, Yixiang Fang, Guanbin Li

    Abstract: Visible watermarks, while instrumental in protecting image copyrights, frequently distort the underlying content, complicating tasks like scene interpretation and image editing. Visible watermark removal aims to eliminate the interference of watermarks and restore the background content. However, existing methods often implement watermark component removal and background restoration tasks within a… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

    Comments: Accepted by AAAI2024

  4. arXiv:2309.02285  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    PromptTTS 2: Describing and Generating Voices with Text Prompt

    Authors: Yichong Leng, Zhifang Guo, Kai Shen, Xu Tan, Zeqian Ju, Yanqing Liu, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song, Lei He, Xiang-Yang Li, Sheng Zhao, Tao Qin, Jiang Bian

    Abstract: Speech conveys more information than text, as the same word can be uttered in various voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods relying on speech prompts (reference speech) for voice variability, using text prompts (descriptions) is more user-friendly since speech prompts can be hard to find or may not exist at all. TTS approaches based on the text… ▽ More

    Submitted 11 October, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

    Comments: Demo page: https://speechresearch.github.io/prompttts2

  5. arXiv:2304.09116  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

    Authors: Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, Jiang Bian

    Abstract: Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is important to capture the diversity in human speech such as speaker identities, prosodies, and styles (e.g., singing). Current large TTS systems usually quantize speech into discrete tokens and use language models to generate these tokens one by one, which suffer from unstable prosody, word skip**/repeating is… ▽ More

    Submitted 30 May, 2023; v1 submitted 18 April, 2023; originally announced April 2023.

    Comments: A large-scale text-to-speech and singing voice synthesis system with latent diffusion models. Update: NaturalSpeech 2 extension to voice conversion and speech enhancement

  6. arXiv:2303.06828  [pdf, other

    eess.AS

    Two-step Band-split Neural Network Approach for Full-band Residual Echo Suppression

    Authors: Zihan Zhang, Shimin Zhang, Mingshuai Liu, Yanhong Leng, Zhe Han, Li Chen, Lei Xie

    Abstract: This paper describes a Two-step Band-split Neural Network (TBNN) approach for full-band acoustic echo cancellation. Specifically, after linear filtering, we split the full-band signal into wide-band (16KHz) and high-band (16-48KHz) for residual echo removal with lower modeling difficulty. The wide-band signal is processed by an updated gated convolutional recurrent network (GCRN) with U$^2$ encode… ▽ More

    Submitted 12 March, 2023; originally announced March 2023.

    Comments: Accepted by ICASSP 2023

  7. arXiv:2212.14518  [pdf, other

    eess.AS cs.CL cs.LG cs.SD eess.SP

    ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech

    Authors: Zehua Chen, Yihan Wu, Yichong Leng, Jiawei Chen, Haohe Liu, Xu Tan, Yang Cui, Ke Wang, Lei He, Sheng Zhao, Jiang Bian, Danilo Mandic

    Abstract: Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples. However, their iterative refinement process in high-dimensional data space results in slow inference speed, which restricts their application in real-time systems. Previous works have explored speeding up by minimizing the number of… ▽ More

    Submitted 29 December, 2022; originally announced December 2022.

    Comments: 13 pages, 5 figures

  8. arXiv:2212.01039  [pdf, other

    cs.CL cs.LG eess.AS

    SoftCorrect: Error Correction with Soft Detection for Automatic Speech Recognition

    Authors: Yichong Leng, Xu Tan, Wenjie Liu, Kaitao Song, Rui Wang, Xiang-Yang Li, Tao Qin, Edward Lin, Tie-Yan Liu

    Abstract: Error correction in automatic speech recognition (ASR) aims to correct those incorrect words in sentences generated by ASR models. Since recent ASR models usually have low word error rate (WER), to avoid affecting originally correct tokens, error correction models should only modify incorrect words, and therefore detecting incorrect words is important for error correction. Previous works on error… ▽ More

    Submitted 20 December, 2023; v1 submitted 2 December, 2022; originally announced December 2022.

    Comments: AAAI 2023

  9. arXiv:2211.12171  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    PromptTTS: Controllable Text-to-Speech with Text Descriptions

    Authors: Zhifang Guo, Yichong Leng, Yihan Wu, Sheng Zhao, Xu Tan

    Abstract: Using a text description as prompt to guide the generation of text or images (e.g., GPT-3 or DALLE-2) has drawn wide attention recently. Beyond text and image generation, in this work, we explore the possibility of utilizing text descriptions to guide speech synthesis. Thus, we develop a text-to-speech (TTS) system (dubbed as PromptTTS) that takes a prompt with both style and content descriptions… ▽ More

    Submitted 22 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  10. arXiv:2205.14807  [pdf, other

    eess.AS cs.LG cs.SD

    BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis

    Authors: Yichong Leng, Zehua Chen, Junliang Guo, Haohe Liu, Jiawei Chen, Xu Tan, Danilo Mandic, Lei He, Xiang-Yang Li, Tao Qin, Sheng Zhao, Tie-Yan Liu

    Abstract: Binaural audio plays a significant role in constructing immersive augmented and virtual realities. As it is expensive to record binaural audio from the real world, synthesizing them from mono audio has attracted increasing attention. This synthesis process involves not only the basic physical war** of the mono audio, but also room reverberations and head/ear related filtrations, which, however,… ▽ More

    Submitted 29 November, 2022; v1 submitted 29 May, 2022; originally announced May 2022.

    Comments: NeurIPS 2022 camera version

  11. arXiv:2205.04421  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

    Authors: Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, Yuanhao Yi, Lei He, Frank Soong, Tao Qin, Sheng Zhao, Tie-Yan Liu

    Abstract: Text to speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing app… ▽ More

    Submitted 10 May, 2022; v1 submitted 9 May, 2022; originally announced May 2022.

    Comments: 19 pages, 3 figures, 8 tables

  12. arXiv:2111.04985  [pdf

    eess.IV cs.CV

    Bilinear pooling and metric learning network for early Alzheimer's disease identification with FDG-PET images

    Authors: Wenju Cui, Caiying Yan, Zhuangzhi Yan, Yunsong Peng, Yilin Leng, Chenlu Liu, Shuangqing Chen, Xi Jiang

    Abstract: FDG-PET reveals altered brain metabolism in individuals with mild cognitive impairment (MCI) and Alzheimer's disease (AD). Some biomarkers derived from FDG-PET by computer-aided-diagnosis (CAD) technologies have been proved that they can accurately diagnosis normal control (NC), MCI, and AD. However, the studies of identification of early MCI (EMCI) and late MCI (LMCI) with FDG-PET images are stil… ▽ More

    Submitted 9 November, 2021; originally announced November 2021.

  13. arXiv:2110.03857  [pdf, other

    eess.AS cs.CL cs.SD

    A study on the efficacy of model pre-training in develo** neural text-to-speech system

    Authors: Guangyan Zhang, Yichong Leng, Daxin Tan, Ying Qin, Kaitao Song, Xu Tan, Sheng Zhao, Tan Lee

    Abstract: In the development of neural text-to-speech systems, model pre-training with a large amount of non-target speakers' data is a common approach. However, in terms of ultimately achieved system performance for target speaker(s), the actual benefits of model pre-training are uncertain and unstable, depending very much on the quantity and text content of training data. This study aims to understand bet… ▽ More

    Submitted 7 October, 2021; originally announced October 2021.

  14. arXiv:2109.14420  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    FastCorrect 2: Fast Error Correction on Multiple Candidates for Automatic Speech Recognition

    Authors: Yichong Leng, Xu Tan, Rui Wang, Linchen Zhu, ** Xu, Wenjie Liu, Linquan Liu, Tao Qin, Xiang-Yang Li, Edward Lin, Tie-Yan Liu

    Abstract: Error correction is widely used in automatic speech recognition (ASR) to post-process the generated sentence, and can further reduce the word error rate (WER). Although multiple candidates are generated by an ASR system through beam search, current error correction approaches can only correct one sentence at a time, failing to leverage the voting effect from multiple candidates to better detect an… ▽ More

    Submitted 29 November, 2022; v1 submitted 29 September, 2021; originally announced September 2021.

    Comments: Findings of EMNLP 2021

  15. arXiv:2105.03842  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    FastCorrect: Fast Error Correction with Edit Alignment for Automatic Speech Recognition

    Authors: Yichong Leng, Xu Tan, Linchen Zhu, ** Xu, Renqian Luo, Linquan Liu, Tao Qin, Xiang-Yang Li, Ed Lin, Tie-Yan Liu

    Abstract: Error correction techniques have been used to refine the output sentences from automatic speech recognition (ASR) models and achieve a lower word error rate (WER) than original ASR outputs. Previous works usually use a sequence-to-sequence model to correct an ASR output sentence autoregressively, which causes large latency and cannot be deployed in online ASR services. A straightforward solution t… ▽ More

    Submitted 29 November, 2022; v1 submitted 9 May, 2021; originally announced May 2021.

    Comments: NeurIPS 2021. Code URL: https://github.com/microsoft/NeuralSpeech

  16. arXiv:2103.00110  [pdf, other

    cs.SD eess.AS

    MBNet: MOS Prediction for Synthesized Speech with Mean-Bias Network

    Authors: Yichong Leng, Xu Tan, Sheng Zhao, Frank Soong, Xiang-Yang Li, Tao Qin

    Abstract: Mean opinion score (MOS) is a popular subjective metric to assess the quality of synthesized speech, and usually involves multiple human judges to evaluate each speech utterance. To reduce the labor cost in MOS test, multiple methods have been proposed to automatically predict MOS scores. To our knowledge, for a speech utterance, all previous works only used the average of multiple scores from dif… ▽ More

    Submitted 26 February, 2021; originally announced March 2021.

    Comments: Accepted by ICASSP 2021

  17. arXiv:2009.13931  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    Residual acoustic echo suppression based on efficient multi-task convolutional neural network

    Authors: Xinquan Zhou, Yanhong Leng

    Abstract: Acoustic echo degrades the user experience in voice communication systems thus needs to be suppressed completely. We propose a real-time residual acoustic echo suppression (RAES) method using an efficient convolutional neural network. The double talk detector is used as an auxiliary task to improve the performance of RAES in the context of multi-task learning. The training criterion is based on a… ▽ More

    Submitted 5 November, 2020; v1 submitted 29 September, 2020; originally announced September 2020.