Skip to main content

Showing 1–10 of 10 results for author: Luan, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2401.04283  [pdf, ps, other

    eess.AS cs.SD

    FADI-AEC: Fast Score Based Diffusion Model Guided by Far-end Signal for Acoustic Echo Cancellation

    Authors: Yang Liu, Li Wan, Yun Li, Yiteng Huang, Ming Sun, James Luan, Yangyang Shi, Xin Lei

    Abstract: Despite the potential of diffusion models in speech enhancement, their deployment in Acoustic Echo Cancellation (AEC) has been restricted. In this paper, we propose DI-AEC, pioneering a diffusion-based stochastic regeneration approach dedicated to AEC. Further, we propose FADI-AEC, fast score-based diffusion AEC framework to save computational demands, making it favorable for edge devices. It stan… ▽ More

    Submitted 8 January, 2024; originally announced January 2024.

  2. arXiv:2212.03435  [pdf, other

    cs.SD cs.CL eess.AS

    Improve Bilingual TTS Using Dynamic Language and Phonology Embedding

    Authors: Fengyu Yang, Jian Luan, Yujun Wang

    Abstract: In most cases, bilingual TTS needs to handle three types of input scripts: first language only, second language only, and second language embedded in the first language. In the latter two situations, the pronunciation and intonation of the second language are usually quite different due to the influence of the first language. Therefore, it is a big challenge to accurately model the pronunciation a… ▽ More

    Submitted 6 December, 2022; originally announced December 2022.

    Comments: Submitted to ICASSP2023

  3. arXiv:2110.09780  [pdf, other

    cs.SD eess.AS

    Improving Emotional Speech Synthesis by Using SUS-Constrained VAE and Text Encoder Aggregation

    Authors: Fengyu Yang, Jian Luan, Yujun Wang

    Abstract: Learning emotion embedding from reference audio is a straightforward approach for multi-emotion speech synthesis in encoder-decoder systems. But how to get better emotion embedding and how to inject it into TTS acoustic model more effectively are still under investigation. In this paper, we propose an innovative constraint to help VAE extract emotion embedding with better cluster cohesion. Besides… ▽ More

    Submitted 28 January, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

    Comments: accepted by ICASSP2022

  4. arXiv:2107.03065  [pdf, other

    cs.SD eess.AS

    Msdtron: a high-capability multi-speaker speech synthesis system for diverse data using characteristic information

    Authors: Qinghua Wu, Quanbo Shen, Jian Luan, YuJun Wang

    Abstract: In multi-speaker speech synthesis, data from a number of speakers usually tend to have great diversity due to the fact that the speakers may differ largely in ages, speaking styles, emotions, and so on. It is important but challenging to improve the modeling capabilities for multi-speaker speech synthesis. To address the issue, this paper proposes a high-capability speech synthesis system, called… ▽ More

    Submitted 11 February, 2022; v1 submitted 7 July, 2021; originally announced July 2021.

    Comments: Accepted by ICASSP-2022

  5. arXiv:2009.01776  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis

    Authors: Jiawei Chen, Xu Tan, Jian Luan, Tao Qin, Tie-Yan Liu

    Abstract: High-fidelity singing voices usually require higher sampling rate (e.g., 48kHz) to convey expression and emotion. However, higher sampling rate causes the wider frequency band and longer waveform sequences and throws challenges for singing voice synthesis (SVS) in both frequency and time domains. Conventional SVS systems that adopt small sampling rate cannot well address the above challenges. In t… ▽ More

    Submitted 3 September, 2020; originally announced September 2020.

  6. arXiv:2008.04658  [pdf, other

    eess.AS cs.SD

    Transfer Learning for Improving Singing-voice Detection in Polyphonic Instrumental Music

    Authors: Yuanbo Hou, Frank K. Soong, Jian Luan, Shengchen Li

    Abstract: Detecting singing-voice in polyphonic instrumental music is critical to music information retrieval. To train a robust vocal detector, a large dataset marked with vocal or non-vocal label at frame-level is essential. However, frame-level labeling is time-consuming and labor expensive, resulting there is little well-labeled dataset available for singing-voice detection (S-VD). Hence, we propose a d… ▽ More

    Submitted 11 August, 2020; originally announced August 2020.

    Comments: Accepted by INTERSPEECH 2020

  7. arXiv:2008.02490  [pdf

    eess.AS cs.SD

    PPSpeech: Phrase based Parallel End-to-End TTS System

    Authors: Yahuan Cong, Ran Zhang, Jian Luan

    Abstract: Current end-to-end autoregressive TTS systems (e.g. Tacotron 2) have outperformed traditional parallel approaches on the quality of synthesized speech. However, they introduce new problems at the same time. Due to the autoregressive nature, the time cost of inference has to be proportional to the length of text, which pose a great challenge for online serving. On the other hand, the style of synth… ▽ More

    Submitted 6 August, 2020; originally announced August 2020.

  8. arXiv:2007.04590  [pdf, other

    eess.AS cs.CL cs.SD

    DeepSinger: Singing Voice Synthesis with Data Mined From the Web

    Authors: Yi Ren, Xu Tan, Tao Qin, Jian Luan, Zhou Zhao, Tie-Yan Liu

    Abstract: In this paper, we develop DeepSinger, a multi-lingual multi-singer singing voice synthesis (SVS) system, which is built from scratch using singing training data mined from music websites. The pipeline of DeepSinger consists of several steps, including data crawling, singing and accompaniment separation, lyrics-to-singing alignment, data filtration, and singing modeling. Specifically, we design a l… ▽ More

    Submitted 15 July, 2020; v1 submitted 9 July, 2020; originally announced July 2020.

    Comments: Accepted by KDD2020 research track

  9. arXiv:2006.10317  [pdf, other

    eess.AS cs.LG cs.SD

    Adversarially Trained Multi-Singer Sequence-To-Sequence Singing Synthesizer

    Authors: Jie Wu, Jian Luan

    Abstract: This paper presents a high quality singing synthesizer that is able to model a voice with limited available recordings. Based on the sequence-to-sequence singing model, we design a multi-singer framework to leverage all the existing singing data of different singers. To attenuate the issue of musical score unbalance among singers, we incorporate an adversarial task of singer classification to make… ▽ More

    Submitted 18 June, 2020; originally announced June 2020.

    Comments: Submitted to INTERSPEECH2020

  10. arXiv:2006.06261  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System

    Authors: Peiling Lu, Jie Wu, Jian Luan, Xu Tan, Li Zhou

    Abstract: This paper presents XiaoiceSing, a high-quality singing voice synthesis system which employs an integrated network for spectrum, F0 and duration modeling. We follow the main architecture of FastSpeech while proposing some singing-specific design: 1) Besides phoneme ID and position encoding, features from musical score (e.g.note pitch and length) are also added. 2) To attenuate off-key issues, we a… ▽ More

    Submitted 11 June, 2020; originally announced June 2020.