Skip to main content

Showing 1–7 of 7 results for author: Tuo, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2308.16836  [pdf, other

    cs.SD cs.AI eess.AS

    Towards Improving the Expressiveness of Singing Voice Synthesis with BERT Derived Semantic Information

    Authors: Shaohuan Zhou, Shun Lei, Weiya You, Deyi Tuo, Yuren You, Zhiyong Wu, Shiyin Kang, Helen Meng

    Abstract: This paper presents an end-to-end high-quality singing voice synthesis (SVS) system that uses bidirectional encoder representation from Transformers (BERT) derived semantic embeddings to improve the expressiveness of the synthesized singing voice. Based on the main architecture of recently proposed VISinger, we put forward several specific designs for expressive singing voice synthesis. First, dif… ▽ More

    Submitted 31 August, 2023; originally announced August 2023.

  2. Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information

    Authors: Jie Chen, Changhe Song, Deyi Tuo, Xixin Wu, Shiyin Kang, Zhiyong Wu, Helen Meng

    Abstract: For text-to-speech (TTS) synthesis, prosodic structure prediction (PSP) plays an important role in producing natural and intelligible speech. Although inter-utterance linguistic information can influence the speech interpretation of the target utterance, previous works on PSP mainly focus on utilizing intrautterance linguistic information of the current utterance only. This work proposes to use in… ▽ More

    Submitted 31 August, 2023; originally announced August 2023.

    Comments: Accepted by Interspeech2022

  3. arXiv:2306.09025  [pdf, other

    cs.SD cs.LG eess.AS

    CoverHunter: Cover Song Identification with Refined Attention and Alignments

    Authors: Feng Liu, Deyi Tuo, Yinan Xu, Xintong Han

    Abstract: Abstract: Cover song identification (CSI) focuses on finding the same music with different versions in reference anchors given a query track. In this paper, we propose a novel system named CoverHunter that overcomes the shortcomings of existing detection schemes by exploring richer features with refined attention and alignments. CoverHunter contains three key modules: 1) A convolution-augmented tr… ▽ More

    Submitted 15 June, 2023; originally announced June 2023.

    Comments: 6 pages, 3 figures

  4. arXiv:2203.12813  [pdf, other

    cs.SD cs.CL eess.AS

    Disentangleing Content and Fine-grained Prosody Information via Hybrid ASR Bottleneck Features for Voice Conversion

    Authors: Xintao Zhao, Feng Liu, Changhe Song, Zhiyong Wu, Shiyin Kang, Deyi Tuo, Helen Meng

    Abstract: Non-parallel data voice conversion (VC) have achieved considerable breakthroughs recently through introducing bottleneck features (BNFs) extracted by the automatic speech recognition(ASR) model. However, selection of BNFs have a significant impact on VC result. For example, when extracting BNFs from ASR trained with Cross Entropy loss (CE-BNFs) and feeding into neural network to train a VC system,… ▽ More

    Submitted 23 March, 2022; originally announced March 2022.

    Comments: Accepted by ICASSP 2022

  5. arXiv:2203.12188  [pdf, other

    cs.SD cs.AI eess.AS

    FullSubNet+: Channel Attention FullSubNet with Complex Spectrograms for Speech Enhancement

    Authors: Jun Chen, Zilin Wang, Deyi Tuo, Zhiyong Wu, Shiyin Kang, Helen Meng

    Abstract: Previously proposed FullSubNet has achieved outstanding performance in Deep Noise Suppression (DNS) Challenge and attracted much attention. However, it still encounters issues such as input-output mismatch and coarse processing for frequency bands. In this paper, we propose an extended single-channel real-time speech enhancement framework called FullSubNet+ with following significant improvements.… ▽ More

    Submitted 26 March, 2022; v1 submitted 23 March, 2022; originally announced March 2022.

    Comments: Accepted by ICASSP 2022

  6. arXiv:2006.11610  [pdf, other

    eess.AS cs.LG cs.MM cs.SD

    Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking Head Generation Using Phonetic Posteriorgrams

    Authors: Huirong Huang, Zhiyong Wu, Shiyin Kang, Dongyang Dai, Jia Jia, Tianxiao Fu, Deyi Tuo, Guangzhi Lei, Peng Liu, Dan Su, Dong Yu, Helen Meng

    Abstract: Generating 3D speech-driven talking head has received more and more attention in recent years. Recent approaches mainly have following limitations: 1) most speaker-independent methods need handcrafted features that are time-consuming to design or unreliable; 2) there is no convincing method to support multilingual or mixlingual speech as input. In this work, we propose a novel approach using phone… ▽ More

    Submitted 20 June, 2020; originally announced June 2020.

    Comments: 5 pages, 5 figures

  7. arXiv:1909.01700  [pdf, other

    cs.CL cs.CV cs.SD eess.AS

    DurIAN: Duration Informed Attention Network For Multimodal Synthesis

    Authors: Chengzhu Yu, Heng Lu, Na Hu, Meng Yu, Chao Weng, Kun Xu, Peng Liu, Deyi Tuo, Shiyin Kang, Guangzhi Lei, Dan Su, Dong Yu

    Abstract: In this paper, we present a generic and robust multimodal synthesis system that produces highly natural speech and facial expression simultaneously. The key component of this system is the Duration Informed Attention Network (DurIAN), an autoregressive model in which the alignments between the input text and the output acoustic features are inferred from a duration model. This is different from th… ▽ More

    Submitted 5 September, 2019; v1 submitted 4 September, 2019; originally announced September 2019.