Skip to main content

Showing 1–16 of 16 results for author: Cong, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.02430  [pdf, other

    eess.AS cs.SD

    Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    Authors: Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu , et al. (21 additional authors not shown)

    Abstract: We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and sub… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  2. arXiv:2310.04004  [pdf, other

    cs.SD eess.AS

    U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning

    Authors: Tao Li, Zhichao Wang, Xinfa Zhu, Jian Cong, Qiao Tian, Yu** Wang, Lei Xie

    Abstract: Zero-shot speaker cloning aims to synthesize speech for any target speaker unseen during TTS system building, given only a single speech reference of the speaker at hand. Although more practical in real applications, the current zero-shot methods still produce speech with undesirable naturalness and speaker similarity. Moreover, endowing the target speaker with arbitrary speaking styles in the zer… ▽ More

    Submitted 6 October, 2023; originally announced October 2023.

  3. arXiv:2310.01342  [pdf, other

    cs.IT eess.SP

    Near-field Integrated Sensing and Communication: Opportunities and Challenges

    Authors: Jiayi Cong, Changsheng You, Jiapeng Li, Li Chen, Beixiong Zheng, Yuanwei Liu, Wen Wu, Yi Gong, Shi **, Rui Zhang

    Abstract: With the extremely large-scale array XL-array deployed in future wireless systems, wireless communication and sensing are expected to operate in the radiative near-field region, which needs to be characterized by the spherical rather than planar wavefronts. Unlike most existing works that considered far-field integrated sensing and communication (ISAC), we study in this article the new near-field… ▽ More

    Submitted 17 October, 2023; v1 submitted 2 October, 2023; originally announced October 2023.

    Comments: This work is submitted to IEEE for possible publication

  4. arXiv:2309.00883  [pdf, other

    cs.SD eess.AS

    DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech -- A Study between English and Mandarin

    Authors: Tao Li, Chenxu Hu, Jian Cong, Xinfa Zhu, **gbei Li, Qiao Tian, Yu** Wang, Lei Xie

    Abstract: While the performance of cross-lingual TTS based on monolingual corpora has been significantly improved recently, generating cross-lingual speech still suffers from the foreign accent problem, leading to limited naturalness. Besides, current cross-lingual methods ignore modeling emotion, which is indispensable paralinguistic information in speech delivery. In this paper, we propose DiCLET-TTS, a D… ▽ More

    Submitted 2 September, 2023; originally announced September 2023.

    Comments: accepted by TASLP

  5. arXiv:2212.04736  [pdf, other

    eess.IV

    FPGA-Based In-Vivo Calcium Image Decoding for Closed-Loop Feedback Applications

    Authors: Zhe Chen, Garrett J. Blair, Chengdi Cao, Jim Zhou, Daniel Aharoni, Peyman Golshani, Hugh T. Blair, Jason Cong

    Abstract: Miniaturized calcium imaging is an emerging neural recording technique that has been widely used for monitoring neural activity on a large scale at a specific brain region of rats or mice. Most existing calcium-image analysis pipelines operate offline. This results in long processing latency, making it difficult to realize closed-loop feedback stimulation for brain research. In recent work, we hav… ▽ More

    Submitted 16 April, 2023; v1 submitted 9 December, 2022; originally announced December 2022.

    Comments: 11 pages, 15 figures

  6. arXiv:2211.01087  [pdf, other

    cs.SD eess.AS

    DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP

    Authors: Kun Song, Yongmao Zhang, Yi Lei, Jian Cong, Hanzhao Li, Lei Xie, Gang He, **feng Bai

    Abstract: Recent development of neural vocoders based on the generative adversarial neural network (GAN) has shown obvious advantages of generating raw waveform conditioned on mel-spectrogram with fast inference speed and lightweight networks. Whereas, it is still challenging to train a universal neural vocoder that can synthesize high-fidelity speech from various scenarios with unseen speakers, languages,… ▽ More

    Submitted 28 May, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: Accepted to ICASSP 2023

  7. arXiv:2210.17349  [pdf, other

    cs.SD eess.AS

    Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS

    Authors: Kun Song, Jian Cong, Xinsheng Wang, Yongmao Zhang, Lei Xie, Ning Jiang, Haiying Wu

    Abstract: In current two-stage neural text-to-speech (TTS) paradigm, it is ideal to have a universal neural vocoder, once trained, which is robust to imperfect mel-spectrogram predicted from the acoustic model. To this end, we propose Robust MelGAN vocoder by solving the original multi-band MelGAN's metallic sound problem and increasing its generalization ability. Specifically, we introduce a fine-grained n… ▽ More

    Submitted 2 November, 2022; v1 submitted 31 October, 2022; originally announced October 2022.

    Comments: Accepted by ISCSLP 2022

  8. arXiv:2207.01832  [pdf, other

    cs.SD eess.AS

    Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion

    Authors: Yi Lei, Shan Yang, Jian Cong, Lei Xie, Dan Su

    Abstract: The zero-shot scenario for speech generation aims at synthesizing a novel unseen voice with only one utterance of the target speaker. Although the challenges of adapting new voices in zero-shot scenario exist in both stages -- acoustic modeling and vocoder, previous works usually consider the problem from only one stage. In this paper, we extend our previous Glow-WaveGAN to Glow-WaveGAN 2, aiming… ▽ More

    Submitted 5 July, 2022; originally announced July 2022.

  9. arXiv:2206.00208  [pdf, other

    cs.SD eess.AS

    AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation

    Authors: Kun Song, Heyang Xue, Xinsheng Wang, Jian Cong, Yongmao Zhang, Lei Xie, Bing Yang, Xiong Zhang, Dan Su

    Abstract: Speaker adaptation in text-to-speech synthesis (TTS) is to finetune a pre-trained TTS model to adapt to new target speakers with limited data. While much effort has been conducted towards this task, seldom work has been performed for low computational resource scenarios due to the challenges raised by the requirement of the lightweight model and less computational complexity. In this paper, a tiny… ▽ More

    Submitted 2 November, 2022; v1 submitted 31 May, 2022; originally announced June 2022.

    Comments: Accepted by ISCSLP 2022

  10. arXiv:2205.04421  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

    Authors: Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, Yuanhao Yi, Lei He, Frank Soong, Tao Qin, Sheng Zhao, Tie-Yan Liu

    Abstract: Text to speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing app… ▽ More

    Submitted 10 May, 2022; v1 submitted 9 May, 2022; originally announced May 2022.

    Comments: 19 pages, 3 figures, 8 tables

  11. arXiv:2110.08813  [pdf, other

    eess.AS cs.SD

    VISinger: Variational Inference with Adversarial Learning for End-to-End Singing Voice Synthesis

    Authors: Yongmao Zhang, Jian Cong, Heyang Xue, Lei Xie, Pengcheng Zhu, Mengxiao Bi

    Abstract: In this paper, we propose VISinger, a complete end-to-end high-quality singing voice synthesis (SVS) system that directly generates audio waveform from lyrics and musical score. Our approach is inspired by VITS, which adopts VAE-based posterior encoder augmented with normalizing flow-based prior encoder and adversarial decoder to realize complete end-to-end speech generation. VISinger follows the… ▽ More

    Submitted 24 February, 2022; v1 submitted 17 October, 2021; originally announced October 2021.

    Comments: 5 pages, ICASSP 2022

  12. arXiv:2106.10831  [pdf, other

    eess.AS cs.SD

    Glow-WaveGAN: Learning Speech Representations from GAN-based Variational Auto-Encoder For High Fidelity Flow-based Speech Synthesis

    Authors: Jian Cong, Shan Yang, Lei Xie, Dan Su

    Abstract: Current two-stage TTS framework typically integrates an acoustic model with a vocoder -- the acoustic model predicts a low resolution intermediate representation such as Mel-spectrum while the vocoder generates waveform from the intermediate representation. Although the intermediate representation is served as a bridge, there still exists critical mismatch between the acoustic model and the vocode… ▽ More

    Submitted 21 June, 2021; v1 submitted 20 June, 2021; originally announced June 2021.

    Comments: Accepted to INTERSPEECH 2021

  13. arXiv:2106.10828  [pdf, other

    eess.AS cs.SD

    Controllable Context-aware Conversational Speech Synthesis

    Authors: Jian Cong, Shan Yang, Na Hu, Guangzhi Li, Lei Xie, Dan Su

    Abstract: In spoken conversations, spontaneous behaviors like filled pause and prolongations always happen. Conversational partner tends to align features of their speech with their interlocutor which is known as entrainment. To produce human-like conversations, we propose a unified controllable spontaneous conversational speech synthesis framework to model the above two phenomena. Specifically, we use expl… ▽ More

    Submitted 20 June, 2021; originally announced June 2021.

    Comments: Accepted to INTERSPEECH 2021

  14. arXiv:2008.04265  [pdf, other

    eess.AS cs.SD

    Data Efficient Voice Cloning from Noisy Samples with Domain Adversarial Training

    Authors: Jian Cong, Shan Yang, Lei Xie, Guoqiao Yu, Guanglu Wan

    Abstract: Data efficient voice cloning aims at synthesizing target speaker's voice with only a few enrollment samples at hand. To this end, speaker adaptation and speaker encoding are two typical methods based on base model trained from multiple speakers. The former uses a small set of target speaker data to transfer the multi-speaker model to target speaker's voice through direct model update, while in the… ▽ More

    Submitted 10 August, 2020; v1 submitted 10 August, 2020; originally announced August 2020.

    Comments: Accepted to INTERSPEECH 2020

  15. arXiv:2004.12592  [pdf, other

    eess.IV cs.CV cs.LG

    Robust Screening of COVID-19 from Chest X-ray via Discriminative Cost-Sensitive Learning

    Authors: Tianyang Li, Zhongyi Han, Benzheng Wei, Yuanjie Zheng, Yanfei Hong, **yu Cong

    Abstract: This paper addresses the new problem of automated screening of coronavirus disease 2019 (COVID-19) based on chest X-rays, which is urgently demanded toward fast stop** the pandemic. However, robust and accurate screening of COVID-19 from chest X-rays is still a globally recognized challenge because of two bottlenecks: 1) imaging features of COVID-19 share some similarities with other pneumonia o… ▽ More

    Submitted 21 May, 2020; v1 submitted 27 April, 2020; originally announced April 2020.

    Comments: Under review

  16. Hybrid beamforming for single carrier mmWave MIMO systems

    Authors: Tian Lin, Jiaqi Cong, Yu Zhu

    Abstract: Hybrid analog and digital beamforming (HBF) has been recognized as an attractive technique offering a tradeoff between hardware implementation limitation and system performance for future broadband millimeter wave (mmWave) communications. In contrast to most current works focusing on the HBF design for orthogonal frequency division multiplexing based mmWave systems, this paper investigates the HBF… ▽ More

    Submitted 26 February, 2019; originally announced February 2019.

    Comments: IEEE GlobalSIP2018, Feb. 2019