Search | arXiv e-print repository

Diff-TTS: A Denoising Diffusion Model for Text-to-Speech

Authors: Myeonghun Jeong, Hyeongju Kim, Sung Jun Cheon, Byoung ** Choi, Nam Soo Kim

Abstract: Although neural text-to-speech (TTS) models have attracted a lot of attention and succeeded in generating human-like speech, there is still room for improvements to its naturalness and architectural efficiency. In this work, we propose a novel non-autoregressive TTS model, namely Diff-TTS, which achieves highly natural and efficient speech synthesis. Given the text, Diff-TTS exploits a denoising d… ▽ More Although neural text-to-speech (TTS) models have attracted a lot of attention and succeeded in generating human-like speech, there is still room for improvements to its naturalness and architectural efficiency. In this work, we propose a novel non-autoregressive TTS model, namely Diff-TTS, which achieves highly natural and efficient speech synthesis. Given the text, Diff-TTS exploits a denoising diffusion framework to transform the noise signal into a mel-spectrogram via diffusion time steps. In order to learn the mel-spectrogram distribution conditioned on the text, we present a likelihood-based optimization method for TTS. Furthermore, to boost up the inference speed, we leverage the accelerated sampling method that allows Diff-TTS to generate raw waveforms much faster without significantly degrading perceptual quality. Through experiments, we verified that Diff-TTS generates 28 times faster than the real-time with a single NVIDIA 2080Ti GPU. △ Less

Submitted 3 April, 2021; originally announced April 2021.

Comments: Submitted to INTERSPEECH 2021

arXiv:2007.05214 [pdf, other]

doi 10.1109/TASLP.2021.3049344

Gated Recurrent Context: Softmax-free Attention for Online Encoder-Decoder Speech Recognition

Authors: Hyeonseung Lee, Woo Hyun Kang, Sung Jun Cheon, Hyeongju Kim, Nam Soo Kim

Abstract: Recently, attention-based encoder-decoder (AED) models have shown state-of-the-art performance in automatic speech recognition (ASR). As the original AED models with global attentions are not capable of online inference, various online attention schemes have been developed to reduce ASR latency for better user experience. However, a common limitation of the conventional softmax-based online attent… ▽ More Recently, attention-based encoder-decoder (AED) models have shown state-of-the-art performance in automatic speech recognition (ASR). As the original AED models with global attentions are not capable of online inference, various online attention schemes have been developed to reduce ASR latency for better user experience. However, a common limitation of the conventional softmax-based online attention approaches is that they introduce an additional hyperparameter related to the length of the attention window, requiring multiple trials of model training for tuning the hyperparameter. In order to deal with this problem, we propose a novel softmax-free attention method and its modified formulation for online attention, which does not need any additional hyperparameter at the training phase. Through a number of ASR experiments, we demonstrate the tradeoff between the latency and performance of the proposed online attention technique can be controlled by merely adjusting a threshold at the test phase. Furthermore, the proposed methods showed competitive performance to the conventional global and online attentions in terms of word-error-rates (WERs). △ Less

Submitted 14 January, 2021; v1 submitted 10 July, 2020; originally announced July 2020.

arXiv:2006.04598 [pdf, other]

WaveNODE: A Continuous Normalizing Flow for Speech Synthesis

Authors: Hyeongju Kim, Hyeonseung Lee, Woo Hyun Kang, Sung Jun Cheon, Byoung ** Choi, Nam Soo Kim

Abstract: In recent years, various flow-based generative models have been proposed to generate high-fidelity waveforms in real-time. However, these models require either a well-trained teacher network or a number of flow steps making them memory-inefficient. In this paper, we propose a novel generative model called WaveNODE which exploits a continuous normalizing flow for speech synthesis. Unlike the conven… ▽ More In recent years, various flow-based generative models have been proposed to generate high-fidelity waveforms in real-time. However, these models require either a well-trained teacher network or a number of flow steps making them memory-inefficient. In this paper, we propose a novel generative model called WaveNODE which exploits a continuous normalizing flow for speech synthesis. Unlike the conventional models, WaveNODE places no constraint on the function used for flow operation, thus allowing the usage of more flexible and complex functions. Moreover, WaveNODE can be optimized to maximize the likelihood without requiring any teacher network or auxiliary loss terms. We experimentally show that WaveNODE achieves comparable performance with fewer parameters compared to the conventional flow-based vocoders. △ Less

Submitted 2 July, 2020; v1 submitted 8 June, 2020; originally announced June 2020.

Comments: 8 pages, 4 figures, Second workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models (ICML 2020)

arXiv:1810.13113 [pdf, other]

doi 10.1145/3461778.3462078

Giving Space to Your Message: Assistive Word Segmentation for the Electronic Ty** of Digital Minorities

Authors: Won Ik Cho, Sung Jun Cheon, Woo Hyun Kang, Ji Won Kim, Nam Soo Kim

Abstract: For readability and disambiguation of the written text, appropriate word segmentation is recommended for documentation, and it also holds for the digitized texts. If the language is agglutinative while far from scriptio continua, for instance in the Korean language, the problem becomes more significant. However, some device users these days find it challenging to communicate via key stroking, not… ▽ More For readability and disambiguation of the written text, appropriate word segmentation is recommended for documentation, and it also holds for the digitized texts. If the language is agglutinative while far from scriptio continua, for instance in the Korean language, the problem becomes more significant. However, some device users these days find it challenging to communicate via key stroking, not only for handicap but also for being unskilled. In this study, we propose a real-time assistive technology that utilizes an automatic word segmentation, designed for digital minorities who are not familiar with electronic ty**. We propose a data-driven system trained upon a spoken Korean language corpus with various non-canonical expressions and dialects, guaranteeing the comprehension of contextual information. Through quantitative and qualitative comparison with other text processing toolkits, we show the reliability of the proposed system and its fit with colloquial and non-normalized texts, which fulfills the aim of supportive technology. △ Less

Submitted 4 May, 2021; v1 submitted 31 October, 2018; originally announced October 2018.

Comments: DIS 2021 Camera-ready

Showing 1–4 of 4 results for author: Cheon, S J