-
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
Authors:
Jonathan Shen,
Ruoming Pang,
Ron J. Weiss,
Mike Schuster,
Navdeep Jaitly,
Zongheng Yang,
Zhifeng Chen,
Yu Zhang,
Yuxuan Wang,
RJ Skerry-Ryan,
Rif A. Saurous,
Yannis Agiomyrgiannakis,
Yonghui Wu
Abstract:
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion s…
▽ More
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of $4.53$ comparable to a MOS of $4.58$ for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and $F_0$ features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.
△ Less
Submitted 15 February, 2018; v1 submitted 15 December, 2017;
originally announced December 2017.
-
Tacotron: Towards End-to-End Speech Synthesis
Authors:
Yuxuan Wang,
RJ Skerry-Ryan,
Daisy Stanton,
Yonghui Wu,
Ron J. Weiss,
Navdeep Jaitly,
Zongheng Yang,
Ying Xiao,
Zhifeng Chen,
Samy Bengio,
Quoc Le,
Yannis Agiomyrgiannakis,
Rob Clark,
Rif A. Saurous
Abstract:
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Give…
▽ More
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given <text, audio> pairs, the model can be trained completely from scratch with random initialization. We present several key techniques to make the sequence-to-sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it's substantially faster than sample-level autoregressive methods.
△ Less
Submitted 6 April, 2017; v1 submitted 29 March, 2017;
originally announced March 2017.
-
AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech
Authors:
Brian Patton,
Yannis Agiomyrgiannakis,
Michael Terry,
Kevin Wilson,
Rif A. Saurous,
D. Sculley
Abstract:
Developers of text-to-speech synthesizers (TTS) often make use of human raters to assess the quality of synthesized speech. We demonstrate that we can model human raters' mean opinion scores (MOS) of synthesized speech using a deep recurrent neural network whose inputs consist solely of a raw waveform. Our best models provide utterance-level estimates of MOS only moderately inferior to sampled hum…
▽ More
Developers of text-to-speech synthesizers (TTS) often make use of human raters to assess the quality of synthesized speech. We demonstrate that we can model human raters' mean opinion scores (MOS) of synthesized speech using a deep recurrent neural network whose inputs consist solely of a raw waveform. Our best models provide utterance-level estimates of MOS only moderately inferior to sampled human ratings, as shown by Pearson and Spearman correlations. When multiple utterances are scored and averaged, a scenario common in synthesizer quality assessment, AutoMOS achieves correlations approaching those of human raters. The AutoMOS model has a number of applications, such as the ability to explore the parameter space of a speech synthesizer without requiring a human-in-the-loop.
△ Less
Submitted 28 November, 2016;
originally announced November 2016.
-
Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices
Authors:
Heiga Zen,
Yannis Agiomyrgiannakis,
Niels Egberts,
Fergus Henderson,
Przemysław Szczepaniak
Abstract:
Acoustic models based on long short-term memory recurrent neural networks (LSTM-RNNs) were applied to statistical parametric speech synthesis (SPSS) and showed significant improvements in naturalness and latency over those based on hidden Markov models (HMMs). This paper describes further optimizations of LSTM-RNN-based SPSS for deployment on mobile devices; weight quantization, multi-frame infere…
▽ More
Acoustic models based on long short-term memory recurrent neural networks (LSTM-RNNs) were applied to statistical parametric speech synthesis (SPSS) and showed significant improvements in naturalness and latency over those based on hidden Markov models (HMMs). This paper describes further optimizations of LSTM-RNN-based SPSS for deployment on mobile devices; weight quantization, multi-frame inference, and robust inference using an ε-contaminated Gaussian loss function. Experimental results in subjective listening tests show that these optimizations can make LSTM-RNN-based SPSS comparable to HMM-based SPSS in runtime speed while maintaining naturalness. Evaluations between LSTM-RNN- based SPSS and HMM-driven unit selection speech synthesis are also presented.
△ Less
Submitted 22 June, 2016; v1 submitted 20 June, 2016;
originally announced June 2016.
-
Using instantaneous frequency and aperiodicity detection to estimate F0 for high-quality speech synthesis
Authors:
Hideki Kawahara,
Yannis Agiomyrgiannakis,
Heiga Zen
Abstract:
This paper introduces a general and flexible framework for F0 and aperiodicity (additive non periodic component) analysis, specifically intended for high-quality speech synthesis and modification applications. The proposed framework consists of three subsystems: instantaneous frequency estimator and initial aperiodicity detector, F0 trajectory tracker, and F0 refinement and aperiodicity extractor.…
▽ More
This paper introduces a general and flexible framework for F0 and aperiodicity (additive non periodic component) analysis, specifically intended for high-quality speech synthesis and modification applications. The proposed framework consists of three subsystems: instantaneous frequency estimator and initial aperiodicity detector, F0 trajectory tracker, and F0 refinement and aperiodicity extractor. A preliminary implementation of the proposed framework substantially outperformed (by a factor of 10 in terms of RMS F0 estimation error) existing F0 extractors in tracking ability of temporally varying F0 trajectories. The front end aperiodicity detector consists of a complex-valued wavelet analysis filter with a highly selective temporal and spectral envelope. This front end aperiodicity detector uses a new measure that quantifies the deviation from periodicity. The measure is less sensitive to slow FM and AM and closely correlates with the signal to noise ratio.
△ Less
Submitted 22 July, 2016; v1 submitted 25 May, 2016;
originally announced May 2016.