Skip to main content

Showing 1–5 of 5 results for author: Agiomyrgiannakis, Y

.
  1. arXiv:1712.05884  [pdf, other

    cs.CL

    Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

    Authors: Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu

    Abstract: This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion s… ▽ More

    Submitted 15 February, 2018; v1 submitted 15 December, 2017; originally announced December 2017.

    Comments: Accepted to ICASSP 2018

  2. arXiv:1703.10135  [pdf, other

    cs.CL cs.LG cs.SD

    Tacotron: Towards End-to-End Speech Synthesis

    Authors: Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous

    Abstract: A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Give… ▽ More

    Submitted 6 April, 2017; v1 submitted 29 March, 2017; originally announced March 2017.

    Comments: Submitted to Interspeech 2017. v2 changed paper title to be consistent with our conference submission (no content change other than typo fixes)

  3. arXiv:1611.09207  [pdf, other

    cs.CL cs.LG stat.ML

    AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech

    Authors: Brian Patton, Yannis Agiomyrgiannakis, Michael Terry, Kevin Wilson, Rif A. Saurous, D. Sculley

    Abstract: Developers of text-to-speech synthesizers (TTS) often make use of human raters to assess the quality of synthesized speech. We demonstrate that we can model human raters' mean opinion scores (MOS) of synthesized speech using a deep recurrent neural network whose inputs consist solely of a raw waveform. Our best models provide utterance-level estimates of MOS only moderately inferior to sampled hum… ▽ More

    Submitted 28 November, 2016; originally announced November 2016.

    Comments: 4 pages, 2 figures, 2 tables, NIPS 2016 End-to-end Learning for Speech and Audio Processing Workshop

  4. arXiv:1606.06061  [pdf, other

    cs.SD cs.CL

    Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices

    Authors: Heiga Zen, Yannis Agiomyrgiannakis, Niels Egberts, Fergus Henderson, Przemysław Szczepaniak

    Abstract: Acoustic models based on long short-term memory recurrent neural networks (LSTM-RNNs) were applied to statistical parametric speech synthesis (SPSS) and showed significant improvements in naturalness and latency over those based on hidden Markov models (HMMs). This paper describes further optimizations of LSTM-RNN-based SPSS for deployment on mobile devices; weight quantization, multi-frame infere… ▽ More

    Submitted 22 June, 2016; v1 submitted 20 June, 2016; originally announced June 2016.

    Comments: 13 pages, 3 figures, Interspeech 2016 (accepted)

  5. arXiv:1605.07809  [pdf, ps, other

    cs.SD eess.AS eess.SP

    Using instantaneous frequency and aperiodicity detection to estimate F0 for high-quality speech synthesis

    Authors: Hideki Kawahara, Yannis Agiomyrgiannakis, Heiga Zen

    Abstract: This paper introduces a general and flexible framework for F0 and aperiodicity (additive non periodic component) analysis, specifically intended for high-quality speech synthesis and modification applications. The proposed framework consists of three subsystems: instantaneous frequency estimator and initial aperiodicity detector, F0 trajectory tracker, and F0 refinement and aperiodicity extractor.… ▽ More

    Submitted 22 July, 2016; v1 submitted 25 May, 2016; originally announced May 2016.

    Comments: Accepted for presentation in ISCA workshop SSW9

    Journal ref: 9th ISCA Speech Synthesis Workshop, 2016, pp.221-228