Skip to main content

Showing 1–13 of 13 results for author: Skerry-Ryan, R

Searching in archive eess. Search in all archives.
.
  1. arXiv:2305.15255  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM

    Authors: Eliya Nachmani, Alon Levkovitch, Roy Hirsch, Julian Salazar, Chulayuth Asawaroengchai, Soroosh Mariooryad, Ehud Rivlin, RJ Skerry-Ryan, Michelle Tadmor Ramanovich

    Abstract: We present Spectron, a novel approach to adapting pre-trained large language models (LLMs) to perform spoken question answering (QA) and speech continuation. By endowing the LLM with a pre-trained speech encoder, our model becomes able to take speech inputs and generate speech outputs. The entire system is trained end-to-end and operates directly on spectrograms, simplifying our architecture. Key… ▽ More

    Submitted 30 May, 2024; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: ICLR 2024 camera-ready

  2. arXiv:2111.05095  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Speaker Generation

    Authors: Daisy Stanton, Matt Shannon, Soroosh Mariooryad, RJ Skerry-Ryan, Eric Battenberg, Tom Bagby, David Kao

    Abstract: This work explores the task of synthesizing speech in nonexistent human-sounding voices. We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a recurrent attention-based text-to-speech model that learns a distribution over a speaker embedding space, which enables sampling of novel and diverse speakers. Our method is easy to… ▽ More

    Submitted 7 November, 2021; originally announced November 2021.

    Comments: 12 pages, 3 figures, 4 tables, appendix with 2 tables

    ACM Class: I.2.7; G.3

  3. arXiv:2103.14574  [pdf, other

    cs.SD eess.AS

    Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

    Authors: Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, RJ Skerry-Ryan, Yonghui Wu

    Abstract: This paper introduces Parallel Tacotron 2, a non-autoregressive neural text-to-speech model with a fully differentiable duration model which does not require supervised duration signals. The duration model is based on a novel attention mechanism and an iterative reconstruction loss based on Soft Dynamic Time War**, this model can learn token-frame alignments as well as token durations automatica… ▽ More

    Submitted 29 August, 2021; v1 submitted 26 March, 2021; originally announced March 2021.

    Comments: Submitted to INTERSPEECH 2021

  4. arXiv:2011.03568  [pdf, other

    cs.CL cs.SD eess.AS

    Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis

    Authors: Ron J. Weiss, RJ Skerry-Ryan, Eric Battenberg, Soroosh Mariooryad, Diederik P. Kingma

    Abstract: We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Output waveforms are modeled as a sequence of non-overlap** fixed-length blocks, each one containing hundreds of samples. The interdependencies of waveform samples within… ▽ More

    Submitted 5 February, 2021; v1 submitted 6 November, 2020; originally announced November 2020.

    Comments: 6 pages including supplement, 3 figures. accepted to ICASSP 2021

  5. arXiv:1910.10288  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis

    Authors: Eric Battenberg, RJ Skerry-Ryan, Soroosh Mariooryad, Daisy Stanton, David Kao, Matt Shannon, Tom Bagby

    Abstract: Despite the ability to produce human-level speech for in-domain text, attention-based end-to-end text-to-speech (TTS) systems suffer from text alignment failures that increase in frequency for out-of-domain text. We show that these failures can be addressed using simple location-relative attention mechanisms that do away with content-based query/key comparisons. We compare two families of attentio… ▽ More

    Submitted 22 April, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

    Comments: Accepted to ICASSP 2020

  6. arXiv:1910.01709  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Semi-Supervised Generative Modeling for Controllable Speech Synthesis

    Authors: Raza Habib, Soroosh Mariooryad, Matt Shannon, Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, David Kao, Tom Bagby

    Abstract: We present a novel generative model that combines state-of-the-art neural text-to-speech (TTS) with semi-supervised probabilistic latent variable models. By providing partial supervision to some of the latent variables, we are able to force them to take on consistent and interpretable purposes, which previously hasn't been possible with purely unsupervised TTS models. We demonstrate that our model… ▽ More

    Submitted 3 October, 2019; originally announced October 2019.

  7. arXiv:1907.04448  [pdf, other

    cs.CL cs.SD eess.AS

    Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

    Authors: Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, RJ Skerry-Ryan, Ye Jia, Andrew Rosenberg, Bhuvana Ramabhadran

    Abstract: We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages. Moreover, the model is able to transfer voices across languages, e.g. synthesize fluent Spanish speech using an English speaker's voice, without training on any bilingual or parallel examples. Such transfer works across distantly related… ▽ More

    Submitted 24 July, 2019; v1 submitted 9 July, 2019; originally announced July 2019.

    Comments: 5 pages, submitted to Interspeech 2019

  8. arXiv:1906.03402  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis

    Authors: Eric Battenberg, Soroosh Mariooryad, Daisy Stanton, RJ Skerry-Ryan, Matt Shannon, David Kao, Tom Bagby

    Abstract: Recent work has explored sequence-to-sequence latent variable models for expressive speech synthesis (supporting control and transfer of prosody and style), but has not presented a coherent framework for understanding the trade-offs between the competing methods. In this paper, we propose embedding capacity (the amount of information the embedding contains about the data) as a unified method of an… ▽ More

    Submitted 25 October, 2019; v1 submitted 8 June, 2019; originally announced June 2019.

    Comments: Submitted to ICLR 2020

  9. arXiv:1906.02246  [pdf, other

    cs.LG cs.CL cs.SD eess.AS eess.SP

    Complex Evolution Recurrent Neural Networks (ceRNNs)

    Authors: Izhak Shafran, Tom Bagby, R. J. Skerry-Ryan

    Abstract: Unitary Evolution Recurrent Neural Networks (uRNNs) have three attractive properties: (a) the unitary property, (b) the complex-valued nature, and (c) their efficient linear operators. The literature so far does not address -- how critical is the unitary property of the model? Furthermore, uRNNs have not been evaluated on large tasks. To study these shortcomings, we propose the complex evolution R… ▽ More

    Submitted 5 June, 2019; originally announced June 2019.

    Journal ref: Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5854-5858, 2018

  10. arXiv:1808.10128  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis

    Authors: Yu-An Chung, Yuxuan Wang, Wei-Ning Hsu, Yu Zhang, RJ Skerry-Ryan

    Abstract: Although end-to-end text-to-speech (TTS) models such as Tacotron have shown excellent results, they typically require a sizable set of high-quality <text, audio> pairs for training, which are expensive to collect. In this paper, we propose a semi-supervised training framework to improve the data efficiency of Tacotron. The idea is to allow Tacotron to utilize textual and acoustic knowledge contain… ▽ More

    Submitted 30 August, 2018; originally announced August 2018.

  11. arXiv:1808.01410  [pdf, other

    cs.CL cs.LG cs.SD eess.AS stat.ML

    Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis

    Authors: Daisy Stanton, Yuxuan Wang, RJ Skerry-Ryan

    Abstract: Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a state-of-the-art end-to-end text-to-speech synthesis system, to uncover expressive factors of variation in speaking style. In this work, we introduce the Text-Predicted Global Style Token (TP-GST) architecture, which treats GST combina… ▽ More

    Submitted 3 August, 2018; originally announced August 2018.

    MSC Class: eess.AS

  12. arXiv:1803.09047  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

    Authors: RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, Rif A. Saurous

    Abstract: We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synth… ▽ More

    Submitted 23 March, 2018; originally announced March 2018.

  13. arXiv:1803.09017  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

    Authors: Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, Rif A. Saurous

    Abstract: In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable "labels" they generate can be used to contr… ▽ More

    Submitted 23 March, 2018; originally announced March 2018.