-
Learning the joint distribution of two sequences using little or no paired data
Authors:
Soroosh Mariooryad,
Matt Shannon,
Siyuan Ma,
Tom Bagby,
David Kao,
Daisy Stanton,
Eric Battenberg,
RJ Skerry-Ryan
Abstract:
We present a noisy channel generative model of two sequences, for example text and speech, which enables uncovering the association between the two modalities when limited paired data is available. To address the intractability of the exact model under a realistic data setup, we propose a variational inference approximation. To train this variational model with categorical data, we propose a KL en…
▽ More
We present a noisy channel generative model of two sequences, for example text and speech, which enables uncovering the association between the two modalities when limited paired data is available. To address the intractability of the exact model under a realistic data setup, we propose a variational inference approximation. To train this variational model with categorical data, we propose a KL encoder loss approach which has connections to the wake-sleep algorithm. Identifying the joint or conditional distributions by only observing unpaired samples from the marginals is only possible under certain conditions in the data distribution and we discuss under what type of conditional independence assumptions that might be achieved, which guides the architecture designs. Experimental results show that even tiny amount of paired data (5 minutes) is sufficient to learn to relate the two modalities (graphemes and phonemes here) when a massive amount of unpaired data is available, paving the path to adopting this principled approach for all seq2seq models in low data resource regimes.
△ Less
Submitted 6 December, 2022;
originally announced December 2022.
-
Speaker Generation
Authors:
Daisy Stanton,
Matt Shannon,
Soroosh Mariooryad,
RJ Skerry-Ryan,
Eric Battenberg,
Tom Bagby,
David Kao
Abstract:
This work explores the task of synthesizing speech in nonexistent human-sounding voices. We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a recurrent attention-based text-to-speech model that learns a distribution over a speaker embedding space, which enables sampling of novel and diverse speakers. Our method is easy to…
▽ More
This work explores the task of synthesizing speech in nonexistent human-sounding voices. We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a recurrent attention-based text-to-speech model that learns a distribution over a speaker embedding space, which enables sampling of novel and diverse speakers. Our method is easy to implement, and does not require transfer learning from speaker ID systems. We present objective and subjective metrics for evaluating performance on this task, and demonstrate that our proposed objective metrics correlate with human perception of speaker similarity. Audio samples are available on our demo page.
△ Less
Submitted 7 November, 2021;
originally announced November 2021.
-
Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis
Authors:
Ron J. Weiss,
RJ Skerry-Ryan,
Eric Battenberg,
Soroosh Mariooryad,
Diederik P. Kingma
Abstract:
We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Output waveforms are modeled as a sequence of non-overlap** fixed-length blocks, each one containing hundreds of samples. The interdependencies of waveform samples within…
▽ More
We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Output waveforms are modeled as a sequence of non-overlap** fixed-length blocks, each one containing hundreds of samples. The interdependencies of waveform samples within each block are modeled using the normalizing flow, enabling parallel training and synthesis. Longer-term dependencies are handled autoregressively by conditioning each flow on preceding blocks.This model can be optimized directly with maximum likelihood, with-out using intermediate, hand-designed features nor additional loss terms. Contemporary state-of-the-art text-to-speech (TTS) systems use a cascade of separately learned models: one (such as Tacotron) which generates intermediate features (such as spectrograms) from text, followed by a vocoder (such as WaveRNN) which generates waveform samples from the intermediate features. The proposed system, in contrast, does not use a fixed intermediate representation, and learns all parameters end-to-end. Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system, with significantly improved generation speed.
△ Less
Submitted 5 February, 2021; v1 submitted 6 November, 2020;
originally announced November 2020.
-
Non-saturating GAN training as divergence minimization
Authors:
Matt Shannon,
Ben Poole,
Soroosh Mariooryad,
Tom Bagby,
Eric Battenberg,
David Kao,
Daisy Stanton,
RJ Skerry-Ryan
Abstract:
Non-saturating generative adversarial network (GAN) training is widely used and has continued to obtain groundbreaking results. However so far this approach has lacked strong theoretical justification, in contrast to alternatives such as f-GANs and Wasserstein GANs which are motivated in terms of approximate divergence minimization. In this paper we show that non-saturating GAN training does in fa…
▽ More
Non-saturating generative adversarial network (GAN) training is widely used and has continued to obtain groundbreaking results. However so far this approach has lacked strong theoretical justification, in contrast to alternatives such as f-GANs and Wasserstein GANs which are motivated in terms of approximate divergence minimization. In this paper we show that non-saturating GAN training does in fact approximately minimize a particular f-divergence. We develop general theoretical tools to compare and classify f-divergences and use these to show that the new f-divergence is qualitatively similar to reverse KL. These results help to explain the high sample quality but poor diversity often observed empirically when using this scheme.
△ Less
Submitted 15 October, 2020;
originally announced October 2020.
-
Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis
Authors:
Eric Battenberg,
RJ Skerry-Ryan,
Soroosh Mariooryad,
Daisy Stanton,
David Kao,
Matt Shannon,
Tom Bagby
Abstract:
Despite the ability to produce human-level speech for in-domain text, attention-based end-to-end text-to-speech (TTS) systems suffer from text alignment failures that increase in frequency for out-of-domain text. We show that these failures can be addressed using simple location-relative attention mechanisms that do away with content-based query/key comparisons. We compare two families of attentio…
▽ More
Despite the ability to produce human-level speech for in-domain text, attention-based end-to-end text-to-speech (TTS) systems suffer from text alignment failures that increase in frequency for out-of-domain text. We show that these failures can be addressed using simple location-relative attention mechanisms that do away with content-based query/key comparisons. We compare two families of attention mechanisms: location-relative GMM-based mechanisms and additive energy-based mechanisms. We suggest simple modifications to GMM-based attention that allow it to align quickly and consistently during training, and introduce a new location-relative attention mechanism to the additive energy-based family, called Dynamic Convolution Attention (DCA). We compare the various mechanisms in terms of alignment speed and consistency during training, naturalness, and ability to generalize to long utterances, and conclude that GMM attention and DCA can generalize to very long utterances, while preserving naturalness for shorter, in-domain utterances.
△ Less
Submitted 22 April, 2020; v1 submitted 22 October, 2019;
originally announced October 2019.
-
Semi-Supervised Generative Modeling for Controllable Speech Synthesis
Authors:
Raza Habib,
Soroosh Mariooryad,
Matt Shannon,
Eric Battenberg,
RJ Skerry-Ryan,
Daisy Stanton,
David Kao,
Tom Bagby
Abstract:
We present a novel generative model that combines state-of-the-art neural text-to-speech (TTS) with semi-supervised probabilistic latent variable models. By providing partial supervision to some of the latent variables, we are able to force them to take on consistent and interpretable purposes, which previously hasn't been possible with purely unsupervised TTS models. We demonstrate that our model…
▽ More
We present a novel generative model that combines state-of-the-art neural text-to-speech (TTS) with semi-supervised probabilistic latent variable models. By providing partial supervision to some of the latent variables, we are able to force them to take on consistent and interpretable purposes, which previously hasn't been possible with purely unsupervised TTS models. We demonstrate that our model is able to reliably discover and control important but rarely labelled attributes of speech, such as affect and speaking rate, with as little as 1% (30 minutes) supervision. Even at such low supervision levels we do not observe a degradation of synthesis quality compared to a state-of-the-art baseline. Audio samples are available on the web.
△ Less
Submitted 3 October, 2019;
originally announced October 2019.
-
Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis
Authors:
Eric Battenberg,
Soroosh Mariooryad,
Daisy Stanton,
RJ Skerry-Ryan,
Matt Shannon,
David Kao,
Tom Bagby
Abstract:
Recent work has explored sequence-to-sequence latent variable models for expressive speech synthesis (supporting control and transfer of prosody and style), but has not presented a coherent framework for understanding the trade-offs between the competing methods. In this paper, we propose embedding capacity (the amount of information the embedding contains about the data) as a unified method of an…
▽ More
Recent work has explored sequence-to-sequence latent variable models for expressive speech synthesis (supporting control and transfer of prosody and style), but has not presented a coherent framework for understanding the trade-offs between the competing methods. In this paper, we propose embedding capacity (the amount of information the embedding contains about the data) as a unified method of analyzing the behavior of latent variable models of speech, comparing existing heuristic (non-variational) methods to variational methods that are able to explicitly constrain capacity using an upper bound on representational mutual information. In our proposed model (Capacitron), we show that by adding conditional dependencies to the variational posterior such that it matches the form of the true posterior, the same model can be used for high-precision prosody transfer, text-agnostic style transfer, and generation of natural-sounding prior samples. For multi-speaker models, Capacitron is able to preserve target speaker identity during inter-speaker prosody transfer and when drawing samples from the latent prior. Lastly, we introduce a method for decomposing embedding capacity hierarchically across two sets of latents, allowing a portion of the latent variability to be specified and the remaining variability sampled from a learned prior. Audio examples are available on the web.
△ Less
Submitted 25 October, 2019; v1 submitted 8 June, 2019;
originally announced June 2019.
-
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
Authors:
RJ Skerry-Ryan,
Eric Battenberg,
Ying Xiao,
Yuxuan Wang,
Daisy Stanton,
Joel Shor,
Ron J. Weiss,
Rob Clark,
Rif A. Saurous
Abstract:
We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synth…
▽ More
We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synthesis speakers are different. Additionally, we show that a reference prosody embedding can be used to synthesize text that is different from that of the reference utterance. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results with accompanying audio samples from single-speaker and 44-speaker Tacotron models on a prosody transfer task.
△ Less
Submitted 23 March, 2018;
originally announced March 2018.
-
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
Authors:
Yuxuan Wang,
Daisy Stanton,
Yu Zhang,
RJ Skerry-Ryan,
Eric Battenberg,
Joel Shor,
Ying Xiao,
Fei Ren,
Ye Jia,
Rif A. Saurous
Abstract:
In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable "labels" they generate can be used to contr…
▽ More
In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable "labels" they generate can be used to control synthesis in novel ways, such as varying speed and speaking style - independently of the text content. They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.
△ Less
Submitted 23 March, 2018;
originally announced March 2018.
-
Uncovering Latent Style Factors for Expressive Speech Synthesis
Authors:
Yuxuan Wang,
RJ Skerry-Ryan,
Ying Xiao,
Daisy Stanton,
Joel Shor,
Eric Battenberg,
Rob Clark,
Rif A. Saurous
Abstract:
Prosodic modeling is a core problem in speech synthesis. The key challenge is producing desirable prosody from textual input containing only phonetic information. In this preliminary study, we introduce the concept of "style tokens" in Tacotron, a recently proposed end-to-end neural speech synthesis model. Using style tokens, we aim to extract independent prosodic styles from training data. We sho…
▽ More
Prosodic modeling is a core problem in speech synthesis. The key challenge is producing desirable prosody from textual input containing only phonetic information. In this preliminary study, we introduce the concept of "style tokens" in Tacotron, a recently proposed end-to-end neural speech synthesis model. Using style tokens, we aim to extract independent prosodic styles from training data. We show that without annotation data or an explicit supervision signal, our approach can automatically learn a variety of prosodic variations in a purely data-driven way. Importantly, each style token corresponds to a fixed style factor regardless of the given text sequence. As a result, we can control the prosodic style of synthetic speech in a somewhat predictable and globally consistent way.
△ Less
Submitted 1 November, 2017;
originally announced November 2017.
-
Exploring Neural Transducers for End-to-End Speech Recognition
Authors:
Eric Battenberg,
Jitong Chen,
Rewon Child,
Adam Coates,
Yashesh Gaur,
Yi Li,
Hairong Liu,
Sanjeev Satheesh,
David Seetapun,
Anuroop Sriram,
Zhenyao Zhu
Abstract:
In this work, we perform an empirical comparison among the CTC, RNN-Transducer, and attention-based Seq2Seq models for end-to-end speech recognition. We show that, without any language model, Seq2Seq and RNN-Transducer models both outperform the best reported CTC models with a language model, on the popular Hub5'00 benchmark. On our internal diverse dataset, these trends continue - RNNTransducer m…
▽ More
In this work, we perform an empirical comparison among the CTC, RNN-Transducer, and attention-based Seq2Seq models for end-to-end speech recognition. We show that, without any language model, Seq2Seq and RNN-Transducer models both outperform the best reported CTC models with a language model, on the popular Hub5'00 benchmark. On our internal diverse dataset, these trends continue - RNNTransducer models rescored with a language model after beam search outperform our best CTC models. These results simplify the speech recognition pipeline so that decoding can now be expressed purely as neural network operations. We also study how the choice of encoder architecture affects the performance of the three models - when all encoder layers are forward only, and when encoders downsample the input representation aggressively.
△ Less
Submitted 24 July, 2017;
originally announced July 2017.
-
Reducing Bias in Production Speech Models
Authors:
Eric Battenberg,
Rewon Child,
Adam Coates,
Christopher Fougner,
Yashesh Gaur,
Jiaji Huang,
Heewoo Jun,
Ajay Kannan,
Markus Kliegl,
Atul Kumar,
Hairong Liu,
Vinay Rao,
Sanjeev Satheesh,
David Seetapun,
Anuroop Sriram,
Zhenyao Zhu
Abstract:
Replacing hand-engineered pipelines with end-to-end deep learning systems has enabled strong results in applications like speech and object recognition. However, the causality and latency constraints of production systems put end-to-end speech models back into the underfitting regime and expose biases in the model that we show cannot be overcome by "scaling up", i.e., training bigger models on mor…
▽ More
Replacing hand-engineered pipelines with end-to-end deep learning systems has enabled strong results in applications like speech and object recognition. However, the causality and latency constraints of production systems put end-to-end speech models back into the underfitting regime and expose biases in the model that we show cannot be overcome by "scaling up", i.e., training bigger models on more data. In this work we systematically identify and address sources of bias, reducing error rates by up to 20% while remaining practical for deployment. We achieve this by utilizing improved neural architectures for streaming inference, solving optimization issues, and employing strategies that increase audio and label modelling versatility.
△ Less
Submitted 11 May, 2017;
originally announced May 2017.
-
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
Authors:
Dario Amodei,
Rishita Anubhai,
Eric Battenberg,
Carl Case,
Jared Casper,
Bryan Catanzaro,
**gdong Chen,
Mike Chrzanowski,
Adam Coates,
Greg Diamos,
Erich Elsen,
Jesse Engel,
Linxi Fan,
Christopher Fougner,
Tony Han,
Awni Hannun,
Billy Jun,
Patrick LeGresley,
Libby Lin,
Sharan Narang,
Andrew Ng,
Sherjil Ozair,
Ryan Prenger,
Jonathan Raiman,
Sanjeev Satheesh
, et al. (9 additional authors not shown)
Abstract:
We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our app…
▽ More
We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, resulting in a 7x speedup over our previous system. Because of this efficiency, experiments that previously took weeks now run in days. This enables us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.
△ Less
Submitted 8 December, 2015;
originally announced December 2015.