-
WaveGrad: Estimating Gradients for Waveform Generation
Authors:
Nanxin Chen,
Yu Zhang,
Heiga Zen,
Ron J. Weiss,
Mohammad Norouzi,
William Chan
Abstract:
This paper introduces WaveGrad, a conditional model for waveform generation which estimates gradients of the data density. The model is built on prior work on score matching and diffusion probabilistic models. It starts from a Gaussian white noise signal and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram. WaveGrad offers a natural way to trade infere…
▽ More
This paper introduces WaveGrad, a conditional model for waveform generation which estimates gradients of the data density. The model is built on prior work on score matching and diffusion probabilistic models. It starts from a Gaussian white noise signal and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram. WaveGrad offers a natural way to trade inference speed for sample quality by adjusting the number of refinement steps, and bridges the gap between non-autoregressive and autoregressive models in terms of audio quality. We find that it can generate high fidelity audio samples using as few as six iterations. Experiments reveal WaveGrad to generate high fidelity audio, outperforming adversarial non-autoregressive baselines and matching a strong likelihood-based autoregressive baseline using fewer sequential operations. Audio samples are available at https://wavegrad.github.io/.
△ Less
Submitted 9 October, 2020; v1 submitted 2 September, 2020;
originally announced September 2020.
-
Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior
Authors:
Guangzhi Sun,
Yu Zhang,
Ron J. Weiss,
Yuan Cao,
Heiga Zen,
Andrew Rosenberg,
Bhuvana Ramabhadran,
Yonghui Wu
Abstract:
Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody of synthesized speech. Such models typically incorporate a fine-grained variational autoencoder (VAE) structure, extracting latent features at each input token (e.g., phonemes). However, generating samples with the standard VAE prior often results in unnatural and discontinuous speech,…
▽ More
Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody of synthesized speech. Such models typically incorporate a fine-grained variational autoencoder (VAE) structure, extracting latent features at each input token (e.g., phonemes). However, generating samples with the standard VAE prior often results in unnatural and discontinuous speech, with dramatic prosodic variation between tokens. This paper proposes a sequential prior in a discrete latent space which can generate more naturally sounding samples. This is accomplished by discretizing the latent features using vector quantization (VQ), and separately training an autoregressive (AR) prior model over the result. We evaluate the approach using listening tests, objective metrics of automatic speech recognition (ASR) performance, and measurements of prosody attributes. Experimental results show that the proposed model significantly improves the naturalness in random sample generation. Furthermore, initial experiments demonstrate that randomly sampling from the proposed model can be used as data augmentation to improve the ASR performance.
△ Less
Submitted 6 February, 2020;
originally announced February 2020.
-
Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis
Authors:
Guangzhi Sun,
Yu Zhang,
Ron J. Weiss,
Yuan Cao,
Heiga Zen,
Yonghui Wu
Abstract:
This paper proposes a hierarchical, fine-grained and interpretable latent variable model for prosody based on the Tacotron 2 text-to-speech model. It achieves multi-resolution modeling of prosody by conditioning finer level representations on coarser level ones. Additionally, it imposes hierarchical conditioning across all latent dimensions using a conditional variational auto-encoder (VAE) with a…
▽ More
This paper proposes a hierarchical, fine-grained and interpretable latent variable model for prosody based on the Tacotron 2 text-to-speech model. It achieves multi-resolution modeling of prosody by conditioning finer level representations on coarser level ones. Additionally, it imposes hierarchical conditioning across all latent dimensions using a conditional variational auto-encoder (VAE) with an auto-regressive structure. Evaluation of reconstruction performance illustrates that the new structure does not degrade the model while allowing better interpretability. Interpretations of prosody attributes are provided together with the comparison between word-level and phone-level prosody representations. Moreover, both qualitative and quantitative evaluations are used to demonstrate the improvement in the disentanglement of the latent dimensions.
△ Less
Submitted 6 February, 2020;
originally announced February 2020.
-
Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling
Authors:
Jonathan Shen,
Patrick Nguyen,
Yonghui Wu,
Zhifeng Chen,
Mia X. Chen,
Ye Jia,
Anjuli Kannan,
Tara Sainath,
Yuan Cao,
Chung-Cheng Chiu,
Yanzhang He,
Jan Chorowski,
Smit Hinsu,
Stella Laurenzo,
James Qin,
Orhan Firat,
Wolfgang Macherey,
Suyog Gupta,
Ankur Bapna,
Shuyuan Zhang,
Ruoming Pang,
Ron J. Weiss,
Rohit Prabhavalkar,
Qiao Liang,
Benoit Jacob
, et al. (66 additional authors not shown)
Abstract:
Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models. Lingvo models are composed of modular building blocks that are flexible and easily extensible, and experiment configurations are centralized and highly customizable. Distributed training and quantized inference are supported directly w…
▽ More
Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models. Lingvo models are composed of modular building blocks that are flexible and easily extensible, and experiment configurations are centralized and highly customizable. Distributed training and quantized inference are supported directly within the framework, and it contains existing implementations of a large number of utilities, helper functions, and the newest research ideas. Lingvo has been used in collaboration by dozens of researchers in more than 20 papers over the last two years. This document outlines the underlying design of Lingvo and serves as an introduction to the various pieces of the framework, while also offering examples of advanced features that showcase the capabilities of the framework.
△ Less
Submitted 21 February, 2019;
originally announced February 2019.
-
Unsupervised speech representation learning using WaveNet autoencoders
Authors:
Jan Chorowski,
Ron J. Weiss,
Samy Bengio,
AƤron van den Oord
Abstract:
We consider the task of unsupervised extraction of meaningful latent representations of speech by applying autoencoding neural networks to speech waveforms. The goal is to learn a representation able to capture high level semantic content from the signal, e.g.\ phoneme identities, while being invariant to confounding low level details in the signal such as the underlying pitch contour or backgroun…
▽ More
We consider the task of unsupervised extraction of meaningful latent representations of speech by applying autoencoding neural networks to speech waveforms. The goal is to learn a representation able to capture high level semantic content from the signal, e.g.\ phoneme identities, while being invariant to confounding low level details in the signal such as the underlying pitch contour or background noise. Since the learned representation is tuned to contain only phonetic content, we resort to using a high capacity WaveNet decoder to infer information discarded by the encoder from previous samples. Moreover, the behavior of autoencoder models depends on the kind of constraint that is applied to the latent representation. We compare three variants: a simple dimensionality reduction bottleneck, a Gaussian Variational Autoencoder (VAE), and a discrete Vector Quantized VAE (VQ-VAE). We analyze the quality of learned representations in terms of speaker independence, the ability to predict phonetic content, and the ability to accurately reconstruct individual spectrogram frames. Moreover, for discrete encodings extracted using the VQ-VAE, we measure the ease of map** them to phonemes. We introduce a regularization scheme that forces the representations to focus on the phonetic content of the utterance and report performance comparable with the top entries in the ZeroSpeech 2017 unsupervised acoustic unit discovery task.
△ Less
Submitted 11 September, 2019; v1 submitted 25 January, 2019;
originally announced January 2019.
-
VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
Authors:
Quan Wang,
Hannah Muckenhirn,
Kevin Wilson,
Prashant Sridhar,
Zelin Wu,
John Hershey,
Rif A. Saurous,
Ron J. Weiss,
Ye Jia,
Ignacio Lopez Moreno
Abstract:
In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embe…
▽ More
In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.
△ Less
Submitted 19 June, 2019; v1 submitted 10 October, 2018;
originally announced October 2018.
-
On Using Backpropagation for Speech Texture Generation and Voice Conversion
Authors:
Jan Chorowski,
Ron J. Weiss,
Rif A. Saurous,
Samy Bengio
Abstract:
Inspired by recent work on neural network image generation which rely on backpropagation towards the network inputs, we present a proof-of-concept system for speech texture synthesis and voice conversion based on two mechanisms: approximate inversion of the representation learned by a speech recognition neural network, and on matching statistics of neuron activations between different source and t…
▽ More
Inspired by recent work on neural network image generation which rely on backpropagation towards the network inputs, we present a proof-of-concept system for speech texture synthesis and voice conversion based on two mechanisms: approximate inversion of the representation learned by a speech recognition neural network, and on matching statistics of neuron activations between different source and target utterances. Similar to image texture synthesis and neural style transfer, the system works by optimizing a cost function with respect to the input waveform samples. To this end we use a differentiable mel-filterbank feature extraction pipeline and train a convolutional CTC speech recognition network. Our system is able to extract speaker characteristics from very limited amounts of target speaker data, as little as a few seconds, and can be used to generate realistic speech babble or reconstruct an utterance in a different voice.
△ Less
Submitted 8 March, 2018; v1 submitted 22 December, 2017;
originally announced December 2017.
-
State-of-the-art Speech Recognition With Sequence-to-Sequence Models
Authors:
Chung-Cheng Chiu,
Tara N. Sainath,
Yonghui Wu,
Rohit Prabhavalkar,
Patrick Nguyen,
Zhifeng Chen,
Anjuli Kannan,
Ron J. Weiss,
Kanishka Rao,
Ekaterina Gonina,
Navdeep Jaitly,
Bo Li,
Jan Chorowski,
Michiel Bacchiani
Abstract:
Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural network. In previous work, we have shown that such architectures are comparable to state-of-theart ASR systems on dictation tasks, but it was not clear if such archite…
▽ More
Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural network. In previous work, we have shown that such architectures are comparable to state-of-theart ASR systems on dictation tasks, but it was not clear if such architectures would be practical for more challenging tasks such as voice search. In this work, we explore a variety of structural and optimization improvements to our LAS model which significantly improve performance. On the structural side, we show that word piece models can be used instead of graphemes. We also introduce a multi-head attention architecture, which offers improvements over the commonly-used single-head attention. On the optimization side, we explore synchronous training, scheduled sampling, label smoothing, and minimum word error rate optimization, which are all shown to improve accuracy. We present results with a unidirectional LSTM encoder for streaming recognition. On a 12, 500 hour voice search task, we find that the proposed changes improve the WER from 9.2% to 5.6%, while the best conventional system achieves 6.7%; on a dictation task our model achieves a WER of 4.1% compared to 5% for the conventional system.
△ Less
Submitted 23 February, 2018; v1 submitted 5 December, 2017;
originally announced December 2017.
-
Sequence-to-Sequence Models Can Directly Translate Foreign Speech
Authors:
Ron J. Weiss,
Jan Chorowski,
Navdeep Jaitly,
Yonghui Wu,
Zhifeng Chen
Abstract:
We present a recurrent encoder-decoder deep neural network architecture that directly translates speech in one language into text in another. The model does not explicitly transcribe the speech into text in the source language, nor does it require supervision from the ground truth source language transcription during training. We apply a slightly modified sequence-to-sequence with attention archit…
▽ More
We present a recurrent encoder-decoder deep neural network architecture that directly translates speech in one language into text in another. The model does not explicitly transcribe the speech into text in the source language, nor does it require supervision from the ground truth source language transcription during training. We apply a slightly modified sequence-to-sequence with attention architecture that has previously been used for speech recognition and show that it can be repurposed for this more complex task, illustrating the power of attention-based models. A single model trained end-to-end obtains state-of-the-art performance on the Fisher Callhome Spanish-English speech translation task, outperforming a cascade of independently trained sequence-to-sequence speech recognition and machine translation models by 1.8 BLEU points on the Fisher test set. In addition, we find that making use of the training data in both languages by multi-task training sequence-to-sequence speech translation and recognition models with a shared encoder network can improve performance by a further 1.4 BLEU points.
△ Less
Submitted 12 June, 2017; v1 submitted 24 March, 2017;
originally announced March 2017.
-
CNN Architectures for Large-Scale Audio Classification
Authors:
Shawn Hershey,
Sourish Chaudhuri,
Daniel P. W. Ellis,
Jort F. Gemmeke,
Aren Jansen,
R. Channing Moore,
Manoj Plakal,
Devin Platt,
Rif A. Saurous,
Bryan Seybold,
Malcolm Slaney,
Ron J. Weiss,
Kevin Wilson
Abstract:
Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying th…
▽ More
Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5] Acoustic Event Detection (AED) classification task.
△ Less
Submitted 10 January, 2017; v1 submitted 29 September, 2016;
originally announced September 2016.