Skip to main content

Showing 1–27 of 27 results for author: Weiss, R

Searching in archive eess. Search in all archives.
.
  1. arXiv:2210.10879  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    G-Augment: Searching for the Meta-Structure of Data Augmentation Policies for ASR

    Authors: Gary Wang, Ekin D. Cubuk, Andrew Rosenberg, Shuyang Cheng, Ron J. Weiss, Bhuvana Ramabhadran, Pedro J. Moreno, Quoc V. Le, Daniel S. Park

    Abstract: Data augmentation is a ubiquitous technique used to provide robustness to automatic speech recognition (ASR) training. However, even as so much of the ASR training process has become automated and more "end-to-end", the data augmentation policy (what augmentation functions to use, and how to apply them) remains hand-crafted. We present Graph-Augment, a technique to define the augmentation space as… ▽ More

    Submitted 24 October, 2022; v1 submitted 19 October, 2022; originally announced October 2022.

    Comments: 6 pages, accepted at SLT 2022. Updated with copyright

  2. arXiv:2201.04489  [pdf

    eess.SY

    Power-to-Gas in a gas and electricity distribution network: a sensitivity analysis of modeling approaches

    Authors: Gabriele Fambri, Cesar Diaz-Londono, Andrea Mazza, Marco Badami, Robert Weiss

    Abstract: Power-to-Gas (P2G) has been one of the most frequently discussed technologies in the last few years. This technology allows producing CO2 free fuels. Thanks to its high flexibility, it may offer services to the power system, fostering Variable Renewable Energy Sources (VRES) and the electricity demand match, mitigating the issues related to VRES overproduction. The role of P2G plants connected to… ▽ More

    Submitted 17 February, 2022; v1 submitted 12 January, 2022; originally announced January 2022.

  3. arXiv:2112.10714  [pdf, other

    cs.LG cs.CV cs.RO eess.SY

    Learning Spatio-Temporal Specifications for Dynamical Systems

    Authors: Suhail Alsalehi, Erfan Aasi, Ron Weiss, Calin Belta

    Abstract: Learning dynamical systems properties from data provides important insights that help us understand such systems and mitigate undesired outcomes. In this work, we propose a framework for learning spatio-temporal (ST) properties as formal logic specifications from data. We introduce SVM-STL, an extension of Signal Signal Temporal Logic (STL), capable of specifying spatial and temporal properties of… ▽ More

    Submitted 20 December, 2021; originally announced December 2021.

    Comments: 12 pages, submitted to L4DC 2021

    MSC Class: I.5.3; I.5.4; B.1.0

    Journal ref: PMLR 168:968-980, 2022

  4. arXiv:2111.11790  [pdf

    eess.SY

    Techno-economic analysis of Power-to-Gas plants in a gas and electricity distribution network system with high renewable energy penetration

    Authors: Gabriele Fambri, Cesar Diaz-Londono, Andrea Mazza, Marco Badami, Teemu Sihvonen, Robert Weiss

    Abstract: Distributed generation, based on the exploitation of Renewable Energy Sources (RES), has increased in the last few decades to limit anthropogenic carbon dioxide emissions, and this trend will increase in the future. However, RES generation is not dispatchable, and an increasing share of RES may lead to inefficiencies and even problems for the electricity network. Flexible resources are needed to h… ▽ More

    Submitted 23 November, 2021; originally announced November 2021.

  5. arXiv:2106.09660  [pdf, ps, other

    eess.AS cs.LG cs.SD

    WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

    Authors: Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, Najim Dehak, William Chan

    Abstract: This paper introduces WaveGrad 2, a non-autoregressive generative model for text-to-speech synthesis. WaveGrad 2 is trained to estimate the gradient of the log conditional density of the waveform given a phoneme sequence. The model takes an input phoneme sequence, and through an iterative refinement process, generates an audio waveform. This contrasts to the original WaveGrad vocoder which conditi… ▽ More

    Submitted 18 June, 2021; v1 submitted 17 June, 2021; originally announced June 2021.

    Comments: Proceedings of INTERSPEECH

  6. arXiv:2106.00847  [pdf, other

    eess.AS cs.SD

    Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation

    Authors: Scott Wisdom, Aren Jansen, Ron J. Weiss, Hakan Erdogan, John R. Hershey

    Abstract: Supervised neural network training has led to significant progress on single-channel sound separation. This approach relies on ground truth isolated sources, which precludes scaling to widely available mixture data and limits progress on open-domain tasks. The recent mixture invariant training (MixIT) method enables training on in-the-wild data; however, it suffers from two outstanding problems. F… ▽ More

    Submitted 16 October, 2021; v1 submitted 1 June, 2021; originally announced June 2021.

    Comments: 5 pages, 1 figure. WASPAA 2021

  7. arXiv:2011.03568  [pdf, other

    cs.CL cs.SD eess.AS

    Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis

    Authors: Ron J. Weiss, RJ Skerry-Ryan, Eric Battenberg, Soroosh Mariooryad, Diederik P. Kingma

    Abstract: We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Output waveforms are modeled as a sequence of non-overlap** fixed-length blocks, each one containing hundreds of samples. The interdependencies of waveform samples within… ▽ More

    Submitted 5 February, 2021; v1 submitted 6 November, 2020; originally announced November 2020.

    Comments: 6 pages including supplement, 3 figures. accepted to ICASSP 2021

  8. arXiv:2010.11439  [pdf, other

    cs.SD eess.AS

    Parallel Tacotron: Non-Autoregressive and Controllable TTS

    Authors: Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, Ron Weiss, Yonghui Wu

    Abstract: Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still room for improvements to its efficiency and naturalness. This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, called \emph{Parallel Tacotron}, is highly parallelizable during both training and inference, a… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

  9. arXiv:2009.00713  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    WaveGrad: Estimating Gradients for Waveform Generation

    Authors: Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, William Chan

    Abstract: This paper introduces WaveGrad, a conditional model for waveform generation which estimates gradients of the data density. The model is built on prior work on score matching and diffusion probabilistic models. It starts from a Gaussian white noise signal and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram. WaveGrad offers a natural way to trade infere… ▽ More

    Submitted 9 October, 2020; v1 submitted 2 September, 2020; originally announced September 2020.

  10. arXiv:2006.12701  [pdf, other

    eess.AS cs.LG cs.SD

    Unsupervised Sound Separation Using Mixture Invariant Training

    Authors: Scott Wisdom, Efthymios Tzinis, Hakan Erdogan, Ron J. Weiss, Kevin Wilson, John R. Hershey

    Abstract: In recent years, rapid progress has been made on the problem of single-channel sound separation using supervised training of deep neural networks. In such supervised approaches, a model is trained to predict the component sources from synthetic mixtures created by adding up isolated ground-truth sources. Reliance on this synthetic training data is problematic because good performance depends upon… ▽ More

    Submitted 23 October, 2020; v1 submitted 22 June, 2020; originally announced June 2020.

    Comments: Accepted for spotlight presentation at NeurIPS 2020

  11. arXiv:2002.03788  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior

    Authors: Guangzhi Sun, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Andrew Rosenberg, Bhuvana Ramabhadran, Yonghui Wu

    Abstract: Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody of synthesized speech. Such models typically incorporate a fine-grained variational autoencoder (VAE) structure, extracting latent features at each input token (e.g., phonemes). However, generating samples with the standard VAE prior often results in unnatural and discontinuous speech,… ▽ More

    Submitted 6 February, 2020; originally announced February 2020.

    Comments: To appear in ICASSP 2020

  12. arXiv:2002.03785  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis

    Authors: Guangzhi Sun, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Yonghui Wu

    Abstract: This paper proposes a hierarchical, fine-grained and interpretable latent variable model for prosody based on the Tacotron 2 text-to-speech model. It achieves multi-resolution modeling of prosody by conditioning finer level representations on coarser level ones. Additionally, it imposes hierarchical conditioning across all latent dimensions using a conditional variational auto-encoder (VAE) with a… ▽ More

    Submitted 6 February, 2020; originally announced February 2020.

    Comments: to appear in ICASSP 2020

  13. arXiv:1907.04448  [pdf, other

    cs.CL cs.SD eess.AS

    Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

    Authors: Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, RJ Skerry-Ryan, Ye Jia, Andrew Rosenberg, Bhuvana Ramabhadran

    Abstract: We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages. Moreover, the model is able to transfer voices across languages, e.g. synthesize fluent Spanish speech using an English speaker's voice, without training on any bilingual or parallel examples. Such transfer works across distantly related… ▽ More

    Submitted 24 July, 2019; v1 submitted 9 July, 2019; originally announced July 2019.

    Comments: 5 pages, submitted to Interspeech 2019

  14. arXiv:1904.06037  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Direct speech-to-speech translation with a sequence-to-sequence model

    Authors: Ye Jia, Ron J. Weiss, Fadi Biadsy, Wolfgang Macherey, Melvin Johnson, Zhifeng Chen, Yonghui Wu

    Abstract: We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation. The network is trained end-to-end, learning to map speech spectrograms into target spectrograms in another language, corresponding to the translated content (in a different canonical voice).… ▽ More

    Submitted 25 June, 2019; v1 submitted 12 April, 2019; originally announced April 2019.

    Comments: Accepted to Interspeech 2019

  15. arXiv:1904.04169  [pdf, other

    eess.AS cs.SD

    Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation

    Authors: Fadi Biadsy, Ron J. Weiss, Pedro J. Moreno, Dimitri Kanevsky, Ye Jia

    Abstract: We describe Parrotron, an end-to-end-trained speech-to-speech conversion model that maps an input spectrogram directly to another spectrogram, without utilizing any intermediate discrete representation. The network is composed of an encoder, spectrogram and phoneme decoders, followed by a vocoder to synthesize a time-domain waveform. We demonstrate that this model can be trained to normalize speec… ▽ More

    Submitted 29 October, 2019; v1 submitted 8 April, 2019; originally announced April 2019.

    Comments: 5 pages, submitted to Interspeech 2019

  16. arXiv:1904.02882  [pdf, other

    cs.SD eess.AS

    LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

    Authors: Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, Yonghui Wu

    Abstract: This paper introduces a new speech corpus called "LibriTTS" designed for text-to-speech use. It is derived from the original audio and text materials of the LibriSpeech corpus, which has been used for training and evaluating automatic speech recognition systems. The new corpus inherits desired properties of the LibriSpeech corpus while addressing a number of issues which make LibriSpeech less than… ▽ More

    Submitted 5 April, 2019; originally announced April 2019.

    Comments: Submitted for Interspeech 2019, 7 pages

  17. arXiv:1902.07178  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    A spelling correction model for end-to-end speech recognition

    Authors: **xi Guo, Tara N. Sainath, Ron J. Weiss

    Abstract: Attention-based sequence-to-sequence models for speech recognition jointly train an acoustic model, language model (LM), and alignment mechanism using a single neural network and require only parallel audio-text pairs. Thus, the language model component of the end-to-end model is only trained on transcribed audio-text pairs, which leads to performance degradation especially on rare words. While th… ▽ More

    Submitted 19 February, 2019; originally announced February 2019.

    Comments: Accepted to ICASSP 2019

  18. arXiv:1901.08810  [pdf, other

    cs.LG eess.AS stat.ML

    Unsupervised speech representation learning using WaveNet autoencoders

    Authors: Jan Chorowski, Ron J. Weiss, Samy Bengio, AƤron van den Oord

    Abstract: We consider the task of unsupervised extraction of meaningful latent representations of speech by applying autoencoding neural networks to speech waveforms. The goal is to learn a representation able to capture high level semantic content from the signal, e.g.\ phoneme identities, while being invariant to confounding low level details in the signal such as the underlying pitch contour or backgroun… ▽ More

    Submitted 11 September, 2019; v1 submitted 25 January, 2019; originally announced January 2019.

    Comments: Accepted to IEEE TASLP, final version available at http://dx.doi.org/10.1109/TASLP.2019.2938863

  19. arXiv:1811.02050  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation

    Authors: Ye Jia, Melvin Johnson, Wolfgang Macherey, Ron J. Weiss, Yuan Cao, Chung-Cheng Chiu, Naveen Ari, Stella Laurenzo, Yonghui Wu

    Abstract: End-to-end Speech Translation (ST) models have many potential advantages when compared to the cascade of Automatic Speech Recognition (ASR) and text Machine Translation (MT) models, including lowered inference latency and the avoidance of error compounding. However, the quality of end-to-end ST is often limited by a paucity of training data, since it is difficult to collect large parallel corpora… ▽ More

    Submitted 10 February, 2019; v1 submitted 5 November, 2018; originally announced November 2018.

    Comments: ICASSP 2019

  20. arXiv:1810.07217  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Hierarchical Generative Modeling for Controllable Speech Synthesis

    Authors: Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, Ruoming Pang

    Abstract: This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions. The model is formulated as a conditional generative model based on the variational autoencoder (VAE) framework, with two levels of hierarch… ▽ More

    Submitted 27 December, 2018; v1 submitted 16 October, 2018; originally announced October 2018.

    Comments: 27 pages, accepted to ICLR 2019

  21. arXiv:1810.04826  [pdf, other

    eess.AS cs.LG eess.SP stat.ML

    VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

    Authors: Quan Wang, Hannah Muckenhirn, Kevin Wilson, Prashant Sridhar, Zelin Wu, John Hershey, Rif A. Saurous, Ron J. Weiss, Ye Jia, Ignacio Lopez Moreno

    Abstract: In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embe… ▽ More

    Submitted 19 June, 2019; v1 submitted 10 October, 2018; originally announced October 2018.

    Comments: To appear in Interspeech 2019

  22. arXiv:1806.08002  [pdf, other

    cs.SD eess.AS

    Synthesizing Diverse, High-Quality Audio Textures

    Authors: Joseph Antognini, Matt Hoffman, Ron J. Weiss

    Abstract: Texture synthesis techniques based on matching the Gram matrix of feature activations in neural networks have achieved spectacular success in the image domain. In this paper we extend these techniques to the audio domain. We demonstrate that synthesizing diverse audio textures is challenging, and argue that this is because audio data is relatively low-dimensional. We therefore introduce two new te… ▽ More

    Submitted 20 June, 2018; originally announced June 2018.

    Comments: 10 pages, submitted to TASLP

  23. arXiv:1806.04558  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

    Authors: Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu

    Abstract: We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers… ▽ More

    Submitted 2 January, 2019; v1 submitted 12 June, 2018; originally announced June 2018.

    Comments: NeurIPS 2018

    Journal ref: Advances in Neural Information Processing Systems 31 (2018), 4485-4495

  24. arXiv:1803.09047  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

    Authors: RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, Rif A. Saurous

    Abstract: We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synth… ▽ More

    Submitted 23 March, 2018; originally announced March 2018.

  25. arXiv:1712.08363  [pdf, other

    cs.SD eess.AS stat.ML

    On Using Backpropagation for Speech Texture Generation and Voice Conversion

    Authors: Jan Chorowski, Ron J. Weiss, Rif A. Saurous, Samy Bengio

    Abstract: Inspired by recent work on neural network image generation which rely on backpropagation towards the network inputs, we present a proof-of-concept system for speech texture synthesis and voice conversion based on two mechanisms: approximate inversion of the representation learned by a speech recognition neural network, and on matching statistics of neuron activations between different source and t… ▽ More

    Submitted 8 March, 2018; v1 submitted 22 December, 2017; originally announced December 2017.

    Comments: Accepted to ICASSP 2018

  26. arXiv:1712.01769  [pdf, other

    cs.CL cs.SD eess.AS stat.ML

    State-of-the-art Speech Recognition With Sequence-to-Sequence Models

    Authors: Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, Michiel Bacchiani

    Abstract: Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural network. In previous work, we have shown that such architectures are comparable to state-of-theart ASR systems on dictation tasks, but it was not clear if such archite… ▽ More

    Submitted 23 February, 2018; v1 submitted 5 December, 2017; originally announced December 2017.

    Comments: ICASSP camera-ready version

  27. arXiv:1711.01694  [pdf, other

    eess.AS cs.AI cs.CL

    Multilingual Speech Recognition With A Single End-To-End Model

    Authors: Shubham Toshniwal, Tara N. Sainath, Ron J. Weiss, Bo Li, Pedro Moreno, Eugene Weinstein, Kanishka Rao

    Abstract: Training a conventional automatic speech recognition (ASR) system to support multiple languages is challenging because the sub-word unit, lexicon and word inventories are typically language specific. In contrast, sequence-to-sequence models are well suited for multilingual ASR because they encapsulate an acoustic, pronunciation and language model jointly in a single network. In this work we presen… ▽ More

    Submitted 15 February, 2018; v1 submitted 5 November, 2017; originally announced November 2017.

    Comments: Accepted in ICASSP 2018