Skip to main content

Showing 1–3 of 3 results for author: Elias, I

Searching in archive eess. Search in all archives.
.
  1. arXiv:2402.18932  [pdf, other

    eess.AS cs.SD

    Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

    Authors: Takaaki Saeki, Gary Wang, Nobuyuki Morioka, Isaac Elias, Kyle Kastner, Andrew Rosenberg, Bhuvana Ramabhadran, Heiga Zen, Françoise Beaufays, Hadar Shemtov

    Abstract: Collecting high-quality studio recordings of audio is challenging, which limits the language coverage of text-to-speech (TTS) systems. This paper proposes a framework for scaling a multilingual TTS model to 100+ languages using found data without supervision. The proposed framework combines speech-text encoder pretraining with unsupervised training using untranscribed speech and unspoken text data… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

    Comments: To appear in ICASSP 2024

  2. arXiv:2103.14574  [pdf, other

    cs.SD eess.AS

    Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

    Authors: Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, RJ Skerry-Ryan, Yonghui Wu

    Abstract: This paper introduces Parallel Tacotron 2, a non-autoregressive neural text-to-speech model with a fully differentiable duration model which does not require supervised duration signals. The duration model is based on a novel attention mechanism and an iterative reconstruction loss based on Soft Dynamic Time War**, this model can learn token-frame alignments as well as token durations automatica… ▽ More

    Submitted 29 August, 2021; v1 submitted 26 March, 2021; originally announced March 2021.

    Comments: Submitted to INTERSPEECH 2021

  3. arXiv:2010.11439  [pdf, other

    cs.SD eess.AS

    Parallel Tacotron: Non-Autoregressive and Controllable TTS

    Authors: Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, Ron Weiss, Yonghui Wu

    Abstract: Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still room for improvements to its efficiency and naturalness. This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, called \emph{Parallel Tacotron}, is highly parallelizable during both training and inference, a… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.