Skip to main content

Showing 1–7 of 7 results for author: Kastner, K

Searching in archive eess. Search in all archives.
.
  1. arXiv:2402.18932  [pdf, other

    eess.AS cs.SD

    Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

    Authors: Takaaki Saeki, Gary Wang, Nobuyuki Morioka, Isaac Elias, Kyle Kastner, Andrew Rosenberg, Bhuvana Ramabhadran, Heiga Zen, Françoise Beaufays, Hadar Shemtov

    Abstract: Collecting high-quality studio recordings of audio is challenging, which limits the language coverage of text-to-speech (TTS) systems. This paper proposes a framework for scaling a multilingual TTS model to 100+ languages using found data without supervision. The proposed framework combines speech-text encoder pretraining with unsupervised training using untranscribed speech and unspoken text data… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

    Comments: To appear in ICASSP 2024

  2. arXiv:2401.04235  [pdf, other

    cs.CL cs.SD eess.AS

    High-precision Voice Search Query Correction via Retrievable Speech-text Embedings

    Authors: Christopher Li, Gary Wang, Kyle Kastner, Heng Su, Allen Chen, Andrew Rosenberg, Zhehuai Chen, Zelin Wu, Leonid Velikovich, Pat Rondon, Diamantino Caseiro, Petar Aleksic

    Abstract: Automatic speech recognition (ASR) systems can suffer from poor recall for various reasons, such as noisy audio, lack of sufficient training data, etc. Previous work has shown that recall can be improved by retrieving rewrite candidates from a large database of likely, contextually-relevant alternatives to the hypothesis text using nearest-neighbors search over embeddings of the ASR hypothesis t… ▽ More

    Submitted 8 January, 2024; originally announced January 2024.

  3. arXiv:2304.14514  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Understanding Shared Speech-Text Representations

    Authors: Gary Wang, Kyle Kastner, Ankur Bapna, Zhehuai Chen, Andrew Rosenberg, Bhuvana Ramabhadran, Yu Zhang

    Abstract: Recently, a number of approaches to train speech models by incorpo-rating text into end-to-end models have been developed, with Mae-stro advancing state-of-the-art automatic speech recognition (ASR)and Speech Translation (ST) performance. In this paper, we expandour understanding of the resulting shared speech-text representationswith two types of analyses. First we examine the limits of speech-fr… ▽ More

    Submitted 27 April, 2023; originally announced April 2023.

    Comments: Accepted at ICASSP 2023, camera ready

  4. arXiv:2206.15276  [pdf, other

    cs.SD cs.LG eess.AS

    R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS

    Authors: Kyle Kastner, Aaron Courville

    Abstract: This paper introduces R-MelNet, a two-part autoregressive architecture with a frontend based on the first tier of MelNet and a backend WaveRNN-style audio decoder for neural text-to-speech synthesis. Taking as input a mixed sequence of characters and phonemes, with an optional audio priming sequence, this model produces low-resolution mel-spectral features which are interpolated and used by a Wave… ▽ More

    Submitted 30 June, 2022; originally announced June 2022.

  5. arXiv:2112.09312  [pdf, other

    cs.SD cs.LG eess.AS

    MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling

    Authors: Yusong Wu, Ethan Manilow, Yi Deng, Rigel Swavely, Kyle Kastner, Tim Cooijmans, Aaron Courville, Cheng-Zhi Anna Huang, Jesse Engel

    Abstract: Musical expression requires control of both what notes are played, and how they are performed. Conventional audio synthesizers provide detailed expressive controls, but at the cost of realism. Black-box neural audio synthesis and concatenative samplers can produce realistic audio, but have few mechanisms for control. In this work, we introduce MIDI-DDSP a hierarchical model of musical instruments… ▽ More

    Submitted 17 March, 2022; v1 submitted 16 December, 2021; originally announced December 2021.

    Comments: Accepted by International Conference on Learning Representations (ICLR) 2022

  6. arXiv:1811.07426  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Harmonic Recomposition using Conditional Autoregressive Modeling

    Authors: Kyle Kastner, Rithesh Kumar, Tim Cooijmans, Aaron Courville

    Abstract: We demonstrate a conditional autoregressive pipeline for efficient music recomposition, based on methods presented in van den Oord et al.(2017). Recomposition (Casal & Casey, 2010) focuses on reworking existing musical pieces, adhering to structure at a high level while also re-imagining other aspects of the work. This can involve reuse of pre-existing themes or parts of the original piece, while… ▽ More

    Submitted 18 November, 2018; originally announced November 2018.

    Comments: 3 pages, 2 figures. In Proceedings of The Joint Workshop on Machine Learning for Music, ICML 2018

  7. arXiv:1811.07240  [pdf, other

    cs.LG cs.CL cs.SD eess.AS stat.ML

    Representation Mixing for TTS Synthesis

    Authors: Kyle Kastner, João Felipe Santos, Yoshua Bengio, Aaron Courville

    Abstract: Recent character and phoneme-based parametric TTS systems using deep learning have shown strong performance in natural speech generation. However, the choice between character or phoneme input can create serious limitations for practical deployment, as direct control of pronunciation is crucial in certain cases. We demonstrate a simple method for combining multiple types of linguistic information… ▽ More

    Submitted 24 November, 2018; v1 submitted 17 November, 2018; originally announced November 2018.

    Comments: 5 pages, 3 figures