Skip to main content

Showing 1–41 of 41 results for author: Ramabhadran, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.14701  [pdf, other

    cs.AI cs.CL cs.SD eess.AS

    Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions

    Authors: Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Neeraj Gaur, Zhong Meng

    Abstract: In this paper, we focus on addressing the constraints faced when applying LLMs to ASR. Recent works utilize prefixLM-type models, which directly apply speech as a prefix to LLMs for ASR. We have found that optimizing speech prefixes leads to better ASR performance and propose applying RNNT loss to perform speech prefix-tuning. This is a simple approach and does not increase the model complexity or… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  2. arXiv:2406.06664  [pdf, other

    eess.AS cs.LG cs.SD

    ASTRA: Aligning Speech and Text Representations for Asr without Sampling

    Authors: Neeraj Gaur, Rohan Agrawal, Gary Wang, Parisa Haghani, Andrew Rosenberg, Bhuvana Ramabhadran

    Abstract: This paper introduces ASTRA, a novel method for improving Automatic Speech Recognition (ASR) through text injection.Unlike prevailing techniques, ASTRA eliminates the need for sampling to match sequence lengths between speech and text modalities. Instead, it leverages the inherent alignments learned within CTC/RNNT models. This approach offers the following two advantages, namely, avoiding potenti… ▽ More

    Submitted 13 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

    Comments: To be published in Interspeech 2024

  3. arXiv:2406.02921  [pdf, other

    cs.CL cs.AI cs.LG cs.NE eess.AS

    Text Injection for Neural Contextual Biasing

    Authors: Zhong Meng, Zelin Wu, Rohit Prabhavalkar, Cal Peyser, Weiran Wang, Nanxin Chen, Tara N. Sainath, Bhuvana Ramabhadran

    Abstract: Neural contextual biasing effectively improves automatic speech recognition (ASR) for crucial phrases within a speaker's context, particularly those that are infrequent in the training data. This work proposes contextual text injection (CTI) to enhance contextual ASR. CTI leverages not only the paired speech-text data, but also a much larger corpus of unpaired text to optimize the ASR model and it… ▽ More

    Submitted 11 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

    Comments: 5 pages, 1 figure

    Journal ref: Interspeech 2024, Kos Island, Greece

  4. arXiv:2402.18932  [pdf, other

    eess.AS cs.SD

    Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

    Authors: Takaaki Saeki, Gary Wang, Nobuyuki Morioka, Isaac Elias, Kyle Kastner, Andrew Rosenberg, Bhuvana Ramabhadran, Heiga Zen, Françoise Beaufays, Hadar Shemtov

    Abstract: Collecting high-quality studio recordings of audio is challenging, which limits the language coverage of text-to-speech (TTS) systems. This paper proposes a framework for scaling a multilingual TTS model to 100+ languages using found data without supervision. The proposed framework combines speech-text encoder pretraining with unsupervised training using untranscribed speech and unspoken text data… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

    Comments: To appear in ICASSP 2024

  5. arXiv:2308.07486  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    O-1: Self-training with Oracle and 1-best Hypothesis

    Authors: Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Kartik Audhkhasi

    Abstract: We introduce O-1, a new self-training objective to reduce training bias and unify training and evaluation metrics for speech recognition. O-1 is a faster variant of Expected Minimum Bayes Risk (EMBR), that boosts the oracle hypothesis and can accommodate both supervised and unsupervised data. We demonstrate the effectiveness of our approach in terms of recognition on publicly available SpeechStew… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

  6. arXiv:2308.07393  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Using Text Injection to Improve Recognition of Personal Identifiers in Speech

    Authors: Yochai Blau, Rohan Agrawal, Lior Madmony, Gary Wang, Andrew Rosenberg, Zhehuai Chen, Zorik Gekhman, Genady Beryozkin, Parisa Haghani, Bhuvana Ramabhadran

    Abstract: Accurate recognition of specific categories, such as persons' names, dates or other identifiers is critical in many Automatic Speech Recognition (ASR) applications. As these categories represent personal information, ethical use of this data including collection, transcription, training and evaluation demands special care. One way of ensuring the security and privacy of individuals is to redact or… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

    Comments: Accepted to Interspeech 2023

    MSC Class: 68T10 ACM Class: I.2.7

  7. arXiv:2306.08133  [pdf, ps, other

    eess.AS cs.CL

    Large-scale Language Model Rescoring on Long-form Data

    Authors: Tongzhou Chen, Cyril Allauzen, Yinghui Huang, Daniel Park, David Rybach, W. Ronny Huang, Rodrigo Cabrera, Kartik Audhkhasi, Bhuvana Ramabhadran, Pedro J. Moreno, Michael Riley

    Abstract: In this work, we study the impact of Large-scale Language Models (LLM) on Automated Speech Recognition (ASR) of YouTube videos, which we use as a source for long-form ASR. We demonstrate up to 8\% relative reduction in Word Error Eate (WER) on US English (en-us) and code-switched Indian English (en-in) long-form ASR test sets and a reduction of up to 30\% relative on Salient Term Error Rate (STER)… ▽ More

    Submitted 5 September, 2023; v1 submitted 13 June, 2023; originally announced June 2023.

    Comments: 5 pages, accepted in ICASSP 2023

    Journal ref: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  8. arXiv:2304.14514  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Understanding Shared Speech-Text Representations

    Authors: Gary Wang, Kyle Kastner, Ankur Bapna, Zhehuai Chen, Andrew Rosenberg, Bhuvana Ramabhadran, Yu Zhang

    Abstract: Recently, a number of approaches to train speech models by incorpo-rating text into end-to-end models have been developed, with Mae-stro advancing state-of-the-art automatic speech recognition (ASR)and Speech Translation (ST) performance. In this paper, we expandour understanding of the resulting shared speech-text representationswith two types of analyses. First we examine the limits of speech-fr… ▽ More

    Submitted 27 April, 2023; originally announced April 2023.

    Comments: Accepted at ICASSP 2023, camera ready

  9. arXiv:2303.05958  [pdf, ps, other

    cs.CL cs.SD eess.AS stat.ML

    Robust Knowledge Distillation from RNN-T Models With Noisy Training Labels Using Full-Sum Loss

    Authors: Mohammad Zeineldeen, Kartik Audhkhasi, Murali Karthick Baskar, Bhuvana Ramabhadran

    Abstract: This work studies knowledge distillation (KD) and addresses its constraints for recurrent neural network transducer (RNN-T) models. In hard distillation, a teacher model transcribes large amounts of unlabelled speech to train a student model. Soft distillation is another popular KD method that distills the output logits of the teacher model. Due to the nature of RNN-T alignments, applying soft dis… ▽ More

    Submitted 10 March, 2023; originally announced March 2023.

    Comments: Accepted at ICASSP 2023

  10. arXiv:2303.01037  [pdf, other

    cs.CL cs.SD eess.AS

    Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

    Authors: Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk , et al. (2 additional authors not shown)

    Abstract: We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quant… ▽ More

    Submitted 24 September, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

    Comments: 20 pages, 7 figures, 8 tables

  11. arXiv:2302.08583  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition

    Authors: Zhong Meng, Weiran Wang, Rohit Prabhavalkar, Tara N. Sainath, Tongzhou Chen, Ehsan Variani, Yu Zhang, Bo Li, Andrew Rosenberg, Bhuvana Ramabhadran

    Abstract: We propose JEIT, a joint end-to-end (E2E) model and internal language model (ILM) training method to inject large-scale unpaired text into ILM during E2E training which improves rare-word speech recognition. With JEIT, the E2E model computes an E2E loss on audio-transcript pairs while its ILM estimates a cross-entropy loss on unpaired text. The E2E model is trained to minimize a weighted sum of E2… ▽ More

    Submitted 16 February, 2023; originally announced February 2023.

    Comments: 5 pages, 3 figures, in ICASSP 2023

    Journal ref: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes island, Greece

  12. arXiv:2210.17049  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Modular Hybrid Autoregressive Transducer

    Authors: Zhong Meng, Tongzhou Chen, Rohit Prabhavalkar, Yu Zhang, Gary Wang, Kartik Audhkhasi, Jesse Emond, Trevor Strohman, Bhuvana Ramabhadran, W. Ronny Huang, Ehsan Variani, Yinghui Huang, Pedro J. Moreno

    Abstract: Text-only adaptation of a transducer model remains challenging for end-to-end speech recognition since the transducer has no clearly separated acoustic model (AM), language model (LM) or blank model. In this work, we propose a modular hybrid autoregressive transducer (MHAT) that has structurally separated label and blank decoders to predict label and blank distributions, respectively, along with a… ▽ More

    Submitted 16 February, 2023; v1 submitted 30 October, 2022; originally announced October 2022.

    Comments: 8 pages, 1 figure, in SLT 2022

    Journal ref: 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar

  13. arXiv:2210.15447  [pdf, other

    cs.SD cs.CL eess.AS

    Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech

    Authors: Takaaki Saeki, Heiga Zen, Zhehuai Chen, Nobuyuki Morioka, Gary Wang, Yu Zhang, Ankur Bapna, Andrew Rosenberg, Bhuvana Ramabhadran

    Abstract: This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of the thousands of languages in the world. One difficulty to scale multilingual TTS to hundreds of languages is collecting high-quality speech-text paired da… ▽ More

    Submitted 15 March, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: To appear in ICASSP 2023

  14. arXiv:2210.10879  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    G-Augment: Searching for the Meta-Structure of Data Augmentation Policies for ASR

    Authors: Gary Wang, Ekin D. Cubuk, Andrew Rosenberg, Shuyang Cheng, Ron J. Weiss, Bhuvana Ramabhadran, Pedro J. Moreno, Quoc V. Le, Daniel S. Park

    Abstract: Data augmentation is a ubiquitous technique used to provide robustness to automatic speech recognition (ASR) training. However, even as so much of the ASR training process has become automated and more "end-to-end", the data augmentation policy (what augmentation functions to use, and how to apply them) remains hand-crafted. We present Graph-Augment, a technique to define the augmentation space as… ▽ More

    Submitted 24 October, 2022; v1 submitted 19 October, 2022; originally announced October 2022.

    Comments: 6 pages, accepted at SLT 2022. Updated with copyright

  15. arXiv:2210.10027  [pdf, other

    cs.CL cs.SD eess.AS

    Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR

    Authors: Zhehuai Chen, Ankur Bapna, Andrew Rosenberg, Yu Zhang, Bhuvana Ramabhadran, Pedro Moreno, Nanxin Chen

    Abstract: Training state-of-the-art Automated Speech Recognition (ASR) models typically requires a substantial amount of transcribed speech. In this work, we demonstrate that a modality-matched joint speech and text model can be leveraged to train a massively multilingual ASR model without any supervised (manually transcribed) speech for some languages. This paper explores the use of jointly learnt speech a… ▽ More

    Submitted 21 October, 2022; v1 submitted 18 October, 2022; originally announced October 2022.

    Comments: Accepted by SLT 2022

    MSC Class: 68T10 ACM Class: I.2.7

  16. arXiv:2209.06987  [pdf, other

    cs.SD cs.LG eess.AS

    Non-Parallel Voice Conversion for ASR Augmentation

    Authors: Gary Wang, Andrew Rosenberg, Bhuvana Ramabhadran, Fadi Biadsy, Yinghui Huang, Jesse Emond, Pedro Moreno Mengibar

    Abstract: Automatic speech recognition (ASR) needs to be robust to speaker differences. Voice Conversion (VC) modifies speaker characteristics of input speech. This is an attractive feature for ASR data augmentation. In this paper, we demonstrate that voice conversion can be used as a data augmentation technique to improve ASR performance, even on LibriSpeech, which contains 2,456 speakers. For ASR augmenta… ▽ More

    Submitted 14 September, 2022; originally announced September 2022.

    Comments: Accepted by Interspeech 2022

    MSC Class: 68T10 ACM Class: I.2.7

  17. arXiv:2209.06096  [pdf, other

    cs.CL cs.SD eess.AS

    Analysis of Self-Attention Head Diversity for Conformer-based Automatic Speech Recognition

    Authors: Kartik Audhkhasi, Yinghui Huang, Bhuvana Ramabhadran, Pedro J. Moreno

    Abstract: Attention layers are an integral part of modern end-to-end automatic speech recognition systems, for instance as part of the Transformer or Conformer architecture. Attention is typically multi-headed, where each head has an independent set of learned parameters and operates on the same input feature sequence. The output of multi-headed attention is a fusion of the outputs from the individual heads… ▽ More

    Submitted 13 September, 2022; originally announced September 2022.

    Comments: Accepted for publication in Interspeech 2022

  18. arXiv:2205.08014  [pdf, ps, other

    eess.AS cs.SD

    Accented Speech Recognition: Benchmarking, Pre-training, and Diverse Data

    Authors: Alëna Aksënova, Zhehuai Chen, Chung-Cheng Chiu, Daan van Esch, Pavel Golik, Wei Han, Levi King, Bhuvana Ramabhadran, Andrew Rosenberg, Suzan Schwartz, Gary Wang

    Abstract: Building inclusive speech recognition systems is a crucial step towards develo** technologies that speakers of all language varieties can use. Therefore, ASR systems must work for everybody independently of the way they speak. To accomplish this goal, there should be available data sets representing language varieties, and also an understanding of model configuration that is the most helpful in… ▽ More

    Submitted 16 May, 2022; originally announced May 2022.

    Comments: 5 pages, 3 tables

  19. arXiv:2204.07553  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Rare Word Recognition with LM-aware MWER Training

    Authors: Weiran Wang, Tongzhou Chen, Tara N. Sainath, Ehsan Variani, Rohit Prabhavalkar, Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser, Trevor Strohman, Yanzhang He, David Rybach

    Abstract: Language models (LMs) significantly improve the recognition accuracy of end-to-end (E2E) models on words rarely seen during training, when used in either the shallow fusion or the rescoring setups. In this work, we introduce LMs in the learning of hybrid autoregressive transducer (HAT) models in the discriminative training framework, to mitigate the training versus inference gap regarding the use… ▽ More

    Submitted 27 June, 2022; v1 submitted 15 April, 2022; originally announced April 2022.

    Comments: To appear in INTERSPEECH 2022

  20. arXiv:2204.03409  [pdf, other

    cs.CL cs.SD eess.AS

    MAESTRO: Matched Speech Text Representations through Modality Matching

    Authors: Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro Moreno, Ankur Bapna, Heiga Zen

    Abstract: We present Maestro, a self-supervised training method to unify representations learnt from speech and text modalities. Self-supervised learning from speech signals aims to learn the latent structure inherent in the signal, while self-supervised learning from text attempts to capture lexical information. Learning aligned representations from unpaired speech and text sequences is a challenging task.… ▽ More

    Submitted 1 July, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

    Comments: Accepted by Interspeech 2022

    MSC Class: 68T10 ACM Class: I.2.7

  21. arXiv:2202.12719  [pdf, other

    cs.SD cs.CL eess.AS

    Ask2Mask: Guided Data Selection for Masked Speech Modeling

    Authors: Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Yu Zhang, Pedro Moreno

    Abstract: Masked speech modeling (MSM) methods such as wav2vec2 or w2v-BERT learn representations over speech frames which are randomly masked within an utterance. While these methods improve performance of Automatic Speech Recognition (ASR) systems, they have one major limitation. They treat all unsupervised speech samples with equal weight, which hinders learning as not all samples have relevant informati… ▽ More

    Submitted 24 February, 2022; originally announced February 2022.

  22. arXiv:2109.13226  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

    Authors: Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yan** Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang , et al. (1 additional authors not shown)

    Abstract: We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled da… ▽ More

    Submitted 21 July, 2022; v1 submitted 27 September, 2021; originally announced September 2021.

    Comments: 14 pages, 7 figures, 13 tables; v2: minor corrections, reference baselines and bibliography updated; v3: corrections based on reviewer feedback, bibliography updated

  23. arXiv:2108.12226  [pdf, other

    cs.CL cs.SD eess.AS

    Injecting Text in Self-Supervised Speech Pretraining

    Authors: Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Gary Wang, Pedro Moreno

    Abstract: Self-supervised pretraining for Automated Speech Recognition (ASR) has shown varied degrees of success. In this paper, we propose to jointly learn representations during pretraining from two different modalities: speech and text. The proposed method, tts4pretrain complements the power of contrastive learning in self-supervision with linguistic/lexical representations derived from synthesized speec… ▽ More

    Submitted 27 August, 2021; originally announced August 2021.

    Comments: submit to ASRU 2021

    MSC Class: 68T10 ACM Class: I.2.7

  24. arXiv:2008.06121  [pdf, other

    eess.AS cs.LG

    LSTM Acoustic Models Learn to Align and Pronounce with Graphemes

    Authors: Arindrima Datta, Guanlong Zhao, Bhuvana Ramabhadran, Eugene Weinstein

    Abstract: Automated speech recognition coverage of the world's languages continues to expand. However, standard phoneme based systems require handcrafted lexicons that are difficult and expensive to obtain. To address this problem, we propose a training methodology for a grapheme-based speech recognizer that can be trained in a purely data-driven fashion. Built with LSTM networks and trained with the cross-… ▽ More

    Submitted 13 August, 2020; originally announced August 2020.

    Comments: 5 pages, 4 figures. This work was done between summer 2018 and spring 2019

  25. arXiv:2004.09571  [pdf, other

    eess.AS cs.SD stat.ML

    Language-agnostic Multilingual Modeling

    Authors: Arindrima Datta, Bhuvana Ramabhadran, Jesse Emond, Anjuli Kannan, Brian Roark

    Abstract: Multilingual Automated Speech Recognition (ASR) systems allow for the joint training of data-rich and data-scarce languages in a single model. This enables data and parameter sharing across languages, which is especially beneficial for the data-scarce languages. However, most state-of-the-art multilingual models require the encoding of language information and therefore are not as flexible or scal… ▽ More

    Submitted 20 April, 2020; originally announced April 2020.

  26. arXiv:2002.03788  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior

    Authors: Guangzhi Sun, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Andrew Rosenberg, Bhuvana Ramabhadran, Yonghui Wu

    Abstract: Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody of synthesized speech. Such models typically incorporate a fine-grained variational autoencoder (VAE) structure, extracting latent features at each input token (e.g., phonemes). However, generating samples with the standard VAE prior often results in unnatural and discontinuous speech,… ▽ More

    Submitted 6 February, 2020; originally announced February 2020.

    Comments: To appear in ICASSP 2020

  27. arXiv:1909.11699  [pdf, other

    cs.CL cs.SD eess.AS

    Speech Recognition with Augmented Synthesized Speech

    Authors: Andrew Rosenberg, Yu Zhang, Bhuvana Ramabhadran, Ye Jia, Pedro Moreno, Yonghui Wu, Zelin Wu

    Abstract: Recent success of the Tacotron speech synthesis architecture and its variants in producing natural sounding multi-speaker synthesized speech has raised the exciting possibility of replacing expensive, manually transcribed, domain-specific, human speech that is used to train speech recognizers. The multi-speaker speech synthesis architecture can learn latent embedding spaces of prosody, speaker and… ▽ More

    Submitted 25 September, 2019; originally announced September 2019.

    Comments: Accepted for publication at ASRU 2020

  28. arXiv:1909.05330  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

    Authors: Anjuli Kannan, Arindrima Datta, Tara N. Sainath, Eugene Weinstein, Bhuvana Ramabhadran, Yonghui Wu, Ankur Bapna, Zhifeng Chen, Seungji Lee

    Abstract: Multilingual end-to-end (E2E) models have shown great promise in expansion of automatic speech recognition (ASR) coverage of the world's languages. They have shown improvement over monolingual systems, and have simplified training and serving by eliminating language-specific acoustic, pronunciation, and language models. This work presents an E2E multilingual system which is equipped to operate in… ▽ More

    Submitted 11 September, 2019; originally announced September 2019.

    Comments: Accepted in Interspeech 2019

  29. arXiv:1907.04448  [pdf, other

    cs.CL cs.SD eess.AS

    Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

    Authors: Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, RJ Skerry-Ryan, Ye Jia, Andrew Rosenberg, Bhuvana Ramabhadran

    Abstract: We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages. Moreover, the model is able to transfer voices across languages, e.g. synthesize fluent Spanish speech using an English speaker's voice, without training on any bilingual or parallel examples. Such transfer works across distantly related… ▽ More

    Submitted 24 July, 2019; v1 submitted 9 July, 2019; originally announced July 2019.

    Comments: 5 pages, submitted to Interspeech 2019

  30. arXiv:1802.02656  [pdf, other

    cs.CL cs.SD eess.AS

    Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition

    Authors: Xuesong Yang, Kartik Audhkhasi, Andrew Rosenberg, Samuel Thomas, Bhuvana Ramabhadran, Mark Hasegawa-Johnson

    Abstract: The performance of automatic speech recognition systems degrades with increasing mismatch between the training and testing scenarios. Differences in speaker accents are a significant source of such mismatch. The traditional approach to deal with multiple accents involves pooling data from several accents during training and building a single model in multi-task fashion, where tasks correspond to i… ▽ More

    Submitted 7 February, 2018; originally announced February 2018.

    Comments: Accepted in The 43rd IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2018)

  31. arXiv:1712.03133  [pdf, other

    cs.CL cs.AI cs.NE stat.ML

    Building competitive direct acoustics-to-word models for English conversational speech recognition

    Authors: Kartik Audhkhasi, Brian Kingsbury, Bhuvana Ramabhadran, George Saon, Michael Picheny

    Abstract: Direct acoustics-to-word (A2W) models in the end-to-end paradigm have received increasing attention compared to conventional sub-word based automatic speech recognition models using phones, characters, or context-dependent hidden Markov model states. This is because A2W models recognize words from speech without any decoder, pronunciation lexicon, or externally-trained language model, making train… ▽ More

    Submitted 8 December, 2017; originally announced December 2017.

    Comments: Submitted to IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018

  32. arXiv:1709.06436  [pdf, other

    cs.CL

    Language Modeling with Highway LSTM

    Authors: Gakuto Kurata, Bhuvana Ramabhadran, George Saon, Abhinav Sethy

    Abstract: Language models (LMs) based on Long Short Term Memory (LSTM) have shown good gains in many automatic speech recognition tasks. In this paper, we extend an LSTM by adding highway networks inside an LSTM and use the resulting Highway LSTM (HW-LSTM) model for language modeling. The added highway networks increase the depth in the time dimension. Since a typical LSTM has two internal states, a memory… ▽ More

    Submitted 19 September, 2017; originally announced September 2017.

    Comments: to appear in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2017)

  33. arXiv:1703.07754  [pdf, other

    cs.CL cs.NE stat.ML

    Direct Acoustics-to-Word Models for English Conversational Speech Recognition

    Authors: Kartik Audhkhasi, Bhuvana Ramabhadran, George Saon, Michael Picheny, David Nahamoo

    Abstract: Recent work on end-to-end automatic speech recognition (ASR) has shown that the connectionist temporal classification (CTC) loss can be used to convert acoustics to phone or character sequences. Such systems are used with a dictionary and separately-trained Language Model (LM) to produce word sequences. However, they are not truly end-to-end in the sense of map** acoustics directly to words with… ▽ More

    Submitted 22 March, 2017; originally announced March 2017.

    Comments: Submitted to Interspeech-2017

  34. arXiv:1703.02136  [pdf, other

    cs.CL

    English Conversational Telephone Speech Recognition by Humans and Machines

    Authors: George Saon, Gakuto Kurata, Tom Sercu, Kartik Audhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xiaodong Cui, Bhuvana Ramabhadran, Michael Picheny, Lynn-Li Lim, Bergul Roomi, Phil Hall

    Abstract: One of the most difficult speech recognition tasks is accurate recognition of human to human communication. Advances in deep learning over the last few years have produced major speech recognition improvements on the representative Switchboard conversational corpus. Word error rates that just a few years ago were 14% have dropped to 8.0%, then 6.6% and most recently 5.8%, and are now believed to b… ▽ More

    Submitted 6 March, 2017; originally announced March 2017.

  35. arXiv:1701.04313  [pdf, other

    cs.CL cs.IR cs.LG cs.NE

    End-to-End ASR-free Keyword Search from Speech

    Authors: Kartik Audhkhasi, Andrew Rosenberg, Abhinav Sethy, Bhuvana Ramabhadran, Brian Kingsbury

    Abstract: End-to-end (E2E) systems have achieved competitive results compared to conventional hybrid hidden Markov model (HMM)-deep neural network based automatic speech recognition (ASR) systems. Such E2E systems are attractive due to the lack of dependence on alignments between input acoustic and output grapheme or HMM state sequence during training. This paper explores the design of an ASR-free end-to-en… ▽ More

    Submitted 13 January, 2017; originally announced January 2017.

    Comments: Published in the IEEE 2017 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2017), scheduled for 5-9 March 2017 in New Orleans, Louisiana, USA

  36. arXiv:1612.01928  [pdf, other

    cs.CL cs.CV cs.LG cs.SD stat.ML

    Invariant Representations for Noisy Speech Recognition

    Authors: Dmitriy Serdyuk, Kartik Audhkhasi, Philémon Brakel, Bhuvana Ramabhadran, Samuel Thomas, Yoshua Bengio

    Abstract: Modern automatic speech recognition (ASR) systems need to be robust under acoustic variability arising from environmental, speaker, channel, and recording conditions. Ensuring such robustness to variability is a challenge in modern day neural network-based ASR systems, especially when all types of variability are not seen during training. We attempt to address this problem by encouraging the neura… ▽ More

    Submitted 27 November, 2016; originally announced December 2016.

    Comments: 5 pages, 1 figure, 1 table, NIPS workshop on end-to-end speech recognition

  37. arXiv:1606.04521  [pdf, other

    cs.LG

    Training variance and performance evaluation of neural networks in speech

    Authors: Ewout van den Berg, Bhuvana Ramabhadran, Michael Picheny

    Abstract: In this work we study variance in the results of neural network training on a wide variety of configurations in automatic speech recognition. Although this variance itself is well known, this is, to the best of our knowledge, the first paper that performs an extensive empirical study on its effects in speech recognition. We view training as sampling from a distribution and show that these distribu… ▽ More

    Submitted 14 June, 2016; originally announced June 2016.

  38. arXiv:1412.7063  [pdf, other

    cs.CL cs.LG cs.NE

    Diverse Embedding Neural Network Language Models

    Authors: Kartik Audhkhasi, Abhinav Sethy, Bhuvana Ramabhadran

    Abstract: We propose Diverse Embedding Neural Network (DENN), a novel architecture for language models (LMs). A DENNLM projects the input word history vector onto multiple diverse low-dimensional sub-spaces instead of a single higher-dimensional sub-space as in conventional feed-forward neural network LMs. We encourage these sub-spaces to be diverse during network training through an augmented loss function… ▽ More

    Submitted 15 April, 2015; v1 submitted 22 December, 2014; originally announced December 2014.

    Comments: Under review as workshop contribution at ICLR 2015

  39. arXiv:1312.7463  [pdf, ps, other

    stat.ML cs.CV cs.LG

    Generalized Ambiguity Decomposition for Understanding Ensemble Diversity

    Authors: Kartik Audhkhasi, Abhinav Sethy, Bhuvana Ramabhadran, Shrikanth S. Narayanan

    Abstract: Diversity or complementarity of experts in ensemble pattern recognition and information processing systems is widely-observed by researchers to be crucial for achieving performance improvement upon fusion. Understanding this link between ensemble diversity and fusion performance is thus an important research question. However, prior works have theoretically characterized ensemble diversity and hav… ▽ More

    Submitted 28 December, 2013; originally announced December 2013.

    Comments: 32 pages, 10 figures

    ACM Class: I.5

  40. arXiv:1309.1508  [pdf, other

    cs.LG cs.CL cs.NE math.OC stat.ML

    Accelerating Hessian-free optimization for deep neural networks by implicit preconditioning and sampling

    Authors: Tara N. Sainath, Lior Horesh, Brian Kingsbury, Aleksandr Y. Aravkin, Bhuvana Ramabhadran

    Abstract: Hessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-… ▽ More

    Submitted 10 December, 2013; v1 submitted 5 September, 2013; originally announced September 2013.

    Comments: this paper is not supposed to be posted publically before the conference in December due to company policy. another co-author was not informed of this and posted without the permission of the first author. pls remove

    MSC Class: 65K05; 90C15; 90C90

  41. arXiv:1309.1501  [pdf, ps, other

    cs.LG cs.CL cs.NE math.OC stat.ML

    Improvements to deep convolutional neural networks for LVCSR

    Authors: Tara N. Sainath, Brian Kingsbury, Abdel-rahman Mohamed, George E. Dahl, George Saon, Hagen Soltau, Tomas Beran, Aleksandr Y. Aravkin, Bhuvana Ramabhadran

    Abstract: Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further imp… ▽ More

    Submitted 10 December, 2013; v1 submitted 5 September, 2013; originally announced September 2013.

    Comments: 6 pages, 1 figure

    MSC Class: 65K05; 90C15; 90C90