Skip to main content

Showing 1–19 of 19 results for author: Rybach, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2306.08133  [pdf, ps, other

    eess.AS cs.CL

    Large-scale Language Model Rescoring on Long-form Data

    Authors: Tongzhou Chen, Cyril Allauzen, Yinghui Huang, Daniel Park, David Rybach, W. Ronny Huang, Rodrigo Cabrera, Kartik Audhkhasi, Bhuvana Ramabhadran, Pedro J. Moreno, Michael Riley

    Abstract: In this work, we study the impact of Large-scale Language Models (LLM) on Automated Speech Recognition (ASR) of YouTube videos, which we use as a source for long-form ASR. We demonstrate up to 8\% relative reduction in Word Error Eate (WER) on US English (en-us) and code-switched Indian English (en-in) long-form ASR test sets and a reduction of up to 30\% relative on Salient Term Error Rate (STER)… ▽ More

    Submitted 5 September, 2023; v1 submitted 13 June, 2023; originally announced June 2023.

    Comments: 5 pages, accepted in ICASSP 2023

    Journal ref: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  2. arXiv:2212.12442  [pdf, ps, other

    cs.CL cs.LG

    Alignment Entropy Regularization

    Authors: Ehsan Variani, Ke Wu, David Rybach, Cyril Allauzen, Michael Riley

    Abstract: Existing training criteria in automatic speech recognition(ASR) permit the model to freely explore more than one time alignments between the feature and label sequences. In this paper, we use entropy to measure a model's uncertainty, i.e. how it chooses to distribute the probability mass over the set of allowed alignments. Furthermore, we evaluate the effect of entropy regularization in encouragin… ▽ More

    Submitted 22 December, 2022; originally announced December 2022.

  3. arXiv:2211.15432  [pdf, other

    cs.CL

    E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

    Authors: W. Ronny Huang, Shuo-Yiin Chang, Tara N. Sainath, Yanzhang He, David Rybach, Robert David, Rohit Prabhavalkar, Cyril Allauzen, Cal Peyser, Trevor D. Strohman

    Abstract: We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single model. A key challenge is allowing the segmenter (which runs in real-time, synchronously with the decoder) to finalize the 2nd pass (which runs 900 ms behind real-time) without introducing user-perceived latency or deletion errors during inference. We propose a design where the neural segmenter is integrated wi… ▽ More

    Submitted 5 March, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

    Comments: ICASSP 2023

  4. arXiv:2205.13674  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Global Normalization for Streaming Speech Recognition in a Modular Framework

    Authors: Ehsan Variani, Ke Wu, Michael Riley, David Rybach, Matt Shannon, Cyril Allauzen

    Abstract: We introduce the Globally Normalized Autoregressive Transducer (GNAT) for addressing the label bias problem in streaming speech recognition. Our solution admits a tractable exact computation of the denominator for the sequence-level normalization. Through theoretical and empirical results, we demonstrate that by switching to a globally normalized model, the word error rate gap between streaming an… ▽ More

    Submitted 26 May, 2022; originally announced May 2022.

  5. arXiv:2204.10749  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR

    Authors: W. Ronny Huang, Shuo-yiin Chang, David Rybach, Rohit Prabhavalkar, Tara N. Sainath, Cyril Allauzen, Cal Peyser, Zhiyun Lu

    Abstract: Improving the performance of end-to-end ASR models on long utterances ranging from minutes to hours in length is an ongoing challenge in speech recognition. A common solution is to segment the audio in advance using a separate voice activity detector (VAD) that decides segment boundary locations based purely on acoustic speech/non-speech information. VAD segmenters, however, may be sub-optimal for… ▽ More

    Submitted 15 June, 2022; v1 submitted 22 April, 2022; originally announced April 2022.

    Comments: Interspeech 2022

  6. arXiv:2204.07553  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Rare Word Recognition with LM-aware MWER Training

    Authors: Weiran Wang, Tongzhou Chen, Tara N. Sainath, Ehsan Variani, Rohit Prabhavalkar, Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser, Trevor Strohman, Yanzhang He, David Rybach

    Abstract: Language models (LMs) significantly improve the recognition accuracy of end-to-end (E2E) models on words rarely seen during training, when used in either the shallow fusion or the rescoring setups. In this work, we introduce LMs in the learning of hybrid autoregressive transducer (HAT) models in the discriminative training framework, to mitigate the training versus inference gap regarding the use… ▽ More

    Submitted 27 June, 2022; v1 submitted 15 April, 2022; originally announced April 2022.

    Comments: To appear in INTERSPEECH 2022

  7. arXiv:2201.06469  [pdf, ps, other

    cs.CL

    Handling Compounding in Mobile Keyboard Input

    Authors: Andreas Kabel, Keith Hall, Tom Ouyang, David Rybach, Daan van Esch, Françoise Beaufays

    Abstract: This paper proposes a framework to improve the ty** experience of mobile users in morphologically rich languages. Smartphone keyboards typically support features such as input decoding, corrections and predictions that all rely on language models. For latency reasons, these operations happen on device, so the models are of limited size and cannot easily cover all the words needed by users for th… ▽ More

    Submitted 17 January, 2022; originally announced January 2022.

    Comments: 7 pages

  8. arXiv:2104.04552  [pdf, other

    cs.CL cs.SD eess.AS

    Lookup-Table Recurrent Language Models for Long Tail Speech Recognition

    Authors: W. Ronny Huang, Tara N. Sainath, Cal Peyser, Shankar Kumar, David Rybach, Trevor Strohman

    Abstract: We introduce Lookup-Table Language Models (LookupLM), a method for scaling up the size of RNN language models with only a constant increase in the floating point operations, by increasing the expressivity of the embedding table. In particular, we instantiate an (additional) embedding table which embeds the previous n-gram token sequence, rather than a single token. This allows the embedding table… ▽ More

    Submitted 6 June, 2021; v1 submitted 9 April, 2021; originally announced April 2021.

    Comments: Presented as conference paper at Interspeech 2021

  9. arXiv:2012.06749  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Less Is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging

    Authors: Rohit Prabhavalkar, Yanzhang He, David Rybach, Sean Campbell, Arun Narayanan, Trevor Strohman, Tara N. Sainath

    Abstract: End-to-end models that condition the output label sequence on all previously predicted labels have emerged as popular alternatives to conventional systems for automatic speech recognition (ASR). Since unique label histories correspond to distinct models states, such models are decoded using an approximate beam-search process which produces a tree of hypotheses. In this work, we study the influen… ▽ More

    Submitted 12 December, 2020; originally announced December 2020.

  10. arXiv:2003.12710  [pdf, other

    cs.CL cs.LG cs.SD

    A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency

    Authors: Tara N. Sainath, Yanzhang He, Bo Li, Arun Narayanan, Ruoming Pang, Antoine Bruguier, Shuo-yiin Chang, Wei Li, Raziel Alvarez, Zhifeng Chen, Chung-Cheng Chiu, David Garcia, Alex Gruenstein, Ke Hu, Minho **, Anjuli Kannan, Qiao Liang, Ian McGraw, Cal Peyser, Rohit Prabhavalkar, Golan Pundak, David Rybach, Yuan Shangguan, Yash Sheth, Trevor Strohman , et al. (4 additional authors not shown)

    Abstract: Thus far, end-to-end (E2E) models have not been shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that… ▽ More

    Submitted 1 May, 2020; v1 submitted 28 March, 2020; originally announced March 2020.

    Comments: In Proceedings of IEEE ICASSP 2020

  11. arXiv:2003.07705  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    Hybrid Autoregressive Transducer (hat)

    Authors: Ehsan Variani, David Rybach, Cyril Allauzen, Michael Riley

    Abstract: This paper proposes and evaluates the hybrid autoregressive transducer (HAT) model, a time-synchronous encoderdecoder model that preserves the modularity of conventional automatic speech recognition systems. The HAT model provides a way to measure the quality of the internal language model that can be used to decide whether inference with an external language model is beneficial or not. This artic… ▽ More

    Submitted 12 March, 2020; originally announced March 2020.

  12. arXiv:1910.11455  [pdf, other

    eess.AS cs.CL cs.SD

    Recognizing long-form speech using streaming end-to-end models

    Authors: Arun Narayanan, Rohit Prabhavalkar, Chung-Cheng Chiu, David Rybach, Tara N. Sainath, Trevor Strohman

    Abstract: All-neural end-to-end (E2E) automatic speech recognition (ASR) systems that use a single neural network to transduce audio to word sequences have been shown to achieve state-of-the-art results on several tasks. In this work, we examine the ability of E2E models to generalize to unseen domains, where we find that models trained on short utterances fail to generalize to long-form speech. We propose… ▽ More

    Submitted 24 October, 2019; originally announced October 2019.

  13. arXiv:1908.10992  [pdf, other

    cs.CL cs.SD eess.AS

    Two-Pass End-to-End Speech Recognition

    Authors: Tara N. Sainath, Ruoming Pang, David Rybach, Yanzhang He, Rohit Prabhavalkar, Wei Li, Mirkó Visontai, Qiao Liang, Trevor Strohman, Yonghui Wu, Ian McGraw, Chung-Cheng Chiu

    Abstract: The requirements for many applications of state-of-the-art speech recognition systems include not only low word error rate (WER) but also low latency. Specifically, for many use-cases, the system must be able to decode utterances in a streaming fashion and faster than real-time. Recently, a streaming recurrent neural network transducer (RNN-T) end-to-end (E2E) model has shown to be a good candidat… ▽ More

    Submitted 28 August, 2019; originally announced August 2019.

  14. arXiv:1902.08295  [pdf, other

    cs.LG stat.ML

    Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling

    Authors: Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia X. Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, Yanzhang He, Jan Chorowski, Smit Hinsu, Stella Laurenzo, James Qin, Orhan Firat, Wolfgang Macherey, Suyog Gupta, Ankur Bapna, Shuyuan Zhang, Ruoming Pang, Ron J. Weiss, Rohit Prabhavalkar, Qiao Liang, Benoit Jacob , et al. (66 additional authors not shown)

    Abstract: Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models. Lingvo models are composed of modular building blocks that are flexible and easily extensible, and experiment configurations are centralized and highly customizable. Distributed training and quantized inference are supported directly w… ▽ More

    Submitted 21 February, 2019; originally announced February 2019.

  15. On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition

    Authors: Kazuki Irie, Rohit Prabhavalkar, Anjuli Kannan, Antoine Bruguier, David Rybach, Patrick Nguyen

    Abstract: In conventional speech recognition, phoneme-based models outperform grapheme-based models for non-phonetic languages such as English. The performance gap between the two typically reduces as the amount of training data is increased. In this work, we examine the impact of the choice of modeling unit for attention-based encoder-decoder models. We conduct experiments on the LibriSpeech 100hr, 460hr,… ▽ More

    Submitted 23 July, 2019; v1 submitted 5 February, 2019; originally announced February 2019.

    Comments: To appear in the proceedings of INTERSPEECH 2019

  16. arXiv:1811.06621  [pdf, other

    cs.CL

    Streaming End-to-end Speech Recognition For Mobile Devices

    Authors: Yanzhang He, Tara N. Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, Qiao Liang, Deepti Bhatia, Yuan Shangguan, Bo Li, Golan Pundak, Khe Chai Sim, Tom Bagby, Shuo-yiin Chang, Kanishka Rao, Alexander Gruenstein

    Abstract: End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specif… ▽ More

    Submitted 15 November, 2018; originally announced November 2018.

  17. arXiv:1712.01864  [pdf, other

    cs.CL cs.SD eess.AS stat.ML

    No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models

    Authors: Tara N. Sainath, Rohit Prabhavalkar, Shankar Kumar, Seungji Lee, Anjuli Kannan, David Rybach, Vlad Schogol, Patrick Nguyen, Bo Li, Yonghui Wu, Zhifeng Chen, Chung-Cheng Chiu

    Abstract: For decades, context-dependent phonemes have been the dominant sub-word unit for conventional acoustic modeling systems. This status quo has begun to be challenged recently by end-to-end models which seek to combine acoustic, pronunciation, and language model components into a single neural network. Such systems, which typically predict graphemes or words, simplify the recognition process since th… ▽ More

    Submitted 5 December, 2017; originally announced December 2017.

  18. arXiv:1704.03987  [pdf, other

    cs.CL

    Mobile Keyboard Input Decoding with Finite-State Transducers

    Authors: Tom Ouyang, David Rybach, Françoise Beaufays, Michael Riley

    Abstract: We propose a finite-state transducer (FST) representation for the models used to decode keyboard inputs on mobile devices. Drawing from learnings from the field of speech recognition, we describe a decoding framework that can satisfy the strict memory and latency constraints of keyboard input. We extend this framework to support functionalities typically not present in speech recognition, such as… ▽ More

    Submitted 13 April, 2017; originally announced April 2017.

  19. arXiv:1603.03185  [pdf, other

    cs.CL cs.LG cs.SD

    Personalized Speech recognition on mobile devices

    Authors: Ian McGraw, Rohit Prabhavalkar, Raziel Alvarez, Montse Gonzalez Arenas, Kanishka Rao, David Rybach, Ouais Alsharif, Hasim Sak, Alexander Gruenstein, Francoise Beaufays, Carolina Parada

    Abstract: We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone. We employ a quantized Long Short-Term Memory (LSTM) acoustic model trained with connectionist temporal classification (CTC) to directly predict phoneme targets, and further reduce its… ▽ More

    Submitted 11 March, 2016; v1 submitted 10 March, 2016; originally announced March 2016.