Skip to main content

Showing 1–37 of 37 results for author: Strohman, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2401.08992  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Efficient Adapter Finetuning for Tail Languages in Streaming Multilingual ASR

    Authors: Junwen Bai, Bo Li, Qiujia Li, Tara N. Sainath, Trevor Strohman

    Abstract: The end-to-end ASR model is often desired in the streaming multilingual scenario since it is easier to deploy and can benefit from pre-trained speech models such as powerful foundation models. Meanwhile, the heterogeneous nature and imbalanced data abundance of different languages may cause performance degradation, leading to asynchronous peak performance for different languages during training, e… ▽ More

    Submitted 17 January, 2024; originally announced January 2024.

    Comments: Accepted to ICASSP 2024

  2. arXiv:2312.11805  [pdf, other

    cs.CL cs.AI cs.CV

    Gemini: A Family of Highly Capable Multimodal Models

    Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

    Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More

    Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

  3. arXiv:2310.17022  [pdf, other

    cs.LG cs.AI cs.CL

    Controlled Decoding from Language Models

    Authors: Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yan** Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, Jilin Chen, Alex Beutel, Ahmad Beirami

    Abstract: KL-regularized reinforcement learning (RL) is a popular alignment framework to control the language model responses towards high reward outcomes. We pose a tokenwise RL objective and propose a modular solver for it, called controlled decoding (CD). CD exerts control through a separate prefix scorer module, which is trained to learn a value function for the reward. The prefix scorer is used at infe… ▽ More

    Submitted 3 June, 2024; v1 submitted 25 October, 2023; originally announced October 2023.

    Comments: ICML 2024

  4. arXiv:2304.00171  [pdf, other

    cs.CL cs.SD eess.AS

    Practical Conformer: Optimizing size, speed and flops of Conformer for on-Device and cloud ASR

    Authors: Rami Botros, Anmol Gulati, Tara N. Sainath, Krzysztof Choromanski, Ruoming Pang, Trevor Strohman, Weiran Wang, Jiahui Yu

    Abstract: Conformer models maintain a large number of internal states, the vast majority of which are associated with self-attention layers. With limited memory bandwidth, reading these from memory at each inference step can slow down inference. In this paper, we design an optimized conformer that is small enough to meet on-device restrictions and has fast inference on TPUs. We explore various ideas to impr… ▽ More

    Submitted 31 March, 2023; originally announced April 2023.

  5. arXiv:2303.01037  [pdf, other

    cs.CL cs.SD eess.AS

    Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

    Authors: Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk , et al. (2 additional authors not shown)

    Abstract: We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quant… ▽ More

    Submitted 24 September, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

    Comments: 20 pages, 7 figures, 8 tables

  6. arXiv:2302.11186  [pdf, other

    eess.AS cs.CL cs.SD

    UML: A Universal Monolingual Output Layer for Multilingual ASR

    Authors: Chao Zhang, Bo Li, Tara N. Sainath, Trevor Strohman, Shuo-yiin Chang

    Abstract: Word-piece models (WPMs) are commonly used subword units in state-of-the-art end-to-end automatic speech recognition (ASR) systems. For multilingual ASR, due to the differences in written scripts across languages, multilingual WPMs bring the challenges of having overly large output layers and scaling to more languages. In this work, we propose a universal monolingual output layer (UML) to address… ▽ More

    Submitted 22 February, 2023; originally announced February 2023.

    Comments: Published as a conference paper at ICASSP 2023

  7. arXiv:2302.08917  [pdf, other

    cs.CL cs.LG

    Massively Multilingual Shallow Fusion with Large Language Models

    Authors: Ke Hu, Tara N. Sainath, Bo Li, Nan Du, Yan** Huang, Andrew M. Dai, Yu Zhang, Rodrigo Cabrera, Zhifeng Chen, Trevor Strohman

    Abstract: While large language models (LLM) have made impressive progress in natural language processing, it remains unclear how to utilize them in improving automatic speech recognition (ASR). In this work, we propose to train a single multilingual language model (LM) for shallow fusion in multiple languages. We push the limits of the multilingual LM to cover up to 84 languages by scaling up using a mixtur… ▽ More

    Submitted 17 February, 2023; originally announced February 2023.

    Comments: Accepted to IEEE ICASSP 2023

  8. arXiv:2302.01496  [pdf, ps, other

    cs.CL cs.LG cs.SD eess.AS

    Efficient Domain Adaptation for Speech Foundation Models

    Authors: Bo Li, Dongseong Hwang, Zhouyuan Huo, Junwen Bai, Guru Prakash, Tara N. Sainath, Khe Chai Sim, Yu Zhang, Wei Han, Trevor Strohman, Francoise Beaufays

    Abstract: Foundation models (FMs), that are trained on broad data at scale and are adaptable to a wide range of downstream tasks, have brought large interest in the research community. Benefiting from the diverse data sources such as different modalities, languages and application domains, foundation models have demonstrated strong generalization and knowledge transfer capabilities. In this paper, we presen… ▽ More

    Submitted 2 February, 2023; originally announced February 2023.

  9. arXiv:2301.07851  [pdf, other

    cs.SD cs.AI cs.LG cs.NE eess.AS

    From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition

    Authors: Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Rohit Prabhavalkar, Tara N. Sainath, Trevor Strohman

    Abstract: In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can \textbf{re-purpose} well-trained English automatic speech recognition (ASR) models to recognize the other languages. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement that, for the first time… ▽ More

    Submitted 18 January, 2023; originally announced January 2023.

    Comments: Submitted to ICASSP 2023. The project was initiated in May 2022 during a research internship at Google Research

  10. arXiv:2211.15432  [pdf, other

    cs.CL

    E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

    Authors: W. Ronny Huang, Shuo-Yiin Chang, Tara N. Sainath, Yanzhang He, David Rybach, Robert David, Rohit Prabhavalkar, Cyril Allauzen, Cal Peyser, Trevor D. Strohman

    Abstract: We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single model. A key challenge is allowing the segmenter (which runs in real-time, synchronously with the decoder) to finalize the 2nd pass (which runs 900 ms behind real-time) without introducing user-perceived latency or deletion errors during inference. We propose a design where the neural segmenter is integrated wi… ▽ More

    Submitted 5 March, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

    Comments: ICASSP 2023

  11. arXiv:2211.02712  [pdf, other

    cs.LG cs.SD eess.AS

    Resource-Efficient Transfer Learning From Speech Foundation Model Using Hierarchical Feature Fusion

    Authors: Zhouyuan Huo, Khe Chai Sim, Bo Li, Dongseong Hwang, Tara N. Sainath, Trevor Strohman

    Abstract: Self-supervised pre-training of a speech foundation model, followed by supervised fine-tuning, has shown impressive quality improvements on automatic speech recognition (ASR) tasks. Fine-tuning separate foundation models for many downstream tasks are expensive since the foundation model is usually very big. Parameter-efficient fine-tuning methods (e.g. adapter, sparse update methods) offer an alte… ▽ More

    Submitted 4 November, 2022; originally announced November 2022.

  12. arXiv:2210.17049  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Modular Hybrid Autoregressive Transducer

    Authors: Zhong Meng, Tongzhou Chen, Rohit Prabhavalkar, Yu Zhang, Gary Wang, Kartik Audhkhasi, Jesse Emond, Trevor Strohman, Bhuvana Ramabhadran, W. Ronny Huang, Ehsan Variani, Yinghui Huang, Pedro J. Moreno

    Abstract: Text-only adaptation of a transducer model remains challenging for end-to-end speech recognition since the transducer has no clearly separated acoustic model (AM), language model (LM) or blank model. In this work, we propose a modular hybrid autoregressive transducer (MHAT) that has structurally separated label and blank decoders to predict label and blank distributions, respectively, along with a… ▽ More

    Submitted 16 February, 2023; v1 submitted 30 October, 2022; originally announced October 2022.

    Comments: 8 pages, 1 figure, in SLT 2022

    Journal ref: 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar

  13. arXiv:2210.07353  [pdf, other

    cs.CL cs.SD eess.AS

    JOIST: A Joint Speech and Text Streaming Model For ASR

    Authors: Tara N. Sainath, Rohit Prabhavalkar, Ankur Bapna, Yu Zhang, Zhouyuan Huo, Zhehuai Chen, Bo Li, Weiran Wang, Trevor Strohman

    Abstract: We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E) model with both speech-text paired inputs, and text-only unpaired inputs. Unlike previous works, we explore joint training with both modalities, rather than pre-training and fine-tuning. In addition, we explore JOIST using a streaming E2E model with an order of magnitude more data, which are also novelties comp… ▽ More

    Submitted 13 October, 2022; originally announced October 2022.

  14. arXiv:2210.05793  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    Comparison of Soft and Hard Target RNN-T Distillation for Large-scale ASR

    Authors: Dongseong Hwang, Khe Chai Sim, Yu Zhang, Trevor Strohman

    Abstract: Knowledge distillation is an effective machine learning technique to transfer knowledge from a teacher model to a smaller student model, especially with unlabeled data. In this paper, we focus on knowledge distillation for the RNN-T model, which is widely used in state-of-the-art (SoTA) automatic speech recognition (ASR). Specifically, we compared using soft and hard target distillation to train l… ▽ More

    Submitted 28 October, 2022; v1 submitted 11 October, 2022; originally announced October 2022.

    Comments: 8 pages, 2 figures

  15. arXiv:2209.06058  [pdf, other

    eess.AS cs.CL

    Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

    Authors: Chao Zhang, Bo Li, Tara Sainath, Trevor Strohman, Sepand Mavandadi, Shuo-yiin Chang, Parisa Haghani

    Abstract: Language identification is critical for many downstream tasks in automatic speech recognition (ASR), and is beneficial to integrate into multilingual end-to-end ASR as an additional task. In this paper, we propose to modify the structure of the cascaded-encoder-based recurrent neural network transducer (RNN-T) model by integrating a per-frame language identifier (LID) predictor. RNN-T with cascade… ▽ More

    Submitted 13 September, 2022; originally announced September 2022.

  16. arXiv:2208.13916  [pdf, other

    eess.AS cs.CL cs.SD

    A Language Agnostic Multilingual Streaming On-Device ASR System

    Authors: Bo Li, Tara N. Sainath, Ruoming Pang, Shuo-yiin Chang, Qiumin Xu, Trevor Strohman, Vince Chen, Qiao Liang, Heguang Liu, Yanzhang He, Parisa Haghani, Sameer Bidichandani

    Abstract: On-device end-to-end (E2E) models have shown improvements over a conventional model on English Voice Search tasks in both quality and latency. E2E models have also shown promising results for multilingual automatic speech recognition (ASR). In this paper, we extend our previous capacity solution to streaming applications and present a streaming multilingual E2E ASR system that runs fully on device… ▽ More

    Submitted 29 August, 2022; originally announced August 2022.

    Comments: Accepted in Interspeech 2022

  17. arXiv:2208.13322  [pdf, other

    cs.CL cs.SD eess.AS

    Streaming Intended Query Detection using E2E Modeling for Continued Conversation

    Authors: Shuo-yiin Chang, Guru Prakash, Zelin Wu, Qiao Liang, Tara N. Sainath, Bo Li, Adam Stambler, Shyam Upadhyay, Manaal Faruqui, Trevor Strohman

    Abstract: In voice-enabled applications, a predetermined hotword isusually used to activate a device in order to attend to the query.However, speaking queries followed by a hotword each timeintroduces a cognitive burden in continued conversations. Toavoid repeating a hotword, we propose a streaming end-to-end(E2E) intended query detector that identifies the utterancesdirected towards the device and filters… ▽ More

    Submitted 28 August, 2022; originally announced August 2022.

    Comments: 5 pages, Interspeech 2022

  18. arXiv:2208.13321  [pdf, other

    cs.CL cs.SD eess.AS

    Turn-Taking Prediction for Natural Conversational Speech

    Authors: Shuo-yiin Chang, Bo Li, Tara N. Sainath, Chao Zhang, Trevor Strohman, Qiao Liang, Yanzhang He

    Abstract: While a streaming voice assistant system has been used in many applications, this system typically focuses on unnatural, one-shot interactions assuming input from a single voice query without hesitation or disfluency. However, a common conversational utterance often involves multiple queries with turn-taking, in addition to disfluencies. These disfluencies include pausing to think, hesitations, wo… ▽ More

    Submitted 28 August, 2022; originally announced August 2022.

    Comments: 5 pages, Interspeech 2022

  19. arXiv:2206.14716  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Deliberation by Text-Only and Semi-Supervised Training

    Authors: Ke Hu, Tara N. Sainath, Yanzhang He, Rohit Prabhavalkar, Trevor Strohman, Sepand Mavandadi, Weiran Wang

    Abstract: Text-only and semi-supervised training based on audio-only data has gained popularity recently due to the wide availability of unlabeled text and speech data. In this work, we propose incorporating text-only and semi-supervised training into an attention-based deliberation model. By incorporating text-only data in training a bidirectional encoder representation from transformer (BERT) for the deli… ▽ More

    Submitted 29 June, 2022; originally announced June 2022.

    Comments: Accepted by Interspeech 2022

  20. arXiv:2204.07553  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Rare Word Recognition with LM-aware MWER Training

    Authors: Weiran Wang, Tongzhou Chen, Tara N. Sainath, Ehsan Variani, Rohit Prabhavalkar, Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser, Trevor Strohman, Yanzhang He, David Rybach

    Abstract: Language models (LMs) significantly improve the recognition accuracy of end-to-end (E2E) models on words rarely seen during training, when used in either the shallow fusion or the rescoring setups. In this work, we introduce LMs in the learning of hybrid autoregressive transducer (HAT) models in the discriminative training framework, to mitigate the training versus inference gap regarding the use… ▽ More

    Submitted 27 June, 2022; v1 submitted 15 April, 2022; originally announced April 2022.

    Comments: To appear in INTERSPEECH 2022

  21. arXiv:2204.06164  [pdf, other

    eess.AS cs.LG cs.SD

    A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes

    Authors: Shao** Ding, Weiran Wang, Ding Zhao, Tara N. Sainath, Yanzhang He, Robert David, Rami Botros, Xin Wang, Rina Panigrahy, Qiao Liang, Dongseong Hwang, Ian McGraw, Rohit Prabhavalkar, Trevor Strohman

    Abstract: In this paper, we propose a dynamic cascaded encoder Automatic Speech Recognition (ASR) model, which unifies models for different deployment scenarios. Moreover, the model can significantly reduce model size and power consumption without loss of quality. Namely, with the dynamic cascaded encoder model, we explore three techniques to maximally boost the performance of each model size: 1) Use separa… ▽ More

    Submitted 24 June, 2022; v1 submitted 13 April, 2022; originally announced April 2022.

    Comments: Accepted by INTERSPEECH 2022

  22. arXiv:2203.12668  [pdf, other

    cs.LG cs.CL

    Pseudo Label Is Better Than Human Label

    Authors: Dongseong Hwang, Khe Chai Sim, Zhouyuan Huo, Trevor Strohman

    Abstract: State-of-the-art automatic speech recognition (ASR) systems are trained with tens of thousands of hours of labeled speech data. Human transcription is expensive and time consuming. Factors such as the quality and consistency of the transcription can greatly affect the performance of the ASR models trained with these data. In this paper, we show that we can train a strong teacher model to produce h… ▽ More

    Submitted 1 July, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

    Comments: 6 pages, 2 figures, 9 tables, Proceedings of INTERSPEECH 2022

  23. arXiv:2203.05008  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

    Authors: W. Ronny Huang, Cal Peyser, Tara N. Sainath, Ruoming Pang, Trevor Strohman, Shankar Kumar

    Abstract: Language model fusion helps smart assistants recognize words which are rare in acoustic data but abundant in text-only corpora (typed search logs). However, such corpora have properties that hinder downstream performance, including being (1) too large, (2) beset with domain-mismatched content, and (3) heavy-headed rather than heavy-tailed (excessively many duplicate search queries such as "weather… ▽ More

    Submitted 15 June, 2022; v1 submitted 9 March, 2022; originally announced March 2022.

    Comments: Interspeech 2022

  24. arXiv:2110.03841  [pdf, ps, other

    eess.AS cs.CL

    Input Length Matters: Improving RNN-T and MWER Training for Long-form Telephony Speech Recognition

    Authors: Zhiyun Lu, Yanwei Pan, Thibault Doutre, Parisa Haghani, Liangliang Cao, Rohit Prabhavalkar, Chao Zhang, Trevor Strohman

    Abstract: End-to-end models have achieved state-of-the-art results on several automatic speech recognition tasks. However, they perform poorly when evaluated on long-form data, e.g., minutes long conversational telephony audio. One reason the model fails on long-form speech is that it has only seen short utterances during training. In this paper we study the effect of training utterance length on the word e… ▽ More

    Submitted 1 April, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

    Comments: submitted to INTERSPEECH 2022

  25. arXiv:2110.02220  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.NE

    Fast Contextual Adaptation with Neural Associative Memory for On-Device Personalized Speech Recognition

    Authors: Tsendsuren Munkhdalai, Khe Chai Sim, Angad Chandorkar, Fan Gao, Mason Chua, Trevor Strohman, Françoise Beaufays

    Abstract: Fast contextual adaptation has shown to be effective in improving Automatic Speech Recognition (ASR) of rare words and when combined with an on-device personalized training, it can yield an even better recognition result. However, the traditional re-scoring approaches based on an external language model is prone to diverge during the personalized training. In this work, we introduce a model-based… ▽ More

    Submitted 6 October, 2021; v1 submitted 4 October, 2021; originally announced October 2021.

    Comments: 5 pages, 3 figures, 3 tables

  26. arXiv:2110.00165  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Large-scale ASR Domain Adaptation using Self- and Semi-supervised Learning

    Authors: Dongseong Hwang, Ananya Misra, Zhouyuan Huo, Nikhil Siddhartha, Shefali Garg, David Qiu, Khe Chai Sim, Trevor Strohman, Françoise Beaufays, Yanzhang He

    Abstract: Self- and semi-supervised learning methods have been actively investigated to reduce labeled training data or enhance the model performance. However, the approach mostly focus on in-domain performance for public datasets. In this study, we utilize the combination of self- and semi-supervised learning methods to solve unseen domain adaptation problem in a large-scale production setting for online A… ▽ More

    Submitted 15 February, 2022; v1 submitted 30 September, 2021; originally announced October 2021.

    Comments: ICASSP 2022 accepted, 5 pages, 2 figures, 5 tables

  27. arXiv:2110.00155  [pdf, other

    cs.SD cs.LG eess.AS

    Incremental Layer-wise Self-Supervised Learning for Efficient Speech Domain Adaptation On Device

    Authors: Zhouyuan Huo, Dongseong Hwang, Khe Chai Sim, Shefali Garg, Ananya Misra, Nikhil Siddhartha, Trevor Strohman, Françoise Beaufays

    Abstract: Streaming end-to-end speech recognition models have been widely applied to mobile devices and show significant improvement in efficiency. These models are typically trained on the server using transcribed speech data. However, the server data distribution can be very different from the data distribution on user devices, which could affect the model performance. There are two main challenges for on… ▽ More

    Submitted 30 September, 2021; originally announced October 2021.

    Comments: 5 pages

  28. arXiv:2104.04552  [pdf, other

    cs.CL cs.SD eess.AS

    Lookup-Table Recurrent Language Models for Long Tail Speech Recognition

    Authors: W. Ronny Huang, Tara N. Sainath, Cal Peyser, Shankar Kumar, David Rybach, Trevor Strohman

    Abstract: We introduce Lookup-Table Language Models (LookupLM), a method for scaling up the size of RNN language models with only a constant increase in the floating point operations, by increasing the expressivity of the embedding table. In particular, we instantiate an (additional) embedding table which embeds the previous n-gram token sequence, rather than a single token. This allows the embedding table… ▽ More

    Submitted 6 June, 2021; v1 submitted 9 April, 2021; originally announced April 2021.

    Comments: Presented as conference paper at Interspeech 2021

  29. arXiv:2101.11577  [pdf, other

    cs.CL

    Transformer Based Deliberation for Two-Pass Speech Recognition

    Authors: Ke Hu, Ruoming Pang, Tara N. Sainath, Trevor Strohman

    Abstract: Interactive speech recognition systems must generate words quickly while also producing accurate results. Two-pass models excel at these requirements by employing a first-pass decoder that quickly emits words, and a second-pass decoder that requires more context but is more accurate. Previous work has established that a deliberation network can be an effective second-pass model. The model attends… ▽ More

    Submitted 27 January, 2021; originally announced January 2021.

  30. arXiv:2012.06749  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Less Is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging

    Authors: Rohit Prabhavalkar, Yanzhang He, David Rybach, Sean Campbell, Arun Narayanan, Trevor Strohman, Tara N. Sainath

    Abstract: End-to-end models that condition the output label sequence on all previously predicted labels have emerged as popular alternatives to conventional systems for automatic speech recognition (ASR). Since unique label histories correspond to distinct models states, such models are decoded using an approximate beam-search process which produces a tree of hypotheses. In this work, we study the influen… ▽ More

    Submitted 12 December, 2020; originally announced December 2020.

  31. arXiv:2011.10798  [pdf, other

    eess.AS cs.SD

    A Better and Faster End-to-End Model for Streaming ASR

    Authors: Bo Li, Anmol Gulati, Jiahui Yu, Tara N. Sainath, Chung-Cheng Chiu, Arun Narayanan, Shuo-Yiin Chang, Ruoming Pang, Yanzhang He, James Qin, Wei Han, Qiao Liang, Yu Zhang, Trevor Strohman, Yonghui Wu

    Abstract: End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the end and thus has much higher partial latency compared to a conventional ASR model. To address this i… ▽ More

    Submitted 11 February, 2021; v1 submitted 21 November, 2020; originally announced November 2020.

    Comments: Accepted in ICASSP 2021

  32. arXiv:2010.14606  [pdf, other

    eess.AS cs.CL cs.SD

    Cascaded encoders for unifying streaming and non-streaming ASR

    Authors: Arun Narayanan, Tara N. Sainath, Ruoming Pang, Jiahui Yu, Chung-Cheng Chiu, Rohit Prabhavalkar, Ehsan Variani, Trevor Strohman

    Abstract: End-to-end (E2E) automatic speech recognition (ASR) models, by now, have shown competitive performance on several benchmarks. These models are structured to either operate in streaming or non-streaming mode. This work presents cascaded encoders for building a single E2E ASR model that can operate in both these modes simultaneously. The proposed model consists of streaming and non-streaming encoder… ▽ More

    Submitted 27 October, 2020; originally announced October 2020.

  33. arXiv:2010.11428  [pdf, other

    eess.AS cs.CL cs.LG

    Confidence Estimation for Attention-based Sequence-to-sequence Models for Speech Recognition

    Authors: Qiujia Li, David Qiu, Yu Zhang, Bo Li, Yanzhang He, Philip C. Woodland, Liangliang Cao, Trevor Strohman

    Abstract: For various speech-related tasks, confidence scores from a speech recogniser are a useful measure to assess the quality of transcriptions. In traditional hidden Markov model-based automatic speech recognition (ASR) systems, confidence scores can be reliably obtained from word posteriors in decoding lattices. However, for an ASR system with an auto-regressive decoder, such as an attention-based seq… ▽ More

    Submitted 23 October, 2020; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: Submitted to ICASSP 2021

  34. arXiv:2003.12710  [pdf, other

    cs.CL cs.LG cs.SD

    A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency

    Authors: Tara N. Sainath, Yanzhang He, Bo Li, Arun Narayanan, Ruoming Pang, Antoine Bruguier, Shuo-yiin Chang, Wei Li, Raziel Alvarez, Zhifeng Chen, Chung-Cheng Chiu, David Garcia, Alex Gruenstein, Ke Hu, Minho **, Anjuli Kannan, Qiao Liang, Ian McGraw, Cal Peyser, Rohit Prabhavalkar, Golan Pundak, David Rybach, Yuan Shangguan, Yash Sheth, Trevor Strohman , et al. (4 additional authors not shown)

    Abstract: Thus far, end-to-end (E2E) models have not been shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that… ▽ More

    Submitted 1 May, 2020; v1 submitted 28 March, 2020; originally announced March 2020.

    Comments: In Proceedings of IEEE ICASSP 2020

  35. arXiv:1910.11455  [pdf, other

    eess.AS cs.CL cs.SD

    Recognizing long-form speech using streaming end-to-end models

    Authors: Arun Narayanan, Rohit Prabhavalkar, Chung-Cheng Chiu, David Rybach, Tara N. Sainath, Trevor Strohman

    Abstract: All-neural end-to-end (E2E) automatic speech recognition (ASR) systems that use a single neural network to transduce audio to word sequences have been shown to achieve state-of-the-art results on several tasks. In this work, we examine the ability of E2E models to generalize to unseen domains, where we find that models trained on short utterances fail to generalize to long-form speech. We propose… ▽ More

    Submitted 24 October, 2019; originally announced October 2019.

  36. arXiv:1908.10992  [pdf, other

    cs.CL cs.SD eess.AS

    Two-Pass End-to-End Speech Recognition

    Authors: Tara N. Sainath, Ruoming Pang, David Rybach, Yanzhang He, Rohit Prabhavalkar, Wei Li, Mirkó Visontai, Qiao Liang, Trevor Strohman, Yonghui Wu, Ian McGraw, Chung-Cheng Chiu

    Abstract: The requirements for many applications of state-of-the-art speech recognition systems include not only low word error rate (WER) but also low latency. Specifically, for many use-cases, the system must be able to decode utterances in a streaming fashion and faster than real-time. Recently, a streaming recurrent neural network transducer (RNN-T) end-to-end (E2E) model has shown to be a good candidat… ▽ More

    Submitted 28 August, 2019; originally announced August 2019.

  37. arXiv:1808.05312  [pdf, other

    cs.CL eess.AS

    Toward domain-invariant speech recognition via large scale training

    Authors: Arun Narayanan, Ananya Misra, Khe Chai Sim, Golan Pundak, Anshuman Tripathi, Mohamed Elfeky, Parisa Haghani, Trevor Strohman, Michiel Bacchiani

    Abstract: Current state-of-the-art automatic speech recognition systems are trained to work in specific `domains', defined based on factors like application, sampling rate and codec. When such recognizers are used in conditions that do not match the training domain, performance significantly drops. This work explores the idea of building a single domain-invariant model for varied use-cases by combining larg… ▽ More

    Submitted 15 August, 2018; originally announced August 2018.