Skip to main content

Showing 1–29 of 29 results for author: Pang, R

Searching in archive eess. Search in all archives.
.
  1. arXiv:2309.09843  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Instruction-Following Speech Recognition

    Authors: Cheng-I Jeff Lai, Zhiyun Lu, Liangliang Cao, Ruoming Pang

    Abstract: Conventional end-to-end Automatic Speech Recognition (ASR) models primarily focus on exact transcription tasks, lacking flexibility for nuanced user interactions. With the advent of Large Language Models (LLMs) in speech processing, more organic, text-prompt-based interactions have become possible. However, the mechanisms behind these models' speech understanding and "reasoning" capabilities remai… ▽ More

    Submitted 18 September, 2023; originally announced September 2023.

  2. arXiv:2304.00171  [pdf, other

    cs.CL cs.SD eess.AS

    Practical Conformer: Optimizing size, speed and flops of Conformer for on-Device and cloud ASR

    Authors: Rami Botros, Anmol Gulati, Tara N. Sainath, Krzysztof Choromanski, Ruoming Pang, Trevor Strohman, Weiran Wang, Jiahui Yu

    Abstract: Conformer models maintain a large number of internal states, the vast majority of which are associated with self-attention layers. With limited memory bandwidth, reading these from memory at each inference step can slow down inference. In this paper, we design an optimized conformer that is small enough to meet on-device restrictions and has fast inference on TPUs. We explore various ideas to impr… ▽ More

    Submitted 31 March, 2023; originally announced April 2023.

  3. arXiv:2208.13916  [pdf, other

    eess.AS cs.CL cs.SD

    A Language Agnostic Multilingual Streaming On-Device ASR System

    Authors: Bo Li, Tara N. Sainath, Ruoming Pang, Shuo-yiin Chang, Qiumin Xu, Trevor Strohman, Vince Chen, Qiao Liang, Heguang Liu, Yanzhang He, Parisa Haghani, Sameer Bidichandani

    Abstract: On-device end-to-end (E2E) models have shown improvements over a conventional model on English Voice Search tasks in both quality and latency. E2E models have also shown promising results for multilingual automatic speech recognition (ASR). In this paper, we extend our previous capacity solution to streaming applications and present a streaming multilingual E2E ASR system that runs fully on device… ▽ More

    Submitted 29 August, 2022; originally announced August 2022.

    Comments: Accepted in Interspeech 2022

  4. arXiv:2203.05008  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

    Authors: W. Ronny Huang, Cal Peyser, Tara N. Sainath, Ruoming Pang, Trevor Strohman, Shankar Kumar

    Abstract: Language model fusion helps smart assistants recognize words which are rare in acoustic data but abundant in text-only corpora (typed search logs). However, such corpora have properties that hinder downstream performance, including being (1) too large, (2) beset with domain-mismatched content, and (3) heavy-headed rather than heavy-tailed (excessively many duplicate search queries such as "weather… ▽ More

    Submitted 15 June, 2022; v1 submitted 9 March, 2022; originally announced March 2022.

    Comments: Interspeech 2022

  5. arXiv:2109.13226  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

    Authors: Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yan** Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang , et al. (1 additional authors not shown)

    Abstract: We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled da… ▽ More

    Submitted 21 July, 2022; v1 submitted 27 September, 2021; originally announced September 2021.

    Comments: 14 pages, 7 figures, 13 tables; v2: minor corrections, reference baselines and bibliography updated; v3: corrections based on reviewer feedback, bibliography updated

  6. arXiv:2108.06209  [pdf, other

    cs.LG cs.SD eess.AS

    W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training

    Authors: Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, Yonghui Wu

    Abstract: Motivated by the success of masked language modeling~(MLM) in pre-training natural language processing models, we propose w2v-BERT that explores MLM for self-supervised speech representation learning. w2v-BERT is a framework that combines contrastive learning and MLM, where the former trains the model to discretize input continuous speech signals into a finite set of discriminative speech tokens,… ▽ More

    Submitted 13 September, 2021; v1 submitted 7 August, 2021; originally announced August 2021.

  7. arXiv:2104.14830  [pdf, other

    cs.CL cs.SD eess.AS

    Scaling End-to-End Models for Large-Scale Multilingual ASR

    Authors: Bo Li, Ruoming Pang, Tara N. Sainath, Anmol Gulati, Yu Zhang, James Qin, Parisa Haghani, W. Ronny Huang, Min Ma, Junwen Bai

    Abstract: Building ASR models across many languages is a challenging multi-task learning problem due to large variations and heavily unbalanced data. Existing work has shown positive transfer from high resource to low resource languages. However, degradations on high resource languages are commonly observed due to interference from the heterogeneous multilingual data and reduction in per-language capacity.… ▽ More

    Submitted 11 September, 2021; v1 submitted 30 April, 2021; originally announced April 2021.

    Comments: ASRU 2021

  8. arXiv:2104.14346  [pdf, other

    cs.CL cs.SD eess.AS

    Bridging the gap between streaming and non-streaming ASR systems bydistilling ensembles of CTC and RNN-T models

    Authors: Thibault Doutre, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Olivier Siohan, Liangliang Cao

    Abstract: Streaming end-to-end automatic speech recognition (ASR) systems are widely used in everyday applications that require transcribing speech to text in real-time. Their minimal latency makes them suitable for such tasks. Unlike their non-streaming counterparts, streaming models are constrained to be causal with no future context and suffer from higher word error rates (WER). To improve streaming mode… ▽ More

    Submitted 25 April, 2021; originally announced April 2021.

  9. arXiv:2102.05610  [pdf, other

    cs.CV eess.IV

    Searching for Fast Model Families on Datacenter Accelerators

    Authors: Sheng Li, Mingxing Tan, Ruoming Pang, Andrew Li, Liqun Cheng, Quoc Le, Norman P. Jouppi

    Abstract: Neural Architecture Search (NAS), together with model scaling, has shown remarkable progress in designing high accuracy and fast convolutional architecture families. However, as neither NAS nor model scaling considers sufficient hardware architecture details, they do not take full advantage of the emerging datacenter (DC) accelerators. In this paper, we search for fast and accurate CNN model famil… ▽ More

    Submitted 10 February, 2021; originally announced February 2021.

  10. arXiv:2011.10798  [pdf, other

    eess.AS cs.SD

    A Better and Faster End-to-End Model for Streaming ASR

    Authors: Bo Li, Anmol Gulati, Jiahui Yu, Tara N. Sainath, Chung-Cheng Chiu, Arun Narayanan, Shuo-Yiin Chang, Ruoming Pang, Yanzhang He, James Qin, Wei Han, Qiao Liang, Yu Zhang, Trevor Strohman, Yonghui Wu

    Abstract: End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the end and thus has much higher partial latency compared to a conventional ASR model. To address this i… ▽ More

    Submitted 11 February, 2021; v1 submitted 21 November, 2020; originally announced November 2020.

    Comments: Accepted in ICASSP 2021

  11. arXiv:2010.14606  [pdf, other

    eess.AS cs.CL cs.SD

    Cascaded encoders for unifying streaming and non-streaming ASR

    Authors: Arun Narayanan, Tara N. Sainath, Ruoming Pang, Jiahui Yu, Chung-Cheng Chiu, Rohit Prabhavalkar, Ehsan Variani, Trevor Strohman

    Abstract: End-to-end (E2E) automatic speech recognition (ASR) models, by now, have shown competitive performance on several benchmarks. These models are structured to either operate in streaming or non-streaming mode. This work presents cascaded encoders for building a single E2E ASR model that can operate in both these modes simultaneously. The proposed model consists of streaming and non-streaming encoder… ▽ More

    Submitted 27 October, 2020; originally announced October 2020.

  12. arXiv:2010.12973  [pdf, other

    cs.CL cs.SD eess.AS

    Unsupervised Learning of Disentangled Speech Content and Style Representation

    Authors: Andros Tjandra, Ruoming Pang, Yu Zhang, Shigeki Karita

    Abstract: We present an approach for unsupervised learning of speech representation disentangling contents and styles. Our model consists of: (1) a local encoder that captures per-frame information; (2) a global encoder that captures per-utterance information; and (3) a conditional decoder that reconstructs speech given local and global latent variables. Our experiments show that (1) the local latent variab… ▽ More

    Submitted 20 June, 2021; v1 submitted 24 October, 2020; originally announced October 2020.

    Comments: Submitted to Interspeech 2021

  13. arXiv:2010.12096  [pdf, other

    cs.SD cs.CL eess.AS

    Improving Streaming Automatic Speech Recognition With Non-Streaming Model Distillation On Unsupervised Data

    Authors: Thibault Doutre, Wei Han, Min Ma, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, Arun Narayanan, Ananya Misra, Yu Zhang, Liangliang Cao

    Abstract: Streaming end-to-end automatic speech recognition (ASR) models are widely used on smart speakers and on-device applications. Since these models are expected to transcribe speech with minimal latency, they are constrained to be causal with no future context, compared to their non-streaming counterparts. Consequently, streaming models usually perform worse than non-streaming models. We propose a nov… ▽ More

    Submitted 21 February, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

  14. arXiv:2010.11148  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization

    Authors: Jiahui Yu, Chung-Cheng Chiu, Bo Li, Shuo-yiin Chang, Tara N. Sainath, Yanzhang He, Arun Narayanan, Wei Han, Anmol Gulati, Yonghui Wu, Ruoming Pang

    Abstract: Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible. However, emitting fast without degrading quality, as measured by word error rate (WER), is highly challenging. Existing approaches including Early and Late Penalties and Constrained Alignments penalize emission delay by manipulating per-token or per-frame probability prediction i… ▽ More

    Submitted 3 February, 2021; v1 submitted 21 October, 2020; originally announced October 2020.

    Comments: Accepted in ICASSP 2021

  15. arXiv:2010.10504  [pdf, other

    eess.AS cs.LG cs.SD

    Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition

    Authors: Yu Zhang, James Qin, Daniel S. Park, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Quoc V. Le, Yonghui Wu

    Abstract: We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech utilizing the unlabeled audio of the Libri-Light dataset. More precisely, we carry out noisy student training with SpecAugment using giant Conformer models pre-trained using wav2vec 2.0 pre-training. By doing so, we are able to achieve word-e… ▽ More

    Submitted 20 July, 2022; v1 submitted 20 October, 2020; originally announced October 2020.

    Comments: 11 pages, 3 figures, 5 tables. Accepted to NeurIPS SAS 2020 Workshop; v2: minor errors corrected

  16. arXiv:2010.06030  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling

    Authors: Jiahui Yu, Wei Han, Anmol Gulati, Chung-Cheng Chiu, Bo Li, Tara N. Sainath, Yonghui Wu, Ruoming Pang

    Abstract: Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible, while full-context ASR waits for the completion of a full speech utterance before emitting completed hypotheses. In this work, we propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech reco… ▽ More

    Submitted 27 January, 2021; v1 submitted 12 October, 2020; originally announced October 2020.

    Comments: Accepted in ICLR 2021

  17. arXiv:2008.13093  [pdf, other

    eess.AS cs.CL

    Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition

    Authors: Wei Li, James Qin, Chung-Cheng Chiu, Ruoming Pang, Yanzhang He

    Abstract: Recent advances of end-to-end models have outperformed conventional models through employing a two-pass model. The two-pass model provides better speed-quality trade-offs for on-device speech recognition, where a 1st-pass model generates hypotheses in a streaming fashion, and a 2nd-pass model re-scores the hypotheses with full audio sequence context. The 2nd-pass model plays a key role in the qual… ▽ More

    Submitted 2 September, 2020; v1 submitted 30 August, 2020; originally announced August 2020.

    Comments: Proceedings of Interspeech, 2020

  18. arXiv:2008.10491  [pdf, other

    eess.AS cs.LG

    Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus

    Authors: Cal Peyser, Sepand Mavandadi, Tara N. Sainath, James Apfel, Ruoming Pang, Shankar Kumar

    Abstract: End-to-end (E2E) automatic speech recognition (ASR) systems lack the distinct language model (LM) component that characterizes traditional speech systems. While this simplifies the model architecture, it complicates the task of incorporating text-only data into training, which is important to the recognition of tail words that do not occur often in audio-text pairs. While shallow fusion has been p… ▽ More

    Submitted 25 August, 2020; v1 submitted 24 August, 2020; originally announced August 2020.

  19. arXiv:2005.10627  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Dynamic Sparsity Neural Networks for Automatic Speech Recognition

    Authors: Zhaofeng Wu, Ding Zhao, Qiao Liang, Jiahui Yu, Anmol Gulati, Ruoming Pang

    Abstract: In automatic speech recognition (ASR), model pruning is a widely adopted technique that reduces model size and latency to deploy neural network models on edge devices with resource constraints. However, multiple models with different sparsity levels usually need to be separately trained and deployed to heterogeneous target hardware with different resource specifications and for applications that h… ▽ More

    Submitted 8 February, 2021; v1 submitted 16 May, 2020; originally announced May 2020.

    Comments: ICASSP 2021. (c) 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

  20. arXiv:2005.08100  [pdf, other

    eess.AS cs.LG cs.SD

    Conformer: Convolution-augmented Transformer for Speech Recognition

    Authors: Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang

    Abstract: Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively. In this work, we achieve the best of both worlds by studying how to combine convolution ne… ▽ More

    Submitted 16 May, 2020; originally announced May 2020.

    Comments: Submitted to Interspeech 2020

  21. arXiv:2005.03271  [pdf, other

    eess.AS cs.CL

    RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions

    Authors: Chung-Cheng Chiu, Arun Narayanan, Wei Han, Rohit Prabhavalkar, Yu Zhang, Navdeep Jaitly, Ruoming Pang, Tara N. Sainath, Patrick Nguyen, Liangliang Cao, Yonghui Wu

    Abstract: In recent years, all-neural end-to-end approaches have obtained state-of-the-art results on several challenging automatic speech recognition (ASR) tasks. However, most existing works focus on building ASR models where train and test data are drawn from the same domain. This results in poor generalization characteristics on mismatched-domains: e.g., end-to-end models trained on short segments perfo… ▽ More

    Submitted 23 December, 2020; v1 submitted 7 May, 2020; originally announced May 2020.

    Comments: SLT camera-ready version

  22. arXiv:2005.03191  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context

    Authors: Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng Chiu, James Qin, Anmol Gulati, Ruoming Pang, Yonghui Wu

    Abstract: Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind other state-of-the-art methods in performance. In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet. ContextNet features a fully convolutional encoder that incorporates global context information into… ▽ More

    Submitted 15 May, 2020; v1 submitted 6 May, 2020; originally announced May 2020.

    Comments: Submitted to Interspeech 2020

  23. arXiv:2004.11544  [pdf, other

    eess.AS

    Towards Fast and Accurate Streaming End-to-End ASR

    Authors: Bo Li, Shuo-yiin Chang, Tara N. Sainath, Ruoming Pang, Yanzhang He, Trevor Strohman, Yonghui Wu

    Abstract: End-to-end (E2E) models fold the acoustic, pronunciation and language models of a conventional speech recognition model into one neural network with a much smaller number of parameters than a conventional ASR system, thus making it suitable for on-device applications. For example, recurrent neural network transducer (RNN-T) as a streaming E2E model has shown promising potential for on-device ASR.… ▽ More

    Submitted 12 May, 2020; v1 submitted 24 April, 2020; originally announced April 2020.

    Comments: Accepted in ICASSP 2020

  24. arXiv:2003.07962  [pdf, other

    eess.AS cs.CL cs.SD

    Deliberation Model Based Two-Pass End-to-End Speech Recognition

    Authors: Ke Hu, Tara N. Sainath, Ruoming Pang, Rohit Prabhavalkar

    Abstract: End-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) and perform competitively relative to conventional models. To further improve the quality, a two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model while maintaining a reasonable latency. The model attends to acoustics to rescore hypotheses, a… ▽ More

    Submitted 17 March, 2020; originally announced March 2020.

  25. arXiv:1911.09070  [pdf, other

    cs.CV cs.LG eess.IV

    EfficientDet: Scalable and Efficient Object Detection

    Authors: Mingxing Tan, Ruoming Pang, Quoc V. Le

    Abstract: Model efficiency has become increasingly important in computer vision. In this paper, we systematically study neural network architecture design choices for object detection and propose several key optimizations to improve efficiency. First, we propose a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multiscale feature fusion; Second, we propose a compound scal… ▽ More

    Submitted 27 July, 2020; v1 submitted 20 November, 2019; originally announced November 2019.

    Comments: CVPR 2020

    Journal ref: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020)

  26. arXiv:1911.02242  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    A comparison of end-to-end models for long-form speech recognition

    Authors: Chung-Cheng Chiu, Wei Han, Yu Zhang, Ruoming Pang, Sergey Kishchenko, Patrick Nguyen, Arun Narayanan, Hank Liao, Shuyuan Zhang, Anjuli Kannan, Rohit Prabhavalkar, Zhifeng Chen, Tara Sainath, Yonghui Wu

    Abstract: End-to-end automatic speech recognition (ASR) models, including both attention-based models and the recurrent neural network transducer (RNN-T), have shown superior performance compared to conventional systems. However, previous studies have focused primarily on short utterances that typically last for just a few seconds or, at most, a few tens of seconds. Whether such architectures are practical… ▽ More

    Submitted 6 November, 2019; originally announced November 2019.

    Comments: ASRU camera-ready version

  27. arXiv:1908.10992  [pdf, other

    cs.CL cs.SD eess.AS

    Two-Pass End-to-End Speech Recognition

    Authors: Tara N. Sainath, Ruoming Pang, David Rybach, Yanzhang He, Rohit Prabhavalkar, Wei Li, Mirkó Visontai, Qiao Liang, Trevor Strohman, Yonghui Wu, Ian McGraw, Chung-Cheng Chiu

    Abstract: The requirements for many applications of state-of-the-art speech recognition systems include not only low word error rate (WER) but also low latency. Specifically, for many use-cases, the system must be able to decode utterances in a streaming fashion and faster than real-time. Recently, a streaming recurrent neural network transducer (RNN-T) end-to-end (E2E) model has shown to be a good candidat… ▽ More

    Submitted 28 August, 2019; originally announced August 2019.

  28. arXiv:1810.07217  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Hierarchical Generative Modeling for Controllable Speech Synthesis

    Authors: Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, Ruoming Pang

    Abstract: This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions. The model is formulated as a conditional generative model based on the variational autoencoder (VAE) framework, with two levels of hierarch… ▽ More

    Submitted 27 December, 2018; v1 submitted 16 October, 2018; originally announced October 2018.

    Comments: 27 pages, accepted to ICLR 2019

  29. arXiv:1806.04558  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

    Authors: Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu

    Abstract: We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers… ▽ More

    Submitted 2 January, 2019; v1 submitted 12 June, 2018; originally announced June 2018.

    Comments: NeurIPS 2018

    Journal ref: Advances in Neural Information Processing Systems 31 (2018), 4485-4495