Skip to main content

Showing 1–14 of 14 results for author: Peyser, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.02921  [pdf, other

    cs.CL cs.AI cs.LG cs.NE eess.AS

    Text Injection for Neural Contextual Biasing

    Authors: Zhong Meng, Zelin Wu, Rohit Prabhavalkar, Cal Peyser, Weiran Wang, Nanxin Chen, Tara N. Sainath, Bhuvana Ramabhadran

    Abstract: Neural contextual biasing effectively improves automatic speech recognition (ASR) for crucial phrases within a speaker's context, particularly those that are infrequent in the training data. This work proposes contextual text injection (CTI) to enhance contextual ASR. CTI leverages not only the paired speech-text data, but also a much larger corpus of unpaired text to optimize the ASR model and it… ▽ More

    Submitted 11 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

    Comments: 5 pages, 1 figure

    Journal ref: Interspeech 2024, Kos Island, Greece

  2. arXiv:2308.06125  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Improving Joint Speech-Text Representations Without Alignment

    Authors: Cal Peyser, Zhong Meng, Ke Hu, Rohit Prabhavalkar, Andrew Rosenberg, Tara N. Sainath, Michael Picheny, Kyunghyun Cho

    Abstract: The last year has seen astonishing progress in text-prompted image generation premised on the idea of a cross-modal representation space in which the text and image domains are represented jointly. In ASR, this idea has found application as joint speech-text encoders that can scale to the capacities of very large parameter models by being trained on both unpaired speech and text. While these metho… ▽ More

    Submitted 11 August, 2023; originally announced August 2023.

    Journal ref: INTERSPEECH 2023

  3. arXiv:2304.11053  [pdf, other

    cs.CL cs.SD eess.AS

    A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at Scale

    Authors: Cal Peyser, Michael Picheny, Kyunghyun Cho, Rohit Prabhavalkar, Ronny Huang, Tara Sainath

    Abstract: Unpaired text and audio injection have emerged as dominant methods for improving ASR performance in the absence of a large labeled corpus. However, little guidance exists on deploying these methods to improve production ASR systems that are trained on very large supervised corpora and with realistic requirements like a constrained model size and CPU budget, streaming capability, and a rich lattice… ▽ More

    Submitted 19 April, 2023; originally announced April 2023.

    Journal ref: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  4. arXiv:2301.04327  [pdf, other

    cs.CL cs.AI

    Dual Learning for Large Vocabulary On-Device ASR

    Authors: Cal Peyser, Ronny Huang, Tara Sainath, Rohit Prabhavalkar, Michael Picheny, Kyunghyun Cho

    Abstract: Dual learning is a paradigm for semi-supervised machine learning that seeks to leverage unsupervised data by solving two opposite tasks at once. In this scheme, each model is used to generate pseudo-labels for unlabeled examples that are used to train the other model. Dual learning has seen some use in speech processing by pairing ASR and TTS as dual tasks. However, these results mostly address on… ▽ More

    Submitted 11 January, 2023; originally announced January 2023.

  5. arXiv:2211.15432  [pdf, other

    cs.CL

    E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

    Authors: W. Ronny Huang, Shuo-Yiin Chang, Tara N. Sainath, Yanzhang He, David Rybach, Robert David, Rohit Prabhavalkar, Cyril Allauzen, Cal Peyser, Trevor D. Strohman

    Abstract: We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single model. A key challenge is allowing the segmenter (which runs in real-time, synchronously with the decoder) to finalize the 2nd pass (which runs 900 ms behind real-time) without introducing user-perceived latency or deletion errors during inference. We propose a design where the neural segmenter is integrated wi… ▽ More

    Submitted 5 March, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

    Comments: ICASSP 2023

  6. arXiv:2208.13191  [pdf, other

    cs.SD cs.AI eess.AS

    Towards Disentangled Speech Representations

    Authors: Cal Peyser, Ronny Huang Andrew Rosenberg Tara N. Sainath, Michael Picheny, Kyunghyun Cho

    Abstract: The careful construction of audio representations has become a dominant feature in the design of approaches to many speech tasks. Increasingly, such approaches have emphasized "disentanglement", where a representation contains only parts of the speech signal relevant to transcription while discarding irrelevant information. In this paper, we construct a representation learning task based on joint… ▽ More

    Submitted 28 August, 2022; originally announced August 2022.

  7. arXiv:2204.10749  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR

    Authors: W. Ronny Huang, Shuo-yiin Chang, David Rybach, Rohit Prabhavalkar, Tara N. Sainath, Cyril Allauzen, Cal Peyser, Zhiyun Lu

    Abstract: Improving the performance of end-to-end ASR models on long utterances ranging from minutes to hours in length is an ongoing challenge in speech recognition. A common solution is to segment the audio in advance using a separate voice activity detector (VAD) that decides segment boundary locations based purely on acoustic speech/non-speech information. VAD segmenters, however, may be sub-optimal for… ▽ More

    Submitted 15 June, 2022; v1 submitted 22 April, 2022; originally announced April 2022.

    Comments: Interspeech 2022

  8. arXiv:2204.07553  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Rare Word Recognition with LM-aware MWER Training

    Authors: Weiran Wang, Tongzhou Chen, Tara N. Sainath, Ehsan Variani, Rohit Prabhavalkar, Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser, Trevor Strohman, Yanzhang He, David Rybach

    Abstract: Language models (LMs) significantly improve the recognition accuracy of end-to-end (E2E) models on words rarely seen during training, when used in either the shallow fusion or the rescoring setups. In this work, we introduce LMs in the learning of hybrid autoregressive transducer (HAT) models in the discriminative training framework, to mitigate the training versus inference gap regarding the use… ▽ More

    Submitted 27 June, 2022; v1 submitted 15 April, 2022; originally announced April 2022.

    Comments: To appear in INTERSPEECH 2022

  9. arXiv:2203.05008  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

    Authors: W. Ronny Huang, Cal Peyser, Tara N. Sainath, Ruoming Pang, Trevor Strohman, Shankar Kumar

    Abstract: Language model fusion helps smart assistants recognize words which are rare in acoustic data but abundant in text-only corpora (typed search logs). However, such corpora have properties that hinder downstream performance, including being (1) too large, (2) beset with domain-mismatched content, and (3) heavy-headed rather than heavy-tailed (excessively many duplicate search queries such as "weather… ▽ More

    Submitted 15 June, 2022; v1 submitted 9 March, 2022; originally announced March 2022.

    Comments: Interspeech 2022

  10. arXiv:2104.04552  [pdf, other

    cs.CL cs.SD eess.AS

    Lookup-Table Recurrent Language Models for Long Tail Speech Recognition

    Authors: W. Ronny Huang, Tara N. Sainath, Cal Peyser, Shankar Kumar, David Rybach, Trevor Strohman

    Abstract: We introduce Lookup-Table Language Models (LookupLM), a method for scaling up the size of RNN language models with only a constant increase in the floating point operations, by increasing the expressivity of the embedding table. In particular, we instantiate an (additional) embedding table which embeds the previous n-gram token sequence, rather than a single token. This allows the embedding table… ▽ More

    Submitted 6 June, 2021; v1 submitted 9 April, 2021; originally announced April 2021.

    Comments: Presented as conference paper at Interspeech 2021

  11. arXiv:2008.10491  [pdf, other

    eess.AS cs.LG

    Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus

    Authors: Cal Peyser, Sepand Mavandadi, Tara N. Sainath, James Apfel, Ruoming Pang, Shankar Kumar

    Abstract: End-to-end (E2E) automatic speech recognition (ASR) systems lack the distinct language model (LM) component that characterizes traditional speech systems. While this simplifies the model architecture, it complicates the task of incorporating text-only data into training, which is important to the recognition of tail words that do not occur often in audio-text pairs. While shallow fusion has been p… ▽ More

    Submitted 25 August, 2020; v1 submitted 24 August, 2020; originally announced August 2020.

  12. arXiv:2005.09756  [pdf, other

    eess.AS cs.LG cs.SD

    Improving Proper Noun Recognition in End-to-End ASR By Customization of the MWER Loss Criterion

    Authors: Cal Peyser, Tara N. Sainath, Golan Pundak

    Abstract: Proper nouns present a challenge for end-to-end (E2E) automatic speech recognition (ASR) systems in that a particular name may appear only rarely during training, and may have a pronunciation similar to that of a more common word. Unlike conventional ASR models, E2E systems lack an explicit pronounciation model that can be specifically trained with proper noun pronounciations and a language model… ▽ More

    Submitted 19 May, 2020; originally announced May 2020.

  13. arXiv:2003.12710  [pdf, other

    cs.CL cs.LG cs.SD

    A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency

    Authors: Tara N. Sainath, Yanzhang He, Bo Li, Arun Narayanan, Ruoming Pang, Antoine Bruguier, Shuo-yiin Chang, Wei Li, Raziel Alvarez, Zhifeng Chen, Chung-Cheng Chiu, David Garcia, Alex Gruenstein, Ke Hu, Minho **, Anjuli Kannan, Qiao Liang, Ian McGraw, Cal Peyser, Rohit Prabhavalkar, Golan Pundak, David Rybach, Yuan Shangguan, Yash Sheth, Trevor Strohman , et al. (4 additional authors not shown)

    Abstract: Thus far, end-to-end (E2E) models have not been shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that… ▽ More

    Submitted 1 May, 2020; v1 submitted 28 March, 2020; originally announced March 2020.

    Comments: In Proceedings of IEEE ICASSP 2020

  14. arXiv:1907.01372  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Improving Performance of End-to-End ASR on Numeric Sequences

    Authors: Cal Peyser, Hao Zhang, Tara N. Sainath, Zelin Wu

    Abstract: Recognizing written domain numeric utterances (e.g. I need $1.25.) can be challenging for ASR systems, particularly when numeric sequences are not seen during training. This out-of-vocabulary (OOV) issue is addressed in conventional ASR systems by training part of the model on spoken domain utterances (e.g. I need one dollar and twenty five cents.), for which numeric sequences are composed of in-v… ▽ More

    Submitted 1 July, 2019; originally announced July 2019.