Skip to main content

Showing 1–20 of 20 results for author: Gandhe, A

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.09618  [pdf, other

    cs.CL cs.AI cs.IR cs.SD eess.AS

    Multi-Modal Retrieval For Large Language Model Based Speech Recognition

    Authors: Jari Kolehmainen, Aditya Gourav, Prashanth Gurunath Shivakumar, Yile Gu, Ankur Gandhe, Ariya Rastrow, Grant Strimel, Ivan Bulyko

    Abstract: Retrieval is a widely adopted approach for improving language models leveraging external information. As the field moves towards multi-modal large language models, it is important to extend the pure text based methods to incorporate other modalities in retrieval as well for applications across the wide spectrum of machine learning tasks and data types. In this work, we propose multi-modal retrieva… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  2. arXiv:2401.10447  [pdf, other

    cs.CL cs.AI cs.LG cs.NE cs.SD eess.AS

    Investigating Training Strategies and Model Robustness of Low-Rank Adaptation for Language Modeling in Speech Recognition

    Authors: Yu Yu, Chao-Han Huck Yang, Tuan Dinh, Sungho Ryu, Jari Kolehmainen, Roger Ren, Denis Filimonov, Prashanth G. Shivakumar, Ankur Gandhe, Ariya Rastow, Jia Xu, Ivan Bulyko, Andreas Stolcke

    Abstract: The use of low-rank adaptation (LoRA) with frozen pretrained language models (PLMs) has become increasing popular as a mainstream, resource-efficient modeling approach for memory-constrained hardware. In this study, we first explore how to enhance model performance by introducing various LoRA training strategies, achieving relative word error rate reductions of 3.50\% on the public Librispeech dat… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

  3. arXiv:2401.02921  [pdf, other

    cs.CL eess.AS

    Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks

    Authors: Kevin Everson, Yile Gu, Huck Yang, Prashanth Gurunath Shivakumar, Guan-Ting Lin, Jari Kolehmainen, Ivan Bulyko, Ankur Gandhe, Shalini Ghosh, Wael Hamza, Hung-yi Lee, Ariya Rastrow, Andreas Stolcke

    Abstract: In the realm of spoken language understanding (SLU), numerous natural language understanding (NLU) methodologies have been adapted by supplying large language models (LLMs) with transcribed speech instead of conventional written text. In real-world scenarios, prior to input into an LLM, an automated speech recognition (ASR) system generates an output transcript hypothesis, where inherent errors ca… ▽ More

    Submitted 5 January, 2024; originally announced January 2024.

    Comments: Accepted to ICASSP 2024

  4. arXiv:2312.15316  [pdf, other

    cs.CL eess.AS

    Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue

    Authors: Guan-Ting Lin, Prashanth Gurunath Shivakumar, Ankur Gandhe, Chao-Han Huck Yang, Yile Gu, Shalini Ghosh, Andreas Stolcke, Hung-yi Lee, Ivan Bulyko

    Abstract: Large Language Models (LLMs) have demonstrated superior abilities in tasks such as chatting, reasoning, and question-answering. However, standard LLMs may ignore crucial paralinguistic information, such as sentiment, emotion, and speaking style, which are essential for achieving natural, human-like spoken conversation, especially when such information is conveyed by acoustic cues. We therefore pro… ▽ More

    Submitted 17 January, 2024; v1 submitted 23 December, 2023; originally announced December 2023.

    Comments: Accepted by ICASSP 2024. Camera-ready version

  5. arXiv:2310.06248  [pdf, other

    eess.AS

    Discriminative Speech Recognition Rescoring with Pre-trained Language Models

    Authors: Prashanth Gurunath Shivakumar, Jari Kolehmainen, Yile Gu, Ankur Gandhe, Ariya Rastrow, Ivan Bulyko

    Abstract: Second pass rescoring is a critical component of competitive automatic speech recognition (ASR) systems. Large language models have demonstrated their ability in using pre-trained information for better rescoring of ASR hypothesis. Discriminative training, directly optimizing the minimum word-error-rate (MWER) criterion typically improves rescoring. In this study, we propose and explore several di… ▽ More

    Submitted 9 October, 2023; originally announced October 2023.

    Comments: ASRU 2023

  6. arXiv:2309.15223  [pdf, other

    cs.CL cs.AI cs.LG cs.NE cs.SD eess.AS

    Low-rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition

    Authors: Yu Yu, Chao-Han Huck Yang, Jari Kolehmainen, Prashanth G. Shivakumar, Yile Gu, Sungho Ryu, Roger Ren, Qi Luo, Aditya Gourav, I-Fan Chen, Yi-Chieh Liu, Tuan Dinh, Ankur Gandhe, Denis Filimonov, Shalini Ghosh, Andreas Stolcke, Ariya Rastow, Ivan Bulyko

    Abstract: We propose a neural language modeling system based on low-rank adaptation (LoRA) for speech recognition output rescoring. Although pretrained language models (LMs) like BERT have shown superior performance in second-pass rescoring, the high computational cost of scaling up the pretraining stage and adapting the pretrained models to specific domains limit their practical use in rescoring. Here we p… ▽ More

    Submitted 10 October, 2023; v1 submitted 26 September, 2023; originally announced September 2023.

    Comments: Accepted to IEEE ASRU 2023. Internal Review Approved. Revised 2nd version with Andreas and Huck. The first version is in Sep 29th. 8 pages

    Journal ref: Proc. IEEE ASRU Workshop, Dec. 2023

  7. arXiv:2307.06832  [pdf, other

    eess.AS cs.CL

    Personalization for BERT-based Discriminative Speech Recognition Rescoring

    Authors: Jari Kolehmainen, Yile Gu, Aditya Gourav, Prashanth Gurunath Shivakumar, Ankur Gandhe, Ariya Rastrow, Ivan Bulyko

    Abstract: Recognition of personalized content remains a challenge in end-to-end speech recognition. We explore three novel approaches that use personalized content in a neural rescoring step to improve recognition: gazetteers, prompting, and a cross-attention based encoder-decoder model. We use internal de-identified en-US data from interactions with a virtual voice assistant supplemented with personalized… ▽ More

    Submitted 13 July, 2023; originally announced July 2023.

  8. arXiv:2306.15815  [pdf, other

    eess.AS

    Scaling Laws for Discriminative Speech Recognition Rescoring Models

    Authors: Yile Gu, Prashanth Gurunath Shivakumar, Jari Kolehmainen, Ankur Gandhe, Ariya Rastrow, Ivan Bulyko

    Abstract: Recent studies have found that model performance has a smooth power-law relationship, or scaling laws, with training data and model size, for a wide range of problems. These scaling laws allow one to choose nearly optimal data and model sizes. We study whether this scaling property is also applicable to second-pass rescoring, which is an important component of speech recognition systems. We focus… ▽ More

    Submitted 27 June, 2023; originally announced June 2023.

  9. arXiv:2306.09452  [pdf, other

    eess.AS

    Distillation Strategies for Discriminative Speech Recognition Rescoring

    Authors: Prashanth Gurunath Shivakumar, Jari Kolehmainen, Yile Gu, Ankur Gandhe, Ariya Rastrow, Ivan Bulyko

    Abstract: Second-pass rescoring is employed in most state-of-the-art speech recognition systems. Recently, BERT based models have gained popularity for re-ranking the n-best hypothesis by exploiting the knowledge from masked language model pre-training. Further, fine-tuning with discriminative loss such as minimum word error rate (MWER) has shown to perform better than likelihood-based loss. Streaming appli… ▽ More

    Submitted 15 June, 2023; originally announced June 2023.

    Comments: Accepted at INTERSPEECH 2023

  10. Streaming Speech-to-Confusion Network Speech Recognition

    Authors: Denis Filimonov, Prabhat Pandey, Ariya Rastrow, Ankur Gandhe, Andreas Stolcke

    Abstract: In interactive automatic speech recognition (ASR) systems, low-latency requirements limit the amount of search space that can be explored during decoding, particularly in end-to-end neural ASR. In this paper, we present a novel streaming ASR architecture that outputs a confusion network while maintaining limited latency, as needed for interactive applications. We show that 1-best results of our mo… ▽ More

    Submitted 2 June, 2023; originally announced June 2023.

    Comments: Submitted to Interspeech 2023

    Journal ref: Proc. Interspeech, Aug. 2023, pp. 4099-4103

  11. arXiv:2305.05271  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Robust Acoustic and Semantic Contextual Biasing in Neural Transducers for Speech Recognition

    Authors: Xuandi Fu, Kanthashree Mysore Sathyendra, Ankur Gandhe, **g Liu, Grant P. Strimel, Ross McGowan, Athanasios Mouchtaris

    Abstract: Attention-based contextual biasing approaches have shown significant improvements in the recognition of generic and/or personal rare-words in End-to-End Automatic Speech Recognition (E2E ASR) systems like neural transducers. These approaches employ cross-attention to bias the model towards specific contextual entities injected as bias-phrases to the model. Prior approaches typically relied on subw… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

    Comments: Accepted at ICASSP 2023

  12. PROCTER: PROnunciation-aware ConTextual adaptER for personalized speech recognition in neural transducers

    Authors: Rahul Pandey, Roger Ren, Qi Luo, **g Liu, Ariya Rastrow, Ankur Gandhe, Denis Filimonov, Grant Strimel, Andreas Stolcke, Ivan Bulyko

    Abstract: End-to-End (E2E) automatic speech recognition (ASR) systems used in voice assistants often have difficulties recognizing infrequent words personalized to the user, such as names and places. Rare words often have non-trivial pronunciations, and in such cases, human knowledge in the form of a pronunciation lexicon can be useful. We propose a PROnunCiation-aware conTextual adaptER (PROCTER) that dyna… ▽ More

    Submitted 29 March, 2023; originally announced March 2023.

    Comments: To appear in Proc. IEEE ICASSP

    Journal ref: Proc. IEEE ICASSP, June 2023

  13. arXiv:2303.10942  [pdf, other

    cs.CL cs.SD eess.AS

    On-the-fly Text Retrieval for End-to-End ASR Adaptation

    Authors: Bolaji Yusuf, Aditya Gourav, Ankur Gandhe, Ivan Bulyko

    Abstract: End-to-end speech recognition models are improved by incorporating external text sources, typically by fusion with an external language model. Such language models have to be retrained whenever the corpus of interest changes. Furthermore, since they store the entire corpus in their parameters, rare words can be challenging to recall. In this work, we propose augmenting a transducer-based ASR model… ▽ More

    Submitted 20 March, 2023; originally announced March 2023.

    Comments: Accepted to ICASSP 2023; Appendix added to include ablations that could not fit into the conference 4-page limit

  14. arXiv:2202.06045  [pdf, other

    cs.CL cs.SD eess.AS

    USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder

    Authors: Bolaji Yusuf, Ankur Gandhe, Alex Sokolov

    Abstract: Improving end-to-end speech recognition by incorporating external text data has been a longstanding research topic. There has been a recent focus on training E2E ASR models that get the performance benefits of external text data without incurring the extra cost of evaluating an external language model at inference time. In this work, we propose training ASR model jointly with a set of text-to-text… ▽ More

    Submitted 12 February, 2022; originally announced February 2022.

    Comments: 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2022)

  15. arXiv:2202.01094  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    RescoreBERT: Discriminative Speech Recognition Rescoring with BERT

    Authors: Liyan Xu, Yile Gu, Jari Kolehmainen, Haidar Khan, Ankur Gandhe, Ariya Rastrow, Andreas Stolcke, Ivan Bulyko

    Abstract: Second-pass rescoring is an important component in automatic speech recognition (ASR) systems that is used to improve the outputs from a first-pass decoder by implementing a lattice rescoring or $n$-best re-ranking. While pretraining with a masked language model (MLM) objective has received great success in various natural language understanding (NLU) tasks, it has not gained traction as a rescori… ▽ More

    Submitted 18 February, 2022; v1 submitted 2 February, 2022; originally announced February 2022.

    Comments: Accepted to ICASSP 2022

    Journal ref: Proc. IEEE ICASSP, May 2022, pp. 6617-6121

  16. arXiv:2111.10157  [pdf, other

    cs.CL cs.SD eess.AS

    Lattention: Lattice-attention in ASR rescoring

    Authors: Prabhat Pandey, Sergio Duarte Torres, Ali Orkan Bayer, Ankur Gandhe, Volker Leutnant

    Abstract: Lattices form a compact representation of multiple hypotheses generated from an automatic speech recognition system and have been shown to improve performance of downstream tasks like spoken language understanding and speech translation, compared to using one-best hypothesis. In this work, we look into the effectiveness of lattice cues for rescoring n-best lists in second-pass. We encode lattices… ▽ More

    Submitted 19 November, 2021; originally announced November 2021.

    Comments: Submitted to ICASSP 2022

    ACM Class: I.2.7

  17. arXiv:2101.03229  [pdf, other

    cs.CL cs.SD eess.AS

    Domain-aware Neural Language Models for Speech Recognition

    Authors: Linda Liu, Yile Gu, Aditya Gourav, Ankur Gandhe, Shashank Kalmane, Denis Filimonov, Ariya Rastrow, Ivan Bulyko

    Abstract: As voice assistants become more ubiquitous, they are increasingly expected to support and perform well on a wide variety of use-cases across different domains. We present a domain-aware rescoring framework suitable for achieving domain-adaptation during second-pass rescoring in production settings. In our framework, we fine-tune a domain-general neural language model on several domains, and use an… ▽ More

    Submitted 16 February, 2021; v1 submitted 4 January, 2021; originally announced January 2021.

    Comments: ICASSP 2021

  18. arXiv:2012.00133  [pdf, other

    cs.CL cs.SD eess.AS

    Improving accuracy of rare words for RNN-Transducer through unigram shallow fusion

    Authors: Vijay Ravi, Yile Gu, Ankur Gandhe, Ariya Rastrow, Linda Liu, Denis Filimonov, Scott Novotney, Ivan Bulyko

    Abstract: End-to-end automatic speech recognition (ASR) systems, such as recurrent neural network transducer (RNN-T), have become popular, but rare word remains a challenge. In this paper, we propose a simple, yet effective method called unigram shallow fusion (USF) to improve rare words for RNN-T. In USF, we extract rare words from RNN-T training data based on unigram count, and apply a fixed reward when t… ▽ More

    Submitted 30 November, 2020; originally announced December 2020.

  19. arXiv:2011.11715  [pdf, other

    cs.CL cs.AI cs.LG cs.NE cs.SD eess.AS

    Multi-task Language Modeling for Improving Speech Recognition of Rare Words

    Authors: Chao-Han Huck Yang, Linda Liu, Ankur Gandhe, Yile Gu, Anirudh Raju, Denis Filimonov, Ivan Bulyko

    Abstract: End-to-end automatic speech recognition (ASR) systems are increasingly popular due to their relative architectural simplicity and competitive performance. However, even though the average accuracy of these systems may be high, the performance on rare content words often lags behind hybrid ASR systems. To address this problem, second-pass rescoring is often applied leveraging upon language modeling… ▽ More

    Submitted 11 September, 2021; v1 submitted 23 November, 2020; originally announced November 2020.

    Comments: Accepted to IEEE Automatic Speech Recognition and Understanding (ASRU) 2021

  20. arXiv:1912.03363  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Audio-attention discriminative language model for ASR rescoring

    Authors: Ankur Gandhe, Ariya Rastrow

    Abstract: End-to-end approaches for automatic speech recognition (ASR) benefit from directly modeling the probability of the word sequence given the input audio stream in a single neural network. However, compared to conventional ASR systems, these models typically require more data to achieve comparable results. Well-known model adaptation techniques, to account for domain and style adaptation, are not eas… ▽ More

    Submitted 18 February, 2020; v1 submitted 6 December, 2019; originally announced December 2019.

    Comments: 4 pages, 1 figure, Accepted at ICASSP 2020