Search | arXiv e-print repository

Accented Speech Recognition With Accent-specific Codebooks

Authors: Darshan Prabhu, Preethi Jyothi, Sriram Ganapathy, Vinit Unni

Abstract: Speech accents pose a significant challenge to state-of-the-art automatic speech recognition (ASR) systems. Degradation in performance across underrepresented accents is a severe deterrent to the inclusive adoption of ASR. In this work, we propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks. These learnable codebooks capture… ▽ More Speech accents pose a significant challenge to state-of-the-art automatic speech recognition (ASR) systems. Degradation in performance across underrepresented accents is a severe deterrent to the inclusive adoption of ASR. In this work, we propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks. These learnable codebooks capture accent-specific information and are integrated within the ASR encoder layers. The model is trained on accented English speech, while the test data also contained accents which were not seen during training. On the Mozilla Common Voice multi-accented dataset, we show that our proposed approach yields significant performance gains not only on the seen English accents (up to $37\%$ relative improvement in word error rate) but also on the unseen accents (up to $5\%$ relative improvement in WER). Further, we illustrate benefits for a zero-shot transfer setup on the L2Artic dataset. We also compare the performance with other approaches based on accent adversarial training. △ Less

Submitted 26 October, 2023; v1 submitted 24 October, 2023; originally announced October 2023.

Comments: Accepted to EMNLP 2023 Main Conference (Long Paper)

arXiv:2307.05006 [pdf, ps, other]

Improving RNN-Transducers with Acoustic LookAhead

Authors: Vinit S. Unni, Ashish Mittal, Preethi Jyothi, Sunita Sarawagi

Abstract: RNN-Transducers (RNN-Ts) have gained widespread acceptance as an end-to-end model for speech to text conversion because of their high accuracy and streaming capabilities. A typical RNN-T independently encodes the input audio and the text context, and combines the two encodings by a thin joint network. While this architecture provides SOTA streaming accuracy, it also makes the model vulnerable to s… ▽ More RNN-Transducers (RNN-Ts) have gained widespread acceptance as an end-to-end model for speech to text conversion because of their high accuracy and streaming capabilities. A typical RNN-T independently encodes the input audio and the text context, and combines the two encodings by a thin joint network. While this architecture provides SOTA streaming accuracy, it also makes the model vulnerable to strong LM biasing which manifests as multi-step hallucination of text without acoustic evidence. In this paper we propose LookAhead that makes text representations more acoustically grounded by looking ahead into the future within the audio input. This technique yields a significant 5%-20% relative reduction in word error rate on both in-domain and out-of-domain evaluation sets. △ Less

Submitted 10 July, 2023; originally announced July 2023.

Comments: 5 pages, 1 fig, 7 tables, Proceedings of Interspeech 2023

arXiv:2203.02317 [pdf, other]

Adaptive Discounting of Implicit Language Models in RNN-Transducers

Authors: Vinit Unni, Shreya Khare, Ashish Mittal, Preethi Jyothi, Sunita Sarawagi, Samarth Bharadwaj

Abstract: RNN-Transducer (RNN-T) models have become synonymous with streaming end-to-end ASR systems. While they perform competitively on a number of evaluation categories, rare words pose a serious challenge to RNN-T models. One main reason for the degradation in performance on rare words is that the language model (LM) internal to RNN-Ts can become overconfident and lead to hallucinated predictions that a… ▽ More RNN-Transducer (RNN-T) models have become synonymous with streaming end-to-end ASR systems. While they perform competitively on a number of evaluation categories, rare words pose a serious challenge to RNN-T models. One main reason for the degradation in performance on rare words is that the language model (LM) internal to RNN-Ts can become overconfident and lead to hallucinated predictions that are acoustically inconsistent with the underlying speech. To address this issue, we propose a lightweight adaptive LM discounting technique AdaptLMD, that can be used with any RNN-T architecture without requiring any external resources or additional parameters. AdaptLMD uses a two-pronged approach: 1) Randomly mask the prediction network output to encourage the RNN-T to not be overly reliant on it's outputs. 2) Dynamically choose when to discount the implicit LM (ILM) based on rarity of recently predicted tokens and divergence between ILM and implicit acoustic model (IAM) scores. Comparing AdaptLMD to a competitive RNN-T baseline, we obtain up to 4% and 14% relative reductions in overall WER and rare word PER, respectively, on a conversational, code-mixed Hindi-English ASR task. △ Less

Submitted 21 February, 2022; originally announced March 2022.

Comments: Proceedings for ICASSP 2022

arXiv:2106.12758 [pdf, other]

doi 10.1063/5.0064215

Neural ODE to model and prognose thermoacoustic instability

Authors: Jayesh Dhadphale, Vishnu R. Unni, Abhishek Saha, R. I. Sujith

Abstract: In reacting flow systems, thermoacoustic instability characterized by high amplitude pressure fluctuations, is driven by a positive coupling between the unsteady heat release rate and the acoustic field of the combustor. When the underlying flow is turbulent, as a control parameter of the system is varied and the system approach thermoacoustic instability, the acoustic pressure oscillations synchr… ▽ More In reacting flow systems, thermoacoustic instability characterized by high amplitude pressure fluctuations, is driven by a positive coupling between the unsteady heat release rate and the acoustic field of the combustor. When the underlying flow is turbulent, as a control parameter of the system is varied and the system approach thermoacoustic instability, the acoustic pressure oscillations synchronize with heat release rate oscillations. Consequently, during the onset of thermoacoustic instability in turbulent combustors, the system dynamics transition from chaotic oscillations to periodic oscillations via a state of intermittency. Thermoacoustic systems are traditionally modeled by coupling the model for the unsteady heat source and the acoustic subsystem, each estimated independently. The response of the unsteady heat source, the flame, to acoustic fluctuations are characterized by introducing external unsteady forcing. This necessitates a powerful excitation module to obtain the nonlinear response of the flame to acoustic perturbations. Instead of characterizing individual subsystems, we introduce a neural ordinary differential equation (neural ODE) framework to model the thermoacoustic system as a whole. The neural ODE model for the thermoacoustic system uses time series of the heat release rate and the pressure fluctuations, measured simultaneously without introducing any external perturbations, to model their coupled interaction. Further, we use the parameters of neural ODE to define an anomaly measure that represents the proximity of system dynamics to limit cycle oscillations and thus provide an early warning signal for the onset of thermoacoustic instability. △ Less

Submitted 13 August, 2021; v1 submitted 23 June, 2021; originally announced June 2021.

Comments: 31 pages, 12 figures

arXiv:2104.00235 [pdf, ps, other]

doi 10.21437/Interspeech.2021-1339

Multilingual and code-switching ASR challenges for low resource Indian languages

Authors: Anuj Diwan, Rakesh Vaideeswaran, Sanket Shah, Ankita Singh, Srinivasa Raghavan, Shreya Khare, Vinit Unni, Saurabh Vyas, Akash Rajpuria, Chiranjeevi Yarra, Ashish Mittal, Prasanta Kumar Ghosh, Preethi Jyothi, Kalika Bali, Vivek Seshadri, Sunayana Sitaram, Samarth Bharadwaj, Jai Nanavati, Raoul Nanavati, Karthik Sankaranarayanan, Tejaswi Seeram, Basil Abraham

Abstract: Recently, there is increasing interest in multilingual automatic speech recognition (ASR) where a speech recognition system caters to multiple low resource languages by taking advantage of low amounts of labeled corpora in multiple languages. With multilingualism becoming common in today's world, there has been increasing interest in code-switching ASR as well. In code-switching, multiple language… ▽ More Recently, there is increasing interest in multilingual automatic speech recognition (ASR) where a speech recognition system caters to multiple low resource languages by taking advantage of low amounts of labeled corpora in multiple languages. With multilingualism becoming common in today's world, there has been increasing interest in code-switching ASR as well. In code-switching, multiple languages are freely interchanged within a single sentence or between sentences. The success of low-resource multilingual and code-switching ASR often depends on the variety of languages in terms of their acoustics, linguistic characteristics as well as the amount of data available and how these are carefully considered in building the ASR system. In this challenge, we would like to focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages, namely Hindi, Marathi, Odia, Tamil, Telugu, Gujarati and Bengali. For this purpose, we provide a total of ~600 hours of transcribed speech data, comprising train and test sets, in these languages including two code-switched language pairs, Hindi-English and Bengali-English. We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively. △ Less

Submitted 31 March, 2021; originally announced April 2021.

Comments: 6 pages

Showing 1–5 of 5 results for author: Unni, V