Skip to main content

Showing 1–16 of 16 results for author: Ronanki, S

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.17935  [pdf, other

    cs.CL cs.SD eess.AS

    Sequential Editing for Lifelong Training of Speech Recognition Models

    Authors: Devang Kulshreshtha, Saket Dingliwal, Brady Houston, Nikolaos Pappas, Srikanth Ronanki

    Abstract: Automatic Speech Recognition (ASR) traditionally assumes known domains, but adding data from a new domain raises concerns about computational inefficiencies linked to retraining models on both existing and new domains. Fine-tuning solely on new domain risks Catastrophic Forgetting (CF). To address this, Lifelong Learning (LLL) algorithms have been proposed for ASR. Prior research has explored tech… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: INTERSPEECH 2024

  2. arXiv:2405.08317  [pdf, other

    cs.CL cs.SD eess.AS

    SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models

    Authors: Raghuveer Peri, Sai Muralidhar Jayanthi, Srikanth Ronanki, Anshu Bhatia, Karel Mundnich, Saket Dingliwal, Nilaksh Das, Zejiang Hou, Goeric Huybrechts, Srikanth Vishnubhotla, Daniel Garcia-Romero, Sundararajan Srinivasan, Kyu J Han, Katrin Kirchhoff

    Abstract: Integrated Speech and Large Language Models (SLMs) that can follow speech instructions and generate relevant text responses have gained popularity lately. However, the safety and robustness of these models remains largely unclear. In this work, we investigate the potential vulnerabilities of such instruction-following speech-language models to adversarial attacks and jailbreaking. Specifically, we… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

    Comments: 9+6 pages, Submitted to ACL 2024

  3. arXiv:2405.08295  [pdf, other

    cs.CL cs.SD eess.AS

    SpeechVerse: A Large-scale Generalizable Audio Language Model

    Authors: Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi, Zhaocheng Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai Muralidhar Jayanthi, Xilai Li, Karel Mundnich, Monica Sunkara, Sundararajan Srinivasan, Kyu J Han, Katrin Kirchhoff

    Abstract: Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore devel… ▽ More

    Submitted 31 May, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

    Comments: Single Column, 13 page

  4. arXiv:2311.08402  [pdf, other

    cs.CL cs.IR cs.SD eess.AS

    Retrieve and Copy: Scaling ASR Personalization to Large Catalogs

    Authors: Sai Muralidhar Jayanthi, Devang Kulshreshtha, Saket Dingliwal, Srikanth Ronanki, Sravan Bodapati

    Abstract: Personalization of automatic speech recognition (ASR) models is a widely studied topic because of its many practical applications. Most recently, attention-based contextual biasing techniques are used to improve the recognition of rare words and domain specific entities. However, due to performance constraints, the biasing is often limited to a few thousand entities, restricting real-world usabili… ▽ More

    Submitted 14 November, 2023; originally announced November 2023.

    Comments: EMNLP 2023

  5. arXiv:2311.02482  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Generalized zero-shot audio-to-intent classification

    Authors: Veera Raghavendra Elluru, Devang Kulshreshtha, Rohit Paturi, Sravan Bodapati, Srikanth Ronanki

    Abstract: Spoken language understanding systems using audio-only data are gaining popularity, yet their ability to handle unseen intents remains limited. In this study, we propose a generalized zero-shot audio-to-intent classification framework with only a few sample text sentences per intent. To achieve this, we first train a supervised audio-to-intent classifier by making use of a self-supervised pre-trai… ▽ More

    Submitted 4 November, 2023; originally announced November 2023.

  6. arXiv:2306.08175  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    DCTX-Conformer: Dynamic context carry-over for low latency unified streaming and non-streaming Conformer ASR

    Authors: Goeric Huybrechts, Srikanth Ronanki, Xilai Li, Hadis Nosrati, Sravan Bodapati, Katrin Kirchhoff

    Abstract: Conformer-based end-to-end models have become ubiquitous these days and are commonly used in both streaming and non-streaming automatic speech recognition (ASR). Techniques like dual-mode and dynamic chunk training helped unify streaming and non-streaming systems. However, there remains a performance gap between streaming with a full and limited past context. To address this issue, we propose the… ▽ More

    Submitted 1 March, 2024; v1 submitted 13 June, 2023; originally announced June 2023.

  7. arXiv:2304.09325  [pdf, other

    eess.AS cs.SD

    Dynamic Chunk Convolution for Unified Streaming and Non-Streaming Conformer ASR

    Authors: Xilai Li, Goeric Huybrechts, Srikanth Ronanki, Jeff Farris, Sravan Bodapati

    Abstract: Recently, there has been an increasing interest in unifying streaming and non-streaming speech recognition models to reduce development, training and deployment cost. The best-known approaches rely on either window-based or dynamic chunk-based attention strategy and causal convolutions to minimize the degradation due to streaming. However, the performance gap still remains relatively large between… ▽ More

    Submitted 25 April, 2023; v1 submitted 18 April, 2023; originally announced April 2023.

    Comments: 5 pages, 3 figures, 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2023)

  8. arXiv:2211.13280  [pdf, other

    cs.CL cs.SD eess.AS

    Device Directedness with Contextual Cues for Spoken Dialog Systems

    Authors: Dhanush Bekal, Sundararajan Srinivasan, Sravan Bodapati, Srikanth Ronanki, Katrin Kirchhoff

    Abstract: In this work, we define barge-in verification as a supervised learning task where audio-only information is used to classify user spoken dialogue into true and false barge-ins. Following the success of pre-trained models, we use low-level speech representations from a self-supervised representation learning model for our downstream classification task. Further, we propose a novel technique to infu… ▽ More

    Submitted 23 November, 2022; originally announced November 2022.

  9. arXiv:2210.09510  [pdf, other

    cs.CL cs.SD eess.AS

    Towards Personalization of CTC Speech Recognition Models with Contextual Adapters and Adaptive Boosting

    Authors: Saket Dingliwal, Monica Sunkara, Sravan Bodapati, Srikanth Ronanki, Jeff Farris, Katrin Kirchhoff

    Abstract: End-to-end speech recognition models trained using joint Connectionist Temporal Classification (CTC)-Attention loss have gained popularity recently. In these models, a non-autoregressive CTC decoder is often used at inference time due to its speed and simplicity. However, such models are hard to personalize because of their conditional independence assumption that prevents output tokens from previ… ▽ More

    Submitted 13 November, 2022; v1 submitted 17 October, 2022; originally announced October 2022.

    Comments: To appear in SLT 2022

  10. arXiv:2011.15023  [pdf, other

    cs.CL eess.AS

    Transformer-Transducers for Code-Switched Speech Recognition

    Authors: Siddharth Dalmia, Yuzong Liu, Srikanth Ronanki, Katrin Kirchhoff

    Abstract: We live in a world where 60% of the population can speak two or more languages fluently. Members of these communities constantly switch between languages when having a conversation. As automatic speech recognition (ASR) systems are being deployed to the real-world, there is a need for practical systems that can handle multiple languages both within an utterance or across utterances. In this paper,… ▽ More

    Submitted 14 February, 2021; v1 submitted 30 November, 2020; originally announced November 2020.

    Comments: Accepted at ICASSP 2021

  11. arXiv:2008.00702  [pdf, other

    eess.AS cs.CL

    Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech

    Authors: Monica Sunkara, Srikanth Ronanki, Dhanush Bekal, Sravan Bodapati, Katrin Kirchhoff

    Abstract: In this work, we explore a multimodal semi-supervised learning approach for punctuation prediction by learning representations from large amounts of unlabelled audio and text data. Conventional approaches in speech processing typically use forced alignment to encoder per frame acoustic features to word level features and perform multimodal fusion of the resulting acoustic and lexical representatio… ▽ More

    Submitted 3 August, 2020; originally announced August 2020.

    Comments: Accepted for Interspeech 2020

  12. arXiv:2007.02025  [pdf, other

    cs.CL cs.SD eess.AS

    Robust Prediction of Punctuation and Truecasing for Medical ASR

    Authors: Monica Sunkara, Srikanth Ronanki, Kalpit Dixit, Sravan Bodapati, Katrin Kirchhoff

    Abstract: Automatic speech recognition (ASR) systems in the medical domain that focus on transcribing clinical dictations and doctor-patient conversations often pose many challenges due to the complexity of the domain. ASR output typically undergoes automatic punctuation to enable users to speak naturally, without having to vocalise awkward and explicit punctuation commands, such as "period", "add comma" or… ▽ More

    Submitted 11 July, 2020; v1 submitted 4 July, 2020; originally announced July 2020.

    Comments: Accepted for ACL NLPMC workshop 2020

  13. arXiv:1911.01601  [pdf, other

    eess.AS cs.CR cs.SD eess.SP

    ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech

    Authors: Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Hector Delgado, Andreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, Lauri Juvela, Paavo Alku, Yu-Huai Peng, Hsin-Te Hwang, Yu Tsao, Hsin-Min Wang, Sebastien Le Maguer, Markus Becker, Fergus Henderson, Rob Clark, Yu Zhang, Quan Wang, Ye Jia, Kai Onuma, Koji Mushika , et al. (15 additional authors not shown)

    Abstract: Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as "presentation attacks." These vulnerabilities are generally unacceptable and call for spoofing countermeasures or "presentation attack detection" systems. In addition to imperso… ▽ More

    Submitted 14 July, 2020; v1 submitted 4 November, 2019; originally announced November 2019.

    Comments: Accepted, Computer Speech and Language. This manuscript version is made available under the CC-BY-NC-ND 4.0. For the published version on Elsevier website, please visit https://doi.org/10.1016/j.csl.2020.101114

  14. arXiv:1907.02479  [pdf, other

    eess.AS cs.CL

    Fine-grained robust prosody transfer for single-speaker neural text-to-speech

    Authors: Viacheslav Klimkov, Srikanth Ronanki, Jonas Rohnke, Thomas Drugman

    Abstract: We present a neural text-to-speech system for fine-grained prosody transfer from one speaker to another. Conventional approaches for end-to-end prosody transfer typically use either fixed-dimensional or variable-length prosody embedding via a secondary attention to encode the reference signal. However, when trained on a single-speaker dataset, the conventional prosody transfer systems are not robu… ▽ More

    Submitted 4 July, 2019; originally announced July 2019.

    Comments: 5 pages, 7 figures, Accepted for Interspeech 2019

  15. arXiv:1904.02790  [pdf, other

    cs.CL cs.LG eess.AS

    In Other News: A Bi-style Text-to-speech Model for Synthesizing Newscaster Voice with Limited Data

    Authors: Nishant Prateek, Mateusz Ɓajszczak, Roberto Barra-Chicote, Thomas Drugman, Jaime Lorenzo-Trueba, Thomas Merritt, Srikanth Ronanki, Trevor Wood

    Abstract: Neural text-to-speech synthesis (NTTS) models have shown significant progress in generating high-quality speech, however they require a large quantity of training data. This makes creating models for multiple styles expensive and time-consuming. In this paper different styles of speech are analysed based on prosodic variations, from this a model is proposed to synthesise speech in the style of a n… ▽ More

    Submitted 4 April, 2019; originally announced April 2019.

    Comments: Accepted at NAACL-HLT 2019

  16. arXiv:1811.06315  [pdf, other

    cs.CL eess.AS

    Effect of data reduction on sequence-to-sequence neural TTS

    Authors: Javier Latorre, Jakub Lachowicz, Jaime Lorenzo-Trueba, Thomas Merritt, Thomas Drugman, Srikanth Ronanki, Klimkov Viacheslav

    Abstract: Recent speech synthesis systems based on sampling from autoregressive neural networks models can generate speech almost undistinguishable from human recordings. However, these models require large amounts of data. This paper shows that the lack of data from one speaker can be compensated with data from other speakers. The naturalness of Tacotron2-like models trained on a blend of 5k utterances fro… ▽ More

    Submitted 23 November, 2018; v1 submitted 15 November, 2018; originally announced November 2018.

    Comments: 4 pages, 1 extra for references. Submitted to ICASSP 2019