Skip to main content

Showing 1–7 of 7 results for author: Paturi, R

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.18679  [pdf, other

    eess.AS cs.AI cs.CL cs.LG

    Speakers Unembedded: Embedding-free Approach to Long-form Neural Diarization

    Authors: Xiang Li, Vivek Govindan, Rohit Paturi, Sundararajan Srinivasan

    Abstract: End-to-end neural diarization (EEND) models offer significant improvements over traditional embedding-based Speaker Diarization (SD) approaches but falls short on generalizing to long-form audio with large number of speakers. EEND-vector-clustering method mitigates this by combining local EEND with global clustering of speaker embeddings from local windows, but this requires an additional speaker… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: Accepted at INTERSPEECH 2024

  2. arXiv:2406.17266  [pdf, other

    eess.AS cs.AI cs.CL cs.LG

    AG-LSEC: Audio Grounded Lexical Speaker Error Correction

    Authors: Rohit Paturi, Xiang Li, Sundararajan Srinivasan

    Abstract: Speaker Diarization (SD) systems are typically audio-based and operate independently of the ASR system in traditional speech transcription pipelines and can have speaker errors due to SD and/or ASR reconciliation, especially around speaker turns and regions of speech overlap. To reduce these errors, a Lexical Speaker Error Correction (LSEC), in which an external language model provides lexical inf… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: Accepted at INTERSPEECH 2024

  3. arXiv:2405.08295  [pdf, other

    cs.CL cs.SD eess.AS

    SpeechVerse: A Large-scale Generalizable Audio Language Model

    Authors: Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi, Zhaocheng Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai Muralidhar Jayanthi, Xilai Li, Karel Mundnich, Monica Sunkara, Sundararajan Srinivasan, Kyu J Han, Katrin Kirchhoff

    Abstract: Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore devel… ▽ More

    Submitted 31 May, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

    Comments: Single Column, 13 page

  4. arXiv:2311.02482  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Generalized zero-shot audio-to-intent classification

    Authors: Veera Raghavendra Elluru, Devang Kulshreshtha, Rohit Paturi, Sravan Bodapati, Srikanth Ronanki

    Abstract: Spoken language understanding systems using audio-only data are gaining popularity, yet their ability to handle unseen intents remains limited. In this study, we propose a generalized zero-shot audio-to-intent classification framework with only a few sample text sentences per intent. To achieve this, we first train a supervised audio-to-intent classifier by making use of a self-supervised pre-trai… ▽ More

    Submitted 4 November, 2023; originally announced November 2023.

  5. arXiv:2311.00697  [pdf, other

    cs.CL eess.AS

    End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation

    Authors: Juan Zuluaga-Gomez, Zhaocheng Huang, Xing Niu, Rohit Paturi, Sundararajan Srinivasan, Prashant Mathur, Brian Thompson, Marcello Federico

    Abstract: Conventional speech-to-text translation (ST) systems are trained on single-speaker utterances, and they may not generalize to real-life scenarios where the audio contains conversations by multiple speakers. In this paper, we tackle single-channel multi-speaker conversational ST with an end-to-end and multi-task training model, named Speaker-Turn Aware Conversational Speech Translation, that combin… ▽ More

    Submitted 1 November, 2023; originally announced November 2023.

    Comments: Accepted at EMNLP 2023. Code: https://github.com/amazon-science/stac-speech-translation

  6. arXiv:2306.09313  [pdf, other

    eess.AS cs.AI cs.CL cs.LG

    Lexical Speaker Error Correction: Leveraging Language Models for Speaker Diarization Error Correction

    Authors: Rohit Paturi, Sundararajan Srinivasan, Xiang Li

    Abstract: Speaker diarization (SD) is typically used with an automatic speech recognition (ASR) system to ascribe speaker labels to recognized words. The conventional approach reconciles outputs from independently optimized ASR and SD systems, where the SD system typically uses only acoustic information to identify the speakers in the audio stream. This approach can lead to speaker errors especially around… ▽ More

    Submitted 15 June, 2023; originally announced June 2023.

    Comments: Accepted at INTERSPEECH 2023

  7. arXiv:2112.05863  [pdf, other

    eess.AS cs.CL cs.LG cs.SD eess.SP

    Directed Speech Separation for Automatic Speech Recognition of Long Form Conversational Speech

    Authors: Rohit Paturi, Sundararajan Srinivasan, Katrin Kirchhoff, Daniel Garcia-Romero

    Abstract: Many of the recent advances in speech separation are primarily aimed at synthetic mixtures of short audio utterances with high degrees of overlap. Most of these approaches need an additional stitching step to stitch the separated speech chunks for long form audio. Since most of the approaches involve Permutation Invariant training (PIT), the order of separated speech chunks is nondeterministic and… ▽ More

    Submitted 6 September, 2022; v1 submitted 10 December, 2021; originally announced December 2021.

    Comments: Accepted for publication at Interspeech 2022