Skip to main content

Showing 1–24 of 24 results for author: Umesh, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2402.12094  [pdf, other

    eess.AS cs.SD

    On the relationship between speech and hearing

    Authors: Srinivasan Umesh, Leon Cohen, Douglas Nelson

    Abstract: We present a framework for experimentally linking speech production and hearing. Using this approach, we describe experimental results, that lead to the concept that sounds made by different individuals and perceived to be the same can be transformed into each other by a "speech scale". The speech scale is empirically determined using only speech data. We show the similarity of the speech scale to… ▽ More

    Submitted 19 February, 2024; originally announced February 2024.

  2. arXiv:2312.13026  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    FusDom: Combining In-Domain and Out-of-Domain Knowledge for Continuous Self-Supervised Learning

    Authors: Ashish Seth, Sreyan Ghosh, S. Umesh, Dinesh Manocha

    Abstract: Continued pre-training (CP) offers multiple advantages, like target domain adaptation and the potential to exploit the continuous stream of unlabeled data available online. However, continued pre-training on out-of-domain distributions often leads to catastrophic forgetting of previously acquired knowledge, leading to sub-optimal ASR performance. This paper presents FusDom, a simple and novel meth… ▽ More

    Submitted 20 December, 2023; originally announced December 2023.

    Comments: Accepted at ICASSP 2024. Code: https://github.com/cs20s030/fusdom

  3. arXiv:2312.12783  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Stable Distillation: Regularizing Continued Pre-training for Low-Resource Automatic Speech Recognition

    Authors: Ashish Seth, Sreyan Ghosh, S. Umesh, Dinesh Manocha

    Abstract: Continued self-supervised (SSL) pre-training for adapting existing SSL models to the target domain has shown to be extremely effective for low-resource Automatic Speech Recognition (ASR). This paper proposes Stable Distillation, a simple and novel approach for SSL-based continued pre-training that boosts ASR performance in the target domain where both labeled and unlabeled data are limited. Stable… ▽ More

    Submitted 20 December, 2023; originally announced December 2023.

    Comments: Accepted to ICASSP 2024. Code: https://github.com/cs20s030/stable_distillation

  4. arXiv:2310.14654  [pdf, ps, other

    cs.CL eess.AS

    SPRING-INX: A Multilingual Indian Language Speech Corpus by SPRING Lab, IIT Madras

    Authors: Nithya R, Malavika S, Jordan F, Arjun Gangwar, Metilda N J, S Umesh, Rithik Sarab, Akhilesh Kumar Dubey, Govind Divakaran, Samudra Vijaya K, Suryakanth V Gangashetty

    Abstract: India is home to a multitude of languages of which 22 languages are recognised by the Indian Constitution as official. Building speech based applications for the Indian population is a difficult problem owing to limited data and the number of languages and accents to accommodate. To encourage the language technology community to build speech based applications in Indian languages, we are open sour… ▽ More

    Submitted 24 October, 2023; v1 submitted 23 October, 2023; originally announced October 2023.

    Comments: 3 pages, About SPRING-INX Data

  5. arXiv:2308.01018  [pdf, other

    cs.CL cs.SD eess.AS

    SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis

    Authors: Ramanan Sivaguru, Vasista Sai Lodagala, S Umesh

    Abstract: While FastSpeech2 aims to integrate aspects of speech such as pitch, energy, and duration as conditional inputs, it still leaves scope for richer representations. As a part of this work, we leverage representations from various Self-Supervised Learning (SSL) models to enhance the quality of the synthesized speech. In particular, we pass the FastSpeech2 encoder's length-regulated outputs through a… ▽ More

    Submitted 2 August, 2023; originally announced August 2023.

    Comments: Accepted for publication at Interspeech 2023

  6. arXiv:2305.19584  [pdf, other

    cs.CL eess.AS

    The Tag-Team Approach: Leveraging CLS and Language Tagging for Enhancing Multilingual ASR

    Authors: Kaousheik Jayakumar, Vrunda N. Sukhadia, A Arunkumar, S. Umesh

    Abstract: Building a multilingual Automated Speech Recognition (ASR) system in a linguistically diverse country like India can be a challenging task due to the differences in scripts and the limited availability of speech data. This problem can be solved by exploiting the fact that many of these languages are phonetically similar. These languages can be converted into a Common Label Set (CLS) by map** sim… ▽ More

    Submitted 31 May, 2023; originally announced May 2023.

    Comments: 5 pages,5 figures, submitted to INTERSPEECH2023

  7. arXiv:2303.05668  [pdf, other

    eess.AS cs.AI

    UNFUSED: UNsupervised Finetuning Using SElf supervised Distillation

    Authors: Ashish Seth, Sreyan Ghosh, S. Umesh, Dinesh Manocha

    Abstract: In this paper, we introduce UnFuSeD, a novel approach to leverage self-supervised learning and reduce the need for large amounts of labeled data for audio classification. Unlike prior works, which directly fine-tune a self-supervised pre-trained encoder on a target dataset, we use the encoder to generate pseudo-labels for unsupervised fine-tuning before the actual fine-tuning step. We first train… ▽ More

    Submitted 17 May, 2023; v1 submitted 9 March, 2023; originally announced March 2023.

    Comments: ICASSP 2023 SASB Workshop

  8. arXiv:2211.01669  [pdf, other

    eess.AS cs.SD eess.SP

    Channel-Aware Pretraining of Joint Encoder-Decoder Self-Supervised Model for Telephonic-Speech ASR

    Authors: Vrunda N. Sukhadia, A. Arunkumar, S. Umesh

    Abstract: This paper proposes a novel technique to obtain better downstream ASR performance from a joint encoder-decoder self-supervised model when trained with speech pooled from two different channels (narrow and wide band). The joint encoder-decoder self-supervised model extends the HuBERT model with a Transformer decoder. HuBERT performs clustering of features and predicts the class of every input frame… ▽ More

    Submitted 3 June, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

    Comments: 5 pages, 5 figures

  9. arXiv:2211.01519  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    SLICER: Learning universal audio representations using low-resource self-supervised pre-training

    Authors: Ashish Seth, Sreyan Ghosh, S. Umesh, Dinesh Manocha

    Abstract: We present a new Self-Supervised Learning (SSL) approach to pre-train encoders on unlabeled audio data that reduces the need for large amounts of labeled data for audio and speech classification. Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks in a low-resource un-labeled audio pre-training setting. Inspired by the recent… ▽ More

    Submitted 17 May, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: ICASSP 2023

  10. arXiv:2211.01515  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    MAST: Multiscale Audio Spectrogram Transformers

    Authors: Sreyan Ghosh, Ashish Seth, S. Umesh, Dinesh Manocha

    Abstract: We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST). Given an input audio spectrogram, we first patchify and project it into an initial temporal resolution and embedding dimension, post which the multiple stages in MAST progressively expand the embedding dimension… ▽ More

    Submitted 17 May, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: ICASSP 2023

  11. arXiv:2211.01338  [pdf, other

    eess.AS cs.CL cs.MM cs.SD eess.IV

    Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages

    Authors: Anusha Prakash, Arun Kumar, Ashish Seth, Bhagyashree Mukherjee, Ishika Gupta, Jom Kuriakose, Jordan Fernandes, K V Vikram, Mano Ranjith Kumar M, Metilda Sagaya Mary, Mohammad Wajahat, Mohana N, Mudit Batra, Navina K, Nihal John George, Nithya Ravi, Pruthwik Mishra, Sudhanshu Srivastava, Vasista Sai Lodagala, Vandan Mujadia, Kada Sai Venkata Vineeth, Vrunda Sukhadia, Dipti Sharma, Hema Murthy, Pushpak Bhattacharya , et al. (2 additional authors not shown)

    Abstract: Cross-lingual dubbing of lecture videos requires the transcription of the original audio, correction and removal of disfluencies, domain term discovery, text-to-text translation into the target language, chunking of text using target language rhythm, text-to-speech synthesis followed by isochronous lipsyncing to the original video. This task becomes challenging when the source and target languages… ▽ More

    Submitted 1 November, 2022; originally announced November 2022.

  12. arXiv:2211.01246  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    data2vec-aqc: Search for the right Teaching Assistant in the Teacher-Student training setup

    Authors: Vasista Sai Lodagala, Sreyan Ghosh, S. Umesh

    Abstract: In this paper, we propose a new Self-Supervised Learning (SSL) algorithm called data2vec-aqc, for speech representation learning from unlabeled speech data. Our goal is to improve SSL for speech in domains where both unlabeled and labeled data are limited. Building on the recently introduced data2vec, we introduce additional modules to the data2vec framework that leverage the benefit of data augme… ▽ More

    Submitted 13 May, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: Accepted to ICASSP 2023

  13. arXiv:2210.02592  [pdf, other

    cs.CL

    CCC-wav2vec 2.0: Clustering aided Cross Contrastive Self-supervised learning of speech representations

    Authors: Vasista Sai Lodagala, Sreyan Ghosh, S. Umesh

    Abstract: While Self-Supervised Learning has helped reap the benefit of the scale from the available unlabeled data, the learning paradigms are continuously being bettered. We present a new pre-training strategy named ccc-wav2vec 2.0, which uses clustering and an augmentation-based cross-contrastive loss as its self-supervised objective. Through the clustering module, we scale down the influence of those ne… ▽ More

    Submitted 13 May, 2023; v1 submitted 5 October, 2022; originally announced October 2022.

    Comments: Accepted to IEEE SLT 2022

  14. Investigation of Ensemble features of Self-Supervised Pretrained Models for Automatic Speech Recognition

    Authors: A Arunkumar, Vrunda N Sukhadia, S. Umesh

    Abstract: Self-supervised learning (SSL) based models have been shown to generate powerful representations that can be used to improve the performance of downstream speech tasks. Several state-of-the-art SSL models are available, and each of these models optimizes a different loss which gives rise to the possibility of their features being complementary. This paper proposes using an ensemble of such SSL rep… ▽ More

    Submitted 11 June, 2022; originally announced June 2022.

    Comments: 4 pages , 2 figures,submitted to interspeech 2022

  15. arXiv:2203.16973  [pdf, other

    cs.CL cs.SD eess.AS

    Analyzing the factors affecting usefulness of Self-Supervised Pre-trained Representations for Speech Recognition

    Authors: Ashish Seth, Lodagala V S V Durga Prasad, Sreyan Ghosh, S. Umesh

    Abstract: Self-supervised learning (SSL) to learn high-level speech representations has been a popular approach to building Automatic Speech Recognition (ASR) systems in low-resource settings. However, the common assumption made in literature is that a considerable amount of unlabeled data is available for the same domain or language that can be leveraged for SSL pre-training, which we acknowledge is not fe… ▽ More

    Submitted 17 May, 2023; v1 submitted 31 March, 2022; originally announced March 2022.

  16. arXiv:2203.16965  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    PADA: Pruning Assisted Domain Adaptation for Self-Supervised Speech Representations

    Authors: Lodagala V S V Durga Prasad, Sreyan Ghosh, S. Umesh

    Abstract: While self-supervised speech representation learning (SSL) models serve a variety of downstream tasks, these models have been observed to overfit to the domain from which the unlabelled data originates. To alleviate this issue, we propose PADA (Pruning Assisted Domain Adaptation) and zero out redundant weights from models pre-trained on large amounts of out-of-domain (OOD) data. Intuitively, this… ▽ More

    Submitted 13 May, 2023; v1 submitted 31 March, 2022; originally announced March 2022.

    Comments: Accepted to IEEE SLT 2022

  17. arXiv:2203.16028  [pdf, other

    cs.CL cs.MM cs.SD eess.AS

    Span Classification with Structured Information for Disfluency Detection in Spoken Utterances

    Authors: Sreyan Ghosh, Sonal Kumar, Yaman Kumar Singla, Rajiv Ratn Shah, S. Umesh

    Abstract: Existing approaches in disfluency detection focus on solving a token-level classification task for identifying and removing disfluencies in text. Moreover, most works focus on leveraging only contextual information captured by the linear sequences in text, thus ignoring the structured information in text which is efficiently captured by dependency trees. In this paper, building on the span classif… ▽ More

    Submitted 18 April, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

  18. arXiv:2203.13628  [pdf, other

    cs.SD cs.CL eess.AS

    DeLoRes: Decorrelating Latent Spaces for Low-Resource Audio Representation Learning

    Authors: Sreyan Ghosh, Ashish Seth, and Deepak Mittal, Maneesh Singh, S. Umesh

    Abstract: Inspired by the recent progress in self-supervised learning for computer vision, in this paper we introduce DeLoRes, a new general-purpose audio representation learning approach. Our main objective is to make our network learn representations in a resource-constrained setting (both data and compute), that can generalize well across a diverse set of downstream tasks. Inspired from the Barlow Twins… ▽ More

    Submitted 26 June, 2022; v1 submitted 25 March, 2022; originally announced March 2022.

    Comments: Accepted to AAAI 2022 workshop on Self-supervised Learning for Audio and Speech Processing

  19. Domain Adaptation of low-resource Target-Domain models using well-trained ASR Conformer Models

    Authors: Vrunda N. Sukhadia, S. Umesh

    Abstract: In this paper, we investigate domain adaptation for low-resource Automatic Speech Recognition (ASR) of target-domain data, when a well-trained ASR model trained with a large dataset is available. We argue that in the encoder-decoder framework, the decoder of the well-trained ASR model is largely tuned towards the source-domain, hurting the performance of target-domain models in vanilla transfer-le… ▽ More

    Submitted 29 May, 2023; v1 submitted 18 February, 2022; originally announced February 2022.

    Comments: 5 pages,2 figures

  20. arXiv:2110.08895  [pdf, other

    cs.SD cs.CL eess.AS

    DECAR: Deep Clustering for learning general-purpose Audio Representations

    Authors: Sreyan Ghosh, Sandesh V Katta, Ashish Seth, S. Umesh

    Abstract: We introduce DECAR, a self-supervised pre-training approach for learning general-purpose audio representations. Our system is based on clustering: it utilizes an offline clustering step to provide target labels that act as pseudo-labels for solving a prediction task. We develop on top of recent advances in self-supervised learning for computer vision and design a lightweight, easy-to-use self-supe… ▽ More

    Submitted 14 March, 2023; v1 submitted 17 October, 2021; originally announced October 2021.

  21. arXiv:2110.07592  [pdf, other

    cs.CL cs.SD eess.AS

    DeToxy: A Large-Scale Multimodal Dataset for Toxicity Classification in Spoken Utterances

    Authors: Sreyan Ghosh, Samden Lepcha, S Sakshi, Rajiv Ratn Shah, S. Umesh

    Abstract: Toxic speech, also known as hate speech, is regarded as one of the crucial issues plaguing online social media today. Most recent work on toxic speech detection is constrained to the modality of text and written conversations with very limited work on toxicity detection from spoken utterances or using the modality of speech. In this paper, we introduce a new dataset DeToxy, the first publicly avai… ▽ More

    Submitted 4 April, 2022; v1 submitted 14 October, 2021; originally announced October 2021.

    Comments: Submitted to Interspeech 2022

  22. arXiv:2008.04659  [pdf, other

    eess.AS cs.SD

    S-vectors and TESA: Speaker Embeddings and a Speaker Authenticator Based on Transformer Encoder

    Authors: N J Metilda Sagaya Mary, S Umesh, Sandesh V Katta

    Abstract: One of the most popular speaker embeddings is x-vectors, which are obtained from an architecture that gradually builds a larger temporal context with layers. In this paper, we propose to derive speaker embeddings from Transformer's encoder trained for speaker classification. Self-attention, on which Transformer's encoder is built, attends to all the features over the entire utterance and might be… ▽ More

    Submitted 12 December, 2021; v1 submitted 11 August, 2020; originally announced August 2020.

    Comments: Version 2, Accepted for publication in IEEE TASLP

  23. arXiv:2008.03247  [pdf, other

    eess.AS cs.CV cs.SD

    Investigation of Speaker-adaptation methods in Transformer based ASR

    Authors: Vishwas M. Shetty, Metilda Sagaya Mary N J, S. Umesh

    Abstract: End-to-end models are fast replacing the conventional hybrid models in automatic speech recognition. Transformer, a sequence-to-sequence model, based on self-attention popularly used in machine translation tasks, has given promising results when used for automatic speech recognition. This paper explores different ways of incorporating speaker information at the encoder input while training a trans… ▽ More

    Submitted 17 November, 2021; v1 submitted 7 August, 2020; originally announced August 2020.

    Comments: 5 pages, 6 figures

  24. arXiv:1307.4048  [pdf, ps, other

    cs.LG cs.CV stat.ML

    Modified SPLICE and its Extension to Non-Stereo Data for Noise Robust Speech Recognition

    Authors: D. S. Pavan Kumar, N. Vishnu Prasad, Vikas Joshi, S. Umesh

    Abstract: In this paper, a modification to the training process of the popular SPLICE algorithm has been proposed for noise robust speech recognition. The modification is based on feature correlations, and enables this stereo-based algorithm to improve the performance in all noise conditions, especially in unseen cases. Further, the modified framework is extended to work for non-stereo datasets where clean… ▽ More

    Submitted 15 July, 2013; originally announced July 2013.

    Comments: Submitted to Automatic Speech Recognition and Understanding (ASRU) 2013 Workshop