Skip to main content

Showing 1–50 of 57 results for author: Ganapathy, S

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.09494  [pdf, other

    eess.AS cs.LG

    The Second DISPLACE Challenge : DIarization of SPeaker and LAnguage in Conversational Environments

    Authors: Shareef Babu Kalluri, Prachi Singh, Pratik Roy Chowdhuri, Apoorva Kulkarni, Shikha Baghel, Pradyoth Hegde, Swapnil Sontakke, Deepak K T, S. R. Mahadeva Prasanna, Deepu Vijayasenan, Sriram Ganapathy

    Abstract: The DIarization of SPeaker and LAnguage in Conversational Environments (DISPLACE) 2024 challenge is the second in the series of DISPLACE challenges, which involves tasks of speaker diarization (SD) and language diarization (LD) on a challenging multilingual conversational speech dataset. In the DISPLACE 2024 challenge, we also introduced the task of automatic speech recognition (ASR) on this datas… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: 5 pages, 3 figures, Interspeech 2024

  2. arXiv:2401.12850  [pdf, other

    eess.AS cs.AI cs.SD

    Overlap-aware End-to-End Supervised Hierarchical Graph Clustering for Speaker Diarization

    Authors: Prachi Singh, Sriram Ganapathy

    Abstract: Speaker diarization, the task of segmenting an audio recording based on speaker identity, constitutes an important speech pre-processing step for several downstream applications. The conventional approach to diarization involves multiple steps of embedding extraction and clustering, which are often optimized in an isolated fashion. While end-to-end diarization systems attempt to learn a single mod… ▽ More

    Submitted 23 January, 2024; originally announced January 2024.

    Comments: 10 pages

  3. arXiv:2401.04511  [pdf, other

    eess.AS cs.LG cs.SD

    Zero Shot Audio to Audio Emotion Transfer With Speaker Disentanglement

    Authors: Soumya Dutta, Sriram Ganapathy

    Abstract: The problem of audio-to-audio (A2A) style transfer involves replacing the style features of the source audio with those from the target audio while preserving the content related attributes of the source audio. In this paper, we propose an efficient approach, termed as Zero-shot Emotion Style Transfer (ZEST), that allows the transfer of emotional content present in the given source audio with the… ▽ More

    Submitted 9 January, 2024; originally announced January 2024.

    Comments: 5 pages, 3 figures, accepted at ICASSP 2024

  4. arXiv:2311.12564  [pdf

    eess.AS cs.LG eess.SP

    Summary of the DISPLACE Challenge 2023 - DIarization of SPeaker and LAnguage in Conversational Environments

    Authors: Shikha Baghel, Shreyas Ramoji, Somil Jain, Pratik Roy Chowdhuri, Prachi Singh, Deepu Vijayasenan, Sriram Ganapathy

    Abstract: In multi-lingual societies, where multiple languages are spoken in a small geographic vicinity, informal conversations often involve mix of languages. Existing speech technologies may be inefficient in extracting information from such conversations, where the speech data is rich in diversity with multiple languages and speakers. The DISPLACE (DIarization of SPeaker and LAnguage in Conversational E… ▽ More

    Submitted 3 January, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

  5. arXiv:2309.13537  [pdf, other

    eess.AS cs.AI cs.SD

    Speech enhancement with frequency domain auto-regressive modeling

    Authors: Anurenjan Purushothaman, Debottam Dutta, Rohit Kumar, Sriram Ganapathy

    Abstract: Speech applications in far-field real world settings often deal with signals that are corrupted by reverberation. The task of dereverberation constitutes an important step to improve the audible quality and to reduce the error rates in applications like automatic speech recognition (ASR). We propose a unified framework of speech dereverberation for improving the speech quality and the ASR performa… ▽ More

    Submitted 23 September, 2023; originally announced September 2023.

    Comments: 10 pages

    Journal ref: IEEE/ACM Transactions on Audio, Speech and Language Processing 2023

  6. arXiv:2309.10567  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Multimodal Modeling For Spoken Language Identification

    Authors: Shikhar Bharadwaj, Min Ma, Shikhar Vashishth, Ankur Bapna, Sriram Ganapathy, Vera Axelrod, Siddharth Dalmia, Wei Han, Yu Zhang, Daan van Esch, Sandy Ritchie, Partha Talukdar, Jason Riesa

    Abstract: Spoken language identification refers to the task of automatically predicting the spoken language in a given utterance. Conventionally, it is modeled as a speech-based language identification task. Prior techniques have been constrained to a single modality; however in the case of video data there is a wealth of other metadata that may be beneficial for this task. In this work, we propose MuSeLI,… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

  7. arXiv:2307.10982  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    MASR: Multi-label Aware Speech Representation

    Authors: Anjali Raj, Shikhar Bharadwaj, Sriram Ganapathy, Min Ma, Shikhar Vashishth

    Abstract: In the recent years, speech representation learning is constructed primarily as a self-supervised learning (SSL) task, using the raw audio signal alone, while ignoring the side-information that is often available for a given speech recording. In this paper, we propose MASR, a Multi-label Aware Speech Representation learning framework, which addresses the aforementioned limitations. MASR enables th… ▽ More

    Submitted 25 September, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

    Comments: Accepted at ASRU 2023

  8. arXiv:2307.07325  [pdf, other

    eess.AS cs.AI cs.LG

    Representation Learning With Hidden Unit Clustering For Low Resource Speech Applications

    Authors: Varun Krishna, Tarun Sai, Sriram Ganapathy

    Abstract: The representation learning of speech, without textual resources, is an area of significant interest for many low resource speech applications. In this paper, we describe an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework. The input to the model consists of audio samples that are windowed and processed with 1-D convolutional layers.… ▽ More

    Submitted 14 July, 2023; originally announced July 2023.

  9. arXiv:2307.00366  [pdf, other

    eess.AS

    Enhancing the EEG Speech Match Mismatch Tasks With Word Boundaries

    Authors: Akshara Soman, Vidhi Sinha, Sriram Ganapathy

    Abstract: Recent studies have shown that the underlying neural mechanisms of human speech comprehension can be analyzed using a match-mismatch classification of the speech stimulus and the neural response. However, such studies have been conducted for fixed-duration segments without accounting for the discrete processing of speech in the brain. In this work, we establish that word boundary information plays… ▽ More

    Submitted 1 July, 2023; originally announced July 2023.

    Comments: 5 pages, 4 figures, 4 tables, accepted to Interspeech2023 conference

  10. arXiv:2306.04374  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Label Aware Speech Representation Learning For Language Identification

    Authors: Shikhar Vashishth, Shikhar Bharadwaj, Sriram Ganapathy, Ankur Bapna, Min Ma, Wei Han, Vera Axelrod, Partha Talukdar

    Abstract: Speech representation learning approaches for non-semantic tasks such as language recognition have either explored supervised embedding extraction methods using a classifier model or self-supervised representation learning approaches using raw data. In this paper, we propose a novel framework of combining self-supervised representation learning with the language label information for the pre-train… ▽ More

    Submitted 7 June, 2023; originally announced June 2023.

    Comments: Accepted at Interspeech 2023

  11. arXiv:2305.12741  [pdf, other

    eess.AS cs.LG cs.SD q-bio.QM

    Coswara: A respiratory sounds and symptoms dataset for remote screening of SARS-CoV-2 infection

    Authors: Debarpan Bhattacharya, Neeraj Kumar Sharma, Debottam Dutta, Srikanth Raj Chetupalli, Pravin Mote, Sriram Ganapathy, Chandrakiran C, Sahiti Nori, Suhail K K, Sadhana Gonuguntla, Murali Alagesan

    Abstract: This paper presents the Coswara dataset, a dataset containing diverse set of respiratory sounds and rich meta-data, recorded between April-2020 and February-2022 from 2635 individuals (1819 SARS-CoV-2 negative, 674 positive, and 142 recovered subjects). The respiratory sounds contained nine sound categories associated with variants of breathing, cough and speech. The rich metadata contained demogr… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

    Comments: Accepted for publiation in Nature Scientific Data

  12. arXiv:2304.06910  [pdf, other

    eess.AS cs.CL cs.SD

    HCAM -- Hierarchical Cross Attention Model for Multi-modal Emotion Recognition

    Authors: Soumya Dutta, Sriram Ganapathy

    Abstract: Emotion recognition in conversations is challenging due to the multi-modal nature of the emotion expression. We propose a hierarchical cross-attention model (HCAM) approach to multi-modal emotion recognition using a combination of recurrent and co-attention neural network models. The input to the model consists of two modalities, i) audio data, processed through a learnable wav2vec approach and, i… ▽ More

    Submitted 9 January, 2024; v1 submitted 13 April, 2023; originally announced April 2023.

    Comments: 11 pages, 6 figures

  13. arXiv:2303.00830  [pdf, other

    eess.AS cs.SD eess.SP

    DISPLACE Challenge: DIarization of SPeaker and LAnguage in Conversational Environments

    Authors: Shikha Baghel, Shreyas Ramoji, Sidharth, Ranjana H, Prachi Singh, Somil Jain, Pratik Roy Chowdhuri, Kaustubh Kulkarni, Swapnil Padhi, Deepu Vijayasenan, Sriram Ganapathy

    Abstract: In multilingual societies, social conversations often involve code-mixed speech. The current speech technology may not be well equipped to extract information from multi-lingual multi-speaker conversations. The DISPLACE challenge entails a first-of-kind task to benchmark speaker and language diarization on the same data, as the data contains multi-speaker conversations in multilingual code-mixed s… ▽ More

    Submitted 5 June, 2023; v1 submitted 1 March, 2023; originally announced March 2023.

  14. arXiv:2302.12716  [pdf, other

    cs.SD cs.LG eess.AS

    Supervised Hierarchical Clustering using Graph Neural Networks for Speaker Diarization

    Authors: Prachi Singh, Amrit Kaul, Sriram Ganapathy

    Abstract: Conventional methods for speaker diarization involve windowing an audio file into short segments to extract speaker embeddings, followed by an unsupervised clustering of the embeddings. This multi-step approach generates speaker assignments for each segment. In this paper, we propose a novel Supervised HierArchical gRaph Clustering algorithm (SHARC) for speaker diarization where we introduce a hie… ▽ More

    Submitted 24 February, 2023; originally announced February 2023.

    Comments: 5 pages including references. Accepted in ICASSP 2023

  15. arXiv:2208.12410  [pdf, other

    cs.SD cs.LG eess.AS

    Leveraging Symmetrical Convolutional Transformer Networks for Speech to Singing Voice Style Transfer

    Authors: Shrutina Agarwal, Sriram Ganapathy, Naoya Takahashi

    Abstract: In this paper, we propose a model to perform style transfer of speech to singing voice. Contrary to the previous signal processing-based methods, which require high-quality singing templates or phoneme synchronization, we explore a data-driven approach for the problem of converting natural speech to singing voice. We develop a novel neural network architecture, called SymNet, which models the alig… ▽ More

    Submitted 25 August, 2022; originally announced August 2022.

    Comments: accepted to INTERSPEECH 2022

  16. arXiv:2206.13365  [pdf, other

    eess.AS cs.LG cs.SD

    Interpretable Acoustic Representation Learning on Breathing and Speech Signals for COVID-19 Detection

    Authors: Debottam Dutta, Debarpan Bhattacharya, Sriram Ganapathy, Amir H. Poorjam, Deepak Mittal, Maneesh Singh

    Abstract: In this paper, we describe an approach for representation learning of audio signals for the task of COVID-19 detection. The raw audio samples are processed with a bank of 1-D convolutional filters that are parameterized as cosine modulated Gaussian functions. The choice of these kernels allows the interpretation of the filterbanks as smooth band-pass filters. The filtered outputs are pooled, log-c… ▽ More

    Submitted 27 June, 2022; originally announced June 2022.

  17. arXiv:2206.12309  [pdf, other

    eess.AS cs.LG eess.SP

    Analyzing the impact of SARS-CoV-2 variants on respiratory sound signals

    Authors: Debarpan Bhattacharya, Debottam Dutta, Neeraj Kumar Sharma, Srikanth Raj Chetupalli, Pravin Mote, Sriram Ganapathy, Chandrakiran C, Sahiti Nori, Suhail K K, Sadhana Gonuguntla, Murali Alagesan

    Abstract: The COVID-19 outbreak resulted in multiple waves of infections that have been associated with different SARS-CoV-2 variants. Studies have reported differential impact of the variants on respiratory health of patients. We explore whether acoustic signals, collected from COVID-19 subjects, show computationally distinguishable acoustic patterns suggesting a possibility to predict the underlying virus… ▽ More

    Submitted 24 June, 2022; originally announced June 2022.

    Journal ref: Interspeech, 2022

  18. arXiv:2206.05462  [pdf, other

    eess.AS cs.LG cs.SD

    Svadhyaya system for the Second Diagnosing COVID-19 using Acoustics Challenge 2021

    Authors: Deepak Mittal, Amir H. Poorjam, Debottam Dutta, Debarpan Bhattacharya, Zemin Yu, Sriram Ganapathy, Maneesh Singh

    Abstract: This report describes the system used for detecting COVID-19 positives using three different acoustic modalities, namely speech, breathing, and cough in the second DiCOVA challenge. The proposed system is based on the combination of 4 different approaches, each focusing more on one aspect of the problem, and reaches the blind test AUCs of 86.41, 77.60, and 84.55, in the breathing, cough, and speec… ▽ More

    Submitted 11 June, 2022; originally announced June 2022.

  19. arXiv:2206.05053  [pdf, other

    cs.HC cs.LG cs.SD eess.AS eess.SP

    Coswara: A website application enabling COVID-19 screening by analysing respiratory sound samples and health symptoms

    Authors: Debarpan Bhattacharya, Debottam Dutta, Neeraj Kumar Sharma, Srikanth Raj Chetupalli, Pravin Mote, Sriram Ganapathy, Chandrakiran C, Sahiti Nori, Suhail K K, Sadhana Gonuguntla, Murali Alagesan

    Abstract: The COVID-19 pandemic has accelerated research on design of alternative, quick and effective COVID-19 diagnosis approaches. In this paper, we describe the Coswara tool, a website application designed to enable COVID-19 detection by analysing respiratory sound samples and health symptoms. A user using this service can log into a website using any device connected to the internet, provide there curr… ▽ More

    Submitted 9 June, 2022; originally announced June 2022.

    Journal ref: Interspeech, 2022

  20. arXiv:2110.01177  [pdf, other

    eess.AS cs.SD q-bio.QM

    The Second DiCOVA Challenge: Dataset and performance analysis for COVID-19 diagnosis using acoustics

    Authors: Neeraj Kumar Sharma, Srikanth Raj Chetupalli, Debarpan Bhattacharya, Debottam Dutta, Pravin Mote, Sriram Ganapathy

    Abstract: The Second Diagnosis of COVID-19 using Acoustics (DiCOVA) Challenge aimed at accelerating the research in acoustics based detection of COVID-19, a topic at the intersection of acoustics, signal processing, machine learning, and healthcare. This paper presents the details of the challenge, which was an open call for researchers to analyze a dataset of audio recordings consisting of breathing, cough… ▽ More

    Submitted 11 October, 2021; v1 submitted 4 October, 2021; originally announced October 2021.

  21. arXiv:2109.06824  [pdf, other

    eess.AS cs.SD

    Self-Supervised Metric Learning With Graph Clustering For Speaker Diarization

    Authors: Prachi Singh, Sriram Ganapathy

    Abstract: In this paper, we propose a novel algorithm for speaker diarization using metric learning for graph based clustering. The graph clustering algorithms use an adjacency matrix consisting of similarity scores. These scores are computed between speaker embeddings extracted from pairs of audio segments within the given recording. In this paper, we propose an approach that jointly learns the speaker emb… ▽ More

    Submitted 14 September, 2021; originally announced September 2021.

    Comments: 8 pages, Accepted in ASRU 2021

  22. arXiv:2108.05520  [pdf, other

    eess.AS cs.SD eess.SP

    Dereverberation of Autoregressive Envelopes for Far-field Speech Recognition

    Authors: Anurenjan Purushothaman, Anirudh Sreeram, Rohit Kumar, Sriram Ganapathy

    Abstract: The task of speech recognition in far-field environments is adversely affected by the reverberant artifacts that elicit as the temporal smearing of the sub-band envelopes. In this paper, we develop a neural model for speech dereverberation using the long-term sub-band envelopes of speech. The sub-band envelopes are derived using frequency domain linear prediction (FDLP) which performs an autoregre… ▽ More

    Submitted 13 August, 2021; v1 submitted 12 August, 2021; originally announced August 2021.

    Comments: arXiv admin note: text overlap with arXiv:2008.03339

  23. arXiv:2108.03975  [pdf, other

    eess.AS

    End-to-End Speech Recognition With Joint Dereverberation Of Sub-Band Autoregressive Envelopes

    Authors: Rohit Kumar, Anurenjan Purushothaman, Anirudh Sreeram, Sriram Ganapathy

    Abstract: The end-to-end (E2E) automatic speech recognition (ASR) systems are often required to operate in reverberant conditions, where the long-term sub-band envelopes of the speech are temporally smeared. In this paper, we develop a feature enhancement approach using a neural model operating on sub-band temporal envelopes. The temporal envelopes are modeled using the framework of frequency domain linear… ▽ More

    Submitted 17 February, 2022; v1 submitted 9 August, 2021; originally announced August 2021.

    Comments: 5 pages with refrences, e2e asr

  24. arXiv:2107.14793  [pdf, other

    eess.AS cs.SD eess.SP

    A Multi-Head Relevance Weighting Framework For Learning Raw Waveform Audio Representations

    Authors: Debottam Dutta, Purvi Agrawal, Sriram Ganapathy

    Abstract: In this work, we propose a multi-head relevance weighting framework to learn audio representations from raw waveforms. The audio waveform, split into windows of short duration, are processed with a 1-D convolutional layer of cosine modulated Gaussian filters acting as a learnable filterbank. The key novelty of the proposed framework is the introduction of multi-head relevance on the learnt filterb… ▽ More

    Submitted 30 July, 2021; originally announced July 2021.

    Comments: Submitted to 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics(WASPAA 2021)

  25. arXiv:2106.12763  [pdf, other

    eess.AS cs.SD eess.IV eess.SP

    SRIB-LEAP submission to Far-field Multi-Channel Speech Enhancement Challenge for Video Conferencing

    Authors: R G Prithvi Raj, Rohit Kumar, M K Jayesh, Anurenjan Purushothaman, Sriram Ganapathy, M A Basha Shaik

    Abstract: This paper presents the details of the SRIB-LEAP submission to the ConferencingSpeech challenge 2021. The challenge involved the task of multi-channel speech enhancement to improve the quality of far field speech from microphone arrays in a video conferencing room. We propose a two stage method involving a beamformer followed by single channel enhancement. For the beamformer, we incorporated self-… ▽ More

    Submitted 24 June, 2021; originally announced June 2021.

  26. arXiv:2106.10997  [pdf, other

    eess.AS cs.SD

    Towards sound based testing of COVID-19 -- Summary of the first Diagnostics of COVID-19 using Acoustics (DiCOVA) Challenge

    Authors: Neeraj Kumar Sharma, Ananya Muguli, Prashant Krishnan, Rohit Kumar, Srikanth Raj Chetupalli, Sriram Ganapathy

    Abstract: The technology development for point-of-care tests (POCTs) targeting respiratory diseases has witnessed a growing demand in the recent past. Investigating the presence of acoustic biomarkers in modalities such as cough, breathing and speech sounds, and using them for building POCTs can offer fast, contactless and inexpensive testing. In view of this, over the past year, we launched the ``Coswara''… ▽ More

    Submitted 21 June, 2021; originally announced June 2021.

    Comments: Manuscript in review in the Elsevier Computer Speech and Language journal

  27. arXiv:2106.00639  [pdf, other

    eess.AS cs.SD eess.SP

    Multi-modal Point-of-Care Diagnostics for COVID-19 Based On Acoustics and Symptoms

    Authors: Srikanth Raj Chetupalli, Prashant Krishnan, Neeraj Sharma, Ananya Muguli, Rohit Kumar, Viral Nanda, Lancelot Mark Pinto, Prasanta Kumar Ghosh, Sriram Ganapathy

    Abstract: The research direction of identifying acoustic bio-markers of respiratory diseases has received renewed interest following the onset of COVID-19 pandemic. In this paper, we design an approach to COVID-19 diagnostic using crowd-sourced multi-modal data. The data resource, consisting of acoustic signals like cough, breathing, and speech signals, along with the data of symptoms, are recorded using a… ▽ More

    Submitted 5 June, 2021; v1 submitted 1 June, 2021; originally announced June 2021.

    Comments: The Manuscript is submitted to IEEE-EMBS Journal of Biomedical and Health Informatics on June 1, 2021

  28. arXiv:2105.08492  [pdf, other

    eess.AS cs.SD eess.SP q-bio.QM

    Deep Correlation Analysis for Audio-EEG Decoding

    Authors: Jaswanth Reddy Katthi, Sriram Ganapathy

    Abstract: The electroencephalography (EEG), which is one of the easiest modes of recording brain activations in a non-invasive manner, is often distorted due to recording artifacts which adversely impacts the stimulus-response analysis. The most prominent techniques thus far attempt to improve the stimulus-response correlations using linear methods. In this paper, we propose a neural network based correlati… ▽ More

    Submitted 27 November, 2021; v1 submitted 18 May, 2021; originally announced May 2021.

    Comments: Got accepted to IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING

  29. Self-supervised Representation Learning With Path Integral Clustering For Speaker Diarization

    Authors: Prachi Singh, Sriram Ganapathy

    Abstract: Automatic speaker diarization techniques typically involve a two-stage processing approach where audio segments of fixed duration are converted to vector representations in the first stage. This is followed by an unsupervised clustering of the representations in the second stage. In most of the prior approaches, these two stages are performed in an isolated manner with independent optimization ste… ▽ More

    Submitted 19 April, 2021; originally announced April 2021.

    Comments: 11 pages, Accepted in IEEE Transactions on Audio, Speech and Language Processing

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021

  30. arXiv:2104.02359  [pdf, other

    eess.AS

    LEAP Submission for the Third DIHARD Diarization Challenge

    Authors: Prachi Singh, Rajat Varma, Venkat Krishnamohan, Srikanth Raj Chetupalli, Sriram Ganapathy

    Abstract: The LEAP submission for DIHARD-III challenge is described in this paper. The proposed system is composed of a speech bandwidth classifier, and diarization systems fine-tuned for narrowband and wideband speech separately. We use an end-to-end speaker diarization system for the narrowband conversational telephone speech recordings. For the wideband multi-speaker recordings, we use a neural embedding… ▽ More

    Submitted 14 June, 2021; v1 submitted 6 April, 2021; originally announced April 2021.

    Comments: Accepted in INTERSPEECH 2021

  31. arXiv:2104.01882  [pdf, other

    eess.AS

    Speaker conditioned acoustic modeling for multi-speaker conversational ASR

    Authors: Srikanth Raj Chetupalli, Sriram Ganapathy

    Abstract: In this paper, we propose a novel approach for the transcription of speech conversations with natural speaker overlap, from single channel speech recordings. The proposed model is a combination of a speaker diarization system and a hybrid automatic speech recognition (ASR) system. The speaker conditioned acoustic model (SCAM) in the ASR system consists of a series of embedding layers which use the… ▽ More

    Submitted 29 August, 2022; v1 submitted 5 April, 2021; originally announced April 2021.

    Comments: Manuscript accepted for presentation at Interspeech 2022

  32. arXiv:2103.09148  [pdf, other

    eess.AS cs.SD

    DiCOVA Challenge: Dataset, task, and baseline system for COVID-19 diagnosis using acoustics

    Authors: Ananya Muguli, Lancelot Pinto, Nirmala R., Neeraj Sharma, Prashant Krishnan, Prasanta Kumar Ghosh, Rohit Kumar, Shrirama Bhat, Srikanth Raj Chetupalli, Sriram Ganapathy, Shreyas Ramoji, Viral Nanda

    Abstract: The DiCOVA challenge aims at accelerating research in diagnosing COVID-19 using acoustics (DiCOVA), a topic at the intersection of speech and audio processing, respiratory health diagnosis, and machine learning. This challenge is an open call for researchers to analyze a dataset of sound recordings collected from COVID-19 infected and non-COVID-19 individuals for a two-class classification. These… ▽ More

    Submitted 17 June, 2021; v1 submitted 16 March, 2021; originally announced March 2021.

    Comments: To appear in Proceedings of Interspeech, 2021

  33. arXiv:2103.06478  [pdf, other

    eess.AS cs.SD q-bio.QM

    Deep Multiway Canonical Correlation Analysis for Multi-Subject EEG Normalization

    Authors: Jaswanth Reddy Katthi, Sriram Ganapathy

    Abstract: The normalization of brain recordings from multiple subjects responding to the natural stimuli is one of the key challenges in auditory neuroscience. The objective of this normalization is to transform the brain data in such a way as to remove the inter-subject redundancies and to boost the component related to the stimuli. In this paper, we propose a deep learning framework to improve the correla… ▽ More

    Submitted 11 March, 2021; originally announced March 2021.

    Comments: 5 pages, 2 figures, 2 tables, to be published in ICASSP 2021

  34. arXiv:2102.08575  [pdf, ps, other

    cs.SD cs.LG eess.AS

    End-to-end lyrics Recognition with Voice to Singing Style Transfer

    Authors: Sakya Basak, Shrutina Agarwal, Sriram Ganapathy, Naoya Takahashi

    Abstract: Automatic transcription of monophonic/polyphonic music is a challenging task due to the lack of availability of large amounts of transcribed data. In this paper, we propose a data augmentation method that converts natural speech to singing voice based on vocoder based speech synthesizer. This approach, called voice to singing (V2S), performs the voice style conversion by modulating the F0 contour… ▽ More

    Submitted 16 February, 2021; originally announced February 2021.

    Comments: accepted at ICASSP 2021

  35. arXiv:2102.07390  [pdf, other

    eess.AS

    Representation Learning For Speech Recognition Using Feedback Based Relevance Weighting

    Authors: Purvi Agrawal, Sriram Ganapathy

    Abstract: In this work, we propose an acoustic embedding based approach for representation learning in speech recognition. The proposed approach involves two stages comprising of acoustic filterbank learning from raw waveform, followed by modulation filterbank learning. In each stage, a relevance weighting operation is employed that acts as a feature selection module. In particular, the relevance weighting… ▽ More

    Submitted 15 February, 2021; originally announced February 2021.

    Comments: arXiv admin note: substantial text overlap with arXiv:2011.00721, arXiv:2011.02136, arXiv:2001.07067

    Journal ref: IEEE International Conference on Acoustics, Speech, & Signal Processing (ICASSP) 2021

  36. arXiv:2012.01477  [pdf, other

    eess.AS cs.SD

    The Third DIHARD Diarization Challenge

    Authors: Neville Ryant, Prachi Singh, Venkat Krishnamohan, Rajat Varma, Kenneth Church, Christopher Cieri, Jun Du, Sriram Ganapathy, Mark Liberman

    Abstract: DIHARD III was the third in a series of speaker diarization challenges intended to improve the robustness of diarization systems to variability in recording equipment, noise conditions, and conversational domain. Speaker diarization was evaluated under two speech activity conditions (diarization from a reference speech activity vs. diarization from scratch) and 11 diverse domains. The domains span… ▽ More

    Submitted 5 April, 2021; v1 submitted 2 December, 2020; originally announced December 2020.

    Comments: arXiv admin note: text overlap with arXiv:1906.07839

  37. Interpretable Representation Learning for Speech and Audio Signals Based on Relevance Weighting

    Authors: Purvi Agrawal, Sriram Ganapathy

    Abstract: The learning of interpretable representations from raw data presents significant challenges for time series data like speech. In this work, we propose a relevance weighting scheme that allows the interpretation of the speech representations during the forward propagation of the model itself. The relevance weighting is achieved using a sub-network approach that performs the task of feature selectio… ▽ More

    Submitted 29 October, 2020; originally announced November 2020.

    Comments: arXiv admin note: text overlap with arXiv:2011.00721

    Journal ref: IEEE Transactions and Audio, Speech and Language Processing, Vol. 28, pp. 2823 - 2836, 2020

  38. Robust Raw Waveform Speech Recognition Using Relevance Weighted Representations

    Authors: Purvi Agrawal, Sriram Ganapathy

    Abstract: Speech recognition in noisy and channel distorted scenarios is often challenging as the current acoustic modeling schemes are not adaptive to the changes in the signal distribution in the presence of noise. In this work, we develop a novel acoustic modeling framework for noise robust speech recognition based on relevance weighting mechanism. The relevance weighting is achieved using a sub-network… ▽ More

    Submitted 29 October, 2020; originally announced November 2020.

    Comments: arXiv admin note: text overlap with arXiv:2001.07067

    Journal ref: Proc. Interspeech 2020, 1649-1653 (2020)

  39. arXiv:2008.05064  [pdf

    cs.CY cs.AI cs.HC eess.SY

    Effects of Voice-Based Synthetic Assistant on Performance of Emergency Care Provider in Training

    Authors: Praveen Damacharla, Parashar Dhakal, Sebastian Stumbo, Ahmad Y. Javaid, Subhashini Ganapathy, David A. Malek, Douglas C. Hodge, Vijay Devabhaktuni

    Abstract: As part of a perennial project, our team is actively engaged in develo** new synthetic assistant (SA) technologies to assist in training combat medics and medical first responders. It is critical that medical first responders are well trained to deal with emergencies more effectively. This would require real-time monitoring and feedback for each trainee. Therefore, we introduced a voice-based SA… ▽ More

    Submitted 11 August, 2020; originally announced August 2020.

    ACM Class: H.1.2; I.2.1; I.2.7

    Journal ref: Int J Artif Intell Educ, 29, 122-143, 2018

  40. arXiv:2008.04527  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Neural PLDA Modeling for End-to-End Speaker Verification

    Authors: Shreyas Ramoji, Prashant Krishnan, Sriram Ganapathy

    Abstract: While deep learning models have made significant advances in supervised classification problems, the application of these models for out-of-set verification tasks like speaker recognition has been limited to deriving feature embeddings. The state-of-the-art x-vector PLDA based speaker verification systems use a generative model based on probabilistic linear discriminant analysis (PLDA) for computi… ▽ More

    Submitted 11 August, 2020; originally announced August 2020.

    Comments: Accepted in Interspeech 2020. GitHub Implementation Repos: https://github.com/iiscleap/E2E-NPLDA and https://github.com/iiscleap/NeuralPlda

  41. Deep Self-Supervised Hierarchical Clustering for Speaker Diarization

    Authors: Prachi Singh, Sriram Ganapathy

    Abstract: The state-of-the-art speaker diarization systems use agglomerative hierarchical clustering (AHC) which performs the clustering of previously learned neural embeddings. While the clustering approach attempts to identify speaker clusters, the AHC algorithm does not involve any further learning. In this paper, we propose a novel algorithm for hierarchical clustering which combines the speaker cluster… ▽ More

    Submitted 10 August, 2020; originally announced August 2020.

    Comments: 5 pages, Accepted in Interspeech 2020

    Journal ref: Proc. Interspeech 2020

  42. arXiv:2008.03517  [pdf, ps, other

    eess.AS

    Context Dependent RNNLM for Automatic Transcription of Conversations

    Authors: Srikanth Raj Chetupalli, Sriram Ganapathy

    Abstract: Conversational speech, while being unstructured at an utterance level, typically has a macro topic which provides larger context spanning multiple utterances. The current language models in speech recognition systems using recurrent neural networks (RNNLM) rely mainly on the local context and exclude the larger context. In order to model the long term dependencies of words across multiple sentence… ▽ More

    Submitted 8 August, 2020; originally announced August 2020.

    Comments: Manuscript accepted for publication at INTERSPEECH 2020, Oct 25-29, Shanghai, China

  43. arXiv:2008.03339  [pdf, other

    eess.AS cs.SD eess.SP

    Deep Learning Based Dereverberation of Temporal Envelopesfor Robust Speech Recognition

    Authors: Anurenjan Purushothaman, Anirudh Sreeram, Rohit Kumar, Sriram Ganapathy

    Abstract: Automatic speech recognition in reverberant conditions is a challenging task as the long-term envelopes of the reverberant speech are temporally smeared. In this paper, we propose a neural model for enhancement of sub-band temporal envelopes for dereverberation of speech. The temporal envelopes are derived using the autoregressive modeling framework of frequency domain linear prediction (FDLP). Th… ▽ More

    Submitted 7 August, 2020; originally announced August 2020.

  44. arXiv:2007.06021  [pdf, other

    eess.AS cs.LG

    NISP: A Multi-lingual Multi-accent Dataset for Speaker Profiling

    Authors: Shareef Babu Kalluri, Deepu Vijayasenan, Sriram Ganapathy, Ragesh Rajan M, Prashant Krishnan

    Abstract: Many commercial and forensic applications of speech demand the extraction of information about the speaker characteristics, which falls into the broad category of speaker profiling. The speaker characteristics needed for profiling include physical traits of the speaker like height, age, and gender of the speaker along with the native language of the speaker. Many of the datasets available have onl… ▽ More

    Submitted 12 July, 2020; originally announced July 2020.

    Comments: 5pages, Initial version submitted to Interspeech2020

  45. arXiv:2006.05815  [pdf, other

    eess.AS cs.SD

    Third DIHARD Challenge Evaluation Plan

    Authors: Neville Ryant, Kenneth Church, Christopher Cieri, Jun Du, Sriram Ganapathy, Mark Liberman

    Abstract: This paper introduces the third DIHARD challenge, the third in a series of speaker diarization challenges intended to improve the robustness of diarization systems to variation in recording equipment, noise conditions, and conversational domain. The challenge comprises two tracks evaluating diarization performance when starting from a reference speech segmentation (track 1) and diarization from ra… ▽ More

    Submitted 2 December, 2020; v1 submitted 4 June, 2020; originally announced June 2020.

    Comments: Version 1.2 - Planned schedule updated - Updated numbers in tables from final versions of development/evaluation sets - Corrected typo

  46. arXiv:2005.11258  [pdf, other

    eess.AS

    LEAP Submission to CHiME-6 ASR Challenge}

    Authors: Anirudh Sreeram, Anurenjan Purushothaman, Rohit Kumar, Sriram Ganapathy

    Abstract: This paper reports the LEAP submission to the CHiME-6 challenge. The CHiME-6 Automatic Speech Recognition (ASR) challenge Track 1 involved the recognition of speech in noisy and reverberant acoustic conditions in home environments with multiple-party interactions. For the challenge submission, the LEAP system used extensive data augmentation and a factorized time-delay neural network (TDNN) archit… ▽ More

    Submitted 22 May, 2020; originally announced May 2020.

  47. Coswara -- A Database of Breathing, Cough, and Voice Sounds for COVID-19 Diagnosis

    Authors: Neeraj Sharma, Prashant Krishnan, Rohit Kumar, Shreyas Ramoji, Srikanth Raj Chetupalli, Nirmala R., Prasanta Kumar Ghosh, Sriram Ganapathy

    Abstract: The COVID-19 pandemic presents global challenges transcending boundaries of country, race, religion, and economy. The current gold standard method for COVID-19 detection is the reverse transcription polymerase chain reaction (RT-PCR) testing. However, this method is expensive, time-consuming, and violates social distancing. Also, as the pandemic is expected to stay for a while, there is a need for… ▽ More

    Submitted 11 August, 2020; v1 submitted 21 May, 2020; originally announced May 2020.

    Comments: A description of Coswara dataset to evaluate COVID-19 diagnosis using respiratory sounds

  48. arXiv:2004.01221  [pdf, other

    eess.AS cs.CL cs.LG cs.SD stat.ML

    Towards Relevance and Sequence Modeling in Language Recognition

    Authors: Bharat Padi, Anand Mohan, Sriram Ganapathy

    Abstract: The task of automatic language identification (LID) involving multiple dialects of the same language family in the presence of noise is a challenging problem. In these scenarios, the identity of the language/dialect may be reliably present only in parts of the temporal sequence of the speech signal. The conventional approaches to LID (and for speaker recognition) ignore the sequence information by… ▽ More

    Submitted 2 April, 2020; originally announced April 2020.

    Comments: https://github.com/iiscleap/lre-relevance-weighting Accepted to IEEE Transactions on Audio, Speech and Language Processing

  49. arXiv:2002.03562  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    NPLDA: A Deep Neural PLDA Model for Speaker Verification

    Authors: Shreyas Ramoji, Prashant Krishnan, Sriram Ganapathy

    Abstract: The state-of-art approach for speaker verification consists of a neural network based embedding extractor along with a backend generative model such as the Probabilistic Linear Discriminant Analysis (PLDA). In this work, we propose a neural network approach for backend modeling in speaker recognition. The likelihood ratio score of the generative PLDA model is posed as a discriminative similarity f… ▽ More

    Submitted 24 May, 2020; v1 submitted 10 February, 2020; originally announced February 2020.

    Comments: Published in Odyssey 2020, the Speaker and Language Recognition Workshop (VOiCES Special Session). Link to GitHub Implementation: https://github.com/iiscleap/NeuralPlda. arXiv admin note: substantial text overlap with arXiv:2001.07034

    Journal ref: in Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, Pages 202-209

  50. arXiv:2002.02735  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    LEAP System for SRE19 CTS Challenge -- Improvements and Error Analysis

    Authors: Shreyas Ramoji, Prashant Krishnan, Bhargavram Mysore, Prachi Singh, Sriram Ganapathy

    Abstract: The NIST Speaker Recognition Evaluation - Conversational Telephone Speech (CTS) challenge 2019 was an open evaluation for the task of speaker verification in challenging conditions. In this paper, we provide a detailed account of the LEAP SRE system submitted to the CTS challenge focusing on the novel components in the back-end system modeling. All the systems used the time-delay neural network (T… ▽ More

    Submitted 24 May, 2020; v1 submitted 7 February, 2020; originally announced February 2020.

    Comments: Published In Proc. Odyssey 2020, the Speaker and Language Recognition Workshop. Link to GitHub Implementation: https://github.com/iiscleap/NeuralPlda

    Journal ref: in Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, 281--288