-
The Second DISPLACE Challenge : DIarization of SPeaker and LAnguage in Conversational Environments
Authors:
Shareef Babu Kalluri,
Prachi Singh,
Pratik Roy Chowdhuri,
Apoorva Kulkarni,
Shikha Baghel,
Pradyoth Hegde,
Swapnil Sontakke,
Deepak K T,
S. R. Mahadeva Prasanna,
Deepu Vijayasenan,
Sriram Ganapathy
Abstract:
The DIarization of SPeaker and LAnguage in Conversational Environments (DISPLACE) 2024 challenge is the second in the series of DISPLACE challenges, which involves tasks of speaker diarization (SD) and language diarization (LD) on a challenging multilingual conversational speech dataset. In the DISPLACE 2024 challenge, we also introduced the task of automatic speech recognition (ASR) on this datas…
▽ More
The DIarization of SPeaker and LAnguage in Conversational Environments (DISPLACE) 2024 challenge is the second in the series of DISPLACE challenges, which involves tasks of speaker diarization (SD) and language diarization (LD) on a challenging multilingual conversational speech dataset. In the DISPLACE 2024 challenge, we also introduced the task of automatic speech recognition (ASR) on this dataset. The dataset containing 158 hours of speech, consisting of both supervised and unsupervised mono-channel far-field recordings, was released for LD and SD tracks. Further, 12 hours of close-field mono-channel recordings were provided for the ASR track conducted on 5 Indian languages. The details of the dataset, baseline systems and the leader board results are highlighted in this paper. We have also compared our baseline models and the team's performances on evaluation data of DISPLACE-2023 to emphasize the advancements made in this second version of the challenge.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Overlap-aware End-to-End Supervised Hierarchical Graph Clustering for Speaker Diarization
Authors:
Prachi Singh,
Sriram Ganapathy
Abstract:
Speaker diarization, the task of segmenting an audio recording based on speaker identity, constitutes an important speech pre-processing step for several downstream applications. The conventional approach to diarization involves multiple steps of embedding extraction and clustering, which are often optimized in an isolated fashion. While end-to-end diarization systems attempt to learn a single mod…
▽ More
Speaker diarization, the task of segmenting an audio recording based on speaker identity, constitutes an important speech pre-processing step for several downstream applications. The conventional approach to diarization involves multiple steps of embedding extraction and clustering, which are often optimized in an isolated fashion. While end-to-end diarization systems attempt to learn a single model for the task, they are often cumbersome to train and require large supervised datasets. In this paper, we propose an end-to-end supervised hierarchical clustering algorithm based on graph neural networks (GNN), called End-to-end Supervised HierARchical Clustering (E-SHARC). The E-SHARC approach uses front-end mel-filterbank features as input and jointly learns an embedding extractor and the GNN clustering module, performing representation learning, metric learning, and clustering with end-to-end optimization. Further, with additional inputs from an external overlap detector, the E-SHARC approach is capable of predicting the speakers in the overlap** speech regions. The experimental evaluation on several benchmark datasets like AMI, VoxConverse and DISPLACE, illustrates that the proposed E-SHARC framework improves significantly over the state-of-art diarization systems.
△ Less
Submitted 23 January, 2024;
originally announced January 2024.
-
Zero Shot Audio to Audio Emotion Transfer With Speaker Disentanglement
Authors:
Soumya Dutta,
Sriram Ganapathy
Abstract:
The problem of audio-to-audio (A2A) style transfer involves replacing the style features of the source audio with those from the target audio while preserving the content related attributes of the source audio. In this paper, we propose an efficient approach, termed as Zero-shot Emotion Style Transfer (ZEST), that allows the transfer of emotional content present in the given source audio with the…
▽ More
The problem of audio-to-audio (A2A) style transfer involves replacing the style features of the source audio with those from the target audio while preserving the content related attributes of the source audio. In this paper, we propose an efficient approach, termed as Zero-shot Emotion Style Transfer (ZEST), that allows the transfer of emotional content present in the given source audio with the one embedded in the target audio while retaining the speaker and speech content from the source. The proposed system builds upon decomposing speech into semantic tokens, speaker representations and emotion embeddings. Using these factors, we propose a framework to reconstruct the pitch contour of the given speech signal and train a decoder that reconstructs the speech signal. The model is trained using a self-supervision based reconstruction loss. During conversion, the emotion embedding is alone derived from the target audio, while rest of the factors are derived from the source audio. In our experiments, we show that, even without using parallel training data or labels from the source or target audio, we illustrate zero shot emotion transfer capabilities of the proposed ZEST model using objective and subjective quality evaluations.
△ Less
Submitted 9 January, 2024;
originally announced January 2024.
-
Summary of the DISPLACE Challenge 2023 - DIarization of SPeaker and LAnguage in Conversational Environments
Authors:
Shikha Baghel,
Shreyas Ramoji,
Somil Jain,
Pratik Roy Chowdhuri,
Prachi Singh,
Deepu Vijayasenan,
Sriram Ganapathy
Abstract:
In multi-lingual societies, where multiple languages are spoken in a small geographic vicinity, informal conversations often involve mix of languages. Existing speech technologies may be inefficient in extracting information from such conversations, where the speech data is rich in diversity with multiple languages and speakers. The DISPLACE (DIarization of SPeaker and LAnguage in Conversational E…
▽ More
In multi-lingual societies, where multiple languages are spoken in a small geographic vicinity, informal conversations often involve mix of languages. Existing speech technologies may be inefficient in extracting information from such conversations, where the speech data is rich in diversity with multiple languages and speakers. The DISPLACE (DIarization of SPeaker and LAnguage in Conversational Environments) challenge constitutes an open-call for evaluating and bench-marking the speaker and language diarization technologies on this challenging condition. The challenge entailed two tracks: Track-1 focused on speaker diarization (SD) in multilingual situations while, Track-2 addressed the language diarization (LD) in a multi-speaker scenario. Both the tracks were evaluated using the same underlying audio data. To facilitate this evaluation, a real-world dataset featuring multilingual, multi-speaker conversational far-field speech was recorded and distributed. Furthermore, a baseline system was made available for both SD and LD task which mimicked the state-of-art in these tasks. The challenge garnered a total of $42$ world-wide registrations and received a total of $19$ combined submissions for Track-1 and Track-2. This paper describes the challenge, details of the datasets, tasks, and the baseline system. Additionally, the paper provides a concise overview of the submitted systems in both tracks, with an emphasis given to the top performing systems. The paper also presents insights and future perspectives for SD and LD tasks, focusing on the key challenges that the systems need to overcome before wide-spread commercial deployment on such conversations.
△ Less
Submitted 3 January, 2024; v1 submitted 21 November, 2023;
originally announced November 2023.
-
Speech enhancement with frequency domain auto-regressive modeling
Authors:
Anurenjan Purushothaman,
Debottam Dutta,
Rohit Kumar,
Sriram Ganapathy
Abstract:
Speech applications in far-field real world settings often deal with signals that are corrupted by reverberation. The task of dereverberation constitutes an important step to improve the audible quality and to reduce the error rates in applications like automatic speech recognition (ASR). We propose a unified framework of speech dereverberation for improving the speech quality and the ASR performa…
▽ More
Speech applications in far-field real world settings often deal with signals that are corrupted by reverberation. The task of dereverberation constitutes an important step to improve the audible quality and to reduce the error rates in applications like automatic speech recognition (ASR). We propose a unified framework of speech dereverberation for improving the speech quality and the ASR performance using the approach of envelope-carrier decomposition provided by an autoregressive (AR) model. The AR model is applied in the frequency domain of the sub-band speech signals to separate the envelope and carrier parts. A novel neural architecture based on dual path long short term memory (DPLSTM) model is proposed, which jointly enhances the sub-band envelope and carrier components. The dereverberated envelope-carrier signals are modulated and the sub-band signals are synthesized to reconstruct the audio signal back. The DPLSTM model for dereverberation of envelope and carrier components also allows the joint learning of the network weights for the down stream ASR task. In the ASR tasks on the REVERB challenge dataset as well as on the VOiCES dataset, we illustrate that the joint learning of speech dereverberation network and the E2E ASR model yields significant performance improvements over the baseline ASR system trained on log-mel spectrogram as well as other benchmarks for dereverberation (average relative improvements of 10-24% over the baseline system). The speech quality improvements, evaluated using subjective listening tests, further highlight the improved quality of the reconstructed audio.
△ Less
Submitted 23 September, 2023;
originally announced September 2023.
-
Multimodal Modeling For Spoken Language Identification
Authors:
Shikhar Bharadwaj,
Min Ma,
Shikhar Vashishth,
Ankur Bapna,
Sriram Ganapathy,
Vera Axelrod,
Siddharth Dalmia,
Wei Han,
Yu Zhang,
Daan van Esch,
Sandy Ritchie,
Partha Talukdar,
Jason Riesa
Abstract:
Spoken language identification refers to the task of automatically predicting the spoken language in a given utterance. Conventionally, it is modeled as a speech-based language identification task. Prior techniques have been constrained to a single modality; however in the case of video data there is a wealth of other metadata that may be beneficial for this task. In this work, we propose MuSeLI,…
▽ More
Spoken language identification refers to the task of automatically predicting the spoken language in a given utterance. Conventionally, it is modeled as a speech-based language identification task. Prior techniques have been constrained to a single modality; however in the case of video data there is a wealth of other metadata that may be beneficial for this task. In this work, we propose MuSeLI, a Multimodal Spoken Language Identification method, which delves into the use of various metadata sources to enhance language identification. Our study reveals that metadata such as video title, description and geographic location provide substantial information to identify the spoken language of the multimedia recording. We conduct experiments using two diverse public datasets of YouTube videos, and obtain state-of-the-art results on the language identification task. We additionally conduct an ablation study that describes the distinct contribution of each modality for language recognition.
△ Less
Submitted 19 September, 2023;
originally announced September 2023.
-
MASR: Multi-label Aware Speech Representation
Authors:
Anjali Raj,
Shikhar Bharadwaj,
Sriram Ganapathy,
Min Ma,
Shikhar Vashishth
Abstract:
In the recent years, speech representation learning is constructed primarily as a self-supervised learning (SSL) task, using the raw audio signal alone, while ignoring the side-information that is often available for a given speech recording. In this paper, we propose MASR, a Multi-label Aware Speech Representation learning framework, which addresses the aforementioned limitations. MASR enables th…
▽ More
In the recent years, speech representation learning is constructed primarily as a self-supervised learning (SSL) task, using the raw audio signal alone, while ignoring the side-information that is often available for a given speech recording. In this paper, we propose MASR, a Multi-label Aware Speech Representation learning framework, which addresses the aforementioned limitations. MASR enables the inclusion of multiple external knowledge sources to enhance the utilization of meta-data information. The external knowledge sources are incorporated in the form of sample-level pair-wise similarity matrices that are useful in a hard-mining loss. A key advantage of the MASR framework is that it can be combined with any choice of SSL method. Using MASR representations, we perform evaluations on several downstream tasks such as language identification, speech recognition and other non-semantic tasks such as speaker and emotion recognition. In these experiments, we illustrate significant performance improvements for the MASR over other established benchmarks. We perform a detailed analysis on the language identification task to provide insights on how the proposed loss function enables the representations to separate closely related languages.
△ Less
Submitted 25 September, 2023; v1 submitted 20 July, 2023;
originally announced July 2023.
-
Representation Learning With Hidden Unit Clustering For Low Resource Speech Applications
Authors:
Varun Krishna,
Tarun Sai,
Sriram Ganapathy
Abstract:
The representation learning of speech, without textual resources, is an area of significant interest for many low resource speech applications. In this paper, we describe an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework. The input to the model consists of audio samples that are windowed and processed with 1-D convolutional layers.…
▽ More
The representation learning of speech, without textual resources, is an area of significant interest for many low resource speech applications. In this paper, we describe an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework. The input to the model consists of audio samples that are windowed and processed with 1-D convolutional layers. The learned "time-frequency" representations from the convolutional neural network (CNN) module are further processed with long short term memory (LSTM) layers which generate a contextual vector representation for every windowed segment. The HUC framework, allowing the categorization of the representations into a small number of phoneme-like units, is used to train the model for learning semantically rich speech representations. The targets consist of phoneme-like pseudo labels for each audio segment and these are generated with an iterative k-means algorithm. We explore techniques that improve the speaker invariance of the learned representations and illustrate the effectiveness of the proposed approach on two settings, i) completely unsupervised speech applications on the sub-tasks described as part of the ZeroSpeech 2021 challenge and ii) semi-supervised automatic speech recognition (ASR) applications on the TIMIT dataset and on the GramVaani challenge Hindi dataset. In these experiments, we achieve state-of-art results for various ZeroSpeech tasks. Further, on the ASR experiments, the HUC representations are shown to improve significantly over other established benchmarks based on Wav2vec, HuBERT and Best-RQ.
△ Less
Submitted 14 July, 2023;
originally announced July 2023.
-
Enhancing the EEG Speech Match Mismatch Tasks With Word Boundaries
Authors:
Akshara Soman,
Vidhi Sinha,
Sriram Ganapathy
Abstract:
Recent studies have shown that the underlying neural mechanisms of human speech comprehension can be analyzed using a match-mismatch classification of the speech stimulus and the neural response. However, such studies have been conducted for fixed-duration segments without accounting for the discrete processing of speech in the brain. In this work, we establish that word boundary information plays…
▽ More
Recent studies have shown that the underlying neural mechanisms of human speech comprehension can be analyzed using a match-mismatch classification of the speech stimulus and the neural response. However, such studies have been conducted for fixed-duration segments without accounting for the discrete processing of speech in the brain. In this work, we establish that word boundary information plays a significant role in sentence processing by relating EEG to its speech input. We process the speech and the EEG signals using a network of convolution layers. Then, a word boundary-based average pooling is performed on the representations, and the inter-word context is incorporated using a recurrent layer. The experiments show that the modeling accuracy can be significantly improved (match-mismatch classification accuracy) to 93% on a publicly available speech-EEG data set, while previous efforts achieved an accuracy of 65-75% for this task.
△ Less
Submitted 1 July, 2023;
originally announced July 2023.
-
Label Aware Speech Representation Learning For Language Identification
Authors:
Shikhar Vashishth,
Shikhar Bharadwaj,
Sriram Ganapathy,
Ankur Bapna,
Min Ma,
Wei Han,
Vera Axelrod,
Partha Talukdar
Abstract:
Speech representation learning approaches for non-semantic tasks such as language recognition have either explored supervised embedding extraction methods using a classifier model or self-supervised representation learning approaches using raw data. In this paper, we propose a novel framework of combining self-supervised representation learning with the language label information for the pre-train…
▽ More
Speech representation learning approaches for non-semantic tasks such as language recognition have either explored supervised embedding extraction methods using a classifier model or self-supervised representation learning approaches using raw data. In this paper, we propose a novel framework of combining self-supervised representation learning with the language label information for the pre-training task. This framework, termed as Label Aware Speech Representation (LASR) learning, uses a triplet based objective function to incorporate language labels along with the self-supervised loss function. The speech representations are further fine-tuned for the downstream task. The language recognition experiments are performed on two public datasets - FLEURS and Dhwani. In these experiments, we illustrate that the proposed LASR framework improves over the state-of-the-art systems on language identification. We also report an analysis of the robustness of LASR approach to noisy/missing labels as well as its application to multi-lingual speech recognition tasks.
△ Less
Submitted 7 June, 2023;
originally announced June 2023.
-
Coswara: A respiratory sounds and symptoms dataset for remote screening of SARS-CoV-2 infection
Authors:
Debarpan Bhattacharya,
Neeraj Kumar Sharma,
Debottam Dutta,
Srikanth Raj Chetupalli,
Pravin Mote,
Sriram Ganapathy,
Chandrakiran C,
Sahiti Nori,
Suhail K K,
Sadhana Gonuguntla,
Murali Alagesan
Abstract:
This paper presents the Coswara dataset, a dataset containing diverse set of respiratory sounds and rich meta-data, recorded between April-2020 and February-2022 from 2635 individuals (1819 SARS-CoV-2 negative, 674 positive, and 142 recovered subjects). The respiratory sounds contained nine sound categories associated with variants of breathing, cough and speech. The rich metadata contained demogr…
▽ More
This paper presents the Coswara dataset, a dataset containing diverse set of respiratory sounds and rich meta-data, recorded between April-2020 and February-2022 from 2635 individuals (1819 SARS-CoV-2 negative, 674 positive, and 142 recovered subjects). The respiratory sounds contained nine sound categories associated with variants of breathing, cough and speech. The rich metadata contained demographic information associated with age, gender and geographic location, as well as the health information relating to the symptoms, pre-existing respiratory ailments, comorbidity and SARS-CoV-2 test status. Our study is the first of its kind to manually annotate the audio quality of the entire dataset (amounting to 65~hours) through manual listening. The paper summarizes the data collection procedure, demographic, symptoms and audio data information. A COVID-19 classifier based on bi-directional long short-term (BLSTM) architecture, is trained and evaluated on the different population sub-groups contained in the dataset to understand the bias/fairness of the model. This enabled the analysis of the impact of gender, geographic location, date of recording, and language proficiency on the COVID-19 detection performance.
△ Less
Submitted 22 May, 2023;
originally announced May 2023.
-
HCAM -- Hierarchical Cross Attention Model for Multi-modal Emotion Recognition
Authors:
Soumya Dutta,
Sriram Ganapathy
Abstract:
Emotion recognition in conversations is challenging due to the multi-modal nature of the emotion expression. We propose a hierarchical cross-attention model (HCAM) approach to multi-modal emotion recognition using a combination of recurrent and co-attention neural network models. The input to the model consists of two modalities, i) audio data, processed through a learnable wav2vec approach and, i…
▽ More
Emotion recognition in conversations is challenging due to the multi-modal nature of the emotion expression. We propose a hierarchical cross-attention model (HCAM) approach to multi-modal emotion recognition using a combination of recurrent and co-attention neural network models. The input to the model consists of two modalities, i) audio data, processed through a learnable wav2vec approach and, ii) text data represented using a bidirectional encoder representations from transformers (BERT) model. The audio and text representations are processed using a set of bi-directional recurrent neural network layers with self-attention that converts each utterance in a given conversation to a fixed dimensional embedding. In order to incorporate contextual knowledge and the information across the two modalities, the audio and text embeddings are combined using a co-attention layer that attempts to weigh the utterance level embeddings relevant to the task of emotion recognition. The neural network parameters in the audio layers, text layers as well as the multi-modal co-attention layers, are hierarchically trained for the emotion classification task. We perform experiments on three established datasets namely, IEMOCAP, MELD and CMU-MOSI, where we illustrate that the proposed model improves significantly over other benchmarks and helps achieve state-of-art results on all these datasets.
△ Less
Submitted 9 January, 2024; v1 submitted 13 April, 2023;
originally announced April 2023.
-
DISPLACE Challenge: DIarization of SPeaker and LAnguage in Conversational Environments
Authors:
Shikha Baghel,
Shreyas Ramoji,
Sidharth,
Ranjana H,
Prachi Singh,
Somil Jain,
Pratik Roy Chowdhuri,
Kaustubh Kulkarni,
Swapnil Padhi,
Deepu Vijayasenan,
Sriram Ganapathy
Abstract:
In multilingual societies, social conversations often involve code-mixed speech. The current speech technology may not be well equipped to extract information from multi-lingual multi-speaker conversations. The DISPLACE challenge entails a first-of-kind task to benchmark speaker and language diarization on the same data, as the data contains multi-speaker conversations in multilingual code-mixed s…
▽ More
In multilingual societies, social conversations often involve code-mixed speech. The current speech technology may not be well equipped to extract information from multi-lingual multi-speaker conversations. The DISPLACE challenge entails a first-of-kind task to benchmark speaker and language diarization on the same data, as the data contains multi-speaker conversations in multilingual code-mixed speech. The challenge attempts to highlight outstanding issues in speaker diarization (SD) in multilingual settings with code-mixing. Further, language diarization (LD) in multi-speaker settings also introduces new challenges, where the system has to disambiguate speaker switches with code switches. For this challenge, a natural multilingual, multi-speaker conversational dataset is distributed for development and evaluation purposes. The systems are evaluated on single-channel far-field recordings. We also release a baseline system and report the highlights of the system submissions.
△ Less
Submitted 5 June, 2023; v1 submitted 1 March, 2023;
originally announced March 2023.
-
Supervised Hierarchical Clustering using Graph Neural Networks for Speaker Diarization
Authors:
Prachi Singh,
Amrit Kaul,
Sriram Ganapathy
Abstract:
Conventional methods for speaker diarization involve windowing an audio file into short segments to extract speaker embeddings, followed by an unsupervised clustering of the embeddings. This multi-step approach generates speaker assignments for each segment. In this paper, we propose a novel Supervised HierArchical gRaph Clustering algorithm (SHARC) for speaker diarization where we introduce a hie…
▽ More
Conventional methods for speaker diarization involve windowing an audio file into short segments to extract speaker embeddings, followed by an unsupervised clustering of the embeddings. This multi-step approach generates speaker assignments for each segment. In this paper, we propose a novel Supervised HierArchical gRaph Clustering algorithm (SHARC) for speaker diarization where we introduce a hierarchical structure using Graph Neural Network (GNN) to perform supervised clustering. The supervision allows the model to update the representations and directly improve the clustering performance, thus enabling a single-step approach for diarization. In the proposed work, the input segment embeddings are treated as nodes of a graph with the edge weights corresponding to the similarity scores between the nodes. We also propose an approach to jointly update the embedding extractor and the GNN model to perform end-to-end speaker diarization (E2E-SHARC). During inference, the hierarchical clustering is performed using node densities and edge existence probabilities to merge the segments until convergence. In the diarization experiments, we illustrate that the proposed E2E-SHARC approach achieves 53% and 44% relative improvements over the baseline systems on benchmark datasets like AMI and Voxconverse, respectively.
△ Less
Submitted 24 February, 2023;
originally announced February 2023.
-
Leveraging Symmetrical Convolutional Transformer Networks for Speech to Singing Voice Style Transfer
Authors:
Shrutina Agarwal,
Sriram Ganapathy,
Naoya Takahashi
Abstract:
In this paper, we propose a model to perform style transfer of speech to singing voice. Contrary to the previous signal processing-based methods, which require high-quality singing templates or phoneme synchronization, we explore a data-driven approach for the problem of converting natural speech to singing voice. We develop a novel neural network architecture, called SymNet, which models the alig…
▽ More
In this paper, we propose a model to perform style transfer of speech to singing voice. Contrary to the previous signal processing-based methods, which require high-quality singing templates or phoneme synchronization, we explore a data-driven approach for the problem of converting natural speech to singing voice. We develop a novel neural network architecture, called SymNet, which models the alignment of the input speech with the target melody while preserving the speaker identity and naturalness. The proposed SymNet model is comprised of symmetrical stack of three types of layers - convolutional, transformer, and self-attention layers. The paper also explores novel data augmentation and generative loss annealing methods to facilitate the model training. Experiments are performed on the
NUS and NHSS datasets which consist of parallel data of speech and singing voice. In these experiments, we show that the proposed SymNet model improves the objective reconstruction quality significantly over the previously published methods and baseline architectures. Further, a subjective listening test confirms the improved quality of the audio obtained using the proposed approach (absolute improvement of 0.37 in mean opinion score measure over the baseline system).
△ Less
Submitted 25 August, 2022;
originally announced August 2022.
-
Interpretable Acoustic Representation Learning on Breathing and Speech Signals for COVID-19 Detection
Authors:
Debottam Dutta,
Debarpan Bhattacharya,
Sriram Ganapathy,
Amir H. Poorjam,
Deepak Mittal,
Maneesh Singh
Abstract:
In this paper, we describe an approach for representation learning of audio signals for the task of COVID-19 detection. The raw audio samples are processed with a bank of 1-D convolutional filters that are parameterized as cosine modulated Gaussian functions. The choice of these kernels allows the interpretation of the filterbanks as smooth band-pass filters. The filtered outputs are pooled, log-c…
▽ More
In this paper, we describe an approach for representation learning of audio signals for the task of COVID-19 detection. The raw audio samples are processed with a bank of 1-D convolutional filters that are parameterized as cosine modulated Gaussian functions. The choice of these kernels allows the interpretation of the filterbanks as smooth band-pass filters. The filtered outputs are pooled, log-compressed and used in a self-attention based relevance weighting mechanism. The relevance weighting emphasizes the key regions of the time-frequency decomposition that are important for the downstream task. The subsequent layers of the model consist of a recurrent architecture and the models are trained for a COVID-19 detection task. In our experiments on the Coswara data set, we show that the proposed model achieves significant performance improvements over the baseline system as well as other representation learning approaches. Further, the approach proposed is shown to be uniformly applicable for speech and breathing signals and for transfer learning from a larger data set.
△ Less
Submitted 27 June, 2022;
originally announced June 2022.
-
Analyzing the impact of SARS-CoV-2 variants on respiratory sound signals
Authors:
Debarpan Bhattacharya,
Debottam Dutta,
Neeraj Kumar Sharma,
Srikanth Raj Chetupalli,
Pravin Mote,
Sriram Ganapathy,
Chandrakiran C,
Sahiti Nori,
Suhail K K,
Sadhana Gonuguntla,
Murali Alagesan
Abstract:
The COVID-19 outbreak resulted in multiple waves of infections that have been associated with different SARS-CoV-2 variants. Studies have reported differential impact of the variants on respiratory health of patients. We explore whether acoustic signals, collected from COVID-19 subjects, show computationally distinguishable acoustic patterns suggesting a possibility to predict the underlying virus…
▽ More
The COVID-19 outbreak resulted in multiple waves of infections that have been associated with different SARS-CoV-2 variants. Studies have reported differential impact of the variants on respiratory health of patients. We explore whether acoustic signals, collected from COVID-19 subjects, show computationally distinguishable acoustic patterns suggesting a possibility to predict the underlying virus variant. We analyze the Coswara dataset which is collected from three subject pools, namely, i) healthy, ii) COVID-19 subjects recorded during the delta variant dominant period, and iii) data from COVID-19 subjects recorded during the omicron surge. Our findings suggest that multiple sound categories, such as cough, breathing, and speech, indicate significant acoustic feature differences when comparing COVID-19 subjects with omicron and delta variants. The classification areas-under-the-curve are significantly above chance for differentiating subjects infected by omicron from those infected by delta. Using a score fusion from multiple sound categories, we obtained an area-under-the-curve of 89% and 52.4% sensitivity at 95% specificity. Additionally, a hierarchical three class approach was used to classify the acoustic data into healthy and COVID-19 positive, and further COVID-19 subjects into delta and omicron variants providing high level of 3-class classification accuracy. These results suggest new ways for designing sound based COVID-19 diagnosis approaches.
△ Less
Submitted 24 June, 2022;
originally announced June 2022.
-
Svadhyaya system for the Second Diagnosing COVID-19 using Acoustics Challenge 2021
Authors:
Deepak Mittal,
Amir H. Poorjam,
Debottam Dutta,
Debarpan Bhattacharya,
Zemin Yu,
Sriram Ganapathy,
Maneesh Singh
Abstract:
This report describes the system used for detecting COVID-19 positives using three different acoustic modalities, namely speech, breathing, and cough in the second DiCOVA challenge. The proposed system is based on the combination of 4 different approaches, each focusing more on one aspect of the problem, and reaches the blind test AUCs of 86.41, 77.60, and 84.55, in the breathing, cough, and speec…
▽ More
This report describes the system used for detecting COVID-19 positives using three different acoustic modalities, namely speech, breathing, and cough in the second DiCOVA challenge. The proposed system is based on the combination of 4 different approaches, each focusing more on one aspect of the problem, and reaches the blind test AUCs of 86.41, 77.60, and 84.55, in the breathing, cough, and speech tracks, respectively, and the AUC of 85.37 in the fusion of these three tracks.
△ Less
Submitted 11 June, 2022;
originally announced June 2022.
-
Coswara: A website application enabling COVID-19 screening by analysing respiratory sound samples and health symptoms
Authors:
Debarpan Bhattacharya,
Debottam Dutta,
Neeraj Kumar Sharma,
Srikanth Raj Chetupalli,
Pravin Mote,
Sriram Ganapathy,
Chandrakiran C,
Sahiti Nori,
Suhail K K,
Sadhana Gonuguntla,
Murali Alagesan
Abstract:
The COVID-19 pandemic has accelerated research on design of alternative, quick and effective COVID-19 diagnosis approaches. In this paper, we describe the Coswara tool, a website application designed to enable COVID-19 detection by analysing respiratory sound samples and health symptoms. A user using this service can log into a website using any device connected to the internet, provide there curr…
▽ More
The COVID-19 pandemic has accelerated research on design of alternative, quick and effective COVID-19 diagnosis approaches. In this paper, we describe the Coswara tool, a website application designed to enable COVID-19 detection by analysing respiratory sound samples and health symptoms. A user using this service can log into a website using any device connected to the internet, provide there current health symptom information and record few sound sampled corresponding to breathing, cough, and speech. Within a minute of analysis of this information on a cloud server the website tool will output a COVID-19 probability score to the user. As the COVID-19 pandemic continues to demand massive and scalable population level testing, we hypothesize that the proposed tool provides a potential solution towards this.
△ Less
Submitted 9 June, 2022;
originally announced June 2022.
-
The Second DiCOVA Challenge: Dataset and performance analysis for COVID-19 diagnosis using acoustics
Authors:
Neeraj Kumar Sharma,
Srikanth Raj Chetupalli,
Debarpan Bhattacharya,
Debottam Dutta,
Pravin Mote,
Sriram Ganapathy
Abstract:
The Second Diagnosis of COVID-19 using Acoustics (DiCOVA) Challenge aimed at accelerating the research in acoustics based detection of COVID-19, a topic at the intersection of acoustics, signal processing, machine learning, and healthcare. This paper presents the details of the challenge, which was an open call for researchers to analyze a dataset of audio recordings consisting of breathing, cough…
▽ More
The Second Diagnosis of COVID-19 using Acoustics (DiCOVA) Challenge aimed at accelerating the research in acoustics based detection of COVID-19, a topic at the intersection of acoustics, signal processing, machine learning, and healthcare. This paper presents the details of the challenge, which was an open call for researchers to analyze a dataset of audio recordings consisting of breathing, cough and speech signals. This data was collected from individuals with and without COVID-19 infection, and the task in the challenge was a two-class classification. The development set audio recordings were collected from 965 (172 COVID-19 positive) individuals, while the evaluation set contained data from 471 individuals (71 COVID-19 positive). The challenge featured four tracks, one associated with each sound category of cough, speech and breathing, and a fourth fusion track. A baseline system was also released to benchmark the participants. In this paper, we present an overview of the challenge, the rationale for the data collection and the baseline system. Further, a performance analysis for the systems submitted by the $16$ participating teams in the leaderboard is also presented.
△ Less
Submitted 11 October, 2021; v1 submitted 4 October, 2021;
originally announced October 2021.
-
Self-Supervised Metric Learning With Graph Clustering For Speaker Diarization
Authors:
Prachi Singh,
Sriram Ganapathy
Abstract:
In this paper, we propose a novel algorithm for speaker diarization using metric learning for graph based clustering. The graph clustering algorithms use an adjacency matrix consisting of similarity scores. These scores are computed between speaker embeddings extracted from pairs of audio segments within the given recording. In this paper, we propose an approach that jointly learns the speaker emb…
▽ More
In this paper, we propose a novel algorithm for speaker diarization using metric learning for graph based clustering. The graph clustering algorithms use an adjacency matrix consisting of similarity scores. These scores are computed between speaker embeddings extracted from pairs of audio segments within the given recording. In this paper, we propose an approach that jointly learns the speaker embeddings and the similarity metric using principles of self-supervised learning. The metric learning network implements a neural model of the probabilistic linear discriminant analysis (PLDA). The self-supervision is derived from the pseudo labels obtained from a previous iteration of clustering. The entire model of representation learning and metric learning is trained with a binary cross entropy loss. By combining the self-supervision based metric learning along with the graph-based clustering algorithm, we achieve significant relative improvements of 60% and 7% over the x-vector PLDA agglomerative hierarchical clustering (AHC) approach on AMI and the DIHARD datasets respectively in terms of diarization error rates (DER).
△ Less
Submitted 14 September, 2021;
originally announced September 2021.
-
Dereverberation of Autoregressive Envelopes for Far-field Speech Recognition
Authors:
Anurenjan Purushothaman,
Anirudh Sreeram,
Rohit Kumar,
Sriram Ganapathy
Abstract:
The task of speech recognition in far-field environments is adversely affected by the reverberant artifacts that elicit as the temporal smearing of the sub-band envelopes. In this paper, we develop a neural model for speech dereverberation using the long-term sub-band envelopes of speech. The sub-band envelopes are derived using frequency domain linear prediction (FDLP) which performs an autoregre…
▽ More
The task of speech recognition in far-field environments is adversely affected by the reverberant artifacts that elicit as the temporal smearing of the sub-band envelopes. In this paper, we develop a neural model for speech dereverberation using the long-term sub-band envelopes of speech. The sub-band envelopes are derived using frequency domain linear prediction (FDLP) which performs an autoregressive estimation of the Hilbert envelopes. The neural dereverberation model estimates the envelope gain which when applied to reverberant signals suppresses the late reflection components in the far-field signal. The dereverberated envelopes are used for feature extraction in speech recognition. Further, the sequence of steps involved in envelope dereverberation, feature extraction and acoustic modeling for ASR can be implemented as a single neural processing pipeline which allows the joint learning of the dereverberation network and the acoustic model. Several experiments are performed on the REVERB challenge dataset, CHiME-3 dataset and VOiCES dataset. In these experiments, the joint learning of envelope dereverberation and acoustic model yields significant performance improvements over the baseline ASR system based on log-mel spectrogram as well as other past approaches for dereverberation (average relative improvements of 10-24% over the baseline system). A detailed analysis on the choice of hyper-parameters and the cost function involved in envelope dereverberation is also provided.
△ Less
Submitted 13 August, 2021; v1 submitted 12 August, 2021;
originally announced August 2021.
-
End-to-End Speech Recognition With Joint Dereverberation Of Sub-Band Autoregressive Envelopes
Authors:
Rohit Kumar,
Anurenjan Purushothaman,
Anirudh Sreeram,
Sriram Ganapathy
Abstract:
The end-to-end (E2E) automatic speech recognition (ASR) systems are often required to operate in reverberant conditions, where the long-term sub-band envelopes of the speech are temporally smeared. In this paper, we develop a feature enhancement approach using a neural model operating on sub-band temporal envelopes. The temporal envelopes are modeled using the framework of frequency domain linear…
▽ More
The end-to-end (E2E) automatic speech recognition (ASR) systems are often required to operate in reverberant conditions, where the long-term sub-band envelopes of the speech are temporally smeared. In this paper, we develop a feature enhancement approach using a neural model operating on sub-band temporal envelopes. The temporal envelopes are modeled using the framework of frequency domain linear prediction (FDLP). The neural enhancement model proposed in this paper performs an envelope gain based enhancement of temporal envelopes. The model architecture consists of a combination of convolutional and long short term memory (LSTM) neural network layers. Further, the envelope dereverberation, feature extraction and acoustic modeling using transformer based E2E ASR can all be jointly optimized for the speech recognition task. The joint optimization ensures that the dereverberation model targets the ASR cost function. We perform E2E speech recognition experiments on the REVERB challenge dataset as well as on the VOiCES dataset. In these experiments, the proposed joint modeling approach yields significant improvements compared to the baseline E2E ASR system (average relative improvements of 21% on the REVERB challenge dataset and about 10% on the VOiCES dataset).
△ Less
Submitted 17 February, 2022; v1 submitted 9 August, 2021;
originally announced August 2021.
-
A Multi-Head Relevance Weighting Framework For Learning Raw Waveform Audio Representations
Authors:
Debottam Dutta,
Purvi Agrawal,
Sriram Ganapathy
Abstract:
In this work, we propose a multi-head relevance weighting framework to learn audio representations from raw waveforms. The audio waveform, split into windows of short duration, are processed with a 1-D convolutional layer of cosine modulated Gaussian filters acting as a learnable filterbank. The key novelty of the proposed framework is the introduction of multi-head relevance on the learnt filterb…
▽ More
In this work, we propose a multi-head relevance weighting framework to learn audio representations from raw waveforms. The audio waveform, split into windows of short duration, are processed with a 1-D convolutional layer of cosine modulated Gaussian filters acting as a learnable filterbank. The key novelty of the proposed framework is the introduction of multi-head relevance on the learnt filterbank representations. Each head of the relevance network is modelled as a separate sub-network. These heads perform representation enhancement by generating weight masks for different parts of the time-frequency representation learnt by the parametric acoustic filterbank layer. The relevance weighted representations are fed to a neural classifier and the whole system is trained jointly for the audio classification objective. Experiments are performed on the DCASE2020 Task 1A challenge as well as the Urban Sound Classification (USC) tasks. In these experiments, the proposed approach yields relative improvements of 10% and 23% respectively for the DCASE2020 and USC datasets over the mel-spectrogram baseline. Also, the analysis of multi-head relevance weights provides insights on the learned representations.
△ Less
Submitted 30 July, 2021;
originally announced July 2021.
-
SRIB-LEAP submission to Far-field Multi-Channel Speech Enhancement Challenge for Video Conferencing
Authors:
R G Prithvi Raj,
Rohit Kumar,
M K Jayesh,
Anurenjan Purushothaman,
Sriram Ganapathy,
M A Basha Shaik
Abstract:
This paper presents the details of the SRIB-LEAP submission to the ConferencingSpeech challenge 2021. The challenge involved the task of multi-channel speech enhancement to improve the quality of far field speech from microphone arrays in a video conferencing room. We propose a two stage method involving a beamformer followed by single channel enhancement. For the beamformer, we incorporated self-…
▽ More
This paper presents the details of the SRIB-LEAP submission to the ConferencingSpeech challenge 2021. The challenge involved the task of multi-channel speech enhancement to improve the quality of far field speech from microphone arrays in a video conferencing room. We propose a two stage method involving a beamformer followed by single channel enhancement. For the beamformer, we incorporated self-attention mechanism as inter-channel processing layer in the filter-and-sum network (FaSNet), an end-to-end time-domain beamforming system. The single channel speech enhancement is done in log spectral domain using convolution neural network (CNN)-long short term memory (LSTM) based architecture. We achieved improvements in objective quality metrics - perceptual evaluation of speech quality (PESQ) of 0.5 on the noisy data. On subjective quality evaluation, the proposed approach improved the mean opinion score (MOS) by an absolute measure of 0.9 over the noisy audio.
△ Less
Submitted 24 June, 2021;
originally announced June 2021.
-
Towards sound based testing of COVID-19 -- Summary of the first Diagnostics of COVID-19 using Acoustics (DiCOVA) Challenge
Authors:
Neeraj Kumar Sharma,
Ananya Muguli,
Prashant Krishnan,
Rohit Kumar,
Srikanth Raj Chetupalli,
Sriram Ganapathy
Abstract:
The technology development for point-of-care tests (POCTs) targeting respiratory diseases has witnessed a growing demand in the recent past. Investigating the presence of acoustic biomarkers in modalities such as cough, breathing and speech sounds, and using them for building POCTs can offer fast, contactless and inexpensive testing. In view of this, over the past year, we launched the ``Coswara''…
▽ More
The technology development for point-of-care tests (POCTs) targeting respiratory diseases has witnessed a growing demand in the recent past. Investigating the presence of acoustic biomarkers in modalities such as cough, breathing and speech sounds, and using them for building POCTs can offer fast, contactless and inexpensive testing. In view of this, over the past year, we launched the ``Coswara'' project to collect cough, breathing and speech sound recordings via worldwide crowdsourcing. With this data, a call for development of diagnostic tools was announced in the Interspeech 2021 as a special session titled ``Diagnostics of COVID-19 using Acoustics (DiCOVA) Challenge''. The goal was to bring together researchers and practitioners interested in develo** acoustics-based COVID-19 POCTs by enabling them to work on the same set of development and test datasets. As part of the challenge, datasets with breathing, cough, and speech sound samples from COVID-19 and non-COVID-19 individuals were released to the participants. The challenge consisted of two tracks. The Track-1 focused only on cough sounds, and participants competed in a leaderboard setting. In Track-2, breathing and speech samples were provided for the participants, without a competitive leaderboard. The challenge attracted 85 plus registrations with 29 final submissions for Track-1. This paper describes the challenge (datasets, tasks, baseline system), and presents a focused summary of the various systems submitted by the participating teams. An analysis of the results from the top four teams showed that a fusion of the scores from these teams yields an area-under-the-curve of 95.1% on the blind test data. By summarizing the lessons learned, we foresee the challenge overview in this paper to help accelerate technology for acoustic-based POCTs.
△ Less
Submitted 21 June, 2021;
originally announced June 2021.
-
Multi-modal Point-of-Care Diagnostics for COVID-19 Based On Acoustics and Symptoms
Authors:
Srikanth Raj Chetupalli,
Prashant Krishnan,
Neeraj Sharma,
Ananya Muguli,
Rohit Kumar,
Viral Nanda,
Lancelot Mark Pinto,
Prasanta Kumar Ghosh,
Sriram Ganapathy
Abstract:
The research direction of identifying acoustic bio-markers of respiratory diseases has received renewed interest following the onset of COVID-19 pandemic. In this paper, we design an approach to COVID-19 diagnostic using crowd-sourced multi-modal data. The data resource, consisting of acoustic signals like cough, breathing, and speech signals, along with the data of symptoms, are recorded using a…
▽ More
The research direction of identifying acoustic bio-markers of respiratory diseases has received renewed interest following the onset of COVID-19 pandemic. In this paper, we design an approach to COVID-19 diagnostic using crowd-sourced multi-modal data. The data resource, consisting of acoustic signals like cough, breathing, and speech signals, along with the data of symptoms, are recorded using a web-application over a period of ten months. We investigate the use of statistical descriptors of simple time-frequency features for acoustic signals and binary features for the presence of symptoms. Unlike previous works, we primarily focus on the application of simple linear classifiers like logistic regression and support vector machines for acoustic data while decision tree models are employed on the symptoms data. We show that a multi-modal integration of acoustics and symptoms classifiers achieves an area-under-curve (AUC) of 92.40, a significant improvement over any individual modality. Several ablation experiments are also provided which highlight the acoustic and symptom dimensions that are important for the task of COVID-19 diagnostics.
△ Less
Submitted 5 June, 2021; v1 submitted 1 June, 2021;
originally announced June 2021.
-
Deep Correlation Analysis for Audio-EEG Decoding
Authors:
Jaswanth Reddy Katthi,
Sriram Ganapathy
Abstract:
The electroencephalography (EEG), which is one of the easiest modes of recording brain activations in a non-invasive manner, is often distorted due to recording artifacts which adversely impacts the stimulus-response analysis. The most prominent techniques thus far attempt to improve the stimulus-response correlations using linear methods. In this paper, we propose a neural network based correlati…
▽ More
The electroencephalography (EEG), which is one of the easiest modes of recording brain activations in a non-invasive manner, is often distorted due to recording artifacts which adversely impacts the stimulus-response analysis. The most prominent techniques thus far attempt to improve the stimulus-response correlations using linear methods. In this paper, we propose a neural network based correlation analysis framework that significantly improves over the linear methods for auditory stimuli. A deep model is proposed for intra-subject audio-EEG analysis based on directly optimizing the correlation loss. Further, a neural network model with a shared encoder architecture is proposed for improving the inter-subject stimulus response correlations. These models attempt to suppress the EEG artifacts while preserving the components related to the stimulus. Several experiments are performed using EEG recordings from subjects listening to speech and music stimuli. In these experiments, we show that the deep models improve the Pearson correlation significantly over the linear methods (average absolute improvements of 7.4% in speech tasks and 29.3% in music tasks). We also analyze the impact of several model parameters on the stimulus-response correlation.
△ Less
Submitted 27 November, 2021; v1 submitted 18 May, 2021;
originally announced May 2021.
-
Self-supervised Representation Learning With Path Integral Clustering For Speaker Diarization
Authors:
Prachi Singh,
Sriram Ganapathy
Abstract:
Automatic speaker diarization techniques typically involve a two-stage processing approach where audio segments of fixed duration are converted to vector representations in the first stage. This is followed by an unsupervised clustering of the representations in the second stage. In most of the prior approaches, these two stages are performed in an isolated manner with independent optimization ste…
▽ More
Automatic speaker diarization techniques typically involve a two-stage processing approach where audio segments of fixed duration are converted to vector representations in the first stage. This is followed by an unsupervised clustering of the representations in the second stage. In most of the prior approaches, these two stages are performed in an isolated manner with independent optimization steps. In this paper, we propose a representation learning and clustering algorithm that can be iteratively performed for improved speaker diarization. The representation learning is based on principles of self-supervised learning while the clustering algorithm is a graph structural method based on path integral clustering (PIC). The representation learning step uses the cluster targets from PIC and the clustering step is performed on embeddings learned from the self-supervised deep model. This iterative approach is referred to as self-supervised clustering (SSC). The diarization experiments are performed on CALLHOME and AMI meeting datasets. In these experiments, we show that the SSC algorithm improves significantly over the baseline system (relative improvements of 13% and 59% on CALLHOME and AMI datasets respectively in terms of diarization error rate (DER)). In addition, the DER results reported in this work improve over several other recent approaches for speaker diarization.
△ Less
Submitted 19 April, 2021;
originally announced April 2021.
-
LEAP Submission for the Third DIHARD Diarization Challenge
Authors:
Prachi Singh,
Rajat Varma,
Venkat Krishnamohan,
Srikanth Raj Chetupalli,
Sriram Ganapathy
Abstract:
The LEAP submission for DIHARD-III challenge is described in this paper. The proposed system is composed of a speech bandwidth classifier, and diarization systems fine-tuned for narrowband and wideband speech separately. We use an end-to-end speaker diarization system for the narrowband conversational telephone speech recordings. For the wideband multi-speaker recordings, we use a neural embedding…
▽ More
The LEAP submission for DIHARD-III challenge is described in this paper. The proposed system is composed of a speech bandwidth classifier, and diarization systems fine-tuned for narrowband and wideband speech separately. We use an end-to-end speaker diarization system for the narrowband conversational telephone speech recordings. For the wideband multi-speaker recordings, we use a neural embedding based clustering approach, similar to the baseline system. The embeddings are extracted from a time-delay neural network (called x-vectors) followed by the graph based path integral clustering (PIC) approach. The LEAP system showed 24% and 18% relative improvements for Track-1 and Track-2 respectively over the baseline system provided by the organizers. This paper describes the challenge submission, the post-evaluation analysis and improvements observed on the DIHARD-III dataset.
△ Less
Submitted 14 June, 2021; v1 submitted 6 April, 2021;
originally announced April 2021.
-
Speaker conditioned acoustic modeling for multi-speaker conversational ASR
Authors:
Srikanth Raj Chetupalli,
Sriram Ganapathy
Abstract:
In this paper, we propose a novel approach for the transcription of speech conversations with natural speaker overlap, from single channel speech recordings. The proposed model is a combination of a speaker diarization system and a hybrid automatic speech recognition (ASR) system. The speaker conditioned acoustic model (SCAM) in the ASR system consists of a series of embedding layers which use the…
▽ More
In this paper, we propose a novel approach for the transcription of speech conversations with natural speaker overlap, from single channel speech recordings. The proposed model is a combination of a speaker diarization system and a hybrid automatic speech recognition (ASR) system. The speaker conditioned acoustic model (SCAM) in the ASR system consists of a series of embedding layers which use the speaker activity inputs from the diarization system to derive speaker specific embeddings. The output of the SCAM are speaker specific senones that are used for decoding the transcripts for each speaker in the conversation. In this work, we experiment with the automatic speaker activity decisions generated using an end-to-end speaker diarization system. A joint learning approach is also proposed where the diarization model and the ASR acoustic model are jointly optimized. The experiments are performed on the mixed-channel two speaker recordings from the Switchboard corpus of telephone conversations. In these experiments, we show that the proposed acoustic model, incorporating speaker activity decisions and joint optimization, improves significantly over the ASR system with explicit source filtering (relative improvements of 12% in word error rate (WER) over the baseline system).
△ Less
Submitted 29 August, 2022; v1 submitted 5 April, 2021;
originally announced April 2021.
-
DiCOVA Challenge: Dataset, task, and baseline system for COVID-19 diagnosis using acoustics
Authors:
Ananya Muguli,
Lancelot Pinto,
Nirmala R.,
Neeraj Sharma,
Prashant Krishnan,
Prasanta Kumar Ghosh,
Rohit Kumar,
Shrirama Bhat,
Srikanth Raj Chetupalli,
Sriram Ganapathy,
Shreyas Ramoji,
Viral Nanda
Abstract:
The DiCOVA challenge aims at accelerating research in diagnosing COVID-19 using acoustics (DiCOVA), a topic at the intersection of speech and audio processing, respiratory health diagnosis, and machine learning. This challenge is an open call for researchers to analyze a dataset of sound recordings collected from COVID-19 infected and non-COVID-19 individuals for a two-class classification. These…
▽ More
The DiCOVA challenge aims at accelerating research in diagnosing COVID-19 using acoustics (DiCOVA), a topic at the intersection of speech and audio processing, respiratory health diagnosis, and machine learning. This challenge is an open call for researchers to analyze a dataset of sound recordings collected from COVID-19 infected and non-COVID-19 individuals for a two-class classification. These recordings were collected via crowdsourcing from multiple countries, through a website application. The challenge features two tracks, one focusing on cough sounds, and the other on using a collection of breath, sustained vowel phonation, and number counting speech recordings. In this paper, we introduce the challenge and provide a detailed description of the task, and present a baseline system for the task.
△ Less
Submitted 17 June, 2021; v1 submitted 16 March, 2021;
originally announced March 2021.
-
Deep Multiway Canonical Correlation Analysis for Multi-Subject EEG Normalization
Authors:
Jaswanth Reddy Katthi,
Sriram Ganapathy
Abstract:
The normalization of brain recordings from multiple subjects responding to the natural stimuli is one of the key challenges in auditory neuroscience. The objective of this normalization is to transform the brain data in such a way as to remove the inter-subject redundancies and to boost the component related to the stimuli. In this paper, we propose a deep learning framework to improve the correla…
▽ More
The normalization of brain recordings from multiple subjects responding to the natural stimuli is one of the key challenges in auditory neuroscience. The objective of this normalization is to transform the brain data in such a way as to remove the inter-subject redundancies and to boost the component related to the stimuli. In this paper, we propose a deep learning framework to improve the correlation of electroencephalography (EEG) data recorded from multiple subjects engaged in an audio listening task. The proposed model extends the linear multi-way canonical correlation analysis (CCA) for audio-EEG analysis using an auto-encoder network with a shared encoder layer. The model is trained to optimize a combined loss involving correlation and reconstruction. The experiments are performed on EEG data collected from subjects listening to natural speech and music. In these experiments, we show that the proposed deep multi-way CCA (DMCCA) based model significantly improves the correlations over the linear multi-way CCA approach with absolute improvements of 0.08 and 0.29 in terms of the Pearson correlation values for speech and music tasks respectively.
△ Less
Submitted 11 March, 2021;
originally announced March 2021.
-
End-to-end lyrics Recognition with Voice to Singing Style Transfer
Authors:
Sakya Basak,
Shrutina Agarwal,
Sriram Ganapathy,
Naoya Takahashi
Abstract:
Automatic transcription of monophonic/polyphonic music is a challenging task due to the lack of availability of large amounts of transcribed data. In this paper, we propose a data augmentation method that converts natural speech to singing voice based on vocoder based speech synthesizer. This approach, called voice to singing (V2S), performs the voice style conversion by modulating the F0 contour…
▽ More
Automatic transcription of monophonic/polyphonic music is a challenging task due to the lack of availability of large amounts of transcribed data. In this paper, we propose a data augmentation method that converts natural speech to singing voice based on vocoder based speech synthesizer. This approach, called voice to singing (V2S), performs the voice style conversion by modulating the F0 contour of the natural speech with that of a singing voice. The V2S model based style transfer can generate good quality singing voice thereby enabling the conversion of large corpora of natural speech to singing voice that is useful in building an E2E lyrics transcription system. In our experiments on monophonic singing voice data, the V2S style transfer provides a significant gain (relative improvements of 21%) for the E2E lyrics transcription system. We also discuss additional components like transfer learning and lyrics based language modeling to improve the performance of the lyrics transcription system.
△ Less
Submitted 16 February, 2021;
originally announced February 2021.
-
Representation Learning For Speech Recognition Using Feedback Based Relevance Weighting
Authors:
Purvi Agrawal,
Sriram Ganapathy
Abstract:
In this work, we propose an acoustic embedding based approach for representation learning in speech recognition. The proposed approach involves two stages comprising of acoustic filterbank learning from raw waveform, followed by modulation filterbank learning. In each stage, a relevance weighting operation is employed that acts as a feature selection module. In particular, the relevance weighting…
▽ More
In this work, we propose an acoustic embedding based approach for representation learning in speech recognition. The proposed approach involves two stages comprising of acoustic filterbank learning from raw waveform, followed by modulation filterbank learning. In each stage, a relevance weighting operation is employed that acts as a feature selection module. In particular, the relevance weighting network receives embeddings of the model outputs from the previous time instants as feedback. The proposed relevance weighting scheme allows the respective feature representations to be adaptively selected before propagation to the higher layers. The application of the proposed approach for the task of speech recognition on Aurora-4 and CHiME-3 datasets gives significant performance improvements over baseline systems on raw waveform signal as well as those based on mel representations (average relative improvement of 15% over the mel baseline on Aurora-4 dataset and 7% on CHiME-3 dataset).
△ Less
Submitted 15 February, 2021;
originally announced February 2021.
-
The Third DIHARD Diarization Challenge
Authors:
Neville Ryant,
Prachi Singh,
Venkat Krishnamohan,
Rajat Varma,
Kenneth Church,
Christopher Cieri,
Jun Du,
Sriram Ganapathy,
Mark Liberman
Abstract:
DIHARD III was the third in a series of speaker diarization challenges intended to improve the robustness of diarization systems to variability in recording equipment, noise conditions, and conversational domain. Speaker diarization was evaluated under two speech activity conditions (diarization from a reference speech activity vs. diarization from scratch) and 11 diverse domains. The domains span…
▽ More
DIHARD III was the third in a series of speaker diarization challenges intended to improve the robustness of diarization systems to variability in recording equipment, noise conditions, and conversational domain. Speaker diarization was evaluated under two speech activity conditions (diarization from a reference speech activity vs. diarization from scratch) and 11 diverse domains. The domains span a range of recording conditions and interaction types, including read audio-books, meeting speech, clinical interviews, web videos, and, for the first time, conversational telephone speech. A total of 30 organizations (forming 21teams) from industry and academia submitted 499 valid system outputs. The evaluation results indicate that speaker diarization has improved markedly since DIHARD I, particularly for two-party interactions, but that for many domains (e.g., web video) the problem remains far from solved.
△ Less
Submitted 5 April, 2021; v1 submitted 2 December, 2020;
originally announced December 2020.
-
Interpretable Representation Learning for Speech and Audio Signals Based on Relevance Weighting
Authors:
Purvi Agrawal,
Sriram Ganapathy
Abstract:
The learning of interpretable representations from raw data presents significant challenges for time series data like speech. In this work, we propose a relevance weighting scheme that allows the interpretation of the speech representations during the forward propagation of the model itself. The relevance weighting is achieved using a sub-network approach that performs the task of feature selectio…
▽ More
The learning of interpretable representations from raw data presents significant challenges for time series data like speech. In this work, we propose a relevance weighting scheme that allows the interpretation of the speech representations during the forward propagation of the model itself. The relevance weighting is achieved using a sub-network approach that performs the task of feature selection. A relevance sub-network, applied on the output of first layer of a convolutional neural network model operating on raw speech signals, acts as an acoustic filterbank (FB) layer with relevance weighting. A similar relevance sub-network applied on the second convolutional layer performs modulation filterbank learning with relevance weighting. The full acoustic model consisting of relevance sub-networks, convolutional layers and feed-forward layers is trained for a speech recognition task on noisy and reverberant speech in the Aurora-4, CHiME-3 and VOiCES datasets. The proposed representation learning framework is also applied for the task of sound classification in the UrbanSound8K dataset. A detailed analysis of the relevance weights learned by the model reveals that the relevance weights capture information regarding the underlying speech/audio content. In addition, speech recognition and sound classification experiments reveal that the incorporation of relevance weighting in the neural network architecture improves the performance significantly.
△ Less
Submitted 29 October, 2020;
originally announced November 2020.
-
Robust Raw Waveform Speech Recognition Using Relevance Weighted Representations
Authors:
Purvi Agrawal,
Sriram Ganapathy
Abstract:
Speech recognition in noisy and channel distorted scenarios is often challenging as the current acoustic modeling schemes are not adaptive to the changes in the signal distribution in the presence of noise. In this work, we develop a novel acoustic modeling framework for noise robust speech recognition based on relevance weighting mechanism. The relevance weighting is achieved using a sub-network…
▽ More
Speech recognition in noisy and channel distorted scenarios is often challenging as the current acoustic modeling schemes are not adaptive to the changes in the signal distribution in the presence of noise. In this work, we develop a novel acoustic modeling framework for noise robust speech recognition based on relevance weighting mechanism. The relevance weighting is achieved using a sub-network approach that performs feature selection. A relevance sub-network is applied on the output of first layer of a convolutional network model operating on raw speech signals while a second relevance sub-network is applied on the second convolutional layer output. The relevance weights for the first layer correspond to an acoustic filterbank selection while the relevance weights in the second layer perform modulation filter selection. The model is trained for a speech recognition task on noisy and reverberant speech. The speech recognition experiments on multiple datasets (Aurora-4, CHiME-3, VOiCES) reveal that the incorporation of relevance weighting in the neural network architecture improves the speech recognition word error rates significantly (average relative improvements of 10% over the baseline systems)
△ Less
Submitted 29 October, 2020;
originally announced November 2020.
-
Effects of Voice-Based Synthetic Assistant on Performance of Emergency Care Provider in Training
Authors:
Praveen Damacharla,
Parashar Dhakal,
Sebastian Stumbo,
Ahmad Y. Javaid,
Subhashini Ganapathy,
David A. Malek,
Douglas C. Hodge,
Vijay Devabhaktuni
Abstract:
As part of a perennial project, our team is actively engaged in develo** new synthetic assistant (SA) technologies to assist in training combat medics and medical first responders. It is critical that medical first responders are well trained to deal with emergencies more effectively. This would require real-time monitoring and feedback for each trainee. Therefore, we introduced a voice-based SA…
▽ More
As part of a perennial project, our team is actively engaged in develo** new synthetic assistant (SA) technologies to assist in training combat medics and medical first responders. It is critical that medical first responders are well trained to deal with emergencies more effectively. This would require real-time monitoring and feedback for each trainee. Therefore, we introduced a voice-based SA to augment the training process of medical first responders and enhance their performance in the field. The potential benefits of SAs include a reduction in training costs and enhanced monitoring mechanisms. Despite the increased usage of voice-based personal assistants (PAs) in day-to-day life, the associated effects are commonly neglected for a study of human factors. Therefore, this paper focuses on performance analysis of the developed voice-based SA in emergency care provider training for a selected emergency treatment scenario. The research discussed in this paper follows design science in develo** proposed technology; at length, we discussed architecture and development and presented working results of voice-based SA. The empirical testing was conducted on two groups as user studies using statistical analysis tools, one trained with conventional methods and the other with the help of SA. The statistical results demonstrated the amplification in training efficacy and performance of medical responders powered by SA. Furthermore, the paper also discusses the accuracy and time of task execution (t) and concludes with the guidelines for resolving the identified problems.
△ Less
Submitted 11 August, 2020;
originally announced August 2020.
-
Neural PLDA Modeling for End-to-End Speaker Verification
Authors:
Shreyas Ramoji,
Prashant Krishnan,
Sriram Ganapathy
Abstract:
While deep learning models have made significant advances in supervised classification problems, the application of these models for out-of-set verification tasks like speaker recognition has been limited to deriving feature embeddings. The state-of-the-art x-vector PLDA based speaker verification systems use a generative model based on probabilistic linear discriminant analysis (PLDA) for computi…
▽ More
While deep learning models have made significant advances in supervised classification problems, the application of these models for out-of-set verification tasks like speaker recognition has been limited to deriving feature embeddings. The state-of-the-art x-vector PLDA based speaker verification systems use a generative model based on probabilistic linear discriminant analysis (PLDA) for computing the verification score. Recently, we had proposed a neural network approach for backend modeling in speaker verification called the neural PLDA (NPLDA) where the likelihood ratio score of the generative PLDA model is posed as a discriminative similarity function and the learnable parameters of the score function are optimized using a verification cost. In this paper, we extend this work to achieve joint optimization of the embedding neural network (x-vector network) with the NPLDA network in an end-to-end (E2E) fashion. This proposed end-to-end model is optimized directly from the acoustic features with a verification cost function and during testing, the model directly outputs the likelihood ratio score. With various experiments using the NIST speaker recognition evaluation (SRE) 2018 and 2019 datasets, we show that the proposed E2E model improves significantly over the x-vector PLDA baseline speaker verification system.
△ Less
Submitted 11 August, 2020;
originally announced August 2020.
-
Deep Self-Supervised Hierarchical Clustering for Speaker Diarization
Authors:
Prachi Singh,
Sriram Ganapathy
Abstract:
The state-of-the-art speaker diarization systems use agglomerative hierarchical clustering (AHC) which performs the clustering of previously learned neural embeddings. While the clustering approach attempts to identify speaker clusters, the AHC algorithm does not involve any further learning. In this paper, we propose a novel algorithm for hierarchical clustering which combines the speaker cluster…
▽ More
The state-of-the-art speaker diarization systems use agglomerative hierarchical clustering (AHC) which performs the clustering of previously learned neural embeddings. While the clustering approach attempts to identify speaker clusters, the AHC algorithm does not involve any further learning. In this paper, we propose a novel algorithm for hierarchical clustering which combines the speaker clustering along with a representation learning framework. The proposed approach is based on principles of self-supervised learning where the self-supervision is derived from the clustering algorithm. The representation learning network is trained with a regularized triplet loss using the clustering solution at the current step while the clustering algorithm uses the deep embeddings from the representation learning step. By combining the self-supervision based representation learning along with the clustering algorithm, we show that the proposed algorithm improves significantly 29% relative improvement) over the AHC algorithm with cosine similarity for a speaker diarization task on CALLHOME dataset. In addition, the proposed approach also improves over the state-of-the-art system with PLDA affinity matrix with 10% relative improvement in DER.
△ Less
Submitted 10 August, 2020;
originally announced August 2020.
-
Context Dependent RNNLM for Automatic Transcription of Conversations
Authors:
Srikanth Raj Chetupalli,
Sriram Ganapathy
Abstract:
Conversational speech, while being unstructured at an utterance level, typically has a macro topic which provides larger context spanning multiple utterances. The current language models in speech recognition systems using recurrent neural networks (RNNLM) rely mainly on the local context and exclude the larger context. In order to model the long term dependencies of words across multiple sentence…
▽ More
Conversational speech, while being unstructured at an utterance level, typically has a macro topic which provides larger context spanning multiple utterances. The current language models in speech recognition systems using recurrent neural networks (RNNLM) rely mainly on the local context and exclude the larger context. In order to model the long term dependencies of words across multiple sentences, we propose a novel architecture where the words from prior utterances are converted to an embedding. The relevance of these embeddings for the prediction of next word in the current sentence is found using a gating network. The relevance weighted context embedding vector is combined in the language model to improve the next word prediction, and the entire model including the context embedding and the relevance weighting layers is jointly learned for a conversational language modeling task. Experiments are performed on two conversational datasets - AMI corpus and the Switchboard corpus. In these tasks, we illustrate that the proposed approach yields significant improvements in language model perplexity over the RNNLM baseline. In addition, the use of proposed conversational LM for ASR rescoring results in absolute WER reduction of $1.2$\% on Switchboard dataset and $1.0$\% on AMI dataset over the RNNLM based ASR baseline.
△ Less
Submitted 8 August, 2020;
originally announced August 2020.
-
Deep Learning Based Dereverberation of Temporal Envelopesfor Robust Speech Recognition
Authors:
Anurenjan Purushothaman,
Anirudh Sreeram,
Rohit Kumar,
Sriram Ganapathy
Abstract:
Automatic speech recognition in reverberant conditions is a challenging task as the long-term envelopes of the reverberant speech are temporally smeared. In this paper, we propose a neural model for enhancement of sub-band temporal envelopes for dereverberation of speech. The temporal envelopes are derived using the autoregressive modeling framework of frequency domain linear prediction (FDLP). Th…
▽ More
Automatic speech recognition in reverberant conditions is a challenging task as the long-term envelopes of the reverberant speech are temporally smeared. In this paper, we propose a neural model for enhancement of sub-band temporal envelopes for dereverberation of speech. The temporal envelopes are derived using the autoregressive modeling framework of frequency domain linear prediction (FDLP). The neural enhancement model proposed in this paper performs an envelop gain based enhancement of temporal envelopes and it consists of a series of convolutional and recurrent neural network layers. The enhanced sub-band envelopes are used to generate features for automatic speech recognition (ASR). The ASR experiments are performed on the REVERB challenge dataset as well as the CHiME-3 dataset. In these experiments, the proposed neural enhancement approach provides significant improvements over a baseline ASR system with beamformed audio (average relative improvements of 21% on the development set and about 11% on the evaluation set in word error rates for REVERB challenge dataset).
△ Less
Submitted 7 August, 2020;
originally announced August 2020.
-
NISP: A Multi-lingual Multi-accent Dataset for Speaker Profiling
Authors:
Shareef Babu Kalluri,
Deepu Vijayasenan,
Sriram Ganapathy,
Ragesh Rajan M,
Prashant Krishnan
Abstract:
Many commercial and forensic applications of speech demand the extraction of information about the speaker characteristics, which falls into the broad category of speaker profiling. The speaker characteristics needed for profiling include physical traits of the speaker like height, age, and gender of the speaker along with the native language of the speaker. Many of the datasets available have onl…
▽ More
Many commercial and forensic applications of speech demand the extraction of information about the speaker characteristics, which falls into the broad category of speaker profiling. The speaker characteristics needed for profiling include physical traits of the speaker like height, age, and gender of the speaker along with the native language of the speaker. Many of the datasets available have only partial information for speaker profiling. In this paper, we attempt to overcome this limitation by develo** a new dataset which has speech data from five different Indian languages along with English. The metadata information for speaker profiling applications like linguistic information, regional information, and physical characteristics of a speaker are also collected. We call this dataset as NITK-IISc Multilingual Multi-accent Speaker Profiling (NISP) dataset. The description of the dataset, potential applications, and baseline results for speaker profiling on this dataset are provided in this paper.
△ Less
Submitted 12 July, 2020;
originally announced July 2020.
-
Third DIHARD Challenge Evaluation Plan
Authors:
Neville Ryant,
Kenneth Church,
Christopher Cieri,
Jun Du,
Sriram Ganapathy,
Mark Liberman
Abstract:
This paper introduces the third DIHARD challenge, the third in a series of speaker diarization challenges intended to improve the robustness of diarization systems to variation in recording equipment, noise conditions, and conversational domain. The challenge comprises two tracks evaluating diarization performance when starting from a reference speech segmentation (track 1) and diarization from ra…
▽ More
This paper introduces the third DIHARD challenge, the third in a series of speaker diarization challenges intended to improve the robustness of diarization systems to variation in recording equipment, noise conditions, and conversational domain. The challenge comprises two tracks evaluating diarization performance when starting from a reference speech segmentation (track 1) and diarization from raw audio scratch (track 2). We describe the task, metrics, datasets, and evaluation protocol.
△ Less
Submitted 2 December, 2020; v1 submitted 4 June, 2020;
originally announced June 2020.
-
LEAP Submission to CHiME-6 ASR Challenge}
Authors:
Anirudh Sreeram,
Anurenjan Purushothaman,
Rohit Kumar,
Sriram Ganapathy
Abstract:
This paper reports the LEAP submission to the CHiME-6 challenge. The CHiME-6 Automatic Speech Recognition (ASR) challenge Track 1 involved the recognition of speech in noisy and reverberant acoustic conditions in home environments with multiple-party interactions. For the challenge submission, the LEAP system used extensive data augmentation and a factorized time-delay neural network (TDNN) archit…
▽ More
This paper reports the LEAP submission to the CHiME-6 challenge. The CHiME-6 Automatic Speech Recognition (ASR) challenge Track 1 involved the recognition of speech in noisy and reverberant acoustic conditions in home environments with multiple-party interactions. For the challenge submission, the LEAP system used extensive data augmentation and a factorized time-delay neural network (TDNN) architecture. We also explored a neural architecture that interleaved the TDNN layers with LSTM layers. The submitted system improved the Kaldi recipe by 2% in terms of relative word-error-rate improvements.
△ Less
Submitted 22 May, 2020;
originally announced May 2020.
-
Coswara -- A Database of Breathing, Cough, and Voice Sounds for COVID-19 Diagnosis
Authors:
Neeraj Sharma,
Prashant Krishnan,
Rohit Kumar,
Shreyas Ramoji,
Srikanth Raj Chetupalli,
Nirmala R.,
Prasanta Kumar Ghosh,
Sriram Ganapathy
Abstract:
The COVID-19 pandemic presents global challenges transcending boundaries of country, race, religion, and economy. The current gold standard method for COVID-19 detection is the reverse transcription polymerase chain reaction (RT-PCR) testing. However, this method is expensive, time-consuming, and violates social distancing. Also, as the pandemic is expected to stay for a while, there is a need for…
▽ More
The COVID-19 pandemic presents global challenges transcending boundaries of country, race, religion, and economy. The current gold standard method for COVID-19 detection is the reverse transcription polymerase chain reaction (RT-PCR) testing. However, this method is expensive, time-consuming, and violates social distancing. Also, as the pandemic is expected to stay for a while, there is a need for an alternate diagnosis tool which overcomes these limitations, and is deployable at a large scale. The prominent symptoms of COVID-19 include cough and breathing difficulties. We foresee that respiratory sounds, when analyzed using machine learning techniques, can provide useful insights, enabling the design of a diagnostic tool. Towards this, the paper presents an early effort in creating (and analyzing) a database, called Coswara, of respiratory sounds, namely, cough, breath, and voice. The sound samples are collected via worldwide crowdsourcing using a website application. The curated dataset is released as open access. As the pandemic is evolving, the data collection and analysis is a work in progress. We believe that insights from analysis of Coswara can be effective in enabling sound based technology solutions for point-of-care diagnosis of respiratory infection, and in the near future this can help to diagnose COVID-19.
△ Less
Submitted 11 August, 2020; v1 submitted 21 May, 2020;
originally announced May 2020.
-
Towards Relevance and Sequence Modeling in Language Recognition
Authors:
Bharat Padi,
Anand Mohan,
Sriram Ganapathy
Abstract:
The task of automatic language identification (LID) involving multiple dialects of the same language family in the presence of noise is a challenging problem. In these scenarios, the identity of the language/dialect may be reliably present only in parts of the temporal sequence of the speech signal. The conventional approaches to LID (and for speaker recognition) ignore the sequence information by…
▽ More
The task of automatic language identification (LID) involving multiple dialects of the same language family in the presence of noise is a challenging problem. In these scenarios, the identity of the language/dialect may be reliably present only in parts of the temporal sequence of the speech signal. The conventional approaches to LID (and for speaker recognition) ignore the sequence information by extracting long-term statistical summary of the recording assuming an independence of the feature frames. In this paper, we propose a neural network framework utilizing short-sequence information in language recognition. In particular, a new model is proposed for incorporating relevance in language recognition, where parts of speech data are weighted more based on their relevance for the language recognition task. This relevance weighting is achieved using the bidirectional long short-term memory (BLSTM) network with attention modeling. We explore two approaches, the first approach uses segment level i-vector/x-vector representations that are aggregated in the neural model and the second approach where the acoustic features are directly modeled in an end-to-end neural model. Experiments are performed using the language recognition task in NIST LRE 2017 Challenge using clean, noisy and multi-speaker speech data as well as in the RATS language recognition corpus. In these experiments on noisy LRE tasks as well as the RATS dataset, the proposed approach yields significant improvements over the conventional i-vector/x-vector based language recognition approaches as well as with other previous models incorporating sequence information.
△ Less
Submitted 2 April, 2020;
originally announced April 2020.
-
NPLDA: A Deep Neural PLDA Model for Speaker Verification
Authors:
Shreyas Ramoji,
Prashant Krishnan,
Sriram Ganapathy
Abstract:
The state-of-art approach for speaker verification consists of a neural network based embedding extractor along with a backend generative model such as the Probabilistic Linear Discriminant Analysis (PLDA). In this work, we propose a neural network approach for backend modeling in speaker recognition. The likelihood ratio score of the generative PLDA model is posed as a discriminative similarity f…
▽ More
The state-of-art approach for speaker verification consists of a neural network based embedding extractor along with a backend generative model such as the Probabilistic Linear Discriminant Analysis (PLDA). In this work, we propose a neural network approach for backend modeling in speaker recognition. The likelihood ratio score of the generative PLDA model is posed as a discriminative similarity function and the learnable parameters of the score function are optimized using a verification cost. The proposed model, termed as neural PLDA (NPLDA), is initialized using the generative PLDA model parameters. The loss function for the NPLDA model is an approximation of the minimum detection cost function (DCF). The speaker recognition experiments using the NPLDA model are performed on the speaker verificiation task in the VOiCES datasets as well as the SITW challenge dataset. In these experiments, the NPLDA model optimized using the proposed loss function improves significantly over the state-of-art PLDA based speaker verification system.
△ Less
Submitted 24 May, 2020; v1 submitted 10 February, 2020;
originally announced February 2020.
-
LEAP System for SRE19 CTS Challenge -- Improvements and Error Analysis
Authors:
Shreyas Ramoji,
Prashant Krishnan,
Bhargavram Mysore,
Prachi Singh,
Sriram Ganapathy
Abstract:
The NIST Speaker Recognition Evaluation - Conversational Telephone Speech (CTS) challenge 2019 was an open evaluation for the task of speaker verification in challenging conditions. In this paper, we provide a detailed account of the LEAP SRE system submitted to the CTS challenge focusing on the novel components in the back-end system modeling. All the systems used the time-delay neural network (T…
▽ More
The NIST Speaker Recognition Evaluation - Conversational Telephone Speech (CTS) challenge 2019 was an open evaluation for the task of speaker verification in challenging conditions. In this paper, we provide a detailed account of the LEAP SRE system submitted to the CTS challenge focusing on the novel components in the back-end system modeling. All the systems used the time-delay neural network (TDNN) based x-vector embeddings. The x-vector system in our SRE19 submission used a large pool of training speakers (about 14k speakers). Following the x-vector extraction, we explored a neural network approach to backend score computation that was optimized for a speaker verification cost. The system combination of generative and neural PLDA models resulted in significant improvements for the SRE evaluation dataset. We also found additional gains for the SRE systems based on score normalization and calibration. Subsequent to the evaluations, we have performed a detailed analysis of the submitted systems. The analysis revealed the incremental gains obtained for different training dataset combinations as well as the modeling methods.
△ Less
Submitted 24 May, 2020; v1 submitted 7 February, 2020;
originally announced February 2020.