Search | arXiv e-print repository

SPIRE-SIES: A Spontaneous Indian English Speech Corpus

Authors: Abhayjeet Singh, Charu Shah, Rajashri Varadaraj, Sonakshi Chauhan, Prasanta Kumar Ghosh

Abstract: In this paper, we present a 170.83 hour Indian English spontaneous speech dataset. Lack of Indian English speech data is one of the major hindrances in develo** robust speech systems which are adapted to the Indian speech style. Moreover this scarcity is even more for spontaneous speech. This corpus is crowd sourced over varied Indian nativities, genders and age groups. Traditional spontaneous s… ▽ More In this paper, we present a 170.83 hour Indian English spontaneous speech dataset. Lack of Indian English speech data is one of the major hindrances in develo** robust speech systems which are adapted to the Indian speech style. Moreover this scarcity is even more for spontaneous speech. This corpus is crowd sourced over varied Indian nativities, genders and age groups. Traditional spontaneous speech collection strategies involve capturing of speech during interviewing or conversations. In this study, we use images as stimuli to induce spontaneity in speech. Transcripts for 23 hours is generated and validated which can serve as a spontaneous speech ASR benchmark. Quality of the corpus is validated with voice activity detection based segmentation, gender verification and image semantic correlation. Which determines a relationship between image stimulus and recorded speech using caption keywords derived from Image2Text model and high occurring words derived from whisper ASR generated transcripts. △ Less

Submitted 1 December, 2023; originally announced December 2023.

Comments: 6 pages, 7 plots, 3 tables, Accepted at O-COCOSDA 2023

arXiv:2310.08846 [pdf, other]

Speaking rate attention-based duration prediction for speed control TTS

Authors: Jesuraj Bandekar, Sathvik Udupa, Abhayjeet Singh, Anjali Jayakumar, Deekshitha G, Sandhya Badiger, Saurabh Kumar, Pooja VH, Prasanta Kumar Ghosh

Abstract: With the advent of high-quality speech synthesis, there is a lot of interest in controlling various prosodic attributes of speech. Speaking rate is an essential attribute towards modelling the expressivity of speech. In this work, we propose a novel approach to control the speaking rate for non-autoregressive TTS. We achieve this by conditioning the speaking rate inside the duration predictor, all… ▽ More With the advent of high-quality speech synthesis, there is a lot of interest in controlling various prosodic attributes of speech. Speaking rate is an essential attribute towards modelling the expressivity of speech. In this work, we propose a novel approach to control the speaking rate for non-autoregressive TTS. We achieve this by conditioning the speaking rate inside the duration predictor, allowing implicit speaking rate control. We show the benefits of this approach by synthesising audio at various speaking rate factors and measuring the quality of speaking rate-controlled synthesised speech. Further, we study the effect of the speaking rate distribution of the training data towards effective rate control. Finally, we fine-tune a baseline pretrained TTS model to obtain speaking rate control TTS. We provide various analyses to showcase the benefits of using this proposed approach, along with objective as well as subjective metrics. We find that the proposed methods have higher subjective scores and lower speaker rate errors across many speaking rate factors over the baseline. △ Less

Submitted 13 October, 2023; originally announced October 2023.

Comments: \c{opyright} 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

arXiv:2307.07948 [pdf, ps, other]

Model Adaptation for ASR in low-resource Indian Languages

Authors: Abhayjeet Singh, Arjun Singh Mehta, Ashish Khuraishi K S, Deekshitha G, Gauri Date, Jai Nanavati, Jesuraja Bandekar, Karnalius Basumatary, Karthika P, Sandhya Badiger, Sathvik Udupa, Saurabh Kumar, Savitha, Prasanta Kumar Ghosh, Prashanthi V, Priyanka Pai, Raoul Nanavati, Rohan Saxena, Sai Praneeth Reddy Mora, Srinivasa Raghavan

Abstract: Automatic speech recognition (ASR) performance has improved drastically in recent years, mainly enabled by self-supervised learning (SSL) based acoustic models such as wav2vec2 and large-scale multi-lingual training like Whisper. A huge challenge still exists for low-resource languages where the availability of both audio and text is limited. This is further complicated by the presence of multiple… ▽ More Automatic speech recognition (ASR) performance has improved drastically in recent years, mainly enabled by self-supervised learning (SSL) based acoustic models such as wav2vec2 and large-scale multi-lingual training like Whisper. A huge challenge still exists for low-resource languages where the availability of both audio and text is limited. This is further complicated by the presence of multiple dialects like in Indian languages. However, many Indian languages can be grouped into the same families and share the same script and grammatical structure. This is where a lot of adaptation and fine-tuning techniques can be applied to overcome the low-resource nature of the data by utilising well-resourced similar languages. In such scenarios, it is important to understand the extent to which each modality, like acoustics and text, is important in building a reliable ASR. It could be the case that an abundance of acoustic data in a language reduces the need for large text-only corpora. Or, due to the availability of various pretrained acoustic models, the vice-versa could also be true. In this proposed special session, we encourage the community to explore these ideas with the data in two low-resource Indian languages of Bengali and Bhojpuri. These approaches are not limited to Indian languages, the solutions are potentially applicable to various languages spoken around the world. △ Less

Submitted 16 July, 2023; originally announced July 2023.

Comments: ASRU Special session overview paper

arXiv:2305.00242 [pdf, other]

Analysis of vocal breath sounds before and after administering Bronchodilator in Asthmatic patients

Authors: Shivani Yadav, Dipanjan Gope, Uma Maheswari K., Prasanta Kumar Ghosh

Abstract: Asthma is one of the chronic inflammatory diseases of the airways, which causes chest tightness, wheezing, breathlessness, and cough. Spirometry is an effort-dependent test used to monitor and diagnose lung conditions like Asthma. Vocal breath sound (VBS) based analysis can be an alternative to spirometry as VBS characteristics change depending on the lung condition. VBS test consumes less time, a… ▽ More Asthma is one of the chronic inflammatory diseases of the airways, which causes chest tightness, wheezing, breathlessness, and cough. Spirometry is an effort-dependent test used to monitor and diagnose lung conditions like Asthma. Vocal breath sound (VBS) based analysis can be an alternative to spirometry as VBS characteristics change depending on the lung condition. VBS test consumes less time, and it also requires less effort, unlike spirometry. In this work, VBS characteristics are analyzed before and after administering bronchodilator in a subject-dependent manner using linear discriminant analysis (LDA). We find that features learned through LDA show a significant difference between VBS recorded before and after administering bronchodilator in all 30 subjects considered in this work, whereas the baseline features could achieve a significant difference between VBS only for 26 subjects. We also observe that all frequency ranges do not contribute equally to the discrimination between pre and post bronchodilator conditions. From experiments, we find that two frequency ranges, namely 400-500Hz and 1480-1900Hz, maximally contribute to the discrimination of all the subjects. The study presented in this paper analyzes the pre and post-bronchodilator effect on the inhalation sound recorded at the mouth in a subject-dependent manner. The findings of this work suggest that, inhalation sound recorded at mouth can be a good stimulus to discriminate pre and post-bronchodilator conditions in asthmatic subjects. Inhale sound-based pre and post-bronchodilator discrimination can be of potential use in clinical settings. △ Less

Submitted 29 April, 2023; originally announced May 2023.

arXiv:2304.03758 [pdf, ps, other]

An unsupervised segmentation of vocal breath sounds

Authors: Shivani Yadav, Dipanjan Gope, Uma Maheswari K., Prasanta Kumar Ghosh

Abstract: Breathing is an essential part of human survival, which carries information about a person's physiological and psychological state. Generally, breath boundaries are marked by experts before using for any task. An unsupervised algorithm for breath boundary detection has been proposed for breath sounds recorded at the mouth also referred as vocal breath sounds (VBS) in this work. Breath sounds recor… ▽ More Breathing is an essential part of human survival, which carries information about a person's physiological and psychological state. Generally, breath boundaries are marked by experts before using for any task. An unsupervised algorithm for breath boundary detection has been proposed for breath sounds recorded at the mouth also referred as vocal breath sounds (VBS) in this work. Breath sounds recorded at the mouth are used in this work because they are easy and contactless to record than tracheal breath sounds and lung breath sounds. The periodic nature of breath signal energy is used to segment the breath boundaries. Dynamic programming with the prior information of the number of breath phases($P$) and breath phase duration($d$) is used to find the boundaries. In this work, 367 breath boundaries from 60 subjects (31 healthy, 29 patients) having 307 breaths are predicted. With the proposed method, M ($89\%$), I ($13\%$), D ($11\%$) and S ($79\%$) is found. The proposed method shows better performance than the baselines used in this work. Even the classification performance between asthmatic and healthy subjects using estimated boundaries by the proposed method is comparable with the ground truth boundaries. △ Less

Submitted 7 April, 2023; originally announced April 2023.

arXiv:2211.06371

Vocal Breath Sound Based Gender Classification

Authors: Mohammad Shaique Solanki, Ashutosh M Bharadwaj, Jeevan K, Prasanta Kumar Ghosh

Abstract: Voiced speech signals such as continuous speech are known to have acoustic features such as pitch(F0), and formant frequencies(F1, F2, F3) which can be used for gender classification. However, gender classification studies using non-speech signals such as vocal breath sounds have not been explored as they lack typical gender-specific acoustic features. In this work, we explore whether vocal breath… ▽ More Voiced speech signals such as continuous speech are known to have acoustic features such as pitch(F0), and formant frequencies(F1, F2, F3) which can be used for gender classification. However, gender classification studies using non-speech signals such as vocal breath sounds have not been explored as they lack typical gender-specific acoustic features. In this work, we explore whether vocal breath sounds encode gender information and if so, to what extent it can be used for automatic gender classification. In this study, we explore the use of data-driven and knowledge-based features from vocal breath sounds as well as the classifier complexity for gender classification. We also explore the importance of the location and duration of breath signal segments to be used for automatic classification. Experiments with 54.23 minutes of male and 51.83 minutes of female breath sounds reveal that knowledge-based features, namely MFCC statistics, with low-complexity classifier perform comparably to the data-driven features with classifiers of higher complexity. Breath segments with an average duration of 3 seconds are found to be the best choice irrespective of the location which avoids the need for breath cycle boundary annotation. △ Less

Submitted 25 May, 2023; v1 submitted 11 November, 2022; originally announced November 2022.

Comments: Some updates in the paper. Will new version after updares

arXiv:2210.16881 [pdf, other]

Real-Time MRI Video synthesis from time aligned phonemes with sequence-to-sequence networks

Authors: Sathvik Udupa, Prasanta Kumar Ghosh

Abstract: Real-Time Magnetic resonance imaging (rtMRI) of the midsagittal plane of the mouth is of interest for speech production research. In this work, we focus on estimating utterance level rtMRI video from the spoken phoneme sequence. We obtain time-aligned phonemes from forced alignment, to obtain frame-level phoneme sequences which are aligned with rtMRI frames. We propose a sequence-to-sequence learn… ▽ More Real-Time Magnetic resonance imaging (rtMRI) of the midsagittal plane of the mouth is of interest for speech production research. In this work, we focus on estimating utterance level rtMRI video from the spoken phoneme sequence. We obtain time-aligned phonemes from forced alignment, to obtain frame-level phoneme sequences which are aligned with rtMRI frames. We propose a sequence-to-sequence learning model with a transformer phoneme encoder and convolutional frame decoder. We then modify the learning by using intermediary features obtained from sampling from a pretrained phoneme-conditioned variational autoencoder (CVAE). We train on 8 subjects in a subject-specific manner and demonstrate the performance with a subjective test. We also use an auxiliary task of air tissue boundary (ATB) segmentation to obtain the objective scores on the proposed models. We show that the proposed method is able to generate realistic rtMRI video for unseen utterances, and adding CVAE is beneficial for learning the sequence-to-sequence map** for subjects where the map** is hard to learn. △ Less

Submitted 30 October, 2022; originally announced October 2022.

Comments: submitted to ICASSP 2023

arXiv:2210.16871 [pdf, other]

Improved acoustic-to-articulatory inversion using representations from pretrained self-supervised learning models

Authors: Sathvik Udupa, Siddarth C, Prasanta Kumar Ghosh

Abstract: In this work, we investigate the effectiveness of pretrained Self-Supervised Learning (SSL) features for learning the map** for acoustic to articulatory inversion (AAI). Signal processing-based acoustic features such as MFCCs have been predominantly used for the AAI task with deep neural networks. With SSL features working well for various other speech tasks such as speech recognition, emotion c… ▽ More In this work, we investigate the effectiveness of pretrained Self-Supervised Learning (SSL) features for learning the map** for acoustic to articulatory inversion (AAI). Signal processing-based acoustic features such as MFCCs have been predominantly used for the AAI task with deep neural networks. With SSL features working well for various other speech tasks such as speech recognition, emotion classification, etc., we experiment with its efficacy for AAI. We train on SSL features with transformer neural networks-based AAI models of 3 different model complexities and compare its performance with MFCCs in subject-specific (SS), pooled and fine-tuned (FT) configurations with data from 10 subjects, and evaluate with correlation coefficient (CC) score on the unseen sentence test set. We find that acoustic feature reconstruction objective-based SSL features such as TERA and DeCoAR work well for AAI, with SS CCs of these SSL features reaching close to the best FT CCs of MFCC. We also find the results consistent across different model sizes. △ Less

Submitted 30 October, 2022; originally announced October 2022.

Comments: submitted to ICASSP 2023

arXiv:2203.06004 [pdf, other]

An error correction scheme for improved air-tissue boundary in real-time MRI video for speech production

Authors: Anwesha Roy, Varun Belagali, Prasanta Kumar Ghosh

Abstract: The best performance in Air-tissue boundary (ATB) segmentation of real-time Magnetic Resonance Imaging (rtMRI) videos in speech production is known to be achieved by a 3-dimensional convolutional neural network (3D-CNN) model. However, the evaluation of this model, as well as other ATB segmentation techniques reported in the literature, is done using Dynamic Time War** (DTW) distance between the… ▽ More The best performance in Air-tissue boundary (ATB) segmentation of real-time Magnetic Resonance Imaging (rtMRI) videos in speech production is known to be achieved by a 3-dimensional convolutional neural network (3D-CNN) model. However, the evaluation of this model, as well as other ATB segmentation techniques reported in the literature, is done using Dynamic Time War** (DTW) distance between the entire original and predicted contours. Such an evaluation measure may not capture local errors in the predicted contour. Careful analysis of predicted contours reveals errors in regions like the velum part of contour1 (ATB comprising of upper lip, hard palate, and velum) and tongue base section of contour2 (ATB covering jawline, lower lip, tongue base, and epiglottis), which are not captured in a global evaluation metric like DTW distance. In this work, we automatically detect such errors and propose a correction scheme for the same. We also propose two new evaluation metrics for ATB segmentation separately in contour1 and contour2 to explicitly capture two types of errors in these contours. The proposed detection and correction strategies result in an improvement of these two evaluation metrics by 61.8% and 61.4% for contour1 and by 67.8% and 28.4% for contour2. Traditional DTW distance, on the other hand, improves by 44.6% for contour1 and 4.0% for contour2. △ Less

Submitted 8 March, 2022; originally announced March 2022.

Comments: accepted for ICASSP 2022

arXiv:2112.04151 [pdf, ps, other]

A study on native American English speech recognition by Indian listeners with varying word familiarity level

Authors: Abhayjeet Singh, Achuth Rao MV, Rakesh Vaideeswaran, Chiranjeevi Yarra, Prasanta Kumar Ghosh

Abstract: In this study, listeners of varied Indian nativities are asked to listen and recognize TIMIT utterances spoken by American speakers. We have three kinds of responses from each listener while they recognize an utterance: 1. Sentence difficulty ratings, 2. Speaker difficulty ratings, and 3. Transcription of the utterance. From these transcriptions, word error rate (WER) is calculated and used as a m… ▽ More In this study, listeners of varied Indian nativities are asked to listen and recognize TIMIT utterances spoken by American speakers. We have three kinds of responses from each listener while they recognize an utterance: 1. Sentence difficulty ratings, 2. Speaker difficulty ratings, and 3. Transcription of the utterance. From these transcriptions, word error rate (WER) is calculated and used as a metric to evaluate the similarity between the recognized and the original sentences.The sentences selected in this study are categorized into three groups: Easy, Medium and Hard, based on the frequency ofoccurrence of the words in them. We observe that the sentence, speaker difficulty ratings and the WERs increase from easy to hard categories of sentences. We also compare the human speech recognition performance with that using three automatic speech recognition (ASR) under following three combinations of acoustic model (AM) and language model(LM): ASR1) AM trained with recordings from speakers of Indian origin and LM built on TIMIT text, ASR2) AM using recordings from native American speakers and LM built ontext from LIBRI speech corpus, and ASR3) AM using recordings from native American speakers and LM build on LIBRI speech and TIMIT text. We observe that HSR performance is similar to that of ASR1 whereas ASR3 achieves the best performance. Speaker nativity wise analysis shows that utterances from speakers of some nativity are more difficult to recognize by Indian listeners compared to few other nativities △ Less

Submitted 8 December, 2021; originally announced December 2021.

Comments: 6 pages, 5 figues, COCOSDA 2021

arXiv:2106.00639 [pdf, other]

Multi-modal Point-of-Care Diagnostics for COVID-19 Based On Acoustics and Symptoms

Authors: Srikanth Raj Chetupalli, Prashant Krishnan, Neeraj Sharma, Ananya Muguli, Rohit Kumar, Viral Nanda, Lancelot Mark Pinto, Prasanta Kumar Ghosh, Sriram Ganapathy

Abstract: The research direction of identifying acoustic bio-markers of respiratory diseases has received renewed interest following the onset of COVID-19 pandemic. In this paper, we design an approach to COVID-19 diagnostic using crowd-sourced multi-modal data. The data resource, consisting of acoustic signals like cough, breathing, and speech signals, along with the data of symptoms, are recorded using a… ▽ More The research direction of identifying acoustic bio-markers of respiratory diseases has received renewed interest following the onset of COVID-19 pandemic. In this paper, we design an approach to COVID-19 diagnostic using crowd-sourced multi-modal data. The data resource, consisting of acoustic signals like cough, breathing, and speech signals, along with the data of symptoms, are recorded using a web-application over a period of ten months. We investigate the use of statistical descriptors of simple time-frequency features for acoustic signals and binary features for the presence of symptoms. Unlike previous works, we primarily focus on the application of simple linear classifiers like logistic regression and support vector machines for acoustic data while decision tree models are employed on the symptoms data. We show that a multi-modal integration of acoustics and symptoms classifiers achieves an area-under-curve (AUC) of 92.40, a significant improvement over any individual modality. Several ablation experiments are also provided which highlight the acoustic and symptom dimensions that are important for the task of COVID-19 diagnostics. △ Less

Submitted 5 June, 2021; v1 submitted 1 June, 2021; originally announced June 2021.

Comments: The Manuscript is submitted to IEEE-EMBS Journal of Biomedical and Health Informatics on June 1, 2021

arXiv:2104.05017 [pdf, other]

Estimating articulatory movements in speech production with transformer networks

Authors: Sathvik Udupa, Anwesha Roy, Abhayjeet Singh, Aravind Illa, Prasanta Kumar Ghosh

Abstract: We estimate articulatory movements in speech production from different modalities - acoustics and phonemes. Acoustic-to articulatory inversion (AAI) is a sequence-to-sequence task. On the other hand, phoneme to articulatory (PTA) motion estimation faces a key challenge in reliably aligning the text and the articulatory movements. To address this challenge, we explore the use of a transformer archi… ▽ More We estimate articulatory movements in speech production from different modalities - acoustics and phonemes. Acoustic-to articulatory inversion (AAI) is a sequence-to-sequence task. On the other hand, phoneme to articulatory (PTA) motion estimation faces a key challenge in reliably aligning the text and the articulatory movements. To address this challenge, we explore the use of a transformer architecture - FastSpeech, with explicit duration modelling to learn hard alignments between the phonemes and articulatory movements. We also train a transformer model on AAI. We use correlation coefficient (CC) and root mean squared error (rMSE) to assess the estimation performance in comparison to existing methods on both tasks. We observe 154%, 11.8% & 4.8% relative improvement in CC with subject-dependent, pooled and fine-tuning strategies, respectively, for PTA estimation. Additionally, on the AAI task, we obtain 1.5%, 3% and 3.1% relative gain in CC on the same setups compared to the state-of-the-art baseline. We further present the computational benefits of having transformer architecture as representation blocks. △ Less

Submitted 12 June, 2021; v1 submitted 11 April, 2021; originally announced April 2021.

Comments: accepted for oral presentation at INTERSPEECH 2021

arXiv:2104.00235 [pdf, ps, other]

doi 10.21437/Interspeech.2021-1339

Multilingual and code-switching ASR challenges for low resource Indian languages

Authors: Anuj Diwan, Rakesh Vaideeswaran, Sanket Shah, Ankita Singh, Srinivasa Raghavan, Shreya Khare, Vinit Unni, Saurabh Vyas, Akash Rajpuria, Chiranjeevi Yarra, Ashish Mittal, Prasanta Kumar Ghosh, Preethi Jyothi, Kalika Bali, Vivek Seshadri, Sunayana Sitaram, Samarth Bharadwaj, Jai Nanavati, Raoul Nanavati, Karthik Sankaranarayanan, Tejaswi Seeram, Basil Abraham

Abstract: Recently, there is increasing interest in multilingual automatic speech recognition (ASR) where a speech recognition system caters to multiple low resource languages by taking advantage of low amounts of labeled corpora in multiple languages. With multilingualism becoming common in today's world, there has been increasing interest in code-switching ASR as well. In code-switching, multiple language… ▽ More Recently, there is increasing interest in multilingual automatic speech recognition (ASR) where a speech recognition system caters to multiple low resource languages by taking advantage of low amounts of labeled corpora in multiple languages. With multilingualism becoming common in today's world, there has been increasing interest in code-switching ASR as well. In code-switching, multiple languages are freely interchanged within a single sentence or between sentences. The success of low-resource multilingual and code-switching ASR often depends on the variety of languages in terms of their acoustics, linguistic characteristics as well as the amount of data available and how these are carefully considered in building the ASR system. In this challenge, we would like to focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages, namely Hindi, Marathi, Odia, Tamil, Telugu, Gujarati and Bengali. For this purpose, we provide a total of ~600 hours of transcribed speech data, comprising train and test sets, in these languages including two code-switched language pairs, Hindi-English and Bengali-English. We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively. △ Less

Submitted 31 March, 2021; originally announced April 2021.

Comments: 6 pages

arXiv:2103.09148 [pdf, other]

DiCOVA Challenge: Dataset, task, and baseline system for COVID-19 diagnosis using acoustics

Authors: Ananya Muguli, Lancelot Pinto, Nirmala R., Neeraj Sharma, Prashant Krishnan, Prasanta Kumar Ghosh, Rohit Kumar, Shrirama Bhat, Srikanth Raj Chetupalli, Sriram Ganapathy, Shreyas Ramoji, Viral Nanda

Abstract: The DiCOVA challenge aims at accelerating research in diagnosing COVID-19 using acoustics (DiCOVA), a topic at the intersection of speech and audio processing, respiratory health diagnosis, and machine learning. This challenge is an open call for researchers to analyze a dataset of sound recordings collected from COVID-19 infected and non-COVID-19 individuals for a two-class classification. These… ▽ More The DiCOVA challenge aims at accelerating research in diagnosing COVID-19 using acoustics (DiCOVA), a topic at the intersection of speech and audio processing, respiratory health diagnosis, and machine learning. This challenge is an open call for researchers to analyze a dataset of sound recordings collected from COVID-19 infected and non-COVID-19 individuals for a two-class classification. These recordings were collected via crowdsourcing from multiple countries, through a website application. The challenge features two tracks, one focusing on cough sounds, and the other on using a collection of breath, sustained vowel phonation, and number counting speech recordings. In this paper, we introduce the challenge and provide a detailed description of the task, and present a baseline system for the task. △ Less

Submitted 17 June, 2021; v1 submitted 16 March, 2021; originally announced March 2021.

Comments: To appear in Proceedings of Interspeech, 2021

arXiv:2006.11536 [pdf, other]

Speaker conditioned acoustic-to-articulatory inversion using x-vectors

Authors: Aravind Illa, Prasanta Kumar Ghosh

Abstract: Speech production involves the movement of various articulators, including tongue, jaw, and lips. Estimating the movement of the articulators from the acoustics of speech is known as acoustic-to-articulatory inversion (AAI). Recently, it has been shown that instead of training AAI in a speaker specific manner, pooling the acoustic-articulatory data from multiple speakers is beneficial. Further, ad… ▽ More Speech production involves the movement of various articulators, including tongue, jaw, and lips. Estimating the movement of the articulators from the acoustics of speech is known as acoustic-to-articulatory inversion (AAI). Recently, it has been shown that instead of training AAI in a speaker specific manner, pooling the acoustic-articulatory data from multiple speakers is beneficial. Further, additional conditioning with speaker specific information by one-hot encoding at the input of AAI along with acoustic features benefits the AAI performance in a closed-set speaker train and test condition. In this work, we carry out an experimental study on the benefit of using x-vectors for providing speaker specific information to condition AAI. Experiments with 30 speakers have shown that the AAI performance benefits from the use of x-vectors in a closed set seen speaker condition. Further, x-vectors also generalizes well for unseen speaker evaluation. △ Less

Submitted 20 June, 2020; originally announced June 2020.

arXiv:2006.03107 [pdf, other]

Attention and Encoder-Decoder based models for transforming articulatory movements at different speaking rates

Authors: Abhayjeet Singh, Aravind Illa, Prasanta Kumar Ghosh

Abstract: While speaking at different rates, articulators (like tongue, lips) tend to move differently and the enunciations are also of different durations. In the past, affine transformation and DNN have been used to transform articulatory movements from neutral to fast(N2F) and neutral to slow(N2S) speaking rates [1]. In this work, we improve over the existing transformation techniques by modeling rate sp… ▽ More While speaking at different rates, articulators (like tongue, lips) tend to move differently and the enunciations are also of different durations. In the past, affine transformation and DNN have been used to transform articulatory movements from neutral to fast(N2F) and neutral to slow(N2S) speaking rates [1]. In this work, we improve over the existing transformation techniques by modeling rate specific durations and their transformation using AstNet, an encoder-decoder framework with attention. In the current work, we propose an encoder-decoder architecture using LSTMs which generates smoother predicted articulatory trajectories. For modeling duration variations across speaking rates, we deploy attention network, which eliminates the needto align trajectories in different rates using DTW. We performa phoneme specific duration analysis to examine how well duration is transformed using the proposed AstNet. As the range of articulatory motions is correlated with speaking rate, we also analyze amplitude of the transformed articulatory movements at different rates compared to their original counterparts, to examine how well the proposed AstNet predicts the extent of articulatory movements in N2F and N2S. We observe that AstNet could model both duration and extent of articulatory movements better than the existing transformation techniques resulting in more accurate transformed articulatory trajectories. △ Less

Submitted 20 August, 2020; v1 submitted 4 June, 2020; originally announced June 2020.

Comments: 5 pages, 4 figures, InterSpeech 2020

arXiv:2005.10548 [pdf, other]

doi 10.21437/Interspeech.2020-2768

Coswara -- A Database of Breathing, Cough, and Voice Sounds for COVID-19 Diagnosis

Authors: Neeraj Sharma, Prashant Krishnan, Rohit Kumar, Shreyas Ramoji, Srikanth Raj Chetupalli, Nirmala R., Prasanta Kumar Ghosh, Sriram Ganapathy

Abstract: The COVID-19 pandemic presents global challenges transcending boundaries of country, race, religion, and economy. The current gold standard method for COVID-19 detection is the reverse transcription polymerase chain reaction (RT-PCR) testing. However, this method is expensive, time-consuming, and violates social distancing. Also, as the pandemic is expected to stay for a while, there is a need for… ▽ More The COVID-19 pandemic presents global challenges transcending boundaries of country, race, religion, and economy. The current gold standard method for COVID-19 detection is the reverse transcription polymerase chain reaction (RT-PCR) testing. However, this method is expensive, time-consuming, and violates social distancing. Also, as the pandemic is expected to stay for a while, there is a need for an alternate diagnosis tool which overcomes these limitations, and is deployable at a large scale. The prominent symptoms of COVID-19 include cough and breathing difficulties. We foresee that respiratory sounds, when analyzed using machine learning techniques, can provide useful insights, enabling the design of a diagnostic tool. Towards this, the paper presents an early effort in creating (and analyzing) a database, called Coswara, of respiratory sounds, namely, cough, breath, and voice. The sound samples are collected via worldwide crowdsourcing using a website application. The curated dataset is released as open access. As the pandemic is evolving, the data collection and analysis is a work in progress. We believe that insights from analysis of Coswara can be effective in enabling sound based technology solutions for point-of-care diagnosis of respiratory infection, and in the near future this can help to diagnose COVID-19. △ Less

Submitted 11 August, 2020; v1 submitted 21 May, 2020; originally announced May 2020.

Comments: A description of Coswara dataset to evaluate COVID-19 diagnosis using respiratory sounds

arXiv:1910.14375 [pdf, other]

A comparative study of estimating articulatory movements from phoneme sequences and acoustic features

Authors: Abhayjeet Singh, Aravind Illa, Prasanta Kumar Ghosh

Abstract: Unlike phoneme sequences, movements of speech articulators (lips, tongue, jaw, velum) and the resultant acoustic signal are known to encode not only the linguistic message but also carry para-linguistic information. While several works exist for estimating articulatory movement from acoustic signals, little is known to what extent articulatory movements can be predicted only from linguistic inform… ▽ More Unlike phoneme sequences, movements of speech articulators (lips, tongue, jaw, velum) and the resultant acoustic signal are known to encode not only the linguistic message but also carry para-linguistic information. While several works exist for estimating articulatory movement from acoustic signals, little is known to what extent articulatory movements can be predicted only from linguistic information, i.e., phoneme sequence. In this work, we estimate articulatory movements from three different input representations: R1) acoustic signal, R2) phoneme sequence, R3) phoneme sequence with timing information. While an attention network is used for estimating articulatory movement in the case of R2, BLSTM network is used for R1 and R3. Experiments with ten subjects' acoustic-articulatory data reveal that the estimation techniques achieve an average correlation coefficient of 0.85, 0.81, and 0.81 in the case of R1, R2, and R3 respectively. This indicates that attention network, although uses only phoneme sequence (R2) without any timing information, results in an estimation performance similar to that using rich acoustic signal (R1), suggesting that articulatory motion is primarily driven by the linguistic message. The correlation coefficient is further improved to 0.88 when R1 and R3 are used together for estimating articulatory movements. △ Less

Submitted 19 February, 2020; v1 submitted 31 October, 2019; originally announced October 2019.

Comments: 5 pages, 5 figures, accepted in ICASSP 2020

Showing 1–18 of 18 results for author: Ghosh, P K