Search | arXiv e-print repository

A cost minimization approach to fix the vocabulary size in a tokenizer for an End-to-End ASR system

Authors: Sunil Kumar Kopparapu, Ashish Panda

Abstract: Unlike hybrid speech recognition systems where the use of tokens was restricted to phones, biphones or triphones the choice of tokens in the end-to-end ASR systems is derived from the text corpus of the training data. The use of tokenization algorithms like Byte Pair Encoding (BPE) and WordPiece is popular in identifying the tokens that are used in the overall training process of the speech recogn… ▽ More Unlike hybrid speech recognition systems where the use of tokens was restricted to phones, biphones or triphones the choice of tokens in the end-to-end ASR systems is derived from the text corpus of the training data. The use of tokenization algorithms like Byte Pair Encoding (BPE) and WordPiece is popular in identifying the tokens that are used in the overall training process of the speech recognition system. Popular toolkits, like ESPNet use a pre-defined vocabulary size (number of tokens) for these tokenization algorithms, but there is no discussion on how vocabulary size was derived. In this paper, we build a cost function, assuming the tokenization process to be a black-box to enable choosing the number of tokens which might most benefit building an end-to-end ASR. We show through experiments on LibriSpeech 100 hour set that the performance of an end-to-end ASR system improves when the number of tokens are chosen carefully. △ Less

Submitted 29 April, 2024; originally announced June 2024.

Comments: 5 pages, 4 figures

arXiv:2306.08012 [pdf, other]

A Novel Scheme to classify Read and Spontaneous Speech

Authors: Sunil Kumar Kopparapu

Abstract: The COVID-19 pandemic has led to an increased use of remote telephonic interviews, making it important to distinguish between scripted and spontaneous speech in audio recordings. In this paper, we propose a novel scheme for identifying read and spontaneous speech. Our approach uses a pre-trained DeepSpeech audio-to-alphabet recognition engine to generate a sequence of alphabets from the audio. Fro… ▽ More The COVID-19 pandemic has led to an increased use of remote telephonic interviews, making it important to distinguish between scripted and spontaneous speech in audio recordings. In this paper, we propose a novel scheme for identifying read and spontaneous speech. Our approach uses a pre-trained DeepSpeech audio-to-alphabet recognition engine to generate a sequence of alphabets from the audio. From these alphabets, we derive features that allow us to discriminate between read and spontaneous speech. Our experimental results show that even a small set of self-explanatory features can effectively classify the two types of speech very effectively. △ Less

Submitted 13 June, 2023; originally announced June 2023.

Comments: 14 pages, 8 figures

arXiv:2304.03169

Selective Data Augmentation for Robust Speech Translation

Authors: Rajul Acharya, Ashish Panda, Sunil Kumar Kopparapu

Abstract: Speech translation (ST) systems translate speech in one language to text in another language. End-to-end ST systems (e2e-ST) have gained popularity over cascade systems because of their enhanced performance due to reduced latency and computational cost. Though resource intensive, e2e-ST systems have the inherent ability to retain para and non-linguistic characteristics of the speech unlike cascade… ▽ More Speech translation (ST) systems translate speech in one language to text in another language. End-to-end ST systems (e2e-ST) have gained popularity over cascade systems because of their enhanced performance due to reduced latency and computational cost. Though resource intensive, e2e-ST systems have the inherent ability to retain para and non-linguistic characteristics of the speech unlike cascade systems. In this paper, we propose to use an e2e architecture for English-Hindi (en-hi) ST. We use two imperfect machine translation (MT) services to translate Libri-trans en text into hi text. While each service gives MT data individually to generate parallel ST data, we propose a data augmentation strategy of noisy MT data to aid robust ST. The main contribution of this paper is the proposal of a data augmentation strategy. We show that this results in better ST (BLEU score) compared to brute force augmentation of MT data. We observed an absolute improvement of 1.59 BLEU score with our approach. △ Less

Submitted 25 April, 2023; v1 submitted 22 March, 2023; originally announced April 2023.

Comments: Did not realize that the experiments and the analysis based on the experiments were incomplete

arXiv:2210.06354 [pdf, other]

Text-to-Audio Grounding Based Novel Metric for Evaluating Audio Caption Similarity

Authors: Swapnil Bhosale, Rupayan Chakraborty, Sunil Kumar Kopparapu

Abstract: Automatic Audio Captioning (AAC) refers to the task of translating an audio sample into a natural language (NL) text that describes the audio events, source of the events and their relationships. Unlike NL text generation tasks, which rely on metrics like BLEU, ROUGE, METEOR based on lexical semantics for evaluation, the AAC evaluation metric requires an ability to map NL text (phrases) that corre… ▽ More Automatic Audio Captioning (AAC) refers to the task of translating an audio sample into a natural language (NL) text that describes the audio events, source of the events and their relationships. Unlike NL text generation tasks, which rely on metrics like BLEU, ROUGE, METEOR based on lexical semantics for evaluation, the AAC evaluation metric requires an ability to map NL text (phrases) that correspond to similar sounds in addition lexical semantics. Current metrics used for evaluation of AAC tasks lack an understanding of the perceived properties of sound represented by text. In this paper, wepropose a novel metric based on Text-to-Audio Grounding (TAG), which is, useful for evaluating cross modal tasks like AAC. Experiments on publicly available AAC data-set shows our evaluation metric to perform better compared to existing metrics used in NL text and image captioning literature. △ Less

Submitted 3 October, 2022; originally announced October 2022.

Comments: 9 pages, 8 figures,

arXiv:2203.13259 [pdf, other]

Computing Optimal Location of Microphone for Improved Speech Recognition

Authors: Karan Nathwani, Bhavya Dixit, Sunil Kumar Kopparapu

Abstract: It was shown in our earlier work that the measurement error in the microphone position affected the room impulse response (RIR) which in turn affected the single-channel close microphone and multi-channel distant microphone speech recognition. In this paper, as an extension, we systematically study to identify the optimal location of the microphone, given an approximate and hence erroneous locatio… ▽ More It was shown in our earlier work that the measurement error in the microphone position affected the room impulse response (RIR) which in turn affected the single-channel close microphone and multi-channel distant microphone speech recognition. In this paper, as an extension, we systematically study to identify the optimal location of the microphone, given an approximate and hence erroneous location of the microphone in 3D space. The primary idea is to use Monte-Carlo technique to generate a large number of random microphone positions around the erroneous microphone position and select the microphone position that results in the best performance of a general purpose automatic speech recognition (gp-asr). We experiment with clean and noisy speech and show that the optimal location of the microphone is unique and is affected by noise. △ Less

Submitted 24 March, 2022; originally announced March 2022.

Comments: 5 pages

arXiv:2202.03271 [pdf, ps, other]

Spectro Temporal EEG Biomarkers For Binary Emotion Classification

Authors: Upasana Tiwari, Rupayan Chakraborty, Sunil Kumar Kopparapu

Abstract: Electroencephalogram (EEG) is one of the most reliable physiological signal for emotion detection. Being non-stationary in nature, EEGs are better analysed by spectro temporal representations. Standard features like Discrete Wavelet Transformation (DWT) can represent temporal changes in spectral dynamics of an EEG, but is insufficient to extract information other way around, i.e. spectral changes… ▽ More Electroencephalogram (EEG) is one of the most reliable physiological signal for emotion detection. Being non-stationary in nature, EEGs are better analysed by spectro temporal representations. Standard features like Discrete Wavelet Transformation (DWT) can represent temporal changes in spectral dynamics of an EEG, but is insufficient to extract information other way around, i.e. spectral changes in temporal dynamics. On the other hand, Empirical mode decomposition (EMD) based features can be useful to bridge the above mentioned gap. Towards this direction, we extract two novel features on top of EMD, namely, (a) marginal hilbert spectrum (MHS) and (b) Holo-Hilbert spectral analysis (HHSA) based on EMD, to better represent emotions in 2D arousal-valence (A-V) space. The usefulness of these features for EEG emotion classification is investigated through extensive experiments using state-of-the-art classifiers. In addition, experiments conducted on DEAP dataset for binary emotion classification in both A-V space, reveal the efficacy of the proposed features over the standard set of temporal and spectral features. △ Less

Submitted 2 February, 2022; originally announced February 2022.

arXiv:2201.12352 [pdf, other]

Automatic Audio Captioning using Attention weighted Event based Embeddings

Authors: Swapnil Bhosale, Rupayan Chakraborty, Sunil Kumar Kopparapu

Abstract: Automatic Audio Captioning (AAC) refers to the task of translating audio into a natural language that describes the audio events, source of the events and their relationships. The limited samples in AAC datasets at present, has set up a trend to incorporate transfer learning with Audio Event Detection (AED) as a parent task. Towards this direction, in this paper, we propose an encoder-decoder arch… ▽ More Automatic Audio Captioning (AAC) refers to the task of translating audio into a natural language that describes the audio events, source of the events and their relationships. The limited samples in AAC datasets at present, has set up a trend to incorporate transfer learning with Audio Event Detection (AED) as a parent task. Towards this direction, in this paper, we propose an encoder-decoder architecture with light-weight (i.e. with lesser learnable parameters) Bi-LSTM recurrent layers for AAC and compare the performance of two state-of-the-art pre-trained AED models as embedding extractors. Our results show that an efficient AED based embedding extractor combined with temporal attention and augmentation techniques is able to surpass existing literature with computationally intensive architectures. Further, we provide evidence of the ability of the non-uniform attention weighted encoding generated as a part of our model to facilitate the decoder glance over specific sections of the audio while generating each token. △ Less

Submitted 28 January, 2022; originally announced January 2022.

arXiv:2201.09470 [pdf, other]

Synthetic speech detection using meta-learning with prototypical loss

Authors: Monisankha Pal, Aditya Raikar, Ashish Panda, Sunil Kumar Kopparapu

Abstract: Recent works on speech spoofing countermeasures still lack generalization ability to unseen spoofing attacks. This is one of the key issues of ASVspoof challenges especially with the rapid development of diverse and high-quality spoofing algorithms. In this work, we address the generalizability of spoofing detection by proposing prototypical loss under the meta-learning paradigm to mimic the unsee… ▽ More Recent works on speech spoofing countermeasures still lack generalization ability to unseen spoofing attacks. This is one of the key issues of ASVspoof challenges especially with the rapid development of diverse and high-quality spoofing algorithms. In this work, we address the generalizability of spoofing detection by proposing prototypical loss under the meta-learning paradigm to mimic the unseen test scenario during training. Prototypical loss with metric-learning objectives can learn the embedding space directly and emerges as a strong alternative to prevailing classification loss functions. We propose an anti-spoofing system based on squeeze-excitation Residual network (SE-ResNet) architecture with prototypical loss. We demonstrate that the proposed single system without any data augmentation can achieve competitive performance to the recent best anti-spoofing systems on ASVspoof 2019 logical access (LA) task. Furthermore, the proposed system with data augmentation outperforms the ASVspoof 2021 challenge best baseline both in the progress and evaluation phase of the LA task. On ASVspoof 2019 and 2021 evaluation set LA scenario, we attain a relative 68.4% and 3.6% improvement in min-tDCF compared to the challenge best baselines, respectively. △ Less

Submitted 24 January, 2022; originally announced January 2022.

arXiv:2103.13823 [pdf, ps, other]

A Novel Adaptive Minority Oversampling Technique for Improved Classification in Data Imbalanced Scenarios

Authors: Ayush Tripathi, Rupayan Chakraborty, Sunil Kumar Kopparapu

Abstract: Imbalance in the proportion of training samples belonging to different classes often poses performance degradation of conventional classifiers. This is primarily due to the tendency of the classifier to be biased towards the majority classes in the imbalanced dataset. In this paper, we propose a novel three step technique to address imbalanced data. As a first step we significantly oversample the… ▽ More Imbalance in the proportion of training samples belonging to different classes often poses performance degradation of conventional classifiers. This is primarily due to the tendency of the classifier to be biased towards the majority classes in the imbalanced dataset. In this paper, we propose a novel three step technique to address imbalanced data. As a first step we significantly oversample the minority class distribution by employing the traditional Synthetic Minority OverSampling Technique (SMOTE) algorithm using the neighborhood of the minority class samples and in the next step we partition the generated samples using a Gaussian-Mixture Model based clustering algorithm. In the final step synthetic data samples are chosen based on the weight associated with the cluster, the weight itself being determined by the distribution of the majority class samples. Extensive experiments on several standard datasets from diverse domains shows the usefulness of the proposed technique in comparison with the original SMOTE and its state-of-the-art variants algorithms. △ Less

Submitted 26 March, 2021; v1 submitted 24 March, 2021; originally announced March 2021.

Comments: 8 pages

Journal ref: ICPR 2020

arXiv:2103.06157 [pdf, other]

doi 10.1016/j.csl.2021.101213

Automatic Speaker Independent Dysarthric Speech Intelligibility Assessment System

Authors: Ayush Tripathi, Swapnil Bhosale, Sunil Kumar Kopparapu

Abstract: Dysarthria is a condition which hampers the ability of an individual to control the muscles that play a major role in speech delivery. The loss of fine control over muscles that assist the movement of lips, vocal chords, tongue and diaphragm results in abnormal speech delivery. One can assess the severity level of dysarthria by analyzing the intelligibility of speech spoken by an individual. Conti… ▽ More Dysarthria is a condition which hampers the ability of an individual to control the muscles that play a major role in speech delivery. The loss of fine control over muscles that assist the movement of lips, vocal chords, tongue and diaphragm results in abnormal speech delivery. One can assess the severity level of dysarthria by analyzing the intelligibility of speech spoken by an individual. Continuous intelligibility assessment helps speech language pathologists not only study the impact of medication but also allows them to plan personalized therapy. It helps the clinicians immensely if the intelligibility assessment system is reliable, automatic, simple for (a) patients to undergo and (b) clinicians to interpret. Lack of availability of dysarthric data has resulted in development of speaker dependent automatic intelligibility assessment systems which requires patients to speak a large number of utterances. In this paper, we propose (a) a cost minimization procedure to select an optimal (small) number of utterances that need to be spoken by the dysarthric patient, (b) four different speaker independent intelligibility assessment systems which require the patient to speak a small number of words, and (c) the assessment score is close to the perceptual score that the Speech Language Pathologist (SLP) can relate to. The need for small number of utterances to be spoken by the patient and the score being relatable to the SLP benefits both the dysarthric patient and the clinician from usability perspective. △ Less

Submitted 10 March, 2021; originally announced March 2021.

Comments: 29 pages, 2 figures, Computer Speech & Language 2021

arXiv:2102.08074 [pdf, other]

Semi Supervised Learning For Few-shot Audio Classification By Episodic Triplet Mining

Authors: Swapnil Bhosale, Rupayan Chakraborty, Sunil Kumar Kopparapu

Abstract: Few-shot learning aims to generalize unseen classes that appear during testing but are unavailable during training. Prototypical networks incorporate few-shot metric learning, by constructing a class prototype in the form of a mean vector of the embedded support points within a class. The performance of prototypical networks in extreme few-shot scenarios (like one-shot) degrades drastically, mainl… ▽ More Few-shot learning aims to generalize unseen classes that appear during testing but are unavailable during training. Prototypical networks incorporate few-shot metric learning, by constructing a class prototype in the form of a mean vector of the embedded support points within a class. The performance of prototypical networks in extreme few-shot scenarios (like one-shot) degrades drastically, mainly due to the desuetude of variations within the clusters while constructing prototypes. In this paper, we propose to replace the typical prototypical loss function with an Episodic Triplet Mining (ETM) technique. The conventional triplet selection leads to overfitting, because of all possible combinations being used during training. We incorporate episodic training for mining the semi hard positive and the semi hard negative triplets to overcome the overfitting. We also propose an adaptation to make use of unlabeled training samples for better modeling. Experimenting on two different audio processing tasks, namely speaker recognition and audio event detection; show improved performances and hence the efficacy of ETM over the prototypical loss function and other meta-learning frameworks. Further, we show improved performances when unlabeled training samples are used. △ Less

Submitted 16 February, 2021; originally announced February 2021.

Comments: 5 pages

arXiv:2002.12788 [pdf, other]

Identification of Dementia Using Audio Biomarkers

Authors: Rupayan Chakraborty, Meghna Pandharipande, Chitralekha Bhat, Sunil Kumar Kopparapu

Abstract: Dementia is a syndrome, generally of a chronic nature characterized by a deterioration in cognitive function, especially in the geriatric population and is severe enough to impact their daily activities. Early diagnosis of dementia is essential to provide timely treatment to alleviate the effects and sometimes to slow the progression of dementia. Speech has been known to provide an indication of a… ▽ More Dementia is a syndrome, generally of a chronic nature characterized by a deterioration in cognitive function, especially in the geriatric population and is severe enough to impact their daily activities. Early diagnosis of dementia is essential to provide timely treatment to alleviate the effects and sometimes to slow the progression of dementia. Speech has been known to provide an indication of a person's cognitive state. The objective of this work is to use speech processing and machine learning techniques to automatically identify the stage of dementia such as mild cognitive impairment (MCI) or Alzheimers disease (AD). Non-linguistic acoustic parameters are used for this purpose, making this a language independent approach. We analyze the patients audio excerpts from a clinician-participant conversations taken from the Pitt corpus of DementiaBank database, to identify the speech parameters that best distinguish between MCI, AD and healthy (HC) speech. We analyze the contribution of various types of acoustic features such as spectral, temporal, cepstral their feature-level fusion and selection towards the identification of dementia stage. Additionally, we compare the performance of using feature-level fusion and score-level fusion. An accuracy of 82% is achieved using score-level fusion with an absolute improvement of 5% over feature-level fusion. △ Less

Submitted 27 February, 2020; originally announced February 2020.

Comments: 5 pages, 3 figures

arXiv:1912.11151 [pdf, other]

A Cycle-GAN Approach to Model Natural Perturbations in Speech for ASR Applications

Authors: Sri Harsha Dumpala, Imran Sheikh, Rupayan Chakraborty, Sunil Kumar Kopparapu

Abstract: Naturally introduced perturbations in audio signal, caused by emotional and physical states of the speaker, can significantly degrade the performance of Automatic Speech Recognition (ASR) systems. In this paper, we propose a front-end based on Cycle-Consistent Generative Adversarial Network (CycleGAN) which transforms naturally perturbed speech into normal speech, and hence improves the robustness… ▽ More Naturally introduced perturbations in audio signal, caused by emotional and physical states of the speaker, can significantly degrade the performance of Automatic Speech Recognition (ASR) systems. In this paper, we propose a front-end based on Cycle-Consistent Generative Adversarial Network (CycleGAN) which transforms naturally perturbed speech into normal speech, and hence improves the robustness of an ASR system. The CycleGAN model is trained on non-parallel examples of perturbed and normal speech. Experiments on spontaneous laughter-speech and creaky-speech datasets show that the performance of four different ASR systems improve by using speech obtained from CycleGAN based front-end, as compared to directly using the original perturbed speech. Visualization of the features of the laughter perturbed speech and those generated by the proposed front-end further demonstrates the effectiveness of our approach. △ Less

Submitted 18 December, 2019; originally announced December 2019.

Comments: 7 pages, 3 figures, ICASSP-2019

arXiv:1911.01421 [pdf, ps, other]

A Deep Learning approach for Hindi Named Entity Recognition

Authors: Bansi Shah, Sunil Kumar Kopparapu

Abstract: Named Entity Recognition is one of the most important text processing requirement in many NLP tasks. In this paper we use a deep architecture to accomplish the task of recognizing named entities in a given Hindi text sentence. Bidirectional Long Short Term Memory (BiLSTM) based techniques have been used for NER task in literature. In this paper, we first tune BiLSTM low-resource scenario to work f… ▽ More Named Entity Recognition is one of the most important text processing requirement in many NLP tasks. In this paper we use a deep architecture to accomplish the task of recognizing named entities in a given Hindi text sentence. Bidirectional Long Short Term Memory (BiLSTM) based techniques have been used for NER task in literature. In this paper, we first tune BiLSTM low-resource scenario to work for Hindi NER and propose two enhancements namely (a) de-noising auto-encoder (DAE) LSTM and (b) conditioning LSTM which show improvement in NER task compared to the BiLSTM approach. We use pre-trained word embedding to represent the words in the corpus, and the NER tags of the words are as defined by the used annotated corpora. Experiments have been performed to analyze the performance of different word embeddings and batch sizes which is essential for training deep models. △ Less

Submitted 5 November, 2019; originally announced November 2019.

Comments: 7 pages; work done during internship at TCS

arXiv:1712.05608 [pdf, other]

A Novel Approach for Effective Learning in Low Resourced Scenarios

Authors: Sri Harsha Dumpala, Rupayan Chakraborty, Sunil Kumar Kopparapu

Abstract: Deep learning based discriminative methods, being the state-of-the-art machine learning techniques, are ill-suited for learning from lower amounts of data. In this paper, we propose a novel framework, called simultaneous two sample learning (s2sL), to effectively learn the class discriminative characteristics, even from very low amount of data. In s2sL, more than one sample (here, two samples) are… ▽ More Deep learning based discriminative methods, being the state-of-the-art machine learning techniques, are ill-suited for learning from lower amounts of data. In this paper, we propose a novel framework, called simultaneous two sample learning (s2sL), to effectively learn the class discriminative characteristics, even from very low amount of data. In s2sL, more than one sample (here, two samples) are simultaneously considered to both, train and test the classifier. We demonstrate our approach for speech/music discrimination and emotion classification through experiments. Further, we also show the effectiveness of s2sL approach for classification in low-resource scenario, and for imbalanced data. △ Less

Submitted 15 December, 2017; originally announced December 2017.

Comments: Presented at NIPS 2017 Machine Learning for Audio Signal Processing (ML4Audio) Workshop, Dec. 2017

arXiv:1710.06923 [pdf, other]

Adapting general-purpose speech recognition engine output for domain-specific natural language question answering

Authors: C. Anantaram, Sunil Kumar Kopparapu

Abstract: Speech-based natural language question-answering interfaces to enterprise systems are gaining a lot of attention. General-purpose speech engines can be integrated with NLP systems to provide such interfaces. Usually, general-purpose speech engines are trained on large `general' corpus. However, when such engines are used for specific domains, they may not recognize domain-specific words well, and… ▽ More Speech-based natural language question-answering interfaces to enterprise systems are gaining a lot of attention. General-purpose speech engines can be integrated with NLP systems to provide such interfaces. Usually, general-purpose speech engines are trained on large `general' corpus. However, when such engines are used for specific domains, they may not recognize domain-specific words well, and may produce erroneous output. Further, the accent and the environmental conditions in which the speaker speaks a sentence may induce the speech engine to inaccurately recognize certain words. The subsequent natural language question-answering does not produce the requisite results as the question does not accurately represent what the speaker intended. Thus, the speech engine's output may need to be adapted for a domain before further natural language processing is carried out. We present two mechanisms for such an adaptation, one based on evolutionary development and the other based on machine learning, and show how we can repair the speech-output to make the subsequent natural language question-answering better. △ Less

Submitted 12 October, 2017; originally announced October 2017.

Comments: 20 opages

arXiv:1705.09289 [pdf, other]

Improved I-vector-based Speaker Recognition for Utterances with Speaker Generated Non-speech sounds

Authors: Sri Harsha Dumpala, Ashish Panda, Sunil Kumar Kopparapu

Abstract: Conversational speech not only contains several variants of neutral speech but is also prominently interlaced with several speaker generated non-speech sounds such as laughter and breath. A robust speaker recognition system should be capable of recognizing a speaker irrespective of these variations in his speech. An understanding of whether the speaker-specific information represented by these var… ▽ More Conversational speech not only contains several variants of neutral speech but is also prominently interlaced with several speaker generated non-speech sounds such as laughter and breath. A robust speaker recognition system should be capable of recognizing a speaker irrespective of these variations in his speech. An understanding of whether the speaker-specific information represented by these variations is similar or not helps build a good speaker recognition system. In this paper, speaker variations captured by neutral speech of a speaker is analyzed by considering speech-laugh (a variant of neutral speech) and laughter (non-speech) sounds of the speaker. We study an i-vector-based speaker recognition system trained only on neutral speech and evaluate its performance on speech-laugh and laughter. Further, we analyze the effect of including laughter sounds during training of an i-vector-basedspeaker recognition system. Our experimental results show that the inclusion of laughter sounds during training seem to provide complementary speaker-specific information which results in an overall improved performance of the speaker recognition system, especially on the utterances with speech-laugh segments. △ Less

Submitted 25 May, 2017; originally announced May 2017.

arXiv:1704.07055 [pdf, other]

k-FFNN: A priori knowledge infused Feed-forward Neural Networks

Authors: Sri Harsha Dumpala, Rupayan Chakraborty, Sunil Kumar Kopparapu

Abstract: Recurrent neural network (RNN) are being extensively used over feed-forward neural networks (FFNN) because of their inherent capability to capture temporal relationships that exist in the sequential data such as speech. This aspect of RNN is advantageous especially when there is no a priori knowledge about the temporal correlations within the data. However, RNNs require large amount of data to lea… ▽ More Recurrent neural network (RNN) are being extensively used over feed-forward neural networks (FFNN) because of their inherent capability to capture temporal relationships that exist in the sequential data such as speech. This aspect of RNN is advantageous especially when there is no a priori knowledge about the temporal correlations within the data. However, RNNs require large amount of data to learn these temporal correlations, limiting their advantage in low resource scenarios. It is not immediately clear (a) how a priori temporal knowledge can be used in a FFNN architecture (b) how a FFNN performs when provided with this knowledge about temporal correlations (assuming available) during training. The objective of this paper is to explore k-FFNN, namely a FFNN architecture that can incorporate the a priori knowledge of the temporal relationships within the data sequence during training and compare k-FFNN performance with RNN in a low resource scenario. We evaluate the performance of k-FFNN and RNN by extensive experimentation on MediaEval 2016 audio data ("Emotional Impact of Movies" task). Experimental results show that the performance of k-FFNN is comparable to RNN, and in some scenarios k-FFNN performs better than RNN when temporal knowledge is injected into FFNN architecture. The main contributions of this paper are (a) fusing a priori knowledge into FFNN architecture to construct a k-FFNN and (b) analyzing the performance of k-FFNN with respect to RNN for different size of training data. △ Less

Submitted 24 April, 2017; originally announced April 2017.

arXiv:1601.02605 [pdf, other]

A Mobile Phone based Speech Therapist

Authors: Vinod K. Pandey, Arun Pande, Sunil Kumar Kopparapu

Abstract: Patients with articulatory disorders often have difficulty in speaking. These patients need several speech therapy sessions to enable them speak normally. These therapy sessions are conducted by a specialized speech therapist. The goal of speech therapy is to develop good speech habits as well as to teach how to articulate sounds the right way. Speech therapy is critical for continuous improvement… ▽ More Patients with articulatory disorders often have difficulty in speaking. These patients need several speech therapy sessions to enable them speak normally. These therapy sessions are conducted by a specialized speech therapist. The goal of speech therapy is to develop good speech habits as well as to teach how to articulate sounds the right way. Speech therapy is critical for continuous improvement to regain normal speech. Speech therapy sessions require a patient to travel to a hospital or a speech therapy center for extended periods of time regularly; this makes the process of speech therapy not only time consuming but also very expensive. Additionally, there is a severe shortage of trained speech therapists around the globe in general and in develo** countries in particular. In this paper, we propose a low cost mobile speech therapist, a system that enables speech therapy using a mobile phone which eliminates the need of the patient to frequently travel to a speech therapist in a far away hospital. The proposed system, which is being built, enables both synchronous and asynchronous interaction between the speech therapist and the patient anytime anywhere △ Less

Submitted 11 January, 2016; originally announced January 2016.

Comments: 6 pages, 6 figures, SimPe. [2011] Remote Speech Therapist Vinod Pandey, Arun Pande, Sunil Kopparapu SiMPE 2011, Stockholm, Sweden, Aug 2011

arXiv:1601.02543 [pdf, other]

Evaluating the Performance of a Speech Recognition based System

Authors: Vinod Kumar Pandey, Sunil Kumar Kopparapu

Abstract: Speech based solutions have taken center stage with growth in the services industry where there is a need to cater to a very large number of people from all strata of the society. While natural language speech interfaces are the talk in the research community, yet in practice, menu based speech solutions thrive. Typically in a menu based speech solution the user is required to respond by speaking… ▽ More Speech based solutions have taken center stage with growth in the services industry where there is a need to cater to a very large number of people from all strata of the society. While natural language speech interfaces are the talk in the research community, yet in practice, menu based speech solutions thrive. Typically in a menu based speech solution the user is required to respond by speaking from a closed set of words when prompted by the system. A sequence of human speech response to the IVR prompts results in the completion of a transaction. A transaction is deemed successful if the speech solution can correctly recognize all the spoken utterances of the user whenever prompted by the system. The usual mechanism to evaluate the performance of a speech solution is to do an extensive test of the system by putting it to actual people use and then evaluating the performance by analyzing the logs for successful transactions. This kind of evaluation could lead to dissatisfied test users especially if the performance of the system were to result in a poor transaction completion rate. To negate this the Wizard of Oz approach is adopted during evaluation of a speech system. Overall this kind of evaluations is an expensive proposition both in terms of time and cost. In this paper, we propose a method to evaluate the performance of a speech solution without actually putting it to people use. We first describe the methodology and then show experimentally that this can be used to identify the performance bottlenecks of the speech solution even before the system is actually used thus saving evaluation time and expenses. △ Less

Submitted 11 January, 2016; originally announced January 2016.

Comments: 7 pages, 2 figure, ACC 2011

Journal ref: ACC (3) 2011: 230-238

arXiv:1504.01496 [pdf, other]

doi 10.1007/978-90-481-3658-2_18

Voice based self help System: User Experience Vs Accuracy

Authors: Sunil Kumar Kopparapu

Abstract: In general, self help systems are being increasingly deployed by service based industries because they are capable of delivering better customer service and increasingly the switch is to voice based self help systems because they provide a natural interface for a human to interact with a machine. A speech based self help system ideally needs a speech recognition engine to convert spoken speech to… ▽ More In general, self help systems are being increasingly deployed by service based industries because they are capable of delivering better customer service and increasingly the switch is to voice based self help systems because they provide a natural interface for a human to interact with a machine. A speech based self help system ideally needs a speech recognition engine to convert spoken speech to text and in addition a language processing engine to take care of any misrecognitions by the speech recognition engine. Any off-the-shelf speech recognition engine is generally a combination of acoustic processing and speech grammar. While this is the norm, we believe that ideally a speech recognition application should have in addition to a speech recognition engine a separate language processing engine to give the system better performance. In this paper, we discuss ways in which the speech recognition engine and the language processing engine can be combined to give a better user experience. △ Less

Submitted 7 April, 2015; originally announced April 2015.

Comments: 5 pages; 1 figure

arXiv:1504.01488 [pdf, other]

On-line Handwritten Devanagari Character Recognition using Fuzzy Directional Features

Authors: Sunil Kumar Kopparapu, Lajish VL

Abstract: This paper describes a new feature set for use in the recognition of on-line handwritten Devanagari script based on Fuzzy Directional Features. Experiments are conducted for the automatic recognition of isolated handwritten character primitives (sub-character units). Initially we describe the proposed feature set, called the Fuzzy Directional Features (FDF) and then show how these features can be… ▽ More This paper describes a new feature set for use in the recognition of on-line handwritten Devanagari script based on Fuzzy Directional Features. Experiments are conducted for the automatic recognition of isolated handwritten character primitives (sub-character units). Initially we describe the proposed feature set, called the Fuzzy Directional Features (FDF) and then show how these features can be effectively utilized for writer independent character recognition. Experimental results show that FDF set perform well for writer independent data set at stroke level recognition. The main contribution of this paper is the introduction of a novel feature set and establish experimentally its ability in recognition of handwritten Devanagari script. △ Less

Submitted 7 April, 2015; originally announced April 2015.

Comments: 6 pages; 2009

arXiv:1504.01476 [pdf, other]

Mobile Phone Based Vehicle License Plate Recognition for Road Policing

Authors: Lajish V. L., Sunil Kumar Kopparapu

Abstract: Identity of a vehicle is done through the vehicle license plate by traffic police in general. Au- tomatic vehicle license plate recognition has several applications in intelligent traffic management systems. The security situation across the globe and particularly in India demands a need to equip the traffic police with a system that enables them to get instant details of a vehicle. The system sho… ▽ More Identity of a vehicle is done through the vehicle license plate by traffic police in general. Au- tomatic vehicle license plate recognition has several applications in intelligent traffic management systems. The security situation across the globe and particularly in India demands a need to equip the traffic police with a system that enables them to get instant details of a vehicle. The system should be easy to use, should be mobile, and work 24 x 7. In this paper, we describe a mobile phone based, client-server architected, license plate recognition system. While we use the state of the art image processing and pattern recognition algorithms tuned for Indian conditions to automatically recognize non-uniform license plates, the main contribution is in creating an end to end usable solution. The client application runs on a mobile device and a server application, with access to vehicle information database, is hosted centrally. The solution enables capture of license plate image captured by the phone camera and passes to the server; on the server the license plate number is recognized; the data associated with the number plate is then sent back to the mobile device, instantaneously. We describe the end to end system architecture in detail. A working prototype of the proposed system has been implemented in the lab environment. △ Less

Submitted 7 April, 2015; originally announced April 2015.

Comments: 7 pages; PReMI Experiential Workshop, Delhi

arXiv:1503.07284 [pdf]

doi 10.1109/ICSIP.2010.5697471

A Rule-Based Short Query Intent Identification System

Authors: Arijit De, Sunil Kumar Kopparapu

Abstract: Using SMS (Short Message System), cell phones can be used to query for information about various topics. In an SMS based search system, one of the key problems is to identify a domain (broad topic) associated with the user query; so that a more comprehensive search can be carried out by the domain specific search engine. In this paper we use a rule based approach, to identify the domain, called Sh… ▽ More Using SMS (Short Message System), cell phones can be used to query for information about various topics. In an SMS based search system, one of the key problems is to identify a domain (broad topic) associated with the user query; so that a more comprehensive search can be carried out by the domain specific search engine. In this paper we use a rule based approach, to identify the domain, called Short Query Intent Identification System (SQIIS). We construct two different rule-bases using different strategies to suit query intent identification. We evaluate the two rule-bases experimentally. △ Less

Submitted 25 March, 2015; originally announced March 2015.

Comments: 5 pages, 2010 International Conference on Signal and Image Processing (ICSIP)

arXiv:1501.02887 [pdf, other]

Online Handwritten Devanagari Stroke Recognition Using Extended Directional Features

Authors: Lajish VL, Sunil Kumar Kopparapu

Abstract: This paper describes a new feature set, called the extended directional features (EDF) for use in the recognition of online handwritten strokes. We use EDF specifically to recognize strokes that form a basis for producing Devanagari script, which is the most widely used Indian language script. It should be noted that stroke recognition in handwritten script is equivalent to phoneme recognition in… ▽ More This paper describes a new feature set, called the extended directional features (EDF) for use in the recognition of online handwritten strokes. We use EDF specifically to recognize strokes that form a basis for producing Devanagari script, which is the most widely used Indian language script. It should be noted that stroke recognition in handwritten script is equivalent to phoneme recognition in speech signals and is generally very poor and of the order of 20% for singing voice. Experiments are conducted for the automatic recognition of isolated handwritten strokes. Initially we describe the proposed feature set, namely EDF and then show how this feature can be effectively utilized for writer independent script recognition through stroke recognition. Experimental results show that the extended directional feature set performs well with about 65+% stroke level recognition accuracy for writer independent data set. △ Less

Submitted 11 January, 2015; originally announced January 2015.

Comments: 8th International Conference on Signal Processing and Communication Systems 15 - 17 December 2014, Gold Coast, Australia

arXiv:1410.7382 [pdf, other]

Modified Mel Filter Bank to Compute MFCC of Subsampled Speech

Authors: Kiran Kumar Bhuvanagiri, Sunil Kumar Kopparapu

Abstract: Mel Frequency Cepstral Coefficients (MFCCs) are the most popularly used speech features in most speech and speaker recognition applications. In this work, we propose a modified Mel filter bank to extract MFCCs from subsampled speech. We also propose a stronger metric which effectively captures the correlation between MFCCs of original speech and MFCC of resampled speech. It is found that the propo… ▽ More Mel Frequency Cepstral Coefficients (MFCCs) are the most popularly used speech features in most speech and speaker recognition applications. In this work, we propose a modified Mel filter bank to extract MFCCs from subsampled speech. We also propose a stronger metric which effectively captures the correlation between MFCCs of original speech and MFCC of resampled speech. It is found that the proposed method of filter bank construction performs distinguishably well and gives recognition performance on resampled speech close to recognition accuracies on original speech. △ Less

Submitted 25 October, 2014; originally announced October 2014.

Comments: arXiv admin note: substantial text overlap with arXiv:1410.6903

arXiv:1410.6909 [pdf, other]

A Framework for On-Line Devanagari Handwritten Character Recognition

Authors: Sunil Kumar Kopparapu, Lajish V. L

Abstract: The main challenge in on-line handwritten character recognition in Indian lan- guage is the large size of the character set, larger similarity between different characters in the script and the huge variation in writing style. In this paper we propose a framework for on-line handwitten script recognition taking cues from speech signal processing literature. The framework is based on identify- ing… ▽ More The main challenge in on-line handwritten character recognition in Indian lan- guage is the large size of the character set, larger similarity between different characters in the script and the huge variation in writing style. In this paper we propose a framework for on-line handwitten script recognition taking cues from speech signal processing literature. The framework is based on identify- ing strokes, which in turn lead to recognition of handwritten on-line characters rather that the conventional character identification. Though the framework is described for Devanagari script, the framework is general and can be applied to any language. The proposed platform consists of pre-processing, feature extraction, recog- nition and post processing like the conventional character recognition but ap- plied to strokes. The on-line Devanagari character recognition reduces to one of recognizing one of 69 primitives and recognition of a character is performed by recognizing a sequence of such primitives. We further show the impact of noise removal on on-line raw data which is usually noisy. The use of Fuzzy Direc- tional Features to enhance the accuracy of stroke recognition is also described. The recognition results are compared with commonly used directional features in literature using several classifiers. △ Less

Submitted 25 October, 2014; originally announced October 2014.

Comments: 29 pages

arXiv:1410.6905 [pdf, other]

doi 10.1109/TENCON.2009.5396003

On the use of Stress information in Speech for Speaker Recognition

Authors: Laxmi Narayana M., Sunil Kumar Kopparapu

Abstract: The performance of a speaker recognition system decreases when the speaker is under stress or emotion. In this paper we explore and identify a mechanism that enables use of inherent stress-in-speech or speaking style information present in speech of a person as additional cues for speaker recognition. We quantify the the inherent stress present in the speech of a speaker mainly using 3 features, n… ▽ More The performance of a speaker recognition system decreases when the speaker is under stress or emotion. In this paper we explore and identify a mechanism that enables use of inherent stress-in-speech or speaking style information present in speech of a person as additional cues for speaker recognition. We quantify the the inherent stress present in the speech of a speaker mainly using 3 features, namely, pitch, amplitude and duration (together called PAD) We experimentally observe that the PAD vectors of similar phones in different words of a speaker are close to each other in the three dimensional (PAD) space confirming that the way a speaker stresses different syllables in their speech is unique to them, thus we propose the use of PAD based speaking style of a speaker as an additional feature for speaker recognition applications. △ Less

Submitted 25 October, 2014; originally announced October 2014.

arXiv:1410.6903 [pdf, other]

Choice of Mel Filter Bank in Computing MFCC of a Resampled Speech

Authors: Laxmi Narayana M., Sunil Kumar Kopparapu

Abstract: Mel Frequency Cepstral Coefficients (MFCCs) are the most popularly used speech features in most speech and speaker recognition applications. In this paper, we study the effect of resampling a speech signal on these speech features. We first derive a relationship between the MFCC param- eters of the resampled speech and the MFCC parameters of the original speech. We propose six methods of calculati… ▽ More Mel Frequency Cepstral Coefficients (MFCCs) are the most popularly used speech features in most speech and speaker recognition applications. In this paper, we study the effect of resampling a speech signal on these speech features. We first derive a relationship between the MFCC param- eters of the resampled speech and the MFCC parameters of the original speech. We propose six methods of calculating the MFCC parameters of downsampled speech by transforming the Mel filter bank used to com- pute MFCC of the original speech. We then experimentally compute the MFCC parameters of the down sampled speech using the proposed meth- ods and compute the Pearson coefficient between the MFCC parameters of the downsampled speech and that of the original speech to identify the most effective choice of Mel-filter band that enables the computed MFCC of the resampled speech to be as close as possible to the original speech sample MFCC. △ Less

Submitted 25 October, 2014; originally announced October 2014.

arXiv:1406.2464 [pdf, other]

doi 10.1109/ISIEA.2010.5679370

Music and Vocal Separation Using Multi-Band Modulation Based Features

Authors: Sunil Kumar Kopparapu, Meghna Pandharipande, G Sita

Abstract: The potential use of non-linear speech features has not been investigated for music analysis although other commonly used speech features like Mel Frequency Ceptral Coefficients (MFCC) and pitch have been used extensively. In this paper, we assume an audio signal to be a sum of modulated sinusoidal and then use the energy separation algorithm to decompose the audio into amplitude and frequency mod… ▽ More The potential use of non-linear speech features has not been investigated for music analysis although other commonly used speech features like Mel Frequency Ceptral Coefficients (MFCC) and pitch have been used extensively. In this paper, we assume an audio signal to be a sum of modulated sinusoidal and then use the energy separation algorithm to decompose the audio into amplitude and frequency modulation components using the non-linear Teager-Kaiser energy operator. We first identify the distribution of these non-linear features for music only and voice only segments in the audio signal in different Mel spaced frequency bands and show that they have the ability to discriminate. The proposed method based on Kullback-Leibler divergence measure is evaluated using a set of Indian classical songs from three different artists. Experimental results show that the discrimination ability is evident in certain low and mid frequency bands (200 - 1500 Hz). △ Less

Submitted 10 June, 2014; originally announced June 2014.

Comments: 5 pages, 5 figures, 2010 IEEE Symposium on Industrial Electronics Applications (ISIEA)

arXiv:1406.1280 [pdf, other]

Basis Identification for Automatic Creation of Pronunciation Lexicon for Proper Names

Authors: Sunil Kumar Kopparapu, M Laxminarayana

Abstract: Development of a proper names pronunciation lexicon is usually a manual effort which can not be avoided. Grapheme to phoneme (G2P) conversion modules, in literature, are usually rule based and work best for non-proper names in a particular language. Proper names are foreign to a G2P module. We follow an optimization approach to enable automatic construction of proper names pronunciation lexicon. T… ▽ More Development of a proper names pronunciation lexicon is usually a manual effort which can not be avoided. Grapheme to phoneme (G2P) conversion modules, in literature, are usually rule based and work best for non-proper names in a particular language. Proper names are foreign to a G2P module. We follow an optimization approach to enable automatic construction of proper names pronunciation lexicon. The idea is to construct a small orthogonal set of words (basis) which can span the set of names in a given database. We propose two algorithms for the construction of this basis. The transcription lexicon of all the proper names in a database can be produced by the manual transcription of only the small set of basis words. We first construct a cost function and show that the minimization of the cost function results in a basis. We derive conditions for convergence of this cost function and validate them experimentally on a very large proper name database. Experiments show the transcription can be achieved by transcribing a set of small number of basis words. The algorithms proposed are generic and independent of language; however performance is better if the proper names have same origin, namely, same language or geographical region. △ Less

Submitted 5 June, 2014; originally announced June 2014.

arXiv:1403.6901 [pdf, other]

Automatic Segmentation of Broadcast News Audio using Self Similarity Matrix

Authors: Sapna Soni, Ahmed Imran, Sunil Kumar Kopparapu

Abstract: Generally audio news broadcast on radio is com- posed of music, commercials, news from correspondents and recorded statements in addition to the actual news read by the newsreader. When news transcripts are available, automatic segmentation of audio news broadcast to time align the audio with the text transcription to build frugal speech corpora is essential. We address the problem of identifying… ▽ More Generally audio news broadcast on radio is com- posed of music, commercials, news from correspondents and recorded statements in addition to the actual news read by the newsreader. When news transcripts are available, automatic segmentation of audio news broadcast to time align the audio with the text transcription to build frugal speech corpora is essential. We address the problem of identifying segmentation in the audio news broadcast corresponding to the news read by the newsreader so that they can be mapped to the text transcripts. The existing techniques produce sub-optimal solutions when used to extract newsreader read segments. In this paper, we propose a new technique which is able to identify the acoustic change points reliably using an acoustic Self Similarity Matrix (SSM). We describe the two pass technique in detail and verify its performance on real audio news broadcast of All India Radio for different languages. △ Less

Submitted 26 March, 2014; originally announced March 2014.

Comments: 4 pages, 5 images

Showing 1–32 of 32 results for author: Kopparapu, S K