Search | arXiv e-print repository

Automatic Speech Recognition for Hindi

Abstract: Automatic speech recognition (ASR) is a key area in computational linguistics, focusing on develo** technologies that enable computers to convert spoken language into text. This field combines linguistics and machine learning. ASR models, which map speech audio to transcripts through supervised learning, require handling real and unrestricted text. Text-to-speech systems directly work with real… ▽ More Automatic speech recognition (ASR) is a key area in computational linguistics, focusing on develo** technologies that enable computers to convert spoken language into text. This field combines linguistics and machine learning. ASR models, which map speech audio to transcripts through supervised learning, require handling real and unrestricted text. Text-to-speech systems directly work with real text, while ASR systems rely on language models trained on large text corpora. High-quality transcribed data is essential for training predictive models. The research involved two main components: develo** a web application and designing a web interface for speech recognition. The web application, created with JavaScript and Node.js, manages large volumes of audio files and their transcriptions, facilitating collaborative human correction of ASR transcripts. It operates in real-time using a client-server architecture. The web interface for speech recognition records 16 kHz mono audio from any device running the web app, performs voice activity detection (VAD), and sends the audio to the recognition engine. VAD detects human speech presence, aiding efficient speech processing and reducing unnecessary processing during non-speech intervals, thus saving computation and network bandwidth in VoIP applications. The final phase of the research tested a neural network for accurately aligning the speech signal to hidden Markov model (HMM) states. This included implementing a novel backpropagation method that utilizes prior statistics of node co-activations. △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2312.09599 [pdf]

Brain-scale Theta Band Functional Connectivity As A Signature of Slow Breathing and Breath-hold Phases

Authors: Anusha A. S., Pradeep Kumar G., A. G. Ramakrishnan

Abstract: The study reported herein attempts to understand the neural mechanisms engaged in the conscious control of breathing and breath-hold. The variations in the electroencephalogram (EEG) based functional connectivity (FC) of the human brain during consciously controlled breathing at 2 cycles per minute (cpm), and breath-hold have been investigated and reported here. An experimental protocol involving… ▽ More The study reported herein attempts to understand the neural mechanisms engaged in the conscious control of breathing and breath-hold. The variations in the electroencephalogram (EEG) based functional connectivity (FC) of the human brain during consciously controlled breathing at 2 cycles per minute (cpm), and breath-hold have been investigated and reported here. An experimental protocol involving controlled breathing and breath-hold sessions, synchronized to a visual metronome, was designed and administered to 20 healthy subjects (9 females and 11 males). EEG data were collected during these sessions using the 61-channel eego mylab system from ANT Neuro. Further, FC was estimated for all possible pairs of EEG time series data, for 7 EEG bands. Feature selection using a genetic algorithm (GA) was performed to identify a subset of functional connections that would best distinguish the inhale, exhale, inhale-hold, and exhale-hold phases using a random committee classifier. The best accuracy of 93.36 % was obtained when 1161 theta-band functional connections were fed as input to the classifier, highlighting the efficacy of the theta-band functional connectome in distinguishing these phases of the respiratory cycle. This functional network was further characterized using graph measures, and observations illustrated a statistically significant difference in the efficiency of information exchange through the network during different respiratory phases. △ Less

Submitted 15 December, 2023; originally announced December 2023.

arXiv:2310.17138 [pdf, other]

A Classifier Using Global Character Level and Local Sub-unit Level Features for Hindi Online Handwritten Character Recognition

Authors: Anand Sharma, A. G. Ramakrishnan

Abstract: A classifier is developed that defines a joint distribution of global character features, number of sub-units and local sub-unit features to model Hindi online handwritten characters. The classifier uses latent variables to model the structure of sub-units. The classifier uses histograms of points, orientations, and dynamics of orientations (HPOD) features to represent characters at global charact… ▽ More A classifier is developed that defines a joint distribution of global character features, number of sub-units and local sub-unit features to model Hindi online handwritten characters. The classifier uses latent variables to model the structure of sub-units. The classifier uses histograms of points, orientations, and dynamics of orientations (HPOD) features to represent characters at global character level and local sub-unit level and is independent of character stroke order and stroke direction variations. The parameters of the classifier is estimated using maximum likelihood method. Different classifiers and features used in other studies are considered in this study for classification performance comparison with the developed classifier. The classifiers considered are Second Order Statistics (SOS), Sub-space (SS), Fisher Discriminant (FD), Feedforward Neural Network (FFN) and Support Vector Machines (SVM) and the features considered are Spatio Temporal (ST), Discrete Fourier Transform (DFT), Discrete Cosine Transform (SCT), Discrete Wavelet Transform (DWT), Spatial (SP) and Histograms of Oriented Gradients (HOG). Hindi character datasets used for training and testing the developed classifier consist of samples of handwritten characters from 96 different character classes. There are 12832 samples with an average of 133 samples per character class in the training set and 2821 samples with an average of 29 samples per character class in the testing set. The developed classifier has the highest accuracy of 93.5\% on the testing set compared to that of the classifiers trained on different features extracted from the same training set and evaluated on the same testing set considered in this study. △ Less

Submitted 26 October, 2023; originally announced October 2023.

Comments: 23 pages, 8 jpg figures. arXiv admin note: text overlap with arXiv:2310.08222

arXiv:2310.08222 [pdf, other]

Structural analysis of Hindi online handwritten characters for character recognition

Authors: Anand Sharma, A. G. Ramakrishnan

Abstract: Direction properties of online strokes are used to analyze them in terms of homogeneous regions or sub-strokes with points satisfying common geometric properties. Such sub-strokes are called sub-units. These properties are used to extract sub-units from Hindi ideal online characters. These properties along with some heuristics are used to extract sub-units from Hindi online handwritten characters.… ▽ More Direction properties of online strokes are used to analyze them in terms of homogeneous regions or sub-strokes with points satisfying common geometric properties. Such sub-strokes are called sub-units. These properties are used to extract sub-units from Hindi ideal online characters. These properties along with some heuristics are used to extract sub-units from Hindi online handwritten characters.\\ A method is developed to extract point stroke, clockwise curve stroke, counter-clockwise curve stroke and loop stroke segments as sub-units from Hindi online handwritten characters. These extracted sub-units are close in structure to the sub-units of the corresponding Hindi online ideal characters.\\ Importance of local representation of online handwritten characters in terms of sub-units is assessed by training a classifier with sub-unit level local and character level global features extracted from characters for character recognition. The classifier has the recognition accuracy of 93.5\% on the testing set. This accuracy is the highest when compared with that of the classifiers trained only with global features extracted from characters in the same training set and evaluated on the same testing set.\\ Sub-unit extraction algorithm and the sub-unit based character classifier are tested on Hindi online handwritten character dataset. This dataset consists of samples from 96 different characters. There are 12832 and 2821 samples in the training and testing sets, respectively. △ Less

Submitted 12 October, 2023; originally announced October 2023.

Comments: 34 pages, 36 jpg figures

arXiv:2309.02067 [pdf, other]

Histograms of Points, Orientations, and Dynamics of Orientations Features for Hindi Online Handwritten Character Recognition

Authors: Anand Sharma, A. G. Ramakrishnan

Abstract: A set of features independent of character stroke direction and order variations is proposed for online handwritten character recognition. A method is developed that maps features like co-ordinates of points, orientations of strokes at points, and dynamics of orientations of strokes at points spatially as a function of co-ordinate values of the points and computes histograms of these features from… ▽ More A set of features independent of character stroke direction and order variations is proposed for online handwritten character recognition. A method is developed that maps features like co-ordinates of points, orientations of strokes at points, and dynamics of orientations of strokes at points spatially as a function of co-ordinate values of the points and computes histograms of these features from different regions in the spatial map. Different features like spatio-temporal, discrete Fourier transform, discrete cosine transform, discrete wavelet transform, spatial, and histograms of oriented gradients used in other studies for training classifiers for character recognition are considered. The classifier chosen for classification performance comparison, when trained with different features, is support vector machines (SVM). The character datasets used for training and testing the classifiers consist of online handwritten samples of 96 different Hindi characters. There are 12832 and 2821 samples in training and testing datasets, respectively. SVM classifiers trained with the proposed features has the highest classification accuracy of 92.9\% when compared to the performances of SVM classifiers trained with the other features and tested on the same testing dataset. Therefore, the proposed features have better character discriminative capability than the other features considered for comparison. △ Less

Submitted 5 September, 2023; originally announced September 2023.

Comments: 21 pages, 12 jpg figures

arXiv:2109.05494 [pdf, other]

Unsupervised Domain Adaptation Schemes for Building ASR in Low-resource Languages

Authors: Anoop C S, Prathosh A P, A G Ramakrishnan

Abstract: Building an automatic speech recognition (ASR) system from scratch requires a large amount of annotated speech data, which is difficult to collect in many languages. However, there are cases where the low-resource language shares a common acoustic space with a high-resource language having enough annotated data to build an ASR. In such cases, we show that the domain-independent acoustic models lea… ▽ More Building an automatic speech recognition (ASR) system from scratch requires a large amount of annotated speech data, which is difficult to collect in many languages. However, there are cases where the low-resource language shares a common acoustic space with a high-resource language having enough annotated data to build an ASR. In such cases, we show that the domain-independent acoustic models learned from the high-resource language through unsupervised domain adaptation (UDA) schemes can enhance the performance of the ASR in the low-resource language. We use the specific example of Hindi in the source domain and Sanskrit in the target domain. We explore two architectures: i) domain adversarial training using gradient reversal layer (GRL) and ii) domain separation networks (DSN). The GRL and DSN architectures give absolute improvements of 6.71% and 7.32%, respectively, in word error rate over the baseline deep neural network model when trained on just 5.5 hours of data in the target domain. We also show that choosing a proper language (Telugu) in the source domain can bring further improvement. The results suggest that UDA schemes can be helpful in the development of ASR systems for low-resource languages, mitigating the hassle of collecting large amounts of annotated speech data. △ Less

Submitted 16 September, 2021; v1 submitted 12 September, 2021; originally announced September 2021.

Comments: Submitted to ASRU 2021

arXiv:2003.10433 [pdf, ps, other]

doi 10.1109/INDICON47234.2019.9028925

Decoding Imagined Speech using Wavelet Features and Deep Neural Networks

Authors: Jerrin Thomas Panachakel, A. G. Ramakrishnan, A. G. Ramakrishnan

Abstract: This paper proposes a novel approach that uses deep neural networks for classifying imagined speech, significantly increasing the classification accuracy. The proposed approach employs only the EEG channels over specific areas of the brain for classification, and derives distinct feature vectors from each of those channels. This gives us more data to train a classifier, enabling us to use deep lea… ▽ More This paper proposes a novel approach that uses deep neural networks for classifying imagined speech, significantly increasing the classification accuracy. The proposed approach employs only the EEG channels over specific areas of the brain for classification, and derives distinct feature vectors from each of those channels. This gives us more data to train a classifier, enabling us to use deep learning approaches. Wavelet and temporal domain features are extracted from each channel. The final class label of each test trial is obtained by applying a majority voting on the classification results of the individual channels considered in the trial. This approach is used for classifying all the 11 prompts in the KaraOne dataset of imagined speech. The proposed architecture and the approach of treating the data have resulted in an average classification accuracy of 57.15%, which is an improvement of around 35% over the state-of-the-art results. △ Less

Submitted 18 March, 2020; originally announced March 2020.

Comments: Preprint of the paper presented in 2019 IEEE 16th India Council International Conference (INDICON). arXiv admin note: substantial text overlap with arXiv:2003.09374

arXiv:2003.10212 [pdf, other]

An Improved EEG Acquisition Protocol Facilitates Localized Neural Activation

Authors: Jerrin Thomas Panachakel, Nandagopal Netrakanti Vinayak, Maanvi Nunna, A. G. Ramakrishnan, Kanishka Sharma

Abstract: This work proposes improvements in the electroencephalogram (EEG) recording protocols for motor imagery through the introduction of actual motor movement and/or somatosensory cues. The results obtained demonstrate the advantage of requiring the subjects to perform motor actions following the trials of imagery. By introducing motor actions in the protocol, the subjects are able to perform actual mo… ▽ More This work proposes improvements in the electroencephalogram (EEG) recording protocols for motor imagery through the introduction of actual motor movement and/or somatosensory cues. The results obtained demonstrate the advantage of requiring the subjects to perform motor actions following the trials of imagery. By introducing motor actions in the protocol, the subjects are able to perform actual motor planning, rather than just visualizing the motor movement, thus greatly improving the ease with which the motor movements can be imagined. This study also probes the added advantage of administering somatosensory cues in the subject, as opposed to the conventional auditory/visual cues. These changes in the protocol show promise in terms of the aptness of the spatial filters obtained on the data, on application of the well-known common spatial pattern (CSP) algorithms. The regions highlighted by the spatial filters are more localized and consistent across the subjects when the protocol is augmented with somatosensory stimuli. Hence, we suggest that this may prove to be a better EEG acquisition protocol for detecting brain activation in response to intended motor commands in (clinically) paralyzed/locked-in patients. △ Less

Submitted 13 March, 2020; originally announced March 2020.

Comments: Preprint of the paper presented at ComNet 2019

arXiv:2003.09374 [pdf, other]

A Novel Deep Learning Architecture for Decoding Imagined Speech from EEG

Authors: Jerrin Thomas Panachakel, A. G. Ramakrishnan, T. V. Ananthapadmanabha

Abstract: The recent advances in the field of deep learning have not been fully utilised for decoding imagined speech primarily because of the unavailability of sufficient training samples to train a deep network. In this paper, we present a novel architecture that employs deep neural network (DNN) for classifying the words "in" and "cooperate" from the corresponding EEG signals in the ASU imagined speech d… ▽ More The recent advances in the field of deep learning have not been fully utilised for decoding imagined speech primarily because of the unavailability of sufficient training samples to train a deep network. In this paper, we present a novel architecture that employs deep neural network (DNN) for classifying the words "in" and "cooperate" from the corresponding EEG signals in the ASU imagined speech dataset. Nine EEG channels, which best capture the underlying cortical activity, are chosen using common spatial pattern (CSP) and are treated as independent data vectors. Discrete wavelet transform (DWT) is used for feature extraction. To the best of our knowledge, so far DNN has not been employed as a classifier in decoding imagined speech. Treating the selected EEG channels corresponding to each imagined word as independent data vectors helps in providing sufficient number of samples to train a DNN. For each test trial, the final class label is obtained by applying a majority voting on the classification results of the individual channels considered in the trial. We have achieved accuracies comparable to the state-of-the-art results. The results can be further improved by using a higher-density EEG acquisition system in conjunction with other deep learning techniques such as long short-term memory. △ Less

Submitted 18 March, 2020; originally announced March 2020.

Comments: Preprint of the paper presented at IEEE AIBEC 2019, Austria

arXiv:1902.05411 [pdf, other]

Improving Facial Emotion Recognition Systems Using Gradient and Laplacian Images

Authors: Ram Krishna Pandey, Souvik Karmakar, A G Ramakrishnan, Nabagata Saha

Abstract: In this work, we have proposed several enhancements to improve the performance of any facial emotion recognition (FER) system. We believe that the changes in the positions of the fiducial points and the intensities capture the crucial information regarding the emotion of a face image. We propose the use of the gradient and the Laplacian of the input image together with the original input into a co… ▽ More In this work, we have proposed several enhancements to improve the performance of any facial emotion recognition (FER) system. We believe that the changes in the positions of the fiducial points and the intensities capture the crucial information regarding the emotion of a face image. We propose the use of the gradient and the Laplacian of the input image together with the original input into a convolutional neural network (CNN). These modifications help the network learn additional information from the gradient and Laplacian of the images. However, the plain CNN is not able to extract this information from the raw images. We have performed a number of experiments on two well known datasets KDEF and FERplus. Our approach enhances the already high performance of state-of-the-art FER systems by 3 to 5%. △ Less

Submitted 12 February, 2019; originally announced February 2019.

arXiv:1812.02475 [pdf, other]

Binary Document Image Super Resolution for Improved Readability and OCR Performance

Authors: Ram Krishna Pandey, K Vignesh, A G Ramakrishnan, Chandrahasa B

Abstract: There is a need for information retrieval from large collections of low-resolution (LR) binary document images, which can be found in digital libraries across the world, where the high-resolution (HR) counterpart is not available. This gives rise to the problem of binary document image super-resolution (BDISR). The objective of this paper is to address the interesting and challenging problem of su… ▽ More There is a need for information retrieval from large collections of low-resolution (LR) binary document images, which can be found in digital libraries across the world, where the high-resolution (HR) counterpart is not available. This gives rise to the problem of binary document image super-resolution (BDISR). The objective of this paper is to address the interesting and challenging problem of super resolution of binary Tamil document images for improved readability and better optical character recognition (OCR). We propose multiple deep neural network architectures to address this problem and analyze their performance. The proposed models are all single image super-resolution techniques, which learn a generalized spatial correspondence between the LR and HR binary document images. We employ convolutional layers for feature extraction followed by transposed convolution and sub-pixel convolution layers for upscaling the features. Since the outputs of the neural networks are gray scale, we utilize the advantage of power law transformation as a post-processing technique to improve the character level pixel connectivity. The performance of our models is evaluated by comparing the OCR accuracies and the mean opinion scores given by human evaluators on LR images and the corresponding model-generated HR images. △ Less

Submitted 6 December, 2018; originally announced December 2018.

arXiv:1812.02447 [pdf, other]

Pitch-synchronous DCT features: A pilot study on speaker identification

Authors: Amit Meghanani, A G Ramakrishnan

Abstract: We propose a new feature, namely, pitchsynchronous discrete cosine transform (PS-DCT), for the task of speaker identification. These features are obtained directly from the voiced segments of the speech signal, without any preemphasis or windowing. The feature vectors are vector quantized, to create one separate codebook for each speaker during training. The performance of the PS-DCT features is s… ▽ More We propose a new feature, namely, pitchsynchronous discrete cosine transform (PS-DCT), for the task of speaker identification. These features are obtained directly from the voiced segments of the speech signal, without any preemphasis or windowing. The feature vectors are vector quantized, to create one separate codebook for each speaker during training. The performance of the PS-DCT features is shown to be good, and hence it can be used to supplement other features for the speaker identification task. Speaker identification is also performed using Mel-frequency cepstral coefficient (MFCC) features and combined with the proposed features to improve its performance. For this pilot study, 30 speakers (14 female and 16 male) have been picked up randomly from the TIMIT database for the speaker identification task. On this data, both the proposed features and MFCC give an identification accuracy of 90% and 96.7% for codebook sizes of 16 and 32, respectively, and the combined features achieve 100% performance. Apart from the speaker identification task, this work also shows the capability of DCT to capture discriminative information from the speech signal with minimal pre-processing. △ Less

Submitted 6 December, 2018; originally announced December 2018.

arXiv:1809.00961 [pdf, other]

MSCE: An edge preserving robust loss function for improving super-resolution algorithms

Authors: Ram Krishna Pandey, Nabagata Saha, Samarjit Karmakar, A G Ramakrishnan

Abstract: With the recent advancement in the deep learning technologies such as CNNs and GANs, there is significant improvement in the quality of the images reconstructed by deep learning based super-resolution (SR) techniques. In this work, we propose a robust loss function based on the preservation of edges obtained by the Canny operator. This loss function, when combined with the existing loss function s… ▽ More With the recent advancement in the deep learning technologies such as CNNs and GANs, there is significant improvement in the quality of the images reconstructed by deep learning based super-resolution (SR) techniques. In this work, we propose a robust loss function based on the preservation of edges obtained by the Canny operator. This loss function, when combined with the existing loss function such as mean square error (MSE), gives better SR reconstruction measured in terms of PSNR and SSIM. Our proposed loss function guarantees improved performance on any existing algorithm using MSE loss function, without any increase in the computational complexity during testing. △ Less

Submitted 25 August, 2018; originally announced September 2018.

Comments: Accepted in ICONIP-2018

arXiv:1808.09432 [pdf, other]

Using Monte Carlo dropout for non-stationary noise reduction from speech

Authors: Nazreen P. M., A. G. Ramakrishnan

Abstract: In this work, we propose the use of dropout as a Bayesian estimator for increasing the generalizability of a deep neural network (DNN) for speech enhancement. By using Monte Carlo (MC) dropout, we show that the DNN performs better enhancement in unseen noise and SNR conditions. The DNN is trained on speech corrupted with Factory2, M109, Babble, Leopard and Volvo noises at SNRs of 0, 5 and 10 dB. S… ▽ More In this work, we propose the use of dropout as a Bayesian estimator for increasing the generalizability of a deep neural network (DNN) for speech enhancement. By using Monte Carlo (MC) dropout, we show that the DNN performs better enhancement in unseen noise and SNR conditions. The DNN is trained on speech corrupted with Factory2, M109, Babble, Leopard and Volvo noises at SNRs of 0, 5 and 10 dB. Speech samples are obtained from the TIMIT database and noises from NOISEX-92. In another experiment, we train five DNN models separately on speech corrupted with Factory2, M109, Babble, Leopard and Volvo noises, at 0, 5 and 10 dB SNRs. The model precision (estimated using MC dropout) is used as a proxy for squared error to dynamically select the best of the DNN models based on their performance on each frame of test data. We propose an algorithm with a threshold on the model precision to switch between classifier based model selection scheme and model precision based selection scheme. Testing is done on speech corrupted with unseen noises White, Pink and Factory1 and all five seen noises. △ Less

Submitted 28 August, 2018; originally announced August 2018.

Comments: This article draws from our previous work arXiv:1806.00516

arXiv:1807.05927 [pdf, other]

Computationally Efficient Approaches for Image Style Transfer

Authors: Ram Krishna Pandey, Samarjit Karmakar, A G Ramakrishnan

Abstract: In this work, we have investigated various style transfer approaches and (i) examined how the stylized reconstruction changes with the change of loss function and (ii) provided a computationally efficient solution for the same. We have used elegant techniques like depth-wise separable convolution in place of convolution and nearest neighbor interpolation in place of transposed convolution. Further… ▽ More In this work, we have investigated various style transfer approaches and (i) examined how the stylized reconstruction changes with the change of loss function and (ii) provided a computationally efficient solution for the same. We have used elegant techniques like depth-wise separable convolution in place of convolution and nearest neighbor interpolation in place of transposed convolution. Further, we have also added multiple interpolations in place of transposed convolution. The results obtained are perceptually similar in quality, while being computationally very efficient. The decrease in the computational complexity of our architecture is validated by the decrease in the testing time by 26.1%, 39.1%, and 57.1%, respectively. △ Less

Submitted 16 July, 2018; originally announced July 2018.

arXiv:1807.05813 [pdf, other]

Subjective and objective experiments on the influence of speaker's gender on the unvoiced segments

Authors: A Madhavaraj, T V Ananthapadmanabha, A G Ramakrishnan

Abstract: Subjective and objective experiments are conducted to understand the extent to which a speaker's gender influences the acoustics of unvoiced (U) sounds. U segments of utterances are replaced by the corresponding segments of a speaker of opposite gender to prepare modified utterances. Humans are asked to judge if the modified utterance is spoken by one or two speakers. The experiments show that hum… ▽ More Subjective and objective experiments are conducted to understand the extent to which a speaker's gender influences the acoustics of unvoiced (U) sounds. U segments of utterances are replaced by the corresponding segments of a speaker of opposite gender to prepare modified utterances. Humans are asked to judge if the modified utterance is spoken by one or two speakers. The experiments show that human subjects are unable to distinguish the modified from the original. Thus, listeners are able to identify the U segments irrespective of the gender, which may be based on some speaker-independent invariant acoustic cues. To test if this finding is purely a perceptual phenomenon, objective experiments are also conducted. Gender specific HMM based phoneme recognition systems are trained using the TIMIT training set and tested on (a) utterances spoken by the same gender (b) utterances spoken by the opposite gender and (c) the modified utterances of the test set. As expected, the performance is the highest for case (a) and the lowest for case (b). The performance degrades only slightly for case (c). This result shows that the speaker's gender does not as strongly influence the acoustics of U sounds as they do the voiced sounds. △ Less

Submitted 16 July, 2018; originally announced July 2018.

Comments: 2 Figures, 5 Pages

arXiv:1806.00516 [pdf, other]

DNN Based Speech Enhancement for Unseen Noises Using Monte Carlo Dropout

Authors: Nazreen P M, A G Ramakrishnan

Abstract: In this work, we propose the use of dropouts as a Bayesian estimator for increasing the generalizability of a deep neural network (DNN) for speech enhancement. By using Monte Carlo (MC) dropout, we show that the DNN performs better enhancement in unseen noise and SNR conditions. The DNN is trained on speech corrupted with Factory2, M109, Babble, Leopard and Volvo noises at SNRs of 0, 5 and 10 dB a… ▽ More In this work, we propose the use of dropouts as a Bayesian estimator for increasing the generalizability of a deep neural network (DNN) for speech enhancement. By using Monte Carlo (MC) dropout, we show that the DNN performs better enhancement in unseen noise and SNR conditions. The DNN is trained on speech corrupted with Factory2, M109, Babble, Leopard and Volvo noises at SNRs of 0, 5 and 10 dB and tested on speech with white, pink and factory1 noises. Speech samples are obtained from the TIMIT database and noises from NOISEX-92. In another experiment, we train five DNN models separately on speech corrupted with Factory2, M109, Babble, Leopard and Volvo noises, at 0, 5 and 10 dB SNRs. The model precision (estimated using MC dropout) is used as a proxy for squared error to dynamically select the best of the DNN models based on their performance on each frame of test data. △ Less

Submitted 1 June, 2018; originally announced June 2018.

arXiv:1805.09400 [pdf, other]

A hybrid approach of interpolations and CNN to obtain super-resolution

Authors: Ram Krishna Pandey, A G Ramakrishnan

Abstract: We propose a novel architecture that learns an end-to-end map** function to improve the spatial resolution of the input natural images. The model is unique in forming a nonlinear combination of three traditional interpolation techniques using the convolutional neural network. Another proposed architecture uses a skip connection with nearest neighbor interpolation, achieving almost similar result… ▽ More We propose a novel architecture that learns an end-to-end map** function to improve the spatial resolution of the input natural images. The model is unique in forming a nonlinear combination of three traditional interpolation techniques using the convolutional neural network. Another proposed architecture uses a skip connection with nearest neighbor interpolation, achieving almost similar results. The architectures have been carefully designed to ensure that the reconstructed images lie precisely in the manifold of high-resolution images, thereby preserving the high-frequency components with fine details. We have compared with the state of the art and recent deep learning based natural image super-resolution techniques and found that our methods are able to preserve the sharp details in the image, while also obtaining comparable or better PSNR than them. Since our methods use only traditional interpolations and a shallow CNN with less number of smaller filters, the computational cost is kept low. We have reported the results of two proposed architectures on five standard datasets, for an upscale factor of 2. Our methods generalize well in most cases, which is evident from the better results obtained with increasingly complex datasets. For 4-times upscaling, we have designed similar architectures for comparing with other methods. △ Less

Submitted 23 May, 2018; originally announced May 2018.

Report number: TIP-19077-2018

arXiv:1805.09233 [pdf, other]

Segmentation of Liver Lesions with Reduced Complexity Deep Models

Authors: Ram Krishna Pandey, Aswin Vasan, A G Ramakrishnan

Abstract: We propose a computationally efficient architecture that learns to segment lesions from CT images of the liver. The proposed architecture uses bilinear interpolation with sub-pixel convolution at the last layer to upscale the course feature in bottle neck architecture. Since bilinear interpolation and sub-pixel convolution do not have any learnable parameter, our overall model is faster and occupi… ▽ More We propose a computationally efficient architecture that learns to segment lesions from CT images of the liver. The proposed architecture uses bilinear interpolation with sub-pixel convolution at the last layer to upscale the course feature in bottle neck architecture. Since bilinear interpolation and sub-pixel convolution do not have any learnable parameter, our overall model is faster and occupies less memory footprint than the traditional U-net. We evaluate our proposed architecture on the highly competitive dataset of 2017 Liver Tumor Segmentation (LiTS) Challenge. Our method achieves competitive results while reducing the number of learnable parameters roughly by a factor of 13.8 compared to the original UNet model. △ Less

Submitted 23 May, 2018; originally announced May 2018.

arXiv:1701.08835 [pdf, other]

Language Independent Single Document Image Super-Resolution using CNN for improved recognition

Authors: Ram Krishna Pandey, A G Ramakrishnan

Abstract: Recognition of document images have important applications in restoring old and classical texts. The problem involves quality improvement before passing it to a properly trained OCR to get accurate recognition of the text. The image enhancement and quality improvement constitute important steps as subsequent recognition depends upon the quality of the input image. There are scenarios when high res… ▽ More Recognition of document images have important applications in restoring old and classical texts. The problem involves quality improvement before passing it to a properly trained OCR to get accurate recognition of the text. The image enhancement and quality improvement constitute important steps as subsequent recognition depends upon the quality of the input image. There are scenarios when high resolution images are not available and our experiments show that the OCR accuracy reduces significantly with decrease in the spatial resolution of document images. Thus the only option is to improve the resolution of such document images. The goal is to construct a high resolution image, given a single low resolution binary image, which constitutes the problem of single image super-resolution. Most of the previous work in super-resolution deal with natural images which have more information-content than the document images. Here, we use Convolution Neural Network to learn the map** between low and the corresponding high resolution images. We experiment with different number of layers, parameter settings and non-linear functions to build a fast end-to-end framework for document image super-resolution. Our proposed model shows a very good PSNR improvement of about 4 dB on 75 dpi Tamil images, resulting in a 3 % improvement of word level accuracy by the OCR. It takes less time than the recent sparse based natural image super-resolution technique, making it useful for real-time document recognition applications. △ Less

Submitted 30 January, 2017; originally announced January 2017.

arXiv:1609.09764 [pdf, ps, other]

Adaptive dictionary based approach for background noise and speaker classification and subsequent source separation

Authors: K V Vijay Girish, A G Ramakrishnan, T V Ananthapadmanabha

Abstract: A judicious combination of dictionary learning methods, block sparsity and source recovery algorithm are used in a hierarchical manner to identify the noises and the speakers from a noisy conversation between two people. Conversations are simulated using speech from two speakers, each with a different background noise, with varied SNR values, down to -10 dB. Ten each of randomly chosen male and fe… ▽ More A judicious combination of dictionary learning methods, block sparsity and source recovery algorithm are used in a hierarchical manner to identify the noises and the speakers from a noisy conversation between two people. Conversations are simulated using speech from two speakers, each with a different background noise, with varied SNR values, down to -10 dB. Ten each of randomly chosen male and female speakers from the TIMIT database and all the noise sources from the NOISEX database are used for the simulations. For speaker identification, the relative value of weights recovered is used to select an appropriately small subset of the test data, assumed to contain speech. This novel choice of using varied amounts of test data results in an improvement in the speaker recognition rate of around 15% at SNR of 0 dB. Speech and noise are separated using dictionaries of the estimated speaker and noise, and an improvement of signal to distortion ratios of up to 10% is achieved at SNR of 0 dB. K-medoid and cosine similarity based dictionary learning methods lead to better recognition of the background noise and the speaker. Experiments are also conducted on cases, where either the background noise or the speaker is outside the set of trained dictionaries. In such cases, adaptive dictionary learning leads to performance comparable to the other case of complete dictionaries. △ Less

Submitted 28 October, 2016; v1 submitted 30 September, 2016; originally announced September 2016.

Comments: 12 pages

arXiv:1609.05104 [pdf, other]

Intrinsic normalization and extrinsic denormalization of formant data of vowels

Authors: T. V. Ananthapadmanabha, A. G. Ramakrishnan

Abstract: Using a known speaker-intrinsic normalization procedure, formant data are scaled by the reciprocal of the geometric mean of the first three formant frequencies. This reduces the influence of the talker but results in a distorted vowel space. The proposed speaker-extrinsic procedure re-scales the normalized values by the mean formant values of vowels. When tested on the formant data of vowels publi… ▽ More Using a known speaker-intrinsic normalization procedure, formant data are scaled by the reciprocal of the geometric mean of the first three formant frequencies. This reduces the influence of the talker but results in a distorted vowel space. The proposed speaker-extrinsic procedure re-scales the normalized values by the mean formant values of vowels. When tested on the formant data of vowels published by Peterson and Barney, the combined approach leads to well separated clusters by reducing the spread due to talkers. The proposed procedure performs better than two top-ranked normalization procedures based on the accuracy of vowel classification as the objective measure. △ Less

Submitted 10 December, 2016; v1 submitted 16 September, 2016; originally announced September 2016.

Comments: 18 pages, 8 figures. Title has been revised. Appendix has been added to include more figures and to clarify 'hypothesize-test' procedure, JASA-EL, 2016

arXiv:1510.07774 [pdf, ps, other]

A dictionary learning and source recovery based approach to classify diverse audio sources

Authors: K V Vijay Girish, T V Ananthapadmanabha, A G Ramakrishnan

Abstract: A dictionary learning based audio source classification algorithm is proposed to classify a sample audio signal as one amongst a finite set of different audio sources. Cosine similarity measure is used to select the atoms during dictionary learning. Based on three objective measures proposed, namely, signal to distortion ratio (SDR), the number of non-zero weights and the sum of weights, a frame-w… ▽ More A dictionary learning based audio source classification algorithm is proposed to classify a sample audio signal as one amongst a finite set of different audio sources. Cosine similarity measure is used to select the atoms during dictionary learning. Based on three objective measures proposed, namely, signal to distortion ratio (SDR), the number of non-zero weights and the sum of weights, a frame-wise source classification accuracy of 98.2% is obtained for twelve different sources. Cent percent accuracy has been obtained using moving SDR accumulated over six successive frames for ten of the audio sources tested, while the two other sources require accumulation of 10 and 14 frames. △ Less

Submitted 27 October, 2015; originally announced October 2015.

Comments: 5 pages, 5 figures

ACM Class: H.5.1

arXiv:1506.04828 [pdf, ps]

Significance of the levels of spectral valleys with application to front/back distinction of vowel sounds

Authors: T. V. Ananthapadmanabha, A. G. Ramakrishnan, Shubham Sharma

Abstract: An objective critical distance (OCD) has been defined as that spacing between adjacent formants, when the level of the valley between them reaches the mean spectral level. The measured OCD lies in the same range (viz., 3-3.5 bark) as the critical distance determined by subjective experiments for similar experimental conditions. The level of spectral valley serves a purpose similar to that of the s… ▽ More An objective critical distance (OCD) has been defined as that spacing between adjacent formants, when the level of the valley between them reaches the mean spectral level. The measured OCD lies in the same range (viz., 3-3.5 bark) as the critical distance determined by subjective experiments for similar experimental conditions. The level of spectral valley serves a purpose similar to that of the spacing between the formants with an added advantage that it can be measured from the spectral envelope without an explicit knowledge of formant frequencies. Based on the relative spacing of formant frequencies, the level of the spectral valley, VI (between F1 and F2) is much higher than the level of VII (spectral valley between F2 and F3) for back vowels and vice-versa for front vowels. Classification of vowels into front/back distinction with the difference (VI-VII) as an acoustic feature, tested using TIMIT, NTIMIT, Tamil and Kannada language databases gives, on the average, an accuracy of about 95%, which is comparable to the accuracy (90.6%) obtained using a neural network classifier trained and tested using MFCC as the feature vector for TIMIT database. The acoustic feature (VI-VII) has also been tested for its robustness on the TIMIT database for additive white and babble noise and an accuracy of about 95% has been obtained for SNRs down to 25 dB for both types of noise. △ Less

Submitted 5 October, 2015; v1 submitted 16 June, 2015; originally announced June 2015.

Comments: 39 pages, 6 figures, submitted to JASA

arXiv:1411.1267 [pdf, ps, other]

An Interesting Property of LPCs for Sonorant Vs Fricative Discrimination

Authors: T. V. Ananthapadmanabha, A. G. Ramakrishnan, Pradeep Balachandran

Abstract: Linear prediction (LP) technique estimates an optimum all-pole filter of a given order for a frame of speech signal. The coefficients of the all-pole filter, 1/A(z) are referred to as LP coefficients (LPCs). The gain of the inverse of the all-pole filter, A(z) at z = 1, i.e, at frequency = 0, A(1) corresponds to the sum of LPCs, which has the property of being lower (higher) than a threshold for t… ▽ More Linear prediction (LP) technique estimates an optimum all-pole filter of a given order for a frame of speech signal. The coefficients of the all-pole filter, 1/A(z) are referred to as LP coefficients (LPCs). The gain of the inverse of the all-pole filter, A(z) at z = 1, i.e, at frequency = 0, A(1) corresponds to the sum of LPCs, which has the property of being lower (higher) than a threshold for the sonorants (fricatives). When the inverse-tan of A(1), denoted as T(1), is used a feature and tested on the sonorant and fricative frames of the entire TIMIT database, an accuracy of 99.07% is obtained. Hence, we refer to T(1) as sonorant-fricative discrimination index (SFDI). This property has also been tested for its robustness for additive white noise and on the telephone quality speech of the NTIMIT database. These results are comparable to, or in some respects, better than the state-of-the-art methods proposed for a similar task. Such a property may be used for segmenting a speech signal or for non-uniform frame-rate analysis. △ Less

Submitted 5 November, 2014; originally announced November 2014.

Comments: 5 pages including references

arXiv:1411.0370 [pdf, ps, other]

Detection of transitions between broad phonetic classes in a speech signal

Authors: T V Ananthapadmanabha, K V Vijay Girish, A G Ramakrishnan

Abstract: Detection of transitions between broad phonetic classes in a speech signal is an important problem which has applications such as landmark detection and segmentation. The proposed hierarchical method detects silence to non-silence transitions, high amplitude (mostly sonorants) to low ampli- tude (mostly fricatives/affricates/stop bursts) transitions and vice-versa. A subset of the extremum (minimu… ▽ More Detection of transitions between broad phonetic classes in a speech signal is an important problem which has applications such as landmark detection and segmentation. The proposed hierarchical method detects silence to non-silence transitions, high amplitude (mostly sonorants) to low ampli- tude (mostly fricatives/affricates/stop bursts) transitions and vice-versa. A subset of the extremum (minimum or maximum) samples between every pair of successive zero-crossings is selected above a second pass threshold, from each bandpass filtered speech signal frame. Relative to the mid-point (reference) of a frame, locations of the first and the last extrema lie on either side, if the speech signal belongs to a homogeneous segment; else, both these locations lie on the left or the right side of the reference, indicating a transition frame. When tested on the entire TIMIT database, of the transitions detected, 93.6% are within a tolerance of 20 ms from the hand labeled boundaries. Sonorant, unvoiced non-sonorant and silence classes and their respective onsets are detected with an accuracy of about 83.5% for the same tolerance. The results are as good as, and in some respects better than the state-of-the-art methods for similar tasks. △ Less

Submitted 3 November, 2014; originally announced November 2014.

Comments: 12 pages, 5 figures

arXiv:1407.6315 [pdf, ps, other]

Quadratically constrained quadratic programming for classification using particle swarms and applications

Authors: Deepak Kumar, A G Ramakrishnan

Abstract: Particle swarm optimization is used in several combinatorial optimization problems. In this work, particle swarms are used to solve quadratic programming problems with quadratic constraints. The approach of particle swarms is an example for interior point methods in optimization as an iterative technique. This approach is novel and deals with classification problems without the use of a traditiona… ▽ More Particle swarm optimization is used in several combinatorial optimization problems. In this work, particle swarms are used to solve quadratic programming problems with quadratic constraints. The approach of particle swarms is an example for interior point methods in optimization as an iterative technique. This approach is novel and deals with classification problems without the use of a traditional classifier. Our method determines the optimal hyperplane or classification boundary for a data set. In a binary classification problem, we constrain each class as a cluster, which is enclosed by an ellipsoid. The estimation of the optimal hyperplane between the two clusters is posed as a quadratically constrained quadratic problem. The optimization problem is solved in distributed format using modified particle swarms. Our method has the advantage of using the direction towards optimal solution rather than searching the entire feasible region. Our results on the Iris, Pima, Wine, and Thyroid datasets show that the proposed method works better than a neural network and the performance is close to that of SVM. △ Less

Submitted 23 July, 2014; originally announced July 2014.

Comments: 17 pages, 3 figures

arXiv:1407.1285 [pdf, other]

Compressed EEG Acquisition with Limited Channels using Estimated Signal Correlation

Authors: J V Satyanarayana, A G Ramakrishnan

Abstract: Nearby scalp channels in multi-channel EEG data exhibit high correlation. A question that naturally arises is whether it is required to record signals from all the electrodes in a group of closely spaced electrodes in a typical measurement setup. One could save on the number of channels that are recorded, if it were possible to reconstruct the omitted channels to the accuracy needed for identifyin… ▽ More Nearby scalp channels in multi-channel EEG data exhibit high correlation. A question that naturally arises is whether it is required to record signals from all the electrodes in a group of closely spaced electrodes in a typical measurement setup. One could save on the number of channels that are recorded, if it were possible to reconstruct the omitted channels to the accuracy needed for identifying the relevant information (say, spectral content in the signal), required to carry out a preliminary diagnosis. We address this problem from a compressed sensing perspective and propose a measurement and reconstruction scheme. Working with publicly available EEG database, we put our scheme to experiment and illustrate that if it is only a matter of estimating the frequency content of the signal in various EEG bands, then all the channels need not be recorded. We have achieved an average error below 15% between the original and reconstructed signals with respect to estimation of the spectral content in the delta, theta and alpha bands. We have demonstrated that channels in the 10-10 system of electrode placement can be estimated, with an error less than 10% using recordings on the sparser 10-20 system. △ Less

Submitted 4 July, 2014; originally announced July 2014.

arXiv:1208.6137 [pdf, ps, other]

Benchmarking recognition results on word image datasets

Authors: Deepak Kumar, M N Anil Prasad, A G Ramakrishnan

Abstract: We have benchmarked the maximum obtainable recognition accuracy on various word image datasets using manual segmentation and a currently available commercial OCR. We have developed a Matlab program, with graphical user interface, for semi-automated pixel level segmentation of word images. We discuss the advantages of pixel level annotation. We have covered five databases adding up to over 3600 wor… ▽ More We have benchmarked the maximum obtainable recognition accuracy on various word image datasets using manual segmentation and a currently available commercial OCR. We have developed a Matlab program, with graphical user interface, for semi-automated pixel level segmentation of word images. We discuss the advantages of pixel level annotation. We have covered five databases adding up to over 3600 word images. These word images have been cropped from camera captured scene, born-digital and street view images. We recognize the segmented word image using the trial version of Nuance Omnipage OCR. We also discuss, how the degradations introduced during acquisition or inaccuracies introduced during creation of word images affect the recognition of the word present in the image. Word images for different kinds of degradations and correction for slant and curvy nature of words are also discussed. The word recognition rates obtained on ICDAR 2003, Sign evaluation, Street view, Born-digital and ICDAR 2011 datasets are 83.9%, 89.3%, 79.6%, 88.5% and 86.7% respectively. △ Less

Submitted 30 August, 2012; originally announced August 2012.

Comments: 16 pages, 4 figures

ACM Class: I.7; I.7.5; I.4.6; I.4.8; I.2.10

Showing 1–29 of 29 results for author: Ramakrishnan, A G