Search | arXiv e-print repository

Improved far-field speech recognition using Joint Variational Autoencoder

Authors: Shashi Kumar, Shakti P. Rath, Abhishek Pandey

Abstract: Automatic Speech Recognition (ASR) systems suffer considerably when source speech is corrupted with noise or room impulse responses (RIR). Typically, speech enhancement is applied in both mismatched and matched scenario training and testing. In matched setting, acoustic model (AM) is trained on dereverberated far-field features while in mismatched setting, AM is fixed. In recent past, map** spee… ▽ More Automatic Speech Recognition (ASR) systems suffer considerably when source speech is corrupted with noise or room impulse responses (RIR). Typically, speech enhancement is applied in both mismatched and matched scenario training and testing. In matched setting, acoustic model (AM) is trained on dereverberated far-field features while in mismatched setting, AM is fixed. In recent past, map** speech features from far-field to close-talk using denoising autoencoder (DA) has been explored. In this paper, we focus on matched scenario training and show that the proposed joint VAE based map** achieves a significant improvement over DA. Specifically, we observe an absolute improvement of 2.5% in word error rate (WER) compared to DA based enhancement and 3.96% compared to AM trained directly on far-field filterbank features. △ Less

Submitted 24 April, 2022; originally announced April 2022.

Comments: 5 pages, 2 figures, 3 tables

arXiv:2112.01025 [pdf, other]

A Mixture of Expert Based Deep Neural Network for Improved ASR

Authors: Vishwanath Pratap Singh, Shakti P. Rath, Abhishek Pandey

Abstract: This paper presents a novel deep learning architecture for acoustic model in the context of Automatic Speech Recognition (ASR), termed as MixNet. Besides the conventional layers, such as fully connected layers in DNN-HMM and memory cells in LSTM-HMM, the model uses two additional layers based on Mixture of Experts (MoE). The first MoE layer operating at the input is based on pre-defined broad phon… ▽ More This paper presents a novel deep learning architecture for acoustic model in the context of Automatic Speech Recognition (ASR), termed as MixNet. Besides the conventional layers, such as fully connected layers in DNN-HMM and memory cells in LSTM-HMM, the model uses two additional layers based on Mixture of Experts (MoE). The first MoE layer operating at the input is based on pre-defined broad phonetic classes and the second layer operating at the penultimate layer is based on automatically learned acoustic classes. In natural speech, overlap in distribution across different acoustic classes is inevitable, which leads to inter-class mis-classification. The ASR accuracy is expected to improve if the conventional architecture of acoustic model is modified to make them more suitable to account for such overlaps. MixNet is developed kee** this in mind. Analysis conducted by means of scatter diagram verifies that MoE indeed improves the separation between classes that translates to better ASR accuracy. Experiments are conducted on a large vocabulary ASR task which show that the proposed architecture provides 13.6% and 10.0% relative reduction in word error rates compared to the conventional models, namely, DNN and LSTM respectively, trained using sMBR criteria. In comparison to an existing method developed for phone-classification (by Eigen et al), our proposed method yields a significant improvement. △ Less

Submitted 2 December, 2021; originally announced December 2021.

arXiv:2112.01023 [pdf, other]

A higher order Minkowski loss for improved prediction ability of acoustic model in ASR

Authors: Vishwanath Pratap Singh, Shakti P. Rath, Abhishek Pandey

Abstract: Conventional automatic speech recognition (ASR) system uses second-order minkowski loss during inference time which is suboptimal as it incorporates only first order statistics in posterior estimation [2]. In this paper we have proposed higher order minkowski loss (4th Order and 6th Order) during inference time, without any changes during training time. The main contribution of the paper is to sho… ▽ More Conventional automatic speech recognition (ASR) system uses second-order minkowski loss during inference time which is suboptimal as it incorporates only first order statistics in posterior estimation [2]. In this paper we have proposed higher order minkowski loss (4th Order and 6th Order) during inference time, without any changes during training time. The main contribution of the paper is to show that higher order loss uses higher order statistics in posterior estimation, which improves the prediction ability of acoustic model in ASR system. We have shown mathematically that posterior probability obtained due to higher order loss is function of second order posterior and thus the method can be incorporated in standard ASR system in an easy manner. It is to be noted that all changes are proposed during test(inference) time, we do not make any change in any training pipeline. Multiple baseline systems namely, TDNN1, TDNN2, DNN and LSTM are developed to verify the improvement incurred due to higher order minkowski loss. All experiments are conducted on LibriSpeech dataset and performance metrics are word error rate (WER) on "dev-clean", "test-clean", "dev-other" and "test-other" datasets. △ Less

Submitted 2 December, 2021; originally announced December 2021.

arXiv:0804.2346 [pdf]

Theory and Applications of Two-dimensional, Null-boundary, Nine-Neighborhood, Cellular Automata Linear rules

Authors: Pabitra Pal Choudhury, Birendra Kumar Nayak, Sudhakar Sahoo, Sunil Pankaj Rath

Abstract: This paper deals with the theory and application of 2-Dimensional, nine-neighborhood, null- boundary, uniform as well as hybrid Cellular Automata (2D CA) linear rules in image processing. These rules are classified into nine groups depending upon the number of neighboring cells influences the cell under consideration. All the Uniform rules have been found to be rendering multiple copies of a giv… ▽ More This paper deals with the theory and application of 2-Dimensional, nine-neighborhood, null- boundary, uniform as well as hybrid Cellular Automata (2D CA) linear rules in image processing. These rules are classified into nine groups depending upon the number of neighboring cells influences the cell under consideration. All the Uniform rules have been found to be rendering multiple copies of a given image depending on the groups to which they belong where as Hybrid rules are also shown to be characterizing the phenomena of zooming in, zooming out, thickening and thinning of a given image. Further, using hybrid CA rules a new searching algorithm is developed called Sweepers algorithm which is found to be applicable to simulate many inter disciplinary research areas like migration of organisms towards a single point destination, Single Attractor and Multiple Attractor Cellular Automata Theory, Pattern Classification and Clustering Problem, Image compression, Encryption and Decryption problems, Density Classification problem etc. △ Less

Submitted 15 April, 2008; originally announced April 2008.

Comments: 17 pages, 41 figures, a portion of this paper is accepted in the journal as well as proceedings at WSEAS,2006

Report number: Tech.Report No. ASD/2005/4, 13 May 2005

Showing 1–4 of 4 results for author: Rath, S P