Search | arXiv e-print repository

arXiv:2010.07895 [pdf, other]

Deep Convolutional Neural Network-based Inverse Filtering Approach for Speech De-reverberation

Authors: Hanwook Chung, Vikrant Singh Tomar, Benoit Champagne

Abstract: In this paper, we introduce a spectral-domain inverse filtering approach for single-channel speech de-reverberation using deep convolutional neural network (CNN). The main goal is to better handle realistic reverberant conditions where the room impulse response (RIR) filter is longer than the short-time Fourier transform (STFT) analysis window. To this end, we consider the convolutive transfer fun… ▽ More In this paper, we introduce a spectral-domain inverse filtering approach for single-channel speech de-reverberation using deep convolutional neural network (CNN). The main goal is to better handle realistic reverberant conditions where the room impulse response (RIR) filter is longer than the short-time Fourier transform (STFT) analysis window. To this end, we consider the convolutive transfer function (CTF) model for the reverberant speech signal. In the proposed framework, the CNN architecture is trained to directly estimate the inverse filter of the CTF model. Among various choices for the CNN structure, we consider the U-net which consists of a fully-convolutional auto-encoder network with skip-connections. Experimental results show that the proposed method provides better de-reverberation performance than the prevalent benchmark algorithms under various reverberation conditions. △ Less

Submitted 15 October, 2020; originally announced October 2020.

arXiv:1904.03670 [pdf, other]

Speech Model Pre-training for End-to-End Spoken Language Understanding

Authors: Loren Lugosch, Mirco Ravanelli, Patrick Ignoto, Vikrant Singh Tomar, Yoshua Bengio

Abstract: Whereas conventional spoken language understanding (SLU) systems map speech to text, and then text to intent, end-to-end SLU systems map speech directly to intent through a single trainable model. Achieving high accuracy with these end-to-end models without a large amount of training data is difficult. We propose a method to reduce the data requirements of end-to-end SLU in which the model is firs… ▽ More Whereas conventional spoken language understanding (SLU) systems map speech to text, and then text to intent, end-to-end SLU systems map speech directly to intent through a single trainable model. Achieving high accuracy with these end-to-end models without a large amount of training data is difficult. We propose a method to reduce the data requirements of end-to-end SLU in which the model is first pre-trained to predict words and phonemes, thus learning good features for SLU. We introduce a new SLU dataset, Fluent Speech Commands, and show that our method improves performance both when the full dataset is used for training and when only a small subset is used. We also describe preliminary experiments to gauge the model's ability to generalize to new phrases not heard during training. △ Less

Submitted 25 July, 2019; v1 submitted 7 April, 2019; originally announced April 2019.

Comments: Accepted to Interspeech 2019

arXiv:1811.10736 [pdf, other]

DONUT: CTC-based Query-by-Example Keyword Spotting

Authors: Loren Lugosch, Samuel Myer, Vikrant Singh Tomar

Abstract: Keyword spotting--or wakeword detection--is an essential feature for hands-free operation of modern voice-controlled devices. With such devices becoming ubiquitous, users might want to choose a personalized custom wakeword. In this work, we present DONUT, a CTC-based algorithm for online query-by-example keyword spotting that enables custom wakeword detection. The algorithm works by recording a sm… ▽ More Keyword spotting--or wakeword detection--is an essential feature for hands-free operation of modern voice-controlled devices. With such devices becoming ubiquitous, users might want to choose a personalized custom wakeword. In this work, we present DONUT, a CTC-based algorithm for online query-by-example keyword spotting that enables custom wakeword detection. The algorithm works by recording a small number of training examples from the user, generating a set of label sequence hypotheses from these training examples, and detecting the wakeword by aggregating the scores of all the hypotheses given a new audio recording. Our method combines the generalization and interpretability of CTC-based keyword spotting with the user-adaptation and convenience of a conventional query-by-example system. DONUT has low computational requirements and is well-suited for both learning and inference on embedded systems without requiring private user data to be uploaded to the cloud. △ Less

Submitted 26 November, 2018; originally announced November 2018.

Comments: Accepted to NeurIPS 2018 Workshop on Interpretability and Robustness for Audio, Speech, and Language

arXiv:1807.04353 [pdf, other]

Efficient keyword spotting using time delay neural networks

Authors: Samuel Myer, Vikrant Singh Tomar

Abstract: This paper describes a novel method of live keyword spotting using a two-stage time delay neural network. The model is trained using transfer learning: initial training with phone targets from a large speech corpus is followed by training with keyword targets from a smaller data set. The accuracy of the system is evaluated on two separate tasks. The first is the freely available Google Speech Comm… ▽ More This paper describes a novel method of live keyword spotting using a two-stage time delay neural network. The model is trained using transfer learning: initial training with phone targets from a large speech corpus is followed by training with keyword targets from a smaller data set. The accuracy of the system is evaluated on two separate tasks. The first is the freely available Google Speech Commands dataset. The second is an in-house task specifically developed for keyword spotting. The results show significant improvements in false accept and false reject rates in both clean and noisy environments when compared with previously known techniques. Furthermore, we investigate various techniques to reduce computation in terms of multiplications per second of audio. Compared to recently published work, the proposed system provides up to 89% savings on computational complexity. △ Less

Submitted 28 August, 2018; v1 submitted 11 July, 2018; originally announced July 2018.

Comments: Will appear in Interspeech 2018

arXiv:1807.02465 [pdf, other]

Tone Recognition Using Lifters and CTC

Authors: Loren Lugosch, Vikrant Singh Tomar

Abstract: In this paper, we present a new method for recognizing tones in continuous speech for tonal languages. The method works by converting the speech signal to a cepstrogram, extracting a sequence of cepstral features using a convolutional neural network, and predicting the underlying sequence of tones using a connectionist temporal classification (CTC) network. The performance of the proposed method i… ▽ More In this paper, we present a new method for recognizing tones in continuous speech for tonal languages. The method works by converting the speech signal to a cepstrogram, extracting a sequence of cepstral features using a convolutional neural network, and predicting the underlying sequence of tones using a connectionist temporal classification (CTC) network. The performance of the proposed method is evaluated on a freely available Mandarin Chinese speech corpus, AISHELL-1, and is shown to outperform the existing techniques in the literature in terms of tone error rate (TER). △ Less

Submitted 6 July, 2018; originally announced July 2018.

Comments: Accepted to Interspeech 2018

Showing 1–5 of 5 results for author: Tomar, V S