Search | arXiv e-print repository

arXiv:1901.10680 [pdf, other]

Effective weakly supervised semantic frame induction using expression sharing in hierarchical hidden Markov models

Authors: Janneke van de Loo, Jort F. Gemmeke, Guy De Pauw, Bart Ons, Walter Daelemans, Hugo Van hamme

Abstract: We present a framework for the induction of semantic frames from utterances in the context of an adaptive command-and-control interface. The system is trained on an individual user's utterances and the corresponding semantic frames representing controls. During training, no prior information on the alignment between utterance segments and frame slots and values is available. In addition, semantic… ▽ More We present a framework for the induction of semantic frames from utterances in the context of an adaptive command-and-control interface. The system is trained on an individual user's utterances and the corresponding semantic frames representing controls. During training, no prior information on the alignment between utterance segments and frame slots and values is available. In addition, semantic frames in the training data can contain information that is not expressed in the utterances. To tackle this weakly supervised classification task, we propose a framework based on Hidden Markov Models (HMMs). Structural modifications, resulting in a hierarchical HMM, and an extension called expression sharing are introduced to minimize the amount of training time and effort required for the user. The dataset used for the present study is PATCOR, which contains commands uttered in the context of a vocally guided card game, Patience. Experiments were carried out on orthographic and phonetic transcriptions of commands, segmented on different levels of n-gram granularity. The experimental results show positive effects of all the studied system extensions, with some effect differences between the different input representations. Moreover, evaluation experiments on held-out data with the optimal system configuration show that the extended system is able to achieve high accuracies with relatively small amounts of training data. △ Less

Submitted 30 January, 2019; originally announced January 2019.

arXiv:1609.09430 [pdf, other]

CNN Architectures for Large-Scale Audio Classification

Authors: Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, Kevin Wilson

Abstract: Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying th… ▽ More Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5] Acoustic Event Detection (AED) classification task. △ Less

Submitted 10 January, 2017; v1 submitted 29 September, 2016; originally announced September 2016.

Comments: Accepted for publication at ICASSP 2017 Changes: Added definitions of mAP, AUC, and d-prime. Updated mAP/AUC/d-prime numbers for Audio Set based on changes of latest Audio Set revision. Changed wording to fit 4 page limit with new additions

arXiv:0903.3198 [pdf, ps, other]

TR02: State dependent oracle masks for improved dynamical features

Authors: J. F. Gemmeke, B. Cranen

Abstract: Using the AURORA-2 digit recognition task, we show that recognition accuracies obtained with classical, SNR based oracle masks can be substantially improved by using a state-dependent mask estimation technique. Using the AURORA-2 digit recognition task, we show that recognition accuracies obtained with classical, SNR based oracle masks can be substantially improved by using a state-dependent mask estimation technique. △ Less

Submitted 18 March, 2009; originally announced March 2009.

arXiv:0901.2416 [pdf, ps, other]

TR01: Time-continuous Sparse Imputation

Authors: J. F. Gemmeke, B. Cranen

Abstract: An effective way to increase the noise robustness of automatic speech recognition is to label noisy speech features as either reliable or unreliable (missing) prior to decoding, and to replace the missing ones by clean speech estimates. We present a novel method to obtain such clean speech estimates. Unlike previous imputation frameworks which work on a frame-by-frame basis, our method focuses o… ▽ More An effective way to increase the noise robustness of automatic speech recognition is to label noisy speech features as either reliable or unreliable (missing) prior to decoding, and to replace the missing ones by clean speech estimates. We present a novel method to obtain such clean speech estimates. Unlike previous imputation frameworks which work on a frame-by-frame basis, our method focuses on exploiting information from a large time-context. Using a sliding window approach, denoised speech representations are constructed using a sparse representation of the reliable features in an overcomplete basis of fixed-length exemplar fragments. We demonstrate the potential of our approach with experiments on the AURORA-2 connected digit database. △ Less

Submitted 16 January, 2009; originally announced January 2009.

Comments: 9 pages, 5 figures, Technical Report

Showing 1–4 of 4 results for author: Gemmeke, J F