Search | arXiv e-print repository

arXiv:2006.02774 [pdf, other]

A study on more realistic room simulation for far-field keyword spotting

Authors: Eric Bezzam, Robin Scheibler, Cyril Cadoux, Thibault Gisselbrecht

Abstract: We investigate the impact of more realistic room simulation for training far-field keyword spotting systems without fine-tuning on in-domain data. To this end, we study the impact of incorporating the following factors in the room impulse response (RIR) generation: air absorption, surface- and frequency-dependent coefficients of real materials, and stochastic ray tracing. Through an ablation study… ▽ More We investigate the impact of more realistic room simulation for training far-field keyword spotting systems without fine-tuning on in-domain data. To this end, we study the impact of incorporating the following factors in the room impulse response (RIR) generation: air absorption, surface- and frequency-dependent coefficients of real materials, and stochastic ray tracing. Through an ablation study, a wake word task is used to measure the impact of these factors in comparison with a ground-truth set of measured RIRs. On a hold-out set of re-recordings under clean and noisy far-field conditions, we demonstrate up to $35.8\%$ relative improvement over the commonly-used (single absorption coefficient) image source method. Source code is made available in the Pyroomacoustics package, allowing others to incorporate these techniques in their work. △ Less

Submitted 18 November, 2020; v1 submitted 4 June, 2020; originally announced June 2020.

Comments: 7 pages, 4 figures, accepted at APSIPA 2020, room impulse response generation code can be found at https://github.com/ebezzam/room-simulation

arXiv:2002.10851 [pdf, other]

Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM Networks

Authors: Théodore Bluche, Maël Primet, Thibault Gisselbrecht

Abstract: We explore a keyword-based spoken language understanding system, in which the intent of the user can directly be derived from the detection of a sequence of keywords in the query. In this paper, we focus on an open-vocabulary keyword spotting method, allowing the user to define their own keywords without having to retrain the whole model. We describe the different design choices leading to a fast… ▽ More We explore a keyword-based spoken language understanding system, in which the intent of the user can directly be derived from the detection of a sequence of keywords in the query. In this paper, we focus on an open-vocabulary keyword spotting method, allowing the user to define their own keywords without having to retrain the whole model. We describe the different design choices leading to a fast and small-footprint system, able to run on tiny devices, for any arbitrary set of user-defined keywords, without training data specific to those keywords. The model, based on a quantized long short-term memory (LSTM) neural network, trained with connectionist temporal classification (CTC), weighs less than 500KB. Our approach takes advantage of some properties of the predictions of CTC-trained networks to calibrate the confidence scores and implement a fast detection algorithm. The proposed system outperforms a standard keyword-filler model approach. △ Less

Submitted 25 February, 2020; originally announced February 2020.

arXiv:1912.07575 [pdf, other]

Predicting detection filters for small footprint open-vocabulary keyword spotting

Authors: Theodore Bluche, Thibault Gisselbrecht

Abstract: In this paper, we propose a fully-neural approach to open-vocabulary keyword spotting, that allows the users to include a customizable voice interface to their device and that does not require task-specific data. We present a keyword detection neural network weighing less than 250KB, in which the topmost layer performing keyword detection is predicted by an auxiliary network, that may be run offli… ▽ More In this paper, we propose a fully-neural approach to open-vocabulary keyword spotting, that allows the users to include a customizable voice interface to their device and that does not require task-specific data. We present a keyword detection neural network weighing less than 250KB, in which the topmost layer performing keyword detection is predicted by an auxiliary network, that may be run offline to generate a detector for any keyword. We show that the proposed model outperforms acoustic keyword spotting baselines by a large margin on two tasks of detecting keywords in utterances and three tasks of detecting isolated speech commands. We also propose a method to fine-tune the model when specific training data is available for some keywords, which yields a performance similar to a standard speech command neural network while kee** the ability of the model to be applied to new keywords. △ Less

Submitted 29 September, 2020; v1 submitted 16 December, 2019; originally announced December 2019.

Comments: Submtted to Interspeech 2020

arXiv:1811.07684 [pdf, other]

Efficient keyword spotting using dilated convolutions and gating

Authors: Alice Coucke, Mohammed Chlieh, Thibault Gisselbrecht, David Leroy, Mathieu Poumeyrol, Thibaut Lavril

Abstract: We explore the application of end-to-end stateless temporal modeling to small-footprint keyword spotting as opposed to recurrent networks that model long-term temporal dependencies using internal states. We propose a model inspired by the recent success of dilated convolutions in sequence modeling applications, allowing to train deeper architectures in resource-constrained configurations. Gated ac… ▽ More We explore the application of end-to-end stateless temporal modeling to small-footprint keyword spotting as opposed to recurrent networks that model long-term temporal dependencies using internal states. We propose a model inspired by the recent success of dilated convolutions in sequence modeling applications, allowing to train deeper architectures in resource-constrained configurations. Gated activations and residual connections are also added, following a similar configuration to WaveNet. In addition, we apply a custom target labeling that back-propagates loss from specific frames of interest, therefore yielding higher accuracy and only requiring to detect the end of the keyword. Our experimental results show that our model outperforms a max-pooling loss trained recurrent neural network using LSTM cells, with a significant decrease in false rejection rate. The underlying dataset - "Hey Snips" utterances recorded by over 2.2K different speakers - has been made publicly available to establish an open reference for wake-word detection. △ Less

Submitted 18 February, 2019; v1 submitted 19 November, 2018; originally announced November 2018.

Comments: Accepted for publication to ICASSP 2019

arXiv:1810.12735 [pdf, ps, other]

Spoken Language Understanding on the Edge

Authors: Alaa Saade, Alice Coucke, Alexandre Caulier, Joseph Dureau, Adrien Ball, Théodore Bluche, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, Maël Primet

Abstract: We consider the problem of performing Spoken Language Understanding (SLU) on small devices typical of IoT applications. Our contributions are twofold. First, we outline the design of an embedded, private-by-design SLU system and show that it has performance on par with cloud-based commercial solutions. Second, we release the datasets used in our experiments in the interest of reproducibility and i… ▽ More We consider the problem of performing Spoken Language Understanding (SLU) on small devices typical of IoT applications. Our contributions are twofold. First, we outline the design of an embedded, private-by-design SLU system and show that it has performance on par with cloud-based commercial solutions. Second, we release the datasets used in our experiments in the interest of reproducibility and in the hope that they can prove useful to the SLU community. △ Less

Submitted 2 October, 2019; v1 submitted 30 October, 2018; originally announced October 2018.

Comments: arXiv admin note: text overlap with arXiv:1805.10190

arXiv:1810.05512 [pdf, other]

Federated Learning for Keyword Spotting

Authors: David Leroy, Alice Coucke, Thibaut Lavril, Thibault Gisselbrecht, Joseph Dureau

Abstract: We propose a practical approach based on federated learning to solve out-of-domain issues with continuously running embedded speech-based models such as wake word detectors. We conduct an extensive empirical study of the federated averaging algorithm for the "Hey Snips" wake word based on a crowdsourced dataset that mimics a federation of wake word users. We empirically demonstrate that using an a… ▽ More We propose a practical approach based on federated learning to solve out-of-domain issues with continuously running embedded speech-based models such as wake word detectors. We conduct an extensive empirical study of the federated averaging algorithm for the "Hey Snips" wake word based on a crowdsourced dataset that mimics a federation of wake word users. We empirically demonstrate that using an adaptive averaging strategy inspired from Adam in place of standard weighted model averaging highly reduces the number of communication rounds required to reach our target performance. The associated upstream communication costs per user are estimated at 8 MB, which is a reasonable in the context of smart home voice assistants. Additionally, the dataset used for these experiments is being open sourced with the aim of fostering further transparent research in the application of federated learning to speech data. △ Less

Submitted 18 February, 2019; v1 submitted 9 October, 2018; originally announced October 2018.

Comments: Accepted for publication to ICASSP 2019

arXiv:1808.10725 [pdf, other]

Bandit algorithms for real-time data capture on large social medias

Authors: Thibault Gisselbrecht

Abstract: We study the problem of real time data capture on social media. Due to the different limitations imposed by those media, but also to the very large amount of information, it is impossible to collect all the data produced by social networks such as Twitter. Therefore, to be able to gather enough relevant information related to a predefined need, it is necessary to focus on a subset of the informati… ▽ More We study the problem of real time data capture on social media. Due to the different limitations imposed by those media, but also to the very large amount of information, it is impossible to collect all the data produced by social networks such as Twitter. Therefore, to be able to gather enough relevant information related to a predefined need, it is necessary to focus on a subset of the information sources. In this work, we focus on user-centered data capture and consider each account of a social network as a source that can be listened to at each iteration of a data capture process, in order to collect the corresponding produced contents. This process, whose aim is to maximize the quality of the information gathered, is constrained by the number of users that can be monitored simultaneously. The problem of selecting a subset of accounts to listen to over time is a sequential decision problem under constraints, which we formalize as a bandit problem with multiple selections. Therefore, we propose several bandit models to identify the most relevant users in real time. First, we study of the case of the stochastic bandit, in which each user corresponds to a stationary distribution. Then, we introduce two contextual bandit models, one stationary and the other non stationary, in which the utility of each user can be estimated by assuming some underlying structure in the reward space. The first approach introduces the notion of profile, which corresponds to the average behavior of a user. The second approach takes into account the activity of a user in order to predict his future behavior. Finally, we are interested in models that are able to tackle complex temporal dependencies between users, with the use of a latent space within which the information transits from one iteration to the other. Each of the proposed approaches is validated on both artificial and real datasets. △ Less

Submitted 28 August, 2018; originally announced August 2018.

Comments: in French

arXiv:1805.10190 [pdf, other]

Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces

Authors: Alice Coucke, Alaa Saade, Adrien Ball, Théodore Bluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, Maël Primet, Joseph Dureau

Abstract: This paper presents the machine learning architecture of the Snips Voice Platform, a software solution to perform Spoken Language Understanding on microprocessors typical of IoT devices. The embedded inference is fast and accurate while enforcing privacy by design, as no personal user data is ever collected. Focusing on Automatic Speech Recognition and Natural Language Understanding, we detail our… ▽ More This paper presents the machine learning architecture of the Snips Voice Platform, a software solution to perform Spoken Language Understanding on microprocessors typical of IoT devices. The embedded inference is fast and accurate while enforcing privacy by design, as no personal user data is ever collected. Focusing on Automatic Speech Recognition and Natural Language Understanding, we detail our approach to training high-performance Machine Learning models that are small enough to run in real-time on small devices. Additionally, we describe a data generation procedure that provides sufficient, high-quality training data without compromising user privacy. △ Less

Submitted 6 December, 2018; v1 submitted 25 May, 2018; originally announced May 2018.

Comments: 29 pages, 9 figures, 17 tables

Showing 1–8 of 8 results for author: Gisselbrecht, T