-
Sonos Voice Control Bias Assessment Dataset: A Methodology for Demographic Bias Assessment in Voice Assistants
Authors:
Chloé Sekkat,
Fanny Leroy,
Salima Mdhaffar,
Blake Perry Smith,
Yannick Estève,
Joseph Dureau,
Alice Coucke
Abstract:
Recent works demonstrate that voice assistants do not perform equally well for everyone, but research on demographic robustness of speech technologies is still scarce. This is mainly due to the rarity of large datasets with controlled demographic tags. This paper introduces the Sonos Voice Control Bias Assessment Dataset, an open dataset composed of voice assistant requests for North American Engl…
▽ More
Recent works demonstrate that voice assistants do not perform equally well for everyone, but research on demographic robustness of speech technologies is still scarce. This is mainly due to the rarity of large datasets with controlled demographic tags. This paper introduces the Sonos Voice Control Bias Assessment Dataset, an open dataset composed of voice assistant requests for North American English in the music domain (1,038 speakers, 166 hours, 170k audio samples, with 9,040 unique labelled transcripts) with a controlled demographic diversity (gender, age, dialectal region and ethnicity). We also release a statistical demographic bias assessment methodology, at the univariate and multivariate levels, tailored to this specific use case and leveraging spoken language understanding metrics rather than transcription accuracy, which we believe is a better proxy for user experience. To demonstrate the capabilities of this dataset and statistical method to detect demographic bias, we consider a pair of state-of-the-art Automatic Speech Recognition and Spoken Language Understanding models. Results show statistically significant differences in performance across age, dialectal region and ethnicity. Multivariate tests are crucial to shed light on mixed effects between dialectal region, gender and age.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
Conditioned Text Generation with Transfer for Closed-Domain Dialogue Systems
Authors:
Stéphane d'Ascoli,
Alice Coucke,
Francesco Caltagirone,
Alexandre Caulier,
Marc Lelarge
Abstract:
Scarcity of training data for task-oriented dialogue systems is a well known problem that is usually tackled with costly and time-consuming manual data annotation. An alternative solution is to rely on automatic text generation which, although less accurate than human supervision, has the advantage of being cheap and fast. Our contribution is twofold. First we show how to optimally train and contr…
▽ More
Scarcity of training data for task-oriented dialogue systems is a well known problem that is usually tackled with costly and time-consuming manual data annotation. An alternative solution is to rely on automatic text generation which, although less accurate than human supervision, has the advantage of being cheap and fast. Our contribution is twofold. First we show how to optimally train and control the generation of intent-specific sentences using a conditional variational autoencoder. Then we introduce a new protocol called query transfer that allows to leverage a large unlabelled dataset, possibly containing irrelevant queries, to extract relevant information. Comparison with two different baselines shows that this method, in the appropriate regime, consistently improves the diversity of the generated queries without compromising their quality. We also demonstrate the effectiveness of our generation method as a data augmentation technique for language modelling tasks.
△ Less
Submitted 3 November, 2020;
originally announced November 2020.
-
Small footprint Text-Independent Speaker Verification for Embedded Systems
Authors:
Julien Balian,
Raffaele Tavarone,
Mathieu Poumeyrol,
Alice Coucke
Abstract:
Deep neural network approaches to speaker verification have proven successful, but typical computational requirements of State-Of-The-Art (SOTA) systems make them unsuited for embedded applications. In this work, we present a two-stage model architecture orders of magnitude smaller than common solutions (237.5K learning parameters, 11.5MFLOPS) reaching a competitive result of 3.31% Equal Error Rat…
▽ More
Deep neural network approaches to speaker verification have proven successful, but typical computational requirements of State-Of-The-Art (SOTA) systems make them unsuited for embedded applications. In this work, we present a two-stage model architecture orders of magnitude smaller than common solutions (237.5K learning parameters, 11.5MFLOPS) reaching a competitive result of 3.31% Equal Error Rate (EER) on the well established VoxCeleb1 verification test set. We demonstrate the possibility of running our solution on small devices typical of IoT systems such as the Raspberry Pi 3B with a latency smaller than 200ms on a 5s long utterance. Additionally, we evaluate our model on the acoustically challenging VOiCES corpus. We report a limited increase in EER of 2.6 percentage points with respect to the best scoring model of the 2019 VOiCES from a Distance Challenge, against a reduction of 25.6 times in the number of learning parameters.
△ Less
Submitted 21 April, 2021; v1 submitted 3 November, 2020;
originally announced November 2020.
-
Conditioned Query Generation for Task-Oriented Dialogue Systems
Authors:
Stéphane d'Ascoli,
Alice Coucke,
Francesco Caltagirone,
Alexandre Caulier,
Marc Lelarge
Abstract:
Scarcity of training data for task-oriented dialogue systems is a well known problem that is usually tackled with costly and time-consuming manual data annotation. An alternative solution is to rely on automatic text generation which, although less accurate than human supervision, has the advantage of being cheap and fast. In this paper we propose a novel controlled data generation method that cou…
▽ More
Scarcity of training data for task-oriented dialogue systems is a well known problem that is usually tackled with costly and time-consuming manual data annotation. An alternative solution is to rely on automatic text generation which, although less accurate than human supervision, has the advantage of being cheap and fast. In this paper we propose a novel controlled data generation method that could be used as a training augmentation framework for closed-domain dialogue. Our contribution is twofold. First we show how to optimally train and control the generation of intent-specific sentences using a conditional variational autoencoder. Then we introduce a novel protocol called query transfer that allows to leverage a broad, unlabelled dataset to extract relevant information. Comparison with two different baselines shows that our method, in the appropriate regime, consistently improves the diversity of the generated queries without compromising their quality.
△ Less
Submitted 9 November, 2019;
originally announced November 2019.
-
Inference of compressed Potts graphical models
Authors:
Francesca Rizzato,
Alice Coucke,
Eleonora de Leonardis,
J. P. Barton,
Jérôme Tubiana,
Remi Monasson,
Simona Cocco
Abstract:
We consider the problem of inferring a graphical Potts model on a population of variables, with a non-uniform number of Potts colors (symbols) across variables. This inverse Potts problem generally involves the inference of a large number of parameters, often larger than the number of available data, and, hence, requires the introduction of regularization. We study here a double regularization sch…
▽ More
We consider the problem of inferring a graphical Potts model on a population of variables, with a non-uniform number of Potts colors (symbols) across variables. This inverse Potts problem generally involves the inference of a large number of parameters, often larger than the number of available data, and, hence, requires the introduction of regularization. We study here a double regularization scheme, in which the number of colors available to each variable is reduced, and interaction networks are made sparse. To achieve this color compression scheme, only Potts states with large empirical frequency (exceeding some threshold) are explicitly modeled on each site, while the others are grouped into a single state. We benchmark the performances of this mixed regularization approach, with two inference algorithms, the Adaptive Cluster Expansion (ACE) and the PseudoLikelihood Maximization (PLM) on synthetic data obtained by sampling disordered Potts models on an Erdos-Renyi random graphs. We show in particular that color compression does not affect the quality of reconstruction of the parameters corresponding to high-frequency symbols, while drastically reducing the number of the other parameters and thus the computational time. Our procedure is also applied to multi-sequence alignments of protein families, with similar results.
△ Less
Submitted 3 January, 2020; v1 submitted 30 July, 2019;
originally announced July 2019.
-
Efficient keyword spotting using dilated convolutions and gating
Authors:
Alice Coucke,
Mohammed Chlieh,
Thibault Gisselbrecht,
David Leroy,
Mathieu Poumeyrol,
Thibaut Lavril
Abstract:
We explore the application of end-to-end stateless temporal modeling to small-footprint keyword spotting as opposed to recurrent networks that model long-term temporal dependencies using internal states. We propose a model inspired by the recent success of dilated convolutions in sequence modeling applications, allowing to train deeper architectures in resource-constrained configurations. Gated ac…
▽ More
We explore the application of end-to-end stateless temporal modeling to small-footprint keyword spotting as opposed to recurrent networks that model long-term temporal dependencies using internal states. We propose a model inspired by the recent success of dilated convolutions in sequence modeling applications, allowing to train deeper architectures in resource-constrained configurations. Gated activations and residual connections are also added, following a similar configuration to WaveNet. In addition, we apply a custom target labeling that back-propagates loss from specific frames of interest, therefore yielding higher accuracy and only requiring to detect the end of the keyword. Our experimental results show that our model outperforms a max-pooling loss trained recurrent neural network using LSTM cells, with a significant decrease in false rejection rate. The underlying dataset - "Hey Snips" utterances recorded by over 2.2K different speakers - has been made publicly available to establish an open reference for wake-word detection.
△ Less
Submitted 18 February, 2019; v1 submitted 19 November, 2018;
originally announced November 2018.
-
Spoken Language Understanding on the Edge
Authors:
Alaa Saade,
Alice Coucke,
Alexandre Caulier,
Joseph Dureau,
Adrien Ball,
Théodore Bluche,
David Leroy,
Clément Doumouro,
Thibault Gisselbrecht,
Francesco Caltagirone,
Thibaut Lavril,
Maël Primet
Abstract:
We consider the problem of performing Spoken Language Understanding (SLU) on small devices typical of IoT applications. Our contributions are twofold. First, we outline the design of an embedded, private-by-design SLU system and show that it has performance on par with cloud-based commercial solutions. Second, we release the datasets used in our experiments in the interest of reproducibility and i…
▽ More
We consider the problem of performing Spoken Language Understanding (SLU) on small devices typical of IoT applications. Our contributions are twofold. First, we outline the design of an embedded, private-by-design SLU system and show that it has performance on par with cloud-based commercial solutions. Second, we release the datasets used in our experiments in the interest of reproducibility and in the hope that they can prove useful to the SLU community.
△ Less
Submitted 2 October, 2019; v1 submitted 30 October, 2018;
originally announced October 2018.
-
Federated Learning for Keyword Spotting
Authors:
David Leroy,
Alice Coucke,
Thibaut Lavril,
Thibault Gisselbrecht,
Joseph Dureau
Abstract:
We propose a practical approach based on federated learning to solve out-of-domain issues with continuously running embedded speech-based models such as wake word detectors. We conduct an extensive empirical study of the federated averaging algorithm for the "Hey Snips" wake word based on a crowdsourced dataset that mimics a federation of wake word users. We empirically demonstrate that using an a…
▽ More
We propose a practical approach based on federated learning to solve out-of-domain issues with continuously running embedded speech-based models such as wake word detectors. We conduct an extensive empirical study of the federated averaging algorithm for the "Hey Snips" wake word based on a crowdsourced dataset that mimics a federation of wake word users. We empirically demonstrate that using an adaptive averaging strategy inspired from Adam in place of standard weighted model averaging highly reduces the number of communication rounds required to reach our target performance. The associated upstream communication costs per user are estimated at 8 MB, which is a reasonable in the context of smart home voice assistants. Additionally, the dataset used for these experiments is being open sourced with the aim of fostering further transparent research in the application of federated learning to speech data.
△ Less
Submitted 18 February, 2019; v1 submitted 9 October, 2018;
originally announced October 2018.
-
Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces
Authors:
Alice Coucke,
Alaa Saade,
Adrien Ball,
Théodore Bluche,
Alexandre Caulier,
David Leroy,
Clément Doumouro,
Thibault Gisselbrecht,
Francesco Caltagirone,
Thibaut Lavril,
Maël Primet,
Joseph Dureau
Abstract:
This paper presents the machine learning architecture of the Snips Voice Platform, a software solution to perform Spoken Language Understanding on microprocessors typical of IoT devices. The embedded inference is fast and accurate while enforcing privacy by design, as no personal user data is ever collected. Focusing on Automatic Speech Recognition and Natural Language Understanding, we detail our…
▽ More
This paper presents the machine learning architecture of the Snips Voice Platform, a software solution to perform Spoken Language Understanding on microprocessors typical of IoT devices. The embedded inference is fast and accurate while enforcing privacy by design, as no personal user data is ever collected. Focusing on Automatic Speech Recognition and Natural Language Understanding, we detail our approach to training high-performance Machine Learning models that are small enough to run in real-time on small devices. Additionally, we describe a data generation procedure that provides sufficient, high-quality training data without compromising user privacy.
△ Less
Submitted 6 December, 2018; v1 submitted 25 May, 2018;
originally announced May 2018.
-
Deep Representation for Patient Visits from Electronic Health Records
Authors:
Jean-Baptiste Escudié,
Alaa Saade,
Alice Coucke,
Marc Lelarge
Abstract:
We show how to learn low-dimensional representations (embeddings) of patient visits from the corresponding electronic health record (EHR) where International Classification of Diseases (ICD) diagnosis codes are removed. We expect that these embeddings will be useful for the construction of predictive statistical models anticipated to drive personalized medicine and improve healthcare quality. Thes…
▽ More
We show how to learn low-dimensional representations (embeddings) of patient visits from the corresponding electronic health record (EHR) where International Classification of Diseases (ICD) diagnosis codes are removed. We expect that these embeddings will be useful for the construction of predictive statistical models anticipated to drive personalized medicine and improve healthcare quality. These embeddings are learned using a deep neural network trained to predict ICD diagnosis categories. We show that our embeddings capture relevant clinical informations and can be used directly as input to standard machine learning algorithms like multi-output classifiers for ICD code prediction. We also show that important medical informations correspond to particular directions in our embedding space.
△ Less
Submitted 26 March, 2018;
originally announced March 2018.
-
An interplay of migratory and division forces as a generic mechanism for stem cell patterns
Authors:
Edouard Hannezo,
Alice Coucke,
Jean-François Joanny
Abstract:
In many adult tissues, stem cells and differentiated cells are not homogeneously distributed : stem cells are arranged in periodic "niches", and differentiated cells are constantly produced and migrate out of these niches. In this article, we provide a general theoretical framework to study mixtures of dividing and actively migrating particles, which we apply to biological tissues. We show in part…
▽ More
In many adult tissues, stem cells and differentiated cells are not homogeneously distributed : stem cells are arranged in periodic "niches", and differentiated cells are constantly produced and migrate out of these niches. In this article, we provide a general theoretical framework to study mixtures of dividing and actively migrating particles, which we apply to biological tissues. We show in particular that the interplay between the stresses arising from active cell migration and stem cell division give rise to robust stem cell patterns. The instability of the tissue leads to spatial patterns which are either steady or oscillating in time. The wavelength of the instability has an order of magnitude consistent with the biological observations. We also discuss the implications of these results for future in vitro and in vivo experiments.
△ Less
Submitted 18 December, 2015;
originally announced December 2015.