-
A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding
Authors:
Yingzhi Wang,
Abdelmoumene Boumadane,
Abdelwahab Heba
Abstract:
Speech self-supervised models such as wav2vec 2.0 and HuBERT are making revolutionary progress in Automatic Speech Recognition (ASR). However, they have not been totally proven to produce better performance on tasks other than ASR. In this work, we explored partial fine-tuning and entire fine-tuning on wav2vec 2.0 and HuBERT pre-trained models for three non-ASR speech tasks: Speech Emotion Recogni…
▽ More
Speech self-supervised models such as wav2vec 2.0 and HuBERT are making revolutionary progress in Automatic Speech Recognition (ASR). However, they have not been totally proven to produce better performance on tasks other than ASR. In this work, we explored partial fine-tuning and entire fine-tuning on wav2vec 2.0 and HuBERT pre-trained models for three non-ASR speech tasks: Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding. With simple proposed downstream frameworks, the best scores reached 79.58% weighted accuracy on speaker-dependent setting and 73.01% weighted accuracy on speaker-independent setting for Speech Emotion Recognition on IEMOCAP, 2.36% equal error rate for Speaker Verification on VoxCeleb1, 89.38% accuracy for Intent Classification and 78.92% F1 for Slot Filling on SLURP, showing the strength of fine-tuned wav2vec 2.0 and HuBERT on learning prosodic, voice-print and semantic representations.
△ Less
Submitted 3 October, 2022; v1 submitted 4 November, 2021;
originally announced November 2021.
-
Pretext Tasks selection for multitask self-supervised speech representation learning
Authors:
Salah Zaiem,
Titouan Parcollet,
Slim Essid,
Abdel Heba
Abstract:
Through solving pretext tasks, self-supervised learning leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. In audio/speech signal processing, a wide range of features where engineered through decades of research efforts. As it turns out, learning to predict such features (a.k.a pseudo-labels) has proven to be a particularl…
▽ More
Through solving pretext tasks, self-supervised learning leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. In audio/speech signal processing, a wide range of features where engineered through decades of research efforts. As it turns out, learning to predict such features (a.k.a pseudo-labels) has proven to be a particularly relevant pretext task, leading to useful self-supervised representations which prove to be effective for downstream tasks. However, methods and common practices for combining such pretext tasks for better performance on the downstream task have not been explored and understood properly. In fact, the process relies almost exclusively on a computationally heavy experimental procedure, which becomes intractable with the increase of the number of pretext tasks. This paper introduces a method to select a group of pretext tasks among a set of candidates. The method we propose estimates calibrated weights for the partial losses corresponding to the considered pretext tasks during the self-supervised training process. The experiments conducted on automatic speech recognition, speaker and emotion recognition validate our approach, as the groups selected and weighted with our method perform better than classic baselines, thus facilitating the selection and combination of relevant pseudo-labels for self-supervised representation learning.
△ Less
Submitted 11 November, 2022; v1 submitted 1 July, 2021;
originally announced July 2021.
-
SpeechBrain: A General-Purpose Speech Toolkit
Authors:
Mirco Ravanelli,
Titouan Parcollet,
Peter Plantinga,
Aku Rouhe,
Samuele Cornell,
Loren Lugosch,
Cem Subakan,
Nauman Dawalatabad,
Abdelwahab Heba,
Jianyuan Zhong,
Ju-Chieh Chou,
Sung-Lin Yeh,
Szu-Wei Fu,
Chien-Feng Liao,
Elena Rastorgueva,
François Grondin,
William Aris,
Hwidong Na,
Yan Gao,
Renato De Mori,
Yoshua Bengio
Abstract:
SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing…
▽ More
SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing pipelines. SpeechBrain achieves competitive or state-of-the-art performance in a wide range of speech benchmarks. It also provides training recipes, pretrained models, and inference scripts for popular speech datasets, as well as tutorials which allow anyone with basic Python proficiency to familiarize themselves with speech technologies.
△ Less
Submitted 8 June, 2021;
originally announced June 2021.
-
Timers and Such: A Practical Benchmark for Spoken Language Understanding with Numbers
Authors:
Loren Lugosch,
Piyush Papreja,
Mirco Ravanelli,
Abdelwahab Heba,
Titouan Parcollet
Abstract:
This paper introduces Timers and Such, a new open source dataset of spoken English commands for common voice control use cases involving numbers. We describe the gap in existing spoken language understanding datasets that Timers and Such fills, the design and creation of the dataset, and experiments with a number of ASR-based and end-to-end baseline models, the code for which has been made availab…
▽ More
This paper introduces Timers and Such, a new open source dataset of spoken English commands for common voice control use cases involving numbers. We describe the gap in existing spoken language understanding datasets that Timers and Such fills, the design and creation of the dataset, and experiments with a number of ASR-based and end-to-end baseline models, the code for which has been made available as part of the SpeechBrain toolkit.
△ Less
Submitted 30 September, 2021; v1 submitted 4 April, 2021;
originally announced April 2021.
-
Twist your logic with TouIST
Authors:
Khaled Skander Ben Slimane,
Alexis Comte,
Olivier Gasquet,
Abdelwahab Heba,
Olivier Lezaud,
Frederic Maris,
Mael Valais
Abstract:
SAT provers are powerful tools for solving real-sized logic problems, but using them requires solid programming knowledge and may be seen w.r.t.\ logic like assembly language w.r.t.\ programming. Something like a high level language was missing to ease various users to take benefit of these tools. {\sc \texttt {TouIST}}\ aims at filling this gap. It is devoted to propositional logic and its main f…
▽ More
SAT provers are powerful tools for solving real-sized logic problems, but using them requires solid programming knowledge and may be seen w.r.t.\ logic like assembly language w.r.t.\ programming. Something like a high level language was missing to ease various users to take benefit of these tools. {\sc \texttt {TouIST}}\ aims at filling this gap. It is devoted to propositional logic and its main features are 1) to offer a high-level logic langage for expressing succintly complex formulas (e.g.\ formulas describing Sudoku rules, planification problems,\ldots) and 2) to find models to these formulas by using the adequate powerful prover, which the user has no need to know about. It consists in a friendly interface that offers several syntactic facilities and which is connected with some sufficiently powerful provers allowing to automatically solve big instances of difficult problems (such as time-tables or Sudokus). It can interact with various provers: pure SAT solver but also SMT provers (SAT modulo theories - like linear theory of reals, etc) and thus may also be used by beginners for experiencing with pure propositional problems up to graduate students or even researchers for solving planification problems involving big sets of fluents and numerical constraints on them.
△ Less
Submitted 13 July, 2015;
originally announced July 2015.