Skip to main content

Showing 1–31 of 31 results for author: Parcollet, T

Searching in archive eess. Search in all archives.
.
  1. arXiv:2407.00756  [pdf, other

    eess.AS cs.SD

    Less Forgetting for Better Generalization: Exploring Continual-learning Fine-tuning Methods for Speech Self-supervised Representations

    Authors: Salah Zaiem, Titouan Parcollet, Slim Essid

    Abstract: Despite being trained on massive and diverse datasets, speech self-supervised encoders are generally used for downstream purposes as mere frozen feature extractors or model initializers before fine-tuning. The former severely limits the exploitation of large encoders, while the latter hurts the robustness acquired during pretraining, especially in low-resource scenarios. This work explores middle-… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

    Comments: 5 Pages

  2. arXiv:2407.00463  [pdf, other

    cs.LG cs.AI cs.CL cs.HC eess.AS

    Open-Source Conversational AI with SpeechBrain 1.0

    Authors: Mirco Ravanelli, Titouan Parcollet, Adel Moumen, Sylvain de Langen, Cem Subakan, Peter Plantinga, Yingzhi Wang, Pooneh Mousavi, Luca Della Libera, Artem Ploujnikov, Francesco Paissan, Davide Borra, Salah Zaiem, Zeyu Zhao, Shucong Zhang, Georgios Karakasidis, Sung-Lin Yeh, Aku Rouhe, Rudolf Braun, Florian Mai, Juan Zuluaga-Gomez, Seyed Mahed Mousavi, Andreas Nautsch, Xuechen Liu, Sangeet Sagar , et al. (5 additional authors not shown)

    Abstract: SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech recognition, speech enhancement, speaker recognition, text-to-speech, and much more.It promotes transparency and replicability by releasing both the pre-trained models and the complete "recipes" of code and algorithms required for training them. This paper presen… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

    Comments: Submitted to JMLR (Machine Learning Open Source Software)

  3. arXiv:2310.07279  [pdf, other

    cs.SD cs.CL eess.AS

    Enhancing expressivity transfer in textless speech-to-speech translation

    Authors: Jarod Duret, Benjamin O'Brien, Yannick Estève, Titouan Parcollet

    Abstract: Textless speech-to-speech translation systems are rapidly advancing, thanks to the integration of self-supervised learning techniques. However, existing state-of-the-art systems fall short when it comes to capturing and transferring expressivity accurately across different languages. Expressivity plays a vital role in conveying emotions, nuances, and cultural subtleties, thereby enhancing communic… ▽ More

    Submitted 11 October, 2023; originally announced October 2023.

    Journal ref: ASRU, Dec 2023, Taipei, France

  4. arXiv:2309.05472  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech

    Authors: Titouan Parcollet, Ha Nguyen, Solene Evain, Marcely Zanon Boito, Adrien Pupier, Salima Mdhaffar, Hang Le, Sina Alisamir, Natalia Tomashenko, Marco Dinarelli, Shucong Zhang, Alexandre Allauzen, Maximin Coavoux, Yannick Esteve, Mickael Rouvier, Jerome Goulian, Benjamin Lecouteux, Francois Portet, Solange Rossato, Fabien Ringeval, Didier Schwab, Laurent Besacier

    Abstract: Self-supervised learning (SSL) is at the origin of unprecedented improvements in many different domains including computer vision and natural language processing. Speech processing drastically benefitted from SSL as most of the current domain-related tasks are now being approached with pre-trained models. This work introduces LeBenchmark 2.0 an open-source framework for assessing and building SSL-… ▽ More

    Submitted 18 March, 2024; v1 submitted 11 September, 2023; originally announced September 2023.

    Comments: Published in Computer Science and Language. Preprint allowed

  5. arXiv:2308.14456  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Speech Self-Supervised Representations Benchmarking: a Case for Larger Probing Heads

    Authors: Salah Zaiem, Youcef Kemiche, Titouan Parcollet, Slim Essid, Mirco Ravanelli

    Abstract: Self-supervised learning (SSL) leverages large datasets of unlabeled speech to reach impressive performance with reduced amounts of annotated data. The high number of proposed approaches fostered the emergence of comprehensive benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, while the number of considered tasks has bee… ▽ More

    Submitted 21 February, 2024; v1 submitted 28 August, 2023; originally announced August 2023.

    Comments: 18 Pages

  6. arXiv:2307.07421  [pdf, other

    cs.CL cs.SD eess.AS

    SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding

    Authors: Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Bhattacharya

    Abstract: Modern speech processing systems rely on self-attention. Unfortunately, token mixing with self-attention takes quadratic time in the length of the speech utterance, slowing down inference as well as training and increasing memory consumption. Cheaper alternatives to self-attention for ASR have been developed, but they fail to consistently reach the same level of accuracy. This paper, therefore, pr… ▽ More

    Submitted 17 January, 2024; v1 submitted 12 July, 2023; originally announced July 2023.

  7. arXiv:2306.17199  [pdf, other

    eess.AS cs.CL cs.SD

    Learning Multilingual Expressive Speech Representation for Prosody Prediction without Parallel Data

    Authors: Jarod Duret, Titouan Parcollet, Yannick Estève

    Abstract: We propose a method for speech-to-speech emotionpreserving translation that operates at the level of discrete speech units. Our approach relies on the use of multilingual emotion embedding that can capture affective information in a language-independent manner. We show that this embedding can be used to predict the pitch and duration of speech units in a target language, allowing us to resynthesiz… ▽ More

    Submitted 29 June, 2023; originally announced June 2023.

    Journal ref: Speech Synthesis Workshop (SSW), Aug 2023, Grenoble, France

  8. arXiv:2306.00481  [pdf, other

    eess.AS cs.LG

    Automatic Data Augmentation for Domain Adapted Fine-Tuning of Self-Supervised Speech Representations

    Authors: Salah Zaiem, Titouan Parcollet, Slim Essid

    Abstract: Self-Supervised Learning (SSL) has allowed leveraging large amounts of unlabeled speech data to improve the performance of speech recognition models even with small annotated datasets. Despite this, speech SSL representations may fail while facing an acoustic mismatch between the pretraining and target datasets. To address this issue, we propose a novel supervised domain adaptation method, designe… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: 6 pages,INTERSPEECH 2023

  9. arXiv:2306.00452  [pdf, ps, other

    eess.AS cs.LG

    Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?

    Authors: Salah Zaiem, Youcef Kemiche, Titouan Parcollet, Slim Essid, Mirco Ravanelli

    Abstract: Self-supervised learning (SSL) has recently allowed leveraging large datasets of unlabeled speech signals to reach impressive performance on speech tasks using only small amounts of annotated data. The high number of proposed approaches fostered the need and rise of extended benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. Howe… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: 6 pages

    Journal ref: INTERSPEECH 2023

  10. arXiv:2305.18281  [pdf, other

    cs.CL cs.AI cs.LG eess.AS

    HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition

    Authors: Florian Mai, Juan Zuluaga-Gomez, Titouan Parcollet, Petr Motlicek

    Abstract: State-of-the-art ASR systems have achieved promising results by modeling local and global interactions separately. While the former can be computed efficiently, global interactions are usually modeled via attention mechanisms, which are expensive for long input sequences. Here, we address this by extending HyperMixer, an efficient alternative to attention exhibiting linear complexity, to the Confo… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

    Comments: Florian Mai and Juan Zuluaga-Gomez contributed equally. To appear in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2023

  11. arXiv:2303.06740  [pdf, other

    eess.AS cs.LG

    Fine-tuning Strategies for Faster Inference using Speech Self-Supervised Models: A Comparative Study

    Authors: Salah Zaiem, Robin Algayres, Titouan Parcollet, Slim Essid, Mirco Ravanelli

    Abstract: Self-supervised learning (SSL) has allowed substantial progress in Automatic Speech Recognition (ASR) performance in low-resource settings. In this context, it has been demonstrated that larger self-supervised feature extractors are crucial for achieving lower downstream ASR error rates. Thus, better performance might be sanctioned with longer inferences. This article explores different approaches… ▽ More

    Submitted 12 March, 2023; originally announced March 2023.

    Comments: Submitted to ICASSP "Self-supervision in Audio, Speech and Beyond" workshop

  12. arXiv:2302.10144  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    Stabilising and accelerating light gated recurrent units for automatic speech recognition

    Authors: Adel Moumen, Titouan Parcollet

    Abstract: The light gated recurrent units (Li-GRU) is well-known for achieving impressive results in automatic speech recognition (ASR) tasks while being lighter and faster to train than a standard gated recurrent units (GRU). However, the unbounded nature of its rectified linear unit on the candidate recurrent gate induces an important gradient exploding phenomenon disrupting the training process and preve… ▽ More

    Submitted 16 February, 2023; originally announced February 2023.

  13. arXiv:2209.15575  [pdf, other

    cs.SD cs.LG eess.AS

    Match to Win: Analysing Sequences Lengths for Efficient Self-supervised Learning in Speech and Audio

    Authors: Yan Gao, Javier Fernandez-Marques, Titouan Parcollet, Pedro P. B. de Gusmao, Nicholas D. Lane

    Abstract: Self-supervised learning (SSL) has proven vital in speech and audio-related applications. The paradigm trains a general model on unlabeled data that can later be used to solve specific downstream tasks. This type of model is costly to train as it requires manipulating long input sequences that can only be handled by powerful centralised servers. Surprisingly, despite many attempts to increase trai… ▽ More

    Submitted 22 November, 2022; v1 submitted 30 September, 2022; originally announced September 2022.

  14. arXiv:2204.04170  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Automatic Data Augmentation Selection and Parametrization in Contrastive Self-Supervised Speech Representation Learning

    Authors: Salah Zaiem, Titouan Parcollet, Slim Essid

    Abstract: Contrastive learning enables learning useful audio and speech representations without ground-truth labels by maximizing the similarity between latent representations of similar signal segments. In this framework various data augmentation techniques are usually exploited to help enforce desired invariances within the learned representations, improving performance on various audio tasks thanks to mo… ▽ More

    Submitted 8 April, 2022; originally announced April 2022.

    Comments: Submitted to INTERSPEECH 2022

  15. arXiv:2204.02804  [pdf, other

    cs.SD cs.LG eess.AS

    Federated Self-supervised Speech Representations: Are We There Yet?

    Authors: Yan Gao, Javier Fernandez-Marques, Titouan Parcollet, Abhinav Mehrotra, Nicholas D. Lane

    Abstract: The ubiquity of microphone-enabled devices has lead to large amounts of unlabelled audio data being produced at the edge. The integration of self-supervised learning (SSL) and federated learning (FL) into one coherent system can potentially offer data privacy guarantees while also advancing the quality and robustness of speech representations. In this paper, we provide a first-of-its-kind systemat… ▽ More

    Submitted 19 July, 2022; v1 submitted 6 April, 2022; originally announced April 2022.

  16. arXiv:2204.00803  [pdf, other

    cs.CL cs.SD eess.AS

    End-to-end model for named entity recognition from speech without paired training data

    Authors: Salima Mdhaffar, Jarod Duret, Titouan Parcollet, Yannick Estève

    Abstract: Recent works showed that end-to-end neural approaches tend to become very popular for spoken language understanding (SLU). Through the term end-to-end, one considers the use of a single model optimized to extract semantic information directly from the speech signal. A major issue for such models is the lack of paired audio and textual data with semantic annotation. In this paper, we propose an app… ▽ More

    Submitted 2 April, 2022; originally announced April 2022.

    Comments: Submitted to INTERSPEECH 2022

  17. arXiv:2107.00594  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Pretext Tasks selection for multitask self-supervised speech representation learning

    Authors: Salah Zaiem, Titouan Parcollet, Slim Essid, Abdel Heba

    Abstract: Through solving pretext tasks, self-supervised learning leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. In audio/speech signal processing, a wide range of features where engineered through decades of research efforts. As it turns out, learning to predict such features (a.k.a pseudo-labels) has proven to be a particularl… ▽ More

    Submitted 11 November, 2022; v1 submitted 1 July, 2021; originally announced July 2021.

  18. arXiv:2106.04624  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    SpeechBrain: A General-Purpose Speech Toolkit

    Authors: Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, Yoshua Bengio

    Abstract: SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing… ▽ More

    Submitted 8 June, 2021; originally announced June 2021.

    Comments: Preprint

  19. arXiv:2104.14297  [pdf, other

    cs.SD cs.LG eess.AS

    End-to-End Speech Recognition from Federated Acoustic Models

    Authors: Yan Gao, Titouan Parcollet, Salah Zaiem, Javier Fernandez-Marques, Pedro P. B. de Gusmao, Daniel J. Beutel, Nicholas D. Lane

    Abstract: Training Automatic Speech Recognition (ASR) models under federated learning (FL) settings has attracted a lot of attention recently. However, the FL scenarios often presented in the literature are artificial and fail to capture the complexity of real FL systems. In this paper, we construct a challenging and realistic ASR federated experimental setup consisting of clients with heterogeneous data di… ▽ More

    Submitted 9 July, 2021; v1 submitted 29 April, 2021; originally announced April 2021.

  20. LeBenchmark: A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech

    Authors: Solene Evain, Ha Nguyen, Hang Le, Marcely Zanon Boito, Salima Mdhaffar, Sina Alisamir, Ziyi Tong, Natalia Tomashenko, Marco Dinarelli, Titouan Parcollet, Alexandre Allauzen, Yannick Esteve, Benjamin Lecouteux, Francois Portet, Solange Rossato, Fabien Ringeval, Didier Schwab, Laurent Besacier

    Abstract: Self-Supervised Learning (SSL) using huge unlabeled data has been successfully explored for image and natural language processing. Recent works also investigated SSL from speech. They were notably successful to improve performance on downstream tasks such as automatic speech recognition (ASR). While these works suggest it is possible to reduce dependence on labeled data for building efficient spee… ▽ More

    Submitted 10 June, 2021; v1 submitted 23 April, 2021; originally announced April 2021.

    Comments: Will be presented at Interspeech 2021

    Journal ref: Proc. Interspeech 2021

  21. arXiv:2104.07388  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Conditional independence for pretext task selection in Self-supervised speech representation learning

    Authors: Salah Zaiem, Titouan Parcollet, Slim Essid

    Abstract: Through solving pretext tasks, self-supervised learning (SSL) leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. A common pretext task consists in pretraining a SSL model on pseudo-labels derived from the original signal. This technique is particularly relevant for speech data where various meaningful signal processing fea… ▽ More

    Submitted 1 July, 2021; v1 submitted 15 April, 2021; originally announced April 2021.

    Comments: 5 pages, Accepted for presentation at Interspeech2021

  22. arXiv:2104.01604  [pdf, other

    cs.CL eess.AS

    Timers and Such: A Practical Benchmark for Spoken Language Understanding with Numbers

    Authors: Loren Lugosch, Piyush Papreja, Mirco Ravanelli, Abdelwahab Heba, Titouan Parcollet

    Abstract: This paper introduces Timers and Such, a new open source dataset of spoken English commands for common voice control use cases involving numbers. We describe the gap in existing spoken language understanding datasets that Timers and Such fills, the design and creation of the dataset, and experiments with a number of ASR-based and end-to-end baseline models, the code for which has been made availab… ▽ More

    Submitted 30 September, 2021; v1 submitted 4 April, 2021; originally announced April 2021.

    Comments: Accepted to NeurIPS 2021 - Datasets and Benchmarks Track

  23. arXiv:2012.04454  [pdf, other

    eess.AS cs.AI cs.CR

    Adversarial Disentanglement of Speaker Representation for Attribute-Driven Privacy Preservation

    Authors: Paul-Gauthier Noé, Mohammad Mohammadamini, Driss Matrouf, Titouan Parcollet, Andreas Nautsch, Jean-François Bonastre

    Abstract: In speech technologies, speaker's voice representation is used in many applications such as speech recognition, voice conversion, speech synthesis and, obviously, user authentication. Modern vocal representations of the speaker are based on neural embeddings. In addition to the targeted information, these representations usually contain sensitive information about the speaker, like the age, sex, p… ▽ More

    Submitted 16 June, 2021; v1 submitted 8 December, 2020; originally announced December 2020.

    Comments: Accepted to Interspeech 2021

  24. arXiv:2005.09310  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    Distilling Knowledge from Ensembles of Acoustic Models for Joint CTC-Attention End-to-End Speech Recognition

    Authors: Yan Gao, Titouan Parcollet, Nicholas Lane

    Abstract: Knowledge distillation has been widely used to compress existing deep learning models while preserving the performance on a wide range of applications. In the specific context of Automatic Speech Recognition (ASR), distillation from ensembles of acoustic models has recently shown promising results in increasing recognition performance. In this paper, we propose an extension of multi-teacher distil… ▽ More

    Submitted 3 July, 2021; v1 submitted 19 May, 2020; originally announced May 2020.

  25. arXiv:2005.08566  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Quaternion Neural Networks for Multi-channel Distant Speech Recognition

    Authors: Xinchi Qiu, Titouan Parcollet, Mirco Ravanelli, Nicholas Lane, Mohamed Morchid

    Abstract: Despite the significant progress in automatic speech recognition (ASR), distant ASR remains challenging due to noise and reverberation. A common approach to mitigate this issue consists of equip** the recording devices with multiple microphones that capture the acoustic scene from different perspectives. These multi-channel audio recordings contain specific internal relations between each signal… ▽ More

    Submitted 19 May, 2020; v1 submitted 18 May, 2020; originally announced May 2020.

    Comments: 4 pages

  26. arXiv:2002.04569  [pdf, other

    cs.SD eess.AS

    CGCNN: Complex Gabor Convolutional Neural Network on raw speech

    Authors: Paul-Gauthier Noé, Titouan Parcollet, Mohamed Morchid

    Abstract: Convolutional Neural Networks (CNN) have been used in Automatic Speech Recognition (ASR) to learn representations directly from the raw signal instead of hand-crafted acoustic features, providing a richer and lossless input signal. Recent researches propose to inject prior acoustic knowledge to the first convolutional layer by integrating the shape of the impulse responses in order to increase bot… ▽ More

    Submitted 11 February, 2020; originally announced February 2020.

  27. arXiv:1906.08043  [pdf, other

    eess.AS cs.CL cs.SD

    Real to H-space Encoder for Speech Recognition

    Authors: Titouan Parcollet, Mohamed Morchid, Georges Linarès, Renato De Mori

    Abstract: Deep neural networks (DNNs) and more precisely recurrent neural networks (RNNs) are at the core of modern automatic speech recognition systems, due to their efficiency to process input sequences. Recently, it has been shown that different input representations, based on multidimensional algebras, such as complex and quaternion numbers, are able to bring to neural networks a more natural, compressi… ▽ More

    Submitted 17 June, 2019; originally announced June 2019.

    Comments: Accepted at INTERSPEECH 2019

  28. arXiv:1811.09678  [pdf, other

    eess.AS cs.SD stat.ML

    Speech recognition with quaternion neural networks

    Authors: Titouan Parcollet, Mirco Ravanelli, Mohamed Morchid, Georges Linarès, Renato De Mori

    Abstract: Neural network architectures are at the core of powerful automatic speech recognition systems (ASR). However, while recent researches focus on novel model architectures, the acoustic input features remain almost unchanged. Traditional ASR systems rely on multidimensional acoustic features such as the Mel filter bank energies alongside with the first, and second order derivatives to characterize ti… ▽ More

    Submitted 21 November, 2018; originally announced November 2018.

    Comments: NIPS 2018 (IRASL). arXiv admin note: text overlap with arXiv:1806.04418

  29. arXiv:1811.07453  [pdf, other

    eess.AS cs.CL cs.LG cs.NE

    The PyTorch-Kaldi Speech Recognition Toolkit

    Authors: Mirco Ravanelli, Titouan Parcollet, Yoshua Bengio

    Abstract: The availability of open-source software is playing a remarkable role in the popularization of speech recognition and deep learning. Kaldi, for instance, is nowadays an established framework used to develop state-of-the-art speech recognizers. PyTorch is used to build neural networks with the Python language and has recently spawn tremendous interest within the machine learning community thanks to… ▽ More

    Submitted 15 February, 2019; v1 submitted 18 November, 2018; originally announced November 2018.

    Comments: Accepted at ICASSP 2019

  30. arXiv:1811.02566  [pdf, other

    eess.AS cs.LG cs.SD eess.SP stat.ML

    Bidirectional Quaternion Long-Short Term Memory Recurrent Neural Networks for Speech Recognition

    Authors: Titouan Parcollet, Mohamed Morchid, Georges Linarès, Renato De Mori

    Abstract: Recurrent neural networks (RNN) are at the core of modern automatic speech recognition (ASR) systems. In particular, long-short term memory (LSTM) recurrent neural networks have achieved state-of-the-art results in many speech recognition tasks, due to their efficient representation of long and short term dependencies in sequences of inter-dependent features. Nonetheless, internal dependencies wit… ▽ More

    Submitted 6 November, 2018; originally announced November 2018.

    Comments: Submitted at ICASSP 2019. arXiv admin note: text overlap with arXiv:1806.04418

  31. arXiv:1806.07789  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Quaternion Convolutional Neural Networks for End-to-End Automatic Speech Recognition

    Authors: Titouan Parcollet, Ying Zhang, Mohamed Morchid, Chiheb Trabelsi, Georges Linarès, Renato De Mori, Yoshua Bengio

    Abstract: Recently, the connectionist temporal classification (CTC) model coupled with recurrent (RNN) or convolutional neural networks (CNN), made it easier to train speech recognition systems in an end-to-end fashion. However in real-valued models, time frame components such as mel-filter-bank energies and the cepstral coefficients obtained from them, together with their first and second order derivatives… ▽ More

    Submitted 20 June, 2018; originally announced June 2018.

    Comments: Accepted at INTERSPEECH 2018