Skip to main content

Showing 1–50 of 63 results for author: Ravanelli, M

.
  1. arXiv:2407.00463  [pdf, other

    cs.LG cs.AI cs.CL cs.HC eess.AS

    Open-Source Conversational AI with SpeechBrain 1.0

    Authors: Mirco Ravanelli, Titouan Parcollet, Adel Moumen, Sylvain de Langen, Cem Subakan, Peter Plantinga, Yingzhi Wang, Pooneh Mousavi, Luca Della Libera, Artem Ploujnikov, Francesco Paissan, Davide Borra, Salah Zaiem, Zeyu Zhao, Shucong Zhang, Georgios Karakasidis, Sung-Lin Yeh, Aku Rouhe, Rudolf Braun, Florian Mai, Juan Zuluaga-Gomez, Seyed Mahed Mousavi, Andreas Nautsch, Xuechen Liu, Sangeet Sagar , et al. (5 additional authors not shown)

    Abstract: SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech recognition, speech enhancement, speaker recognition, text-to-speech, and much more.It promotes transparency and replicability by releasing both the pre-trained models and the complete "recipes" of code and algorithms required for training them. This paper presen… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

    Comments: Submitted to JMLR (Machine Learning Open Source Software)

  2. arXiv:2406.14294  [pdf, other

    cs.SD cs.AI eess.AS

    DASB - Discrete Audio and Speech Benchmark

    Authors: Pooneh Mousavi, Luca Della Libera, Jarod Duret, Artem Ploujnikov, Cem Subakan, Mirco Ravanelli

    Abstract: Discrete audio tokens have recently gained considerable attention for their potential to connect audio and language processing, enabling the creation of modern multimodal large language models. Ideal audio tokens must effectively preserve phonetic and semantic content along with paralinguistic information, speaker identity, and other details. While several types of audio tokens have been recently… ▽ More

    Submitted 21 June, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

    Comments: 9 pages, 5 tables

  3. arXiv:2406.10735  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

    Authors: Pooneh Mousavi, Jarod Duret, Salah Zaiem, Luca Della Libera, Artem Ploujnikov, Cem Subakan, Mirco Ravanelli

    Abstract: Discrete audio tokens have recently gained attention for their potential to bridge the gap between audio and language processing. Ideal audio tokens must preserve content, paralinguistic elements, speaker identity, and many other audio details. Current audio tokenization methods fall into two categories: Semantic tokens, acquired through quantization of Self-Supervised Learning (SSL) models, and N… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

    Comments: 4 pages, 2 figures, 2 tables, Accepted at Interspeech 2024

  4. arXiv:2406.10422  [pdf, other

    eess.AS cs.SD eess.SP

    Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated Voice

    Authors: Shubham Gupta, Mirco Ravanelli, Pascal Germain, Cem Subakan

    Abstract: In this paper, we propose Phoneme Discretized Saliency Maps (PDSM), a discretization algorithm for saliency maps that takes advantage of phoneme boundaries for explainable detection of AI-generated voice. We experimentally show with two different Text-to-Speech systems (i.e., Tacotron2 and Fastspeech2) that the proposed algorithm produces saliency maps that result in more faithful explanations com… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  5. arXiv:2405.17615  [pdf, other

    cs.SD cs.LG eess.AS eess.SP

    Listenable Maps for Zero-Shot Audio Classifiers

    Authors: Francesco Paissan, Luca Della Libera, Mirco Ravanelli, Cem Subakan

    Abstract: Interpreting the decisions of deep learning models, including audio classifiers, is crucial for ensuring the transparency and trustworthiness of this technology. In this paper, we introduce LMAC-ZS (Listenable Maps for Audio Classifiers in the Zero-Shot context), which, to the best of our knowledge, is the first decoder-based post-hoc interpretation method for explaining the decisions of zero-shot… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  6. arXiv:2403.13086  [pdf, other

    cs.SD cs.LG eess.AS eess.SP

    Listenable Maps for Audio Classifiers

    Authors: Francesco Paissan, Mirco Ravanelli, Cem Subakan

    Abstract: Despite the impressive performance of deep learning models across diverse tasks, their complexity poses challenges for interpretation. This challenge is particularly evident for audio signals, where conveying interpretations becomes inherently difficult. To address this issue, we introduce Listenable Maps for Audio Classifiers (L-MAC), a posthoc interpretation method that generates faithful and li… ▽ More

    Submitted 19 June, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

    Comments: Accepted to ICML 2024 (Oral)

  7. arXiv:2402.16830  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    SKILL: Similarity-aware Knowledge distILLation for Speech Self-Supervised Learning

    Authors: Luca Zampierin, Ghouthi Boukli Hacene, Bac Nguyen, Mirco Ravanelli

    Abstract: Self-supervised learning (SSL) has achieved remarkable success across various speech-processing tasks. To enhance its efficiency, previous works often leverage the use of compression techniques. A notable recent attempt is DPHuBERT, which applies joint knowledge distillation (KD) and structured pruning to learn a significantly smaller SSL model. In this paper, we contribute to this research domain… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

    Comments: Accepted at the Self-supervision in Audio, Speech and Beyond (SASB) Workshop at ICASSP 2024

  8. arXiv:2402.02754  [pdf, other

    cs.SD cs.LG eess.AS

    Focal Modulation Networks for Interpretable Sound Classification

    Authors: Luca Della Libera, Cem Subakan, Mirco Ravanelli

    Abstract: The increasing success of deep neural networks has raised concerns about their inherent black-box nature, posing challenges related to interpretability and trust. While there has been extensive exploration of interpretation techniques in vision and language, interpretability in the audio domain has received limited attention, primarily focusing on post-hoc explanations. This paper addresses the pr… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024 XAI-SA Workshop

  9. arXiv:2402.01098  [pdf, other

    cs.LG stat.ML

    Bayesian Deep Learning for Remaining Useful Life Estimation via Stein Variational Gradient Descent

    Authors: Luca Della Libera, Jacopo Andreoli, Davide Dalle Pezze, Mirco Ravanelli, Gian Antonio Susto

    Abstract: A crucial task in predictive maintenance is estimating the remaining useful life of physical systems. In the last decade, deep learning has improved considerably upon traditional model-based and statistical approaches in terms of predictive performance. However, in order to optimally plan maintenance operations, it is also important to quantify the uncertainty inherent to the predictions. This iss… ▽ More

    Submitted 1 February, 2024; originally announced February 2024.

    Comments: 26 pages, 3 figures

  10. arXiv:2401.02297  [pdf, other

    cs.CL

    Are LLMs Robust for Spoken Dialogues?

    Authors: Seyed Mahed Mousavi, Gabriel Roccabruna, Simone Alghisi, Massimo Rizzoli, Mirco Ravanelli, Giuseppe Riccardi

    Abstract: Large Pre-Trained Language Models have demonstrated state-of-the-art performance in different downstream tasks, including dialogue state tracking and end-to-end response generation. Nevertheless, most of the publicly available datasets and benchmarks on task-oriented dialogues focus on written conversations. Consequently, the robustness of the developed models to spoken interactions is unknown. In… ▽ More

    Submitted 4 January, 2024; originally announced January 2024.

  11. arXiv:2312.03694  [pdf, other

    eess.AS

    Parameter-Efficient Transfer Learning of Audio Spectrogram Transformers

    Authors: Umberto Cappellazzo, Daniele Falavigna, Alessio Brutti, Mirco Ravanelli

    Abstract: The common modus operandi of fine-tuning large pre-trained Transformer models entails the adaptation of all their parameters (i.e., full fine-tuning). While achieving striking results on multiple tasks, this approach becomes unfeasible as the model size and the number of downstream tasks increase. In natural language processing and computer vision, parameter-efficient approaches like prompt-tuning… ▽ More

    Submitted 11 January, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

    Comments: The code is available at: https://github.com/umbertocappellazzo/PETL_AST

  12. arXiv:2310.17864  [pdf, other

    eess.AS cs.SD

    TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

    Authors: Jeff Hwang, Moto Hira, Caroline Chen, Xiaohui Zhang, Zhaoheng Ni, Guangzhi Sun, **chuan Ma, Ruizhe Huang, Vineel Pratap, Yuekai Zhang, Anurag Kumar, Chin-Yun Yu, Chuang Zhu, Chunxi Liu, Jacob Kahn, Mirco Ravanelli, Peng Sun, Shinji Watanabe, Yangyang Shi, Yumeng Tao, Robin Scheibler, Samuele Cornell, Sean Kim, Stavros Petridis

    Abstract: TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by develo** impactful features. Here, we survey TorchAudio's devel… ▽ More

    Submitted 26 October, 2023; originally announced October 2023.

  13. arXiv:2310.16931  [pdf, other

    cs.CL cs.AI

    CL-MASR: A Continual Learning Benchmark for Multilingual ASR

    Authors: Luca Della Libera, Pooneh Mousavi, Salah Zaiem, Cem Subakan, Mirco Ravanelli

    Abstract: Modern multilingual automatic speech recognition (ASR) systems like Whisper have made it possible to transcribe audio in multiple languages with a single model. However, current state-of-the-art ASR models are typically evaluated on individual languages or in a multi-task setting, overlooking the challenge of continually learning new languages. There is insufficient research on how to add new lang… ▽ More

    Submitted 25 October, 2023; originally announced October 2023.

    Comments: 16 pages, 5 figures, 5 tables

  14. arXiv:2310.12858  [pdf, other

    cs.SD cs.LG eess.AS

    Audio Editing with Non-Rigid Text Prompts

    Authors: Francesco Paissan, Luca Della Libera, Zhepei Wang, Mirco Ravanelli, Paris Smaragdis, Cem Subakan

    Abstract: In this paper, we explore audio-editing with non-rigid text edits. We show that the proposed editing pipeline is able to create audio edits that remain faithful to the input audio. We explore text prompts that perform addition, style transfer, and in-painting. We quantitatively and qualitatively show that the edits are able to obtain results which outperform Audio-LDM, a recently released text-pro… ▽ More

    Submitted 12 June, 2024; v1 submitted 19 October, 2023; originally announced October 2023.

    Comments: Accepted to INTERSPEECH 2024

  15. arXiv:2310.04292  [pdf, other

    cs.LG

    Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

    Authors: Dominique Beaini, Shenyang Huang, Joao Alex Cunha, Zhiyi Li, Gabriela Moisescu-Pareja, Oleksandr Dymov, Samuel Maddrell-Mander, Callum McLean, Frederik Wenkel, Luis Müller, Jama Hussein Mohamud, Ali Parviz, Michael Craig, Michał Koziarski, Jiarui Lu, Zhaocheng Zhu, Cristian Gabellini, Kerstin Klaser, Josef Dean, Cas Wognum, Maciej Sypetkowski, Guillaume Rabusseau, Reihaneh Rabbany, Jian Tang, Christopher Morris , et al. (10 additional authors not shown)

    Abstract: Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, where datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by… ▽ More

    Submitted 18 October, 2023; v1 submitted 6 October, 2023; originally announced October 2023.

  16. arXiv:2308.14456  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Speech Self-Supervised Representations Benchmarking: a Case for Larger Probing Heads

    Authors: Salah Zaiem, Youcef Kemiche, Titouan Parcollet, Slim Essid, Mirco Ravanelli

    Abstract: Self-supervised learning (SSL) leverages large datasets of unlabeled speech to reach impressive performance with reduced amounts of annotated data. The high number of proposed approaches fostered the emergence of comprehensive benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, while the number of considered tasks has bee… ▽ More

    Submitted 21 February, 2024; v1 submitted 28 August, 2023; originally announced August 2023.

    Comments: 18 Pages

  17. arXiv:2308.04611  [pdf, other

    cs.LG physics.ao-ph

    Deep Learning Driven Detection of Tsunami Related Internal GravityWaves: a path towards open-ocean natural hazards detection

    Authors: Valentino Constantinou, Michela Ravanelli, Hamlin Liu, Jacob Bortnik

    Abstract: Tsunamis can trigger internal gravity waves (IGWs) in the ionosphere, perturbing the Total Electron Content (TEC) - referred to as Traveling Ionospheric Disturbances (TIDs) that are detectable through the Global Navigation Satellite System (GNSS). The GNSS are constellations of satellites providing signals from Earth orbit - Europe's Galileo, the United States' Global Positioning System (GPS), Rus… ▽ More

    Submitted 8 August, 2023; originally announced August 2023.

  18. arXiv:2307.00134  [pdf, other

    cs.LG

    Generalization Limits of Graph Neural Networks in Identity Effects Learning

    Authors: Giuseppe Alessio D'Inverno, Simone Brugiapaglia, Mirco Ravanelli

    Abstract: Graph Neural Networks (GNNs) have emerged as a powerful tool for data-driven learning on various graph domains. They are usually based on a message-passing mechanism and have gained increasing popularity for their intuitive formulation, which is closely linked to the Weisfeiler-Lehman (WL) test for graph isomorphism to which they have been proven equivalent in terms of expressive power. In this wo… ▽ More

    Submitted 31 October, 2023; v1 submitted 30 June, 2023; originally announced July 2023.

    Comments: 13 pages, 10 figures

  19. arXiv:2306.12991  [pdf, other

    cs.CL

    Speech Emotion Diarization: Which Emotion Appears When?

    Authors: Yingzhi Wang, Mirco Ravanelli, Alya Yacoubi

    Abstract: Speech Emotion Recognition (SER) typically relies on utterance-level solutions. However, emotions conveyed through speech should be considered as discrete speech events with definite temporal boundaries, rather than attributes of the entire utterance. To reflect the fine-grained nature of speech emotions, we propose a new task: Speech Emotion Diarization (SED). Just as Speaker Diarization answers… ▽ More

    Submitted 20 October, 2023; v1 submitted 22 June, 2023; originally announced June 2023.

    Comments: Accepted to ASRU 2023

  20. arXiv:2306.04054  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    RescueSpeech: A German Corpus for Speech Recognition in Search and Rescue Domain

    Authors: Sangeet Sagar, Mirco Ravanelli, Bernd Kiefer, Ivana Kruijff Korbayova, Josef van Genabith

    Abstract: Despite the recent advancements in speech recognition, there are still difficulties in accurately transcribing conversational and emotional speech in noisy and reverberant acoustic environments. This poses a particular challenge in the search and rescue (SAR) domain, where transcribing conversations among rescue team members is crucial to support real-time decision-making. The scarcity of speech d… ▽ More

    Submitted 25 September, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

  21. arXiv:2306.00452  [pdf, ps, other

    eess.AS cs.LG

    Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?

    Authors: Salah Zaiem, Youcef Kemiche, Titouan Parcollet, Slim Essid, Mirco Ravanelli

    Abstract: Self-supervised learning (SSL) has recently allowed leveraging large datasets of unlabeled speech signals to reach impressive performance on speech tasks using only small amounts of annotated data. The high number of proposed approaches fostered the need and rise of extended benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. Howe… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: 6 pages

    Journal ref: INTERSPEECH 2023

  22. arXiv:2304.04858  [pdf, other

    cs.LG cs.CV

    Simulated Annealing in Early Layers Leads to Better Generalization

    Authors: Amirmohammad Sarfi, Zahra Karimpour, Muawiz Chaudhary, Nasir M. Khalid, Mirco Ravanelli, Sudhir Mudur, Eugene Belilovsky

    Abstract: Recently, a number of iterative learning methods have been introduced to improve generalization. These typically rely on training for longer periods of time in exchange for improved generalization. LLF (later-layer-forgetting) is a state-of-the-art method in this category. It strengthens learning in early layers by periodically re-initializing the last few layers of the network. Our principal inno… ▽ More

    Submitted 10 April, 2023; originally announced April 2023.

  23. arXiv:2303.12659  [pdf, other

    cs.AI cs.LG cs.SD eess.AS

    Posthoc Interpretation via Quantization

    Authors: Francesco Paissan, Cem Subakan, Mirco Ravanelli

    Abstract: In this paper, we introduce a new approach, called Posthoc Interpretation via Quantization (PIQ), for interpreting decisions made by trained classifiers. Our method utilizes vector quantization to transform the representations of a classifier into a discrete, class-specific latent space. The class-specific codebooks act as a bottleneck that forces the interpreter to focus on the parts of the input… ▽ More

    Submitted 27 May, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

    Comments: Francesco Paissan and Cem Subakan contributed equally

  24. arXiv:2303.06740  [pdf, other

    eess.AS cs.LG

    Fine-tuning Strategies for Faster Inference using Speech Self-Supervised Models: A Comparative Study

    Authors: Salah Zaiem, Robin Algayres, Titouan Parcollet, Slim Essid, Mirco Ravanelli

    Abstract: Self-supervised learning (SSL) has allowed substantial progress in Automatic Speech Recognition (ASR) performance in low-resource settings. In this context, it has been demonstrated that larger self-supervised feature extractors are crucial for achieving lower downstream ASR error rates. Thus, better performance might be sanctioned with longer inferences. This article explores different approaches… ▽ More

    Submitted 12 March, 2023; originally announced March 2023.

    Comments: Submitted to ICASSP "Self-supervision in Audio, Speech and Beyond" workshop

  25. arXiv:2207.13703  [pdf, other

    cs.SD cs.LG eess.AS

    SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation

    Authors: Artem Ploujnikov, Mirco Ravanelli

    Abstract: End-to-end speech synthesis models directly convert the input characters into an audio representation (e.g., spectrograms). Despite their impressive performance, such models have difficulty disambiguating the pronunciations of identically spelled words. To mitigate this issue, a separate Grapheme-to-Phoneme (G2P) model can be employed to convert the characters into phonemes before synthesizing the… ▽ More

    Submitted 26 July, 2022; originally announced July 2022.

    Comments: 5 pages, submitted to INTERSPEECH 2022

  26. arXiv:2206.09507  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Resource-Efficient Separation Transformer

    Authors: Luca Della Libera, Cem Subakan, Mirco Ravanelli, Samuele Cornell, Frédéric Lepoutre, François Grondin

    Abstract: Transformers have recently achieved state-of-the-art performance in speech separation. These models, however, are computationally demanding and require a lot of learnable parameters. This paper explores Transformer-based speech separation with a reduced computational cost. Our main contribution is the development of the Resource-Efficient Separation Transformer (RE-SepFormer), a self-attention-bas… ▽ More

    Submitted 15 January, 2024; v1 submitted 19 June, 2022; originally announced June 2022.

    Comments: Accepted to ICASSP 2024

  27. arXiv:2205.07390  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Learning Representations for New Sound Classes With Continual Self-Supervised Learning

    Authors: Zhepei Wang, Cem Subakan, Xilin Jiang, Junkai Wu, Efthymios Tzinis, Mirco Ravanelli, Paris Smaragdis

    Abstract: In this paper, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framework where the model can be updated without relying on labeled data. For this purpose, we propose adopting representation learning, where an encoder is trained using unlabeled data. This learning framework enables the study and implementation of a practically rel… ▽ More

    Submitted 13 December, 2022; v1 submitted 15 May, 2022; originally announced May 2022.

    Comments: Accepted to IEEE Signal Processing Letters

  28. arXiv:2202.02884  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Exploring Self-Attention Mechanisms for Speech Separation

    Authors: Cem Subakan, Mirco Ravanelli, Samuele Cornell, Francois Grondin, Mirko Bronzi

    Abstract: Transformers have enabled impressive improvements in deep learning. They often outperform recurrent and convolutional models in many tasks while taking advantage of parallel processing. Recently, we proposed the SepFormer, which obtains state-of-the-art performance in speech separation with the WSJ0-2/3 Mix datasets. This paper studies in-depth Transformers for speech separation. In particular, we… ▽ More

    Submitted 27 May, 2023; v1 submitted 6 February, 2022; originally announced February 2022.

    Comments: Accepted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

  29. arXiv:2111.05703  [pdf, other

    eess.AS cs.SD

    OSSEM: one-shot speaker adaptive speech enhancement using meta learning

    Authors: Cheng Yu, Szu-Wei Fu, Tsun-An Hsieh, Yu Tsao, Mirco Ravanelli

    Abstract: Although deep learning (DL) has achieved notable progress in speech enhancement (SE), further research is still required for a DL-based SE system to adapt effectively and efficiently to particular speakers. In this study, we propose a novel meta-learning-based speaker-adaptive SE approach (called OSSEM) that aims to achieve SE model adaptation in a one-shot manner. OSSEM consists of a modified tra… ▽ More

    Submitted 10 November, 2021; originally announced November 2021.

  30. arXiv:2110.10812  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    REAL-M: Towards Speech Separation on Real Mixtures

    Authors: Cem Subakan, Mirco Ravanelli, Samuele Cornell, François Grondin

    Abstract: In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate separation models on synthetic datasets, while the performance of state-of-the-art techniques on in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release the REAL-M dataset, a crowd-sourced corpus of real-life… ▽ More

    Submitted 20 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP 2022

  31. arXiv:2110.05866  [pdf

    cs.SD cs.CL eess.AS

    MetricGAN-U: Unsupervised speech enhancement/ dereverberation based only on noisy/ reverberated speech

    Authors: Szu-Wei Fu, Cheng Yu, Kuo-Hsuan Hung, Mirco Ravanelli, Yu Tsao

    Abstract: Most of the deep learning-based speech enhancement models are learned in a supervised manner, which implies that pairs of noisy and clean speech are required during training. Consequently, several noisy speeches recorded in daily life cannot be used to train the model. Although certain unsupervised learning frameworks have also been proposed to solve the pair constraint, they still require clean s… ▽ More

    Submitted 12 October, 2021; originally announced October 2021.

  32. arXiv:2107.10790  [pdf, other

    eess.SP cs.AI cs.HC cs.LG

    Interpretable SincNet-based Deep Learning for Emotion Recognition from EEG brain activity

    Authors: Juan Manuel Mayor-Torres, Mirco Ravanelli, Sara E. Medina-DeVilliers, Matthew D. Lerner, Giuseppe Riccardi

    Abstract: Machine learning methods, such as deep learning, show promising results in the medical domain. However, the lack of interpretability of these algorithms may hinder their applicability to medical decision support systems. This paper studies an interpretable deep learning technique, called SincNet. SincNet is a convolutional neural network that efficiently learns customized band-pass filters through… ▽ More

    Submitted 18 July, 2021; originally announced July 2021.

  33. arXiv:2106.04624  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    SpeechBrain: A General-Purpose Speech Toolkit

    Authors: Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, Yoshua Bengio

    Abstract: SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing… ▽ More

    Submitted 8 June, 2021; originally announced June 2021.

    Comments: Preprint

  34. arXiv:2104.03538  [pdf

    cs.SD cs.AI eess.AS

    MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement

    Authors: Szu-Wei Fu, Cheng Yu, Tsun-An Hsieh, Peter Plantinga, Mirco Ravanelli, Xugang Lu, Yu Tsao

    Abstract: The discrepancy between the cost function used for training a speech enhancement model and human auditory perception usually makes the quality of enhanced speech unsatisfactory. Objective evaluation metrics which consider human perception can hence serve as a bridge to reduce the gap. Our previously proposed MetricGAN was designed to optimize objective metrics by connecting the metric with a discr… ▽ More

    Submitted 4 June, 2021; v1 submitted 8 April, 2021; originally announced April 2021.

    Comments: Accepted by Interspeech 2021

  35. arXiv:2104.01604  [pdf, other

    cs.CL eess.AS

    Timers and Such: A Practical Benchmark for Spoken Language Understanding with Numbers

    Authors: Loren Lugosch, Piyush Papreja, Mirco Ravanelli, Abdelwahab Heba, Titouan Parcollet

    Abstract: This paper introduces Timers and Such, a new open source dataset of spoken English commands for common voice control use cases involving numbers. We describe the gap in existing spoken language understanding datasets that Timers and Such fills, the design and creation of the dataset, and experiments with a number of ASR-based and end-to-end baseline models, the code for which has been made availab… ▽ More

    Submitted 30 September, 2021; v1 submitted 4 April, 2021; originally announced April 2021.

    Comments: Accepted to NeurIPS 2021 - Datasets and Benchmarks Track

  36. ECAPA-TDNN Embeddings for Speaker Diarization

    Authors: Nauman Dawalatabad, Mirco Ravanelli, François Grondin, Jenthe Thienpondt, Brecht Desplanques, Hwidong Na

    Abstract: Learning robust speaker embeddings is a crucial step in speaker diarization. Deep neural networks can accurately capture speaker discriminative characteristics and popular deep embeddings such as x-vectors are nowadays a fundamental component of modern diarization systems. Recently, some improvements over the standard TDNN architecture used for x-vectors have been proposed. The ECAPA-TDNN model, f… ▽ More

    Submitted 3 April, 2021; originally announced April 2021.

  37. arXiv:2103.00336  [pdf, other

    cs.LG cs.AI

    Transformers with Competitive Ensembles of Independent Mechanisms

    Authors: Alex Lamb, Di He, Anirudh Goyal, Guolin Ke, Chien-Feng Liao, Mirco Ravanelli, Yoshua Bengio

    Abstract: An important development in deep learning from the earliest MLPs has been a move towards architectures with structural inductive biases which enable the model to keep distinct sources of information and routes of processing well-separated. This structure is linked to the notion of independent mechanisms from the causality literature, in which a mechanism is able to retain the same processing as ir… ▽ More

    Submitted 27 February, 2021; originally announced March 2021.

    Comments: Under Review, ICML 2021

  38. arXiv:2010.13154  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Attention is All You Need in Speech Separation

    Authors: Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, Jianyuan Zhong

    Abstract: Recurrent Neural Networks (RNNs) have long been the dominant architecture in sequence-to-sequence learning. RNNs, however, are inherently sequential models that do not allow parallelization of their computations. Transformers are emerging as a natural alternative to standard RNNs, replacing recurrent computations with a multi-head attention mechanism. In this paper, we propose the SepFormer, a nov… ▽ More

    Submitted 8 March, 2021; v1 submitted 25 October, 2020; originally announced October 2020.

    Comments: Accepted to ICASSP 2021

  39. arXiv:2010.09930  [pdf, other

    cs.SD eess.AS

    BIRD: Big Impulse Response Dataset

    Authors: François Grondin, Jean-Samuel Lauzon, Simon Michaud, Mirco Ravanelli, François Michaud

    Abstract: This paper introduces BIRD, the Big Impulse Response Dataset. This open dataset consists of 100,000 multichannel room impulse responses (RIRs) generated from simulations using the Image Method, making it the largest multichannel open dataset currently available. These RIRs can be used toperform efficient online data augmentation for scenarios that involve two microphones and multiple sound sources… ▽ More

    Submitted 19 October, 2020; originally announced October 2020.

  40. arXiv:2006.04603  [pdf, other

    eess.IV cs.CV cs.LG

    BS-Net: learning COVID-19 pneumonia severity on a large Chest X-Ray dataset

    Authors: Alberto Signoroni, Mattia Savardi, Sergio Benini, Nicola Adami, Riccardo Leonardi, Paolo Gibellini, Filippo Vaccher, Marco Ravanelli, Andrea Borghesi, Roberto Maroldi, Davide Farina

    Abstract: In this work we design an end-to-end deep learning architecture for predicting, on Chest X-rays images (CXR), a multi-regional score conveying the degree of lung compromise in COVID-19 patients. Such semi-quantitative scoring system, namely Brixia~score, is applied in serial monitoring of such patients, showing significant prognostic value, in one of the hospitals that experienced one of the highe… ▽ More

    Submitted 3 April, 2021; v1 submitted 8 June, 2020; originally announced June 2020.

    Comments: 28 pages, 11 figures, preprint of accepted paper to Medical Image Analysis, Project page with Code and Dataset Available at https://brixia.github.io/

    MSC Class: 68T45 ACM Class: I.2.10; I.5; I.4; J.3

  41. arXiv:2005.08566  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Quaternion Neural Networks for Multi-channel Distant Speech Recognition

    Authors: Xinchi Qiu, Titouan Parcollet, Mirco Ravanelli, Nicholas Lane, Mohamed Morchid

    Abstract: Despite the significant progress in automatic speech recognition (ASR), distant ASR remains challenging due to noise and reverberation. A common approach to mitigate this issue consists of equip** the recording devices with multiple microphones that capture the acoustic scene from different perspectives. These multi-channel audio recordings contain specific internal relations between each signal… ▽ More

    Submitted 19 May, 2020; v1 submitted 18 May, 2020; originally announced May 2020.

    Comments: 4 pages

  42. arXiv:2001.09239  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Multi-task self-supervised learning for Robust Speech Recognition

    Authors: Mirco Ravanelli, Jianyuan Zhong, Santiago Pascual, Pawel Swietojanski, Joao Monteiro, Jan Trmal, Yoshua Bengio

    Abstract: Despite the growing interest in unsupervised learning, extracting meaningful knowledge from unlabelled audio remains an open challenge. To take a step in this direction, we recently proposed a problem-agnostic speech encoder (PASE), that combines a convolutional encoder followed by multiple neural networks, called workers, tasked to solve self-supervised problems (i.e., ones that do not require ma… ▽ More

    Submitted 17 April, 2020; v1 submitted 24 January, 2020; originally announced January 2020.

    Comments: In Proc. of ICASSP 2020

  43. arXiv:1910.09463  [pdf, other

    eess.AS

    Using Speech Synthesis to Train End-to-End Spoken Language Understanding Models

    Authors: Loren Lugosch, Brett Meyer, Derek Nowrouzezahrai, Mirco Ravanelli

    Abstract: End-to-end models are an attractive new approach to spoken language understanding (SLU) in which the meaning of an utterance is inferred directly from the raw audio without employing the standard pipeline composed of a separately trained speech recognizer and natural language understanding module. The downside of end-to-end SLU is that in-domain speech data must be recorded to train the model. In… ▽ More

    Submitted 21 October, 2019; originally announced October 2019.

  44. arXiv:1904.03670  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Speech Model Pre-training for End-to-End Spoken Language Understanding

    Authors: Loren Lugosch, Mirco Ravanelli, Patrick Ignoto, Vikrant Singh Tomar, Yoshua Bengio

    Abstract: Whereas conventional spoken language understanding (SLU) systems map speech to text, and then text to intent, end-to-end SLU systems map speech directly to intent through a single trainable model. Achieving high accuracy with these end-to-end models without a large amount of training data is difficult. We propose a method to reduce the data requirements of end-to-end SLU in which the model is firs… ▽ More

    Submitted 25 July, 2019; v1 submitted 7 April, 2019; originally announced April 2019.

    Comments: Accepted to Interspeech 2019

  45. arXiv:1904.03416  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks

    Authors: Santiago Pascual, Mirco Ravanelli, Joan Serrà, Antonio Bonafonte, Yoshua Bengio

    Abstract: Learning good representations without supervision is still an open issue in machine learning, and is particularly challenging for speech signals, which are often characterized by long sequences with a complex hierarchical structure. Some recent works, however, have shown that it is possible to derive useful speech representations by employing a self-supervised encoder-discriminator approach. This… ▽ More

    Submitted 6 April, 2019; originally announced April 2019.

  46. arXiv:1812.05920  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Speech and Speaker Recognition from Raw Waveform with SincNet

    Authors: Mirco Ravanelli, Yoshua Bengio

    Abstract: Deep neural networks can learn complex and abstract representations, that are progressively obtained by combining simpler ones. A recent trend in speech and speaker recognition consists in discovering these representations starting from raw audio samples directly. Differently from standard hand-crafted features such as MFCCs or FBANK, the raw waveform can potentially help neural networks discover… ▽ More

    Submitted 15 February, 2019; v1 submitted 13 December, 2018; originally announced December 2018.

    Comments: arXiv admin note: substantial text overlap with arXiv:1811.09725, arXiv:1808.00158

  47. arXiv:1812.00271  [pdf, other

    eess.AS cs.CL cs.LG cs.NE cs.SD

    Learning Speaker Representations with Mutual Information

    Authors: Mirco Ravanelli, Yoshua Bengio

    Abstract: Learning good representations is of crucial importance in deep learning. Mutual Information (MI) or similar measures of statistical dependence are promising tools for learning these representations in an unsupervised way. Even though the mutual information between two random variables is hard to measure directly in high dimensional spaces, some recent studies have shown that an implicit optimizati… ▽ More

    Submitted 5 April, 2019; v1 submitted 1 December, 2018; originally announced December 2018.

    Comments: Submitted to Interspeech 2019

  48. arXiv:1811.09725  [pdf, other

    eess.AS cs.CL cs.LG cs.NE

    Interpretable Convolutional Filters with SincNet

    Authors: Mirco Ravanelli, Yoshua Bengio

    Abstract: Deep learning is currently playing a crucial role toward higher levels of artificial intelligence. This paradigm allows neural networks to learn complex and abstract representations, that are progressively obtained by combining simpler ones. Nevertheless, the internal "black-box" representations automatically discovered by current neural architectures often suffer from a lack of interpretability,… ▽ More

    Submitted 9 August, 2019; v1 submitted 23 November, 2018; originally announced November 2018.

    Comments: In Proceedings of NIPS@IRASL 2018. arXiv admin note: substantial text overlap with arXiv:1808.00158

  49. arXiv:1811.09678  [pdf, other

    eess.AS cs.SD stat.ML

    Speech recognition with quaternion neural networks

    Authors: Titouan Parcollet, Mirco Ravanelli, Mohamed Morchid, Georges Linarès, Renato De Mori

    Abstract: Neural network architectures are at the core of powerful automatic speech recognition systems (ASR). However, while recent researches focus on novel model architectures, the acoustic input features remain almost unchanged. Traditional ASR systems rely on multidimensional acoustic features such as the Mel filter bank energies alongside with the first, and second order derivatives to characterize ti… ▽ More

    Submitted 21 November, 2018; originally announced November 2018.

    Comments: NIPS 2018 (IRASL). arXiv admin note: text overlap with arXiv:1806.04418

  50. arXiv:1811.07453  [pdf, other

    eess.AS cs.CL cs.LG cs.NE

    The PyTorch-Kaldi Speech Recognition Toolkit

    Authors: Mirco Ravanelli, Titouan Parcollet, Yoshua Bengio

    Abstract: The availability of open-source software is playing a remarkable role in the popularization of speech recognition and deep learning. Kaldi, for instance, is nowadays an established framework used to develop state-of-the-art speech recognizers. PyTorch is used to build neural networks with the Python language and has recently spawn tremendous interest within the machine learning community thanks to… ▽ More

    Submitted 15 February, 2019; v1 submitted 18 November, 2018; originally announced November 2018.

    Comments: Accepted at ICASSP 2019