Skip to main content

Showing 1–49 of 49 results for author: Ravanelli, M

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.14294  [pdf, other

    cs.SD cs.AI eess.AS

    DASB - Discrete Audio and Speech Benchmark

    Authors: Pooneh Mousavi, Luca Della Libera, Jarod Duret, Artem Ploujnikov, Cem Subakan, Mirco Ravanelli

    Abstract: Discrete audio tokens have recently gained considerable attention for their potential to connect audio and language processing, enabling the creation of modern multimodal large language models. Ideal audio tokens must effectively preserve phonetic and semantic content along with paralinguistic information, speaker identity, and other details. While several types of audio tokens have been recently… ▽ More

    Submitted 21 June, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

    Comments: 9 pages, 5 tables

  2. arXiv:2406.10735  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

    Authors: Pooneh Mousavi, Jarod Duret, Salah Zaiem, Luca Della Libera, Artem Ploujnikov, Cem Subakan, Mirco Ravanelli

    Abstract: Discrete audio tokens have recently gained attention for their potential to bridge the gap between audio and language processing. Ideal audio tokens must preserve content, paralinguistic elements, speaker identity, and many other audio details. Current audio tokenization methods fall into two categories: Semantic tokens, acquired through quantization of Self-Supervised Learning (SSL) models, and N… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

    Comments: 4 pages, 2 figures, 2 tables, Accepted at Interspeech 2024

  3. arXiv:2406.10422  [pdf, other

    eess.AS cs.SD eess.SP

    Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated Voice

    Authors: Shubham Gupta, Mirco Ravanelli, Pascal Germain, Cem Subakan

    Abstract: In this paper, we propose Phoneme Discretized Saliency Maps (PDSM), a discretization algorithm for saliency maps that takes advantage of phoneme boundaries for explainable detection of AI-generated voice. We experimentally show with two different Text-to-Speech systems (i.e., Tacotron2 and Fastspeech2) that the proposed algorithm produces saliency maps that result in more faithful explanations com… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  4. arXiv:2405.17615  [pdf, other

    cs.SD cs.LG eess.AS eess.SP

    Listenable Maps for Zero-Shot Audio Classifiers

    Authors: Francesco Paissan, Luca Della Libera, Mirco Ravanelli, Cem Subakan

    Abstract: Interpreting the decisions of deep learning models, including audio classifiers, is crucial for ensuring the transparency and trustworthiness of this technology. In this paper, we introduce LMAC-ZS (Listenable Maps for Audio Classifiers in the Zero-Shot context), which, to the best of our knowledge, is the first decoder-based post-hoc interpretation method for explaining the decisions of zero-shot… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  5. arXiv:2403.13086  [pdf, other

    cs.SD cs.LG eess.AS eess.SP

    Listenable Maps for Audio Classifiers

    Authors: Francesco Paissan, Mirco Ravanelli, Cem Subakan

    Abstract: Despite the impressive performance of deep learning models across diverse tasks, their complexity poses challenges for interpretation. This challenge is particularly evident for audio signals, where conveying interpretations becomes inherently difficult. To address this issue, we introduce Listenable Maps for Audio Classifiers (L-MAC), a posthoc interpretation method that generates faithful and li… ▽ More

    Submitted 19 June, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

    Comments: Accepted to ICML 2024 (Oral)

  6. arXiv:2402.16830  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    SKILL: Similarity-aware Knowledge distILLation for Speech Self-Supervised Learning

    Authors: Luca Zampierin, Ghouthi Boukli Hacene, Bac Nguyen, Mirco Ravanelli

    Abstract: Self-supervised learning (SSL) has achieved remarkable success across various speech-processing tasks. To enhance its efficiency, previous works often leverage the use of compression techniques. A notable recent attempt is DPHuBERT, which applies joint knowledge distillation (KD) and structured pruning to learn a significantly smaller SSL model. In this paper, we contribute to this research domain… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

    Comments: Accepted at the Self-supervision in Audio, Speech and Beyond (SASB) Workshop at ICASSP 2024

  7. arXiv:2402.02754  [pdf, other

    cs.SD cs.LG eess.AS

    Focal Modulation Networks for Interpretable Sound Classification

    Authors: Luca Della Libera, Cem Subakan, Mirco Ravanelli

    Abstract: The increasing success of deep neural networks has raised concerns about their inherent black-box nature, posing challenges related to interpretability and trust. While there has been extensive exploration of interpretation techniques in vision and language, interpretability in the audio domain has received limited attention, primarily focusing on post-hoc explanations. This paper addresses the pr… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024 XAI-SA Workshop

  8. arXiv:2312.03694  [pdf, other

    eess.AS

    Parameter-Efficient Transfer Learning of Audio Spectrogram Transformers

    Authors: Umberto Cappellazzo, Daniele Falavigna, Alessio Brutti, Mirco Ravanelli

    Abstract: The common modus operandi of fine-tuning large pre-trained Transformer models entails the adaptation of all their parameters (i.e., full fine-tuning). While achieving striking results on multiple tasks, this approach becomes unfeasible as the model size and the number of downstream tasks increase. In natural language processing and computer vision, parameter-efficient approaches like prompt-tuning… ▽ More

    Submitted 11 January, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

    Comments: The code is available at: https://github.com/umbertocappellazzo/PETL_AST

  9. arXiv:2310.17864  [pdf, other

    eess.AS cs.SD

    TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

    Authors: Jeff Hwang, Moto Hira, Caroline Chen, Xiaohui Zhang, Zhaoheng Ni, Guangzhi Sun, **chuan Ma, Ruizhe Huang, Vineel Pratap, Yuekai Zhang, Anurag Kumar, Chin-Yun Yu, Chuang Zhu, Chunxi Liu, Jacob Kahn, Mirco Ravanelli, Peng Sun, Shinji Watanabe, Yangyang Shi, Yumeng Tao, Robin Scheibler, Samuele Cornell, Sean Kim, Stavros Petridis

    Abstract: TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by develo** impactful features. Here, we survey TorchAudio's devel… ▽ More

    Submitted 26 October, 2023; originally announced October 2023.

  10. arXiv:2310.12858  [pdf, other

    cs.SD cs.LG eess.AS

    Audio Editing with Non-Rigid Text Prompts

    Authors: Francesco Paissan, Luca Della Libera, Zhepei Wang, Mirco Ravanelli, Paris Smaragdis, Cem Subakan

    Abstract: In this paper, we explore audio-editing with non-rigid text edits. We show that the proposed editing pipeline is able to create audio edits that remain faithful to the input audio. We explore text prompts that perform addition, style transfer, and in-painting. We quantitatively and qualitatively show that the edits are able to obtain results which outperform Audio-LDM, a recently released text-pro… ▽ More

    Submitted 12 June, 2024; v1 submitted 19 October, 2023; originally announced October 2023.

    Comments: Accepted to INTERSPEECH 2024

  11. arXiv:2308.14456  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Speech Self-Supervised Representations Benchmarking: a Case for Larger Probing Heads

    Authors: Salah Zaiem, Youcef Kemiche, Titouan Parcollet, Slim Essid, Mirco Ravanelli

    Abstract: Self-supervised learning (SSL) leverages large datasets of unlabeled speech to reach impressive performance with reduced amounts of annotated data. The high number of proposed approaches fostered the emergence of comprehensive benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, while the number of considered tasks has bee… ▽ More

    Submitted 21 February, 2024; v1 submitted 28 August, 2023; originally announced August 2023.

    Comments: 18 Pages

  12. arXiv:2306.04054  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    RescueSpeech: A German Corpus for Speech Recognition in Search and Rescue Domain

    Authors: Sangeet Sagar, Mirco Ravanelli, Bernd Kiefer, Ivana Kruijff Korbayova, Josef van Genabith

    Abstract: Despite the recent advancements in speech recognition, there are still difficulties in accurately transcribing conversational and emotional speech in noisy and reverberant acoustic environments. This poses a particular challenge in the search and rescue (SAR) domain, where transcribing conversations among rescue team members is crucial to support real-time decision-making. The scarcity of speech d… ▽ More

    Submitted 25 September, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

  13. arXiv:2306.00452  [pdf, ps, other

    eess.AS cs.LG

    Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?

    Authors: Salah Zaiem, Youcef Kemiche, Titouan Parcollet, Slim Essid, Mirco Ravanelli

    Abstract: Self-supervised learning (SSL) has recently allowed leveraging large datasets of unlabeled speech signals to reach impressive performance on speech tasks using only small amounts of annotated data. The high number of proposed approaches fostered the need and rise of extended benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. Howe… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: 6 pages

    Journal ref: INTERSPEECH 2023

  14. arXiv:2303.12659  [pdf, other

    cs.AI cs.LG cs.SD eess.AS

    Posthoc Interpretation via Quantization

    Authors: Francesco Paissan, Cem Subakan, Mirco Ravanelli

    Abstract: In this paper, we introduce a new approach, called Posthoc Interpretation via Quantization (PIQ), for interpreting decisions made by trained classifiers. Our method utilizes vector quantization to transform the representations of a classifier into a discrete, class-specific latent space. The class-specific codebooks act as a bottleneck that forces the interpreter to focus on the parts of the input… ▽ More

    Submitted 27 May, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

    Comments: Francesco Paissan and Cem Subakan contributed equally

  15. arXiv:2303.06740  [pdf, other

    eess.AS cs.LG

    Fine-tuning Strategies for Faster Inference using Speech Self-Supervised Models: A Comparative Study

    Authors: Salah Zaiem, Robin Algayres, Titouan Parcollet, Slim Essid, Mirco Ravanelli

    Abstract: Self-supervised learning (SSL) has allowed substantial progress in Automatic Speech Recognition (ASR) performance in low-resource settings. In this context, it has been demonstrated that larger self-supervised feature extractors are crucial for achieving lower downstream ASR error rates. Thus, better performance might be sanctioned with longer inferences. This article explores different approaches… ▽ More

    Submitted 12 March, 2023; originally announced March 2023.

    Comments: Submitted to ICASSP "Self-supervision in Audio, Speech and Beyond" workshop

  16. arXiv:2207.13703  [pdf, other

    cs.SD cs.LG eess.AS

    SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation

    Authors: Artem Ploujnikov, Mirco Ravanelli

    Abstract: End-to-end speech synthesis models directly convert the input characters into an audio representation (e.g., spectrograms). Despite their impressive performance, such models have difficulty disambiguating the pronunciations of identically spelled words. To mitigate this issue, a separate Grapheme-to-Phoneme (G2P) model can be employed to convert the characters into phonemes before synthesizing the… ▽ More

    Submitted 26 July, 2022; originally announced July 2022.

    Comments: 5 pages, submitted to INTERSPEECH 2022

  17. arXiv:2206.09507  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Resource-Efficient Separation Transformer

    Authors: Luca Della Libera, Cem Subakan, Mirco Ravanelli, Samuele Cornell, Frédéric Lepoutre, François Grondin

    Abstract: Transformers have recently achieved state-of-the-art performance in speech separation. These models, however, are computationally demanding and require a lot of learnable parameters. This paper explores Transformer-based speech separation with a reduced computational cost. Our main contribution is the development of the Resource-Efficient Separation Transformer (RE-SepFormer), a self-attention-bas… ▽ More

    Submitted 15 January, 2024; v1 submitted 19 June, 2022; originally announced June 2022.

    Comments: Accepted to ICASSP 2024

  18. arXiv:2205.07390  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Learning Representations for New Sound Classes With Continual Self-Supervised Learning

    Authors: Zhepei Wang, Cem Subakan, Xilin Jiang, Junkai Wu, Efthymios Tzinis, Mirco Ravanelli, Paris Smaragdis

    Abstract: In this paper, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framework where the model can be updated without relying on labeled data. For this purpose, we propose adopting representation learning, where an encoder is trained using unlabeled data. This learning framework enables the study and implementation of a practically rel… ▽ More

    Submitted 13 December, 2022; v1 submitted 15 May, 2022; originally announced May 2022.

    Comments: Accepted to IEEE Signal Processing Letters

  19. arXiv:2202.02884  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Exploring Self-Attention Mechanisms for Speech Separation

    Authors: Cem Subakan, Mirco Ravanelli, Samuele Cornell, Francois Grondin, Mirko Bronzi

    Abstract: Transformers have enabled impressive improvements in deep learning. They often outperform recurrent and convolutional models in many tasks while taking advantage of parallel processing. Recently, we proposed the SepFormer, which obtains state-of-the-art performance in speech separation with the WSJ0-2/3 Mix datasets. This paper studies in-depth Transformers for speech separation. In particular, we… ▽ More

    Submitted 27 May, 2023; v1 submitted 6 February, 2022; originally announced February 2022.

    Comments: Accepted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

  20. arXiv:2111.05703  [pdf, other

    eess.AS cs.SD

    OSSEM: one-shot speaker adaptive speech enhancement using meta learning

    Authors: Cheng Yu, Szu-Wei Fu, Tsun-An Hsieh, Yu Tsao, Mirco Ravanelli

    Abstract: Although deep learning (DL) has achieved notable progress in speech enhancement (SE), further research is still required for a DL-based SE system to adapt effectively and efficiently to particular speakers. In this study, we propose a novel meta-learning-based speaker-adaptive SE approach (called OSSEM) that aims to achieve SE model adaptation in a one-shot manner. OSSEM consists of a modified tra… ▽ More

    Submitted 10 November, 2021; originally announced November 2021.

  21. arXiv:2110.10812  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    REAL-M: Towards Speech Separation on Real Mixtures

    Authors: Cem Subakan, Mirco Ravanelli, Samuele Cornell, François Grondin

    Abstract: In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate separation models on synthetic datasets, while the performance of state-of-the-art techniques on in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release the REAL-M dataset, a crowd-sourced corpus of real-life… ▽ More

    Submitted 20 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP 2022

  22. arXiv:2110.05866  [pdf

    cs.SD cs.CL eess.AS

    MetricGAN-U: Unsupervised speech enhancement/ dereverberation based only on noisy/ reverberated speech

    Authors: Szu-Wei Fu, Cheng Yu, Kuo-Hsuan Hung, Mirco Ravanelli, Yu Tsao

    Abstract: Most of the deep learning-based speech enhancement models are learned in a supervised manner, which implies that pairs of noisy and clean speech are required during training. Consequently, several noisy speeches recorded in daily life cannot be used to train the model. Although certain unsupervised learning frameworks have also been proposed to solve the pair constraint, they still require clean s… ▽ More

    Submitted 12 October, 2021; originally announced October 2021.

  23. arXiv:2107.10790  [pdf, other

    eess.SP cs.AI cs.HC cs.LG

    Interpretable SincNet-based Deep Learning for Emotion Recognition from EEG brain activity

    Authors: Juan Manuel Mayor-Torres, Mirco Ravanelli, Sara E. Medina-DeVilliers, Matthew D. Lerner, Giuseppe Riccardi

    Abstract: Machine learning methods, such as deep learning, show promising results in the medical domain. However, the lack of interpretability of these algorithms may hinder their applicability to medical decision support systems. This paper studies an interpretable deep learning technique, called SincNet. SincNet is a convolutional neural network that efficiently learns customized band-pass filters through… ▽ More

    Submitted 18 July, 2021; originally announced July 2021.

  24. arXiv:2106.04624  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    SpeechBrain: A General-Purpose Speech Toolkit

    Authors: Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, Yoshua Bengio

    Abstract: SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing… ▽ More

    Submitted 8 June, 2021; originally announced June 2021.

    Comments: Preprint

  25. arXiv:2104.03538  [pdf

    cs.SD cs.AI eess.AS

    MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement

    Authors: Szu-Wei Fu, Cheng Yu, Tsun-An Hsieh, Peter Plantinga, Mirco Ravanelli, Xugang Lu, Yu Tsao

    Abstract: The discrepancy between the cost function used for training a speech enhancement model and human auditory perception usually makes the quality of enhanced speech unsatisfactory. Objective evaluation metrics which consider human perception can hence serve as a bridge to reduce the gap. Our previously proposed MetricGAN was designed to optimize objective metrics by connecting the metric with a discr… ▽ More

    Submitted 4 June, 2021; v1 submitted 8 April, 2021; originally announced April 2021.

    Comments: Accepted by Interspeech 2021

  26. arXiv:2104.01604  [pdf, other

    cs.CL eess.AS

    Timers and Such: A Practical Benchmark for Spoken Language Understanding with Numbers

    Authors: Loren Lugosch, Piyush Papreja, Mirco Ravanelli, Abdelwahab Heba, Titouan Parcollet

    Abstract: This paper introduces Timers and Such, a new open source dataset of spoken English commands for common voice control use cases involving numbers. We describe the gap in existing spoken language understanding datasets that Timers and Such fills, the design and creation of the dataset, and experiments with a number of ASR-based and end-to-end baseline models, the code for which has been made availab… ▽ More

    Submitted 30 September, 2021; v1 submitted 4 April, 2021; originally announced April 2021.

    Comments: Accepted to NeurIPS 2021 - Datasets and Benchmarks Track

  27. ECAPA-TDNN Embeddings for Speaker Diarization

    Authors: Nauman Dawalatabad, Mirco Ravanelli, François Grondin, Jenthe Thienpondt, Brecht Desplanques, Hwidong Na

    Abstract: Learning robust speaker embeddings is a crucial step in speaker diarization. Deep neural networks can accurately capture speaker discriminative characteristics and popular deep embeddings such as x-vectors are nowadays a fundamental component of modern diarization systems. Recently, some improvements over the standard TDNN architecture used for x-vectors have been proposed. The ECAPA-TDNN model, f… ▽ More

    Submitted 3 April, 2021; originally announced April 2021.

  28. arXiv:2010.13154  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Attention is All You Need in Speech Separation

    Authors: Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, Jianyuan Zhong

    Abstract: Recurrent Neural Networks (RNNs) have long been the dominant architecture in sequence-to-sequence learning. RNNs, however, are inherently sequential models that do not allow parallelization of their computations. Transformers are emerging as a natural alternative to standard RNNs, replacing recurrent computations with a multi-head attention mechanism. In this paper, we propose the SepFormer, a nov… ▽ More

    Submitted 8 March, 2021; v1 submitted 25 October, 2020; originally announced October 2020.

    Comments: Accepted to ICASSP 2021

  29. arXiv:2010.09930  [pdf, other

    cs.SD eess.AS

    BIRD: Big Impulse Response Dataset

    Authors: François Grondin, Jean-Samuel Lauzon, Simon Michaud, Mirco Ravanelli, François Michaud

    Abstract: This paper introduces BIRD, the Big Impulse Response Dataset. This open dataset consists of 100,000 multichannel room impulse responses (RIRs) generated from simulations using the Image Method, making it the largest multichannel open dataset currently available. These RIRs can be used toperform efficient online data augmentation for scenarios that involve two microphones and multiple sound sources… ▽ More

    Submitted 19 October, 2020; originally announced October 2020.

  30. arXiv:2006.04603  [pdf, other

    eess.IV cs.CV cs.LG

    BS-Net: learning COVID-19 pneumonia severity on a large Chest X-Ray dataset

    Authors: Alberto Signoroni, Mattia Savardi, Sergio Benini, Nicola Adami, Riccardo Leonardi, Paolo Gibellini, Filippo Vaccher, Marco Ravanelli, Andrea Borghesi, Roberto Maroldi, Davide Farina

    Abstract: In this work we design an end-to-end deep learning architecture for predicting, on Chest X-rays images (CXR), a multi-regional score conveying the degree of lung compromise in COVID-19 patients. Such semi-quantitative scoring system, namely Brixia~score, is applied in serial monitoring of such patients, showing significant prognostic value, in one of the hospitals that experienced one of the highe… ▽ More

    Submitted 3 April, 2021; v1 submitted 8 June, 2020; originally announced June 2020.

    Comments: 28 pages, 11 figures, preprint of accepted paper to Medical Image Analysis, Project page with Code and Dataset Available at https://brixia.github.io/

    MSC Class: 68T45 ACM Class: I.2.10; I.5; I.4; J.3

  31. arXiv:2005.08566  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Quaternion Neural Networks for Multi-channel Distant Speech Recognition

    Authors: Xinchi Qiu, Titouan Parcollet, Mirco Ravanelli, Nicholas Lane, Mohamed Morchid

    Abstract: Despite the significant progress in automatic speech recognition (ASR), distant ASR remains challenging due to noise and reverberation. A common approach to mitigate this issue consists of equip** the recording devices with multiple microphones that capture the acoustic scene from different perspectives. These multi-channel audio recordings contain specific internal relations between each signal… ▽ More

    Submitted 19 May, 2020; v1 submitted 18 May, 2020; originally announced May 2020.

    Comments: 4 pages

  32. arXiv:2001.09239  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Multi-task self-supervised learning for Robust Speech Recognition

    Authors: Mirco Ravanelli, Jianyuan Zhong, Santiago Pascual, Pawel Swietojanski, Joao Monteiro, Jan Trmal, Yoshua Bengio

    Abstract: Despite the growing interest in unsupervised learning, extracting meaningful knowledge from unlabelled audio remains an open challenge. To take a step in this direction, we recently proposed a problem-agnostic speech encoder (PASE), that combines a convolutional encoder followed by multiple neural networks, called workers, tasked to solve self-supervised problems (i.e., ones that do not require ma… ▽ More

    Submitted 17 April, 2020; v1 submitted 24 January, 2020; originally announced January 2020.

    Comments: In Proc. of ICASSP 2020

  33. arXiv:1910.09463  [pdf, other

    eess.AS

    Using Speech Synthesis to Train End-to-End Spoken Language Understanding Models

    Authors: Loren Lugosch, Brett Meyer, Derek Nowrouzezahrai, Mirco Ravanelli

    Abstract: End-to-end models are an attractive new approach to spoken language understanding (SLU) in which the meaning of an utterance is inferred directly from the raw audio without employing the standard pipeline composed of a separately trained speech recognizer and natural language understanding module. The downside of end-to-end SLU is that in-domain speech data must be recorded to train the model. In… ▽ More

    Submitted 21 October, 2019; originally announced October 2019.

  34. arXiv:1904.03670  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Speech Model Pre-training for End-to-End Spoken Language Understanding

    Authors: Loren Lugosch, Mirco Ravanelli, Patrick Ignoto, Vikrant Singh Tomar, Yoshua Bengio

    Abstract: Whereas conventional spoken language understanding (SLU) systems map speech to text, and then text to intent, end-to-end SLU systems map speech directly to intent through a single trainable model. Achieving high accuracy with these end-to-end models without a large amount of training data is difficult. We propose a method to reduce the data requirements of end-to-end SLU in which the model is firs… ▽ More

    Submitted 25 July, 2019; v1 submitted 7 April, 2019; originally announced April 2019.

    Comments: Accepted to Interspeech 2019

  35. arXiv:1904.03416  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks

    Authors: Santiago Pascual, Mirco Ravanelli, Joan SerrĂ , Antonio Bonafonte, Yoshua Bengio

    Abstract: Learning good representations without supervision is still an open issue in machine learning, and is particularly challenging for speech signals, which are often characterized by long sequences with a complex hierarchical structure. Some recent works, however, have shown that it is possible to derive useful speech representations by employing a self-supervised encoder-discriminator approach. This… ▽ More

    Submitted 6 April, 2019; originally announced April 2019.

  36. arXiv:1812.05920  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Speech and Speaker Recognition from Raw Waveform with SincNet

    Authors: Mirco Ravanelli, Yoshua Bengio

    Abstract: Deep neural networks can learn complex and abstract representations, that are progressively obtained by combining simpler ones. A recent trend in speech and speaker recognition consists in discovering these representations starting from raw audio samples directly. Differently from standard hand-crafted features such as MFCCs or FBANK, the raw waveform can potentially help neural networks discover… ▽ More

    Submitted 15 February, 2019; v1 submitted 13 December, 2018; originally announced December 2018.

    Comments: arXiv admin note: substantial text overlap with arXiv:1811.09725, arXiv:1808.00158

  37. arXiv:1812.00271  [pdf, other

    eess.AS cs.CL cs.LG cs.NE cs.SD

    Learning Speaker Representations with Mutual Information

    Authors: Mirco Ravanelli, Yoshua Bengio

    Abstract: Learning good representations is of crucial importance in deep learning. Mutual Information (MI) or similar measures of statistical dependence are promising tools for learning these representations in an unsupervised way. Even though the mutual information between two random variables is hard to measure directly in high dimensional spaces, some recent studies have shown that an implicit optimizati… ▽ More

    Submitted 5 April, 2019; v1 submitted 1 December, 2018; originally announced December 2018.

    Comments: Submitted to Interspeech 2019

  38. arXiv:1811.09725  [pdf, other

    eess.AS cs.CL cs.LG cs.NE

    Interpretable Convolutional Filters with SincNet

    Authors: Mirco Ravanelli, Yoshua Bengio

    Abstract: Deep learning is currently playing a crucial role toward higher levels of artificial intelligence. This paradigm allows neural networks to learn complex and abstract representations, that are progressively obtained by combining simpler ones. Nevertheless, the internal "black-box" representations automatically discovered by current neural architectures often suffer from a lack of interpretability,… ▽ More

    Submitted 9 August, 2019; v1 submitted 23 November, 2018; originally announced November 2018.

    Comments: In Proceedings of NIPS@IRASL 2018. arXiv admin note: substantial text overlap with arXiv:1808.00158

  39. arXiv:1811.09678  [pdf, other

    eess.AS cs.SD stat.ML

    Speech recognition with quaternion neural networks

    Authors: Titouan Parcollet, Mirco Ravanelli, Mohamed Morchid, Georges Linarès, Renato De Mori

    Abstract: Neural network architectures are at the core of powerful automatic speech recognition systems (ASR). However, while recent researches focus on novel model architectures, the acoustic input features remain almost unchanged. Traditional ASR systems rely on multidimensional acoustic features such as the Mel filter bank energies alongside with the first, and second order derivatives to characterize ti… ▽ More

    Submitted 21 November, 2018; originally announced November 2018.

    Comments: NIPS 2018 (IRASL). arXiv admin note: text overlap with arXiv:1806.04418

  40. arXiv:1811.07453  [pdf, other

    eess.AS cs.CL cs.LG cs.NE

    The PyTorch-Kaldi Speech Recognition Toolkit

    Authors: Mirco Ravanelli, Titouan Parcollet, Yoshua Bengio

    Abstract: The availability of open-source software is playing a remarkable role in the popularization of speech recognition and deep learning. Kaldi, for instance, is nowadays an established framework used to develop state-of-the-art speech recognizers. PyTorch is used to build neural networks with the Python language and has recently spawn tremendous interest within the machine learning community thanks to… ▽ More

    Submitted 15 February, 2019; v1 submitted 18 November, 2018; originally announced November 2018.

    Comments: Accepted at ICASSP 2019

  41. arXiv:1808.00158  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Speaker Recognition from Raw Waveform with SincNet

    Authors: Mirco Ravanelli, Yoshua Bengio

    Abstract: Deep learning is progressively gaining popularity as a viable alternative to i-vectors for speaker recognition. Promising results have been recently obtained with Convolutional Neural Networks (CNNs) when fed by raw speech samples directly. Rather than employing standard hand-crafted features, the latter CNNs learn low-level speech representations from waveforms, potentially allowing the network t… ▽ More

    Submitted 9 August, 2019; v1 submitted 29 July, 2018; originally announced August 2018.

    Comments: In Proceedings of SLT 2018

  42. arXiv:1805.10498  [pdf, other

    eess.AS cs.LG cs.NE cs.SD

    Automatic context window composition for distant speech recognition

    Authors: Mirco Ravanelli, Maurizio Omologo

    Abstract: Distant speech recognition is being revolutionized by deep learning, that has contributed to significantly outperform previous HMM-GMM systems. A key aspect behind the rapid rise and success of DNNs is their ability to better manage large time contexts. With this regard, asymmetric context windows that embed more past than future frames have been recently used with feed-forward neural networks. Th… ▽ More

    Submitted 26 May, 2018; originally announced May 2018.

    Comments: This is a preprint version of the paper published on Speech Communication Journal, 2018. Please see https://www.sciencedirect.com/science/article/pii/S0167639318300128 for the published version of this article

  43. arXiv:1804.05374  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.NE

    Twin Regularization for online speech recognition

    Authors: Mirco Ravanelli, Dmitriy Serdyuk, Yoshua Bengio

    Abstract: Online speech recognition is crucial for develo** natural human-machine interfaces. This modality, however, is significantly more challenging than off-line ASR, since real-time/low-latency constraints inevitably hinder the use of future information, that is known to be very helpful to perform robust predictions. A popular solution to mitigate this issue consists of feeding neural acoustic models… ▽ More

    Submitted 11 June, 2018; v1 submitted 15 April, 2018; originally announced April 2018.

    Comments: Accepted at INTESPEECH 2018

  44. arXiv:1803.10225  [pdf, other

    eess.AS cs.NE cs.SD eess.SP

    Light Gated Recurrent Units for Speech Recognition

    Authors: Mirco Ravanelli, Philemon Brakel, Maurizio Omologo, Yoshua Bengio

    Abstract: A field that has directly benefited from the recent advances in deep learning is Automatic Speech Recognition (ASR). Despite the great achievements of the past decades, however, a natural and robust human-machine speech interaction still appears to be out of reach, especially in challenging environments characterized by significant noise and reverberation. To improve robustness, modern speech reco… ▽ More

    Submitted 26 March, 2018; originally announced March 2018.

    Comments: Copyright 2018 IEEE

    Journal ref: IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 2, pp. 92-102, April 2018

  45. arXiv:1712.06086  [pdf, other

    cs.CL cs.SD eess.AS

    Deep Learning for Distant Speech Recognition

    Authors: Mirco Ravanelli

    Abstract: Deep learning is an emerging technology that is considered one of the most promising directions for reaching higher levels of artificial intelligence. Among the other achievements, building computers that understand speech represents a crucial leap towards intelligent machines. Despite the great efforts of the past decades, however, a natural and robust human-machine speech interaction still appea… ▽ More

    Submitted 17 December, 2017; originally announced December 2017.

    Comments: PhD Thesis Unitn, 2017

  46. arXiv:1711.09470  [pdf, other

    eess.AS cs.SD

    Realistic multi-microphone data simulation for distant speech recognition

    Authors: Mirco Ravanelli, Piergiorgio Svaizer, Maurizio Omologo

    Abstract: The availability of realistic simulated corpora is of key importance for the future progress of distant speech recognition technology. The reliability, flexibility and low computational cost of a data simulation process may ultimately allow researchers to train, tune and test different techniques in a variety of acoustic scenarios, avoiding the laborious effort of directly recording real data from… ▽ More

    Submitted 26 November, 2017; originally announced November 2017.

    Comments: Proc. of Interspeech 2016

  47. arXiv:1710.04288  [pdf, other

    eess.AS cs.SD

    Audio Concept Classification with Hierarchical Deep Neural Networks

    Authors: Mirco Ravanelli, Benjamin Elizalde, Karl Ni, Gerald Friedland

    Abstract: Audio-based multimedia retrieval tasks may identify semantic information in audio streams, i.e., audio concepts (such as music, laughter, or a revving engine). Conventional Gaussian-Mixture-Models have had some success in classifying a reduced set of audio concepts. However, multi-class classification can benefit from context window analysis and the discriminating power of deeper architectures. Al… ▽ More

    Submitted 11 October, 2017; originally announced October 2017.

    Journal ref: EUSIPCO 2014

  48. arXiv:1710.03538  [pdf, other

    eess.AS cs.CL cs.SD

    Contaminated speech training methods for robust DNN-HMM distant speech recognition

    Authors: Mirco Ravanelli, Maurizio Omologo

    Abstract: Despite the significant progress made in the last years, state-of-the-art speech recognition technologies provide a satisfactory performance only in the close-talking condition. Robustness of distant speech recognition in adverse acoustic conditions, on the other hand, remains a crucial open issue for future applications of human-machine interaction. To this end, several advances in speech enhance… ▽ More

    Submitted 10 October, 2017; originally announced October 2017.

    Journal ref: INTERSPEECH 2015

  49. arXiv:1710.02560  [pdf, other

    eess.AS cs.CL cs.SD

    The DIRHA-English corpus and related tasks for distant-speech recognition in domestic environments

    Authors: Mirco Ravanelli, Maurizio Omologo

    Abstract: This paper introduces the contents and the possible usage of the DIRHA-ENGLISH multi-microphone corpus, recently realized under the EC DIRHA project. The reference scenario is a domestic environment equipped with a large number of microphones and microphone arrays distributed in space. The corpus is composed of both real and simulated material, and it includes 12 US and 12 UK English native spea… ▽ More

    Submitted 6 October, 2017; originally announced October 2017.

    Comments: ASRU 2015