Skip to main content

Showing 1–22 of 22 results for author: Subakan, C

Searching in archive eess. Search in all archives.
.
  1. arXiv:2407.00463  [pdf, other

    cs.LG cs.AI cs.CL cs.HC eess.AS

    Open-Source Conversational AI with SpeechBrain 1.0

    Authors: Mirco Ravanelli, Titouan Parcollet, Adel Moumen, Sylvain de Langen, Cem Subakan, Peter Plantinga, Yingzhi Wang, Pooneh Mousavi, Luca Della Libera, Artem Ploujnikov, Francesco Paissan, Davide Borra, Salah Zaiem, Zeyu Zhao, Shucong Zhang, Georgios Karakasidis, Sung-Lin Yeh, Aku Rouhe, Rudolf Braun, Florian Mai, Juan Zuluaga-Gomez, Seyed Mahed Mousavi, Andreas Nautsch, Xuechen Liu, Sangeet Sagar , et al. (5 additional authors not shown)

    Abstract: SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech recognition, speech enhancement, speaker recognition, text-to-speech, and much more.It promotes transparency and replicability by releasing both the pre-trained models and the complete "recipes" of code and algorithms required for training them. This paper presen… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

    Comments: Submitted to JMLR (Machine Learning Open Source Software)

  2. arXiv:2406.14294  [pdf, other

    cs.SD cs.AI eess.AS

    DASB - Discrete Audio and Speech Benchmark

    Authors: Pooneh Mousavi, Luca Della Libera, Jarod Duret, Artem Ploujnikov, Cem Subakan, Mirco Ravanelli

    Abstract: Discrete audio tokens have recently gained considerable attention for their potential to connect audio and language processing, enabling the creation of modern multimodal large language models. Ideal audio tokens must effectively preserve phonetic and semantic content along with paralinguistic information, speaker identity, and other details. While several types of audio tokens have been recently… ▽ More

    Submitted 21 June, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

    Comments: 9 pages, 5 tables

  3. arXiv:2406.10735  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

    Authors: Pooneh Mousavi, Jarod Duret, Salah Zaiem, Luca Della Libera, Artem Ploujnikov, Cem Subakan, Mirco Ravanelli

    Abstract: Discrete audio tokens have recently gained attention for their potential to bridge the gap between audio and language processing. Ideal audio tokens must preserve content, paralinguistic elements, speaker identity, and many other audio details. Current audio tokenization methods fall into two categories: Semantic tokens, acquired through quantization of Self-Supervised Learning (SSL) models, and N… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

    Comments: 4 pages, 2 figures, 2 tables, Accepted at Interspeech 2024

  4. arXiv:2406.10422  [pdf, other

    eess.AS cs.SD eess.SP

    Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated Voice

    Authors: Shubham Gupta, Mirco Ravanelli, Pascal Germain, Cem Subakan

    Abstract: In this paper, we propose Phoneme Discretized Saliency Maps (PDSM), a discretization algorithm for saliency maps that takes advantage of phoneme boundaries for explainable detection of AI-generated voice. We experimentally show with two different Text-to-Speech systems (i.e., Tacotron2 and Fastspeech2) that the proposed algorithm produces saliency maps that result in more faithful explanations com… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  5. arXiv:2405.17615  [pdf, other

    cs.SD cs.LG eess.AS eess.SP

    Listenable Maps for Zero-Shot Audio Classifiers

    Authors: Francesco Paissan, Luca Della Libera, Mirco Ravanelli, Cem Subakan

    Abstract: Interpreting the decisions of deep learning models, including audio classifiers, is crucial for ensuring the transparency and trustworthiness of this technology. In this paper, we introduce LMAC-ZS (Listenable Maps for Audio Classifiers in the Zero-Shot context), which, to the best of our knowledge, is the first decoder-based post-hoc interpretation method for explaining the decisions of zero-shot… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  6. arXiv:2403.13086  [pdf, other

    cs.SD cs.LG eess.AS eess.SP

    Listenable Maps for Audio Classifiers

    Authors: Francesco Paissan, Mirco Ravanelli, Cem Subakan

    Abstract: Despite the impressive performance of deep learning models across diverse tasks, their complexity poses challenges for interpretation. This challenge is particularly evident for audio signals, where conveying interpretations becomes inherently difficult. To address this issue, we introduce Listenable Maps for Audio Classifiers (L-MAC), a posthoc interpretation method that generates faithful and li… ▽ More

    Submitted 19 June, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

    Comments: Accepted to ICML 2024 (Oral)

  7. arXiv:2402.02754  [pdf, other

    cs.SD cs.LG eess.AS

    Focal Modulation Networks for Interpretable Sound Classification

    Authors: Luca Della Libera, Cem Subakan, Mirco Ravanelli

    Abstract: The increasing success of deep neural networks has raised concerns about their inherent black-box nature, posing challenges related to interpretability and trust. While there has been extensive exploration of interpretation techniques in vision and language, interpretability in the audio domain has received limited attention, primarily focusing on post-hoc explanations. This paper addresses the pr… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024 XAI-SA Workshop

  8. arXiv:2310.12858  [pdf, other

    cs.SD cs.LG eess.AS

    Audio Editing with Non-Rigid Text Prompts

    Authors: Francesco Paissan, Luca Della Libera, Zhepei Wang, Mirco Ravanelli, Paris Smaragdis, Cem Subakan

    Abstract: In this paper, we explore audio-editing with non-rigid text edits. We show that the proposed editing pipeline is able to create audio edits that remain faithful to the input audio. We explore text prompts that perform addition, style transfer, and in-painting. We quantitatively and qualitatively show that the edits are able to obtain results which outperform Audio-LDM, a recently released text-pro… ▽ More

    Submitted 12 June, 2024; v1 submitted 19 October, 2023; originally announced October 2023.

    Comments: Accepted to INTERSPEECH 2024

  9. arXiv:2305.18283  [pdf, other

    cs.CL cs.AI cs.LG eess.AS

    CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification Based on Common Voice

    Authors: Juan Zuluaga-Gomez, Sara Ahmed, Danielius Visockas, Cem Subakan

    Abstract: Despite the recent advancements in Automatic Speech Recognition (ASR), the recognition of accented speech still remains a dominant problem. In order to create more inclusive ASR systems, research has shown that the integration of accent information, as part of a larger ASR framework, can lead to the mitigation of accented speech errors. We address multilingual accent classification through the ECA… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

    Comments: To appear in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2023

  10. arXiv:2305.01864  [pdf, other

    cs.SD cs.LG eess.AS

    Unsupervised Improvement of Audio-Text Cross-Modal Representations

    Authors: Zhepei Wang, Cem Subakan, Krishna Subramani, Junkai Wu, Tiago Tavares, Fabio Ayres, Paris Smaragdis

    Abstract: Recent advances in using language models to obtain cross-modal audio-text representations have overcome the limitations of conventional training approaches that use predefined labels. This has allowed the community to make progress in tasks like zero-shot classification, which would otherwise not be possible. However, learning such representations requires a large amount of human-annotated audio-t… ▽ More

    Submitted 31 July, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

    Comments: Accepted to WASPAA 2023

  11. arXiv:2305.01578  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    Self-supervised learning for infant cry analysis

    Authors: Arsenii Gorin, Cem Subakan, Sajjad Abdoli, Junhao Wang, Samantha Latremouille, Charles Onu

    Abstract: In this paper, we explore self-supervised learning (SSL) for analyzing a first-of-its-kind database of cry recordings containing clinical indications of more than a thousand newborns. Specifically, we target cry-based detection of neurological injury as well as identification of cry triggers such as pain, hunger, and discomfort. Annotating a large database in the medical setting is expensive and t… ▽ More

    Submitted 2 May, 2023; originally announced May 2023.

    Comments: Accepted to IEEE ICASSP 2023 workshop Self-supervision in Audio, Speech and Beyond

  12. arXiv:2305.00969  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    CryCeleb: A Speaker Verification Dataset Based on Infant Cry Sounds

    Authors: David Budaghyan, Charles C. Onu, Arsenii Gorin, Cem Subakan, Doina Precup

    Abstract: This paper describes the Ubenwa CryCeleb dataset - a labeled collection of infant cries - and the accompanying CryCeleb 2023 task, which is a public speaker verification challenge based on cry sounds. We released more than 6 hours of manually segmented cry sounds from 786 newborns for academic use, aiming to encourage research in infant cry analysis. The inaugural public competition attracted 59 p… ▽ More

    Submitted 21 March, 2024; v1 submitted 1 May, 2023; originally announced May 2023.

    Comments: ICASSP 2024

  13. arXiv:2303.12659  [pdf, other

    cs.AI cs.LG cs.SD eess.AS

    Posthoc Interpretation via Quantization

    Authors: Francesco Paissan, Cem Subakan, Mirco Ravanelli

    Abstract: In this paper, we introduce a new approach, called Posthoc Interpretation via Quantization (PIQ), for interpreting decisions made by trained classifiers. Our method utilizes vector quantization to transform the representations of a classifier into a discrete, class-specific latent space. The class-specific codebooks act as a bottleneck that forces the interpreter to focus on the parts of the input… ▽ More

    Submitted 27 May, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

    Comments: Francesco Paissan and Cem Subakan contributed equally

  14. arXiv:2206.09507  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Resource-Efficient Separation Transformer

    Authors: Luca Della Libera, Cem Subakan, Mirco Ravanelli, Samuele Cornell, Frédéric Lepoutre, François Grondin

    Abstract: Transformers have recently achieved state-of-the-art performance in speech separation. These models, however, are computationally demanding and require a lot of learnable parameters. This paper explores Transformer-based speech separation with a reduced computational cost. Our main contribution is the development of the Resource-Efficient Separation Transformer (RE-SepFormer), a self-attention-bas… ▽ More

    Submitted 15 January, 2024; v1 submitted 19 June, 2022; originally announced June 2022.

    Comments: Accepted to ICASSP 2024

  15. arXiv:2205.07390  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Learning Representations for New Sound Classes With Continual Self-Supervised Learning

    Authors: Zhepei Wang, Cem Subakan, Xilin Jiang, Junkai Wu, Efthymios Tzinis, Mirco Ravanelli, Paris Smaragdis

    Abstract: In this paper, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framework where the model can be updated without relying on labeled data. For this purpose, we propose adopting representation learning, where an encoder is trained using unlabeled data. This learning framework enables the study and implementation of a practically rel… ▽ More

    Submitted 13 December, 2022; v1 submitted 15 May, 2022; originally announced May 2022.

    Comments: Accepted to IEEE Signal Processing Letters

  16. arXiv:2202.02884  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Exploring Self-Attention Mechanisms for Speech Separation

    Authors: Cem Subakan, Mirco Ravanelli, Samuele Cornell, Francois Grondin, Mirko Bronzi

    Abstract: Transformers have enabled impressive improvements in deep learning. They often outperform recurrent and convolutional models in many tasks while taking advantage of parallel processing. Recently, we proposed the SepFormer, which obtains state-of-the-art performance in speech separation with the WSJ0-2/3 Mix datasets. This paper studies in-depth Transformers for speech separation. In particular, we… ▽ More

    Submitted 27 May, 2023; v1 submitted 6 February, 2022; originally announced February 2022.

    Comments: Accepted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

  17. arXiv:2110.10812  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    REAL-M: Towards Speech Separation on Real Mixtures

    Authors: Cem Subakan, Mirco Ravanelli, Samuele Cornell, François Grondin

    Abstract: In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate separation models on synthetic datasets, while the performance of state-of-the-art techniques on in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release the REAL-M dataset, a crowd-sourced corpus of real-life… ▽ More

    Submitted 20 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP 2022

  18. arXiv:2106.04624  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    SpeechBrain: A General-Purpose Speech Toolkit

    Authors: Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, Yoshua Bengio

    Abstract: SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing… ▽ More

    Submitted 8 June, 2021; originally announced June 2021.

    Comments: Preprint

  19. arXiv:2010.13154  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Attention is All You Need in Speech Separation

    Authors: Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, Jianyuan Zhong

    Abstract: Recurrent Neural Networks (RNNs) have long been the dominant architecture in sequence-to-sequence learning. RNNs, however, are inherently sequential models that do not allow parallelization of their computations. Transformers are emerging as a natural alternative to standard RNNs, replacing recurrent computations with a multi-head attention mechanism. In this paper, we propose the SepFormer, a nov… ▽ More

    Submitted 8 March, 2021; v1 submitted 25 October, 2020; originally announced October 2020.

    Comments: Accepted to ICASSP 2021

  20. arXiv:1910.09804  [pdf, other

    cs.LG cs.CL cs.SD eess.AS stat.ML

    Two-Step Sound Source Separation: Training on Learned Latent Targets

    Authors: Efthymios Tzinis, Shrikant Venkataramani, Zhepei Wang, Cem Subakan, Paris Smaragdis

    Abstract: In this paper, we propose a two-step training procedure for source separation via a deep neural network. In the first step we learn a transform (and it's inverse) to a latent space where masking-based separation performance using oracles is optimal. For the second step, we train a separation module that operates on the previously learned space. In order to do so, we also make use of a scale-invari… ▽ More

    Submitted 23 October, 2019; v1 submitted 22 October, 2019; originally announced October 2019.

    Comments: Submitted to ICASSP 2020

    Journal ref: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  21. arXiv:1906.00654  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    Continual Learning of New Sound Classes using Generative Replay

    Authors: Zhepei Wang, Cem Subakan, Efthymios Tzinis, Paris Smaragdis, Laurent Charlin

    Abstract: Continual learning consists in incrementally training a model on a sequence of datasets and testing on the union of all datasets. In this paper, we examine continual learning for the problem of sound classification, in which we wish to refine already trained models to learn new sound classes. In practice one does not want to maintain all past training data and retrain from scratch, but naively upd… ▽ More

    Submitted 3 June, 2019; originally announced June 2019.

  22. arXiv:1709.07908  [pdf, other

    cs.SD eess.AS

    Neural Network Alternatives to Convolutive Audio Models for Source Separation

    Authors: Shrikant Venkataramani, Y. Cem Subakan, Paris Smaragdis

    Abstract: Convolutive Non-Negative Matrix Factorization model factorizes a given audio spectrogram using frequency templates with a temporal dimension. In this paper, we present a convolutional auto-encoder model that acts as a neural network alternative to convolutive NMF. Using the modeling flexibility granted by neural networks, we also explore the idea of using a Recurrent Neural Network in the encoder.… ▽ More

    Submitted 20 September, 2017; originally announced September 2017.

    Comments: Published in MLSP 2017