Skip to main content

Showing 1–30 of 30 results for author: Cornell, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.13471  [pdf, other

    eess.AS cs.SD

    Diffusion-based Generative Modeling with Discriminative Guidance for Streamable Speech Enhancement

    Authors: Chenda Li, Samuele Cornell, Shinji Watanabe, Yanmin Qian

    Abstract: Diffusion-based generative models (DGMs) have recently attracted attention in speech enhancement research (SE) as previous works showed a remarkable generalization capability. However, DGMs are also computationally intensive, as they usually require many iterations in the reverse diffusion process (RDP), making them impractical for streaming SE systems. In this paper, we propose to use discriminat… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  2. arXiv:2406.08056  [pdf, ps, other

    eess.AS cs.SD

    DCASE 2024 Task 4: Sound Event Detection with Heterogeneous Data and Missing Labels

    Authors: Samuele Cornell, Janek Ebbers, Constance Douwes, Irene Martín-Morató, Manu Harju, Annamaria Mesaros, Romain Serizel

    Abstract: The Detection and Classification of Acoustic Scenes and Events Challenge Task 4 aims to advance sound event detection (SED) systems in domestic environments by leveraging training data with different supervision uncertainty. Participants are challenged in exploring how to best use training data from different domains and with varying annotation granularity (strong/weak temporal resolution, soft/ha… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  3. arXiv:2406.04660  [pdf, other

    eess.AS cs.SD

    URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

    Authors: Wangyou Zhang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Chenda Li, Zhaoheng Ni, Anurag Kumar, Jan Pirklbauer, Marvin Sach, Shinji Watanabe, Tim Fingscheidt, Yanmin Qian

    Abstract: The last decade has witnessed significant advancements in deep learning-based speech enhancement (SE). However, most existing SE research has limitations on the coverage of SE sub-tasks, data diversity and amount, and evaluation metrics. To fill this gap and promote research toward universal SE, we establish a new SE challenge, named URGENT, to focus on the universality, robustness, and generaliza… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: 6 pages, 3 figures, 3 tables. Accepted by Interspeech 2024. An extended version of the accepted manuscript with appendix

  4. arXiv:2310.17864  [pdf, other

    eess.AS cs.SD

    TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

    Authors: Jeff Hwang, Moto Hira, Caroline Chen, Xiaohui Zhang, Zhaoheng Ni, Guangzhi Sun, **chuan Ma, Ruizhe Huang, Vineel Pratap, Yuekai Zhang, Anurag Kumar, Chin-Yun Yu, Chuang Zhu, Chunxi Liu, Jacob Kahn, Mirco Ravanelli, Peng Sun, Shinji Watanabe, Yangyang Shi, Yumeng Tao, Robin Scheibler, Samuele Cornell, Sean Kim, Stavros Petridis

    Abstract: TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by develo** impactful features. Here, we survey TorchAudio's devel… ▽ More

    Submitted 26 October, 2023; originally announced October 2023.

  5. arXiv:2310.14976  [pdf

    cs.LG

    Reinforcement learning in large, structured action spaces: A simulation study of decision support for spinal cord injury rehabilitation

    Authors: Nathan Phelps, Stephanie Marrocco, Stephanie Cornell, Dalton L. Wolfe, Daniel J. Lizotte

    Abstract: Reinforcement learning (RL) has helped improve decision-making in several applications. However, applying traditional RL is challenging in some applications, such as rehabilitation of people with a spinal cord injury (SCI). Among other factors, using RL in this domain is difficult because there are many possible treatments (i.e., large action space) and few patients (i.e., limited training data).… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

    Comments: 31 pages, 7 figures

  6. arXiv:2310.01688  [pdf, other

    eess.AS cs.CL cs.SD

    One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition

    Authors: Samuele Cornell, Jee-weon Jung, Shinji Watanabe, Stefano Squartini

    Abstract: This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition). SLIDAR can process arbitrary length inputs and can handle any number of speakers, effectively solving ``who spoke what, when'' concurrently. SLIDAR leverages a sliding window approach and consists of an end-to-end diarizat… ▽ More

    Submitted 2 October, 2023; originally announced October 2023.

  7. arXiv:2307.12231  [pdf, other

    cs.SD cs.CL eess.AS

    Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation

    Authors: Yoshiki Masuyama, Xuankai Chang, Wangyou Zhang, Samuele Cornell, Zhong-Qiu Wang, Nobutaka Ono, Yanmin Qian, Shinji Watanabe

    Abstract: Neural speech separation has made remarkable progress and its integration with automatic speech recognition (ASR) is an important direction towards realizing multi-speaker ASR. This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end. In detail, we explore multi-channel separation methods, mask-based beamforming and comp… ▽ More

    Submitted 23 July, 2023; originally announced July 2023.

    Comments: Accepted to IEEE WASPAA 2023

  8. arXiv:2306.13734  [pdf, other

    eess.AS cs.CL cs.SD

    The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios

    Authors: Samuele Cornell, Matthew Wiesner, Shinji Watanabe, Desh Raj, Xuankai Chang, Paola Garcia, Matthew Maciejewski, Yoshiki Masuyama, Zhong-Qiu Wang, Stefano Squartini, Sanjeev Khudanpur

    Abstract: The CHiME challenges have played a significant role in the development and evaluation of robust automatic speech recognition (ASR) systems. We introduce the CHiME-7 distant ASR (DASR) task, within the 7th CHiME challenge. This task comprises joint ASR and diarization in far-field settings with multiple, and possibly heterogeneous, recording devices. Different from previous challenges, we evaluate… ▽ More

    Submitted 14 July, 2023; v1 submitted 23 June, 2023; originally announced June 2023.

  9. arXiv:2305.18074  [pdf, other

    eess.AS cs.SD eess.SP

    An Experimental Review of Speaker Diarization methods with application to Two-Speaker Conversational Telephone Speech recordings

    Authors: Luca Serafini, Samuele Cornell, Giovanni Morrone, Enrico Zovato, Alessio Brutti, Stefano Squartini

    Abstract: We performed an experimental review of current diarization systems for the conversational telephone speech (CTS) domain. In detail, we considered a total of eight different algorithms belonging to clustering-based, end-to-end neural diarization (EEND), and speech separation guided diarization (SSGD) paradigms. We studied the inference-time computational requirements and diarization accuracy on fou… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

    Comments: 52 pages, 10 figures

  10. arXiv:2304.08707  [pdf, other

    eess.AS cs.SD

    Neural Speech Enhancement with Very Low Algorithmic Latency and Complexity via Integrated Full- and Sub-Band Modeling

    Authors: Zhong-Qiu Wang, Samuele Cornell, Shukjae Choi, Younglo Lee, Byeong-Yeol Kim, Shinji Watanabe

    Abstract: We propose FSB-LSTM, a novel long short-term memory (LSTM) based architecture that integrates full- and sub-band (FSB) modeling, for single- and multi-channel speech enhancement in the short-time Fourier transform (STFT) domain. The model maintains an information highway to flow an over-complete input representation through multiple FSB-LSTM modules. Each FSB-LSTM module consists of a full-band bl… ▽ More

    Submitted 17 April, 2023; originally announced April 2023.

    Comments: in ICASSP 2023

  11. End-to-End Integration of Speech Separation and Voice Activity Detection for Low-Latency Diarization of Telephone Conversations

    Authors: Giovanni Morrone, Samuele Cornell, Luca Serafini, Enrico Zovato, Alessio Brutti, Stefano Squartini

    Abstract: Recent works show that speech separation guided diarization (SSGD) is an increasingly promising direction, mainly thanks to the recent progress in speech separation. It performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream. In this work we conduct an in-depth study of SSGD in the conversational telephone speech (CTS) domain,… ▽ More

    Submitted 22 May, 2024; v1 submitted 21 March, 2023; originally announced March 2023.

    Comments: 16 pages, 7 figures

    Journal ref: Speech Communication 161 (2024) 103081

  12. arXiv:2302.07928  [pdf, other

    eess.AS cs.SD eess.SP

    Multi-Channel Target Speaker Extraction with Refinement: The WavLab Submission to the Second Clarity Enhancement Challenge

    Authors: Samuele Cornell, Zhong-Qiu Wang, Yoshiki Masuyama, Shinji Watanabe, Manuel Pariente, Nobutaka Ono

    Abstract: This paper describes our submission to the Second Clarity Enhancement Challenge (CEC2), which consists of target speech enhancement for hearing-aid (HA) devices in noisy-reverberant environments with multiple interferers such as music and competing speakers. Our approach builds upon the powerful iterative neural/beamforming enhancement (iNeuBe) framework introduced in our recent work, and this p… ▽ More

    Submitted 15 February, 2023; originally announced February 2023.

  13. arXiv:2211.12433  [pdf, other

    cs.SD eess.AS

    TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation

    Authors: Zhong-Qiu Wang, Samuele Cornell, Shukjae Choi, Younglo Lee, Byeong-Yeol Kim, Shinji Watanabe

    Abstract: We propose TF-GridNet for speech separation. The model is a novel deep neural network (DNN) integrating full- and sub-band modeling in the time-frequency (T-F) domain. It stacks several blocks, each consisting of an intra-frame full-band module, a sub-band temporal module, and a cross-frame self-attention module. It is trained to perform complex spectral map**, where the real and imaginary (RI)… ▽ More

    Submitted 4 August, 2023; v1 submitted 22 November, 2022; originally announced November 2022.

    Comments: In IEEE/ACM Transactions on Audio, Speech, and Language Processing. A sound demo is available at https://zqwang7.github.io/demos/TF-GridNet-demo/index.html, and the code is available at https://github.com/espnet/espnet/pull/5395

  14. arXiv:2210.10742  [pdf, other

    cs.SD eess.AS

    End-to-End Integration of Speech Recognition, Dereverberation, Beamforming, and Self-Supervised Learning Representation

    Authors: Yoshiki Masuyama, Xuankai Chang, Samuele Cornell, Shinji Watanabe, Nobutaka Ono

    Abstract: Self-supervised learning representation (SSLR) has demonstrated its significant effectiveness in automatic speech recognition (ASR), mainly with clean speech. Recent work pointed out the strength of integrating SSLR with single-channel speech enhancement for ASR in noisy environments. This paper further advances this integration by dealing with multi-channel input. We propose a novel end-to-end ar… ▽ More

    Submitted 19 October, 2022; originally announced October 2022.

    Comments: Accepted to IEEE SLT 2022

  15. arXiv:2210.07856  [pdf, other

    eess.AS cs.SD

    Description and analysis of novelties introduced in DCASE Task 4 2022 on the baseline system

    Authors: Francesca Ronchini, Samuele Cornell, Romain Serizel, Nicolas Turpault, Eduardo Fonseca, Daniel P. W. Ellis

    Abstract: The aim of the Detection and Classification of Acoustic Scenes and Events Challenge Task 4 is to evaluate systems for the detection of sound events in domestic environments using an heterogeneous dataset. The systems need to be able to correctly detect the sound events present in a recorded audio clip, as well as localize the events in time. This year's task is a follow-up of DCASE 2021 Task 4, wi… ▽ More

    Submitted 14 October, 2022; originally announced October 2022.

    Journal ref: Proceedings of the 7th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022)

  16. arXiv:2209.03952  [pdf, other

    cs.SD eess.AS

    TF-GridNet: Making Time-Frequency Domain Models Great Again for Monaural Speaker Separation

    Authors: Zhong-Qiu Wang, Samuele Cornell, Shukjae Choi, Younglo Lee, Byeong-Yeol Kim, Shinji Watanabe

    Abstract: We propose TF-GridNet, a novel multi-path deep neural network (DNN) operating in the time-frequency (T-F) domain, for monaural talker-independent speaker separation in anechoic conditions. The model stacks several multi-path blocks, each consisting of an intra-frame spectral module, a sub-band temporal module, and a full-band self-attention module, to leverage local and global spectro-temporal inf… ▽ More

    Submitted 15 March, 2023; v1 submitted 8 September, 2022; originally announced September 2022.

    Comments: in IEEE ICASSP 2023

  17. arXiv:2207.09514  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding

    Authors: Yen-Ju Lu, Xuankai Chang, Chenda Li, Wangyou Zhang, Samuele Cornell, Zhaoheng Ni, Yoshiki Masuyama, Brian Yan, Robin Scheibler, Zhong-Qiu Wang, Yu Tsao, Yanmin Qian, Shinji Watanabe

    Abstract: This paper presents recent progress on integrating speech separation and enhancement (SSE) into the ESPnet toolkit. Compared with the previous ESPnet-SE work, numerous features have been added, including recent state-of-the-art speech enhancement models with their respective training and evaluation recipes. Importantly, a new interface has been designed to flexibly combine speech enhancement front… ▽ More

    Submitted 19 July, 2022; originally announced July 2022.

    Comments: To appear in Interspeech 2022

  18. arXiv:2206.09507  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Resource-Efficient Separation Transformer

    Authors: Luca Della Libera, Cem Subakan, Mirco Ravanelli, Samuele Cornell, Frédéric Lepoutre, François Grondin

    Abstract: Transformers have recently achieved state-of-the-art performance in speech separation. These models, however, are computationally demanding and require a lot of learnable parameters. This paper explores Transformer-based speech separation with a reduced computational cost. Our main contribution is the development of the Resource-Efficient Separation Transformer (RE-SepFormer), a self-attention-bas… ▽ More

    Submitted 15 January, 2024; v1 submitted 19 June, 2022; originally announced June 2022.

    Comments: Accepted to ICASSP 2024

  19. arXiv:2202.12298  [pdf, other

    eess.AS cs.SD

    Towards Low-distortion Multi-channel Speech Enhancement: The ESPNet-SE Submission to The L3DAS22 Challenge

    Authors: Yen-Ju Lu, Samuele Cornell, Xuankai Chang, Wangyou Zhang, Chenda Li, Zhaoheng Ni, Zhong-Qiu Wang, Shinji Watanabe

    Abstract: This paper describes our submission to the L3DAS22 Challenge Task 1, which consists of speech enhancement with 3D Ambisonic microphones. The core of our approach combines Deep Neural Network (DNN) driven complex spectral map** with linear beamformers such as the multi-frame multi-channel Wiener filter. Our proposed system has two DNNs and a linear beamformer in between. Both DNNs are trained to… ▽ More

    Submitted 24 February, 2022; originally announced February 2022.

    Comments: to be published in IEEE ICASSP 2022

  20. arXiv:2202.02884  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Exploring Self-Attention Mechanisms for Speech Separation

    Authors: Cem Subakan, Mirco Ravanelli, Samuele Cornell, Francois Grondin, Mirko Bronzi

    Abstract: Transformers have enabled impressive improvements in deep learning. They often outperform recurrent and convolutional models in many tasks while taking advantage of parallel processing. Recently, we proposed the SepFormer, which obtains state-of-the-art performance in speech separation with the WSJ0-2/3 Mix datasets. This paper studies in-depth Transformers for speech separation. In particular, we… ▽ More

    Submitted 27 May, 2023; v1 submitted 6 February, 2022; originally announced February 2022.

    Comments: Accepted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

  21. arXiv:2111.10639  [pdf, other

    cs.SD cs.LG eess.AS

    Implicit Acoustic Echo Cancellation for Keyword Spotting and Device-Directed Speech Detection

    Authors: Samuele Cornell, Thomas Balestri, Thibaud Sénéchal

    Abstract: In many speech-enabled human-machine interaction scenarios, user speech can overlap with the device playback audio. In these instances, the performance of tasks such as keyword-spotting (KWS) and device-directed speech detection (DDD) can degrade significantly. To address this problem, we propose an implicit acoustic echo cancellation (iAEC) framework where a neural network is trained to exploit t… ▽ More

    Submitted 4 October, 2022; v1 submitted 20 November, 2021; originally announced November 2021.

    Comments: To be presented at SLT 2022

  22. arXiv:2111.04614  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Learning Filterbanks for End-to-End Acoustic Beamforming

    Authors: Samuele Cornell, Manuel Pariente, François Grondin, Stefano Squartini

    Abstract: Recent work on monaural source separation has shown that performance can be increased by using fully learned filterbanks with short windows. On the other hand it is widely known that, for conventional beamforming techniques, performance increases with long analysis windows. This applies also to most hybrid neural beamforming methods which rely on a deep neural network (DNN) to estimate the spatial… ▽ More

    Submitted 19 February, 2022; v1 submitted 8 November, 2021; originally announced November 2021.

    Comments: accepted at ICASSP 2022

  23. arXiv:2110.10812  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    REAL-M: Towards Speech Separation on Real Mixtures

    Authors: Cem Subakan, Mirco Ravanelli, Samuele Cornell, François Grondin

    Abstract: In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate separation models on synthetic datasets, while the performance of state-of-the-art techniques on in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release the REAL-M dataset, a crowd-sourced corpus of real-life… ▽ More

    Submitted 20 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP 2022

  24. arXiv:2109.14061  [pdf, other

    eess.AS cs.LG cs.SD

    The impact of non-target events in synthetic soundscapes for sound event detection

    Authors: Francesca Ronchini, Romain Serizel, Nicolas Turpault, Samuele Cornell

    Abstract: Detection and Classification Acoustic Scene and Events Challenge 2021 Task 4 uses a heterogeneous dataset that includes both recorded and synthetic soundscapes. Until recently only target sound events were considered when synthesizing the soundscapes. However, recorded soundscapes often contain a substantial amount of non-target events that may affect the performance. In this paper, we focus on th… ▽ More

    Submitted 28 September, 2021; originally announced September 2021.

    Journal ref: Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021)

  25. arXiv:2106.04624  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    SpeechBrain: A General-Purpose Speech Toolkit

    Authors: Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, Yoshua Bengio

    Abstract: SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing… ▽ More

    Submitted 8 June, 2021; originally announced June 2021.

    Comments: Preprint

  26. arXiv:2104.02819  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Learning to Rank Microphones for Distant Speech Recognition

    Authors: Samuele Cornell, Alessio Brutti, Marco Matassoni, Stefano Squartini

    Abstract: Fully exploiting ad-hoc microphone networks for distant speech recognition is still an open issue. Empirical evidence shows that being able to select the best microphone leads to significant improvements in recognition without any additional effort on front-end processing. Current channel selection techniques either rely on signal, decoder or posterior-based features. Signal-based features are ine… ▽ More

    Submitted 13 April, 2021; v1 submitted 6 April, 2021; originally announced April 2021.

  27. arXiv:2010.13154  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Attention is All You Need in Speech Separation

    Authors: Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, Jianyuan Zhong

    Abstract: Recurrent Neural Networks (RNNs) have long been the dominant architecture in sequence-to-sequence learning. RNNs, however, are inherently sequential models that do not allow parallelization of their computations. Transformers are emerging as a natural alternative to standard RNNs, replacing recurrent computations with a multi-head attention mechanism. In this paper, we propose the SepFormer, a nov… ▽ More

    Submitted 8 March, 2021; v1 submitted 25 October, 2020; originally announced October 2020.

    Comments: Accepted to ICASSP 2021

  28. arXiv:2005.04132  [pdf, other

    eess.AS cs.SD

    Asteroid: the PyTorch-based audio source separation toolkit for researchers

    Authors: Manuel Pariente, Samuele Cornell, Joris Cosentino, Sunit Sivasankaran, Efthymios Tzinis, Jens Heitkaemper, Michel Olvera, Fabian-Robert Stöter, Mathieu Hu, Juan M. Martín-Doñas, David Ditter, Ariel Frank, Antoine Deleforge, Emmanuel Vincent

    Abstract: This paper describes Asteroid, the PyTorch-based audio source separation toolkit for researchers. Inspired by the most successful neural source separation systems, it provides all neural building blocks required to build such a system. To improve reproducibility, Kaldi-style recipes on common audio source separation datasets are also provided. This paper describes the software architecture of Aste… ▽ More

    Submitted 8 May, 2020; originally announced May 2020.

    Comments: Submitted to Interspeech 2020

  29. arXiv:1911.02388  [pdf, other

    eess.AS cs.LG cs.SD

    The Speed Submission to DIHARD II: Contributions & Lessons Learned

    Authors: Md Sahidullah, Jose Patino, Samuele Cornell, Ruiqing Yin, Sunit Sivasankaran, Hervé Bredin, Pavel Korshunov, Alessio Brutti, Romain Serizel, Emmanuel Vincent, Nicholas Evans, Sébastien Marcel, Stefano Squartini, Claude Barras

    Abstract: This paper describes the speaker diarization systems developed for the Second DIHARD Speech Diarization Challenge (DIHARD II) by the Speed team. Besides describing the system, which considerably outperformed the challenge baselines, we also focus on the lessons learned from numerous approaches that we tried for single and multi-channel systems. We present several components of our diarization syst… ▽ More

    Submitted 6 November, 2019; originally announced November 2019.

  30. arXiv:1910.10400  [pdf, other

    cs.SD cs.LG eess.AS eess.SP

    Filterbank design for end-to-end speech separation

    Authors: Manuel Pariente, Samuele Cornell, Antoine Deleforge, Emmanuel Vincent

    Abstract: Single-channel speech separation has recently made great progress thanks to learned filterbanks as used in ConvTasNet. In parallel, parameterized filterbanks have been proposed for speaker recognition where only center frequencies and bandwidths are learned. In this work, we extend real-valued learned and parameterized filterbanks into complex-valued analytic filterbanks and define a set of corres… ▽ More

    Submitted 28 February, 2020; v1 submitted 23 October, 2019; originally announced October 2019.

    Comments: ICASSP 2020