Skip to main content

Showing 1–50 of 68 results for author: Delcroix, M

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.18972  [pdf, ps, other

    eess.AS cs.CL

    Applying LLMs for Rescoring N-best ASR Hypotheses of Casual Conversations: Effects of Domain Adaptation and Context Carry-over

    Authors: Atsunori Ogawa, Naoyuki Kamo, Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Takatomo Kano, Naohiro Tawara, Marc Delcroix

    Abstract: Large language models (LLMs) have been successfully applied for rescoring automatic speech recognition (ASR) hypotheses. However, their ability to rescore ASR hypotheses of casual conversations has not been sufficiently explored. In this study, we reveal it by performing N-best ASR hypotheses rescoring using Llama2 on the CHiME-7 distant ASR (DASR) task. Llama2 is one of the most representative LL… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: 5 pages

  2. arXiv:2404.14860  [pdf, other

    eess.AS cs.SD

    Rethinking Processing Distortions: Disentangling the Impact of Speech Enhancement Errors on Speech Recognition Performance

    Authors: Tsubasa Ochiai, Kazuma Iwamoto, Marc Delcroix, Rintaro Ikeshita, Hiroshi Sato, Shoko Araki, Shigeru Katagiri

    Abstract: It is challenging to improve automatic speech recognition (ASR) performance in noisy conditions with a single-channel speech enhancement (SE) front-end. This is generally attributed to the processing distortions caused by the nonlinear processing of single-channel SE front-ends. However, the causes of such degraded ASR performance have not been fully investigated. How to design single-channel SE f… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

    Comments: 13 pages, 6 figures, Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processing

  3. arXiv:2402.13200  [pdf, other

    eess.AS cs.SD

    Probing Self-supervised Learning Models with Target Speech Extraction

    Authors: Junyi Peng, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Takanori Ashihara, Shoko Araki, Jan Cernocky

    Abstract: Large-scale pre-trained self-supervised learning (SSL) models have shown remarkable advancements in speech-related tasks. However, the utilization of these models in complex multi-talker scenarios, such as extracting a target speaker in a mixture, is yet to be fully evaluated. In this paper, we introduce target speech extraction (TSE) as a novel downstream task to evaluate the feature extraction c… ▽ More

    Submitted 17 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024, Self-supervision in Audio, Speech, and Beyond (SASB) workshop

  4. arXiv:2402.13199  [pdf, other

    eess.AS cs.SD

    Target Speech Extraction with Pre-trained Self-supervised Learning Models

    Authors: Junyi Peng, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Shoko Araki, Jan Cernocky

    Abstract: Pre-trained self-supervised learning (SSL) models have achieved remarkable success in various speech tasks. However, their potential in target speech extraction (TSE) has not been fully exploited. TSE aims to extract the speech of a target speaker in a mixture guided by enrollment utterances. We exploit pre-trained SSL models for two purposes within a TSE framework, i.e., to process the input mixt… ▽ More

    Submitted 17 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024

  5. arXiv:2402.03058  [pdf, other

    eess.AS cs.SD

    Array Geometry-Robust Attention-Based Neural Beamformer for Moving Speakers

    Authors: Marvin Tammen, Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani, Shoko Araki, Simon Doclo

    Abstract: Although mask-based beamforming is a powerful speech enhancement approach, it often requires manual parameter tuning to handle moving speakers. Recently, this approach was augmented with an attention-based spatial covariance matrix aggregator (ASA) module, enabling accurate tracking of moving speakers without manual tuning. However, the deep neural network model used in this module is limited to s… ▽ More

    Submitted 17 June, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

    Comments: accepted at Interspeech 2024

  6. arXiv:2401.17632  [pdf, other

    cs.CL cs.SD eess.AS

    What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis

    Authors: Takanori Ashihara, Marc Delcroix, Takafumi Moriya, Kohei Matsuura, Taichi Asami, Yusuke Ijima

    Abstract: Self-supervised learning (SSL) has attracted increased attention for learning meaningful speech representations. Speech SSL models, such as WavLM, employ masked prediction training to encode general-purpose representations. In contrast, speaker SSL models, exemplified by DINO-based models, adopt utterance-level training objectives primarily for speaker representation. Understanding how these model… ▽ More

    Submitted 31 January, 2024; originally announced January 2024.

    Comments: Accepted at ICASSP 2024

  7. arXiv:2401.05111  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters

    Authors: Kenichi Fujita, Hiroshi Sato, Takanori Ashihara, Hiroki Kanagawa, Marc Delcroix, Takafumi Moriya, Yusuke Ijima

    Abstract: The zero-shot text-to-speech (TTS) method, based on speaker embeddings extracted from reference speech using self-supervised learning (SSL) speech representations, can reproduce speaker characteristics very accurately. However, this approach suffers from degradation in speech synthesis quality when the reference speech contains noise. In this paper, we propose a noise-robust zero-shot TTS method.… ▽ More

    Submitted 10 January, 2024; originally announced January 2024.

    Comments: 5 pages,3 figures, Accepted to IEEE ICASSP 2024

  8. arXiv:2312.14609  [pdf, ps, other

    eess.AS cs.CL

    BLSTM-Based Confidence Estimation for End-to-End Speech Recognition

    Authors: Atsunori Ogawa, Naohiro Tawara, Takatomo Kano, Marc Delcroix

    Abstract: Confidence estimation, in which we estimate the reliability of each recognized token (e.g., word, sub-word, and character) in automatic speech recognition (ASR) hypotheses and detect incorrectly recognized tokens, is an important function for develo** ASR applications. In this study, we perform confidence estimation for end-to-end (E2E) ASR hypotheses. Recent E2E ASR systems show high performanc… ▽ More

    Submitted 22 December, 2023; originally announced December 2023.

    Comments: Accepted to ICASSP 2021

  9. arXiv:2312.12764  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Lattice Rescoring Based on Large Ensemble of Complementary Neural Language Models

    Authors: Atsunori Ogawa, Naohiro Tawara, Marc Delcroix, Shoko Araki

    Abstract: We investigate the effectiveness of using a large ensemble of advanced neural language models (NLMs) for lattice rescoring on automatic speech recognition (ASR) hypotheses. Previous studies have reported the effectiveness of combining a small number of NLMs. In contrast, in this study, we combine up to eight NLMs, i.e., forward/backward long short-term memory/Transformer-LMs that are trained with… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

    Comments: Accepted to ICASSP 2022

  10. arXiv:2311.11599  [pdf, other

    eess.AS

    How does end-to-end speech recognition training impact speech enhancement artifacts?

    Authors: Kazuma Iwamoto, Tsubasa Ochiai, Marc Delcroix, Rintaro Ikeshita, Hiroshi Sato, Shoko Araki, Shigeru Katagiri

    Abstract: Jointly training a speech enhancement (SE) front-end and an automatic speech recognition (ASR) back-end has been investigated as a way to mitigate the influence of \emph{processing distortion} generated by single-channel SE on ASR. In this paper, we investigate the effect of such joint training on the signal-level characteristics of the enhanced signals from the viewpoint of the decomposed noise a… ▽ More

    Submitted 20 November, 2023; originally announced November 2023.

    Comments: 5 pages, 1 figure, 1 table

  11. arXiv:2311.11595  [pdf, ps, other

    eess.AS

    Neural network-based virtual microphone estimation with virtual microphone and beamformer-level multi-task loss

    Authors: Hanako Segawa, Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani, Rintaro Ikeshita, Shoko Araki, Takeshi Yamada, Shoji Makino

    Abstract: Array processing performance depends on the number of microphones available. Virtual microphone estimation (VME) has been proposed to increase the number of microphone signals artificially. Neural network-based VME (NN-VME) trains an NN with a VM-level loss to predict a signal at a microphone location that is available during training but not at inference. However, this training objective may not… ▽ More

    Submitted 20 November, 2023; originally announced November 2023.

    Comments: 5 pages, 2 figures, 1 table

  12. arXiv:2310.11010  [pdf, ps, other

    eess.AS cs.CL

    Iterative Shallow Fusion of Backward Language Model for End-to-End Speech Recognition

    Authors: Atsunori Ogawa, Takafumi Moriya, Naoyuki Kamo, Naohiro Tawara, Marc Delcroix

    Abstract: We propose a new shallow fusion (SF) method to exploit an external backward language model (BLM) for end-to-end automatic speech recognition (ASR). The BLM has complementary characteristics with a forward language model (FLM), and the effectiveness of their combination has been confirmed by rescoring ASR hypotheses as post-processing. In the proposed SF, we iteratively apply the BLM to partial ASR… ▽ More

    Submitted 17 October, 2023; originally announced October 2023.

    Comments: Accepted to ICASSP 2023

  13. arXiv:2310.02732  [pdf, ps, other

    eess.AS cs.SD

    Discriminative Training of VBx Diarization

    Authors: Dominik Klement, Mireia Diez, Federico Landini, Lukáš Burget, Anna Silnova, Marc Delcroix, Naohiro Tawara

    Abstract: Bayesian HMM clustering of x-vector sequences (VBx) has become a widely adopted diarization baseline model in publications and challenges. It uses an HMM to model speaker turns, a generatively trained probabilistic linear discriminant analysis (PLDA) for speaker distribution modeling, and Bayesian inference to estimate the assignment of x-vectors to speakers. This paper presents a new framework fo… ▽ More

    Submitted 4 October, 2023; originally announced October 2023.

    Comments: Submitted to ICASSP 2024

  14. arXiv:2309.16482  [pdf, ps, other

    eess.AS cs.SD

    Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization

    Authors: Thilo von Neumann, Christoph Boeddeker, Tobias Cord-Landwehr, Marc Delcroix, Reinhold Haeb-Umbach

    Abstract: We propose a modular pipeline for the single-channel separation, recognition, and diarization of meeting-style recordings and evaluate it on the Libri-CSS dataset. Using a Continuous Speech Separation (CSS) system with a TF-GridNet separation architecture, followed by a speaker-agnostic speech recognizer, we achieve state-of-the-art recognition performance in terms of Optimal Reference Combination… ▽ More

    Submitted 6 May, 2024; v1 submitted 28 September, 2023; originally announced September 2023.

    Comments: Accepted at HSCMA Sattelite Workshop at ICASSP 2024

  15. arXiv:2309.12656  [pdf, other

    eess.AS cs.SD

    NTT speaker diarization system for CHiME-7: multi-domain, multi-microphone End-to-end and vector clustering diarization

    Authors: Naohiro Tawara, Marc Delcroix, Atsushi Ando, Atsunori Ogawa

    Abstract: This paper details our speaker diarization system designed for multi-domain, multi-microphone casual conversations. The proposed diarization pipeline uses weighted prediction error (WPE)-based dereverberation as a front end, then applies end-to-end neural diarization with vector clustering (EEND-VC) to each channel separately. It integrates the diarization result obtained from each channel using d… ▽ More

    Submitted 22 September, 2023; originally announced September 2023.

    Comments: 5 pages, 5 figures, Submitted to ICASSP 2024

  16. arXiv:2308.03987  [pdf, other

    eess.AS cs.SD

    Target Speech Extraction with Conditional Diffusion Model

    Authors: Naoyuki Kamo, Marc Delcroix, Tomohiro Nakatani

    Abstract: Diffusion model-based speech enhancement has received increased attention since it can generate very natural enhanced signals and generalizes well to unseen conditions. Diffusion models have been explored for several sub-tasks of speech enhancement, such as speech denoising, dereverberation, and source separation. In this paper, we investigate their use for target speech extraction (TSE), which co… ▽ More

    Submitted 16 August, 2023; v1 submitted 7 August, 2023; originally announced August 2023.

    Comments: 5 pages, 4 figures, Interspeech2023

  17. arXiv:2307.11394  [pdf, other

    cs.CL eess.AS

    MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems

    Authors: Thilo von Neumann, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach

    Abstract: MeetEval is an open-source toolkit to evaluate all kinds of meeting transcription systems. It provides a unified interface for the computation of commonly used Word Error Rates (WERs), specifically cpWER, ORC-WER and MIMO-WER along other WER definitions. We extend the cpWER computation by a temporal constraint to ensure that only words are identified as correct when the temporal alignment is plaus… ▽ More

    Submitted 25 January, 2024; v1 submitted 21 July, 2023; originally announced July 2023.

    Comments: Presented at the CHiME7 workshop 2023

  18. arXiv:2306.08374  [pdf, other

    cs.CL cs.SD eess.AS

    SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge?

    Authors: Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka, Yusuke Ijima, Taichi Asami, Marc Delcroix, Yukinori Honma

    Abstract: Self-supervised learning (SSL) for speech representation has been successfully applied in various downstream tasks, such as speech and speaker recognition. More recently, speech SSL models have also been shown to be beneficial in advancing spoken language understanding tasks, implying that the SSL models have the potential to learn not only acoustic but also linguistic information. In this paper,… ▽ More

    Submitted 14 June, 2023; originally announced June 2023.

    Comments: Accepted at INTERSPEECH 2023

  19. arXiv:2306.04233  [pdf, other

    cs.CL cs.SD eess.AS

    Transfer Learning from Pre-trained Language Models Improves End-to-End Speech Summarization

    Authors: Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Takatomo Kano, Atsunori Ogawa, Marc Delcroix

    Abstract: End-to-end speech summarization (E2E SSum) directly summarizes input speech into easy-to-read short sentences with a single model. This approach is promising because it, in contrast to the conventional cascade approach, can utilize full acoustical information and mitigate to the propagation of transcription errors. However, due to the high cost of collecting speech-summary pairs, an E2E SSum model… ▽ More

    Submitted 7 June, 2023; originally announced June 2023.

    Comments: Accepted by Interspeech 2023

  20. arXiv:2305.15971  [pdf, other

    eess.AS

    Knowledge Distillation for Neural Transducer-based Target-Speaker ASR: Exploiting Parallel Mixture/Single-Talker Speech Data

    Authors: Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takanori Ashihara, Kohei Matsuura, Tomohiro Tanaka, Ryo Masumura, Atsunori Ogawa, Taichi Asami

    Abstract: Neural transducer (RNNT)-based target-speaker speech recognition (TS-RNNT) directly transcribes a target speaker's voice from a multi-talker mixture. It is a promising approach for streaming applications because it does not incur the extra computation costs of a target speech extraction frontend, which is a critical barrier to quick response. TS-RNNT is trained end-to-end given the input speech (i… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023

  21. arXiv:2305.14723  [pdf, other

    eess.AS cs.SD

    Downstream Task Agnostic Speech Enhancement with Self-Supervised Representation Loss

    Authors: Hiroshi Sato, Ryo Masumura, Tsubasa Ochiai, Marc Delcroix, Takafumi Moriya, Takanori Ashihara, Kentaro Shinayama, Saki Mizuno, Mana Ihori, Tomohiro Tanaka, Nobukatsu Hojo

    Abstract: Self-supervised learning (SSL) is the latest breakthrough in speech processing, especially for label-scarce downstream tasks by leveraging massive unlabeled audio data. The noise robustness of the SSL is one of the important challenges to expanding its application. We can use speech enhancement (SE) to tackle this issue. However, the mismatch between the SE model and SSL models potentially limits… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: 4 pages , 2 figures, Accepted to Interspeech 2023

  22. arXiv:2305.13580  [pdf, other

    eess.AS cs.SD

    Multi-Stream Extension of Variational Bayesian HMM Clustering (MS-VBx) for Combined End-to-End and Vector Clustering-based Diarization

    Authors: Marc Delcroix, Naohiro Tawara, Mireia Diez, Federico Landini, Anna Silnova, Atsunori Ogawa, Tomohiro Nakatani, Lukas Burget, Shoko Araki

    Abstract: Combining end-to-end neural speaker diarization (EEND) with vector clustering (VC), known as EEND-VC, has gained interest for leveraging the strengths of both methods. EEND-VC estimates activities and speaker embeddings for all speakers within an audio chunk and uses VC to associate these activities with speaker identities across different chunks. EEND-VC generates thus multiple streams of embeddi… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

    Comments: Accepted at Interspeech 2023

  23. arXiv:2303.00978  [pdf, other

    cs.CL eess.AS

    Leveraging Large Text Corpora for End-to-End Speech Summarization

    Authors: Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Atsunori Ogawa, Marc Delcroix, Ryo Masumura

    Abstract: End-to-end speech summarization (E2E SSum) is a technique to directly generate summary sentences from speech. Compared with the cascade approach, which combines automatic speech recognition (ASR) and text summarization models, the E2E approach is more promising because it mitigates ASR errors, incorporates nonverbal information, and simplifies the overall system. However, since collecting a large… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

    Comments: Accepted to ICASSP 2023

  24. Neural Target Speech Extraction: An Overview

    Authors: Katerina Zmolikova, Marc Delcroix, Tsubasa Ochiai, Keisuke Kinoshita, Jan Černocký, Dong Yu

    Abstract: Humans can listen to a target speaker even in challenging acoustic conditions that have noise, reverberation, and interfering speakers. This phenomenon is known as the cocktail-party effect. For decades, researchers have focused on approaching the listening ability of humans. One critical issue is handling interfering speakers because the target and non-target speech signals share similar characte… ▽ More

    Submitted 30 January, 2023; originally announced January 2023.

    Comments: Submitted to IEEE Signal Processing Magazine on Apr. 25, 2022, and accepted on Jan. 12, 2023

  25. On Word Error Rate Definitions and their Efficient Computation for Multi-Speaker Speech Recognition Systems

    Authors: Thilo von Neumann, Christoph Boeddeker, Keisuke Kinoshita, Marc Delcroix, Reinhold Haeb-Umbach

    Abstract: We propose a general framework to compute the word error rate (WER) of ASR systems that process recordings containing multiple speakers at their input and that produce multiple output word sequences (MIMO). Such ASR systems are typically required, e.g., for meeting transcription. We provide an efficient implementation based on a dynamic programming search in a multi-dimensional Levenshtein distanc… ▽ More

    Submitted 21 July, 2023; v1 submitted 29 November, 2022; originally announced November 2022.

    Comments: Presented at ICASSP 2023

  26. arXiv:2209.04175  [pdf, other

    eess.AS cs.SD

    Streaming Target-Speaker ASR with Neural Transducer

    Authors: Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takahiro Shinozaki

    Abstract: Although recent advances in deep learning technology have boosted automatic speech recognition (ASR) performance in the single-talker case, it remains difficult to recognize multi-talker speech in which many voices overlap. One conventional approach to tackle this problem is to use a cascade of a speech separation or target speech extraction front-end with an ASR back-end. However, the extra compu… ▽ More

    Submitted 19 September, 2022; v1 submitted 9 September, 2022; originally announced September 2022.

    Comments: Accepted to Interspeech 2022

  27. arXiv:2208.07091  [pdf, other

    cs.SD cs.LG eess.AS

    Analysis of impact of emotions on target speech extraction and speech separation

    Authors: Ján Švec, Kateřina Žmolíková, Martin Kocour, Marc Delcroix, Tsubasa Ochiai, Ladislav Mošner, Jan Černocký

    Abstract: Recently, the performance of blind speech separation (BSS) and target speech extraction (TSE) has greatly progressed. Most works, however, focus on relatively well-controlled conditions using, e.g., read speech. The performance may degrade in more realistic situations. One of the factors causing such degradation may be intrinsic speaker variability, such as emotions, occurring commonly in realisti… ▽ More

    Submitted 15 August, 2022; originally announced August 2022.

    Comments: Accepted to IWAENC 2022

  28. arXiv:2207.13888  [pdf, other

    eess.AS cs.SD

    Utterance-by-utterance overlap-aware neural diarization with Graph-PIT

    Authors: Keisuke Kinoshita, Thilo von Neumann, Marc Delcroix, Christoph Boeddeker, Reinhold Haeb-Umbach

    Abstract: Recent speaker diarization studies showed that integration of end-to-end neural diarization (EEND) and clustering-based diarization is a promising approach for achieving state-of-the-art performance on various tasks. Such an approach first divides an observed signal into fixed-length segments, then performs {\it segment-level} local diarization based on an EEND module, and merges the segment-level… ▽ More

    Submitted 28 July, 2022; originally announced July 2022.

    Comments: Accepted to Interspeech 2022 (5 pages, 1 figure)

  29. arXiv:2207.11964  [pdf, other

    eess.AS cs.LG cs.MM cs.SD

    ConceptBeam: Concept Driven Target Speech Extraction

    Authors: Yasunori Ohishi, Marc Delcroix, Tsubasa Ochiai, Shoko Araki, Daiki Takeuchi, Daisuke Niizumi, Akisato Kimura, Noboru Harada, Kunio Kashino

    Abstract: We propose a novel framework for target speech extraction based on semantic information, called ConceptBeam. Target speech extraction means extracting the speech of a target speaker in a mixture. Typical approaches have been exploiting properties of audio signals, such as harmonic structure and direction of arrival. In contrast, ConceptBeam tackles the problem with semantic clues. Specifically, we… ▽ More

    Submitted 25 July, 2022; originally announced July 2022.

    Comments: Accepted to ACM Multimedia 2022

  30. arXiv:2206.08174  [pdf, other

    eess.AS cs.SD eess.SP

    Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations

    Authors: Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Ryo Masumura

    Abstract: Target speech extraction is a technique to extract the target speaker's voice from mixture signals using a pre-recorded enrollment utterance that characterize the voice characteristics of the target speaker. One major difficulty of target speech extraction lies in handling variability in ``intra-speaker'' characteristics, i.e., characteristics mismatch between target speech and an enrollment utter… ▽ More

    Submitted 16 June, 2022; originally announced June 2022.

    Comments: 5 pages, 2 figures, 3 tables Submitted to Interspeech 2022

  31. arXiv:2205.03568  [pdf, other

    eess.AS cs.SD

    Mask-based Neural Beamforming for Moving Speakers with Self-Attention-based Tracking

    Authors: Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani, Shoko Araki

    Abstract: Beamforming is a powerful tool designed to enhance speech signals from the direction of a target source. Computing the beamforming filter requires estimating spatial covariance matrices (SCMs) of the source and noise signals. Time-frequency masks are often used to compute these SCMs. Most studies of mask-based beamforming have assumed that the sources do not move. However, sources often move in pr… ▽ More

    Submitted 7 May, 2022; originally announced May 2022.

    Comments: 11 pages, 7 figures, Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processing

  32. arXiv:2204.04811  [pdf, other

    eess.AS cs.SD

    Listen only to me! How well can target speech extraction handle false alarms?

    Authors: Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Katerina Zmolikova, Hiroshi Sato, Tomohiro Nakatani

    Abstract: Target speech extraction (TSE) extracts the speech of a target speaker in a mixture given auxiliary clues characterizing the speaker, such as an enrollment utterance. TSE addresses thus the challenging problem of simultaneously performing separation and speaker identification. There has been much progress in extraction performance following the recent development of neural networks for speech enha… ▽ More

    Submitted 14 July, 2022; v1 submitted 10 April, 2022; originally announced April 2022.

    Comments: Accepted to Interspeech 2022

  33. arXiv:2204.03895  [pdf, other

    eess.AS cs.SD

    SoundBeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning

    Authors: Marc Delcroix, Jorge Bennasar Vázquez, Tsubasa Ochiai, Keisuke Kinoshita, Yasunori Ohishi, Shoko Araki

    Abstract: In many situations, we would like to hear desired sound events (SEs) while being able to ignore interference. Target sound extraction (TSE) tackles this problem by estimating the audio signal of the sounds of target SE classes in a mixture of sounds while suppressing all other sounds. We can achieve this with a neural network that extracts the target SEs by conditioning it on clues representing th… ▽ More

    Submitted 2 November, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

    Comments: Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processing on Feb. 10th, 2022, and accepted on Oct. 20, 2022

  34. arXiv:2202.06524  [pdf, other

    eess.AS cs.LG cs.SD

    Tight integration of neural- and clustering-based diarization through deep unfolding of infinite Gaussian mixture model

    Authors: Keisuke Kinoshita, Marc Delcroix, Tomoharu Iwata

    Abstract: Speaker diarization has been investigated extensively as an important central task for meeting analysis. Recent trend shows that integration of end-to-end neural (EEND)-and clustering-based diarization is a promising approach to handle realistic conversational data containing overlapped speech with an arbitrarily large number of speakers, and achieved state-of-the-art results on various tasks. How… ▽ More

    Submitted 14 February, 2022; originally announced February 2022.

    Comments: Accepted to IEEE ICASSP-2022, 5 pages, 2 figures

  35. arXiv:2201.06685  [pdf, other

    eess.AS cs.SD

    How Bad Are Artifacts?: Analyzing the Impact of Speech Enhancement Errors on ASR

    Authors: Kazuma Iwamoto, Tsubasa Ochiai, Marc Delcroix, Rintaro Ikeshita, Hiroshi Sato, Shoko Araki, Shigeru Katagiri

    Abstract: It is challenging to improve automatic speech recognition (ASR) performance in noisy conditions with single-channel speech enhancement (SE). In this paper, we investigate the causes of ASR performance degradation by decomposing the SE errors using orthogonal projection-based decomposition (OPD). OPD decomposes the SE errors into noise and artifact components. The artifact component is defined as t… ▽ More

    Submitted 30 March, 2022; v1 submitted 17 January, 2022; originally announced January 2022.

    Comments: 5 pages, 5 figures, submitted to Interspeech 2022

  36. Learning to Enhance or Not: Neural Network-Based Switching of Enhanced and Observed Signals for Overlap** Speech Recognition

    Authors: Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Naoyuki Kamo, Takafumi Moriya

    Abstract: The combination of a deep neural network (DNN) -based speech enhancement (SE) front-end and an automatic speech recognition (ASR) back-end is a widely used approach to implement overlap** speech recognition. However, the SE front-end generates processing artifacts that can degrade the ASR performance. We previously found that such performance degradation can occur even under fully overlap** co… ▽ More

    Submitted 11 January, 2022; originally announced January 2022.

    Comments: 5 pages, 2 figures

    Journal ref: In 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6287-6291

  37. arXiv:2111.08201  [pdf, other

    eess.AS cs.CL

    Attention-based Multi-hypothesis Fusion for Speech Summarization

    Authors: Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Shinji Watanabe

    Abstract: Speech summarization, which generates a text summary from speech, can be achieved by combining automatic speech recognition (ASR) and text summarization (TS). With this cascade approach, we can exploit state-of-the-art models and large training datasets for both subtasks, i.e., Transformer for ASR and Bidirectional Encoder Representations from Transformers (BERT) for TS. However, ASR errors direct… ▽ More

    Submitted 15 November, 2021; originally announced November 2021.

  38. arXiv:2111.00009  [pdf, other

    eess.AS cs.LG cs.SD

    Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model

    Authors: Martin Kocour, Kateřina Žmolíková, Lucas Ondel, Ján Švec, Marc Delcroix, Tsubasa Ochiai, Lukáš Burget, Jan Černocký

    Abstract: In typical multi-talker speech recognition systems, a neural network-based acoustic model predicts senone state posteriors for each speaker. These are later used by a single-talker decoder which is applied on each speaker-specific output stream separately. In this work, we argue that such a scheme is sub-optimal and propose a principled solution that decodes all speakers jointly. We modify the aco… ▽ More

    Submitted 15 April, 2022; v1 submitted 31 October, 2021; originally announced November 2021.

    Comments: submitted to Interspeech 2022

  39. arXiv:2110.15581  [pdf, other

    eess.AS cs.SD

    SA-SDR: A novel loss function for separation of meeting style data

    Authors: Thilo von Neumann, Keisuke Kinoshita, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach

    Abstract: Many state-of-the-art neural network-based source separation systems use the averaged Signal-to-Distortion Ratio (SDR) as a training objective function. The basic SDR is, however, undefined if the network reconstructs the reference signal perfectly or if the reference signal contains silence, e.g., when a two-output separator processes a single-speaker recording. Many modifications to the plain SD… ▽ More

    Submitted 21 April, 2022; v1 submitted 29 October, 2021; originally announced October 2021.

    Comments: accepted at ICASSP 2022

  40. Graph-PIT: Generalized permutation invariant training for continuous separation of arbitrary numbers of speakers

    Authors: Thilo von Neumann, Keisuke Kinoshita, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach

    Abstract: Automatic transcription of meetings requires handling of overlapped speech, which calls for continuous speech separation (CSS) systems. The uPIT criterion was proposed for utterance-level separation with neural networks and introduces the constraint that the total number of speakers must not exceed the number of output channels. When processing meeting-like data in a segment-wise manner, i.e., by… ▽ More

    Submitted 20 September, 2021; v1 submitted 30 July, 2021; originally announced July 2021.

    Comments: Accepted at INTERSPEECH 2021

  41. arXiv:2107.14445  [pdf, other

    eess.AS cs.SD

    Speeding Up Permutation Invariant Training for Source Separation

    Authors: Thilo von Neumann, Christoph Boeddeker, Keisuke Kinoshita, Marc Delcroix, Reinhold Haeb-Umbach

    Abstract: Permutation invariant training (PIT) is a widely used training criterion for neural network-based source separation, used for both utterance-level separation with utterance-level PIT (uPIT) and separation of long recordings with the recently proposed Graph-PIT. When implemented naively, both suffer from an exponential complexity in the number of utterances to separate, rendering them unusable for… ▽ More

    Submitted 30 July, 2021; originally announced July 2021.

    Comments: Accepted at 14th ITG Conference on Speech Communication

  42. arXiv:2106.07144  [pdf, other

    eess.AS cs.SD

    Few-shot learning of new sound classes for target sound extraction

    Authors: Marc Delcroix, Jorge Bennasar Vázquez, Tsubasa Ochiai, Keisuke Kinoshita, Shoko Araki

    Abstract: Target sound extraction consists of extracting the sound of a target acoustic event (AE) class from a mixture of AE sounds. It can be realized using a neural network that extracts the target sound conditioned on a 1-hot vector that represents the desired AE class. With this approach, embedding vectors associated with the AE classes are directly optimized for the extraction of sound classes seen du… ▽ More

    Submitted 13 June, 2021; originally announced June 2021.

    Comments: To appear in Interspeech 2021

  43. arXiv:2106.03903  [pdf, other

    cs.SD cs.LG eess.AS

    PILOT: Introducing Transformers for Probabilistic Sound Event Localization

    Authors: Christopher Schymura, Benedikt Bönninghoff, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa

    Abstract: Sound event localization aims at estimating the positions of sound sources in the environment with respect to an acoustic receiver (e.g. a microphone array). Recent advances in this domain most prominently focused on utilizing deep recurrent neural networks. Inspired by the success of transformer architectures as a suitable alternative to classical recurrent neural networks, this paper introduces… ▽ More

    Submitted 7 June, 2021; originally announced June 2021.

    Comments: Accepted at INTERSPEECH 2021

  44. Should We Always Separate?: Switching Between Enhanced and Observed Signals for Overlap** Speech Recognition

    Authors: Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoyuki Kamo

    Abstract: Although recent advances in deep learning technology improved automatic speech recognition (ASR), it remains difficult to recognize speech when it overlaps other people's voices. Speech separation or extraction is often used as a front-end to ASR to handle such overlap** speech. However, deep neural network-based speech enhancement can generate `processing artifacts' as a side effect of the enha… ▽ More

    Submitted 2 June, 2021; originally announced June 2021.

    Comments: 5 pages, 1 figure

    Journal ref: in Proc. Interspeech 2021, 1149-1153

  45. arXiv:2105.09040  [pdf, other

    eess.AS cs.SD

    Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech

    Authors: Keisuke Kinoshita, Marc Delcroix, Naohiro Tawara

    Abstract: Recently, we proposed a novel speaker diarization method called End-to-End-Neural-Diarization-vector clustering (EEND-vector clustering) that integrates clustering-based and end-to-end neural network-based diarization approaches into one framework. The proposed method combines advantages of both frameworks, i.e. high diarization performance and handling of overlapped speech based on EEND, and robu… ▽ More

    Submitted 31 August, 2021; v1 submitted 19 May, 2021; originally announced May 2021.

    Comments: 5 pages, 1 figure, Interspeech2021. (Update to include a reference to the code)

  46. arXiv:2103.00417  [pdf, other

    cs.SD cs.LG eess.AS

    Exploiting Attention-based Sequence-to-Sequence Architectures for Sound Event Localization

    Authors: Christopher Schymura, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa

    Abstract: Sound event localization frameworks based on deep neural networks have shown increased robustness with respect to reverberation and noise in comparison to classical parametric approaches. In particular, recurrent architectures that incorporate temporal context into the estimation process seem to be well-suited for this task. This paper proposes a novel approach to sound event localization by utili… ▽ More

    Submitted 28 February, 2021; originally announced March 2021.

    Comments: Published in Proceedings of the 28th European Signal Processing Conference (EUSIPCO), 2020

  47. arXiv:2102.11634  [pdf, other

    eess.AS cs.SD

    Dual-Path Modeling for Long Recording Speech Separation in Meetings

    Authors: Chenda Li, Zhuo Chen, Yi Luo, Cong Han, Tianyan Zhou, Keisuke Kinoshita, Marc Delcroix, Shinji Watanabe, Yanmin Qian

    Abstract: The continuous speech separation (CSS) is a task to separate the speech sources from a long, partially overlapped recording, which involves a varying number of speakers. A straightforward extension of conventional utterance-level speech separation to the CSS task is to segment the long recording with a size-fixed window and process each window separately. Though effective, this extension fails to… ▽ More

    Submitted 23 February, 2021; originally announced February 2021.

    Comments: Accepted by ICASSP 2021

  48. arXiv:2102.11588  [pdf, other

    cs.SD cs.AI cs.CL cs.CV cs.LG eess.AS eess.IV

    Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain

    Authors: Julio Wissing, Benedikt Boenninghoff, Dorothea Kolossa, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Christopher Schymura

    Abstract: Estimating the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization. Both applications benefit from a known speaker position when, for instance, applying beamforming or assigning unique speaker identities. Recently, several approaches utilizing acoustic signals augmented with visual data have been proposed for this task. However, both the… ▽ More

    Submitted 24 February, 2021; v1 submitted 23 February, 2021; originally announced February 2021.

    Comments: 4 pages, 6 figures, ICASSP 2021

  49. End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend

    Authors: Wangyou Zhang, Christoph Boeddeker, Shinji Watanabe, Tomohiro Nakatani, Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Naoyuki Kamo, Reinhold Haeb-Umbach, Yanmin Qian

    Abstract: Recently, the end-to-end approach has been successfully applied to multi-speaker speech separation and recognition in both single-channel and multichannel conditions. However, severe performance degradation is still observed in the reverberant and noisy scenarios, and there is still a large performance gap between anechoic and reverberant conditions. In this work, we focus on the multichannel mult… ▽ More

    Submitted 23 February, 2021; originally announced February 2021.

    Comments: 5 pages, 1 figure, accepted by ICASSP 2021

  50. arXiv:2102.01326  [pdf, other

    eess.AS cs.LG cs.SD

    Multimodal Attention Fusion for Target Speaker Extraction

    Authors: Hiroshi Sato, Tsubasa Ochiai, Keisuke Kinoshita, Marc Delcroix, Tomohiro Nakatani, Shoko Araki

    Abstract: Target speaker extraction, which aims at extracting a target speaker's voice from a mixture of voices using audio, visual or locational clues, has received much interest. Recently an audio-visual target speaker extraction has been proposed that extracts target speech by using complementary audio and visual clues. Although audio-visual target speaker extraction offers a more stable performance than… ▽ More

    Submitted 2 February, 2021; originally announced February 2021.

    Comments: 7 pages, 5 figures

    Journal ref: in IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 778-784