Skip to main content

Showing 1–16 of 16 results for author: von Neumann, T

Searching in archive eess. Search in all archives.
.
  1. arXiv:2309.16482  [pdf, ps, other

    eess.AS cs.SD

    Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization

    Authors: Thilo von Neumann, Christoph Boeddeker, Tobias Cord-Landwehr, Marc Delcroix, Reinhold Haeb-Umbach

    Abstract: We propose a modular pipeline for the single-channel separation, recognition, and diarization of meeting-style recordings and evaluate it on the Libri-CSS dataset. Using a Continuous Speech Separation (CSS) system with a TF-GridNet separation architecture, followed by a speaker-agnostic speech recognizer, we achieve state-of-the-art recognition performance in terms of Optimal Reference Combination… ▽ More

    Submitted 6 May, 2024; v1 submitted 28 September, 2023; originally announced September 2023.

    Comments: Accepted at HSCMA Sattelite Workshop at ICASSP 2024

  2. arXiv:2309.08454  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Mixture Encoder Supporting Continuous Speech Separation for Meeting Recognition

    Authors: Peter Vieting, Simon Berger, Thilo von Neumann, Christoph Boeddeker, Ralf Schlüter, Reinhold Haeb-Umbach

    Abstract: Many real-life applications of automatic speech recognition (ASR) require processing of overlapped speech. A commonmethod involves first separating the speech into overlap-free streams and then performing ASR on the resulting signals. Recently, the inclusion of a mixture encoder in the ASR model has been proposed. This mixture encoder leverages the original overlapped speech to mitigate the effect… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024

  3. arXiv:2307.11394  [pdf, other

    cs.CL eess.AS

    MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems

    Authors: Thilo von Neumann, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach

    Abstract: MeetEval is an open-source toolkit to evaluate all kinds of meeting transcription systems. It provides a unified interface for the computation of commonly used Word Error Rates (WERs), specifically cpWER, ORC-WER and MIMO-WER along other WER definitions. We extend the cpWER computation by a temporal constraint to ensure that only words are identified as correct when the temporal alignment is plaus… ▽ More

    Submitted 25 January, 2024; v1 submitted 21 July, 2023; originally announced July 2023.

    Comments: Presented at the CHiME7 workshop 2023

  4. On Word Error Rate Definitions and their Efficient Computation for Multi-Speaker Speech Recognition Systems

    Authors: Thilo von Neumann, Christoph Boeddeker, Keisuke Kinoshita, Marc Delcroix, Reinhold Haeb-Umbach

    Abstract: We propose a general framework to compute the word error rate (WER) of ASR systems that process recordings containing multiple speakers at their input and that produce multiple output word sequences (MIMO). Such ASR systems are typically required, e.g., for meeting transcription. We provide an efficient implementation based on a dynamic programming search in a multi-dimensional Levenshtein distanc… ▽ More

    Submitted 21 July, 2023; v1 submitted 29 November, 2022; originally announced November 2022.

    Comments: Presented at ICASSP 2023

  5. arXiv:2209.11494  [pdf, other

    eess.AS

    MMS-MSG: A Multi-purpose Multi-Speaker Mixture Signal Generator

    Authors: Tobias Cord-Landwehr, Thilo von Neumann, Christoph Boeddeker, Reinhold Haeb-Umbach

    Abstract: The scope of speech enhancement has changed from a monolithic view of single, independent tasks, to a joint processing of complex conversational speech recordings. Training and evaluation of these single tasks requires synthetic data with access to intermediate signals that is as close as possible to the evaluation scenario. As such data often is not available, many works instead use specialized d… ▽ More

    Submitted 23 September, 2022; originally announced September 2022.

    Comments: Accepted at IWAENC 2022

  6. arXiv:2207.13888  [pdf, other

    eess.AS cs.SD

    Utterance-by-utterance overlap-aware neural diarization with Graph-PIT

    Authors: Keisuke Kinoshita, Thilo von Neumann, Marc Delcroix, Christoph Boeddeker, Reinhold Haeb-Umbach

    Abstract: Recent speaker diarization studies showed that integration of end-to-end neural diarization (EEND) and clustering-based diarization is a promising approach for achieving state-of-the-art performance on various tasks. Such an approach first divides an observed signal into fixed-length segments, then performs {\it segment-level} local diarization based on an EEND module, and merges the segment-level… ▽ More

    Submitted 28 July, 2022; originally announced July 2022.

    Comments: Accepted to Interspeech 2022 (5 pages, 1 figure)

  7. arXiv:2205.00944  [pdf, other

    eess.AS cs.SD

    A Meeting Transcription System for an Ad-Hoc Acoustic Sensor Network

    Authors: Tobias Gburrek, Christoph Boeddeker, Thilo von Neumann, Tobias Cord-Landwehr, Joerg Schmalenstroeer, Reinhold Haeb-Umbach

    Abstract: We propose a system that transcribes the conversation of a typical meeting scenario that is captured by a set of initially unsynchronized microphone arrays at unknown positions. It consists of subsystems for signal synchronization, including both sampling rate and sampling time offset estimation, diarization based on speaker and microphone array position estimation, multi-channel speech enhancemen… ▽ More

    Submitted 2 May, 2022; originally announced May 2022.

    Comments: Submitted to INTERSPEECH 2022

  8. arXiv:2204.01338  [pdf, other

    cs.SD eess.AS

    An Initialization Scheme for Meeting Separation with Spatial Mixture Models

    Authors: Christoph Boeddeker, Tobias Cord-Landwehr, Thilo von Neumann, Reinhold Haeb-Umbach

    Abstract: Spatial mixture model (SMM) supported acoustic beamforming has been extensively used for the separation of simultaneously active speakers. However, it has hardly been considered for the separation of meeting data, that are characterized by long recordings and only partially overlap** speech. In this contribution, we show that the fact that often only a single speaker is active can be utilized fo… ▽ More

    Submitted 4 April, 2022; originally announced April 2022.

    Comments: Submitted to INTERSPEECH 2022

  9. arXiv:2111.07578  [pdf, other

    eess.AS cs.SD

    Monaural source separation: From anechoic to reverberant environments

    Authors: Tobias Cord-Landwehr, Christoph Boeddeker, Thilo von Neumann, Catalin Zorila, Rama Doddipatla, Reinhold Haeb-Umbach

    Abstract: Impressive progress in neural network-based single-channel speech source separation has been made in recent years. But those improvements have been mostly reported on anechoic data, a situation that is hardly met in practice. Taking the SepFormer as a starting point, which achieves state-of-the-art performance on anechoic mixtures, we gradually modify it to optimize its performance on reverberant… ▽ More

    Submitted 10 May, 2022; v1 submitted 15 November, 2021; originally announced November 2021.

    Comments: Submitted to IWAENC 2022

  10. arXiv:2110.15581  [pdf, other

    eess.AS cs.SD

    SA-SDR: A novel loss function for separation of meeting style data

    Authors: Thilo von Neumann, Keisuke Kinoshita, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach

    Abstract: Many state-of-the-art neural network-based source separation systems use the averaged Signal-to-Distortion Ratio (SDR) as a training objective function. The basic SDR is, however, undefined if the network reconstructs the reference signal perfectly or if the reference signal contains silence, e.g., when a two-output separator processes a single-speaker recording. Many modifications to the plain SD… ▽ More

    Submitted 21 April, 2022; v1 submitted 29 October, 2021; originally announced October 2021.

    Comments: accepted at ICASSP 2022

  11. Graph-PIT: Generalized permutation invariant training for continuous separation of arbitrary numbers of speakers

    Authors: Thilo von Neumann, Keisuke Kinoshita, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach

    Abstract: Automatic transcription of meetings requires handling of overlapped speech, which calls for continuous speech separation (CSS) systems. The uPIT criterion was proposed for utterance-level separation with neural networks and introduces the constraint that the total number of speakers must not exceed the number of output channels. When processing meeting-like data in a segment-wise manner, i.e., by… ▽ More

    Submitted 20 September, 2021; v1 submitted 30 July, 2021; originally announced July 2021.

    Comments: Accepted at INTERSPEECH 2021

  12. arXiv:2107.14445  [pdf, other

    eess.AS cs.SD

    Speeding Up Permutation Invariant Training for Source Separation

    Authors: Thilo von Neumann, Christoph Boeddeker, Keisuke Kinoshita, Marc Delcroix, Reinhold Haeb-Umbach

    Abstract: Permutation invariant training (PIT) is a widely used training criterion for neural network-based source separation, used for both utterance-level separation with utterance-level PIT (uPIT) and separation of long recordings with the recently proposed Graph-PIT. When implemented naively, both suffer from an exponential complexity in the number of utterances to separate, rendering them unusable for… ▽ More

    Submitted 30 July, 2021; originally announced July 2021.

    Comments: Accepted at 14th ITG Conference on Speech Communication

  13. arXiv:2006.13579  [pdf, other

    eess.AS cs.SD

    Multi-path RNN for hierarchical modeling of long sequential data and its application to speaker stream separation

    Authors: Keisuke Kinoshita, Thilo von Neumann, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach

    Abstract: Recently, the source separation performance was greatly improved by time-domain audio source separation based on dual-path recurrent neural network (DPRNN). DPRNN is a simple but effective model for a long sequential data. While DPRNN is quite efficient in modeling a sequential data of the length of an utterance, i.e., about 5 to 10 second data, it is harder to apply it to longer sequences such as… ▽ More

    Submitted 24 June, 2020; originally announced June 2020.

    Comments: 5 pages, 4 figures

  14. Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR

    Authors: Thilo von Neumann, Christoph Boeddeker, Lukas Drude, Keisuke Kinoshita, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach

    Abstract: Most approaches to multi-talker overlapped speech separation and recognition assume that the number of simultaneously active speakers is given, but in realistic situations, it is typically unknown. To cope with this, we extend an iterative speech extraction system with mechanisms to count the number of sources and combine it with a single-talker speech recognizer to form the first end-to-end multi… ▽ More

    Submitted 21 December, 2020; v1 submitted 4 June, 2020; originally announced June 2020.

    Comments: 5 pages, INTERSPEECH 2020

  15. End-to-end training of time domain audio separation and recognition

    Authors: Thilo von Neumann, Keisuke Kinoshita, Lukas Drude, Christoph Boeddeker, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach

    Abstract: The rising interest in single-channel multi-speaker speech separation sparked development of End-to-End (E2E) approaches to multi-speaker speech recognition. However, up until now, state-of-the-art neural network-based time domain source separation has not yet been combined with E2E speech recognition. We here demonstrate how to combine a separation module based on a Convolutional Time domain Audi… ▽ More

    Submitted 13 April, 2020; v1 submitted 18 December, 2019; originally announced December 2019.

    Comments: 5 pages, 1 figure, to appear in ICASSP 2020

  16. arXiv:1902.07881  [pdf, ps, other

    eess.AS cs.LG cs.SD

    All-neural online source separation, counting, and diarization for meeting analysis

    Authors: Thilo von Neumann, Keisuke Kinoshita, Marc Delcroix, Shoko Araki, Tomohiro Nakatani, Reinhold Haeb-Umbach

    Abstract: Automatic meeting analysis comprises the tasks of speaker counting, speaker diarization, and the separation of overlapped speech, followed by automatic speech recognition. This all has to be carried out on arbitrarily long sessions and, ideally, in an online or block-online manner. While significant progress has been made on individual tasks, this paper presents for the first time an all-neural ap… ▽ More

    Submitted 21 February, 2019; originally announced February 2019.

    Comments: 5 pages, to appear in ICASSP2019