Skip to main content

Showing 1–46 of 46 results for author: Haeb-Umbach, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2311.15597  [pdf, ps, other

    eess.AS cs.SD

    Spatial Diarization for Meeting Transcription with Ad-Hoc Acoustic Sensor Networks

    Authors: Tobias Gburrek, Joerg Schmalenstroeer, Reinhold Haeb-Umbach

    Abstract: We propose a diarization system, that estimates "who spoke when" based on spatial information, to be used as a front-end of a meeting transcription system running on the signals gathered from an acoustic sensor network (ASN). Although the spatial distribution of the microphones is advantageous, exploiting the spatial diversity for diarization and signal enhancement is challenging, because the micr… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

    Comments: Accepted at Asilomar Conference on Signals, Systems, and Computers 2023

  2. arXiv:2309.16482  [pdf, ps, other

    eess.AS cs.SD

    Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization

    Authors: Thilo von Neumann, Christoph Boeddeker, Tobias Cord-Landwehr, Marc Delcroix, Reinhold Haeb-Umbach

    Abstract: We propose a modular pipeline for the single-channel separation, recognition, and diarization of meeting-style recordings and evaluate it on the Libri-CSS dataset. Using a Continuous Speech Separation (CSS) system with a TF-GridNet separation architecture, followed by a speaker-agnostic speech recognizer, we achieve state-of-the-art recognition performance in terms of Optimal Reference Combination… ▽ More

    Submitted 6 May, 2024; v1 submitted 28 September, 2023; originally announced September 2023.

    Comments: Accepted at HSCMA Sattelite Workshop at ICASSP 2024

  3. arXiv:2309.08454  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Mixture Encoder Supporting Continuous Speech Separation for Meeting Recognition

    Authors: Peter Vieting, Simon Berger, Thilo von Neumann, Christoph Boeddeker, Ralf Schlüter, Reinhold Haeb-Umbach

    Abstract: Many real-life applications of automatic speech recognition (ASR) require processing of overlapped speech. A commonmethod involves first separating the speech into overlap-free streams and then performing ASR on the resulting signals. Recently, the inclusion of a mixture encoder in the ASR model has been proposed. This mixture encoder leverages the original overlapped speech to mitigate the effect… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024

  4. arXiv:2308.10682  [pdf, other

    cs.SD cs.CL eess.AS

    LibriWASN: A Data Set for Meeting Separation, Diarization, and Recognition with Asynchronous Recording Devices

    Authors: Joerg Schmalenstroeer, Tobias Gburrek, Reinhold Haeb-Umbach

    Abstract: We present LibriWASN, a data set whose design follows closely the LibriCSS meeting recognition data set, with the marked difference that the data is recorded with devices that are randomly positioned on a meeting table and whose sampling clocks are not synchronized. Nine different devices, five smartphones with a single recording channel and four microphone arrays, are used to record a total of 29… ▽ More

    Submitted 21 August, 2023; originally announced August 2023.

    Comments: Accepted for presentation at the ITG conference on Speech Communication 2023

  5. arXiv:2307.11394  [pdf, other

    cs.CL eess.AS

    MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems

    Authors: Thilo von Neumann, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach

    Abstract: MeetEval is an open-source toolkit to evaluate all kinds of meeting transcription systems. It provides a unified interface for the computation of commonly used Word Error Rates (WERs), specifically cpWER, ORC-WER and MIMO-WER along other WER definitions. We extend the cpWER computation by a temporal constraint to ensure that only words are identified as correct when the temporal alignment is plaus… ▽ More

    Submitted 25 January, 2024; v1 submitted 21 July, 2023; originally announced July 2023.

    Comments: Presented at the CHiME7 workshop 2023

  6. arXiv:2306.15440  [pdf, ps, other

    eess.AS cs.SD

    Post-Processing Independent Evaluation of Sound Event Detection Systems

    Authors: Janek Ebbers, Reinhold Haeb-Umbach, Romain Serizel

    Abstract: Due to the high variation in the application requirements of sound event detection (SED) systems, it is not sufficient to evaluate systems only in a single operating mode. Therefore, the community recently adopted the polyphonic sound detection score (PSDS) as an evaluation metric, which is the normalized area under the PSD receiver operating characteristic (PSD-ROC). It summarizes the system perf… ▽ More

    Submitted 27 June, 2023; originally announced June 2023.

    Comments: submitted to DCASE Workshop 2023

  7. arXiv:2306.12173  [pdf, other

    cs.CL cs.LG

    Mixture Encoder for Joint Speech Separation and Recognition

    Authors: Simon Berger, Peter Vieting, Christoph Boeddeker, Ralf Schlüter, Reinhold Haeb-Umbach

    Abstract: Multi-speaker automatic speech recognition (ASR) is crucial for many real-world applications, but it requires dedicated modeling techniques. Existing approaches can be divided into modular and end-to-end methods. Modular approaches separate speakers and recognize each of them with a single-speaker ASR system. End-to-end models process overlapped speech directly in a single, powerful neural network… ▽ More

    Submitted 21 June, 2023; originally announced June 2023.

    Comments: Accepted at Interspeech 2023

  8. A Teacher-Student approach for extracting informative speaker embeddings from speech mixtures

    Authors: Tobias Cord-Landwehr, Christoph Boeddeker, Cătălin Zorilă, Rama Doddipatla, Reinhold Haeb-Umbach

    Abstract: We introduce a monaural neural speaker embeddings extractor that computes an embedding for each speaker present in a speech mixture. To allow for supervised training, a teacher-student approach is employed: the teacher computes the target embeddings from each speaker's utterance before the utterances are added to form the mixture, and the student embedding extractor is then tasked to reproduce tho… ▽ More

    Submitted 19 September, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

    Comments: Proceedings of INTERSPEECH

  9. arXiv:2303.03849  [pdf, other

    eess.AS cs.SD

    TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings

    Authors: Christoph Boeddeker, Aswin Shanmugam Subramanian, Gordon Wichern, Reinhold Haeb-Umbach, Jonathan Le Roux

    Abstract: Since diarization and source separation of meeting data are closely related tasks, we here propose an approach to perform the two objectives jointly. It builds upon the target-speaker voice activity detection (TS-VAD) diarization approach, which assumes that initial speaker embeddings are available. We replace the final combined speaker activity estimation network of TS-VAD with a network that pro… ▽ More

    Submitted 1 January, 2024; v1 submitted 7 March, 2023; originally announced March 2023.

    Comments: Submitted to IEEE/ACM TASLP

  10. On Word Error Rate Definitions and their Efficient Computation for Multi-Speaker Speech Recognition Systems

    Authors: Thilo von Neumann, Christoph Boeddeker, Keisuke Kinoshita, Marc Delcroix, Reinhold Haeb-Umbach

    Abstract: We propose a general framework to compute the word error rate (WER) of ASR systems that process recordings containing multiple speakers at their input and that produce multiple output word sequences (MIMO). Such ASR systems are typically required, e.g., for meeting transcription. We provide an efficient implementation based on a dynamic programming search in a multi-dimensional Levenshtein distanc… ▽ More

    Submitted 21 July, 2023; v1 submitted 29 November, 2022; originally announced November 2022.

    Comments: Presented at ICASSP 2023

  11. arXiv:2207.13888  [pdf, other

    eess.AS cs.SD

    Utterance-by-utterance overlap-aware neural diarization with Graph-PIT

    Authors: Keisuke Kinoshita, Thilo von Neumann, Marc Delcroix, Christoph Boeddeker, Reinhold Haeb-Umbach

    Abstract: Recent speaker diarization studies showed that integration of end-to-end neural diarization (EEND) and clustering-based diarization is a promising approach for achieving state-of-the-art performance on various tasks. Such an approach first divides an observed signal into fixed-length segments, then performs {\it segment-level} local diarization based on an EEND module, and merges the segment-level… ▽ More

    Submitted 28 July, 2022; originally announced July 2022.

    Comments: Accepted to Interspeech 2022 (5 pages, 1 figure)

  12. arXiv:2205.00944  [pdf, other

    eess.AS cs.SD

    A Meeting Transcription System for an Ad-Hoc Acoustic Sensor Network

    Authors: Tobias Gburrek, Christoph Boeddeker, Thilo von Neumann, Tobias Cord-Landwehr, Joerg Schmalenstroeer, Reinhold Haeb-Umbach

    Abstract: We propose a system that transcribes the conversation of a typical meeting scenario that is captured by a set of initially unsynchronized microphone arrays at unknown positions. It consists of subsystems for signal synchronization, including both sampling rate and sampling time offset estimation, diarization based on speaker and microphone array position estimation, multi-channel speech enhancemen… ▽ More

    Submitted 2 May, 2022; originally announced May 2022.

    Comments: Submitted to INTERSPEECH 2022

  13. arXiv:2204.01338  [pdf, other

    cs.SD eess.AS

    An Initialization Scheme for Meeting Separation with Spatial Mixture Models

    Authors: Christoph Boeddeker, Tobias Cord-Landwehr, Thilo von Neumann, Reinhold Haeb-Umbach

    Abstract: Spatial mixture model (SMM) supported acoustic beamforming has been extensively used for the separation of simultaneously active speakers. However, it has hardly been considered for the separation of meeting data, that are characterized by long recordings and only partially overlap** speech. In this contribution, we show that the fact that often only a single speaker is active can be utilized fo… ▽ More

    Submitted 4 April, 2022; originally announced April 2022.

    Comments: Submitted to INTERSPEECH 2022

  14. arXiv:2201.13148  [pdf, other

    eess.AS cs.SD

    Threshold Independent Evaluation of Sound Event Detection Scores

    Authors: Janek Ebbers, Romain Serizel, Reinhold Haeb-Umbach

    Abstract: Performing an adequate evaluation of sound event detection (SED) systems is far from trivial and is still subject to ongoing research. The recently proposed polyphonic sound detection (PSD)-receiver operating characteristic (ROC) and PSD score (PSDS) make an important step into the direction of an evaluation of SED systems which is independent from a certain decision threshold. This allows to obta… ▽ More

    Submitted 31 January, 2022; originally announced January 2022.

    Comments: accepted for ICASSP 2022

  15. arXiv:2111.07578  [pdf, other

    eess.AS cs.SD

    Monaural source separation: From anechoic to reverberant environments

    Authors: Tobias Cord-Landwehr, Christoph Boeddeker, Thilo von Neumann, Catalin Zorila, Rama Doddipatla, Reinhold Haeb-Umbach

    Abstract: Impressive progress in neural network-based single-channel speech source separation has been made in recent years. But those improvements have been mostly reported on anechoic data, a situation that is hardly met in practice. Taking the SepFormer as a starting point, which achieves state-of-the-art performance on anechoic mixtures, we gradually modify it to optimize its performance on reverberant… ▽ More

    Submitted 10 May, 2022; v1 submitted 15 November, 2021; originally announced November 2021.

    Comments: Submitted to IWAENC 2022

  16. arXiv:2110.15581  [pdf, other

    eess.AS cs.SD

    SA-SDR: A novel loss function for separation of meeting style data

    Authors: Thilo von Neumann, Keisuke Kinoshita, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach

    Abstract: Many state-of-the-art neural network-based source separation systems use the averaged Signal-to-Distortion Ratio (SDR) as a training objective function. The basic SDR is, however, undefined if the network reconstructs the reference signal perfectly or if the reference signal contains silence, e.g., when a two-output separator processes a single-speaker recording. Many modifications to the plain SD… ▽ More

    Submitted 21 April, 2022; v1 submitted 29 October, 2021; originally announced October 2021.

    Comments: accepted at ICASSP 2022

  17. arXiv:2110.12820  [pdf, ps, other

    eess.AS cs.SD

    On Synchronization of Wireless Acoustic Sensor Networks in the Presence of Time-varying Sampling Rate Offsets and Speaker Changes

    Authors: Tobias Gburrek, Joerg Schmalenstroeer, Reinhold Haeb-Umbach

    Abstract: A wireless acoustic sensor network records audio signals with sampling time and sampling rate offsets between the audio streams, if the analog-digital converters (ADCs) of the network devices are not synchronized. Here, we introduce a new sampling rate offset model to simulate time-varying sampling frequencies caused, for example, by temperature changes of ADC crystal oscillators, and propose an e… ▽ More

    Submitted 25 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP 2022

  18. Graph-PIT: Generalized permutation invariant training for continuous separation of arbitrary numbers of speakers

    Authors: Thilo von Neumann, Keisuke Kinoshita, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach

    Abstract: Automatic transcription of meetings requires handling of overlapped speech, which calls for continuous speech separation (CSS) systems. The uPIT criterion was proposed for utterance-level separation with neural networks and introduces the constraint that the total number of speakers must not exceed the number of output channels. When processing meeting-like data in a segment-wise manner, i.e., by… ▽ More

    Submitted 20 September, 2021; v1 submitted 30 July, 2021; originally announced July 2021.

    Comments: Accepted at INTERSPEECH 2021

  19. arXiv:2107.14445  [pdf, other

    eess.AS cs.SD

    Speeding Up Permutation Invariant Training for Source Separation

    Authors: Thilo von Neumann, Christoph Boeddeker, Keisuke Kinoshita, Marc Delcroix, Reinhold Haeb-Umbach

    Abstract: Permutation invariant training (PIT) is a widely used training criterion for neural network-based source separation, used for both utterance-level separation with utterance-level PIT (uPIT) and separation of long recordings with the recently proposed Graph-PIT. When implemented naively, both suffer from an exponential complexity in the number of utterances to separate, rendering them unusable for… ▽ More

    Submitted 30 July, 2021; originally announced July 2021.

    Comments: Accepted at 14th ITG Conference on Speech Communication

  20. arXiv:2106.05627  [pdf, other

    cs.SD eess.AS

    A Comparison and Combination of Unsupervised Blind Source Separation Techniques

    Authors: Christoph Boeddeker, Frederik Rautenberg, Reinhold Haeb-Umbach

    Abstract: Unsupervised blind source separation methods do not require a training phase and thus cannot suffer from a train-test mismatch, which is a common concern in neural network based source separation. The unsupervised techniques can be categorized in two classes, those building upon the sparsity of speech in the Short-Time Fourier transform domain and those exploiting non-Gaussianity or non-stationari… ▽ More

    Submitted 10 June, 2021; originally announced June 2021.

    Comments: Submitted to ITG 2021

  21. arXiv:2106.02472  [pdf, other

    cs.SD eess.AS

    A Database for Research on Detection and Enhancement of Speech Transmitted over HF links

    Authors: Jens Heitkaemper, Joerg Schmalenstroeer, Joerg Ullmann, Valentin Ion, Reinhold Haeb-Umbach

    Abstract: In this paper we present an open database for the development of detection and enhancement algorithms of speech transmitted over HF radio channels. It consists of audio samples recorded by various receivers at different locations across Europe, all monitoring the same single-sideband modulated transmission from a base station in Paderborn, Germany. Transmitted and received speech signals are preci… ▽ More

    Submitted 21 July, 2021; v1 submitted 4 June, 2021; originally announced June 2021.

    Comments: Accepted to ITG 2021

  22. arXiv:2105.01786  [pdf, other

    eess.AS cs.CL cs.SD

    Voice Conversion Based Speaker Normalization for Acoustic Unit Discovery

    Authors: Thomas Glarner, Janek Ebbers, Reinhold Häb-Umbach

    Abstract: Discovering speaker independent acoustic units purely from spoken input is known to be a hard problem. In this work we propose an unsupervised speaker normalization technique prior to unit discovery. It is based on separating speaker related from content induced variations in a speech signal with an adversarial contrastive predictive coding approach. This technique does neither require transcribed… ▽ More

    Submitted 4 May, 2021; originally announced May 2021.

    Comments: Submitted to Interspeech 2021

  23. arXiv:2103.06581  [pdf, ps, other

    eess.AS cs.SD

    Forward-Backward Convolutional Recurrent Neural Networks and Tag-Conditioned Convolutional Neural Networks for Weakly Labeled Semi-supervised Sound Event Detection

    Authors: Janek Ebbers, Reinhold Haeb-Umbach

    Abstract: In this paper we present our system for the detection and classification of acoustic scenes and events (DCASE) 2020 Challenge Task 4: Sound event detection and separation in domestic environments. We introduce two new models: the forward-backward convolutional recurrent neural network (FBCRNN) and the tag-conditioned convolutional neural network (CNN). The FBCRNN employs two recurrent neural netwo… ▽ More

    Submitted 11 March, 2021; originally announced March 2021.

    Comments: accepted by dcase2020 workshop, the presented system received the reproducible system award for the dcase2020 challenge task 4

  24. arXiv:2103.01599  [pdf, other

    cs.SD eess.AS

    Open Range Pitch Tracking for Carrier Frequency Difference Estimation from HF Transmitted Speech

    Authors: Joerg Schmalenstroeer, Jens Heitkaemper, Joerg Ullmann, Reinhold Haeb-Umbach

    Abstract: In this paper we investigate the task of detecting carrier frequency differences from demodulated single sideband signals by examining the pitch contours of the received baseband speech signal in the short-time spectral domain. From the detected pitch frequency trajectory and its harmonics a carrier frequency difference, which is caused by demodulating the radio signal with the wrong carrier frequ… ▽ More

    Submitted 3 March, 2021; v1 submitted 2 March, 2021; originally announced March 2021.

    Comments: Submitted to EUSIPCO 2021

  25. End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend

    Authors: Wangyou Zhang, Christoph Boeddeker, Shinji Watanabe, Tomohiro Nakatani, Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Naoyuki Kamo, Reinhold Haeb-Umbach, Yanmin Qian

    Abstract: Recently, the end-to-end approach has been successfully applied to multi-speaker speech separation and recognition in both single-channel and multichannel conditions. However, severe performance degradation is still observed in the reverberant and noisy scenarios, and there is still a large performance gap between anechoic and reverberant conditions. In this work, we focus on the multichannel mult… ▽ More

    Submitted 23 February, 2021; originally announced February 2021.

    Comments: 5 pages, 1 figure, accepted by ICASSP 2021

  26. arXiv:2012.12809  [pdf, other

    cs.CV cs.AI eess.SP

    War** of Radar Data into Camera Image for Cross-Modal Supervision in Automotive Applications

    Authors: Christopher Grimm, Tai Fei, Ernst Warsitz, Ridha Farhoud, Tobias Breddermann, Reinhold Haeb-Umbach

    Abstract: We present an approach to automatically generate semantic labels for real recordings of automotive range-Doppler (RD) radar spectra. Such labels are required when training a neural network for object recognition from radar data. The automatic labeling approach rests on the simultaneous recording of camera and lidar data in addition to the radar spectrum. By war** radar spectra into the camera im… ▽ More

    Submitted 20 June, 2022; v1 submitted 23 December, 2020; originally announced December 2020.

  27. arXiv:2012.06142  [pdf, other

    eess.AS cs.SD

    Iterative Geometry Calibration from Distance Estimates for Wireless Acoustic Sensor Networks

    Authors: Tobias Gburrek, Joerg Schmalenstroeer, Reinhold Haeb-Umbach

    Abstract: In this paper we present an approach to geometry calibration in wireless acoustic sensor networks, whose nodes are assumed to be equipped with a compact microphone array. The proposed approach solely works with estimates of the distances between acoustic sources and the nodes that record these sources. It consists of an iterative weighted least squares localization procedure, which is initialized… ▽ More

    Submitted 14 April, 2021; v1 submitted 11 December, 2020; originally announced December 2020.

    Comments: Accepted for ICASSP 2021

  28. arXiv:2011.15003  [pdf, other

    cs.SD eess.AS

    Convolutive Transfer Function Invariant SDR training criteria for Multi-Channel Reverberant Speech Separation

    Authors: Christoph Boeddeker, Wangyou Zhang, Tomohiro Nakatani, Keisuke Kinoshita, Tsubasa Ochiai, Marc Delcroix, Naoyuki Kamo, Yanmin Qian, Reinhold Haeb-Umbach

    Abstract: Time-domain training criteria have proven to be very effective for the separation of single-channel non-reverberant speech mixtures. Likewise, mask-based beamforming has shown impressive performance in multi-channel reverberant speech enhancement and source separation. Here, we propose to combine neural network supported multi-channel source separation with a time-domain training objective functio… ▽ More

    Submitted 8 June, 2021; v1 submitted 30 November, 2020; originally announced November 2020.

    Comments: Accepted by ICASSP 2021

  29. arXiv:2006.13769  [pdf, other

    eess.AS cs.SD

    Deep Neural Network based Distance Estimation for Geometry Calibration in Acoustic Sensor Networks

    Authors: Tobias Gburrek, Joerg Schmalenstroeer, Andreas Brendel, Walter Kellermann, Reinhold Haeb-Umbach

    Abstract: We present an approach to deep neural network based (DNN-based) distance estimation in reverberant rooms for supporting geometry calibration tasks in wireless acoustic sensor networks. Signal diffuseness information from acoustic signals is aggregated via the coherent-to-diffuse power ratio to obtain a distance-related feature, which is mapped to a source-to-microphone distance estimate by means o… ▽ More

    Submitted 24 June, 2020; originally announced June 2020.

    Comments: Accepted for EUSIPCO 2020

  30. arXiv:2006.13579  [pdf, other

    eess.AS cs.SD

    Multi-path RNN for hierarchical modeling of long sequential data and its application to speaker stream separation

    Authors: Keisuke Kinoshita, Thilo von Neumann, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach

    Abstract: Recently, the source separation performance was greatly improved by time-domain audio source separation based on dual-path recurrent neural network (DPRNN). DPRNN is a simple but effective model for a long sequential data. While DPRNN is quite efficient in modeling a sequential data of the length of an utterance, i.e., about 5 to 10 second data, it is harder to apply it to longer sequences such as… ▽ More

    Submitted 24 June, 2020; originally announced June 2020.

    Comments: 5 pages, 4 figures

  31. Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR

    Authors: Thilo von Neumann, Christoph Boeddeker, Lukas Drude, Keisuke Kinoshita, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach

    Abstract: Most approaches to multi-talker overlapped speech separation and recognition assume that the number of simultaneously active speakers is given, but in realistic situations, it is typically unknown. To cope with this, we extend an iterative speech extraction system with mechanisms to count the number of sources and combine it with a single-talker speech recognizer to form the first end-to-end multi… ▽ More

    Submitted 21 December, 2020; v1 submitted 4 June, 2020; originally announced June 2020.

    Comments: 5 pages, INTERSPEECH 2020

  32. arXiv:2005.12963  [pdf, ps, other

    eess.AS cs.SD

    Contrastive Predictive Coding Supported Factorized Variational Autoencoder for Unsupervised Learning of Disentangled Speech Representations

    Authors: Janek Ebbers, Michael Kuhlmann, Tobias Cord-Landwehr, Reinhold Haeb-Umbach

    Abstract: In this work we address disentanglement of style and content in speech signals. We propose a fully convolutional variational autoencoder employing two encoders: a content encoder and a style encoder. To foster disentanglement, we propose adversarial contrastive predictive coding. This new disentanglement method does neither need parallel data nor any supervision. We show that the proposed techniqu… ▽ More

    Submitted 11 March, 2021; v1 submitted 26 May, 2020; originally announced May 2020.

    Comments: accepted by icassp 2021

  33. arXiv:2005.09913  [pdf, other

    eess.AS cs.SD

    Statistical and Neural Network Based Speech Activity Detection in Non-Stationary Acoustic Environments

    Authors: Jens Heitkaemper, Joerg Schmalenstroeer, Reinhold Haeb-Umbach

    Abstract: Speech activity detection (SAD), which often rests on the fact that the noise is "more" stationary than speech, is particularly challenging in non-stationary environments, because the time variance of the acoustic scene makes it difficult to discriminate speech from noise. We propose two approaches to SAD, where one is based on statistical signal processing, while the other utilizes neural network… ▽ More

    Submitted 28 July, 2020; v1 submitted 20 May, 2020; originally announced May 2020.

    Comments: Accepted to Interspeech 2020

  34. Jointly optimal denoising, dereverberation, and source separation

    Authors: Tomohiro Nakatani, Christoph Boeddeker, Keisuke Kinoshita, Rintaro Ikeshita, Marc Delcroix, Reinhold Haeb-Umbach

    Abstract: This paper proposes methods that can optimize a Convolutional BeamFormer (CBF) for jointly performing denoising, dereverberation, and source separation (DN+DR+SS) in a computationally efficient way. Conventionally, cascade configuration composed of a Weighted Prediction Error minimization (WPE) dereverberation filter followed by a Minimum Variance Distortionless Response beamformer has been usedas… ▽ More

    Submitted 2 August, 2020; v1 submitted 20 May, 2020; originally announced May 2020.

    Comments: Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processing on 12 Feb 2020, Accepted to IEEE/ACM Trans. Audio, Speech, and Language Processing on 14 July 2020

  35. End-to-end training of time domain audio separation and recognition

    Authors: Thilo von Neumann, Keisuke Kinoshita, Lukas Drude, Christoph Boeddeker, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach

    Abstract: The rising interest in single-channel multi-speaker speech separation sparked development of End-to-End (E2E) approaches to multi-speaker speech recognition. However, up until now, state-of-the-art neural network-based time domain source separation has not yet been combined with E2E speech recognition. We here demonstrate how to combine a separation module based on a Convolutional Time domain Audi… ▽ More

    Submitted 13 April, 2020; v1 submitted 18 December, 2019; originally announced December 2019.

    Comments: 5 pages, 1 figure, to appear in ICASSP 2020

  36. arXiv:1911.08895  [pdf, other

    cs.SD cs.CL eess.AS

    Demystifying TasNet: A Dissecting Approach

    Authors: Jens Heitkaemper, Darius Jakobeit, Christoph Boeddeker, Lukas Drude, Reinhold Haeb-Umbach

    Abstract: In recent years time domain speech separation has excelled over frequency domain separation in single channel scenarios and noise-free environments. In this paper we dissect the gains of the time-domain audio separation network (TasNet) approach by gradually replacing components of an utterance-level permutation invariant training (u-PIT) based separation system in the frequency domain until the T… ▽ More

    Submitted 5 February, 2020; v1 submitted 20 November, 2019; originally announced November 2019.

    Comments: Accepted to ICASSP 2020

  37. arXiv:1910.13934  [pdf, other

    cs.SD cs.CL eess.AS

    SMS-WSJ: Database, performance measures, and baseline recipe for multi-channel source separation and recognition

    Authors: Lukas Drude, Jens Heitkaemper, Christoph Boeddeker, Reinhold Haeb-Umbach

    Abstract: We present a multi-channel database of overlap** speech for training, evaluation, and detailed analysis of source separation and extraction algorithms: SMS-WSJ -- Spatialized Multi-Speaker Wall Street Journal. It consists of artificially mixed speech taken from the WSJ database, but unlike earlier databases we consider all WSJ0+1 utterances and take care of strictly separating the speaker sets p… ▽ More

    Submitted 30 October, 2019; originally announced October 2019.

    Comments: Submitted to ICASSP 2020

  38. arXiv:1910.13707  [pdf, other

    cs.SD cs.CL eess.AS

    Jointly optimal dereverberation and beamforming

    Authors: Christoph Boeddeker, Tomohiro Nakatani, Keisuke Kinoshita, Reinhold Haeb-Umbach

    Abstract: We previously proposed an optimal (in the maximum likelihood sense) convolutional beamformer that can perform simultaneous denoising and dereverberation, and showed its superiority over the widely used cascade of a WPE dereverberation filter and a conventional MPDR beamformer. However, it has not been fully investigated which components in the convolutional beamformer yield such superiority. To th… ▽ More

    Submitted 30 October, 2019; originally announced October 2019.

    Comments: Submitted to ICASSP 2020

  39. arXiv:1909.12208  [pdf, ps, other

    cs.CL eess.AS

    An Investigation into the Effectiveness of Enhancement in ASR Training and Test for CHiME-5 Dinner Party Transcription

    Authors: Catalin Zorila, Christoph Boeddeker, Rama Doddipatla, Reinhold Haeb-Umbach

    Abstract: Despite the strong modeling power of neural network acoustic models, speech enhancement has been shown to deliver additional word error rate improvements if multi-channel data is available. However, there has been a longstanding debate whether enhancement should also be carried out on the ASR training data. In an extensive experimental evaluation on the acoustically very challenging CHiME-5 dinner… ▽ More

    Submitted 26 September, 2019; originally announced September 2019.

    Comments: Accepted for ASRU 2019

    MSC Class: 68T10

  40. arXiv:1905.12230  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Guided Source Separation Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR

    Authors: Naoyuki Kanda, Christoph Boeddeker, Jens Heitkaemper, Yusuke Fujita, Shota Horiguchi, Kenji Nagamatsu, Reinhold Haeb-Umbach

    Abstract: In this paper, we present Hitachi and Paderborn University's joint effort for automatic speech recognition (ASR) in a dinner party scenario. The main challenges of ASR systems for dinner party recordings obtained by multiple microphone arrays are (1) heavy speech overlaps, (2) severe noise and reverberation, (3) very natural conversational content, and possibly (4) insufficient training data. As a… ▽ More

    Submitted 26 June, 2019; v1 submitted 29 May, 2019; originally announced May 2019.

    Comments: Accepted to INTERSPEECH 2019

  41. arXiv:1904.01578  [pdf, other

    cs.SD cs.LG stat.ML

    Unsupervised training of neural mask-based beamforming

    Authors: Lukas Drude, Jahn Heymann, Reinhold Haeb-Umbach

    Abstract: We present an unsupervised training approach for a neural network-based mask estimator in an acoustic beamforming application. The network is trained to maximize a likelihood criterion derived from a spatial mixture model of the observations. It is trained from scratch without requiring any parallel data consisting of degraded input and clean training targets. Thus, training can be carried out on… ▽ More

    Submitted 8 April, 2019; v1 submitted 2 April, 2019; originally announced April 2019.

    Comments: Correction to Eq. 11: Hermite symbol was on the wrong variable. Replaces y with the normalized version

  42. arXiv:1904.01340  [pdf, other

    cs.LG eess.SP stat.ML

    Unsupervised training of a deep clustering model for multichannel blind source separation

    Authors: Lukas Drude, Daniel Hasenklever, Reinhold Haeb-Umbach

    Abstract: We propose a training scheme to train neural network-based source separation algorithms from scratch when parallel clean data is unavailable. In particular, we demonstrate that an unsupervised spatial clustering algorithm is sufficient to guide the training of a deep clustering system. We argue that previous work on deep clustering requires strong supervision and elaborate on why this is a limitat… ▽ More

    Submitted 2 April, 2019; originally announced April 2019.

  43. arXiv:1902.07881  [pdf, ps, other

    eess.AS cs.LG cs.SD

    All-neural online source separation, counting, and diarization for meeting analysis

    Authors: Thilo von Neumann, Keisuke Kinoshita, Marc Delcroix, Shoko Araki, Tomohiro Nakatani, Reinhold Haeb-Umbach

    Abstract: Automatic meeting analysis comprises the tasks of speaker counting, speaker diarization, and the separation of overlapped speech, followed by automatic speech recognition. This all has to be carried out on arbitrarily long sessions and, ideally, in an online or block-online manner. While significant progress has been made on individual tasks, this paper presents for the first time an all-neural ap… ▽ More

    Submitted 21 February, 2019; originally announced February 2019.

    Comments: 5 pages, to appear in ICASSP2019

  44. arXiv:1712.09718  [pdf, other

    stat.CO cs.LG eess.SY

    Directional Statistics and Filtering Using libDirectional

    Authors: Gerhard Kurz, Igor Gilitschenski, Florian Pfaff, Lukas Drude, Uwe D. Hanebeck, Reinhold Haeb-Umbach, Roland Y. Siegwart

    Abstract: In this paper, we present libDirectional, a MATLAB library for directional statistics and directional estimation. It supports a variety of commonly used distributions on the unit circle, such as the von Mises, wrapped normal, and wrapped Cauchy distributions. Furthermore, various distributions on higher-dimensional manifolds such as the unit hypersphere and the hypertorus are available. Based on t… ▽ More

    Submitted 27 December, 2017; originally announced December 2017.

    Comments: Version accepted for Publication in the Journal of Statistical Software

  45. arXiv:1701.00392  [pdf, other

    math.NA cs.CE

    On the Computation of Complex-valued Gradients with Application to Statistically Optimum Beamforming

    Authors: Christoph Boeddeker, Patrick Hanebrink, Lukas Drude, Jahn Heymann, Reinhold Haeb-Umbach

    Abstract: This report describes the computation of gradients by algorithmic differentiation for statistically optimum beamforming operations. Especially the derivation of complex-valued functions is a key component of this approach. Therefore the real-valued algorithmic differentiation is extended via the complex-valued chain rule. In addition to the basic mathematic operations the derivative of the eigenva… ▽ More

    Submitted 6 February, 2019; v1 submitted 2 January, 2017; originally announced January 2017.

    Comments: Technical Report

  46. arXiv:1504.03128  [pdf, ps, other

    cs.SD

    Absolute Geometry Calibration of Distributed Microphone Arrays in an Audio-Visual Sensor Network

    Authors: Florian Jacob, Reinhold Haeb-Umbach

    Abstract: Joint audio-visual speaker tracking requires that the locations of microphones and cameras are known and that they are given in a common coordinate system. Sensor self-localization algorithms, however, are usually separately developed for either the acoustic or the visual modality and return their positions in a modality specific coordinate system, often with an unknown rotation, scaling and trans… ▽ More

    Submitted 13 April, 2015; originally announced April 2015.