Skip to main content

Showing 1–31 of 31 results for author: Habets, E A P

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.06403  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Meta Learning Text-to-Speech Synthesis in over 7000 Languages

    Authors: Florian Lux, Sarina Meyer, Lyonel Behringer, Frank Zalkow, Phat Do, Matt Coler, Emanuël A. P. Habets, Ngoc Thang Vu

    Abstract: In this work, we take on the challenging task of building a single text-to-speech synthesis system that is capable of generating speech in over 7000 languages, many of which lack sufficient data for traditional TTS development. By leveraging a novel integration of massively multilingual pretraining and meta learning to approximate language representations, our approach enables zero-shot speech syn… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: accepted at Interspeech 2024

  2. arXiv:2405.17364  [pdf, other

    eess.AS

    Speech Loudness in Broadcasting and Streaming

    Authors: Matteo Torcoli, Mhd Modar Halimeh, Thomas Leitz, Yannik Grewe, Michael Kratschmer, Bernhard Neugebauer, Adrian Murtaza, Harald Fuchs, Emanuël A. P. Habets

    Abstract: The introduction and regulation of loudness in broadcasting and streaming brought clear benefits to the audience, e.g., a level of uniformity across programs and channels. Yet, speech loudness is frequently reported as being too low in certain passages, which can hinder the full understanding and enjoyment of movies and TV programs. This paper proposes expanding the set of loudness-based measures… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: Accepted for presentation at the Audio Engineering Society (AES) 156th Convention, June 2024, Madrid, Spain

  3. arXiv:2402.06246  [pdf, other

    eess.AS cs.SD

    Data-driven Joint Detection and Localization of Acoustic Reflectors

    Authors: H. Nazim Bicer, Cagdas Tuna, Andreas Walther, Emanuël A. P. Habets

    Abstract: Room geometry inference algorithms rely on the localization of acoustic reflectors to identify boundary surfaces of an enclosure. Rooms with highly absorptive walls or walls at large distances from the measurement setup pose challenges for such algorithms. As it is not always possible to localize all walls, we present a data-driven method to jointly detect and localize acoustic reflectors that cor… ▽ More

    Submitted 9 February, 2024; originally announced February 2024.

    Comments: 4+1(bib) Pages. Accepted to ICASSP Satellite Workshop - HSCMA 2024

  4. arXiv:2401.00197  [pdf, other

    eess.AS

    ODAQ: Open Dataset of Audio Quality

    Authors: Matteo Torcoli, Chih-Wei Wu, Sascha Dick, Phillip A. Williams, Mhd Modar Halimeh, William Wolcott, Emanuel A. P. Habets

    Abstract: Research into the prediction and analysis of perceived audio quality is hampered by the scarcity of openly available datasets of audio signals accompanied by corresponding subjective quality scores. To address this problem, we present the Open Dataset of Audio Quality (ODAQ), a new dataset containing the results of a MUSHRA listening test conducted with expert listeners from 2 international labora… ▽ More

    Submitted 30 December, 2023; originally announced January 2024.

    Comments: Accepted paper. IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Seoul, Korea, April 2024

  5. arXiv:2309.03486  [pdf, other

    eess.AS cs.SD

    Simulating room transfer functions between transducers mounted on audio devices using a modified image source method

    Authors: Zeyu Xu, Adrian Herzog, Alexander Lodermeyer, Emanuël A. P. Habets, Albert G. Prinn

    Abstract: The image source method (ISM) is often used to simulate room acoustics due to its ease of use and computational efficiency. The standard ISM is limited to simulations of room impulse responses between point sources and omnidirectional receivers. In this work, the ISM is extended using spherical harmonic directivity coefficients to include acoustic diffraction effects due to source and receiver tra… ▽ More

    Submitted 7 September, 2023; originally announced September 2023.

    Comments: The following article has been submitted to the Journal of the Acoustical Society of America (JASA). After it is published, it will be found at http://asa.scitation.org/journal/jas

  6. arXiv:2308.14611  [pdf, other

    eess.AS cs.SD

    Data-driven 3D Room Geometry Inference with a Linear Loudspeaker Array and a Single Microphone

    Authors: Cagdas Tuna, Altan Akat, H. Nazim Bicer, Andreas Walther, Emanuël A. P. Habets

    Abstract: Knowing the room geometry may be very beneficial for many audio applications, including sound reproduction, acoustic scene analysis, and sound source localization. Room geometry inference (RGI) deals with the problem of reflector localization (RL) based on a set of room impulse responses (RIRs). Motivated by the increasing popularity of commercially available soundbars, this article presents a dat… ▽ More

    Submitted 28 August, 2023; originally announced August 2023.

    Comments: Accepted for publication in Forum Acusticum 2023

  7. arXiv:2303.13453  [pdf, other

    eess.AS cs.SD

    Better Together: Dialogue Separation and Voice Activity Detection for Audio Personalization in TV

    Authors: Matteo Torcoli, Emanuël A. P. Habets

    Abstract: In TV services, dialogue level personalization is key to meeting user preferences and needs. When dialogue and background sounds are not separately available from the production stage, Dialogue Separation (DS) can estimate them to enable personalization. DS was shown to provide clear benefits for the end user. Still, the estimated signals are not perfect, and some leakage can be introduced. This i… ▽ More

    Submitted 23 March, 2023; originally announced March 2023.

    Comments: Paper accepted to the 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2023), Rhodes, Greece

  8. arXiv:2303.08702  [pdf, other

    eess.AS cs.SD

    Beamformer-Guided Target Speaker Extraction

    Authors: Mohamed Elminshawi, Srikanth Raj Chetupalli, Emanuël A. P. Habets

    Abstract: We propose a Beamformer-guided Target Speaker Extraction (BG-TSE) method to extract a target speaker's voice from a multi-channel recording informed by the direction of arrival of the target. The proposed method employs a front-end beamformer steered towards the target speaker to provide an auxiliary signal to a single-channel TSE system. By allowing for time-varying embeddings in the single-chann… ▽ More

    Submitted 15 March, 2023; originally announced March 2023.

    Comments: Submitted to the 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2023)

  9. arXiv:2303.07143  [pdf, other

    eess.AS cs.LG cs.SD

    Multi-Microphone Speaker Separation by Spatial Regions

    Authors: Julian Wechsler, Srikanth Raj Chetupalli, Wolfgang Mack, Emanuël A. P. Habets

    Abstract: We consider the task of region-based source separation of reverberant multi-microphone recordings. We assume pre-defined spatial regions with a single active source per region. The objective is to estimate the signals from the individual spatial regions as captured by a reference microphone while retaining a correspondence between signals and spatial regions. We propose a data-driven approach usin… ▽ More

    Submitted 13 March, 2023; originally announced March 2023.

    Comments: Submitted to the 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing

  10. arXiv:2302.11205  [pdf, other

    eess.AS cs.SD

    Contrastive Representation Learning for Acoustic Parameter Estimation

    Authors: Philipp Götz, Cagdas Tuna, Andreas Walther, Emanuël A. P. Habets

    Abstract: A study is presented in which a contrastive learning approach is used to extract low-dimensional representations of the acoustic environment from single-channel, reverberant speech signals. Convolution of room impulse responses (RIRs) with anechoic source signals is leveraged as a data augmentation technique that offers considerable flexibility in the design of the upstream task. We evaluate the e… ▽ More

    Submitted 13 March, 2023; v1 submitted 22 February, 2023; originally announced February 2023.

    Comments: Accepted for ICASSP 2023, Camera-ready version

  11. arXiv:2212.13442  [pdf, other

    eess.IV cs.MM cs.SD eess.AS

    Audiovisual Database with 360 Video and Higher-Order Ambisonics Audio for Perception, Cognition, Behavior, and QoE Evaluation Research

    Authors: Thomas Robotham, Ashutosh Singla, Olli S. Rummukainen, Alexander Raake, Emanuël A. P. Habets

    Abstract: Research into multi-modal perception, human cognition, behavior, and attention can benefit from high-fidelity content that may recreate real-life-like scenes when rendered on head-mounted displays. Moreover, aspects of audiovisual perception, cognitive processes, and behavior may complement questionnaire-based Quality of Experience (QoE) evaluation of interactive virtual environments. Currently, t… ▽ More

    Submitted 27 December, 2022; originally announced December 2022.

    Comments: 6 pages, 2 figures, accepted and presented at the 2022 14th International Conference on Quality of Multimedia Experience (QoMEX). Database is publicly accessible at https://qoevave.github.io/database/

  12. arXiv:2208.03023  [pdf, other

    eess.AS cs.SD

    AID: Open-source Anechoic Interferer Dataset

    Authors: Philipp Götz, Cagdas Tuna, Andreas Walther, Emanuël A. P. Habets

    Abstract: A dataset of anechoic recordings of various sound sources encountered in domestic environments is presented. The dataset is intended to be a resource of non-stationary, environmental noise signals that, when convolved with acoustic impulse responses, can be used to simulate complex acoustic scenes. Additionally, a Python library is provided to generate random mixtures of the recordings in the data… ▽ More

    Submitted 5 August, 2022; originally announced August 2022.

    Comments: Accepted for publication at IWAENC 2022

  13. Dialogue Enhancement and Listening Effort in Broadcast Audio: A Multimodal Evaluation

    Authors: Matteo Torcoli, Thomas Robotham, Emanuël A. P. Habets

    Abstract: Dialogue enhancement (DE) plays a vital role in broadcasting, enabling the personalization of the relative level between foreground speech and background music and effects. DE has been shown to improve the quality of experience, intelligibility, and self-reported listening effort (LE). A physiological indicator of LE known from audiology studies is pupil size. The relation between pupil size and L… ▽ More

    Submitted 3 August, 2022; v1 submitted 28 July, 2022; originally announced July 2022.

    Comments: Paper accepted to 14th International Conference on Quality of Multimedia Experience (QoMEX), Lippstadt, Germany, 2022 - version 2 fixes some typos

  14. arXiv:2206.13808  [pdf, other

    eess.AS cs.SD

    Speaker Verification in Multi-Speaker Environments Using Temporal Feature Fusion

    Authors: Ahmad Aloradi, Wolfgang Mack, Mohamed Elminshawi, Emanuël A. P. Habets

    Abstract: Verifying the identity of a speaker is crucial in modern human-machine interfaces, e.g., to ensure privacy protection or to enable biometric authentication. Classical speaker verification (SV) approaches estimate a fixed-dimensional embedding from a speech utterance that encodes the speaker's voice characteristics. A speaker is verified if his/her voice embedding is sufficiently similar to the emb… ▽ More

    Submitted 28 June, 2022; originally announced June 2022.

    Comments: To be presented at EUSIPCO 2022

  15. arXiv:2206.06184  [pdf, other

    eess.AS eess.SP

    AmbiSep: Ambisonic-to-Ambisonic Reverberant Speech Separation Using Transformer Networks

    Authors: Adrian Herzog, Srikanth Raj Chetupalli, Emanuël A. P. Habets

    Abstract: Consider a multichannel Ambisonic recording containing a mixture of several reverberant speech signals. Retreiving the reverberant Ambisonic signals corresponding to the individual speech sources blindly from the mixture is a challenging task as it requires to estimate multiple signal channels for each source. In this work, we propose AmbiSep, a deep neural network-based plane-wave domain masking… ▽ More

    Submitted 13 June, 2022; originally announced June 2022.

    Comments: Preprint submitted to IWAENC 2022 (https://iwaenc2022.org)

  16. Blind Reverberation Time Estimation in Dynamic Acoustic Conditions

    Authors: Philipp Götz, Cagdas Tuna, Andreas Walther, Emanuël A. P. Habets

    Abstract: The estimation of reverberation time from real-world signals plays a central role in a wide range of applications. In many scenarios, acoustic conditions change over time which in turn requires the estimate to be updated continuously. Previously proposed methods involving deep neural networks were mostly designed and tested under the assumption of static acoustic conditions. In this work, we show… ▽ More

    Submitted 23 February, 2022; originally announced February 2022.

    Comments: accepted for publication in ICASSP 2022

  17. arXiv:2202.00733  [pdf, other

    eess.AS cs.SD

    New Insights on Target Speaker Extraction

    Authors: Mohamed Elminshawi, Wolfgang Mack, Srikanth Raj Chetupalli, Soumitro Chakrabarty, Emanuël A. P. Habets

    Abstract: Speaker extraction (SE) aims to segregate the speech of a target speaker from a mixture of interfering speakers with the help of auxiliary information. Several forms of auxiliary information have been employed in single-channel SE, such as a speech snippet enrolled from the target speaker or visual information corresponding to the spoken utterance. The effectiveness of the auxiliary information in… ▽ More

    Submitted 15 September, 2023; v1 submitted 1 February, 2022; originally announced February 2022.

  18. Signal-Aware Direction-of-Arrival Estimation Using Attention Mechanisms

    Authors: Wolfgang Mack, Julian Wechsler, Emanuël A. P. Habets

    Abstract: The direction-of-arrival (DOA) of sound sources is an essential acoustic parameter used, e.g., for multi-channel speech enhancement or source tracking. Complex acoustic scenarios consisting of sources-of-interest, interfering sources, reverberation, and noise make the estimation of the DOAs corresponding to the sources-of-interest a challenging task. Recently proposed attention mechanisms allow DO… ▽ More

    Submitted 3 January, 2022; originally announced January 2022.

  19. arXiv:2011.04569  [pdf, other

    eess.AS cs.AI cs.SD

    Informed Source Extraction With Application to Acoustic Echo Reduction

    Authors: Mohamed Elminshawi, Wolfgang Mack, Emanuël A. P. Habets

    Abstract: Informed speaker extraction aims to extract a target speech signal from a mixture of sources given prior knowledge about the desired speaker. Recent deep learning-based methods leverage a speaker discriminative model that maps a reference snippet uttered by the target speaker into a single embedding vector that encapsulates the characteristics of the target speaker. However, such modeling delibera… ▽ More

    Submitted 26 October, 2021; v1 submitted 9 November, 2020; originally announced November 2020.

    Comments: Published at ITG 2021

    Report number: 978-3-8007-5627-8

  20. Efficient Training Data Generation for Phase-Based DOA Estimation

    Authors: Fabian Hübner, Wolfgang Mack, Emanuël A. P. Habets

    Abstract: Deep learning (DL) based direction of arrival (DOA) estimation is an active research topic and currently represents the state-of-the-art. Usually, DL-based DOA estimators are trained with recorded data or computationally expensive generated data. Both data types require significant storage and excessive time to, respectively, record or generate. We propose a low complexity online data generation m… ▽ More

    Submitted 9 November, 2020; originally announced November 2020.

    Comments: Submitted to ICASSP 2021

  21. arXiv:2011.04359  [pdf, ps, other

    eess.AS cs.CV cs.LG cs.SD eess.IV

    An Empirical Study of Visual Features for DNN based Audio-Visual Speech Enhancement in Multi-talker Environments

    Authors: Shrishti Saha Shetu, Soumitro Chakrabarty, Emanuël A. P. Habets

    Abstract: Audio-visual speech enhancement (AVSE) methods use both audio and visual features for the task of speech enhancement and the use of visual features has been shown to be particularly effective in multi-speaker scenarios. In the majority of deep neural network (DNN) based AVSE methods, the audio and visual data are first processed separately using different sub-networks, and then the learned feature… ▽ More

    Submitted 9 November, 2020; originally announced November 2020.

  22. arXiv:2005.00145  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Unsupervised Domain Adaptation for Acoustic Scene Classification Using Band-Wise Statistics Matching

    Authors: Alessandro Ilic Mezza, Emanuël A. P. Habets, Meinard Müller, Augusto Sarti

    Abstract: The performance of machine learning algorithms is known to be negatively affected by possible mismatches between training (source) and test (target) data distributions. In fact, this problem emerges whenever an acoustic scene classification system which has been trained on data recorded by a given device is applied to samples acquired under different acoustic conditions or captured by mismatched r… ▽ More

    Submitted 30 April, 2020; originally announced May 2020.

    Comments: 5 pages, 1 figure, 3 tables, submitted to EUSIPCO 2020

  23. Deep Filtering: Signal Extraction and Reconstruction Using Complex Time-Frequency Filters

    Authors: Wolfgang Mack, Emanuël A. P. Habets

    Abstract: Signal extraction from a single-channel mixture with additional undesired signals is most commonly performed using time-frequency (TF) masks. Typically, the mask is estimated with a deep neural network (DNN), and element-wise applied to the complex mixture short-time Fourier transform (STFT) representation to perform the extraction. Ideal mask magnitudes are zero for solely undesired signals in a… ▽ More

    Submitted 9 December, 2019; v1 submitted 17 April, 2019; originally announced April 2019.

  24. Modal Decomposition of Feedback Delay Networks

    Authors: Sebastian J. Schlecht, Emanuël A. P. Habets

    Abstract: Feedback delay networks (FDNs) belong to a general class of recursive filters which are widely used in sound synthesis and physical modeling applications. We present a numerical technique to compute the modal decomposition of the FDN transfer function. The proposed pole finding algorithm is based on the Ehrlich-Aberth iteration for matrix polynomials and has improved computational performance of u… ▽ More

    Submitted 25 January, 2019; originally announced January 2019.

  25. Multi-scale aggregation of phase information for reducing computational cost of CNN based DOA estimation

    Authors: Soumitro Chakrabarty, Emanuël A. P. Habets

    Abstract: In a recent work on direction-of-arrival (DOA) estimation of multiple speakers with convolutional neural networks (CNNs), the phase component of short-time Fourier transform (STFT) coefficients of the microphone signal is given as input and small filters are used to learn the phase relations between neighboring microphones. Due to this chosen filter size, $M-1$ convolution layers are required to a… ▽ More

    Submitted 20 November, 2018; originally announced November 2018.

    Comments: arXiv admin note: text overlap with arXiv:1807.11722

  26. arXiv:1810.09708  [pdf, other

    eess.AS cs.SD eess.SP

    On the difference-to-sum power ratio of speech and wind noise based on the Corcos model

    Authors: Daniele Mirabilii, Emanuël A. P. Habets

    Abstract: The difference-to-sum power ratio was proposed and used to suppress wind noise under specific acoustic conditions. In this contribution, a general formulation of the difference-to-sum power ratio associated with a mixture of speech and wind noise is proposed and analyzed. In particular, it is assumed that the complex coherence of convective turbulence can be modelled by the Corcos model. In contra… ▽ More

    Submitted 23 October, 2018; originally announced October 2018.

    Comments: 5 pages, 3 figures, IEEE-ICSEE Eilat-Israel conference (special session)

  27. arXiv:1807.11722  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Multi-Speaker DOA Estimation Using Deep Convolutional Networks Trained with Noise Signals

    Authors: Soumitro Chakrabarty, Emanuël A. P. Habets

    Abstract: Supervised learning based methods for source localization, being data driven, can be adapted to different acoustic conditions via training and have been shown to be robust to adverse acoustic environments. In this paper, a convolutional neural network (CNN) based supervised learning method for estimating the direction-of-arrival (DOA) of multiple speakers is proposed. Multi-speaker DOA estimation… ▽ More

    Submitted 31 July, 2018; originally announced July 2018.

  28. Simulating Multi-channel Wind Noise Based on the Corcos Model

    Authors: Daniele Mirabilii, Emanuël A. P. Habets

    Abstract: A novel multi-channel artificial wind noise generator based on a fluid dynamics model, namely the Corcos model, is proposed. In particular, the model is used to approximate the complex coherence function of wind noise signals measured with closely-spaced microphones in the free-field and for time-invariant wind stream direction and speed. Preliminary experiments focus on a spatial analysis of reco… ▽ More

    Submitted 23 July, 2018; v1 submitted 24 May, 2018; originally announced May 2018.

    Comments: 5 pages, 2 figures, IWAENC 2018

  29. Classification vs. Regression in Supervised Learning for Single Channel Speaker Count Estimation

    Authors: Fabian-Robert Stöter, Soumitro Chakrabarty, Bernd Edler, Emanuël A. P. Habets

    Abstract: The task of estimating the maximum number of concurrent speakers from single channel mixtures is important for various audio-based applications, such as blind source separation, speaker diarisation, audio surveillance or auditory scene classification. Building upon powerful machine learning methodology, we develop a Deep Neural Network (DNN) that estimates a speaker count. While DNNs efficiently m… ▽ More

    Submitted 15 February, 2018; v1 submitted 12 December, 2017; originally announced December 2017.

    Comments: Accepted in ICASSP 2018

  30. arXiv:1712.04276  [pdf, other

    cs.SD eess.AS stat.ML

    Multi-Speaker Localization Using Convolutional Neural Network Trained with Noise

    Authors: Soumitro Chakrabarty, Emanuël A. P. Habets

    Abstract: The problem of multi-speaker localization is formulated as a multi-class multi-label classification problem, which is solved using a convolutional neural network (CNN) based source localization method. Utilizing the common assumption of disjoint speaker activities, we propose a novel method to train the CNN using synthesized noise signals. The proposed localization method is evaluated for two spea… ▽ More

    Submitted 12 December, 2017; originally announced December 2017.

    Comments: Presented at Machine Learning for Audio Processing (ML4Audio) Workshop at NIPS 2017

  31. On Lossless Feedback Delay Networks

    Authors: Sebastian J. Schlecht, Emanuel A. P. Habets

    Abstract: Lossless Feedback Delay Networks (FDNs) are commonly used as a design prototype for artificial reverberation algorithms. The lossless property is dependent on the feedback matrix, which connects the output of a set of delays to their inputs, and the lengths of the delays. Both, unitary and triangular feedback matrices are known to constitute lossless FDNs, however, the most general class of lossle… ▽ More

    Submitted 24 June, 2016; originally announced June 2016.