-
Microphone Subset Selection for the Weighted Prediction Error Algorithm using a Group Sparsity Penalty
Authors:
Anselm Lohmann,
Toon van Waterschoot,
Joerg Bitzer,
Simon Doclo
Abstract:
Reverberation can severely degrade the quality of speech signals recorded using microphones in an enclosure. In acoustic sensor networks with spatially distributed microphones, a similar dereverberation performance may be achieved using only a subset of all available microphones. Using the popular convex relaxation method, in this paper we propose to perform microphone subset selection for the wei…
▽ More
Reverberation can severely degrade the quality of speech signals recorded using microphones in an enclosure. In acoustic sensor networks with spatially distributed microphones, a similar dereverberation performance may be achieved using only a subset of all available microphones. Using the popular convex relaxation method, in this paper we propose to perform microphone subset selection for the weighted prediction error (WPE) multi-channel dereverberation algorithm by introducing a group sparsity penalty on the prediction filter coefficients. The resulting problem is shown to be solved efficiently using the accelerated proximal gradient algorithm. Experimental evaluation using measured impulse responses shows that the performance of the proposed method is close to the optimal performance obtained by exhaustive search, both for frequency-dependent as well as frequency-independent microphone subset selection. Furthermore, the performance using only a few microphones for frequency-independent microphone subset selection is only marginally worse than using all available microphones.
△ Less
Submitted 16 January, 2024;
originally announced January 2024.
-
Time-Variant Overlap-Add in Partitions
Authors:
Hagen Jaeger,
Uwe Simmer,
Jörg Bitzer,
Matthias Blau
Abstract:
Virtual and augmented realities are increasingly popular tools in many domains such as architecture, production, training and education, (psycho)therapy, gaming, and others. For a convincing rendering of sound in virtual and augmented environments, audio signals must be convolved in real-time with impulse responses that change from one moment in time to another. Key requirements for the implementa…
▽ More
Virtual and augmented realities are increasingly popular tools in many domains such as architecture, production, training and education, (psycho)therapy, gaming, and others. For a convincing rendering of sound in virtual and augmented environments, audio signals must be convolved in real-time with impulse responses that change from one moment in time to another. Key requirements for the implementation of such time-variant real-time convolution algorithms are short latencies, moderate computational cost and memory footprint, and no perceptible switching artifacts. In this engineering report, we introduce a partitioned convolution algorithm that is able to quickly switch between impulse responses without introducing perceptible artifacts, while maintaining a constant computational load and low memory usage. Implementations in several popular programming languages are freely available via GitHub.
△ Less
Submitted 30 September, 2023;
originally announced October 2023.
-
Long-term Conversation Analysis: Exploring Utility and Privacy
Authors:
Francesco Nespoli,
Jule Pohlhausen,
Patrick A. Naylor,
Joerg Bitzer
Abstract:
The analysis of conversations recorded in everyday life requires privacy protection. In this contribution, we explore a privacy-preserving feature extraction method based on input feature dimension reduction, spectral smoothing and the low-cost speaker anonymization technique based on McAdams coefficient. We assess the utility of the feature extraction methods with a voice activity detection and a…
▽ More
The analysis of conversations recorded in everyday life requires privacy protection. In this contribution, we explore a privacy-preserving feature extraction method based on input feature dimension reduction, spectral smoothing and the low-cost speaker anonymization technique based on McAdams coefficient. We assess the utility of the feature extraction methods with a voice activity detection and a speaker diarization system, while privacy protection is determined with a speech recognition and a speaker verification model. We show that the combination of McAdams coefficient and spectral smoothing maintains the utility while improving privacy.
△ Less
Submitted 28 June, 2023;
originally announced June 2023.
-
Two-Stage Voice Anonymization for Enhanced Privacy
Authors:
Francesco Nespoli,
Daniel Barreda,
Joerg Bitzer,
Patrick A. Naylor
Abstract:
In recent years, the need for privacy preservation when manipulating or storing personal data, including speech , has become a major issue. In this paper, we present a system addressing the speaker-level anonymization problem. We propose and evaluate a two-stage anonymization pipeline exploiting a state-of-the-art anonymization model described in the Voice Privacy Challenge 2022 in combination wit…
▽ More
In recent years, the need for privacy preservation when manipulating or storing personal data, including speech , has become a major issue. In this paper, we present a system addressing the speaker-level anonymization problem. We propose and evaluate a two-stage anonymization pipeline exploiting a state-of-the-art anonymization model described in the Voice Privacy Challenge 2022 in combination with a zero-shot voice conversion architecture able to capture speaker characteristics from a few seconds of speech. We show this architecture can lead to strong privacy preservation while preserving pitch information. Finally, we propose a new compressed metric to evaluate anonymization systems in privacy scenarios with different constraints on privacy and utility.
△ Less
Submitted 28 June, 2023;
originally announced June 2023.
-
Dereverberation in Acoustic Sensor Networks Using Weighted Prediction Error With Microphone-dependent Prediction Delays
Authors:
Anselm Lohmann,
Toon van Waterschoot,
Joerg Bitzer,
Simon Doclo
Abstract:
In the last decades several multi-microphone speech dereverberation algorithms have been proposed, among which the weighted prediction error (WPE) algorithm. In the WPE algorithm, a prediction delay is required to reduce the correlation between the prediction signals and the direct component in the reference microphone signal. In compact arrays with closely-spaced microphones, the prediction delay…
▽ More
In the last decades several multi-microphone speech dereverberation algorithms have been proposed, among which the weighted prediction error (WPE) algorithm. In the WPE algorithm, a prediction delay is required to reduce the correlation between the prediction signals and the direct component in the reference microphone signal. In compact arrays with closely-spaced microphones, the prediction delay is often chosen microphone-independent. In acoustic sensor networks with spatially distributed microphones, large time-differences-of-arrival (TDOAs) of the speech source between the reference microphone and other microphones typically occur. Hence, when using a microphone-independent prediction delay the reference and prediction signals may still be significantly correlated, leading to distortion in the dereverberated output signal. In order to decorrelate the signals, in this paper we propose to apply TDOA compensation with respect to the reference microphone, resulting in microphone-dependent prediction delays for the WPE algorithm. We consider both optimal TDOA compensation using crossband filtering in the short-time Fourier transform domain as well as band-to-band and integer delay approximations. Simulation results for different reverberation times using oracle as well as estimated TDOAs clearly show the benefit of using microphone-dependent prediction delays.
△ Less
Submitted 18 January, 2023;
originally announced January 2023.
-
Geometry-aware DoA Estimation using a Deep Neural Network with mixed-data input features
Authors:
Ulrik Kowalk,
Simon Doclo,
Joerg Bitzer
Abstract:
Unlike model-based direction of arrival (DoA) estimation algorithms, supervised learning-based DoA estimation algorithms based on deep neural networks (DNNs) are usually trained for one specific microphone array geometry, resulting in poor performance when applied to a different array geometry. In this paper we illustrate the fundamental difference between supervised learning-based and model-based…
▽ More
Unlike model-based direction of arrival (DoA) estimation algorithms, supervised learning-based DoA estimation algorithms based on deep neural networks (DNNs) are usually trained for one specific microphone array geometry, resulting in poor performance when applied to a different array geometry. In this paper we illustrate the fundamental difference between supervised learning-based and model-based algorithms leading to this sensitivity. Aiming at designing a supervised learning-based DoA estimation algorithm that generalizes well to different array geometries, in this paper we propose a geometry-aware DoA estimation algorithm. The algorithm uses a fully connected DNN and takes mixed data as input features, namely the time lags maximizing the generalized cross-correlation with phase transform and the microphone coordinates, which are assumed to be known. Experimental results for a reverberant scenario demonstrate the flexibility of the proposed algorithm towards different array geometries and show that the proposed algorithm outperforms model-based algorithms such as steered response power with phase transform.
△ Less
Submitted 9 December, 2022;
originally announced December 2022.
-
Signal-informed DNN-based DOA Estimation combining an External Microphone and GCC-PHAT Features
Authors:
Ulrik Kowalk,
Simon Doclo,
Joerg Bitzer
Abstract:
Aiming at estimating the direction of arrival (DOA) of a desired speaker in a multi-talker environment using a microphone array, in this paper we propose a signal-informed method exploiting the availability of an external microphone attached to the desired speaker. The proposed method applies a binary mask to the GCC-PHAT input features of a convolutional neural network, where the binary mask is c…
▽ More
Aiming at estimating the direction of arrival (DOA) of a desired speaker in a multi-talker environment using a microphone array, in this paper we propose a signal-informed method exploiting the availability of an external microphone attached to the desired speaker. The proposed method applies a binary mask to the GCC-PHAT input features of a convolutional neural network, where the binary mask is computed based on the power distribution of the external microphone signal. Experimental results for a reverberant scenario with up to four interfering speakers demonstrate that the signal-informed masking improves the localization accuracy, without requiring any knowledge about the interfering speakers.
△ Less
Submitted 11 June, 2022;
originally announced June 2022.