Search | arXiv e-print repository

Universal Score-based Speech Enhancement with High Content Preservation

Authors: Robin Scheibler, Yusuke Fujita, Yuma Shirahata, Tatsuya Komatsu

Abstract: We propose UNIVERSE++, a universal speech enhancement method based on score-based diffusion and adversarial training. Specifically, we improve the existing UNIVERSE model that decouples clean speech feature extraction and diffusion. Our contributions are three-fold. First, we make several modifications to the network architecture, improving training stability and final performance. Second, we intr… ▽ More We propose UNIVERSE++, a universal speech enhancement method based on score-based diffusion and adversarial training. Specifically, we improve the existing UNIVERSE model that decouples clean speech feature extraction and diffusion. Our contributions are three-fold. First, we make several modifications to the network architecture, improving training stability and final performance. Second, we introduce an adversarial loss to promote learning high quality speech features. Third, we propose a low-rank adaptation scheme with a phoneme fidelity loss to improve content preservation in the enhanced speech. In the experiments, we train a universal enhancement model on a large scale dataset of speech degraded by noise, reverberation, and various distortions. The results on multiple public benchmark datasets demonstrate that UNIVERSE++ compares favorably to both discriminative and generative baselines for a wide range of qualitative and intelligibility metrics. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 5 pages, 5 figures, accepted at Interspeech 2024

arXiv:2406.04660 [pdf, other]

URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

Authors: Wangyou Zhang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Chenda Li, Zhaoheng Ni, Anurag Kumar, Jan Pirklbauer, Marvin Sach, Shinji Watanabe, Tim Fingscheidt, Yanmin Qian

Abstract: The last decade has witnessed significant advancements in deep learning-based speech enhancement (SE). However, most existing SE research has limitations on the coverage of SE sub-tasks, data diversity and amount, and evaluation metrics. To fill this gap and promote research toward universal SE, we establish a new SE challenge, named URGENT, to focus on the universality, robustness, and generaliza… ▽ More The last decade has witnessed significant advancements in deep learning-based speech enhancement (SE). However, most existing SE research has limitations on the coverage of SE sub-tasks, data diversity and amount, and evaluation metrics. To fill this gap and promote research toward universal SE, we establish a new SE challenge, named URGENT, to focus on the universality, robustness, and generalizability of SE. We aim to extend the SE definition to cover different sub-tasks to explore the limits of SE models, starting from denoising, dereverberation, bandwidth extension, and declip**. A novel framework is proposed to unify all these sub-tasks in a single model, allowing the use of all existing SE approaches. We collected public speech and noise data from different domains to construct diverse evaluation data. Finally, we discuss the insights gained from our preliminary baseline experiments based on both generative and discriminative SE methods with 12 curated metrics. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: 6 pages, 3 figures, 3 tables. Accepted by Interspeech 2024. An extended version of the accepted manuscript with appendix

arXiv:2310.17864 [pdf, other]

TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

Authors: Jeff Hwang, Moto Hira, Caroline Chen, Xiaohui Zhang, Zhaoheng Ni, Guangzhi Sun, **chuan Ma, Ruizhe Huang, Vineel Pratap, Yuekai Zhang, Anurag Kumar, Chin-Yun Yu, Chuang Zhu, Chunxi Liu, Jacob Kahn, Mirco Ravanelli, Peng Sun, Shinji Watanabe, Yangyang Shi, Yumeng Tao, Robin Scheibler, Samuele Cornell, Sean Kim, Stavros Petridis

Abstract: TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by develo** impactful features. Here, we survey TorchAudio's devel… ▽ More TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by develo** impactful features. Here, we survey TorchAudio's development principles and contents and highlight key features we include in its latest version (2.1): self-supervised learning pre-trained pipelines and training recipes, high-performance CTC decoders, speech recognition models and training recipes, advanced media I/O capabilities, and tools for performing forced alignment, multi-channel speech enhancement, and reference-less speech assessment. For a selection of these features, through empirical studies, we demonstrate their efficacy and show that they achieve competitive or state-of-the-art performance. △ Less

Submitted 26 October, 2023; originally announced October 2023.

arXiv:2303.06806 [pdf, other]

Neural Diarization with Non-autoregressive Intermediate Attractors

Authors: Yusuke Fujita, Tatsuya Komatsu, Robin Scheibler, Yusuke Kida, Tetsuji Ogawa

Abstract: End-to-end neural diarization (EEND) with encoder-decoder-based attractors (EDA) is a promising method to handle the whole speaker diarization problem simultaneously with a single neural network. While the EEND model can produce all frame-level speaker labels simultaneously, it disregards output label dependency. In this work, we propose a novel EEND model that introduces the label dependency betw… ▽ More End-to-end neural diarization (EEND) with encoder-decoder-based attractors (EDA) is a promising method to handle the whole speaker diarization problem simultaneously with a single neural network. While the EEND model can produce all frame-level speaker labels simultaneously, it disregards output label dependency. In this work, we propose a novel EEND model that introduces the label dependency between frames. The proposed method generates non-autoregressive intermediate attractors to produce speaker labels at the lower layers and conditions the subsequent layers with these labels. While the proposed model works in a non-autoregressive manner, the speaker labels are refined by referring to the whole sequence of intermediate labels. The experiments with the two-speaker CALLHOME dataset show that the intermediate labels with the proposed non-autoregressive intermediate attractors boost the diarization performance. The proposed method with the deeper network benefits more from the intermediate labels, resulting in better performance and training throughput than EEND-EDA. △ Less

Submitted 12 March, 2023; originally announced March 2023.

Comments: ICASSP 2023

arXiv:2210.17327 [pdf, other]

Diffusion-based Generative Speech Source Separation

Authors: Robin Scheibler, Youna Ji, Soo-Whan Chung, Jaeuk Byun, Soyeon Choe, Min-Seok Choi

Abstract: We propose DiffSep, a new single channel source separation method based on score-matching of a stochastic differential equation (SDE). We craft a tailored continuous time diffusion-mixing process starting from the separated sources and converging to a Gaussian distribution centered on their mixture. This formulation lets us apply the machinery of score-based generative modelling. First, we train a… ▽ More We propose DiffSep, a new single channel source separation method based on score-matching of a stochastic differential equation (SDE). We craft a tailored continuous time diffusion-mixing process starting from the separated sources and converging to a Gaussian distribution centered on their mixture. This formulation lets us apply the machinery of score-based generative modelling. First, we train a neural network to approximate the score function of the marginal probabilities or the diffusion-mixing process. Then, we use it to solve the reverse time SDE that progressively separates the sources starting from their mixture. We propose a modified training strategy to handle model mismatch and source permutation ambiguity. Experiments on the WSJ0 2mix dataset demonstrate the potential of the method. Furthermore, the method is also suitable for speech enhancement and shows performance competitive with prior work on the VoiceBank-DEMAND dataset. △ Less

Submitted 2 November, 2022; v1 submitted 31 October, 2022; originally announced October 2022.

Comments: 5 pages, 3 figures, 2 tables. Submitted to ICASSP 2023

arXiv:2207.09514 [pdf, other]

ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding

Authors: Yen-Ju Lu, Xuankai Chang, Chenda Li, Wangyou Zhang, Samuele Cornell, Zhaoheng Ni, Yoshiki Masuyama, Brian Yan, Robin Scheibler, Zhong-Qiu Wang, Yu Tsao, Yanmin Qian, Shinji Watanabe

Abstract: This paper presents recent progress on integrating speech separation and enhancement (SSE) into the ESPnet toolkit. Compared with the previous ESPnet-SE work, numerous features have been added, including recent state-of-the-art speech enhancement models with their respective training and evaluation recipes. Importantly, a new interface has been designed to flexibly combine speech enhancement front… ▽ More This paper presents recent progress on integrating speech separation and enhancement (SSE) into the ESPnet toolkit. Compared with the previous ESPnet-SE work, numerous features have been added, including recent state-of-the-art speech enhancement models with their respective training and evaluation recipes. Importantly, a new interface has been designed to flexibly combine speech enhancement front-ends with other tasks, including automatic speech recognition (ASR), speech translation (ST), and spoken language understanding (SLU). To showcase such integration, we performed experiments on carefully designed synthetic datasets for noisy-reverberant multi-channel ST and SLU tasks, which can be used as benchmark corpora for future research. In addition to these new tasks, we also use CHiME-4 and WSJ0-2Mix to benchmark multi- and single-channel SE approaches. Results show that the integration of SE front-ends with back-end tasks is a promising research direction even for tasks besides ASR, especially in the multi-channel scenario. The code is available online at https://github.com/ESPnet/ESPnet. The multi-channel ST and SLU datasets, which are another contribution of this work, are released on HuggingFace. △ Less

Submitted 19 July, 2022; originally announced July 2022.

Comments: To appear in Interspeech 2022

arXiv:2204.00218 [pdf, other]

End-to-End Multi-speaker ASR with Independent Vector Analysis

Authors: Robin Scheibler, Wangyou Zhang, Xuankai Chang, Shinji Watanabe, Yanmin Qian

Abstract: We develop an end-to-end system for multi-channel, multi-speaker automatic speech recognition. We propose a frontend for joint source separation and dereverberation based on the independent vector analysis (IVA) paradigm. It uses the fast and stable iterative source steering algorithm together with a neural source model. The parameters from the ASR module and the neural source model are optimized… ▽ More We develop an end-to-end system for multi-channel, multi-speaker automatic speech recognition. We propose a frontend for joint source separation and dereverberation based on the independent vector analysis (IVA) paradigm. It uses the fast and stable iterative source steering algorithm together with a neural source model. The parameters from the ASR module and the neural source model are optimized jointly from the ASR loss itself. We demonstrate competitive performance with previous systems using neural beamforming frontends. First, we explore the trade-offs when using various number of channels for training and testing. Second, we demonstrate that the proposed IVA frontend performs well on noisy data, even when trained on clean mixtures only. Furthermore, it extends without retraining to the separation of more speakers, which is demonstrated on mixtures of three and four speakers. △ Less

Submitted 1 April, 2022; originally announced April 2022.

Comments: Submitted to INTERSPEECH2022. 5 pages, 2 figures, 3 tables

arXiv:2204.00210 [pdf, other]

Spatial Loss for Unsupervised Multi-channel Source Separation

Authors: Kohei Saijo, Robin Scheibler

Abstract: We propose a spatial loss for unsupervised multi-channel source separation. The proposed loss exploits the duality of direction of arrival (DOA) and beamforming: the steering and beamforming vectors should be aligned for the target source, but orthogonal for interfering ones. The spatial loss encourages consistency between the mixing and demixing systems from a classic DOA estimator and a neural s… ▽ More We propose a spatial loss for unsupervised multi-channel source separation. The proposed loss exploits the duality of direction of arrival (DOA) and beamforming: the steering and beamforming vectors should be aligned for the target source, but orthogonal for interfering ones. The spatial loss encourages consistency between the mixing and demixing systems from a classic DOA estimator and a neural separator, respectively. With the proposed loss, we train the neural separators based on minimum variance distortionless response (MVDR) beamforming and independent vector analysis (IVA). We also investigate the effectiveness of combining our spatial loss and a signal loss, which uses the outputs of blind source separation as the reference. We evaluate our proposed method on synthetic and recorded (LibriCSS) mixtures. We find that the spatial loss is most effective to train IVA-based separators. For the neural MVDR beamformer, it performs best when combined with a signal loss. On synthetic mixtures, the proposed unsupervised loss leads to the same performance as a supervised loss in terms of word error rate. On LibriCSS, we obtain close to state-of-the-art performance without any labeled training data. △ Less

Submitted 1 April, 2022; originally announced April 2022.

Comments: Submitted to INTERSPEECH2022

arXiv:2202.08456 [pdf, other]

MLP-ASR: Sequence-length agnostic all-MLP architectures for speech recognition

Authors: ** Sakuma, Tatsuya Komatsu, Robin Scheibler

Abstract: We propose multi-layer perceptron (MLP)-based architectures suitable for variable length input. MLP-based architectures, recently proposed for image classification, can only be used for inputs of a fixed, pre-defined size. However, many types of data are naturally variable in length, for example, acoustic signals. We propose three approaches to extend MLP-based architectures for use with sequences… ▽ More We propose multi-layer perceptron (MLP)-based architectures suitable for variable length input. MLP-based architectures, recently proposed for image classification, can only be used for inputs of a fixed, pre-defined size. However, many types of data are naturally variable in length, for example, acoustic signals. We propose three approaches to extend MLP-based architectures for use with sequences of arbitrary length. The first one uses a circular convolution applied in the Fourier domain, the second applies a depthwise convolution, and the final relies on a shift operation. We evaluate the proposed architectures on an automatic speech recognition task with the Librispeech and Tedlium2 corpora. The best proposed MLP-based architectures improves WER by 1.0 / 0.9%, 0.9 / 0.5% on Librispeech dev-clean/dev-other, test-clean/test-other set, and 0.8 / 1.1% on Tedlium2 dev/test set using 86.4% the size of self-attention-based architecture. △ Less

Submitted 17 February, 2022; originally announced February 2022.

Comments: 8 pages, 4 figures

arXiv:2110.06545 [pdf, other]

Independence-based Joint Dereverberation and Separation with Neural Source Model

Authors: Kohei Saijo, Robin Scheibler

Abstract: We propose an independence-based joint dereverberation and separation method with a neural source model. We introduce a neural network in the framework of time-decorrelation iterative source steering, which is an extension of independent vector analysis to joint dereverberation and separation. The network is trained in an end-to-end manner with a permutation invariant loss on the time-domain separ… ▽ More We propose an independence-based joint dereverberation and separation method with a neural source model. We introduce a neural network in the framework of time-decorrelation iterative source steering, which is an extension of independent vector analysis to joint dereverberation and separation. The network is trained in an end-to-end manner with a permutation invariant loss on the time-domain separation output signals. Our proposed method can be applied in any situation with at least as many microphones as sources, regardless of their number. In experiments, we demonstrate that our method results in high performance in terms of both speech quality metrics and word error rate (WER), even for mixtures with a different number of speakers than training. Furthermore, the model, trained on synthetic mixtures, without any modifications, greatly reduces the WER on the recorded dataset LibriCSS. △ Less

Submitted 1 April, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

Comments: Submitted to INTERSPEECH2022

arXiv:2110.06440 [pdf, other]

SDR -- Medium Rare with Fast Computations

Authors: Robin Scheibler

Abstract: We revisit the widely used bss eval metrics for source separation with an eye out for performance. We propose a fast algorithm fixing shortcomings of publicly available implementations. First, we show that the metrics are fully specified by the squared cosine of just two angles between estimate and reference subspaces. Second, large linear systems are involved. However, they are structured, and we… ▽ More We revisit the widely used bss eval metrics for source separation with an eye out for performance. We propose a fast algorithm fixing shortcomings of publicly available implementations. First, we show that the metrics are fully specified by the squared cosine of just two angles between estimate and reference subspaces. Second, large linear systems are involved. However, they are structured, and we apply a fast iterative method based on conjugate gradient descent. The complexity of this step is thus reduced by a factor quadratic in the distortion filter size used in bss eval, usually 512. In experiments, we assess speed and numerical accuracy. Not only is the loss of accuracy due to the approximate solver acceptable for most applications, but the speed-up is up to two orders of magnitude in some, not so extreme, cases. We confirm that our implementation can train neural networks, and find that longer distortion filters may be beneficial. △ Less

Submitted 12 October, 2021; originally announced October 2021.

Comments: 5 pages, 3 figures, 2 tables. Submitted to ICASSP 2022

arXiv:2106.01011 [pdf, other]

doi 10.1109/ICASSP39728.2021.9414798

Refinement of Direction of Arrival Estimators by Majorization-Minimization Optimization on the Array Manifold

Authors: Robin Scheibler, Masahito Togami

Abstract: We propose a generalized formulation of direction of arrival estimation that includes many existing methods such as steered response power, subspace, coherent and incoherent, as well as speech sparsity-based methods. Unlike most conventional methods that rely exclusively on grid search, we introduce a continuous optimization algorithm to refine DOA estimates beyond the resolution of the initial gr… ▽ More We propose a generalized formulation of direction of arrival estimation that includes many existing methods such as steered response power, subspace, coherent and incoherent, as well as speech sparsity-based methods. Unlike most conventional methods that rely exclusively on grid search, we introduce a continuous optimization algorithm to refine DOA estimates beyond the resolution of the initial grid. The algorithm is derived from the majorization-minimization (MM) technique. We derive two surrogate functions, one quadratic and one linear. Both lead to efficient iterative algorithms that do not require hyperparameters, such as step size, and ensure that the DOA estimates never leave the array manifold, without the need for a projection step. In numerical experiments, we show that the accuracy after a few iterations of the MM algorithm nearly removes dependency on the resolution of the initial grid used. We find that the quadratic surrogate function leads to very fast convergence, but the simplicity of the linear algorithm is very attractive, and the performance gap small. △ Less

Submitted 2 June, 2021; originally announced June 2021.

Comments: 5 pages, 2 figures, 2 tables. Presented at IEEE ICASSP 2021

Journal ref: Proc. IEEE ICASSP, pp. 436-440, June, 2021

arXiv:2102.06322 [pdf, other]

doi 10.1109/ICASSP39728.2021.9413478

Joint Dereverberation and Separation with Iterative Source Steering

Authors: Taishi Nakashima, Robin Scheibler, Masahito Togami, Nobutaka Ono

Abstract: We propose a new algorithm for joint dereverberation and blind source separation (DR-BSS). Our work builds upon the IRLMA-T framework that applies a unified filter combining dereverberation and separation. One drawback of this framework is that it requires several matrix inversions, an operation inherently costly and with potential stability issues. We leverage the recently introduced iterative so… ▽ More We propose a new algorithm for joint dereverberation and blind source separation (DR-BSS). Our work builds upon the IRLMA-T framework that applies a unified filter combining dereverberation and separation. One drawback of this framework is that it requires several matrix inversions, an operation inherently costly and with potential stability issues. We leverage the recently introduced iterative source steering (ISS) updates to propose two algorithms mitigating this issue. Albeit derived from first principles, the first algorithm turns out to be a natural combination of weighted prediction error (WPE) dereverberation and ISS-based BSS, applied alternatingly. In this case, we manage to reduce the number of matrix inversion to only one per iteration and source. The second algorithm updates the ILRMA-T matrix using only sequential ISS updates requiring no matrix inversion at all. Its implementation is straightforward and memory efficient. Numerical experiments demonstrate that both methods achieve the same final performance as ILRMA-T in terms of several relevant objective metrics. In the important case of two sources, the number of iterations required is also similar. △ Less

Submitted 31 May, 2021; v1 submitted 11 February, 2021; originally announced February 2021.

Comments: 5 pages, 2 figures, accepted at ICASSP 2021

arXiv:2011.05540 [pdf, other]

Surrogate Source Model Learning for Determined Source Separation

Authors: Robin Scheibler, Masahito Togami

Abstract: We propose to learn surrogate functions of universal speech priors for determined blind speech separation. Deep speech priors are highly desirable due to their high modelling power, but are not compatible with state-of-the-art independent vector analysis based on majorization-minimization (AuxIVA), since deriving the required surrogate function is not easy, nor always possible. Instead, we do away… ▽ More We propose to learn surrogate functions of universal speech priors for determined blind speech separation. Deep speech priors are highly desirable due to their high modelling power, but are not compatible with state-of-the-art independent vector analysis based on majorization-minimization (AuxIVA), since deriving the required surrogate function is not easy, nor always possible. Instead, we do away with exact majorization and directly approximate the surrogate. Taking advantage of iterative source steering (ISS) updates, we back propagate the permutation invariant separation loss through multiple iterations of AuxIVA. ISS lends itself well to this task due to its lower complexity and lack of matrix inversion. Experiments show large improvements in terms of scale invariant signal-to-distortion (SDR) ratio and word error rate compared to baseline methods. Training is done on two speakers mixtures and we experiment with two losses, SDR and coherence. We find that the learnt approximate surrogate generalizes well on mixtures of three and four speakers without any modification. We also demonstrate generalization to a different variation of the AuxIVA update equations. The SDR loss leads to fastest convergence in iterations, while coherence leads to the lowest word error rate (WER). We obtain as much as 36 % reduction in WER. △ Less

Submitted 10 November, 2020; originally announced November 2020.

Comments: 5 pages, 3 figures, 1 table. Submitted to ICASSP 2021

arXiv:2009.05288 [pdf, other]

Generalized Minimal Distortion Principle for Blind Source Separation

Authors: Robin Scheibler

Abstract: We revisit the source image estimation problem from blind source separation (BSS). We generalize the traditional minimum distortion principle to maximum likelihood estimation with a model for the residual spectrograms. Because residual spectrograms typically contain other sources, we propose to use a mixed-norm model that lets us finely tune sparsity in time and frequency. We propose to carry out… ▽ More We revisit the source image estimation problem from blind source separation (BSS). We generalize the traditional minimum distortion principle to maximum likelihood estimation with a model for the residual spectrograms. Because residual spectrograms typically contain other sources, we propose to use a mixed-norm model that lets us finely tune sparsity in time and frequency. We propose to carry out the minimization of the mixed-norm via majorization-minimization optimization, leading to an iteratively reweighted least-squares algorithm. The algorithm balances well efficiency and ease of implementation. We assess the performance of the proposed method as applied to two well-known determined BSS and one joint BSS-dereverberation algorithms. We find out that it is possible to tune the parameters to improve separation by up to 2 dB, with no increase in distortion, and at little computational cost. The method thus provides a cheap and easy way to boost the performance of blind source separation. △ Less

Submitted 11 September, 2020; originally announced September 2020.

Comments: 5 pages, 1 figure, 2 tables, Accepted at INTERSPEECH 2020

arXiv:2008.10048 [pdf, other]

doi 10.1109/TSP.2021.3072228

Independent Vector Analysis via Log-Quadratically Penalized Quadratic Minimization

Authors: Robin Scheibler

Abstract: We propose a new algorithm for blind source separation (BSS) using independent vector analysis (IVA). This is an improvement over the popular auxiliary function based IVA (AuxIVA) with iterative projection (IP) or iterative source steering (ISS). We introduce iterative projection with adjustment (IPA), where we update one demixing filter and jointly adjust all the other sources along its current d… ▽ More We propose a new algorithm for blind source separation (BSS) using independent vector analysis (IVA). This is an improvement over the popular auxiliary function based IVA (AuxIVA) with iterative projection (IP) or iterative source steering (ISS). We introduce iterative projection with adjustment (IPA), where we update one demixing filter and jointly adjust all the other sources along its current direction. Each update involves solving a non-convex minimization problem that we term log-quadratically penalized quadratic minimization (LQPQM), that we think is of interest beyond this work. In the general case, we show that its global minimum corresponds to the largest root of a univariate function, reminiscent of modified eigenvalue problems. We propose a simple procedure based on Newton-Raphson to efficiently compute it. Numerical experiments demonstrate the effectiveness of the proposed method. First, we show that it efficiently decreases the value of the surrogate function. In further experiments on synthetic mixtures, we study the probability of finding the true demixing matrix and convergence speed. We show that the proposed method combines high success rate and fast convergence. Finally, we validate the performance on a reverberant blind speech separation task. We find that all the AuxIVA-based methods perform similarly in terms of acoustic BSS metrics. However, AuxIVA-IPA converges faster. We measure up to 8.5 times speed-up in terms of runtime compared to the next best AuxIVA-based method, depending on the number of channels and the signal-to-noise ratio (SNR). △ Less

Submitted 18 May, 2021; v1 submitted 23 August, 2020; originally announced August 2020.

Comments: 16 pages, 6 figures, 4 tables

Journal ref: IEEE Transactions on Signal Processing, Vol. 69, pp. 2509 - 2524, April 2021

arXiv:2006.02774 [pdf, other]

A study on more realistic room simulation for far-field keyword spotting

Authors: Eric Bezzam, Robin Scheibler, Cyril Cadoux, Thibault Gisselbrecht

Abstract: We investigate the impact of more realistic room simulation for training far-field keyword spotting systems without fine-tuning on in-domain data. To this end, we study the impact of incorporating the following factors in the room impulse response (RIR) generation: air absorption, surface- and frequency-dependent coefficients of real materials, and stochastic ray tracing. Through an ablation study… ▽ More We investigate the impact of more realistic room simulation for training far-field keyword spotting systems without fine-tuning on in-domain data. To this end, we study the impact of incorporating the following factors in the room impulse response (RIR) generation: air absorption, surface- and frequency-dependent coefficients of real materials, and stochastic ray tracing. Through an ablation study, a wake word task is used to measure the impact of these factors in comparison with a ground-truth set of measured RIRs. On a hold-out set of re-recordings under clean and noisy far-field conditions, we demonstrate up to $35.8\%$ relative improvement over the commonly-used (single absorption coefficient) image source method. Source code is made available in the Pyroomacoustics package, allowing others to incorporate these techniques in their work. △ Less

Submitted 18 November, 2020; v1 submitted 4 June, 2020; originally announced June 2020.

Comments: 7 pages, 4 figures, accepted at APSIPA 2020, room impulse response generation code can be found at https://github.com/ebezzam/room-simulation

arXiv:2004.03926 [pdf, other]

MM Algorithms for Joint Independent Subspace Analysis with Application to Blind Single and Multi-Source Extraction

Authors: Robin Scheibler, Nobutaka Ono

Abstract: In this work, we propose efficient algorithms for joint independent subspace analysis (JISA), an extension of independent component analysis that deals with parallel mixtures, where not all the components are independent. We derive an algorithmic framework for JISA based on the majorization-minimization (MM) optimization technique (JISA-MM). We use a well-known inequality for super-Gaussian source… ▽ More In this work, we propose efficient algorithms for joint independent subspace analysis (JISA), an extension of independent component analysis that deals with parallel mixtures, where not all the components are independent. We derive an algorithmic framework for JISA based on the majorization-minimization (MM) optimization technique (JISA-MM). We use a well-known inequality for super-Gaussian sources to derive a surrogate function of the negative log-likelihood of the observed data. The minimization of this surrogate function leads to a variant of the hybrid exact-approximate diagonalization problem, but where multiple demixing vectors are grouped together. In the spirit of auxiliary function based independent vector analysis (AuxIVA), we propose several updates that can be applied alternately to one, or jointly to two, groups of demixing vectors. Recently, blind extraction of one or more sources has gained interest as a reasonable way of exploiting larger microphone arrays to achieve better separation. In particular, several MM algorithms have been proposed for overdetermined IVA (OverIVA). By applying JISA-MM, we are not only able to rederive these in a general manner, but also find several new algorithms. We run extensive numerical experiments to evaluate their performance, and compare it to that of full separation with AuxIVA. We find that algorithms using pairwise updates of two sources, or of one source and the background have the fastest convergence, and are able to separate target sources quickly and precisely from the background. In addition, we characterize the performance of all algorithms under a large number of noise, reverberation, and background mismatch conditions. △ Less

Submitted 8 April, 2020; originally announced April 2020.

Comments: 15 pages, 4 figures

arXiv:1910.10654 [pdf, other]

Fast Independent Vector Extraction by Iterative SINR Maximization

Authors: Robin Scheibler, Nobutaka Ono

Abstract: We propose fast independent vector extraction (FIVE), a new algorithm that blindly extracts a single non-Gaussian source from a Gaussian background. The algorithm iteratively computes beamforming weights maximizing the signal-to-interference-and-noise ratio for an approximate noise covariance matrix. We demonstrate that this procedure minimizes the negative log-likelihood of the input data accordi… ▽ More We propose fast independent vector extraction (FIVE), a new algorithm that blindly extracts a single non-Gaussian source from a Gaussian background. The algorithm iteratively computes beamforming weights maximizing the signal-to-interference-and-noise ratio for an approximate noise covariance matrix. We demonstrate that this procedure minimizes the negative log-likelihood of the input data according to a well-defined probabilistic model. The minimization is carried out via the auxiliary function technique whereas, unlike related methods, the auxiliary function is globally minimized at every iteration. Numerical experiments are carried out to assess the performance of FIVE. We find that it is vastly superior to competing methods in terms of convergence speed, and has high potential for real-time applications. △ Less

Submitted 23 October, 2019; originally announced October 2019.

Comments: 5 pages, 4 figures, Submitted to ICASSP 2020

arXiv:1905.07880 [pdf, other]

Independent Vector Analysis with more Microphones than Sources

Authors: Robin Scheibler, Nobutaka Ono

Abstract: We extend frequency-domain blind source separation based on independent vector analysis to the case where there are more microphones than sources. The signal is modelled as non-Gaussian sources in a Gaussian background. The proposed algorithm is based on a parametrization of the demixing matrix decreasing the number of parameters to estimate. Furthermore, orthogonal constraints between the signal… ▽ More We extend frequency-domain blind source separation based on independent vector analysis to the case where there are more microphones than sources. The signal is modelled as non-Gaussian sources in a Gaussian background. The proposed algorithm is based on a parametrization of the demixing matrix decreasing the number of parameters to estimate. Furthermore, orthogonal constraints between the signal and background subspaces are imposed to regularize the separation. The problem can then be posed as a constrained likelihood maximization. We propose efficient alternating updates guaranteed to converge to a stationary point of the cost function. The performance of the algorithm is assessed on simulated signals. We find that the separation performance is on par with that of the conventional determined algorithm at a fraction of the computational cost. △ Less

Submitted 7 August, 2019; v1 submitted 20 May, 2019; originally announced May 2019.

Comments: Accepted to WASPAA 2019, 5 pages, 3 figures

arXiv:1904.02334 [pdf, other]

doi 10.1109/ICASSP.2019.8682594

Multi-modal Blind Source Separation with Microphones and Blinkies

Authors: Robin Scheibler, Nobutaka Ono

Abstract: We propose a blind source separation algorithm that jointly exploits measurements by a conventional microphone array and an ad hoc array of low-rate sound power sensors called blinkies. While providing less information than microphones, blinkies circumvent some difficulties of microphone arrays in terms of manufacturing, synchronization, and deployment. The algorithm is derived from a joint probab… ▽ More We propose a blind source separation algorithm that jointly exploits measurements by a conventional microphone array and an ad hoc array of low-rate sound power sensors called blinkies. While providing less information than microphones, blinkies circumvent some difficulties of microphone arrays in terms of manufacturing, synchronization, and deployment. The algorithm is derived from a joint probabilistic model of the microphone and sound power measurements. We assume the separated sources to follow a time-varying spherical Gaussian distribution, and the non-negative power measurement space-time matrix to have a low-rank structure. We show that alternating updates similar to those of independent vector analysis and Itakura-Saito non-negative matrix factorization decrease the negative log-likelihood of the joint distribution. The proposed algorithm is validated via numerical experiments. Its median separation performance is found to be up to 8 dB more than that of independent vector analysis, with significantly reduced variability. △ Less

Submitted 3 April, 2019; originally announced April 2019.

Comments: Accepted at IEEE ICASSP 2019, Brighton, UK. 5 pages. 3 figures

arXiv:1711.06805 [pdf, other]

doi 10.1109/ICASSP.2018.8461345

Separake: Source Separation with a Little Help From Echoes

Authors: Robin Scheibler, Diego Di Carlo, Antoine Deleforge, Ivan Dokmanić

Abstract: It is commonly believed that multipath hurts various audio processing algorithms. At odds with this belief, we show that multipath in fact helps sound source separation, even with very simple propagation models. Unlike most existing methods, we neither ignore the room impulse responses, nor we attempt to estimate them fully. We rather assume that we know the positions of a few virtual microphones… ▽ More It is commonly believed that multipath hurts various audio processing algorithms. At odds with this belief, we show that multipath in fact helps sound source separation, even with very simple propagation models. Unlike most existing methods, we neither ignore the room impulse responses, nor we attempt to estimate them fully. We rather assume that we know the positions of a few virtual microphones generated by echoes and we show how this gives us enough spatial diversity to get a performance boost over the anechoic case. We show improvements for two standard algorithms---one that uses only magnitudes of the transfer functions, and one that also uses the phases. Concretely, we show that multichannel non-negative matrix factorization aided with a small number of echoes beats the vanilla variant of the same algorithm, and that with magnitude information only, echoes enable separation where it was previously impossible. △ Less

Submitted 17 November, 2017; originally announced November 2017.

arXiv:1710.04196 [pdf, other]

doi 10.1109/ICASSP.2018.8461310

Pyroomacoustics: A Python package for audio room simulations and array processing algorithms

Authors: Robin Scheibler, Eric Bezzam, Ivan Dokmanić

Abstract: We present pyroomacoustics, a software package aimed at the rapid development and testing of audio array processing algorithms. The content of the package can be divided into three main components: an intuitive Python object-oriented interface to quickly construct different simulation scenarios involving multiple sound sources and microphones in 2D and 3D rooms; a fast C implementation of the imag… ▽ More We present pyroomacoustics, a software package aimed at the rapid development and testing of audio array processing algorithms. The content of the package can be divided into three main components: an intuitive Python object-oriented interface to quickly construct different simulation scenarios involving multiple sound sources and microphones in 2D and 3D rooms; a fast C implementation of the image source model for general polyhedral rooms to efficiently generate room impulse responses and simulate the propagation between sources and receivers; and finally, reference implementations of popular algorithms for beamforming, direction finding, and adaptive filtering. Together, they form a package with the potential to speed up the time to market of new algorithms by significantly reducing the implementation overhead in the performance evaluation step. △ Less

Submitted 11 October, 2017; originally announced October 2017.

Comments: 5 pages, 5 figures, describes a software package

Showing 1–23 of 23 results for author: Scheibler, R