Search | arXiv e-print repository

A Phoneme-Scale Assessment of Multichannel Speech Enhancement Algorithms

Authors: Nasser-Eddine Monir, Paul Magron, Romain Serizel

Abstract: In the intricate acoustic landscapes where speech intelligibility is challenged by noise and reverberation, multichannel speech enhancement emerges as a promising solution for individuals with hearing loss. Such algorithms are commonly evaluated at the utterance level. However, this approach overlooks the granular acoustic nuances revealed by phoneme-specific analysis, potentially obscuring key in… ▽ More In the intricate acoustic landscapes where speech intelligibility is challenged by noise and reverberation, multichannel speech enhancement emerges as a promising solution for individuals with hearing loss. Such algorithms are commonly evaluated at the utterance level. However, this approach overlooks the granular acoustic nuances revealed by phoneme-specific analysis, potentially obscuring key insights into their performance. This paper presents an in-depth phoneme-scale evaluation of 3 state-of-the-art multichannel speech enhancement algorithms. These algorithms -- FasNet, MVDR, and Tango -- are extensively evaluated across different noise conditions and spatial setups, employing realistic acoustic simulations with measured room impulse responses, and leveraging diversity offered by multiple microphones in a binaural hearing setup. The study emphasizes the fine-grained phoneme-level analysis, revealing that while some phonemes like plosives are heavily impacted by environmental acoustics and challenging to deal with by the algorithms, others like nasals and sibilants see substantial improvements after enhancement. These investigations demonstrate important improvements in phoneme clarity in noisy conditions, with insights that could drive the development of more personalized and phoneme-aware hearing aid technologies. △ Less

Submitted 24 January, 2024; originally announced January 2024.

Comments: This is the preprint of the paper that we submitted to the Trends in Hearing Journal

arXiv:2303.01864 [pdf, ps, other]

Spectrogram Inversion for Audio Source Separation via Consistency, Mixing, and Magnitude Constraints

Authors: Paul Magron, Tuomas Virtanen

Abstract: Audio source separation is often achieved by estimating the magnitude spectrogram of each source, and then applying a phase recovery (or spectrogram inversion) algorithm to retrieve time-domain signals. Typically, spectrogram inversion is treated as an optimization problem involving one or several terms in order to promote estimates that comply with a consistency property, a mixing constraint, and… ▽ More Audio source separation is often achieved by estimating the magnitude spectrogram of each source, and then applying a phase recovery (or spectrogram inversion) algorithm to retrieve time-domain signals. Typically, spectrogram inversion is treated as an optimization problem involving one or several terms in order to promote estimates that comply with a consistency property, a mixing constraint, and/or a target magnitude objective. Nonetheless, it is still unclear which set of constraints and problem formulation is the most appropriate in practice. In this paper, we design a general framework for deriving spectrogram inversion algorithm, which is based on formulating optimization problems by combining these objectives either as soft penalties or hard constraints. We solve these by means of algorithms that perform alternating projections on the subsets corresponding to each objective/constraint. Our framework encompasses existing techniques from the literature as well as novel algorithms. We investigate the potential of these approaches for a speech enhancement task. In particular, one of our novel algorithms outperforms other approaches in a realistic setting where the magnitudes are estimated beforehand using a neural network. △ Less

Submitted 30 June, 2023; v1 submitted 3 March, 2023; originally announced March 2023.

arXiv:2206.13768 [pdf, ps, other]

doi 10.1016/j.sigpro.2022.108905

Algorithms for audio inpainting based on probabilistic nonnegative matrix factorization

Authors: Ondřej Mokrý, Paul Magron, Thomas Oberlin, Cédric Févotte

Abstract: Audio inpainting, i.e., the task of restoring missing or occluded audio signal samples, usually relies on sparse representations or autoregressive modeling. In this paper, we propose to structure the spectrogram with nonnegative matrix factorization (NMF) in a probabilistic framework. First, we treat the missing samples as latent variables, and derive two expectation-maximization algorithms for es… ▽ More Audio inpainting, i.e., the task of restoring missing or occluded audio signal samples, usually relies on sparse representations or autoregressive modeling. In this paper, we propose to structure the spectrogram with nonnegative matrix factorization (NMF) in a probabilistic framework. First, we treat the missing samples as latent variables, and derive two expectation-maximization algorithms for estimating the parameters of the model, depending on whether we formulate the problem in the time- or time-frequency domain. Then, we treat the missing samples as parameters, and we address this novel problem by deriving an alternating minimization scheme. We assess the potential of these algorithms for the task of restoring short- to middle-length gaps in music signals. Experiments reveal great convergence properties of the proposed methods, as well as competitive performance when compared to state-of-the-art audio inpainting techniques. △ Less

Submitted 5 January, 2023; v1 submitted 28 June, 2022; originally announced June 2022.

arXiv:2204.09741 [pdf, ps, other]

doi 10.1109/LSP.2022.3187368

A majorization-minimization algorithm for nonnegative binary matrix factorization

Authors: Paul Magron, Cédric Févotte

Abstract: This paper tackles the problem of decomposing binary data using matrix factorization. We consider the family of mean-parametrized Bernoulli models, a class of generative models that are well suited for modeling binary data and enables interpretability of the factors. We factorize the Bernoulli parameter and consider an additional Beta prior on one of the factors to further improve the model's expr… ▽ More This paper tackles the problem of decomposing binary data using matrix factorization. We consider the family of mean-parametrized Bernoulli models, a class of generative models that are well suited for modeling binary data and enables interpretability of the factors. We factorize the Bernoulli parameter and consider an additional Beta prior on one of the factors to further improve the model's expressive power. While similar models have been proposed in the literature, they only exploit the Beta prior as a proxy to ensure a valid Bernoulli parameter in a Bayesian setting; in practice it reduces to a uniform or uninformative prior. Besides, estimation in these models has focused on costly Bayesian inference. In this paper, we propose a simple yet very efficient majorization-minimization algorithm for maximum a posteriori estimation. Our approach leverages the Beta prior whose parameters can be tuned to improve performance in matrix completion tasks. Experiments conducted on three public binary datasets show that our approach offers an excellent trade-off between prediction performance, computational complexity, and interpretability. △ Less

Submitted 20 April, 2022; originally announced April 2022.

arXiv:2204.01360 [pdf, other]

doi 10.1109/LSP.2022.3189275

Learning the Proximity Operator in Unfolded ADMM for Phase Retrieval

Authors: Pierre-Hugo Vial, Paul Magron, Thomas Oberlin, Cédric Févotte

Abstract: This paper considers the phase retrieval (PR) problem, which aims to reconstruct a signal from phaseless measurements such as magnitude or power spectrograms. PR is generally handled as a minimization problem involving a quadratic loss. Recent works have considered alternative discrepancy measures, such as the Bregman divergences, but it is still challenging to tailor the optimal loss for a given… ▽ More This paper considers the phase retrieval (PR) problem, which aims to reconstruct a signal from phaseless measurements such as magnitude or power spectrograms. PR is generally handled as a minimization problem involving a quadratic loss. Recent works have considered alternative discrepancy measures, such as the Bregman divergences, but it is still challenging to tailor the optimal loss for a given setting. In this paper we propose a novel strategy to automatically learn the optimal metric for PR. We unfold a recently introduced ADMM algorithm into a neural network, and we emphasize that the information about the loss used to formulate the PR problem is conveyed by the proximity operator involved in the ADMM updates. Therefore, we replace this proximity operator with trainable activation functions: learning these in a supervised setting is then equivalent to learning an optimal metric for PR. Experiments conducted with speech signals show that our approach outperforms the baseline ADMM, using a light and interpretable neural architecture. △ Less

Submitted 4 April, 2022; originally announced April 2022.

Comments: 10 pages, 5 figures, submitted to IEEE SPL

arXiv:2203.15758 [pdf, other]

A Sparsity-promoting Dictionary Model for Variational Autoencoders

Authors: Mostafa Sadeghi, Paul Magron

Abstract: Structuring the latent space in probabilistic deep generative models, e.g., variational autoencoders (VAEs), is important to yield more expressive models and interpretable representations, and to avoid overfitting. One way to achieve this objective is to impose a sparsity constraint on the latent variables, e.g., via a Laplace prior. However, such approaches usually complicate the training phase,… ▽ More Structuring the latent space in probabilistic deep generative models, e.g., variational autoencoders (VAEs), is important to yield more expressive models and interpretable representations, and to avoid overfitting. One way to achieve this objective is to impose a sparsity constraint on the latent variables, e.g., via a Laplace prior. However, such approaches usually complicate the training phase, and they sacrifice the reconstruction quality to promote sparsity. In this paper, we propose a simple yet effective methodology to structure the latent space via a sparsity-promoting dictionary model, which assumes that each latent code can be written as a sparse linear combination of a dictionary's columns. In particular, we leverage a computationally efficient and tuning-free method, which relies on a zero-mean Gaussian latent prior with learnable variances. We derive a variational inference scheme to train the model. Experiments on speech generative modeling demonstrate the advantage of the proposed approach over competing techniques, since it promotes sparsity while not deteriorating the output speech quality. △ Less

Submitted 17 June, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

Comments: Proc. of Interspeech 2022

arXiv:2102.12369 [pdf, ps, other]

Neural content-aware collaborative filtering for cold-start music recommendation

Authors: Paul Magron, Cédric Févotte

Abstract: State-of-the-art music recommender systems are based on collaborative filtering, which builds upon learning similarities between users and songs from the available listening data. These approaches inherently face the cold-start problem, as they cannot recommend novel songs with no listening history. Content-aware recommendation addresses this issue by incorporating content information about the so… ▽ More State-of-the-art music recommender systems are based on collaborative filtering, which builds upon learning similarities between users and songs from the available listening data. These approaches inherently face the cold-start problem, as they cannot recommend novel songs with no listening history. Content-aware recommendation addresses this issue by incorporating content information about the songs on top of collaborative filtering. However, methods falling in this category rely on a shallow user/item interaction that originates from a matrix factorization framework. In this work, we introduce neural content-aware collaborative filtering, a unified framework which alleviates these limits, and extends the recently introduced neural collaborative filtering to its content-aware counterpart. We propose a generative model which leverages deep learning for both extracting content information from low-level acoustic features and for modeling the interaction between users and songs embeddings. The deep content feature extractor can either directly predict the item embedding, or serve as a regularization prior, yielding two variants (strict and relaxed) of our model. Experimental results show that the proposed method reaches state-of-the-art results for a cold-start music recommendation task. We notably observe that exploiting deep neural networks for learning refined user/item interactions outperforms approaches using a more simple interaction model in a content-aware framework. △ Less

Submitted 20 July, 2022; v1 submitted 24 February, 2021; originally announced February 2021.

arXiv:2011.12818 [pdf, other]

Phase retrieval with Bregman divergences: Application to audio signal recovery

Authors: Pierre-Hugo Vial, Paul Magron, Thomas Oberlin, Cédric Févotte

Abstract: Phase retrieval aims to recover a signal from magnitude or power spectra measurements. It is often addressed by considering a minimization problem involving a quadratic cost function. We propose a different formulation based on Bregman divergences, which encompass divergences that are appropriate for audio signal processing applications. We derive a fast gradient algorithm to solve this problem. Phase retrieval aims to recover a signal from magnitude or power spectra measurements. It is often addressed by considering a minimization problem involving a quadratic cost function. We propose a different formulation based on Bregman divergences, which encompass divergences that are appropriate for audio signal processing applications. We derive a fast gradient algorithm to solve this problem. △ Less

Submitted 25 November, 2020; originally announced November 2020.

Comments: in Proceedings of iTWIST'20, Paper-ID: 16, Nantes, France, December, 2-4, 2020

arXiv:2010.10276 [pdf, ps, other]

Leveraging the structure of musical preference in content-aware music recommendation

Authors: Paul Magron, Cédric Févotte

Abstract: State-of-the-art music recommendation systems are based on collaborative filtering, which predicts a user's interest from his listening habits and similarities with other users' profiles. These approaches are agnostic to the song content, and therefore face the cold-start problem: they cannot recommend novel songs without listening history. To tackle this issue, content-aware recommendation incorp… ▽ More State-of-the-art music recommendation systems are based on collaborative filtering, which predicts a user's interest from his listening habits and similarities with other users' profiles. These approaches are agnostic to the song content, and therefore face the cold-start problem: they cannot recommend novel songs without listening history. To tackle this issue, content-aware recommendation incorporates information about the songs that can be used for recommending new items. Most methods falling in this category exploit either user-annotated tags, acoustic features or deeply-learned features. Consequently, these content features do not have a clear musical meaning, thus they are not necessarily relevant from a musical preference perspective. In this work, we propose instead to leverage a model of musical preference which originates from the field of music psychology. From low-level acoustic features we extract three factors (arousal, valence and depth), which have been shown appropriate for describing musical taste. Then we integrate those into a collaborative filtering framework for content-aware music recommendation. Experiments conducted on large-scale data show that this approach is able to address the cold-start problem, while using a compact and meaningful set of musical features. △ Less

Submitted 9 February, 2021; v1 submitted 20 October, 2020; originally announced October 2020.

arXiv:2010.10255 [pdf, ps, other]

Phase recovery with Bregman divergences for audio source separation

Authors: Paul Magron, Pierre-Hugo Vial, Thomas Oberlin, Cédric Févotte

Abstract: Time-frequency audio source separation is usually achieved by estimating the short-time Fourier transform (STFT) magnitude of each source, and then applying a phase recovery algorithm to retrieve time-domain signals. In particular, the multiple input spectrogram inversion (MISI) algorithm has shown good performance in several recent works. This algorithm minimizes a quadratic reconstruction error… ▽ More Time-frequency audio source separation is usually achieved by estimating the short-time Fourier transform (STFT) magnitude of each source, and then applying a phase recovery algorithm to retrieve time-domain signals. In particular, the multiple input spectrogram inversion (MISI) algorithm has shown good performance in several recent works. This algorithm minimizes a quadratic reconstruction error between magnitude spectrograms. However, this loss does not properly account for some perceptual properties of audio, and alternative discrepancy measures such as beta-divergences have been preferred in many settings. In this paper, we propose to reformulate phase recovery in audio source separation as a minimization problem involving Bregman divergences. To optimize the resulting objective, we derive a projected gradient descent algorithm. Experiments conducted on a speech enhancement task show that this approach outperforms MISI for several alternative losses, which highlights their relevance for audio source separation applications. △ Less

Submitted 9 February, 2021; v1 submitted 20 October, 2020; originally announced October 2020.

arXiv:2010.00392 [pdf, other]

doi 10.1109/JSTSP.2021.3051870

Phase retrieval with Bregman divergences and application to audio signal recovery

Authors: Pierre-Hugo Vial, Paul Magron, Thomas Oberlin, Cédric Févotte

Abstract: Phase retrieval (PR) aims to recover a signal from the magnitudes of a set of inner products. This problem arises in many audio signal processing applications which operate on a short-time Fourier transform magnitude or power spectrogram, and discard the phase information. Recovering the missing phase from the resulting modified spectrogram is indeed necessary in order to synthesize time-domain si… ▽ More Phase retrieval (PR) aims to recover a signal from the magnitudes of a set of inner products. This problem arises in many audio signal processing applications which operate on a short-time Fourier transform magnitude or power spectrogram, and discard the phase information. Recovering the missing phase from the resulting modified spectrogram is indeed necessary in order to synthesize time-domain signals. PR is commonly addressed by considering a minimization problem involving a quadratic loss function. In this paper, we adopt a different standpoint. Indeed, the quadratic loss does not properly account for some perceptual properties of audio, and alternative discrepancy measures such as beta-divergences have been preferred in many settings. Therefore, we formulate PR as a new minimization problem involving Bregman divergences. Since these divergences are not symmetric with respect to their two input arguments in general, they lead to two different formulations of the problem. To optimize the resulting objective, we derive two algorithms based on accelerated gradient descent and alternating direction method of multipliers. Experiments conducted on audio signal recovery from spectrograms that are either exact or estimated from noisy observations highlight the potential of our proposed methods for audio restoration. In particular, leveraging some of these Bregman divergences induce better performance than the quadratic loss when performing PR from spectrograms under very noisy conditions. △ Less

Submitted 13 January, 2021; v1 submitted 1 October, 2020; originally announced October 2020.

Comments: 23 pages, 3 figures, accepted for publication in the IEEE Journal of Selected Topics in Signal Processing

arXiv:1911.03128 [pdf, ps, other]

doi 10.1109/LSP.2020.2970310

Online Spectrogram Inversion for Low-Latency Audio Source Separation

Authors: Paul Magron, Tuomas Virtanen

Abstract: Audio source separation is usually achieved by estimating the short-time Fourier transform (STFT) magnitude of each source, and then applying a spectrogram inversion algorithm to retrieve time-domain signals. In particular, the multiple input spectrogram inversion (MISI) algorithm has been exploited successfully in several recent works. However, this algorithm suffers from two drawbacks, which we… ▽ More Audio source separation is usually achieved by estimating the short-time Fourier transform (STFT) magnitude of each source, and then applying a spectrogram inversion algorithm to retrieve time-domain signals. In particular, the multiple input spectrogram inversion (MISI) algorithm has been exploited successfully in several recent works. However, this algorithm suffers from two drawbacks, which we address in this paper. First, it has originally been introduced in a heuristic fashion: we propose here a rigorous optimization framework in which MISI is derived, thus proving the convergence of this algorithm. Besides, while MISI operates offline, we propose here an online version of MISI called oMISI, which is suitable for low-latency source separation, an important requirement for e.g., hearing aids applications. oMISI also allows one to use alternative phase initialization schemes exploiting the temporal structure of audio signals. Experiments conducted on a speech separation task show that oMISI performs as well as its offline counterpart, thus demonstrating its potential for real-time source separation. △ Less

Submitted 24 February, 2020; v1 submitted 8 November, 2019; originally announced November 2019.

arXiv:1907.08506 [pdf, other]

Language Modelling for Sound Event Detection with Teacher Forcing and Scheduled Sampling

Authors: Konstantinos Drossos, Shayan Gharib, Paul Magron, Tuomas Virtanen

Abstract: A sound event detection (SED) method typically takes as an input a sequence of audio frames and predicts the activities of sound events in each frame. In real-life recordings, the sound events exhibit some temporal structure: for instance, a "car horn" will likely be followed by a "car passing by". While this temporal structure is widely exploited in sequence prediction tasks (e.g., in machine tra… ▽ More A sound event detection (SED) method typically takes as an input a sequence of audio frames and predicts the activities of sound events in each frame. In real-life recordings, the sound events exhibit some temporal structure: for instance, a "car horn" will likely be followed by a "car passing by". While this temporal structure is widely exploited in sequence prediction tasks (e.g., in machine translation), where language models (LM) are exploited, it is not satisfactorily modeled in SED. In this work we propose a method which allows a recurrent neural network (RNN) to learn an LM for the SED task. The method conditions the input of the RNN with the activities of classes at the previous time step. We evaluate our method using F1 score and error rate (ER) over three different and publicly available datasets; the TUT-SED Synthetic 2016 and the TUT Sound Events 2016 and 2017 datasets. The obtained results show an increase of 9% and 2% at the F1 (higher is better) and a decrease of 7% and 2% at ER (lower is better) for the TUT Sound Events 2016 and 2017 datasets, respectively, when using our method. On the contrary, with our method there is a decrease of 4% at F1 score and an increase of 7% at ER for the TUT-SED Synthetic 2016 dataset. △ Less

Submitted 6 November, 2019; v1 submitted 19 July, 2019; originally announced July 2019.

Comments: Fixed the display of URLs at footnote, updated the results

arXiv:1904.10678 [pdf, ps, other]

Unsupervised Adversarial Domain Adaptation Based On The Wasserstein Distance For Acoustic Scene Classification

Authors: Konstantinos Drossos, Paul Magron, Tuomas Virtanen

Abstract: A challenging problem in deep learning-based machine listening field is the degradation of the performance when using data from unseen conditions. In this paper we focus on the acoustic scene classification (ASC) task and propose an adversarial deep learning method to allow adapting an acoustic scene classification system to deal with a new acoustic channel resulting from data captured with a diff… ▽ More A challenging problem in deep learning-based machine listening field is the degradation of the performance when using data from unseen conditions. In this paper we focus on the acoustic scene classification (ASC) task and propose an adversarial deep learning method to allow adapting an acoustic scene classification system to deal with a new acoustic channel resulting from data captured with a different recording device. We build upon the theoretical model of HΔH-distance and previous adversarial discriminative deep learning method for ASC unsupervised domain adaptation, and we present an adversarial training based method using the Wasserstein distance. We improve the state-of-the-art mean accuracy on the data from the unseen conditions from 32% to 45%, using the TUT Acoustic Scenes dataset. △ Less

Submitted 6 November, 2019; v1 submitted 24 April, 2019; originally announced April 2019.

Comments: Updated indices at Eq 6

arXiv:1807.11298 [pdf, other]

Harmonic-Percussive Source Separation with Deep Neural Networks and Phase Recovery

Authors: Konstantinos Drossos, Paul Magron, Stylianos Ioannis Mimilakis, Tuomas Virtanen

Abstract: Harmonic/percussive source separation (HPSS) consists in separating the pitched instruments from the percussive parts in a music mixture. In this paper, we propose to apply the recently introduced Masker-Denoiser with twin networks (MaD TwinNet) system to this task. MaD TwinNet is a deep learning architecture that has reached state-of-the-art results in monaural singing voice separation. Herein, w… ▽ More Harmonic/percussive source separation (HPSS) consists in separating the pitched instruments from the percussive parts in a music mixture. In this paper, we propose to apply the recently introduced Masker-Denoiser with twin networks (MaD TwinNet) system to this task. MaD TwinNet is a deep learning architecture that has reached state-of-the-art results in monaural singing voice separation. Herein, we propose to apply it to HPSS by using it to estimate the magnitude spectrogram of the percussive source. Then, we retrieve the complex-valued short-time Fourier transform of the sources by means of a phase recovery algorithm, which minimizes the reconstruction error and enforces the phase of the harmonic part to follow a sinusoidal phase model. Experiments conducted on realistic music mixtures show that this novel separation system outperforms the previous state-of-the art kernel additive model approach. △ Less

Submitted 30 July, 2018; originally announced July 2018.

arXiv:1802.03156 [pdf, ps, other]

Complex ISNMF: a Phase-Aware Model for Monaural Audio Source Separation

Authors: Paul Magron, Tuomas Virtanen

Abstract: This paper introduces a phase-aware probabilistic model for audio source separation. Classical source models in the short-term Fourier transform domain use circularly-symmetric Gaussian or Poisson random variables. This is equivalent to assuming that the phase of each source is uniformly distributed, which is not suitable for exploiting the underlying structure of the phase. Drawing on preliminary… ▽ More This paper introduces a phase-aware probabilistic model for audio source separation. Classical source models in the short-term Fourier transform domain use circularly-symmetric Gaussian or Poisson random variables. This is equivalent to assuming that the phase of each source is uniformly distributed, which is not suitable for exploiting the underlying structure of the phase. Drawing on preliminary works, we introduce here a Bayesian anisotropic Gaussian source model in which the phase is no longer uniform. Such a model permits us to favor a phase value that originates from a signal model through a Markov chain prior structure. The variance of the latent variables are structured with nonnegative matrix factorization (NMF). The resulting model is called complex Itakura-Saito NMF (ISNMF) since it generalizes the ISNMF model to the case of non-isotropic variables. It combines the advantages of ISNMF, which uses a distortion measure adapted to audio and yields a set of estimates which preserve the overall energy of the mixture, and of complex NMF, which enables one to account for some phase constraints. We derive a generalized expectation-maximization algorithm to estimate the model parameters. Experiments conducted on a musical source separation task in a semi-informed setting show that the proposed approach outperforms state-of-the-art phase-aware separation techniques. △ Less

Submitted 30 September, 2018; v1 submitted 9 February, 2018; originally announced February 2018.

arXiv:1608.01953 [pdf, ps, other]

Model-based STFT phase recovery for audio source separation

Authors: Paul Magron, Roland Badeau, Bertrand David

Abstract: For audio source separation applications, it is common to estimate the magnitude of the short-time Fourier transform (STFT) of each source. In order to further synthesizing time-domain signals, it is necessary to recover the phase of the corresponding complex-valued STFT. Most authors in this field choose a Wiener-like filtering approach which boils down to using the phase of the original mixture.… ▽ More For audio source separation applications, it is common to estimate the magnitude of the short-time Fourier transform (STFT) of each source. In order to further synthesizing time-domain signals, it is necessary to recover the phase of the corresponding complex-valued STFT. Most authors in this field choose a Wiener-like filtering approach which boils down to using the phase of the original mixture. In this paper, a different standpoint is adopted. Many music events are partially composed of slowly varying sinusoids and the STFT phase increment over time of those frequency components takes a specific form. This allows phase recovery by an unwrap** technique once a short-term frequency estimate has been obtained. Herein, a novel iterative source separation procedure is proposed which builds upon these results. It consists in minimizing the mixing error by means of the auxiliary function method. This procedure is initialized by exploiting the unwrap** technique in order to generate estimates that benefit from a temporal continuity property. Experiments conducted on realistic music pieces show that, given accurate magnitude estimates, this procedure outperforms the state-of-the-art consistent Wiener filter. △ Less

Submitted 27 February, 2018; v1 submitted 5 August, 2016; originally announced August 2016.

arXiv:1608.01844 [pdf, ps, other]

Lévy NMF for robust nonnegative source separation

Authors: Paul Magron, Roland Badeau, Antoine Liutkus

Abstract: Source separation, which consists in decomposing data into meaningful structured components, is an active research topic in many areas, such as music and image signal processing, applied physics and text mining. In this paper, we introduce the Positive $α$-stable (P$α$S) distributions to model the latent sources, which are a subclass of the stable distributions family. They notably permit us to mo… ▽ More Source separation, which consists in decomposing data into meaningful structured components, is an active research topic in many areas, such as music and image signal processing, applied physics and text mining. In this paper, we introduce the Positive $α$-stable (P$α$S) distributions to model the latent sources, which are a subclass of the stable distributions family. They notably permit us to model random variables that are both nonnegative and impulsive. Considering the Lévy distribution, the only P$α$S distribution whose density is tractable, we propose a mixture model called Lévy Nonnegative Matrix Factorization (Lévy NMF). This model accounts for low-rank structures in nonnegative data that possibly has high variability or is corrupted by very adverse noise. The model parameters are estimated in a maximum-likelihood sense. We also derive an estimator of the sources given the parameters, which extends the validity of the generalized Wiener filtering to the P$α$S case. Experiments on synthetic data show that Lévy NMF compares favorably with state-of-the art techniques in terms of robustness to impulsive noise. The analysis of two types of realistic signals is also considered: musical spectrograms and fluorescence spectra of chemical species. The results highlight the potential of the Lévy NMF model for decomposing nonnegative data. △ Less

Submitted 8 November, 2016; v1 submitted 5 August, 2016; originally announced August 2016.

arXiv:1605.07469 [pdf, ps, other]

doi 10.1109/ICASSP.2015.7177936

Phase recovery in NMF for audio source separation: an insightful benchmark

Authors: Paul Magron, Roland Badeau, Bertrand David

Abstract: Nonnegative Matrix Factorization (NMF) is a powerful tool for decomposing mixtures of audio signals in the Time-Frequency (TF) domain. In applications such as source separation, the phase recovery for each extracted component is a major issue since it often leads to audible artifacts. In this paper, we present a methodology for evaluating various NMF-based source separation techniques involving ph… ▽ More Nonnegative Matrix Factorization (NMF) is a powerful tool for decomposing mixtures of audio signals in the Time-Frequency (TF) domain. In applications such as source separation, the phase recovery for each extracted component is a major issue since it often leads to audible artifacts. In this paper, we present a methodology for evaluating various NMF-based source separation techniques involving phase reconstruction. For each model considered, a comparison between two approaches (blind separation without prior information and oracle separation with supervised model learning) is performed, in order to inquire about the room for improvement for the estimation methods. Experimental results show that the High Resolution NMF (HRNMF) model is particularly promising, because it is able to take phases and correlations over time into account with a great expressive power. △ Less

Submitted 24 May, 2016; originally announced May 2016.

Comments: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2015

arXiv:1605.07468 [pdf, ps, other]

doi 10.1109/WASPAA.2015.7336935

Phase reconstruction of spectrograms based on a model of repeated audio events

Authors: Paul Magron, Roland Badeau, Bertrand David

Abstract: Phase recovery of modified spectrograms is a major issue in audio signal processing applications, such as source separation. This paper introduces a novel technique for estimating the phases of components in complex mixtures within onset frames in the Time-Frequency (TF) domain. We propose to exploit the phase repetitions from one onset frame to another. We introduce a reference phase which charac… ▽ More Phase recovery of modified spectrograms is a major issue in audio signal processing applications, such as source separation. This paper introduces a novel technique for estimating the phases of components in complex mixtures within onset frames in the Time-Frequency (TF) domain. We propose to exploit the phase repetitions from one onset frame to another. We introduce a reference phase which characterizes a component independently of its activation times. The onset phases of a component are then modeled as the sum of this reference and an offset which is linearly dependent on the frequency. We derive a complex mixture model within onset frames and we provide two algorithms for the estimation of the model phase parameters. The model is estimated on experimental data and this technique is integrated into an audio source separation framework. The results demonstrate that this model is a promising tool for exploiting phase repetitions, and point out its potential for separating overlap** components in complex mixtures. △ Less

Submitted 24 May, 2016; originally announced May 2016.

Comments: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2015

arXiv:1605.07467 [pdf, ps, other]

Phase reconstruction of spectrograms with linear unwrap**: application to audio signal restoration

Authors: Paul Magron, Roland Badeau, Bertrand David

Abstract: This paper introduces a novel technique for reconstructing the phase of modified spectrograms of audio signals. From the analysis of mixtures of sinusoids we obtain relationships between phases of successive time frames in the Time-Frequency (TF) domain. To obtain similar relationships over frequencies, in particular within onset frames, we study an impulse model. Instantaneous frequencies and att… ▽ More This paper introduces a novel technique for reconstructing the phase of modified spectrograms of audio signals. From the analysis of mixtures of sinusoids we obtain relationships between phases of successive time frames in the Time-Frequency (TF) domain. To obtain similar relationships over frequencies, in particular within onset frames, we study an impulse model. Instantaneous frequencies and attack times are estimated locally to encompass the class of non-stationary signals such as vibratos. These techniques ensure both the vertical coherence of partials (over frequencies) and the horizontal coherence (over time). The method is tested on a variety of data and demonstrates better performance than traditional consistency-based approaches. We also introduce an audio restoration framework and observe that our technique outperforms traditional methods. △ Less

Submitted 24 May, 2016; originally announced May 2016.

Comments: European Signal Processing Conference (EUSIPCO) 2015

arXiv:1605.07466 [pdf, ps, other]

doi 10.1109/ICASSP.2016.7471634

Complex NMF under phase constraints based on signal modeling: application to audio source separation

Authors: Paul Magron, Roland Badeau, Bertrand David

Abstract: Nonnegative Matrix Factorization (NMF) is a powerful tool for decomposing mixtures of audio signals in the Time-Frequency (TF) domain. In the source separation framework, the phase recovery for each extracted component is necessary for synthesizing time-domain signals. The Complex NMF (CNMF) model aims to jointly estimate the spectrogram and the phase of the sources, but requires to constrain the… ▽ More Nonnegative Matrix Factorization (NMF) is a powerful tool for decomposing mixtures of audio signals in the Time-Frequency (TF) domain. In the source separation framework, the phase recovery for each extracted component is necessary for synthesizing time-domain signals. The Complex NMF (CNMF) model aims to jointly estimate the spectrogram and the phase of the sources, but requires to constrain the phase in order to produce satisfactory sounding results. We propose to incorporate phase constraints based on signal models within the CNMF framework: a \textit{phase unwrap**} constraint that enforces a form of temporal coherence, and a constraint based on the \textit{repetition} of audio events, which models the phases of the sources within onset frames. We also provide an algorithm for estimating the model parameters. The experimental results highlight the interest of including such constraints in the CNMF framework for separating overlap** components in complex audio mixtures. △ Less

Submitted 24 May, 2016; originally announced May 2016.

Comments: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2016

Showing 1–22 of 22 results for author: Magron, P