-
Unsupervised Music Source Separation Using Differentiable Parametric Source Models
Authors:
Kilian Schulze-Forster,
Gaël Richard,
Liam Kelley,
Clement S. J. Doire,
Roland Badeau
Abstract:
Supervised deep learning approaches to underdetermined audio source separation achieve state-of-the-art performance but require a dataset of mixtures along with their corresponding isolated source signals. Such datasets can be extremely costly to obtain for musical mixtures. This raises a need for unsupervised methods. We propose a novel unsupervised model-based deep learning approach to musical s…
▽ More
Supervised deep learning approaches to underdetermined audio source separation achieve state-of-the-art performance but require a dataset of mixtures along with their corresponding isolated source signals. Such datasets can be extremely costly to obtain for musical mixtures. This raises a need for unsupervised methods. We propose a novel unsupervised model-based deep learning approach to musical source separation. Each source is modelled with a differentiable parametric source-filter model. A neural network is trained to reconstruct the observed mixture as a sum of the sources by estimating the source models' parameters given their fundamental frequencies. At test time, soft masks are obtained from the synthesized source signals. The experimental evaluation on a vocal ensemble separation task shows that the proposed method outperforms learning-free methods based on nonnegative matrix factorization and a supervised deep learning baseline. Integrating domain knowledge in the form of source models into a data-driven method leads to high data efficiency: the proposed approach achieves good separation quality even when trained on less than three minutes of audio. This work makes powerful deep learning based separation usable in scenarios where training data with ground truth is expensive or nonexistent.
△ Less
Submitted 31 January, 2023; v1 submitted 24 January, 2022;
originally announced January 2022.
-
Fast Approximation of the Sliced-Wasserstein Distance Using Concentration of Random Projections
Authors:
Kimia Nadjahi,
Alain Durmus,
Pierre E. Jacob,
Roland Badeau,
Umut Şimşekli
Abstract:
The Sliced-Wasserstein distance (SW) is being increasingly used in machine learning applications as an alternative to the Wasserstein distance and offers significant computational and statistical benefits. Since it is defined as an expectation over random projections, SW is commonly approximated by Monte Carlo. We adopt a new perspective to approximate SW by making use of the concentration of meas…
▽ More
The Sliced-Wasserstein distance (SW) is being increasingly used in machine learning applications as an alternative to the Wasserstein distance and offers significant computational and statistical benefits. Since it is defined as an expectation over random projections, SW is commonly approximated by Monte Carlo. We adopt a new perspective to approximate SW by making use of the concentration of measure phenomenon: under mild assumptions, one-dimensional projections of a high-dimensional random vector are approximately Gaussian. Based on this observation, we develop a simple deterministic approximation for SW. Our method does not require sampling a number of random projections, and is therefore both accurate and easy to use compared to the usual Monte Carlo approximation. We derive nonasymptotical guarantees for our approach, and show that the approximation error goes to zero as the dimension increases, under a weak dependence condition on the data distribution. We validate our theoretical findings on synthetic datasets, and illustrate the proposed approximation on a generative modeling problem.
△ Less
Submitted 4 January, 2022; v1 submitted 29 June, 2021;
originally announced June 2021.
-
Asymptotic Guarantees for Learning Generative Models with the Sliced-Wasserstein Distance
Authors:
Kimia Nadjahi,
Alain Durmus,
Umut Şimşekli,
Roland Badeau
Abstract:
Minimum expected distance estimation (MEDE) algorithms have been widely used for probabilistic models with intractable likelihood functions and they have become increasingly popular due to their use in implicit generative modeling (e.g. Wasserstein generative adversarial networks, Wasserstein autoencoders). Emerging from computational optimal transport, the Sliced-Wasserstein (SW) distance has bec…
▽ More
Minimum expected distance estimation (MEDE) algorithms have been widely used for probabilistic models with intractable likelihood functions and they have become increasingly popular due to their use in implicit generative modeling (e.g. Wasserstein generative adversarial networks, Wasserstein autoencoders). Emerging from computational optimal transport, the Sliced-Wasserstein (SW) distance has become a popular choice in MEDE thanks to its simplicity and computational benefits. While several studies have reported empirical success on generative modeling with SW, the theoretical properties of such estimators have not yet been established. In this study, we investigate the asymptotic properties of estimators that are obtained by minimizing SW. We first show that convergence in SW implies weak convergence of probability measures in general Wasserstein spaces. Then we show that estimators obtained by minimizing SW (and also an approximate version of SW) are asymptotically consistent. We finally prove a central limit theorem, which characterizes the asymptotic distribution of the estimators and establish a convergence rate of $\sqrt{n}$, where $n$ denotes the number of observed data points. We illustrate the validity of our theory on both synthetic data and neural networks.
△ Less
Submitted 24 March, 2020; v1 submitted 11 June, 2019;
originally announced June 2019.
-
Generalized Sliced Wasserstein Distances
Authors:
Soheil Kolouri,
Kimia Nadjahi,
Umut Simsekli,
Roland Badeau,
Gustavo K. Rohde
Abstract:
The Wasserstein distance and its variations, e.g., the sliced-Wasserstein (SW) distance, have recently drawn attention from the machine learning community. The SW distance, specifically, was shown to have similar properties to the Wasserstein distance, while being much simpler to compute, and is therefore used in various applications including generative modeling and general supervised/unsupervise…
▽ More
The Wasserstein distance and its variations, e.g., the sliced-Wasserstein (SW) distance, have recently drawn attention from the machine learning community. The SW distance, specifically, was shown to have similar properties to the Wasserstein distance, while being much simpler to compute, and is therefore used in various applications including generative modeling and general supervised/unsupervised learning. In this paper, we first clarify the mathematical connection between the SW distance and the Radon transform. We then utilize the generalized Radon transform to define a new family of distances for probability measures, which we call generalized sliced-Wasserstein (GSW) distances. We also show that, similar to the SW distance, the GSW distance can be extended to a maximum GSW (max-GSW) distance. We then provide the conditions under which GSW and max-GSW distances are indeed distances. Finally, we compare the numerical performance of the proposed distances on several generative modeling tasks, including SW flows and SW auto-encoders.
△ Less
Submitted 1 February, 2019;
originally announced February 2019.
-
Model-based STFT phase recovery for audio source separation
Authors:
Paul Magron,
Roland Badeau,
Bertrand David
Abstract:
For audio source separation applications, it is common to estimate the magnitude of the short-time Fourier transform (STFT) of each source. In order to further synthesizing time-domain signals, it is necessary to recover the phase of the corresponding complex-valued STFT. Most authors in this field choose a Wiener-like filtering approach which boils down to using the phase of the original mixture.…
▽ More
For audio source separation applications, it is common to estimate the magnitude of the short-time Fourier transform (STFT) of each source. In order to further synthesizing time-domain signals, it is necessary to recover the phase of the corresponding complex-valued STFT. Most authors in this field choose a Wiener-like filtering approach which boils down to using the phase of the original mixture. In this paper, a different standpoint is adopted. Many music events are partially composed of slowly varying sinusoids and the STFT phase increment over time of those frequency components takes a specific form. This allows phase recovery by an unwrap** technique once a short-term frequency estimate has been obtained. Herein, a novel iterative source separation procedure is proposed which builds upon these results. It consists in minimizing the mixing error by means of the auxiliary function method. This procedure is initialized by exploiting the unwrap** technique in order to generate estimates that benefit from a temporal continuity property. Experiments conducted on realistic music pieces show that, given accurate magnitude estimates, this procedure outperforms the state-of-the-art consistent Wiener filter.
△ Less
Submitted 27 February, 2018; v1 submitted 5 August, 2016;
originally announced August 2016.
-
Lévy NMF for robust nonnegative source separation
Authors:
Paul Magron,
Roland Badeau,
Antoine Liutkus
Abstract:
Source separation, which consists in decomposing data into meaningful structured components, is an active research topic in many areas, such as music and image signal processing, applied physics and text mining. In this paper, we introduce the Positive $α$-stable (P$α$S) distributions to model the latent sources, which are a subclass of the stable distributions family. They notably permit us to mo…
▽ More
Source separation, which consists in decomposing data into meaningful structured components, is an active research topic in many areas, such as music and image signal processing, applied physics and text mining. In this paper, we introduce the Positive $α$-stable (P$α$S) distributions to model the latent sources, which are a subclass of the stable distributions family. They notably permit us to model random variables that are both nonnegative and impulsive. Considering the Lévy distribution, the only P$α$S distribution whose density is tractable, we propose a mixture model called Lévy Nonnegative Matrix Factorization (Lévy NMF). This model accounts for low-rank structures in nonnegative data that possibly has high variability or is corrupted by very adverse noise. The model parameters are estimated in a maximum-likelihood sense. We also derive an estimator of the sources given the parameters, which extends the validity of the generalized Wiener filtering to the P$α$S case. Experiments on synthetic data show that Lévy NMF compares favorably with state-of-the art techniques in terms of robustness to impulsive noise. The analysis of two types of realistic signals is also considered: musical spectrograms and fluorescence spectra of chemical species. The results highlight the potential of the Lévy NMF model for decomposing nonnegative data.
△ Less
Submitted 8 November, 2016; v1 submitted 5 August, 2016;
originally announced August 2016.
-
Nonnegative tensor factorization with frequency modulation cues for blind audio source separation
Authors:
Elliot Creager,
Noah D. Stein,
Roland Badeau,
Philippe Depalle
Abstract:
We present Vibrato Nonnegative Tensor Factorization, an algorithm for single-channel unsupervised audio source separation with an application to separating instrumental or vocal sources with nonstationary pitch from music recordings. Our approach extends Nonnegative Matrix Factorization for audio modeling by including local estimates of frequency modulation as cues in the separation. This permits…
▽ More
We present Vibrato Nonnegative Tensor Factorization, an algorithm for single-channel unsupervised audio source separation with an application to separating instrumental or vocal sources with nonstationary pitch from music recordings. Our approach extends Nonnegative Matrix Factorization for audio modeling by including local estimates of frequency modulation as cues in the separation. This permits the modeling and unsupervised separation of vibrato or glissando musical sources, which is not possible with the basic matrix factorization formulation.
The algorithm factorizes a sparse nonnegative tensor comprising the audio spectrogram and local frequency-slope-to-frequency ratios, which are estimated at each time-frequency bin using the Distributed Derivative Method. The use of local frequency modulations as separation cues is motivated by the principle of common fate partial grou** from Auditory Scene Analysis, which hypothesizes that each latent source in a mixture is characterized perceptually by coherent frequency and amplitude modulations shared by its component partials. We derive multiplicative factor updates by Minorization-Maximization, which guarantees convergence to a local optimum by iteration. We then compare our method to the baseline on two separation tasks: one considers synthetic vibrato notes, while the other considers vibrato string instrument recordings.
△ Less
Submitted 31 May, 2016;
originally announced June 2016.
-
Phase recovery in NMF for audio source separation: an insightful benchmark
Authors:
Paul Magron,
Roland Badeau,
Bertrand David
Abstract:
Nonnegative Matrix Factorization (NMF) is a powerful tool for decomposing mixtures of audio signals in the Time-Frequency (TF) domain. In applications such as source separation, the phase recovery for each extracted component is a major issue since it often leads to audible artifacts. In this paper, we present a methodology for evaluating various NMF-based source separation techniques involving ph…
▽ More
Nonnegative Matrix Factorization (NMF) is a powerful tool for decomposing mixtures of audio signals in the Time-Frequency (TF) domain. In applications such as source separation, the phase recovery for each extracted component is a major issue since it often leads to audible artifacts. In this paper, we present a methodology for evaluating various NMF-based source separation techniques involving phase reconstruction. For each model considered, a comparison between two approaches (blind separation without prior information and oracle separation with supervised model learning) is performed, in order to inquire about the room for improvement for the estimation methods. Experimental results show that the High Resolution NMF (HRNMF) model is particularly promising, because it is able to take phases and correlations over time into account with a great expressive power.
△ Less
Submitted 24 May, 2016;
originally announced May 2016.
-
Phase reconstruction of spectrograms based on a model of repeated audio events
Authors:
Paul Magron,
Roland Badeau,
Bertrand David
Abstract:
Phase recovery of modified spectrograms is a major issue in audio signal processing applications, such as source separation. This paper introduces a novel technique for estimating the phases of components in complex mixtures within onset frames in the Time-Frequency (TF) domain. We propose to exploit the phase repetitions from one onset frame to another. We introduce a reference phase which charac…
▽ More
Phase recovery of modified spectrograms is a major issue in audio signal processing applications, such as source separation. This paper introduces a novel technique for estimating the phases of components in complex mixtures within onset frames in the Time-Frequency (TF) domain. We propose to exploit the phase repetitions from one onset frame to another. We introduce a reference phase which characterizes a component independently of its activation times. The onset phases of a component are then modeled as the sum of this reference and an offset which is linearly dependent on the frequency. We derive a complex mixture model within onset frames and we provide two algorithms for the estimation of the model phase parameters. The model is estimated on experimental data and this technique is integrated into an audio source separation framework. The results demonstrate that this model is a promising tool for exploiting phase repetitions, and point out its potential for separating overlap** components in complex mixtures.
△ Less
Submitted 24 May, 2016;
originally announced May 2016.
-
Phase reconstruction of spectrograms with linear unwrap**: application to audio signal restoration
Authors:
Paul Magron,
Roland Badeau,
Bertrand David
Abstract:
This paper introduces a novel technique for reconstructing the phase of modified spectrograms of audio signals. From the analysis of mixtures of sinusoids we obtain relationships between phases of successive time frames in the Time-Frequency (TF) domain. To obtain similar relationships over frequencies, in particular within onset frames, we study an impulse model. Instantaneous frequencies and att…
▽ More
This paper introduces a novel technique for reconstructing the phase of modified spectrograms of audio signals. From the analysis of mixtures of sinusoids we obtain relationships between phases of successive time frames in the Time-Frequency (TF) domain. To obtain similar relationships over frequencies, in particular within onset frames, we study an impulse model. Instantaneous frequencies and attack times are estimated locally to encompass the class of non-stationary signals such as vibratos. These techniques ensure both the vertical coherence of partials (over frequencies) and the horizontal coherence (over time). The method is tested on a variety of data and demonstrates better performance than traditional consistency-based approaches. We also introduce an audio restoration framework and observe that our technique outperforms traditional methods.
△ Less
Submitted 24 May, 2016;
originally announced May 2016.
-
Complex NMF under phase constraints based on signal modeling: application to audio source separation
Authors:
Paul Magron,
Roland Badeau,
Bertrand David
Abstract:
Nonnegative Matrix Factorization (NMF) is a powerful tool for decomposing mixtures of audio signals in the Time-Frequency (TF) domain. In the source separation framework, the phase recovery for each extracted component is necessary for synthesizing time-domain signals. The Complex NMF (CNMF) model aims to jointly estimate the spectrogram and the phase of the sources, but requires to constrain the…
▽ More
Nonnegative Matrix Factorization (NMF) is a powerful tool for decomposing mixtures of audio signals in the Time-Frequency (TF) domain. In the source separation framework, the phase recovery for each extracted component is necessary for synthesizing time-domain signals. The Complex NMF (CNMF) model aims to jointly estimate the spectrogram and the phase of the sources, but requires to constrain the phase in order to produce satisfactory sounding results. We propose to incorporate phase constraints based on signal models within the CNMF framework: a \textit{phase unwrap**} constraint that enforces a form of temporal coherence, and a constraint based on the \textit{repetition} of audio events, which models the phases of the sources within onset frames. We also provide an algorithm for estimating the model parameters. The experimental results highlight the interest of including such constraints in the CNMF framework for separating overlap** components in complex audio mixtures.
△ Less
Submitted 24 May, 2016;
originally announced May 2016.