Skip to main content

Showing 1–40 of 40 results for author: Girin, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.20101  [pdf, other

    cs.SD cs.CL eess.AS

    Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech Inpainting

    Authors: Ihab Asaad, Maxime Jacquelin, Olivier Perrotin, Laurent Girin, Thomas Hueber

    Abstract: Most speech self-supervised learning (SSL) models are trained with a pretext task which consists in predicting missing parts of the input signal, either future segments (causal prediction) or segments masked anywhere within the input (non-causal prediction). Learned speech representations can then be efficiently transferred to downstream tasks (e.g., automatic speech or speaker recognition). In th… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

  2. arXiv:2312.04167  [pdf, other

    cs.LG

    Mixture of Dynamical Variational Autoencoders for Multi-Source Trajectory Modeling and Separation

    Authors: Xiaoyu Lin, Laurent Girin, Xavier Alameda-Pineda

    Abstract: In this paper, we propose a latent-variable generative model called mixture of dynamical variational autoencoders (MixDVAE) to model the dynamics of a system composed of multiple moving sources. A DVAE model is pre-trained on a single-source dataset to capture the source dynamics. Then, multiple instances of the pre-trained DVAE model are integrated into a multi-source mixture model with a discret… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

    Comments: arXiv admin note: substantial text overlap with arXiv:2202.09315

  3. arXiv:2306.07820  [pdf, other

    eess.AS cs.LG cs.SD

    Unsupervised speech enhancement with deep dynamical generative speech and noise models

    Authors: Xiaoyu Lin, Simon Leglaive, Laurent Girin, Xavier Alameda-Pineda

    Abstract: This work builds on a previous work on unsupervised speech enhancement using a dynamical variational autoencoder (DVAE) as the clean speech model and non-negative matrix factorization (NMF) as the noise model. We propose to replace the NMF noise model with a deep dynamical generative model (DDGM) depending either on the DVAE latent variables, or on the noisy observations, or on both. This DDGM can… ▽ More

    Submitted 13 June, 2023; originally announced June 2023.

  4. arXiv:2305.03582  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    A multimodal dynamical variational autoencoder for audiovisual speech representation learning

    Authors: Samir Sadok, Simon Leglaive, Laurent Girin, Xavier Alameda-Pineda, Renaud Séguier

    Abstract: In this paper, we present a multimodal and dynamical VAE (MDVAE) applied to unsupervised audio-visual speech representation learning. The latent space is structured to dissociate the latent dynamical factors that are shared between the modalities from those that are specific to each modality. A static latent variable is also introduced to encode the information that is constant over time within an… ▽ More

    Submitted 20 February, 2024; v1 submitted 5 May, 2023; originally announced May 2023.

    Comments: 14 figures, https://samsad35.github.io/site-mdvae/

  5. arXiv:2303.09404  [pdf, other

    eess.AS cs.LG cs.SD

    Speech Modeling with a Hierarchical Transformer Dynamical VAE

    Authors: Xiaoyu Lin, Xiaoyu Bie, Simon Leglaive, Laurent Girin, Xavier Alameda-Pineda

    Abstract: The dynamical variational autoencoders (DVAEs) are a family of latent-variable deep generative models that extends the VAE to model a sequence of observed data and a corresponding sequence of latent vectors. In almost all the DVAEs of the literature, the temporal dependencies within each sequence and across the two sequences are modeled with recurrent neural networks. In this paper, we propose to… ▽ More

    Submitted 10 May, 2023; v1 submitted 7 March, 2023; originally announced March 2023.

  6. arXiv:2207.01718  [pdf, other

    cs.CL eess.AS

    BERT, can HE predict contrastive focus? Predicting and controlling prominence in neural TTS using a language model

    Authors: Brooke Stephenson, Laurent Besacier, Laurent Girin, Thomas Hueber

    Abstract: Several recent studies have tested the use of transformer language model representations to infer prosodic features for text-to-speech synthesis (TTS). While these studies have explored prosody in general, in this work, we look specifically at the prediction of contrastive focus on personal pronouns. This is a particularly challenging task as it often requires semantic, discursive and/or pragmatic… ▽ More

    Submitted 4 July, 2022; originally announced July 2022.

    Comments: 5 pages

  7. Learning and controlling the source-filter representation of speech with a variational autoencoder

    Authors: Samir Sadok, Simon Leglaive, Laurent Girin, Xavier Alameda-Pineda, Renaud Séguier

    Abstract: Understanding and controlling latent representations in deep generative models is a challenging yet important problem for analyzing, transforming and generating various types of data. In speech processing, inspiring from the anatomical mechanisms of phonation, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent facto… ▽ More

    Submitted 21 March, 2023; v1 submitted 14 April, 2022; originally announced April 2022.

    Comments: 23 pages, 7 figures, companion website: https://samsad35.github.io/site-sfvae/

    Journal ref: Speech Communication, vol. 148, 2023

  8. arXiv:2204.02269  [pdf, other

    cs.SD cs.CL eess.AS

    Repeat after me: Self-supervised learning of acoustic-to-articulatory map** by vocal imitation

    Authors: Marc-Antoine Georges, Julien Diard, Laurent Girin, Jean-Luc Schwartz, Thomas Hueber

    Abstract: We propose a computational model of speech production combining a pre-trained neural articulatory synthesizer able to reproduce complex speech stimuli from a limited set of interpretable articulatory parameters, a DNN-based internal forward model predicting the sensory consequences of articulatory commands, and an internal inverse model based on a recurrent neural network recovering articulatory c… ▽ More

    Submitted 5 April, 2022; originally announced April 2022.

  9. arXiv:2204.01565  [pdf, other

    cs.CV

    HiT-DVAE: Human Motion Generation via Hierarchical Transformer Dynamical VAE

    Authors: Xiaoyu Bie, Wen Guo, Simon Leglaive, Lauren Girin, Francesc Moreno-Noguer, Xavier Alameda-Pineda

    Abstract: Studies on the automatic processing of 3D human pose data have flourished in the recent past. In this paper, we are interested in the generation of plausible and diverse future human poses following an observed 3D pose sequence. Current methods address this problem by injecting random variables from a single latent space into a deterministic motion prediction framework, which precludes the inheren… ▽ More

    Submitted 4 April, 2022; originally announced April 2022.

  10. arXiv:2202.09315  [pdf, other

    cs.LG cs.CV

    Unsupervised Multiple-Object Tracking with a Dynamical Variational Autoencoder

    Authors: Xiaoyu Lin, Laurent Girin, Xavier Alameda-Pineda

    Abstract: In this paper, we present an unsupervised probabilistic model and associated estimation algorithm for multi-object tracking (MOT) based on a dynamical variational autoencoder (DVAE), called DVAE-UMOT. The DVAE is a latent-variable deep generative model that can be seen as an extension of the variational autoencoder for the modeling of temporal sequences. It is included in DVAE-UMOT to model the ob… ▽ More

    Submitted 21 February, 2022; v1 submitted 18 February, 2022; originally announced February 2022.

  11. arXiv:2109.03465  [pdf, other

    cs.SD cs.LG eess.AS

    A Survey of Sound Source Localization with Deep Learning Methods

    Authors: Pierre-Amaury Grumiaux, Srđan Kitić, Laurent Girin, Alexandre Guérin

    Abstract: This article is a survey on deep learning methods for single and multiple sound source localization. We are particularly interested in sound source localization in indoor/domestic environment, where reverberation and diffuse noise are present. We provide an exhaustive topography of the neural-based localization literature in this context, organized according to several aspects: the neural network… ▽ More

    Submitted 17 June, 2022; v1 submitted 8 September, 2021; originally announced September 2021.

    Comments: Accepted for publication in The Journal of the Acoustical Society of America

  12. arXiv:2107.11066  [pdf, other

    cs.SD eess.AS

    SALADnet: Self-Attentive multisource Localization in the Ambisonics Domain

    Authors: Pierre-Amaury Grumiaux, Srdan Kitic, Prerak Srivastava, Laurent Girin, Alexandre Guérin

    Abstract: In this work, we propose a novel self-attention based neural network for robust multi-speaker localization from Ambisonics recordings. Starting from a state-of-the-art convolutional recurrent neural network, we investigate the benefit of replacing the recurrent layers by self-attention encoders, inherited from the Transformer architecture. We evaluate these models on synthetic and real-world data,… ▽ More

    Submitted 23 July, 2021; originally announced July 2021.

    Comments: Accepted to Workshop on Applications of Signal Processing to Audio and Acoustics

  13. arXiv:2106.12271  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Unsupervised Speech Enhancement using Dynamical Variational Auto-Encoders

    Authors: Xiaoyu Bie, Simon Leglaive, Xavier Alameda-Pineda, Laurent Girin

    Abstract: Dynamical variational autoencoders (DVAEs) are a class of deep generative models with latent variables, dedicated to model time series of high-dimensional data. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include temporal dependencies between successive observed and/or latent vectors. Previous work has shown the interest of using DVAEs over the VAE for speech sp… ▽ More

    Submitted 30 September, 2022; v1 submitted 23 June, 2021; originally announced June 2021.

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2993-3007, 2022

  14. arXiv:2106.06500  [pdf, ps, other

    cs.SD eess.AS

    A Benchmark of Dynamical Variational Autoencoders applied to Speech Spectrogram Modeling

    Authors: Xiaoyu Bie, Laurent Girin, Simon Leglaive, Thomas Hueber, Xavier Alameda-Pineda

    Abstract: The Variational Autoencoder (VAE) is a powerful deep generative model that is now extensively used to represent high-dimensional complex data via a low-dimensional latent space learned in an unsupervised manner. In the original VAE model, input data vectors are processed independently. In recent years, a series of papers have presented different extensions of the VAE to process sequential data, th… ▽ More

    Submitted 14 June, 2021; v1 submitted 11 June, 2021; originally announced June 2021.

    Comments: Accepted to Interspeech 2021. arXiv admin note: text overlap with arXiv:2008.12595

  15. arXiv:2105.01897  [pdf, other

    cs.SD eess.AS

    Improved feature extraction for CRNN-based multiple sound source localization

    Authors: Pierre-Amaury Grumiaux, Srdan Kitic, Laurent Girin, Alexandre Guérin

    Abstract: In this work, we propose to extend a state-of-the-art multi-source localization system based on a convolutional recurrent neural network and Ambisonics signals. We significantly improve the performance of the baseline network by changing the layout between convolutional and pooling layers. We propose several configurations with more convolutional layers and smaller pooling sizes in-between, so tha… ▽ More

    Submitted 5 May, 2021; originally announced May 2021.

    Comments: 5 pages, 2 figures. Accepted to EUSIPCO 2021

  16. arXiv:2104.03204  [pdf, other

    cs.SD cs.CL eess.AS

    Learning robust speech representation with an articulatory-regularized variational autoencoder

    Authors: Marc-Antoine Georges, Laurent Girin, Jean-Luc Schwartz, Thomas Hueber

    Abstract: It is increasingly considered that human speech perception and production both rely on articulatory representations. In this paper, we investigate whether this type of representation could improve the performances of a deep generative model (here a variational autoencoder) trained to encode and decode acoustic speech features. First we develop an articulatory model able to associate articulatory p… ▽ More

    Submitted 7 April, 2021; originally announced April 2021.

  17. arXiv:2102.09914  [pdf, other

    cs.CL eess.AS

    Alternate Endings: Improving Prosody for Incremental Neural TTS with Predicted Future Text Input

    Authors: Brooke Stephenson, Thomas Hueber, Laurent Girin, Laurent Besacier

    Abstract: The prosody of a spoken word is determined by its surrounding context. In incremental text-to-speech synthesis, where the synthesizer produces an output before it has access to the complete input, the full context is often unknown which can result in a loss of naturalness in the synthesized speech. In this paper, we investigate whether the use of predicted future text can attenuate this loss. We c… ▽ More

    Submitted 15 June, 2021; v1 submitted 19 February, 2021; originally announced February 2021.

    Comments: 4 pages

  18. arXiv:2101.01977  [pdf, other

    cs.SD eess.AS

    Multichannel CRNN for Speaker Counting: an Analysis of Performance

    Authors: Pierre-Amaury Grumiaux, Srdan Kitic, Laurent Girin, Alexandre Guérin

    Abstract: Speaker counting is the task of estimating the number of people that are simultaneously speaking in an audio recording. For several audio processing tasks such as speaker diarization, separation, localization and tracking, knowing the number of speakers at each timestep is a prerequisite, or at least it can be a strong advantage, in addition to enabling a low latency processing. In a previous work… ▽ More

    Submitted 6 January, 2021; originally announced January 2021.

    Comments: Presented at Forum Acusticum 2020

  19. arXiv:2012.03574  [pdf, other

    cs.SD cs.RO eess.AS

    Reverberant Sound Localization with a Robot Head Based on Direct-Path Relative Transfer Function

    Authors: Xiaofei Li, Laurent Girin, Fabien Badeig, Radu Horaud

    Abstract: This paper addresses the problem of sound-source localization (SSL) with a robot head, which remains a challenge in real-world environments. In particular we are interested in locating speech sources, as they are of high interest for human-robot interaction. The microphone-pair response corresponding to the direct-path sound propagation is a function of the source direction. In practice, this resp… ▽ More

    Submitted 7 December, 2020; originally announced December 2020.

    Comments: IEEE/RSJ International Conference on Intelligent Robots and Systems,

  20. arXiv:2009.02035  [pdf, other

    eess.AS cs.CL

    What the Future Brings: Investigating the Impact of Lookahead for Incremental Neural TTS

    Authors: Brooke Stephenson, Laurent Besacier, Laurent Girin, Thomas Hueber

    Abstract: In incremental text to speech synthesis (iTTS), the synthesizer produces an audio output before it has access to the entire input sentence. In this paper, we study the behavior of a neural sequence-to-sequence TTS system when used in an incremental mode, i.e. when generating speech output for token n, the system has access to n + k tokens from the text sequence. We first analyze the impact of this… ▽ More

    Submitted 4 September, 2020; originally announced September 2020.

    Comments: 5 pages, 4 figures

  21. arXiv:2008.12595  [pdf, other

    cs.LG stat.ML

    Dynamical Variational Autoencoders: A Comprehensive Review

    Authors: Laurent Girin, Simon Leglaive, Xiaoyu Bie, Julien Diard, Thomas Hueber, Xavier Alameda-Pineda

    Abstract: Variational autoencoders (VAEs) are powerful deep generative models widely used to represent high-dimensional complex data through a low-dimensional latent space learned in an unsupervised manner. In the original VAE model, the input data vectors are processed independently. Recently, a series of papers have presented different extensions of the VAE to process sequential data, which model not only… ▽ More

    Submitted 4 July, 2022; v1 submitted 28 August, 2020; originally announced August 2020.

    Journal ref: Foundations and Trends in Machine Learning, Vol. 15, No. 1-2, pp 1-175, 2021

  22. arXiv:2003.07839  [pdf, other

    cs.SD eess.AS

    High-Resolution Speaker Counting In Reverberant Rooms Using CRNN With Ambisonics Features

    Authors: Pierre-Amaury Grumiaux, Srdjan Kitic, Laurent Girin, Alexandre Guérin

    Abstract: Speaker counting is the task of estimating the number of people that are simultaneously speaking in an audio recording. For several audio processing tasks such as speaker diarization, separation, localization and tracking, knowing the number of speakers at each timestep is a prerequisite, or at least it can be a strong advantage, in addition to enabling a low latency processing. For that purpose,… ▽ More

    Submitted 17 March, 2020; originally announced March 2020.

    Comments: 5 pages, 1 figure

  23. arXiv:1910.10942  [pdf, other

    cs.LG cs.AI cs.NE cs.SD eess.AS

    A Recurrent Variational Autoencoder for Speech Enhancement

    Authors: Simon Leglaive, Xavier Alameda-Pineda, Laurent Girin, Radu Horaud

    Abstract: This paper presents a generative approach to speech enhancement based on a recurrent variational autoencoder (RVAE). The deep generative speech model is trained using clean speech signals only, and it is combined with a nonnegative matrix factorization noise model for speech enhancement. We propose a variational expectation-maximization algorithm where the encoder of the RVAE is fine-tuned at test… ▽ More

    Submitted 10 February, 2020; v1 submitted 24 October, 2019; originally announced October 2019.

    Journal ref: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020, Barcelona, Spain

  24. arXiv:1908.02590  [pdf, other

    cs.SD cs.LG eess.AS

    Audio-visual Speech Enhancement Using Conditional Variational Auto-Encoders

    Authors: Mostafa Sadeghi, Simon Leglaive, Xavier Alameda-PIneda, Laurent Girin, Radu Horaud

    Abstract: Variational auto-encoders (VAEs) are deep generative latent variable models that can be used for learning the distribution of complex data. VAEs have been successfully used to learn a probabilistic prior over speech signals, which is then used to perform speech enhancement. One advantage of this generative approach is that it does not require pairs of clean and noisy speech signals at training. In… ▽ More

    Submitted 26 May, 2020; v1 submitted 7 August, 2019; originally announced August 2019.

    Comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

    Journal ref: IEEE/ACM Transactions on Audio, Speech and Language Processing, 28, 2020

  25. Expectation-Maximization for Speech Source Separation Using Convolutive Transfer Function

    Authors: Xiaofei Li, Laurent Girin, Radu Horaud

    Abstract: This paper addresses the problem of under-determinded speech source separation from multichannel microphone singals, i.e. the convolutive mixtures of multiple sources. The time-domain signals are first transformed to the short-time Fourier transform (STFT) domain. To represent the room filters in the STFT domain, instead of the widely-used narrowband assumption, we propose to use a more accurate m… ▽ More

    Submitted 10 April, 2019; originally announced April 2019.

    Journal ref: CAAI Transactions on Intelligent Technologies, 2019

  26. arXiv:1904.05166  [pdf, other

    eess.SP cs.SD eess.AS

    Audio-noise Power Spectral Density Estimation Using Long Short-term Memory

    Authors: Xiaofei Li, Simon Leglaive, Laurent Girin, Radu Horaud

    Abstract: We propose a method using a long short-term memory (LSTM) network to estimate the noise power spectral density (PSD) of single-channel audio signals represented in the short time Fourier transform (STFT) domain. An LSTM network common to all frequency bands is trained, which processes each frequency band individually by map** the noisy STFT magnitude sequence to its corresponding noise PSD seque… ▽ More

    Submitted 10 April, 2019; originally announced April 2019.

    Comments: Submitted to IEEE Signal Processing Letters

    Journal ref: IEEE Signal Processing Letters, 2019, 26 (6), 918-922

  27. arXiv:1902.03926  [pdf, other

    cs.SD eess.AS stat.ML

    Speech enhancement with variational autoencoders and alpha-stable distributions

    Authors: Simon Leglaive, Umut Simsekli, Antoine Liutkus, Laurent Girin, Radu Horaud

    Abstract: This paper focuses on single-channel semi-supervised speech enhancement. We learn a speaker-independent deep generative speech model using the framework of variational autoencoders. The noise model remains unsupervised because we do not assume prior knowledge of the noisy recording environment. In this context, our contribution is to propose a noise model based on alpha-stable distributions, inste… ▽ More

    Submitted 8 February, 2019; originally announced February 2019.

    Comments: 5 pages, 3 figures, audio examples and code available online : https://team.inria.fr/perception/research/icassp2019-asvae/. arXiv admin note: text overlap with arXiv:1811.06713

    Report number: hal-02005106

    Journal ref: IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Brighton, UK, May 2019, pp. 541-545

  28. arXiv:1902.01605  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    A variance modeling framework based on variational autoencoders for speech enhancement

    Authors: Simon Leglaive, Laurent Girin, Radu Horaud

    Abstract: In this paper we address the problem of enhancing speech signals in noisy mixtures using a source separation approach. We explore the use of neural networks as an alternative to a popular speech variance model based on supervised non-negative matrix factorization (NMF). More precisely, we use a variational autoencoder as a speaker-independent supervised generative speech model, highlighting the co… ▽ More

    Submitted 5 February, 2019; originally announced February 2019.

    Comments: 6 pages, 3 figures

    Report number: hal-01832826v1

    Journal ref: Proc. of the IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Aalborg, Denmark, September 2018

  29. Multichannel Online Dereverberation based on Spectral Magnitude Inverse Filtering

    Authors: Xiaofei Li, Laurent Girin, Sharon Gannot, Radu Horaud

    Abstract: This paper addresses the problem of multichannel online dereverberation. The proposed method is carried out in the short-time Fourier transform (STFT) domain, and for each frequency band independently. In the STFT domain, the time-domain room impulse response is approximately represented by the convolutive transfer function (CTF). The multichannel CTFs are adaptively identified based on the cross-… ▽ More

    Submitted 9 November, 2020; v1 submitted 20 December, 2018; originally announced December 2018.

    Journal ref: ACM/IEEE Transactions on Audio, Speech, and Language Processing, 27(9) 2019

  30. arXiv:1812.04417  [pdf, other

    cs.SD eess.AS

    A cascaded multiple-speaker localization and tracking system

    Authors: Xiaofei Li, Yutong Ban, Laurent Girin, Xavier Alameda-Pineda, Radu Horaud

    Abstract: This paper presents an online multiple-speaker localization and tracking method, as the INRIA-Perception contribution to the LOCATA Challenge 2018. First, the recursive least-square method is used to adaptively estimate the direct-path relative transfer function as an interchannel localization feature. The feature is assumed to associate with a single speaker at each time-frequency bin. Second, a… ▽ More

    Submitted 11 December, 2018; originally announced December 2018.

    Comments: In Proceedings of the LOCATA Challenge Workshop - a satellite event of IWAENC 2018 (arXiv:1811.08482 )

    Report number: LOCATAchallenge/2018/06

  31. arXiv:1811.06713  [pdf, other

    cs.SD eess.AS stat.ML

    Semi-supervised multichannel speech enhancement with variational autoencoders and non-negative matrix factorization

    Authors: Simon Leglaive, Laurent Girin, Radu Horaud

    Abstract: In this paper we address speaker-independent multichannel speech enhancement in unknown noisy environments. Our work is based on a well-established multichannel local Gaussian modeling framework. We propose to use a neural network for modeling the speech spectro-temporal content. The parameters of this supervised model are learned using the framework of variational autoencoders. The noisy recordin… ▽ More

    Submitted 30 April, 2019; v1 submitted 16 November, 2018; originally announced November 2018.

    Comments: 5 pages, 2 figures, audio examples and code available online at https://team.inria.fr/perception/icassp-2019-mvae/

    Report number: hal-02005102

    Journal ref: IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Brighton, UK, May 2019, pp. 101-105

  32. arXiv:1809.10961  [pdf, other

    cs.CV cs.MM stat.ML

    Variational Bayesian Inference for Audio-Visual Tracking of Multiple Speakers

    Authors: Yutong Ban, Xavier Alameda-Pineda, Laurent Girin, Radu Horaud

    Abstract: In this paper we address the problem of tracking multiple speakers via the fusion of visual and auditory information. We propose to exploit the complementary nature of these two modalities in order to accurately estimate smooth trajectories of the tracked persons, to deal with the partial or total absence of one of the modalities over short periods of time, and to estimate the acoustic status -- e… ▽ More

    Submitted 29 October, 2019; v1 submitted 28 September, 2018; originally announced September 2018.

  33. Online Localization and Tracking of Multiple Moving Speakers in Reverberant Environments

    Authors: Xiaofei Li, Yutong Ban, Laurent Girin, Xavier Alameda-Pineda, Radu Horaud

    Abstract: We address the problem of online localization and tracking of multiple moving speakers in reverberant environments. The paper has the following contributions. We use the direct-path relative transfer function (DP-RTF), an inter-channel feature that encodes acoustic information robust against reverberation, and we propose an online algorithm well suited for estimating DP-RTFs associated with moving… ▽ More

    Submitted 26 February, 2019; v1 submitted 28 September, 2018; originally announced September 2018.

    Comments: IEEE Journal of Selected Topics in Signal Processing, 2019

  34. arXiv:1806.04096  [pdf, other

    eess.AS cs.SD

    Autoencoders for music sound modeling: a comparison of linear, shallow, deep, recurrent and variational models

    Authors: Fanny Roche, Thomas Hueber, Samuel Limier, Laurent Girin

    Abstract: This study investigates the use of non-linear unsupervised dimensionality reduction techniques to compress a music dataset into a low-dimensional representation which can be used in turn for the synthesis of new sounds. We systematically compare (shallow) autoencoders (AEs), deep autoencoders (DAEs), recurrent autoencoders (with Long Short-Term Memory cells -- LSTM-AEs) and variational autoencoder… ▽ More

    Submitted 24 May, 2019; v1 submitted 11 June, 2018; originally announced June 2018.

    Comments: SMC 2019

  35. Multichannel Speech Separation and Enhancement Using the Convolutive Transfer Function

    Authors: Xiaofei Li, Laurent Girin, Sharon Gannot, Radu Horaud

    Abstract: This paper addresses the problem of speech separation and enhancement from multichannel convolutive and noisy mixtures, \emph{assuming known mixing filters}. We propose to perform the speech separation and enhancement task in the short-time Fourier transform domain, using the convolutive transfer function (CTF) approximation. Compared to time-domain filters, CTF has much less taps, consequently it… ▽ More

    Submitted 26 February, 2018; v1 submitted 21 November, 2017; originally announced November 2017.

    Comments: Submitted to IEEE/ACM Transactions on Audio, Speech and Language Processing

    Journal ref: IEEE/ACM Transactions on Audio Speech and Language Processing 27(3), 645-659, 2019

  36. Multiple-Speaker Localization Based on Direct-Path Features and Likelihood Maximization with Spatial Sparsity Regularization

    Authors: Xiaofei Li, Laurent Girin, Sharon Gannot, Radu Horaud

    Abstract: This paper addresses the problem of multiple-speaker localization in noisy and reverberant environments, using binaural recordings of an acoustic scene. A Gaussian mixture model (GMM) is adopted, whose components correspond to all the possible candidate source locations defined on a grid. After optimizing the GMM-based objective function, given an observed set of binaural features, both the number… ▽ More

    Submitted 17 May, 2017; v1 submitted 3 November, 2016; originally announced November 2016.

    Comments: 16 pages, 4 figures, 4 tables

    Journal ref: IEEE/ACM Transactions on Audio, Speech and Language Processing, 25(10), pp 1997 - 2012, October 2017

  37. A Variational EM Algorithm for the Separation of Time-Varying Convolutive Audio Mixtures

    Authors: Dionyssos Kounades-Bastian, Laurent Girin, Xavier Alameda-Pineda, Sharon Gannot, Radu Horaud

    Abstract: This paper addresses the problem of separating audio sources from time-varying convolutive mixtures. We propose a probabilistic framework based on the local complex-Gaussian model combined with non-negative matrix factorization. The time-varying mixing filters are modeled by a continuous temporal stochastic process. We present a variational expectation-maximization (VEM) algorithm that employs a K… ▽ More

    Submitted 15 April, 2016; v1 submitted 15 October, 2015; originally announced October 2015.

    Comments: 13 pages, 4 figures, 2 tables

    Journal ref: IEEE/ACM Transactions on Audio, Speech and Language Processing, 24(8), 1408-1423, 2016

  38. Estimation of the Direct-Path Relative Transfer Function for Supervised Sound-Source Localization

    Authors: Xiaofei Li, Laurent Girin, Radu Horaud, Sharon Gannot

    Abstract: This paper addresses the problem of binaural localization of a single speech source in noisy and reverberant environments. For a given binaural microphone setup, the binaural response corresponding to the direct-path propagation of a single source is a function of the source direction. In practice, this response is contaminated by noise and reverberations. The direct-path relative transfer functio… ▽ More

    Submitted 27 June, 2016; v1 submitted 10 September, 2015; originally announced September 2015.

    Comments: 15 pages, 7 figures, 5 tables

    Journal ref: IEEE/ACM Transactions on Audio, Speech and Language Processing, 24(11), 2171 - 2186, 2016

  39. arXiv:1408.2700  [pdf, other

    cs.SD cs.MM stat.AP stat.ML

    Co-Localization of Audio Sources in Images Using Binaural Features and Locally-Linear Regression

    Authors: Antoine Deleforge, Radu Horaud, Yoav Schechner, Laurent Girin

    Abstract: This paper addresses the problem of localizing audio sources using binaural measurements. We propose a supervised formulation that simultaneously localizes multiple sources at different locations. The approach is intrinsically efficient because, contrary to prior work, it relies neither on source separation, nor on monaural segregation. The method starts with a training stage that establishes a lo… ▽ More

    Submitted 15 April, 2016; v1 submitted 12 August, 2014; originally announced August 2014.

    Comments: 15 pages, 8 figures

    Journal ref: IEEE Transactions on Audio, Speech, and Language Processing 23(4), 718-731, April, 2015

  40. arXiv:1402.3689  [pdf, other

    cs.SD cs.RO

    Sound Representation and Classification Benchmark for Domestic Robots

    Authors: Maxime Janvier, Xavier Alameda-Pineda, Laurent Girin, Radu Horaud

    Abstract: We address the problem of sound representation and classification and present results of a comparative study in the context of a domestic robotic scenario. A dataset of sounds was recorded in realistic conditions (background noise, presence of several sound sources, reverberations, etc.) using the humanoid robot NAO. An extended benchmark is carried out to test a variety of representations combine… ▽ More

    Submitted 15 February, 2014; originally announced February 2014.

    Comments: 8 pages, 2 figures