Skip to main content

Showing 1–50 of 66 results for author: Alameda-Pineda, X

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.18839  [pdf, other

    cs.CV

    MEGA: Masked Generative Autoencoder for Human Mesh Recovery

    Authors: Guénolé Fiche, Simon Leglaive, Xavier Alameda-Pineda, Francesc Moreno-Noguer

    Abstract: Human Mesh Recovery (HMR) from a single RGB image is a highly ambiguous problem, as similar 2D projections can correspond to multiple 3D interpretations. Nevertheless, most HMR methods overlook this ambiguity and make a single prediction without accounting for the associated uncertainty. A few approaches generate a distribution of human meshes, enabling the sampling of multiple predictions; howeve… ▽ More

    Submitted 31 May, 2024; v1 submitted 29 May, 2024; originally announced May 2024.

  2. arXiv:2404.07560  [pdf, other

    cs.RO cs.AI

    Socially Pertinent Robots in Gerontological Healthcare

    Authors: Xavier Alameda-Pineda, Angus Addlesee, Daniel Hernández García, Chris Reinke, Soraya Arias, Federica Arrigoni, Alex Auternaud, Lauriane Blavette, Cigdem Beyan, Luis Gomez Camara, Ohad Cohen, Alessandro Conti, Sébastien Dacunha, Christian Dondrup, Yoav Ellinson, Francesco Ferro, Sharon Gannot, Florian Gras, Nancie Gunson, Radu Horaud, Moreno D'Incà, Imad Kimouche, Séverin Lemaignan, Oliver Lemon, Cyril Liotard , et al. (19 additional authors not shown)

    Abstract: Despite the many recent achievements in develo** and deploying social robotics, there are still many underexplored environments and applications for which systematic evaluation of such systems by end-users is necessary. While several robotic platforms have been used in gerontological healthcare, the question of whether or not a social interactive robot with multi-modal conversational capabilitie… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

  3. arXiv:2312.08291  [pdf, other

    cs.CV

    VQ-HPS: Human Pose and Shape Estimation in a Vector-Quantized Latent Space

    Authors: Guénolé Fiche, Simon Leglaive, Xavier Alameda-Pineda, Antonio Agudo, Francesc Moreno-Noguer

    Abstract: Previous works on Human Pose and Shape Estimation (HPSE) from RGB images can be broadly categorized into two main groups: parametric and non-parametric approaches. Parametric techniques leverage a low-dimensional statistical body model for realistic results, whereas recent non-parametric methods achieve higher precision by directly regressing the 3D coordinates of the human body mesh. This work in… ▽ More

    Submitted 31 May, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

  4. arXiv:2312.04167  [pdf, other

    cs.LG

    Mixture of Dynamical Variational Autoencoders for Multi-Source Trajectory Modeling and Separation

    Authors: Xiaoyu Lin, Laurent Girin, Xavier Alameda-Pineda

    Abstract: In this paper, we propose a latent-variable generative model called mixture of dynamical variational autoencoders (MixDVAE) to model the dynamics of a system composed of multiple moving sources. A DVAE model is pre-trained on a single-source dataset to capture the source dynamics. Then, multiple instances of the pre-trained DVAE model are integrated into a multi-source mixture model with a discret… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

    Comments: arXiv admin note: substantial text overlap with arXiv:2202.09315

  5. arXiv:2311.16148  [pdf, other

    cs.NE cs.LG

    Univariate Radial Basis Function Layers: Brain-inspired Deep Neural Layers for Low-Dimensional Inputs

    Authors: Daniel Jost, Basavasagar Patil, Xavier Alameda-Pineda, Chris Reinke

    Abstract: Deep Neural Networks (DNNs) became the standard tool for function approximation with most of the introduced architectures being developed for high-dimensional input data. However, many real-world problems have low-dimensional inputs for which standard Multi-Layer Perceptrons (MLPs) are the default choice. An investigation into specialized architectures is missing. We propose a novel DNN layer call… ▽ More

    Submitted 2 February, 2024; v1 submitted 7 November, 2023; originally announced November 2023.

  6. arXiv:2308.09610  [pdf, other

    cs.CV

    On the Effectiveness of LayerNorm Tuning for Continual Learning in Vision Transformers

    Authors: Thomas De Min, Massimiliano Mancini, Karteek Alahari, Xavier Alameda-Pineda, Elisa Ricci

    Abstract: State-of-the-art rehearsal-free continual learning methods exploit the peculiarities of Vision Transformers to learn task-specific prompts, drastically reducing catastrophic forgetting. However, there is a tradeoff between the number of learned parameters and the performance, making such models computationally expensive. In this work, we aim to reduce this cost while maintaining competitive perfor… ▽ More

    Submitted 18 August, 2023; originally announced August 2023.

    Comments: In The First Workshop on Visual Continual Learning (ICCVW 2023); Oral

  7. arXiv:2307.03270  [pdf, other

    cs.GR cs.CV cs.LG cs.SD eess.AS

    A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony in Talking Head Generation

    Authors: Louis Airale, Dominique Vaufreydaz, Xavier Alameda-Pineda

    Abstract: Animating still face images with deep generative models using a speech input signal is an active research topic and has seen important recent progress. However, much of the effort has been put into lip syncing and rendering quality while the generation of natural head motion, let alone the audio-visual correlation between head motion and speech, has often been neglected. In this work, we propose a… ▽ More

    Submitted 4 July, 2023; originally announced July 2023.

  8. arXiv:2306.07820  [pdf, other

    eess.AS cs.LG cs.SD

    Unsupervised speech enhancement with deep dynamical generative speech and noise models

    Authors: Xiaoyu Lin, Simon Leglaive, Laurent Girin, Xavier Alameda-Pineda

    Abstract: This work builds on a previous work on unsupervised speech enhancement using a dynamical variational autoencoder (DVAE) as the clean speech model and non-negative matrix factorization (NMF) as the noise model. We propose to replace the NMF noise model with a deep dynamical generative model (DDGM) depending either on the DVAE latent variables, or on the noisy observations, or on both. This DDGM can… ▽ More

    Submitted 13 June, 2023; originally announced June 2023.

  9. arXiv:2306.07483  [pdf, other

    cs.CV

    Semi-supervised learning made simple with self-supervised clustering

    Authors: Enrico Fini, Pietro Astolfi, Karteek Alahari, Xavier Alameda-Pineda, Julien Mairal, Moin Nabi, Elisa Ricci

    Abstract: Self-supervised learning models have been shown to learn rich visual representations without requiring human annotations. However, in many real-world scenarios, labels are partially available, motivating a recent line of work on semi-supervised methods inspired by self-supervised principles. In this paper, we propose a conceptually simple yet empirically powerful approach to turn clustering-based… ▽ More

    Submitted 12 June, 2023; originally announced June 2023.

    Comments: CVPR 2023 - Code available at https://github.com/pietroastolfi/suave-daino

    Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023) 3187-3197

  10. arXiv:2306.05846  [pdf, other

    cs.CV cs.AI

    Motion-DVAE: Unsupervised learning for fast human motion denoising

    Authors: Guénolé Fiche, Simon Leglaive, Xavier Alameda-Pineda, Renaud Séguier

    Abstract: Pose and motion priors are crucial for recovering realistic and accurate human motion from noisy observations. Substantial progress has been made on pose and shape estimation from images, and recent works showed impressive results using priors to refine frame-wise predictions. However, a lot of motion priors only model transitions between consecutive poses and are used in time-consuming optimizati… ▽ More

    Submitted 30 November, 2023; v1 submitted 9 June, 2023; originally announced June 2023.

  11. arXiv:2305.03582  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    A multimodal dynamical variational autoencoder for audiovisual speech representation learning

    Authors: Samir Sadok, Simon Leglaive, Laurent Girin, Xavier Alameda-Pineda, Renaud Séguier

    Abstract: In this paper, we present a multimodal and dynamical VAE (MDVAE) applied to unsupervised audio-visual speech representation learning. The latent space is structured to dissociate the latent dynamical factors that are shared between the modalities from those that are specific to each modality. A static latent variable is also introduced to encode the information that is constant over time within an… ▽ More

    Submitted 20 February, 2024; v1 submitted 5 May, 2023; originally announced May 2023.

    Comments: 14 figures, https://samsad35.github.io/site-mdvae/

  12. arXiv:2303.09404  [pdf, other

    eess.AS cs.LG cs.SD

    Speech Modeling with a Hierarchical Transformer Dynamical VAE

    Authors: Xiaoyu Lin, Xiaoyu Bie, Simon Leglaive, Laurent Girin, Xavier Alameda-Pineda

    Abstract: The dynamical variational autoencoders (DVAEs) are a family of latent-variable deep generative models that extends the VAE to model a sequence of observed data and a corresponding sequence of latent vectors. In almost all the DVAEs of the literature, the temporal dependencies within each sequence and across the two sequences are modeled with recurrent neural networks. In this paper, we propose to… ▽ More

    Submitted 10 May, 2023; v1 submitted 7 March, 2023; originally announced March 2023.

  13. arXiv:2211.00990  [pdf, ps, other

    cs.SD cs.CV cs.LG eess.AS

    A weighted-variance variational autoencoder model for speech enhancement

    Authors: Ali Golmakani, Mostafa Sadeghi, Xavier Alameda-Pineda, Romain Serizel

    Abstract: We address speech enhancement based on variational autoencoders, which involves learning a speech prior distribution in the time-frequency (TF) domain. A zero-mean complex-valued Gaussian distribution is usually assumed for the generative model, where the speech information is encoded in the variance as a function of a latent variable. In contrast to this commonly used approach, we propose a weigh… ▽ More

    Submitted 26 October, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

  14. arXiv:2211.00987  [pdf, other

    cs.CV

    Autoregressive GAN for Semantic Unconditional Head Motion Generation

    Authors: Louis Airale, Xavier Alameda-Pineda, Stéphane Lathuilière, Dominique Vaufreydaz

    Abstract: In this work, we address the task of unconditional head motion generation to animate still human faces in a low-dimensional semantic space from a single reference pose. Different from traditional audio-conditioned talking head generation that seldom puts emphasis on realistic head motions, we devise a GAN-based architecture that learns to synthesize rich head motion sequences over long duration wh… ▽ More

    Submitted 17 April, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

  15. arXiv:2207.01567  [pdf, other

    cs.CV cs.AI

    Back to MLP: A Simple Baseline for Human Motion Prediction

    Authors: Wen Guo, Yuming Du, Xi Shen, Vincent Lepetit, Xavier Alameda-Pineda, Francesc Moreno-Noguer

    Abstract: This paper tackles the problem of human motion prediction, consisting in forecasting future body poses from historically observed sequences. State-of-the-art approaches provide good results, however, they rely on deep learning architectures of arbitrary complexity, such as Recurrent Neural Networks(RNN), Transformers or Graph Convolutional Networks(GCN), typically requiring multiple training stage… ▽ More

    Submitted 5 October, 2022; v1 submitted 4 July, 2022; originally announced July 2022.

    Comments: Accepted to WACV 2023; Code available at https://github.com/dulucas/siMLPe

  16. arXiv:2206.03211  [pdf, other

    cs.RO cs.AI

    Variational Meta Reinforcement Learning for Social Robotics

    Authors: Anand Ballou, Xavier Alameda-Pineda, Chris Reinke

    Abstract: With the increasing presence of robots in our every-day environments, improving their social skills is of utmost importance. Nonetheless, social robotics still faces many challenges. One bottleneck is that robotic behaviors need to be often adapted as social norms depend strongly on the environment. For example, a robot should navigate more carefully around patients in a hospital compared to worke… ▽ More

    Submitted 3 August, 2023; v1 submitted 7 June, 2022; originally announced June 2022.

    Comments: 16 pages, 15 figures

  17. arXiv:2204.12366  [pdf, other

    cs.MM

    Robust Audio-Visual Instance Discrimination via Active Contrastive Set Mining

    Authors: Hanyu Xuan, Yihong Xu, Shuo Chen, Zhiliang Wu, Jian Yang, Yan Yan, Xavier Alameda-Pineda

    Abstract: The recent success of audio-visual representation learning can be largely attributed to their pervasive property of audio-visual synchronization, which can be used as self-annotated supervision. As a state-of-the-art solution, Audio-Visual Instance Discrimination (AVID) extends instance discrimination to the audio-visual realm. Existing AVID methods construct the contrastive set by random sampling… ▽ More

    Submitted 26 April, 2022; originally announced April 2022.

    Comments: 7 pages, 4 figures, accepted at IJCAI 2022

  18. Learning and controlling the source-filter representation of speech with a variational autoencoder

    Authors: Samir Sadok, Simon Leglaive, Laurent Girin, Xavier Alameda-Pineda, Renaud Séguier

    Abstract: Understanding and controlling latent representations in deep generative models is a challenging yet important problem for analyzing, transforming and generating various types of data. In speech processing, inspiring from the anatomical mechanisms of phonation, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent facto… ▽ More

    Submitted 21 March, 2023; v1 submitted 14 April, 2022; originally announced April 2022.

    Comments: 23 pages, 7 figures, companion website: https://samsad35.github.io/site-sfvae/

    Journal ref: Speech Communication, vol. 148, 2023

  19. arXiv:2204.02810  [pdf, other

    cs.CV cs.SD eess.AS

    Expression-preserving face frontalization improves visually assisted speech processing

    Authors: Zhiqi Kang, Mostafa Sadeghi, Radu Horaud, Xavier Alameda-Pineda

    Abstract: Face frontalization consists of synthesizing a frontally-viewed face from an arbitrarily-viewed one. The main contribution of this paper is a frontalization methodology that preserves non-rigid facial deformations in order to boost the performance of visually assisted speech communication. The method alternates between the estimation of (i)~the rigid transformation (scale, rotation, and translatio… ▽ More

    Submitted 15 December, 2022; v1 submitted 6 April, 2022; originally announced April 2022.

    Comments: arXiv admin note: text overlap with arXiv:2202.00538

    Journal ref: International Journal of Computer Vision 131 (5), 1122-1140, 2023

  20. arXiv:2204.01565  [pdf, other

    cs.CV

    HiT-DVAE: Human Motion Generation via Hierarchical Transformer Dynamical VAE

    Authors: Xiaoyu Bie, Wen Guo, Simon Leglaive, Lauren Girin, Francesc Moreno-Noguer, Xavier Alameda-Pineda

    Abstract: Studies on the automatic processing of 3D human pose data have flourished in the recent past. In this paper, we are interested in the generation of plausible and diverse future human poses following an observed 3D pose sequence. Current methods address this problem by injecting random variables from a single latent space into a deterministic motion prediction framework, which precludes the inheren… ▽ More

    Submitted 4 April, 2022; originally announced April 2022.

  21. Uncertainty-aware Contrastive Distillation for Incremental Semantic Segmentation

    Authors: Guanglei Yang, Enrico Fini, Dan Xu, Paolo Rota, Mingli Ding, Moin Nabi, Xavier Alameda-Pineda, Elisa Ricci

    Abstract: A fundamental and challenging problem in deep learning is catastrophic forgetting, i.e. the tendency of neural networks to fail to preserve the knowledge acquired from old tasks when learning new tasks. This problem has been widely investigated in the research community and several Incremental Learning (IL) approaches have been proposed in the past years. While earlier works in computer vision hav… ▽ More

    Submitted 20 May, 2022; v1 submitted 26 March, 2022; originally announced March 2022.

    Comments: TPAMI

  22. arXiv:2202.09315  [pdf, other

    cs.LG cs.CV

    Unsupervised Multiple-Object Tracking with a Dynamical Variational Autoencoder

    Authors: Xiaoyu Lin, Laurent Girin, Xavier Alameda-Pineda

    Abstract: In this paper, we present an unsupervised probabilistic model and associated estimation algorithm for multi-object tracking (MOT) based on a dynamical variational autoencoder (DVAE), called DVAE-UMOT. The DVAE is a latent-variable deep generative model that can be seen as an extension of the variational autoencoder for the modeling of temporal sequences. It is included in DVAE-UMOT to model the ob… ▽ More

    Submitted 21 February, 2022; v1 submitted 18 February, 2022; originally announced February 2022.

  23. arXiv:2202.00538  [pdf, other

    cs.SD cs.CV eess.AS

    The impact of removing head movements on audio-visual speech enhancement

    Authors: Zhiqi Kang, Mostafa Sadeghi, Radu Horaud, Xavier Alameda-Pineda, Jacob Donley, Anurag Kumar

    Abstract: This paper investigates the impact of head movements on audio-visual speech enhancement (AVSE). Although being a common conversational feature, head movements have been ignored by past and recent studies: they challenge today's learning-based methods as they often degrade the performance of models that are trained on clean, frontal, and steady face images. To alleviate this problem, we propose to… ▽ More

    Submitted 2 February, 2022; v1 submitted 1 February, 2022; originally announced February 2022.

  24. arXiv:2202.00432  [pdf, other

    cs.CV

    Continual Attentive Fusion for Incremental Learning in Semantic Segmentation

    Authors: Guanglei Yang, Enrico Fini, Dan Xu, Paolo Rota, Mingli Ding, Hao Tang, Xavier Alameda-Pineda, Elisa Ricci

    Abstract: Over the past years, semantic segmentation, as many other tasks in computer vision, benefited from the progress in deep neural networks, resulting in significantly improved performance. However, deep architectures trained with gradient-based techniques suffer from catastrophic forgetting, which is the tendency to forget previously learned knowledge while learning new tasks. Aiming at devising stra… ▽ More

    Submitted 1 February, 2022; originally announced February 2022.

  25. arXiv:2112.04215  [pdf, other

    cs.CV cs.LG

    Self-Supervised Models are Continual Learners

    Authors: Enrico Fini, Victor G. Turrisi da Costa, Xavier Alameda-Pineda, Elisa Ricci, Karteek Alahari, Julien Mairal

    Abstract: Self-supervised models have been shown to produce comparable or better visual representations than their supervised counterparts when trained offline on unlabeled data at scale. However, their efficacy is catastrophically reduced in a Continual Learning (CL) scenario where data is presented to the model sequentially. In this paper, we show that self-supervised loss functions can be seamlessly conv… ▽ More

    Submitted 1 April, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

  26. arXiv:2111.03110  [pdf, other

    cs.LG cs.AI

    Successor Feature Neural Episodic Control

    Authors: David Emukpere, Xavier Alameda-Pineda, Chris Reinke

    Abstract: A longstanding goal in reinforcement learning is to build intelligent agents that show fast learning and a flexible transfer of skills akin to humans and animals. This paper investigates the integration of two frameworks for tackling those goals: episodic control and successor features. Episodic control is a cognitively inspired approach relying on episodic memory, an instance-based memory model o… ▽ More

    Submitted 2 August, 2023; v1 submitted 4 November, 2021; originally announced November 2021.

  27. arXiv:2110.15701  [pdf, other

    cs.LG cs.AI

    Successor Feature Representations

    Authors: Chris Reinke, Xavier Alameda-Pineda

    Abstract: Transfer in Reinforcement Learning aims to improve learning performance on target tasks using knowledge from experienced source tasks. Successor Representations (SR) and their extension Successor Features (SF) are prominent transfer mechanisms in domains where reward functions change between tasks. They reevaluate the expected return of previously learned policies in a new target task to transfer… ▽ More

    Submitted 2 August, 2023; v1 submitted 29 October, 2021; originally announced October 2021.

    Comments: published in Transactions on Machine Learning Research (05/2023), source code: https://gitlab.inria.fr/robotlearn/sfr_learning, [v2] added experiments with learned features, [v3] renamed paper and changed scope, [v4] published version

  28. arXiv:2106.12271  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Unsupervised Speech Enhancement using Dynamical Variational Auto-Encoders

    Authors: Xiaoyu Bie, Simon Leglaive, Xavier Alameda-Pineda, Laurent Girin

    Abstract: Dynamical variational autoencoders (DVAEs) are a class of deep generative models with latent variables, dedicated to model time series of high-dimensional data. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include temporal dependencies between successive observed and/or latent vectors. Previous work has shown the interest of using DVAEs over the VAE for speech sp… ▽ More

    Submitted 30 September, 2022; v1 submitted 23 June, 2021; originally announced June 2021.

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2993-3007, 2022

  29. arXiv:2106.06500  [pdf, ps, other

    cs.SD eess.AS

    A Benchmark of Dynamical Variational Autoencoders applied to Speech Spectrogram Modeling

    Authors: Xiaoyu Bie, Laurent Girin, Simon Leglaive, Thomas Hueber, Xavier Alameda-Pineda

    Abstract: The Variational Autoencoder (VAE) is a powerful deep generative model that is now extensively used to represent high-dimensional complex data via a low-dimensional latent space learned in an unsupervised manner. In the original VAE model, input data vectors are processed independently. In recent years, a series of papers have presented different extensions of the VAE to process sequential data, th… ▽ More

    Submitted 14 June, 2021; v1 submitted 11 June, 2021; originally announced June 2021.

    Comments: Accepted to Interspeech 2021. arXiv admin note: text overlap with arXiv:2008.12595

  30. arXiv:2105.08825  [pdf, other

    cs.CV

    Multi-Person Extreme Motion Prediction

    Authors: Wen Guo, Xiaoyu Bie, Xavier Alameda-Pineda, Francesc Moreno-Noguer

    Abstract: Human motion prediction aims to forecast future poses given a sequence of past 3D skeletons. While this problem has recently received increasing attention, it has mostly been tackled for single humans in isolation. In this paper, we explore this problem when dealing with humans performing collaborative tasks, we seek to predict the future motion of two interacted persons given two sequences of the… ▽ More

    Submitted 19 June, 2022; v1 submitted 18 May, 2021; originally announced May 2021.

    Comments: CVPR 2022, update results of MSR in Table 3

  31. arXiv:2103.15145  [pdf, other

    cs.CV

    TransCenter: Transformers with Dense Representations for Multiple-Object Tracking

    Authors: Yihong Xu, Yutong Ban, Guillaume Delorme, Chuang Gan, Daniela Rus, Xavier Alameda-Pineda

    Abstract: Transformers have proven superior performance for a wide variety of tasks since they were introduced. In recent years, they have drawn attention from the vision community in tasks such as image classification and object detection. Despite this wave, an accurate and efficient multiple-object tracking (MOT) method based on transformers is yet to be designed. We argue that the direct application of a… ▽ More

    Submitted 30 September, 2022; v1 submitted 28 March, 2021; originally announced March 2021.

    Comments: 17 pages, 10 figures, updated results and add comparisons

  32. SocialInteractionGAN: Multi-person Interaction Sequence Generation

    Authors: Louis Airale, Dominique Vaufreydaz, Xavier Alameda-Pineda

    Abstract: Prediction of human actions in social interactions has important applications in the design of social robots or artificial avatars. In this paper, we focus on a unimodal representation of interactions and propose to tackle interaction generation in a data-driven fashion. In particular, we model human interaction generation as a discrete multi-sequence generation problem and present SocialInteracti… ▽ More

    Submitted 12 September, 2022; v1 submitted 10 March, 2021; originally announced March 2021.

    Comments: IEEE Transactions on Affective Computing, Institute of Electrical and Electronics Engineers, 2022

  33. arXiv:2103.03510  [pdf, other

    cs.CV

    Variational Structured Attention Networks for Deep Visual Representation Learning

    Authors: Guanglei Yang, Paolo Rota, Xavier Alameda-Pineda, Dan Xu, Mingli Ding, Elisa Ricci

    Abstract: Convolutional neural networks have enabled major progresses in addressing pixel-level prediction tasks such as semantic segmentation, depth estimation, surface normal prediction and so on, benefiting from their powerful capabilities in visual representation learning. Typically, state of the art models integrate attention mechanisms for improved deep feature representations. Recently, some works ha… ▽ More

    Submitted 15 December, 2021; v1 submitted 5 March, 2021; originally announced March 2021.

    Comments: Accepted at IEEE Transactions on Image Processing (TIP)

  34. arXiv:2102.04144  [pdf, ps, other

    eess.AS cs.CV cs.SD

    Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual Speech Enhancement

    Authors: Mostafa Sadeghi, Xavier Alameda-Pineda

    Abstract: Recently, audio-visual speech enhancement has been tackled in the unsupervised settings based on variational auto-encoders (VAEs), where during training only clean data is used to train a generative model for speech, which at test time is combined with a noise model, e.g. nonnegative matrix factorization (NMF), whose parameters are learned without supervision. Consequently, the proposed model is a… ▽ More

    Submitted 8 February, 2021; originally announced February 2021.

    Comments: 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  35. arXiv:2101.02843  [pdf, other

    cs.CV eess.IV

    Probabilistic Graph Attention Network with Conditional Kernels for Pixel-Wise Prediction

    Authors: Dan Xu, Xavier Alameda-Pineda, Wanli Ouyang, Elisa Ricci, Xiaogang Wang, Nicu Sebe

    Abstract: Multi-scale representations deeply learned via convolutional neural networks have shown tremendous importance for various pixel-level prediction problems. In this paper we present a novel approach that advances the state of the art on pixel-level prediction in a fundamental aspect, i.e. structured multi-scale features learning and fusion. In contrast to previous works directly considering multi-sc… ▽ More

    Submitted 13 March, 2022; v1 submitted 7 January, 2021; originally announced January 2021.

    Comments: Regular paper accepted at TPAMI 2020. arXiv admin note: text overlap with arXiv:1801.00524

  36. arXiv:2010.05302  [pdf, other

    cs.CV

    PI-Net: Pose Interacting Network for Multi-Person Monocular 3D Pose Estimation

    Authors: Wen Guo, Enric Corona, Francesc Moreno-Noguer, Xavier Alameda-Pineda

    Abstract: Recent literature addressed the monocular 3D pose estimation task very satisfactorily. In these studies, different persons are usually treated as independent pose instances to estimate. However, in many every-day situations, people are interacting, and the pose of an individual depends on the pose of his/her interactees. In this paper, we investigate how to exploit this dependency to enhance curre… ▽ More

    Submitted 11 October, 2020; originally announced October 2020.

    Comments: Accepted at WACV 2021

  37. arXiv:2008.12595  [pdf, other

    cs.LG stat.ML

    Dynamical Variational Autoencoders: A Comprehensive Review

    Authors: Laurent Girin, Simon Leglaive, Xiaoyu Bie, Julien Diard, Thomas Hueber, Xavier Alameda-Pineda

    Abstract: Variational autoencoders (VAEs) are powerful deep generative models widely used to represent high-dimensional complex data through a low-dimensional latent space learned in an unsupervised manner. In the original VAE model, the input data vectors are processed independently. Recently, a series of papers have presented different extensions of the VAE to process sequential data, which model not only… ▽ More

    Submitted 4 July, 2022; v1 submitted 28 August, 2020; originally announced August 2020.

    Journal ref: Foundations and Trends in Machine Learning, Vol. 15, No. 1-2, pp 1-175, 2021

  38. arXiv:2008.07191  [pdf, other

    eess.AS cs.LG cs.SD

    Deep Variational Generative Models for Audio-visual Speech Separation

    Authors: Viet-Nhat Nguyen, Mostafa Sadeghi, Elisa Ricci, Xavier Alameda-Pineda

    Abstract: In this paper, we are interested in audio-visual speech separation given a single-channel audio recording as well as visual information (lips movements) associated with each speaker. We propose an unsupervised technique based on audio-visual generative modeling of clean speech. More specifically, during training, a latent variable generative model is learned from clean speech spectrograms using a… ▽ More

    Submitted 31 August, 2021; v1 submitted 17 August, 2020; originally announced August 2020.

    Comments: Accepted to the 31st IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Oct. 25-28, 2021, Gold Coast, Queensland, Australia

  39. arXiv:2008.04200  [pdf, other

    cs.CV cs.CL

    Describe What to Change: A Text-guided Unsupervised Image-to-Image Translation Approach

    Authors: Yahui Liu, Marco De Nadai, Deng Cai, Huayang Li, Xavier Alameda-Pineda, Nicu Sebe, Bruno Lepri

    Abstract: Manipulating visual attributes of images through human-written text is a very challenging task. On the one hand, models have to learn the manipulation without the ground truth of the desired output. On the other hand, models have to deal with the inherent ambiguity of natural language. Previous research usually requires either the user to describe all the characteristics of the desired image or to… ▽ More

    Submitted 10 August, 2020; originally announced August 2020.

    Comments: Submitted to ACM MM '20, October 12-16, 2020, Seattle, WA, USA

  40. arXiv:2006.01668  [pdf, other

    cs.LG cs.CV stat.ML

    Variational Inference and Learning of Piecewise-linear Dynamical Systems

    Authors: Xavier Alameda-Pineda, Vincent Drouard, Radu Horaud

    Abstract: Modeling the temporal behavior of data is of primordial importance in many scientific and engineering fields. Baseline methods assume that both the dynamic and observation equations follow linear-Gaussian models. However, there are many real-world processes that cannot be characterized by a single linear behavior. Alternatively, it is possible to consider a piecewise-linear model which, combined w… ▽ More

    Submitted 2 November, 2020; v1 submitted 2 June, 2020; originally announced June 2020.

    Comments: Submitted to IEEE Transactions on Neural Networks and Learning Systems

  41. Unsupervised Performance Analysis of 3D Face Alignment with a Statistically Robust Confidence Test

    Authors: Mostafa Sadeghi, Xavier Alameda-Pineda, Radu Horaud

    Abstract: This paper addresses the problem of analysing the performance of 3D face alignment (3DFA), or facial landmark localization. This task is usually supervised, based on annotated datasets. Nevertheless, in the particular case of 3DFA, the annotation process is rarely error-free, which strongly biases the results. Alternatively, unsupervised performance analysis (UPA) is investigated. The core ingredi… ▽ More

    Submitted 30 October, 2023; v1 submitted 14 April, 2020; originally announced April 2020.

    Journal ref: Neurocomputing 564, 2024

  42. arXiv:2003.06788  [pdf, other

    cs.CV

    GMM-UNIT: Unsupervised Multi-Domain and Multi-Modal Image-to-Image Translation via Attribute Gaussian Mixture Modeling

    Authors: Yahui Liu, Marco De Nadai, Jian Yao, Nicu Sebe, Bruno Lepri, Xavier Alameda-Pineda

    Abstract: Unsupervised image-to-image translation (UNIT) aims at learning a map** between several visual domains by using unpaired training images. Recent studies have shown remarkable success for multiple domains but they suffer from two main limitations: they are either built from several two-domain map**s that are required to be learned independently, or they generate low-diversity results, a problem… ▽ More

    Submitted 21 March, 2020; v1 submitted 15 March, 2020; originally announced March 2020.

    Comments: 27 pages, 17 figures

  43. arXiv:1912.10647  [pdf, other

    eess.AS cs.CV cs.SD

    Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement

    Authors: Mostafa Sadeghi, Xavier Alameda-Pineda

    Abstract: In this paper, we are interested in unsupervised (unknown noise) audio-visual speech enhancement based on variational autoencoders (VAEs), where the probability distribution of clean speech spectra is simulated using an encoder-decoder architecture. The trained generative model (decoder) is then combined with a noise model at test time to estimate the clean speech. In the speech enhancement phase… ▽ More

    Submitted 8 March, 2021; v1 submitted 23 December, 2019; originally announced December 2019.

    Comments: IEEE Transactions on Signal Processing

  44. arXiv:1911.03930  [pdf, other

    eess.AS cs.LG cs.SD

    Robust Unsupervised Audio-visual Speech Enhancement Using a Mixture of Variational Autoencoders

    Authors: Mostafa Sadeghi, Xavier Alameda-Pineda

    Abstract: Recently, an audio-visual speech generative model based on variational autoencoder (VAE) has been proposed, which is combined with a nonnegative matrix factorization (NMF) model for noise variance to perform unsupervised speech enhancement. When visual data is clean, speech enhancement with audio-visual VAE shows a better performance than with audio-only VAE, which is trained on audio-only data. H… ▽ More

    Submitted 10 November, 2019; originally announced November 2019.

  45. arXiv:1910.10942  [pdf, other

    cs.LG cs.AI cs.NE cs.SD eess.AS

    A Recurrent Variational Autoencoder for Speech Enhancement

    Authors: Simon Leglaive, Xavier Alameda-Pineda, Laurent Girin, Radu Horaud

    Abstract: This paper presents a generative approach to speech enhancement based on a recurrent variational autoencoder (RVAE). The deep generative speech model is trained using clean speech signals only, and it is combined with a nonnegative matrix factorization noise model for speech enhancement. We propose a variational expectation-maximization algorithm where the encoder of the RVAE is fine-tuned at test… ▽ More

    Submitted 10 February, 2020; v1 submitted 24 October, 2019; originally announced October 2019.

    Journal ref: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020, Barcelona, Spain

  46. arXiv:1908.02590  [pdf, other

    cs.SD cs.LG eess.AS

    Audio-visual Speech Enhancement Using Conditional Variational Auto-Encoders

    Authors: Mostafa Sadeghi, Simon Leglaive, Xavier Alameda-PIneda, Laurent Girin, Radu Horaud

    Abstract: Variational auto-encoders (VAEs) are deep generative latent variable models that can be used for learning the distribution of complex data. VAEs have been successfully used to learn a probabilistic prior over speech signals, which is then used to perform speech enhancement. One advantage of this generative approach is that it does not require pairs of clean and noisy speech signals at training. In… ▽ More

    Submitted 26 May, 2020; v1 submitted 7 August, 2019; originally announced August 2019.

    Comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

    Journal ref: IEEE/ACM Transactions on Audio, Speech and Language Processing, 28, 2020

  47. arXiv:1906.06618  [pdf, other

    cs.CV

    How To Train Your Deep Multi-Object Tracker

    Authors: Yihong Xu, Aljosa Osep, Yutong Ban, Radu Horaud, Laura Leal-Taixe, Xavier Alameda-Pineda

    Abstract: The recent trend in vision-based multi-object tracking (MOT) is heading towards leveraging the representational power of deep learning to jointly learn to detect and track objects. However, existing methods train only certain sub-modules using loss functions that often do not correlate with established tracking evaluation measures such as Multi-Object Tracking Accuracy (MOTA) and Precision (MOTP).… ▽ More

    Submitted 23 April, 2020; v1 submitted 15 June, 2019; originally announced June 2019.

    Comments: 14 pages, 9 figures, 6 tables

  48. arXiv:1904.01308  [pdf, other

    cs.CV

    CANU-ReID: A Conditional Adversarial Network for Unsupervised person Re-IDentification

    Authors: Guillaume Delorme, Yihong Xu, Stephane Lathuilière, Radu Horaud, Xavier Alameda-Pineda

    Abstract: Unsupervised person re-ID is the task of identifying people on a target data set for which the ID labels are unavailable during training. In this paper, we propose to unify two trends in unsupervised person re-ID: clustering & fine-tuning and adversarial learning. On one side, clustering groups training images into pseudo-ID labels, and uses them to fine-tune the feature extractor. On the other si… ▽ More

    Submitted 28 April, 2020; v1 submitted 2 April, 2019; originally announced April 2019.

  49. Tracking Multiple Audio Sources with the von Mises Distribution and Variational EM

    Authors: Yutong Ban, Xavier Alameda-PIneda, Christine Evers, Radu Horaud

    Abstract: In this paper we address the problem of simultaneously tracking several moving audio sources, namely the problem of estimating source trajectories from a sequence of observed features. We propose to use the von Mises distribution to model audio-source directions of arrival with circular random variables. This leads to a Bayesian filtering formulation which is intractable because of the combinatori… ▽ More

    Submitted 10 April, 2019; v1 submitted 19 December, 2018; originally announced December 2018.

    Comments: IEEE Signal Processing Letters, 2019

  50. arXiv:1812.04417  [pdf, other

    cs.SD eess.AS

    A cascaded multiple-speaker localization and tracking system

    Authors: Xiaofei Li, Yutong Ban, Laurent Girin, Xavier Alameda-Pineda, Radu Horaud

    Abstract: This paper presents an online multiple-speaker localization and tracking method, as the INRIA-Perception contribution to the LOCATA Challenge 2018. First, the recursive least-square method is used to adaptively estimate the direct-path relative transfer function as an interchannel localization feature. The feature is assumed to associate with a single speaker at each time-frequency bin. Second, a… ▽ More

    Submitted 11 December, 2018; originally announced December 2018.

    Comments: In Proceedings of the LOCATA Challenge Workshop - a satellite event of IWAENC 2018 (arXiv:1811.08482 )

    Report number: LOCATAchallenge/2018/06