Skip to main content

Showing 1–19 of 19 results for author: Haro, G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2306.00489  [pdf, other

    cs.SD cs.AI eess.AS

    Speech inpainting: Context-based speech synthesis guided by video

    Authors: Juan F. Montesinos, Daniel Michelsanti, Gloria Haro, Zheng-Hua Tan, Jesper Jensen

    Abstract: Audio and visual modalities are inherently connected in speech signals: lip movements and facial expressions are correlated with speech sounds. This motivates studies that incorporate the visual modality to enhance an acoustic speech signal or even restore missing audio information. Specifically, this paper focuses on the problem of audio-visual speech inpainting, which is the task of synthesizing… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted in Interspeech23

  2. arXiv:2211.12334  [pdf, other

    cs.CV cs.AI cs.MM

    A Graph-Based Method for Soccer Action Spotting Using Unsupervised Player Classification

    Authors: Alejandro Cartas, Coloma Ballester, Gloria Haro

    Abstract: Action spotting in soccer videos is the task of identifying the specific time when a certain key action of the game occurs. Lately, it has received a large amount of attention and powerful methods have been introduced. Action spotting involves understanding the dynamics of the game, the complexity of events, and the variation of video sequences. Most approaches have focused on the latter, given th… ▽ More

    Submitted 22 November, 2022; originally announced November 2022.

    Comments: Accepted at the 5th International ACM Workshop on Multimedia Content Analysis in Sports (MMSports 2022)

  3. arXiv:2204.02090  [pdf, other

    cs.CV cs.IR cs.SD eess.AS

    VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices

    Authors: Venkatesh S. Kadandale, Juan F. Montesinos, Gloria Haro

    Abstract: In this paper, we address the problem of lip-voice synchronisation in videos containing human face and voice. Our approach is based on determining if the lips motion and the voice in a video are synchronised or not, depending on their audio-visual correspondence score. We propose an audio-visual cross-modal transformer-based model that outperforms several baseline models in the audio-visual synchr… ▽ More

    Submitted 30 June, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

    Comments: Paper accepted to Interspeech 2022; Project Page: https://ipcv.github.io/VocaLiST/

  4. arXiv:2203.04099  [pdf, other

    cs.SD cs.CV cs.LG eess.AS

    VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer

    Authors: Juan F. Montesinos, Venkatesh S. Kadandale, Gloria Haro

    Abstract: This paper presents an audio-visual approach for voice separation which produces state-of-the-art results at a low latency in two scenarios: speech and singing voice. The model is based on a two-stage network. Motion cues are obtained with a lightweight graph convolutional network that processes face landmarks. Then, both audio and motion features are fed to an audio-visual transformer which produ… ▽ More

    Submitted 19 July, 2022; v1 submitted 8 March, 2022; originally announced March 2022.

    Comments: Accepted to ECCV 2022

  5. arXiv:2106.00359  [pdf, other

    cs.LG cs.CV eess.IV

    Learning Football Body-Orientation as a Matter of Classification

    Authors: Adrià Arbués-Sangüesa, Adrián Martín, Paulino Granero, Coloma Ballester, Gloria Haro

    Abstract: Orientation is a crucial skill for football players that becomes a differential factor in a large set of events, especially the ones involving passes. However, existing orientation estimation methods, which are based on computer-vision techniques, still have a lot of room for improvement. To the best of our knowledge, this article presents the first deep learning model for estimating orientation d… ▽ More

    Submitted 1 June, 2021; originally announced June 2021.

    Comments: Accepted in the AI for Sports Analytics Workshop at ICJAI 2021

  6. arXiv:2104.09946  [pdf, other

    cs.SD cs.LG eess.AS

    A cappella: Audio-visual Singing Voice Separation

    Authors: Juan F. Montesinos, Venkatesh S. Kadandale, Gloria Haro

    Abstract: The task of isolating a target singing voice in music videos has useful applications. In this work, we explore the single-channel singing voice separation problem from a multimodal perspective, by jointly learning from audio and visual modalities. To do so, we present Acappella, a dataset spanning around 46 hours of a cappella solo singing videos sourced from YouTube. We also propose an audio-visu… ▽ More

    Submitted 18 October, 2021; v1 submitted 20 April, 2021; originally announced April 2021.

    Comments: Paper accepted at The 32nd British Machine Vision Conference, BMVC 2021

  7. Self-Supervised Small Soccer Player Detection and Tracking

    Authors: Samuel Hurault, Coloma Ballester, Gloria Haro

    Abstract: In a soccer game, the information provided by detecting and tracking brings crucial clues to further analyze and understand some tactical aspects of the game, including individual and team actions. State-of-the-art tracking algorithms achieve impressive results in scenarios on which they have been trained for, but they fail in challenging ones such as soccer games. This is frequently due to the pl… ▽ More

    Submitted 20 November, 2020; originally announced November 2020.

    Comments: In Proceedings of the 3rd International Workshop on Multimedia Content Analysis in Sports (MMSports '20)

    Journal ref: Proceedings of the 3rd International Workshop on Multimedia Content Analysis in Sports (2020) 9-18

  8. arXiv:2006.07931  [pdf, other

    eess.AS cs.DB cs.SD

    Solos: A Dataset for Audio-Visual Music Analysis

    Authors: Juan F. Montesinos, Olga Slizovskaia, Gloria Haro

    Abstract: In this paper, we present a new dataset of music performance videos which can be used for training machine learning methods for multiple tasks such as audio-visual blind source separation and localization, cross-modal correspondences, cross-modal generation and, in general, any audio-visual self-supervised task. These videos, gathered from YouTube, consist of solo musical performances of 13 differ… ▽ More

    Submitted 6 August, 2020; v1 submitted 14 June, 2020; originally announced June 2020.

    Comments: Rephrased some sentenced. Explanation about OpenPose. Minor grammatical errors

  9. arXiv:2004.07209  [pdf, other

    cs.CV

    Using Player's Body-Orientation to Model Pass Feasibility in Soccer

    Authors: Adrià Arbués-Sangüesa, Adrián Martín, Javier Fernández, Coloma Ballester, Gloria Haro

    Abstract: Given a monocular video of a soccer match, this paper presents a computational model to estimate the most feasible pass at any given time. The method leverages offensive player's orientation (plus their location) and opponents' spatial configuration to compute the feasibility of pass events within players of the same team. Orientation data is gathered from body pose estimations that are properly p… ▽ More

    Submitted 15 April, 2020; originally announced April 2020.

    Comments: Accepted at the Computer Vision in Sports Workshop at CVPR 2020

  10. Conditioned Source Separation for Music Instrument Performances

    Authors: Olga Slizovskaia, Gloria Haro, Emilia Gómez

    Abstract: In music source separation, the number of sources may vary for each piece and some of the sources may belong to the same family of instruments, thus sharing timbral characteristics and making the sources more correlated. This leads to additional challenges in the source separation problem. This paper proposes a source separation method for multiple musical instruments sounding simultaneously and e… ▽ More

    Submitted 7 July, 2021; v1 submitted 8 April, 2020; originally announced April 2020.

    Comments: 14 pages, 5 figures, under review

  11. arXiv:2004.02541  [pdf, other

    eess.AS cs.CV cs.LG

    Vocoder-Based Speech Synthesis from Silent Videos

    Authors: Daniel Michelsanti, Olga Slizovskaia, Gloria Haro, Emilia Gómez, Zheng-Hua Tan, Jesper Jensen

    Abstract: Both acoustic and visual information influence human perception of speech. For this reason, the lack of audio in a video sequence determines an extremely low speech intelligibility for untrained lip readers. In this paper, we present a way to synthesise speech from the silent video of a talker using deep learning. The system learns a map** function from raw video frames to acoustic features and… ▽ More

    Submitted 15 August, 2020; v1 submitted 6 April, 2020; originally announced April 2020.

    Comments: Accepted to Interspeech 2020

  12. arXiv:2003.10414  [pdf, other

    cs.SD cs.IR cs.LG cs.MM

    Multi-channel U-Net for Music Source Separation

    Authors: Venkatesh S. Kadandale, Juan F. Montesinos, Gloria Haro, Emilia Gómez

    Abstract: A fairly straightforward approach for music source separation is to train independent models, wherein each model is dedicated for estimating only a specific source. Training a single model to estimate multiple sources generally does not perform as well as the independent dedicated models. However, Conditioned U-Net (C-U-Net) uses a control mechanism to train a single model for multi-source separat… ▽ More

    Submitted 4 September, 2020; v1 submitted 23 March, 2020; originally announced March 2020.

    Comments: The paper has been accepted at IEEE MMSP2020. Project Page: https://vskadandale.github.io/multi-channel-unet

  13. arXiv:2003.00943  [pdf, other

    cs.CV

    Always Look on the Bright Side of the Field: Merging Pose and Contextual Data to Estimate Orientation of Soccer Players

    Authors: Adrià Arbués-Sangüesa, Adrián Martín, Javier Fernández, Carlos Rodríguez, Gloria Haro, Coloma Ballester

    Abstract: Although orientation has proven to be a key skill of soccer players in order to succeed in a broad spectrum of plays, body orientation is a yet-little-explored area in sports analytics' research. Despite being an inherently ambiguous concept, player orientation can be defined as the projection (2D) of the normal vector placed in the center of the upper-torso of players (3D). This research presents… ▽ More

    Submitted 18 May, 2020; v1 submitted 2 March, 2020; originally announced March 2020.

    Comments: Article accepted in the International Conference on Image Processing (ICIP 2020); Appendix was not included in the original manuscript

  14. arXiv:1907.04637  [pdf, other

    cs.CV

    Multi-Person tracking by multi-scale detection in Basketball scenarios

    Authors: Adrià Arbués-Sangüesa, Gloria Haro, Coloma Ballester

    Abstract: Tracking data is a powerful tool for basketball teams in order to extract advanced semantic information and statistics that might lead to a performance boost. However, multi-person tracking is a challenging task to solve in single-camera video sequences, given the frequent occlusions and cluttering that occur in a restricted scenario. In this paper, a novel multi-scale detection method is presente… ▽ More

    Submitted 10 July, 2019; originally announced July 2019.

    Comments: Accepted in IMVIP 2019

  15. arXiv:1907.01813  [pdf, other

    cs.SD cs.LG eess.AS

    A Case Study of Deep-Learned Activations via Hand-Crafted Audio Features

    Authors: Olga Slizovskaia, Emilia Gómez, Gloria Haro

    Abstract: The explainability of Convolutional Neural Networks (CNNs) is a particularly challenging task in all areas of application, and it is notably under-researched in music and audio domain. In this paper, we approach explainability by exploiting the knowledge we have on hand-crafted audio features. Our study focuses on a well-defined MIR task, the recognition of musical instruments from user-generated… ▽ More

    Submitted 3 July, 2019; originally announced July 2019.

    Comments: The 2018 Joint Workshop on Machine Learning for Music, The Federated Artificial Intelligence Meeting (FAIM), Joint workshop program of ICML, IJCAI/ECAI, and AAMAS, Stockholm, Sweden, Saturday, July 14th, 2018

  16. arXiv:1906.02042  [pdf, other

    cs.CV

    Single-Camera Basketball Tracker through Pose and Semantic Feature Fusion

    Authors: Adrià Arbués-Sangüesa, Coloma Ballester, Gloria Haro

    Abstract: Tracking sports players is a widely challenging scenario, specially in single-feed videos recorded in tight courts, where cluttering and occlusions cannot be avoided. This paper presents an analysis of several geometric and semantic visual features to detect and track basketball players. An ablation study is carried out and then used to remark that a robust tracker can be built with Deep Learning… ▽ More

    Submitted 10 July, 2019; v1 submitted 5 June, 2019; originally announced June 2019.

    Comments: Accepted in the International Conference on Artificial Intelligence in Sports 2019 (ICAIS)

  17. arXiv:1811.01850  [pdf, other

    cs.SD cs.LG eess.AS

    End-to-End Sound Source Separation Conditioned On Instrument Labels

    Authors: Olga Slizovskaia, Leo Kim, Gloria Haro, Emilia Gomez

    Abstract: Can we perform an end-to-end music source separation with a variable number of sources using a deep learning model? We present an extension of the Wave-U-Net model which allows end-to-end monaural source separation with a non-fixed number of sources. Furthermore, we propose multiplicative conditioning with instrument labels at the bottleneck of the Wave-U-Net and show its effect on the separation… ▽ More

    Submitted 9 May, 2019; v1 submitted 5 November, 2018; originally announced November 2018.

    Comments: 5 pages, 2 figures, 2 tables, ICASSP 2019

  18. arXiv:1602.08960  [pdf, other

    cs.CV

    FALDOI: A new minimization strategy for large displacement variational optical flow

    Authors: Roberto P. Palomares, Enric Meinhardt-Llopis, Coloma Ballester, Gloria Haro

    Abstract: We propose a large displacement optical flow method that introduces a new strategy to compute a good local minimum of any optical flow energy functional. The method requires a given set of discrete matches, which can be extremely sparse, and an energy functional which locally guides the interpolation from those matches. In particular, the matches are used to guide a structured coordinate-descent o… ▽ More

    Submitted 29 September, 2016; v1 submitted 29 February, 2016; originally announced February 2016.

    MSC Class: 68U10; 49M29; 65K10

  19. A Computational Model for Amodal Completion

    Authors: Maria Oliver, Gloria Haro, Mariella Dimiccoli, Baptiste Mazin, Coloma Ballester

    Abstract: This paper presents a computational model to recover the most likely interpretation of the 3D scene structure from a planar image, where some objects may occlude others. The estimated scene interpretation is obtained by integrating some global and local cues and provides both the complete disoccluded objects that form the scene and their ordering according to depth. Our method first computes sever… ▽ More

    Submitted 29 March, 2016; v1 submitted 26 November, 2015; originally announced November 2015.