Search | arXiv e-print repository

Speech inpainting: Context-based speech synthesis guided by video

Authors: Juan F. Montesinos, Daniel Michelsanti, Gloria Haro, Zheng-Hua Tan, Jesper Jensen

Abstract: Audio and visual modalities are inherently connected in speech signals: lip movements and facial expressions are correlated with speech sounds. This motivates studies that incorporate the visual modality to enhance an acoustic speech signal or even restore missing audio information. Specifically, this paper focuses on the problem of audio-visual speech inpainting, which is the task of synthesizing… ▽ More Audio and visual modalities are inherently connected in speech signals: lip movements and facial expressions are correlated with speech sounds. This motivates studies that incorporate the visual modality to enhance an acoustic speech signal or even restore missing audio information. Specifically, this paper focuses on the problem of audio-visual speech inpainting, which is the task of synthesizing the speech in a corrupted audio segment in a way that it is consistent with the corresponding visual content and the uncorrupted audio context. We present an audio-visual transformer-based deep learning model that leverages visual cues that provide information about the content of the corrupted audio. It outperforms the previous state-of-the-art audio-visual model and audio-only baselines. We also show how visual features extracted with AV-HuBERT, a large audio-visual transformer for speech recognition, are suitable for synthesizing speech. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Comments: Accepted in Interspeech23

arXiv:2211.12334 [pdf, other]

doi 10.1145/3552437.3555691

A Graph-Based Method for Soccer Action Spotting Using Unsupervised Player Classification

Authors: Alejandro Cartas, Coloma Ballester, Gloria Haro

Abstract: Action spotting in soccer videos is the task of identifying the specific time when a certain key action of the game occurs. Lately, it has received a large amount of attention and powerful methods have been introduced. Action spotting involves understanding the dynamics of the game, the complexity of events, and the variation of video sequences. Most approaches have focused on the latter, given th… ▽ More Action spotting in soccer videos is the task of identifying the specific time when a certain key action of the game occurs. Lately, it has received a large amount of attention and powerful methods have been introduced. Action spotting involves understanding the dynamics of the game, the complexity of events, and the variation of video sequences. Most approaches have focused on the latter, given that their models exploit the global visual features of the sequences. In this work, we focus on the former by (a) identifying and representing the players, referees, and goalkeepers as nodes in a graph, and by (b) modeling their temporal interactions as sequences of graphs. For the player identification, or player classification task, we obtain an accuracy of 97.72% in our annotated benchmark. For the action spotting task, our method obtains an overall performance of 57.83% average-mAP by combining it with other audiovisual modalities. This performance surpasses similar graph-based methods and has competitive results with heavy computing methods. Code and data are available at https://github.com/IPCV/soccer_action_spotting. △ Less

Submitted 22 November, 2022; originally announced November 2022.

Comments: Accepted at the 5th International ACM Workshop on Multimedia Content Analysis in Sports (MMSports 2022)

arXiv:2204.02090 [pdf, other]

VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices

Authors: Venkatesh S. Kadandale, Juan F. Montesinos, Gloria Haro

Abstract: In this paper, we address the problem of lip-voice synchronisation in videos containing human face and voice. Our approach is based on determining if the lips motion and the voice in a video are synchronised or not, depending on their audio-visual correspondence score. We propose an audio-visual cross-modal transformer-based model that outperforms several baseline models in the audio-visual synchr… ▽ More In this paper, we address the problem of lip-voice synchronisation in videos containing human face and voice. Our approach is based on determining if the lips motion and the voice in a video are synchronised or not, depending on their audio-visual correspondence score. We propose an audio-visual cross-modal transformer-based model that outperforms several baseline models in the audio-visual synchronisation task on the standard lip-reading speech benchmark dataset LRS2. While the existing methods focus mainly on lip synchronisation in speech videos, we also consider the special case of the singing voice. The singing voice is a more challenging use case for synchronisation due to sustained vowel sounds. We also investigate the relevance of lip synchronisation models trained on speech datasets in the context of singing voice. Finally, we use the frozen visual features learned by our lip synchronisation model in the singing voice separation task to outperform a baseline audio-visual model which was trained end-to-end. The demos, source code, and the pre-trained models are available on https://ipcv.github.io/VocaLiST/ △ Less

Submitted 30 June, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

Comments: Paper accepted to Interspeech 2022; Project Page: https://ipcv.github.io/VocaLiST/

arXiv:2203.04099 [pdf, other]

VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer

Authors: Juan F. Montesinos, Venkatesh S. Kadandale, Gloria Haro

Abstract: This paper presents an audio-visual approach for voice separation which produces state-of-the-art results at a low latency in two scenarios: speech and singing voice. The model is based on a two-stage network. Motion cues are obtained with a lightweight graph convolutional network that processes face landmarks. Then, both audio and motion features are fed to an audio-visual transformer which produ… ▽ More This paper presents an audio-visual approach for voice separation which produces state-of-the-art results at a low latency in two scenarios: speech and singing voice. The model is based on a two-stage network. Motion cues are obtained with a lightweight graph convolutional network that processes face landmarks. Then, both audio and motion features are fed to an audio-visual transformer which produces a fairly good estimation of the isolated target source. In a second stage, the predominant voice is enhanced with an audio-only network. We present different ablation studies and comparison to state-of-the-art methods. Finally, we explore the transferability of models trained for speech separation in the task of singing voice separation. The demos, code, and weights are available in https://ipcv.github.io/VoViT/ △ Less

Submitted 19 July, 2022; v1 submitted 8 March, 2022; originally announced March 2022.

Comments: Accepted to ECCV 2022

arXiv:2106.00359 [pdf, other]

Learning Football Body-Orientation as a Matter of Classification

Authors: Adrià Arbués-Sangüesa, Adrián Martín, Paulino Granero, Coloma Ballester, Gloria Haro

Abstract: Orientation is a crucial skill for football players that becomes a differential factor in a large set of events, especially the ones involving passes. However, existing orientation estimation methods, which are based on computer-vision techniques, still have a lot of room for improvement. To the best of our knowledge, this article presents the first deep learning model for estimating orientation d… ▽ More Orientation is a crucial skill for football players that becomes a differential factor in a large set of events, especially the ones involving passes. However, existing orientation estimation methods, which are based on computer-vision techniques, still have a lot of room for improvement. To the best of our knowledge, this article presents the first deep learning model for estimating orientation directly from video footage. By approaching this challenge as a classification problem where classes correspond to orientation bins, and by introducing a cyclic loss function, a well-known convolutional network is refined to provide player orientation data. The model is trained by using ground-truth orientation data obtained from wearable EPTS devices, which are individually compensated with respect to the perceived orientation in the current frame. The obtained results outperform previous methods; in particular, the absolute median error is less than 12 degrees per player. An ablation study is included in order to show the potential generalization to any kind of football video footage. △ Less

Submitted 1 June, 2021; originally announced June 2021.

Comments: Accepted in the AI for Sports Analytics Workshop at ICJAI 2021

arXiv:2104.09946 [pdf, other]

A cappella: Audio-visual Singing Voice Separation

Authors: Juan F. Montesinos, Venkatesh S. Kadandale, Gloria Haro

Abstract: The task of isolating a target singing voice in music videos has useful applications. In this work, we explore the single-channel singing voice separation problem from a multimodal perspective, by jointly learning from audio and visual modalities. To do so, we present Acappella, a dataset spanning around 46 hours of a cappella solo singing videos sourced from YouTube. We also propose an audio-visu… ▽ More The task of isolating a target singing voice in music videos has useful applications. In this work, we explore the single-channel singing voice separation problem from a multimodal perspective, by jointly learning from audio and visual modalities. To do so, we present Acappella, a dataset spanning around 46 hours of a cappella solo singing videos sourced from YouTube. We also propose an audio-visual convolutional network based on graphs which achieves state-of-the-art singing voice separation results on our dataset and compare it against its audio-only counterpart, U-Net, and a state-of-the-art audio-visual speech separation model. We evaluate the models in the following challenging setups: i) presence of overlap** voices in the audio mixtures, ii) the target voice set to lower volume levels in the mix, and iii) combination of i) and ii). The third one being the most challenging evaluation setup. We demonstrate that our model outperforms the baseline models in the singing voice separation task in the most challenging evaluation setup. The code, the pre-trained models, and the dataset are publicly available at https://ipcv.github.io/Acappella/able at https://ipcv.github.io/Acappella/ △ Less

Submitted 18 October, 2021; v1 submitted 20 April, 2021; originally announced April 2021.

Comments: Paper accepted at The 32nd British Machine Vision Conference, BMVC 2021

arXiv:2011.10336 [pdf, other]

doi 10.1145/3422844.3423054

Self-Supervised Small Soccer Player Detection and Tracking

Authors: Samuel Hurault, Coloma Ballester, Gloria Haro

Abstract: In a soccer game, the information provided by detecting and tracking brings crucial clues to further analyze and understand some tactical aspects of the game, including individual and team actions. State-of-the-art tracking algorithms achieve impressive results in scenarios on which they have been trained for, but they fail in challenging ones such as soccer games. This is frequently due to the pl… ▽ More In a soccer game, the information provided by detecting and tracking brings crucial clues to further analyze and understand some tactical aspects of the game, including individual and team actions. State-of-the-art tracking algorithms achieve impressive results in scenarios on which they have been trained for, but they fail in challenging ones such as soccer games. This is frequently due to the player small relative size and the similar appearance among players of the same team. Although a straightforward solution would be to retrain these models by using a more specific dataset, the lack of such publicly available annotated datasets entails searching for other effective solutions. In this work, we propose a self-supervised pipeline which is able to detect and track low-resolution soccer players under different recording conditions without any need of ground-truth data. Extensive quantitative and qualitative experimental results are presented evaluating its performance. We also present a comparison to several state-of-the-art methods showing that both the proposed detector and the proposed tracker achieve top-tier results, in particular in the presence of small players. △ Less

Submitted 20 November, 2020; originally announced November 2020.

Comments: In Proceedings of the 3rd International Workshop on Multimedia Content Analysis in Sports (MMSports '20)

Journal ref: Proceedings of the 3rd International Workshop on Multimedia Content Analysis in Sports (2020) 9-18

arXiv:2006.07931 [pdf, other]

Solos: A Dataset for Audio-Visual Music Analysis

Authors: Juan F. Montesinos, Olga Slizovskaia, Gloria Haro

Abstract: In this paper, we present a new dataset of music performance videos which can be used for training machine learning methods for multiple tasks such as audio-visual blind source separation and localization, cross-modal correspondences, cross-modal generation and, in general, any audio-visual self-supervised task. These videos, gathered from YouTube, consist of solo musical performances of 13 differ… ▽ More In this paper, we present a new dataset of music performance videos which can be used for training machine learning methods for multiple tasks such as audio-visual blind source separation and localization, cross-modal correspondences, cross-modal generation and, in general, any audio-visual self-supervised task. These videos, gathered from YouTube, consist of solo musical performances of 13 different instruments. Compared to previously proposed audio-visual datasets, Solos is cleaner since a big amount of its recordings are auditions and manually checked recordings, ensuring there is no background noise nor effects added in the video post-processing. Besides, it is, up to the best of our knowledge, the only dataset that contains the whole set of instruments present in the URMP\cite{URPM} dataset, a high-quality dataset of 44 audio-visual recordings of multi-instrument classical music pieces with individual audio tracks. URMP was intented to be used for source separation, thus, we evaluate the performance on the URMP dataset of two different source-separation models trained on Solos. The dataset is publicly available at https://juanfmontesinos.github.io/Solos/ △ Less

Submitted 6 August, 2020; v1 submitted 14 June, 2020; originally announced June 2020.

Comments: Rephrased some sentenced. Explanation about OpenPose. Minor grammatical errors

arXiv:2004.07209 [pdf, other]

Using Player's Body-Orientation to Model Pass Feasibility in Soccer

Authors: Adrià Arbués-Sangüesa, Adrián Martín, Javier Fernández, Coloma Ballester, Gloria Haro

Abstract: Given a monocular video of a soccer match, this paper presents a computational model to estimate the most feasible pass at any given time. The method leverages offensive player's orientation (plus their location) and opponents' spatial configuration to compute the feasibility of pass events within players of the same team. Orientation data is gathered from body pose estimations that are properly p… ▽ More Given a monocular video of a soccer match, this paper presents a computational model to estimate the most feasible pass at any given time. The method leverages offensive player's orientation (plus their location) and opponents' spatial configuration to compute the feasibility of pass events within players of the same team. Orientation data is gathered from body pose estimations that are properly projected onto the 2D game field; moreover, a geometrical solution is provided, through the definition of a feasibility measure, to determine which players are better oriented towards each other. Once analyzed more than 6000 pass events, results show that, by including orientation as a feasibility measure, a robust computational model can be built, reaching more than 0.7 Top-3 accuracy. Finally, the combination of the orientation feasibility measure with the recently introduced Expected Possession Value metric is studied; promising results are obtained, thus showing that existing models can be refined by using orientation as a key feature. These models could help both coaches and analysts to have a better understanding of the game and to improve the players' decision-making process. △ Less

Submitted 15 April, 2020; originally announced April 2020.

Comments: Accepted at the Computer Vision in Sports Workshop at CVPR 2020

arXiv:2004.03873 [pdf, other]

doi 10.1109/TASLP.2021.3082331

Conditioned Source Separation for Music Instrument Performances

Authors: Olga Slizovskaia, Gloria Haro, Emilia Gómez

Abstract: In music source separation, the number of sources may vary for each piece and some of the sources may belong to the same family of instruments, thus sharing timbral characteristics and making the sources more correlated. This leads to additional challenges in the source separation problem. This paper proposes a source separation method for multiple musical instruments sounding simultaneously and e… ▽ More In music source separation, the number of sources may vary for each piece and some of the sources may belong to the same family of instruments, thus sharing timbral characteristics and making the sources more correlated. This leads to additional challenges in the source separation problem. This paper proposes a source separation method for multiple musical instruments sounding simultaneously and explores how much additional information apart from the audio stream can lift the quality of source separation. We explore conditioning techniques at different levels of a primary source separation network and utilize two extra modalities of data, namely presence or absence of instruments in the mixture, and the corresponding video stream data. △ Less

Submitted 7 July, 2021; v1 submitted 8 April, 2020; originally announced April 2020.

Comments: 14 pages, 5 figures, under review

arXiv:2004.02541 [pdf, other]

Vocoder-Based Speech Synthesis from Silent Videos

Authors: Daniel Michelsanti, Olga Slizovskaia, Gloria Haro, Emilia Gómez, Zheng-Hua Tan, Jesper Jensen

Abstract: Both acoustic and visual information influence human perception of speech. For this reason, the lack of audio in a video sequence determines an extremely low speech intelligibility for untrained lip readers. In this paper, we present a way to synthesise speech from the silent video of a talker using deep learning. The system learns a map** function from raw video frames to acoustic features and… ▽ More Both acoustic and visual information influence human perception of speech. For this reason, the lack of audio in a video sequence determines an extremely low speech intelligibility for untrained lip readers. In this paper, we present a way to synthesise speech from the silent video of a talker using deep learning. The system learns a map** function from raw video frames to acoustic features and reconstructs the speech with a vocoder synthesis algorithm. To improve speech reconstruction performance, our model is also trained to predict text information in a multi-task learning fashion and it is able to simultaneously reconstruct and recognise speech in real time. The results in terms of estimated speech quality and intelligibility show the effectiveness of our method, which exhibits an improvement over existing video-to-speech approaches. △ Less

Submitted 15 August, 2020; v1 submitted 6 April, 2020; originally announced April 2020.

Comments: Accepted to Interspeech 2020

arXiv:2003.10414 [pdf, other]

Multi-channel U-Net for Music Source Separation

Authors: Venkatesh S. Kadandale, Juan F. Montesinos, Gloria Haro, Emilia Gómez

Abstract: A fairly straightforward approach for music source separation is to train independent models, wherein each model is dedicated for estimating only a specific source. Training a single model to estimate multiple sources generally does not perform as well as the independent dedicated models. However, Conditioned U-Net (C-U-Net) uses a control mechanism to train a single model for multi-source separat… ▽ More A fairly straightforward approach for music source separation is to train independent models, wherein each model is dedicated for estimating only a specific source. Training a single model to estimate multiple sources generally does not perform as well as the independent dedicated models. However, Conditioned U-Net (C-U-Net) uses a control mechanism to train a single model for multi-source separation and attempts to achieve a performance comparable to that of the dedicated models. We propose a multi-channel U-Net (M-U-Net) trained using a weighted multi-task loss as an alternative to the C-U-Net. We investigate two weighting strategies for our multi-task loss: 1) Dynamic Weighted Average (DWA), and 2) Energy Based Weighting (EBW). DWA determines the weights by tracking the rate of change of loss of each task during training. EBW aims to neutralize the effect of the training bias arising from the difference in energy levels of each of the sources in a mixture. Our methods provide three-fold advantages compared to C-UNet: 1) Fewer effective training iterations per epoch, 2) Fewer trainable network parameters (no control parameters), and 3) Faster processing at inference. Our methods achieve performance comparable to that of C-U-Net and the dedicated U-Nets at a much lower training cost. △ Less

Submitted 4 September, 2020; v1 submitted 23 March, 2020; originally announced March 2020.

Comments: The paper has been accepted at IEEE MMSP2020. Project Page: https://vskadandale.github.io/multi-channel-unet

arXiv:2003.00943 [pdf, other]

Always Look on the Bright Side of the Field: Merging Pose and Contextual Data to Estimate Orientation of Soccer Players

Authors: Adrià Arbués-Sangüesa, Adrián Martín, Javier Fernández, Carlos Rodríguez, Gloria Haro, Coloma Ballester

Abstract: Although orientation has proven to be a key skill of soccer players in order to succeed in a broad spectrum of plays, body orientation is a yet-little-explored area in sports analytics' research. Despite being an inherently ambiguous concept, player orientation can be defined as the projection (2D) of the normal vector placed in the center of the upper-torso of players (3D). This research presents… ▽ More Although orientation has proven to be a key skill of soccer players in order to succeed in a broad spectrum of plays, body orientation is a yet-little-explored area in sports analytics' research. Despite being an inherently ambiguous concept, player orientation can be defined as the projection (2D) of the normal vector placed in the center of the upper-torso of players (3D). This research presents a novel technique to obtain player orientation from monocular video recordings by map** pose parts (shoulders and hips) in a 2D field by combining OpenPose with a super-resolution network, and merging the obtained estimation with contextual information (ball position). Results have been validated with players-held EPTS devices, obtaining a median error of 27 degrees/player. Moreover, three novel types of orientation maps are proposed in order to make raw orientation data easy to visualize and understand, thus allowing further analysis at team- or player-level. △ Less

Submitted 18 May, 2020; v1 submitted 2 March, 2020; originally announced March 2020.

Comments: Article accepted in the International Conference on Image Processing (ICIP 2020); Appendix was not included in the original manuscript

arXiv:1907.04637 [pdf, other]

Multi-Person tracking by multi-scale detection in Basketball scenarios

Authors: Adrià Arbués-Sangüesa, Gloria Haro, Coloma Ballester

Abstract: Tracking data is a powerful tool for basketball teams in order to extract advanced semantic information and statistics that might lead to a performance boost. However, multi-person tracking is a challenging task to solve in single-camera video sequences, given the frequent occlusions and cluttering that occur in a restricted scenario. In this paper, a novel multi-scale detection method is presente… ▽ More Tracking data is a powerful tool for basketball teams in order to extract advanced semantic information and statistics that might lead to a performance boost. However, multi-person tracking is a challenging task to solve in single-camera video sequences, given the frequent occlusions and cluttering that occur in a restricted scenario. In this paper, a novel multi-scale detection method is presented, which is later used to extract geometric and content features, resulting in a multi-person video tracking system. Having built a dataset from scratch together with its ground truth (more than 10k bounding boxes), standard metrics are evaluated, obtaining notable results both in terms of detection (F1-score) and tracking (MOTA). The presented system could be used as a source of data gathering in order to extract useful statistics and semantic analyses a posteriori. △ Less

Submitted 10 July, 2019; originally announced July 2019.

Comments: Accepted in IMVIP 2019

arXiv:1907.01813 [pdf, other]

A Case Study of Deep-Learned Activations via Hand-Crafted Audio Features

Authors: Olga Slizovskaia, Emilia Gómez, Gloria Haro

Abstract: The explainability of Convolutional Neural Networks (CNNs) is a particularly challenging task in all areas of application, and it is notably under-researched in music and audio domain. In this paper, we approach explainability by exploiting the knowledge we have on hand-crafted audio features. Our study focuses on a well-defined MIR task, the recognition of musical instruments from user-generated… ▽ More The explainability of Convolutional Neural Networks (CNNs) is a particularly challenging task in all areas of application, and it is notably under-researched in music and audio domain. In this paper, we approach explainability by exploiting the knowledge we have on hand-crafted audio features. Our study focuses on a well-defined MIR task, the recognition of musical instruments from user-generated music recordings. We compute the similarity between a set of traditional audio features and representations learned by CNNs. We also propose a technique for measuring the similarity between activation maps and audio features which typically presented in the form of a matrix, such as chromagrams or spectrograms. We observe that some neurons' activations correspond to well-known classical audio features. In particular, for shallow layers, we found similarities between activations and harmonic and percussive components of the spectrum. For deeper layers, we compare chromagrams with high-level activation maps as well as loudness and onset rate with deep-learned embeddings. △ Less

Submitted 3 July, 2019; originally announced July 2019.

Comments: The 2018 Joint Workshop on Machine Learning for Music, The Federated Artificial Intelligence Meeting (FAIM), Joint workshop program of ICML, IJCAI/ECAI, and AAMAS, Stockholm, Sweden, Saturday, July 14th, 2018

arXiv:1906.02042 [pdf, other]

Single-Camera Basketball Tracker through Pose and Semantic Feature Fusion

Authors: Adrià Arbués-Sangüesa, Coloma Ballester, Gloria Haro

Abstract: Tracking sports players is a widely challenging scenario, specially in single-feed videos recorded in tight courts, where cluttering and occlusions cannot be avoided. This paper presents an analysis of several geometric and semantic visual features to detect and track basketball players. An ablation study is carried out and then used to remark that a robust tracker can be built with Deep Learning… ▽ More Tracking sports players is a widely challenging scenario, specially in single-feed videos recorded in tight courts, where cluttering and occlusions cannot be avoided. This paper presents an analysis of several geometric and semantic visual features to detect and track basketball players. An ablation study is carried out and then used to remark that a robust tracker can be built with Deep Learning features, without the need of extracting contextual ones, such as proximity or color similarity, nor applying camera stabilization techniques. The presented tracker consists of: (1) a detection step, which uses a pretrained deep learning model to estimate the players pose, followed by (2) a tracking step, which leverages pose and semantic information from the output of a convolutional layer in a VGG network. Its performance is analyzed in terms of MOTA over a basketball dataset with more than 10k instances. △ Less

Submitted 10 July, 2019; v1 submitted 5 June, 2019; originally announced June 2019.

Comments: Accepted in the International Conference on Artificial Intelligence in Sports 2019 (ICAIS)

arXiv:1811.01850 [pdf, other]

End-to-End Sound Source Separation Conditioned On Instrument Labels

Authors: Olga Slizovskaia, Leo Kim, Gloria Haro, Emilia Gomez

Abstract: Can we perform an end-to-end music source separation with a variable number of sources using a deep learning model? We present an extension of the Wave-U-Net model which allows end-to-end monaural source separation with a non-fixed number of sources. Furthermore, we propose multiplicative conditioning with instrument labels at the bottleneck of the Wave-U-Net and show its effect on the separation… ▽ More Can we perform an end-to-end music source separation with a variable number of sources using a deep learning model? We present an extension of the Wave-U-Net model which allows end-to-end monaural source separation with a non-fixed number of sources. Furthermore, we propose multiplicative conditioning with instrument labels at the bottleneck of the Wave-U-Net and show its effect on the separation results. This approach leads to other types of conditioning such as audio-visual source separation and score-informed source separation. △ Less

Submitted 9 May, 2019; v1 submitted 5 November, 2018; originally announced November 2018.

Comments: 5 pages, 2 figures, 2 tables, ICASSP 2019

arXiv:1602.08960 [pdf, other]

FALDOI: A new minimization strategy for large displacement variational optical flow

Authors: Roberto P. Palomares, Enric Meinhardt-Llopis, Coloma Ballester, Gloria Haro

Abstract: We propose a large displacement optical flow method that introduces a new strategy to compute a good local minimum of any optical flow energy functional. The method requires a given set of discrete matches, which can be extremely sparse, and an energy functional which locally guides the interpolation from those matches. In particular, the matches are used to guide a structured coordinate-descent o… ▽ More We propose a large displacement optical flow method that introduces a new strategy to compute a good local minimum of any optical flow energy functional. The method requires a given set of discrete matches, which can be extremely sparse, and an energy functional which locally guides the interpolation from those matches. In particular, the matches are used to guide a structured coordinate-descent of the energy functional around these keypoints. It results in a two-step minimization method at the finest scale which is very robust to the inevitable outliers of the sparse matcher and able to capture large displacements of small objects. Its benefits over other variational methods that also rely on a set of sparse matches are its robustness against very few matches, high levels of noise and outliers. We validate our proposal using several optical flow variational models. The results consistently outperform the coarse-to-fine approaches and achieve good qualitative and quantitative performance on the standard optical flow benchmarks. △ Less

Submitted 29 September, 2016; v1 submitted 29 February, 2016; originally announced February 2016.

MSC Class: 68U10; 49M29; 65K10

arXiv:1511.08418 [pdf, other]

doi 10.1007/s10851-016-0652-x

A Computational Model for Amodal Completion

Authors: Maria Oliver, Gloria Haro, Mariella Dimiccoli, Baptiste Mazin, Coloma Ballester

Abstract: This paper presents a computational model to recover the most likely interpretation of the 3D scene structure from a planar image, where some objects may occlude others. The estimated scene interpretation is obtained by integrating some global and local cues and provides both the complete disoccluded objects that form the scene and their ordering according to depth. Our method first computes sever… ▽ More This paper presents a computational model to recover the most likely interpretation of the 3D scene structure from a planar image, where some objects may occlude others. The estimated scene interpretation is obtained by integrating some global and local cues and provides both the complete disoccluded objects that form the scene and their ordering according to depth. Our method first computes several distal scenes which are compatible with the proximal planar image. To compute these different hypothesized scenes, we propose a perceptually inspired object disocclusion method, which works by minimizing the Euler's elastica as well as by incorporating the relatability of partially occluded contours and the convexity of the disoccluded objects. Then, to estimate the preferred scene we rely on a Bayesian model and define probabilities taking into account the global complexity of the objects in the hypothesized scenes as well as the effort of bringing these objects in their relative position in the planar image, which is also measured by an Euler's elastica-based quantity. The model is illustrated with numerical experiments on, both, synthetic and real images showing the ability of our model to reconstruct the occluded objects and the preferred perceptual order among them. We also present results on images of the Berkeley dataset with provided figure-ground ground-truth labeling. △ Less

Submitted 29 March, 2016; v1 submitted 26 November, 2015; originally announced November 2015.

Showing 1–19 of 19 results for author: Haro, G