Search | arXiv e-print repository

Segmenting the motion components of a video: A long-term unsupervised model

Authors: Etienne Meunier, Patrick Bouthemy

Abstract: Human beings have the ability to continuously analyze a video and immediately extract the motion components. We want to adopt this paradigm to provide a coherent and stable motion segmentation over the video sequence. In this perspective, we propose a novel long-term spatio-temporal model operating in a totally unsupervised way. It takes as input the volume of consecutive optical flow (OF) fields,… ▽ More Human beings have the ability to continuously analyze a video and immediately extract the motion components. We want to adopt this paradigm to provide a coherent and stable motion segmentation over the video sequence. In this perspective, we propose a novel long-term spatio-temporal model operating in a totally unsupervised way. It takes as input the volume of consecutive optical flow (OF) fields, and delivers a volume of segments of coherent motion over the video. More specifically, we have designed a transformer-based network, where we leverage a mathematically well-founded framework, the Evidence Lower Bound (ELBO), to derive the loss function. The loss function combines a flow reconstruction term involving spatio-temporal parametric motion models combining, in a novel way, polynomial (quadratic) motion models for the spatial dimensions and B-splines for the time dimension of the video sequence, and a regularization term enforcing temporal consistency on the segments. We report experiments on four VOS benchmarks, demonstrating competitive quantitative results, while performing motion segmentation on a whole sequence in one go. We also highlight through visual results the key contributions on temporal consistency brought by our method. △ Less

Submitted 17 April, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

arXiv:2201.02074 [pdf, other]

doi 10.1109/TPAMI.2022.3198480

EM-driven unsupervised learning for efficient motion segmentation

Authors: Etienne Meunier, Anaïs Badoual, Patrick Bouthemy

Abstract: In this paper, we present a CNN-based fully unsupervised method for motion segmentation from optical flow. We assume that the input optical flow can be represented as a piecewise set of parametric motion models, typically, affine or quadratic motion models. The core idea of our work is to leverage the Expectation-Maximization (EM) framework in order to design in a well-founded manner a loss functi… ▽ More In this paper, we present a CNN-based fully unsupervised method for motion segmentation from optical flow. We assume that the input optical flow can be represented as a piecewise set of parametric motion models, typically, affine or quadratic motion models. The core idea of our work is to leverage the Expectation-Maximization (EM) framework in order to design in a well-founded manner a loss function and a training procedure of our motion segmentation neural network that does not require either ground-truth or manual annotation. However, in contrast to the classical iterative EM, once the network is trained, we can provide a segmentation for any unseen optical flow field in a single inference step and without estimating any motion models. We investigate different loss functions including robust ones and propose a novel efficient data augmentation technique on the optical flow field, applicable to any network taking optical flow as input. In addition, our method is able by design to segment multiple motions. Our motion segmentation network was tested on four benchmarks, DAVIS2016, SegTrackV2, FBMS59, and MoCA, and performed very well, while being fast at test time. △ Less

Submitted 6 October, 2022; v1 submitted 6 January, 2022; originally announced January 2022.

Comments: Accepted to : IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

arXiv:2012.09573 [pdf, other]

doi 10.1109/TCSVT.2021.3078804

Trajectory saliency detection using consistency-oriented latent codes from a recurrent auto-encoder

Authors: L. Maczyta, P. Bouthemy, O. Le Meur

Abstract: In this paper, we are concerned with the detection of progressive dynamic saliency from video sequences. More precisely, we are interested in saliency related to motion and likely to appear progressively over time. It can be relevant to trigger alarms, to dedicate additional processing or to detect specific events. Trajectories represent the best way to support progressive dynamic saliency detecti… ▽ More In this paper, we are concerned with the detection of progressive dynamic saliency from video sequences. More precisely, we are interested in saliency related to motion and likely to appear progressively over time. It can be relevant to trigger alarms, to dedicate additional processing or to detect specific events. Trajectories represent the best way to support progressive dynamic saliency detection. Accordingly, we will talk about trajectory saliency. A trajectory will be qualified as salient if it deviates from normal trajectories that share a common motion pattern related to a given context. First, we need a compact while discriminative representation of trajectories. We adopt a (nearly) unsupervised learning-based approach. The latent code estimated by a recurrent auto-encoder provides the desired representation. In addition, we enforce consistency for normal (similar) trajectories through the auto-encoder loss function. The distance of the trajectory code to a prototype code accounting for normality is the means to detect salient trajectories. We validate our trajectory saliency detection method on synthetic and real trajectory datasets, and highlight the contributions of its different components. We show that our method outperforms existing methods on several scenarios drawn from the publicly available dataset of pedestrian trajectories acquired in a railway station (Alahi 2014). △ Less

Submitted 19 May, 2021; v1 submitted 17 December, 2020; originally announced December 2020.

arXiv:1903.04842 [pdf, other]

doi 10.1109/ICIP.2019.8803542

Unsupervised motion saliency map estimation based on optical flow inpainting

Authors: L. Maczyta, P. Bouthemy, O. Le Meur

Abstract: The paper addresses the problem of motion saliency in videos, that is, identifying regions that undergo motion departing from its context. We propose a new unsupervised paradigm to compute motion saliency maps. The key ingredient is the flow inpainting stage. Candidate regions are determined from the optical flow boundaries. The residual flow in these regions is given by the difference between the… ▽ More The paper addresses the problem of motion saliency in videos, that is, identifying regions that undergo motion departing from its context. We propose a new unsupervised paradigm to compute motion saliency maps. The key ingredient is the flow inpainting stage. Candidate regions are determined from the optical flow boundaries. The residual flow in these regions is given by the difference between the optical flow and the flow inpainted from the surrounding areas. It provides the cue for motion saliency. The method is flexible and general by relying on motion information only. Experimental results on the DAVIS 2016 benchmark demonstrate that the method compares favourably with state-of-the-art video saliency methods. △ Less

Submitted 4 November, 2019; v1 submitted 12 March, 2019; originally announced March 2019.

Journal ref: International Conference on Image Processing (ICIP) 2019

arXiv:1804.06504 [pdf, other]

Learning how to be robust: Deep polynomial regression

Authors: Juan-Manuel Perez-Rua, Tomas Crivelli, Patrick Bouthemy, Patrick Perez

Abstract: Polynomial regression is a recurrent problem with a large number of applications. In computer vision it often appears in motion analysis. Whatever the application, standard methods for regression of polynomial models tend to deliver biased results when the input data is heavily contaminated by outliers. Moreover, the problem is even harder when outliers have strong structure. Departing from proble… ▽ More Polynomial regression is a recurrent problem with a large number of applications. In computer vision it often appears in motion analysis. Whatever the application, standard methods for regression of polynomial models tend to deliver biased results when the input data is heavily contaminated by outliers. Moreover, the problem is even harder when outliers have strong structure. Departing from problem-tailored heuristics for robust estimation of parametric models, we explore deep convolutional neural networks. Our work aims to find a generic approach for training deep regression models without the explicit need of supervised annotation. We bypass the need for a tailored loss function on the regression parameters by attaching to our model a differentiable hard-wired decoder corresponding to the polynomial operation at hand. We demonstrate the value of our findings by comparing with standard robust regression methods. Furthermore, we demonstrate how to use such models for a real computer vision problem, i.e., video stabilization. The qualitative and quantitative experiments show that neural networks are able to learn robustness for general polynomial regression, with results that well overpass scores of traditional robust estimation methods. △ Less

Submitted 23 May, 2018; v1 submitted 17 April, 2018; originally announced April 2018.

Comments: 18 pages, conference

arXiv:1607.02003 [pdf, other]

Tubelets: Unsupervised action proposals from spatiotemporal super-voxels

Authors: Mihir Jain, Jan van Gemert, Hervé Jégou, Patrick Bouthemy, Cees G. M. Snoek

Abstract: This paper considers the problem of localizing actions in videos as a sequences of bounding boxes. The objective is to generate action proposals that are likely to include the action of interest, ideally achieving high recall with few proposals. Our contributions are threefold. First, inspired by selective search for object proposals, we introduce an approach to generate action proposals from spat… ▽ More This paper considers the problem of localizing actions in videos as a sequences of bounding boxes. The objective is to generate action proposals that are likely to include the action of interest, ideally achieving high recall with few proposals. Our contributions are threefold. First, inspired by selective search for object proposals, we introduce an approach to generate action proposals from spatiotemporal super-voxels in an unsupervised manner, we call them Tubelets. Second, along with the static features from individual frames our approach advantageously exploits motion. We introduce independent motion evidence as a feature to characterize how the action deviates from the background and explicitly incorporate such motion information in various stages of the proposal generation. Finally, we introduce spatiotemporal refinement of Tubelets, for more precise localization of actions, and pruning to keep the number of Tubelets limited. We demonstrate the suitability of our approach by extensive experiments for action proposal quality and action localization on three public datasets: UCF Sports, MSR-II and UCF101. For action proposal quality, our unsupervised proposals beat all other existing approaches on the three datasets. For action localization, we show top performance on both the trimmed videos of UCF Sports and UCF101 as well as the untrimmed videos of MSR-II. △ Less

Submitted 7 July, 2016; originally announced July 2016.

Comments: submitted to International Journal of Computer Vision

arXiv:1407.5759 [pdf, other]

Aggregation of local parametric candidates with exemplar-based occlusion handling for optical flow

Authors: Denis Fortun, Patrick Bouthemy, Charles Kervrann

Abstract: Handling all together large displacements, motion details and occlusions remains an open issue for reliable computation of optical flow in a video sequence. We propose a two-step aggregation paradigm to address this problem. The idea is to supply local motion candidates at every pixel in a first step, and then to combine them to determine the global optical flow field in a second step. We exploit… ▽ More Handling all together large displacements, motion details and occlusions remains an open issue for reliable computation of optical flow in a video sequence. We propose a two-step aggregation paradigm to address this problem. The idea is to supply local motion candidates at every pixel in a first step, and then to combine them to determine the global optical flow field in a second step. We exploit local parametric estimations combined with patch correspondences and we experimentally demonstrate that they are sufficient to produce highly accurate motion candidates. The aggregation step is designed as the discrete optimization of a global regularized energy. The occlusion map is estimated jointly with the flow field throughout the two steps. We propose a generic exemplar-based approach for occlusion filling with motion vectors. We achieve state-of-the-art results in computer vision benchmarks, with particularly significant improvements in the case of large displacements and occlusions. △ Less

Submitted 22 July, 2014; originally announced July 2014.

Comments: Submission,IEEE Transactions on Image Processing (2014)

Showing 1–7 of 7 results for author: Bouthemy, P