Search | arXiv e-print repository

Watching Swarm Dynamics from Above: A Framework for Advanced Object Tracking in Drone Videos

Authors: Duc Pham, Matthew Hansen, Félicie Dhellemmes, Jens Krause, Pia Bideau

Abstract: Easily accessible sensors, like drones with diverse onboard sensors, have greatly expanded studying animal behavior in natural environments. Yet, analyzing vast, unlabeled video data, often spanning hours, remains a challenge for machine learning, especially in computer vision. Existing approaches often analyze only a few frames. Our focus is on long-term animal behavior analysis. To address this… ▽ More Easily accessible sensors, like drones with diverse onboard sensors, have greatly expanded studying animal behavior in natural environments. Yet, analyzing vast, unlabeled video data, often spanning hours, remains a challenge for machine learning, especially in computer vision. Existing approaches often analyze only a few frames. Our focus is on long-term animal behavior analysis. To address this challenge, we utilize classical probabilistic methods for state estimation, such as particle filtering. By incorporating recent advancements in semantic object segmentation, we enable continuous tracking of rapidly evolving object formations, even in scenarios with limited data availability. Particle filters offer a provably optimal algorithmic structure for recursively adding new incoming information. We propose a novel approach for tracking schools of fish in the open ocean from drone videos. Our framework not only performs classical object tracking in 2D, instead it tracks the position and spatial expansion of the fish school in world coordinates by fusing video data and the drone's on board sensor information (GPS and IMU). The presented framework for the first time allows researchers to study collective behavior of fish schools in its natural social and environmental context in a non-invasive and scalable way. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: CVPRW: Workshop paper appearing in CV4Animals

arXiv:2404.10904 [pdf, other]

Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition

Authors: Marah Halawa, Florian Blume, Pia Bideau, Martin Maier, Rasha Abdel Rahman, Olaf Hellwich

Abstract: Human communication is multi-modal; e.g., face-to-face interaction involves auditory signals (speech) and visual signals (face movements and hand gestures). Hence, it is essential to exploit multiple modalities when designing machine learning-based facial expression recognition systems. In addition, given the ever-growing quantities of video data that capture human facial expressions, such systems… ▽ More Human communication is multi-modal; e.g., face-to-face interaction involves auditory signals (speech) and visual signals (face movements and hand gestures). Hence, it is essential to exploit multiple modalities when designing machine learning-based facial expression recognition systems. In addition, given the ever-growing quantities of video data that capture human facial expressions, such systems should utilize raw unlabeled videos without requiring expensive annotations. Therefore, in this work, we employ a multitask multi-modal self-supervised learning method for facial expression recognition from in-the-wild video data. Our model combines three self-supervised objective functions: First, a multi-modal contrastive loss, that pulls diverse data modalities of the same video together in the representation space. Second, a multi-modal clustering loss that preserves the semantic structure of input data in the representation space. Finally, a multi-modal data reconstruction loss. We conduct a comprehensive study on this multimodal multi-task self-supervised learning method on three facial expression recognition benchmarks. To that end, we examine the performance of learning through different combinations of self-supervised tasks on the facial expression recognition downstream task. Our model ConCluGen outperforms several multi-modal self-supervised and fully supervised baselines on the CMU-MOSEI dataset. Our results generally show that multi-modal self-supervision tasks offer large performance gains for challenging tasks such as facial expression recognition, while also reducing the amount of manual annotations required. We release our pre-trained models as well as source code publicly △ Less

Submitted 16 April, 2024; originally announced April 2024.

Comments: The paper will appear in the CVPR 2024 workshops proceedings

arXiv:2207.08664 [pdf, other]

Action-based Contrastive Learning for Trajectory Prediction

Authors: Marah Halawa, Olaf Hellwich, Pia Bideau

Abstract: Trajectory prediction is an essential task for successful human robot interaction, such as in autonomous driving. In this work, we address the problem of predicting future pedestrian trajectories in a first person view setting with a moving camera. To that end, we propose a novel action-based contrastive learning loss, that utilizes pedestrian action information to improve the learned trajectory e… ▽ More Trajectory prediction is an essential task for successful human robot interaction, such as in autonomous driving. In this work, we address the problem of predicting future pedestrian trajectories in a first person view setting with a moving camera. To that end, we propose a novel action-based contrastive learning loss, that utilizes pedestrian action information to improve the learned trajectory embeddings. The fundamental idea behind this new loss is that trajectories of pedestrians performing the same action should be closer to each other in the feature space than the trajectories of pedestrians with significantly different actions. In other words, we argue that behavioral information about pedestrian action influences their future trajectory. Furthermore, we introduce a novel sampling strategy for trajectories that is able to effectively increase negative and positive contrastive samples. Additional synthetic trajectory samples are generated using a trained Conditional Variational Autoencoder (CVAE), which is at the core of several models developed for trajectory prediction. Results show that our proposed contrastive framework employs contextual information about pedestrian behavior, i.e. action, effectively, and it learns a better trajectory representation. Thus, integrating the proposed contrastive framework within a trajectory prediction model improves its results and outperforms state-of-the-art methods on three trajectory prediction benchmarks [31, 32, 26]. △ Less

Submitted 18 July, 2022; originally announced July 2022.

Comments: This paper will appear in the proceedings of The European Conference on Computer Vision (ECCV 2022)

arXiv:2203.00115 [pdf, other]

The Right Spin: Learning Object Motion from Rotation-Compensated Flow Fields

Authors: Pia Bideau, Erik Learned-Miller, Cordelia Schmid, Karteek Alahari

Abstract: Both a good understanding of geometrical concepts and a broad familiarity with objects lead to our excellent perception of moving objects. The human ability to detect and segment moving objects works in the presence of multiple objects, complex background geometry, motion of the observer and even camouflage. How humans perceive moving objects so reliably is a longstanding research question in comp… ▽ More Both a good understanding of geometrical concepts and a broad familiarity with objects lead to our excellent perception of moving objects. The human ability to detect and segment moving objects works in the presence of multiple objects, complex background geometry, motion of the observer and even camouflage. How humans perceive moving objects so reliably is a longstanding research question in computer vision and borrows findings from related areas such as psychology, cognitive science and physics. One approach to the problem is to teach a deep network to model all of these effects. This contrasts with the strategy used by human vision, where cognitive processes and body design are tightly coupled and each is responsible for certain aspects of correctly identifying moving objects. Similarly from the computer vision perspective, there is evidence that classical, geometry-based techniques are better suited to the "motion-based" parts of the problem, while deep networks are more suitable for modeling appearance. In this work, we argue that the coupling of camera rotation and camera translation can create complex motion fields that are difficult for a deep network to untangle directly. We present a novel probabilistic model to estimate the camera's rotation given the motion field. We then rectify the flow field to obtain a rotation-compensated motion field for subsequent segmentation. This strategy of first estimating camera motion, and then allowing a network to learn the remaining parts of the problem, yields improved results on the widely used DAVIS benchmark as well as the recently published motion segmentation data set MoCA (Moving Camouflaged Animals). △ Less

Submitted 28 February, 2022; originally announced March 2022.

arXiv:2106.06887 [pdf, other]

doi 10.1109/ICCV48922.2021.01324

The Spatio-Temporal Poisson Point Process: A Simple Model for the Alignment of Event Camera Data

Authors: Cheng Gu, Erik Learned-Miller, Daniel Sheldon, Guillermo Gallego, Pia Bideau

Abstract: Event cameras, inspired by biological vision systems, provide a natural and data efficient representation of visual information. Visual information is acquired in the form of events that are triggered by local brightness changes. Each pixel location of the camera's sensor records events asynchronously and independently with very high temporal resolution. However, because most brightness changes ar… ▽ More Event cameras, inspired by biological vision systems, provide a natural and data efficient representation of visual information. Visual information is acquired in the form of events that are triggered by local brightness changes. Each pixel location of the camera's sensor records events asynchronously and independently with very high temporal resolution. However, because most brightness changes are triggered by relative motion of the camera and the scene, the events recorded at a single sensor location seldom correspond to the same world point. To extract meaningful information from event cameras, it is helpful to register events that were triggered by the same underlying world point. In this work we propose a new model of event data that captures its natural spatio-temporal structure. We start by develo** a model for aligned event data. That is, we develop a model for the data as though it has been perfectly registered already. In particular, we model the aligned data as a spatio-temporal Poisson point process. Based on this model, we develop a maximum likelihood approach to registering events that are not yet aligned. That is, we find transformations of the observed events that make them as likely as possible under our model. In particular we extract the camera rotation that leads to the best event alignment. We show new state of the art accuracy for rotational velocity estimation on the DAVIS 240C dataset. In addition, our method is also faster and has lower computational complexity than several competing methods. △ Less

Submitted 12 June, 2021; originally announced June 2021.

Journal ref: IEEE International Conference on Computer Vision 2021

arXiv:2106.06073 [pdf, other]

A modular framework for object-based saccadic decisions in dynamic scenes

Authors: Nicolas Roth, Pia Bideau, Olaf Hellwich, Martin Rolfs, Klaus Obermayer

Abstract: Visually exploring the world around us is not a passive process. Instead, we actively explore the world and acquire visual information over time. Here, we present a new model for simulating human eye-movement behavior in dynamic real-world scenes. We model this active scene exploration as a sequential decision making process. We adapt the popular drift-diffusion model (DDM) for perceptual decision… ▽ More Visually exploring the world around us is not a passive process. Instead, we actively explore the world and acquire visual information over time. Here, we present a new model for simulating human eye-movement behavior in dynamic real-world scenes. We model this active scene exploration as a sequential decision making process. We adapt the popular drift-diffusion model (DDM) for perceptual decision making and extend it towards multiple options, defined by objects present in the scene. For each possible choice, the model integrates evidence over time and a decision (saccadic eye movement) is triggered as soon as evidence crosses a decision threshold. Drawing this explicit connection between decision making and object-based scene perception is highly relevant in the context of active viewing, where decisions are made continuously while interacting with an external environment. We validate our model with a carefully designed ablation study and explore influences of our model parameters. A comparison on the VidCom dataset supports the plausibility of the proposed approach. △ Less

Submitted 10 June, 2021; originally announced June 2021.

Comments: Accepted for presentation at EPIC@CVPR2021 workshop, 4 pages, 2 figures

arXiv:1610.10033 [pdf, other]

A Detailed Rubric for Motion Segmentation

Authors: Pia Bideau, Erik Learned-Miller

Abstract: Motion segmentation is currently an active area of research in computer Vision. The task of comparing different methods of motion segmentation is complicated by the fact that researchers may use subtly different definitions of the problem. Questions such as "Which objects are moving?", "What is background?", and "How can we use motion of the camera to segment objects, whether they are static or mo… ▽ More Motion segmentation is currently an active area of research in computer Vision. The task of comparing different methods of motion segmentation is complicated by the fact that researchers may use subtly different definitions of the problem. Questions such as "Which objects are moving?", "What is background?", and "How can we use motion of the camera to segment objects, whether they are static or moving?" are clearly related to each other, but lead to different algorithms, and imply different versions of the ground truth. This report has two goals. The first is to offer a precise definition of motion segmentation so that the intent of an algorithm is as well-defined as possible. The second is to report on new versions of three previously existing data sets that are compatible with this definition. We hope that this more detailed definition, and the three data sets that go with it, will allow more meaningful comparisons of certain motion segmentation methods. △ Less

Submitted 31 October, 2016; originally announced October 2016.

arXiv:1604.00136 [pdf, other]

It's Moving! A Probabilistic Model for Causal Motion Segmentation in Moving Camera Videos

Authors: Pia Bideau, Erik Learned-Miller

Abstract: The human ability to detect and segment moving objects works in the presence of multiple objects, complex background geometry, motion of the observer, and even camouflage. In addition to all of this, the ability to detect motion is nearly instantaneous. While there has been much recent progress in motion segmentation, it still appears we are far from human capabilities. In this work, we derive fro… ▽ More The human ability to detect and segment moving objects works in the presence of multiple objects, complex background geometry, motion of the observer, and even camouflage. In addition to all of this, the ability to detect motion is nearly instantaneous. While there has been much recent progress in motion segmentation, it still appears we are far from human capabilities. In this work, we derive from first principles a new likelihood function for assessing the probability of an optical flow vector given the 3D motion direction of an object. This likelihood uses a novel combination of the angle and magnitude of the optical flow to maximize the information about the true motions of objects. Using this new likelihood and several innovations in initialization, we develop a motion segmentation algorithm that beats current state-of-the-art methods by a large margin. We compare to five state-of-the-art methods on two established benchmarks, and a third new data set of camouflaged animals, which we introduce to push motion segmentation to the next level. △ Less

Submitted 1 April, 2016; originally announced April 2016.

Showing 1–8 of 8 results for author: Bideau, P