-
Spatial Cognition from Egocentric Video: Out of Sight, Not Out of Mind
Authors:
Chiara Plizzari,
Shubham Goel,
Toby Perrett,
Jacob Chalk,
Angjoo Kanazawa,
Dima Damen
Abstract:
As humans move around, performing their daily tasks, they are able to recall where they have positioned objects in their environment, even if these objects are currently out of sight. In this paper, we aim to mimic this spatial cognition ability. We thus formulate the task of Out of Sight, Not Out of Mind - 3D tracking active objects using observations captured through an egocentric camera. We int…
▽ More
As humans move around, performing their daily tasks, they are able to recall where they have positioned objects in their environment, even if these objects are currently out of sight. In this paper, we aim to mimic this spatial cognition ability. We thus formulate the task of Out of Sight, Not Out of Mind - 3D tracking active objects using observations captured through an egocentric camera. We introduce Lift, Match and Keep (LMK), a method which lifts partial 2D observations to 3D world coordinates, matches them over time using visual appearance, 3D location and interactions to form object tracks, and keeps these object tracks even when they go out-of-view of the camera - hence kee** in mind what is out of sight. We test LMK on 100 long videos from EPIC-KITCHENS. Our results demonstrate that spatial cognition is critical for correctly locating objects over short and long time scales. E.g., for one long egocentric video, we estimate the 3D location of 50 active objects. Of these, 60% can be correctly positioned in 3D after 2 minutes of leaving the camera view.
△ Less
Submitted 7 April, 2024;
originally announced April 2024.
-
Centre Stage: Centricity-based Audio-Visual Temporal Action Detection
Authors:
Hanyuan Wang,
Majid Mirmehdi,
Dima Damen,
Toby Perrett
Abstract:
Previous one-stage action detection approaches have modelled temporal dependencies using only the visual modality. In this paper, we explore different strategies to incorporate the audio modality, using multi-scale cross-attention to fuse the two modalities. We also demonstrate the correlation between the distance from the timestep to the action centre and the accuracy of the predicted boundaries.…
▽ More
Previous one-stage action detection approaches have modelled temporal dependencies using only the visual modality. In this paper, we explore different strategies to incorporate the audio modality, using multi-scale cross-attention to fuse the two modalities. We also demonstrate the correlation between the distance from the timestep to the action centre and the accuracy of the predicted boundaries. Thus, we propose a novel network head to estimate the closeness of timesteps to the action centre, which we call the centricity score. This leads to increased confidence for proposals that exhibit more precise boundaries. Our method can be integrated with other one-stage anchor-free architectures and we demonstrate this on three recent baselines on the EPIC-Kitchens-100 action detection benchmark where we achieve state-of-the-art performance. Detailed ablation studies showcase the benefits of fusing audio and our proposed centricity scores. Code and models for our proposed method are publicly available at https://github.com/hanielwang/Audio-Visual-TAD.git
△ Less
Submitted 27 November, 2023;
originally announced November 2023.
-
What can a cook in Italy teach a mechanic in India? Action Recognition Generalisation Over Scenarios and Locations
Authors:
Chiara Plizzari,
Toby Perrett,
Barbara Caputo,
Dima Damen
Abstract:
We propose and address a new generalisation problem: can a model trained for action recognition successfully classify actions when they are performed within a previously unseen scenario and in a previously unseen location? To answer this question, we introduce the Action Recognition Generalisation Over scenarios and locations dataset (ARGO1M), which contains 1.1M video clips from the large-scale E…
▽ More
We propose and address a new generalisation problem: can a model trained for action recognition successfully classify actions when they are performed within a previously unseen scenario and in a previously unseen location? To answer this question, we introduce the Action Recognition Generalisation Over scenarios and locations dataset (ARGO1M), which contains 1.1M video clips from the large-scale Ego4D dataset, across 10 scenarios and 13 locations. We demonstrate recognition models struggle to generalise over 10 proposed test splits, each of an unseen scenario in an unseen location. We thus propose CIR, a method to represent each video as a Cross-Instance Reconstruction of videos from other domains. Reconstructions are paired with text narrations to guide the learning of a domain generalisable representation. We provide extensive analysis and ablations on ARGO1M that show CIR outperforms prior domain generalisation works on all test splits. Code and data: https://chiaraplizz.github.io/what-can-a-cook/.
△ Less
Submitted 24 August, 2023; v1 submitted 14 June, 2023;
originally announced June 2023.
-
Use Your Head: Improving Long-Tail Video Recognition
Authors:
Toby Perrett,
Saptarshi Sinha,
Tilo Burghardt,
Majid Mirmehdi,
Dima Damen
Abstract:
This paper presents an investigation into long-tail video recognition. We demonstrate that, unlike naturally-collected video datasets and existing long-tail image benchmarks, current video benchmarks fall short on multiple long-tailed properties. Most critically, they lack few-shot classes in their tails. In response, we propose new video benchmarks that better assess long-tail recognition, by sam…
▽ More
This paper presents an investigation into long-tail video recognition. We demonstrate that, unlike naturally-collected video datasets and existing long-tail image benchmarks, current video benchmarks fall short on multiple long-tailed properties. Most critically, they lack few-shot classes in their tails. In response, we propose new video benchmarks that better assess long-tail recognition, by sampling subsets from two datasets: SSv2 and VideoLT.
We then propose a method, Long-Tail Mixed Reconstruction, which reduces overfitting to instances from few-shot classes by reconstructing them as weighted combinations of samples from head classes. LMR then employs label mixing to learn robust decision boundaries. It achieves state-of-the-art average class accuracy on EPIC-KITCHENS and the proposed SSv2-LT and VideoLT-LT. Benchmarks and code at: tobyperrett.github.io/lmr
△ Less
Submitted 3 April, 2023;
originally announced April 2023.
-
Refining Action Boundaries for One-stage Detection
Authors:
Hanyuan Wang,
Majid Mirmehdi,
Dima Damen,
Toby Perrett
Abstract:
Current one-stage action detection methods, which simultaneously predict action boundaries and the corresponding class, do not estimate or use a measure of confidence in their boundary predictions, which can lead to inaccurate boundaries. We incorporate the estimation of boundary confidence into one-stage anchor-free detection, through an additional prediction head that predicts the refined bounda…
▽ More
Current one-stage action detection methods, which simultaneously predict action boundaries and the corresponding class, do not estimate or use a measure of confidence in their boundary predictions, which can lead to inaccurate boundaries. We incorporate the estimation of boundary confidence into one-stage anchor-free detection, through an additional prediction head that predicts the refined boundaries with higher confidence. We obtain state-of-the-art performance on the challenging EPIC-KITCHENS-100 action detection as well as the standard THUMOS14 action detection benchmarks, and achieve improvement on the ActivityNet-1.3 benchmark.
△ Less
Submitted 25 October, 2022;
originally announced October 2022.
-
Inertial Hallucinations -- When Wearable Inertial Devices Start Seeing Things
Authors:
Alessandro Masullo,
Toby Perrett,
Tilo Burghardt,
Ian Craddock,
Dima Damen,
Majid Mirmehdi
Abstract:
We propose a novel approach to multimodal sensor fusion for Ambient Assisted Living (AAL) which takes advantage of learning using privileged information (LUPI). We address two major shortcomings of standard multimodal approaches, limited area coverage and reduced reliability. Our new framework fuses the concept of modality hallucination with triplet learning to train a model with different modalit…
▽ More
We propose a novel approach to multimodal sensor fusion for Ambient Assisted Living (AAL) which takes advantage of learning using privileged information (LUPI). We address two major shortcomings of standard multimodal approaches, limited area coverage and reduced reliability. Our new framework fuses the concept of modality hallucination with triplet learning to train a model with different modalities to handle missing sensors at inference time. We evaluate the proposed model on inertial data from a wearable accelerometer device, using RGB videos and skeletons as privileged modalities, and show an improvement of accuracy of an average 6.6% on the UTD-MHAD dataset and an average 5.5% on the Berkeley MHAD dataset, reaching a new state-of-the-art for inertial-only classification accuracy on these datasets. We validate our framework through several ablation studies.
△ Less
Submitted 14 July, 2022;
originally announced July 2022.
-
An Evaluation of OCR on Egocentric Data
Authors:
Valentin Popescu,
Dima Damen,
Toby Perrett
Abstract:
In this paper, we evaluate state-of-the-art OCR methods on Egocentric data. We annotate text in EPIC-KITCHENS images, and demonstrate that existing OCR methods struggle with rotated text, which is frequently observed on objects being handled. We introduce a simple rotate-and-merge procedure which can be applied to pre-trained OCR models that halves the normalized edit distance error. This suggests…
▽ More
In this paper, we evaluate state-of-the-art OCR methods on Egocentric data. We annotate text in EPIC-KITCHENS images, and demonstrate that existing OCR methods struggle with rotated text, which is frequently observed on objects being handled. We introduce a simple rotate-and-merge procedure which can be applied to pre-trained OCR models that halves the normalized edit distance error. This suggests that future OCR attempts should incorporate rotation into model design and training procedures.
△ Less
Submitted 11 June, 2022;
originally announced June 2022.
-
TVNet: Temporal Voting Network for Action Localization
Authors:
Hanyuan Wang,
Dima Damen,
Majid Mirmehdi,
Toby Perrett
Abstract:
We propose a Temporal Voting Network (TVNet) for action localization in untrimmed videos. This incorporates a novel Voting Evidence Module to locate temporal boundaries, more accurately, where temporal contextual evidence is accumulated to predict frame-level probabilities of start and end action boundaries. Our action-independent evidence module is incorporated within a pipeline to calculate conf…
▽ More
We propose a Temporal Voting Network (TVNet) for action localization in untrimmed videos. This incorporates a novel Voting Evidence Module to locate temporal boundaries, more accurately, where temporal contextual evidence is accumulated to predict frame-level probabilities of start and end action boundaries. Our action-independent evidence module is incorporated within a pipeline to calculate confidence scores and action classes. We achieve an average mAP of 34.6% on ActivityNet-1.3, particularly outperforming previous methods with the highest IoU of 0.95. TVNet also achieves mAP of 56.0% when combined with PGCN and 59.1% with MUSES at 0.5 IoU on THUMOS14 and outperforms prior work at all thresholds. Our code is available at https://github.com/hanielwang/TVNet.
△ Less
Submitted 2 January, 2022;
originally announced January 2022.
-
Temporal-Relational CrossTransformers for Few-Shot Action Recognition
Authors:
Toby Perrett,
Alessandro Masullo,
Tilo Burghardt,
Majid Mirmehdi,
Dima Damen
Abstract:
We propose a novel approach to few-shot action recognition, finding temporally-corresponding frame tuples between the query and videos in the support set. Distinct from previous few-shot works, we construct class prototypes using the CrossTransformer attention mechanism to observe relevant sub-sequences of all support videos, rather than using class averages or single best matches. Video represent…
▽ More
We propose a novel approach to few-shot action recognition, finding temporally-corresponding frame tuples between the query and videos in the support set. Distinct from previous few-shot works, we construct class prototypes using the CrossTransformer attention mechanism to observe relevant sub-sequences of all support videos, rather than using class averages or single best matches. Video representations are formed from ordered tuples of varying numbers of frames, which allows sub-sequences of actions at different speeds and temporal offsets to be compared.
Our proposed Temporal-Relational CrossTransformers (TRX) achieve state-of-the-art results on few-shot splits of Kinetics, Something-Something V2 (SSv2), HMDB51 and UCF101. Importantly, our method outperforms prior work on SSv2 by a wide margin (12%) due to the its ability to model temporal relations. A detailed ablation showcases the importance of matching to multiple support set videos and learning higher-order relational CrossTransformers.
△ Less
Submitted 28 March, 2021; v1 submitted 15 January, 2021;
originally announced January 2021.
-
Meta-Learning with Context-Agnostic Initialisations
Authors:
Toby Perrett,
Alessandro Masullo,
Tilo Burghardt,
Majid Mirmehdi,
Dima Damen
Abstract:
Meta-learning approaches have addressed few-shot problems by finding initialisations suited for fine-tuning to target tasks. Often there are additional properties within training data (which we refer to as context), not relevant to the target task, which act as a distractor to meta-learning, particularly when the target task contains examples from a novel context not seen during training. We addre…
▽ More
Meta-learning approaches have addressed few-shot problems by finding initialisations suited for fine-tuning to target tasks. Often there are additional properties within training data (which we refer to as context), not relevant to the target task, which act as a distractor to meta-learning, particularly when the target task contains examples from a novel context not seen during training. We address this oversight by incorporating a context-adversarial component into the meta-learning process. This produces an initialisation for fine-tuning to target which is both context-agnostic and task-generalised. We evaluate our approach on three commonly used meta-learning algorithms and two problems. We demonstrate our context-agnostic meta-learning improves results in each case. First, we report on Omniglot few-shot character classification, using alphabets as context. An average improvement of 4.3% is observed across methods and tasks when classifying characters from an unseen alphabet. Second, we evaluate on a dataset for personalised energy expenditure predictions from video, using participant knowledge as context. We demonstrate that context-agnostic meta-learning decreases the average mean square error by 30%.
△ Less
Submitted 22 October, 2020; v1 submitted 29 July, 2020;
originally announced July 2020.
-
Rescaling Egocentric Vision
Authors:
Dima Damen,
Hazel Doughty,
Giovanni Maria Farinella,
Antonino Furnari,
Evangelos Kazakos,
Jian Ma,
Davide Moltisanti,
Jonathan Munro,
Toby Perrett,
Will Price,
Michael Wray
Abstract:
This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M frames, 90K actions in 700 variable-length videos, capturing long-term unscripted activities in 45 environments, using head-mounted cameras. Compared to its previous version, EPIC-KITCHENS-100 has been annotated using a nov…
▽ More
This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M frames, 90K actions in 700 variable-length videos, capturing long-term unscripted activities in 45 environments, using head-mounted cameras. Compared to its previous version, EPIC-KITCHENS-100 has been annotated using a novel pipeline that allows denser (54% more actions per minute) and more complete annotations of fine-grained actions (+128% more action segments). This collection enables new challenges such as action detection and evaluating the "test of time" - i.e. whether models trained on data collected in 2018 can generalise to new footage collected two years later. The dataset is aligned with 6 challenges: action recognition (full and weak supervision), action detection, action anticipation, cross-modal retrieval (from captions), as well as unsupervised domain adaptation for action recognition. For each challenge, we define the task, provide baselines and evaluation metrics
△ Less
Submitted 17 September, 2021; v1 submitted 23 June, 2020;
originally announced June 2020.
-
The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines
Authors:
Dima Damen,
Hazel Doughty,
Giovanni Maria Farinella,
Sanja Fidler,
Antonino Furnari,
Evangelos Kazakos,
Davide Moltisanti,
Jonathan Munro,
Toby Perrett,
Will Price,
Michael Wray
Abstract:
Since its introduction in 2018, EPIC-KITCHENS has attracted attention as the largest egocentric video benchmark, offering a unique viewpoint on people's interaction with objects, their attention, and even intention. In this paper, we detail how this large-scale dataset was captured by 32 participants in their native kitchen environments, and densely annotated with actions and object interactions.…
▽ More
Since its introduction in 2018, EPIC-KITCHENS has attracted attention as the largest egocentric video benchmark, offering a unique viewpoint on people's interaction with objects, their attention, and even intention. In this paper, we detail how this large-scale dataset was captured by 32 participants in their native kitchen environments, and densely annotated with actions and object interactions. Our videos depict nonscripted daily activities, as recording is started every time a participant entered their kitchen. Recording took place in 4 countries by participants belonging to 10 different nationalities, resulting in highly diverse kitchen habits and cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.2K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos after recording, thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and. anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens. We introduce new baselines that highlight the multimodal nature of the dataset and the importance of explicit temporal modelling to discriminate fine-grained actions e.g. 'closing a tap' from 'opening' it up.
△ Less
Submitted 29 April, 2020;
originally announced May 2020.
-
Sit-to-Stand Analysis in the Wild using Silhouettes for Longitudinal Health Monitoring
Authors:
Alessandro Masullo,
Tilo Burghardt,
Toby Perrett,
Dima Damen,
Majid Mirmehdi
Abstract:
We present the first fully automated Sit-to-Stand or Stand-to-Sit (StS) analysis framework for long-term monitoring of patients in free-living environments using video silhouettes. Our method adopts a coarse-to-fine time localisation approach, where a deep learning classifier identifies possible StS sequences from silhouettes, and a smart peak detection stage provides fine localisation based on 3D…
▽ More
We present the first fully automated Sit-to-Stand or Stand-to-Sit (StS) analysis framework for long-term monitoring of patients in free-living environments using video silhouettes. Our method adopts a coarse-to-fine time localisation approach, where a deep learning classifier identifies possible StS sequences from silhouettes, and a smart peak detection stage provides fine localisation based on 3D bounding boxes. We tested our method on data from real homes of participants and monitored patients undergoing total hip or knee replacement. Our results show 94.4% overall accuracy in the coarse localisation and an error of 0.026 m/s in the speed of ascent measurement, highlighting important trends in the recuperation of patients who underwent surgery.
△ Less
Submitted 3 October, 2019;
originally announced October 2019.
-
DDLSTM: Dual-Domain LSTM for Cross-Dataset Action Recognition
Authors:
Toby Perrett,
Dima Damen
Abstract:
Domain alignment in convolutional networks aims to learn the degree of layer-specific feature alignment beneficial to the joint learning of source and target datasets. While increasingly popular in convolutional networks, there have been no previous attempts to achieve domain alignment in recurrent networks. Similar to spatial features, both source and target domains are likely to exhibit temporal…
▽ More
Domain alignment in convolutional networks aims to learn the degree of layer-specific feature alignment beneficial to the joint learning of source and target datasets. While increasingly popular in convolutional networks, there have been no previous attempts to achieve domain alignment in recurrent networks. Similar to spatial features, both source and target domains are likely to exhibit temporal dependencies that can be jointly learnt and aligned.
In this paper we introduce Dual-Domain LSTM (DDLSTM), an architecture that is able to learn temporal dependencies from two domains concurrently. It performs cross-contaminated batch normalisation on both input-to-hidden and hidden-to-hidden weights, and learns the parameters for cross-contamination, for both single-layer and multi-layer LSTM architectures. We evaluate DDLSTM on frame-level action recognition using three datasets, taking a pair at a time, and report an average increase in accuracy of 3.5%. The proposed DDLSTM architecture outperforms standard, fine-tuned, and batch-normalised LSTMs.
△ Less
Submitted 18 April, 2019;
originally announced April 2019.
-
Colouring Graphs with Sparse Neighbourhoods: Bounds and Applications
Authors:
Marthe Bonamy,
Thomas Perrett,
Luke Postle
Abstract:
Let $G$ be a graph with chromatic number $χ$, maximum degree $Δ$ and clique number $ω$. Reed's conjecture states that $χ\leq \lceil (1-\varepsilon)(Δ+ 1) + \varepsilonω\rceil$ for all $\varepsilon \leq 1/2$. It was shown by King and Reed that, provided $Δ$ is large enough, the conjecture holds for $\varepsilon \leq 1/130,000$. In this article, we show that the same statement holds for…
▽ More
Let $G$ be a graph with chromatic number $χ$, maximum degree $Δ$ and clique number $ω$. Reed's conjecture states that $χ\leq \lceil (1-\varepsilon)(Δ+ 1) + \varepsilonω\rceil$ for all $\varepsilon \leq 1/2$. It was shown by King and Reed that, provided $Δ$ is large enough, the conjecture holds for $\varepsilon \leq 1/130,000$. In this article, we show that the same statement holds for $\varepsilon \leq 1/26$, thus making a significant step towards Reed's conjecture. We derive this result from a general technique to bound the chromatic number of a graph where no vertex has many edges in its neighbourhood. Our improvements to this method also lead to improved bounds on the strong chromatic index of general graphs. We prove that $χ'_s(G)\leq 1.835 Δ(G)^2$ provided $Δ(G)$ is large enough.
△ Less
Submitted 15 October, 2018;
originally announced October 2018.
-
Scaling Egocentric Vision: The EPIC-KITCHENS Dataset
Authors:
Dima Damen,
Hazel Doughty,
Giovanni Maria Farinella,
Sanja Fidler,
Antonino Furnari,
Evangelos Kazakos,
Davide Moltisanti,
Jonathan Munro,
Toby Perrett,
Will Price,
Michael Wray
Abstract:
First-person vision is gaining interest as it offers a unique viewpoint on people's interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen…
▽ More
First-person vision is gaining interest as it offers a unique viewpoint on people's interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. Our videos depict nonscripted daily activities: we simply asked each participant to start recording every time they entered their kitchen. Recording took place in 4 cities (in North America and Europe) by participants belonging to 10 different nationalities, resulting in highly diverse cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labeled for a total of 39.6K action segments and 454.3K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos (after recording), thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens. Dataset and Project page: http://epic-kitchens.github.io
△ Less
Submitted 31 July, 2018; v1 submitted 8 April, 2018;
originally announced April 2018.
-
Gallai's path decomposition conjecture for graphs of small maximum degree
Authors:
Marthe Bonamy,
Thomas Perrett
Abstract:
Gallai's path decomposition conjecture states that the edges of any connected graph on n vertices can be decomposed into at most (n+1)/2 paths. We confirm that conjecture for all graphs with maximum degree at most five.
Gallai's path decomposition conjecture states that the edges of any connected graph on n vertices can be decomposed into at most (n+1)/2 paths. We confirm that conjecture for all graphs with maximum degree at most five.
△ Less
Submitted 20 September, 2016;
originally announced September 2016.
-
Cost-based Feature Transfer for Vehicle Occupant Classification
Authors:
Toby Perrett,
Majid Mirmehdi,
Eduardo Dias
Abstract:
Knowledge of human presence and interaction in a vehicle is of growing interest to vehicle manufacturers for design and safety purposes. We present a framework to perform the tasks of occupant detection and occupant classification for automatic child locks and airbag suppression. It operates for all passenger seats, using a single overhead camera. A transfer learning technique is introduced to mak…
▽ More
Knowledge of human presence and interaction in a vehicle is of growing interest to vehicle manufacturers for design and safety purposes. We present a framework to perform the tasks of occupant detection and occupant classification for automatic child locks and airbag suppression. It operates for all passenger seats, using a single overhead camera. A transfer learning technique is introduced to make full use of training data from all seats whilst still maintaining some control over the bias, necessary for a system designed to penalize certain misclassifications more than others. An evaluation is performed on a challenging dataset with both weighted and unweighted classifiers, demonstrating the effectiveness of the transfer process.
△ Less
Submitted 22 December, 2015;
originally announced December 2015.