-
An Evaluation of Action Recognition Models on EPIC-Kitchens
Authors:
Will Price,
Dima Damen
Abstract:
We benchmark contemporary action recognition models (TSN, TRN, and TSM) on the recently introduced EPIC-Kitchens dataset and release pretrained models on GitHub (https://github.com/epic-kitchens/action-models) for others to build upon. In contrast to popular action recognition datasets like Kinetics, Something-Something, UCF101, and HMDB51, EPIC-Kitchens is shot from an egocentric perspective and…
▽ More
We benchmark contemporary action recognition models (TSN, TRN, and TSM) on the recently introduced EPIC-Kitchens dataset and release pretrained models on GitHub (https://github.com/epic-kitchens/action-models) for others to build upon. In contrast to popular action recognition datasets like Kinetics, Something-Something, UCF101, and HMDB51, EPIC-Kitchens is shot from an egocentric perspective and captures daily actions in-situ. In this report, we aim to understand how well these models can tackle the challenges present in this dataset, such as its long tail class distribution, unseen environment test set, and multiple tasks (verb, noun and, action classification). We discuss the models' shortcomings and avenues for future research.
△ Less
Submitted 2 August, 2019;
originally announced August 2019.
-
Learning Visual Actions Using Multiple Verb-Only Labels
Authors:
Michael Wray,
Dima Damen
Abstract:
This work introduces verb-only representations for both recognition and retrieval of visual actions, in video. Current methods neglect legitimate semantic ambiguities between verbs, instead choosing unambiguous subsets of verbs along with objects to disambiguate the actions. We instead propose multiple verb-only labels, which we learn through hard or soft assignment as a regression. This enables l…
▽ More
This work introduces verb-only representations for both recognition and retrieval of visual actions, in video. Current methods neglect legitimate semantic ambiguities between verbs, instead choosing unambiguous subsets of verbs along with objects to disambiguate the actions. We instead propose multiple verb-only labels, which we learn through hard or soft assignment as a regression. This enables learning a much larger vocabulary of verbs, including contextual overlaps of these verbs. We collect multi-verb annotations for three action video datasets and evaluate the verb-only labelling representations for action recognition and cross-modal retrieval (video-to-text and text-to-video). We demonstrate that multi-label verb-only representations outperform conventional single verb labels. We also explore other benefits of a multi-verb representation including cross-dataset retrieval and verb type manner and result verb types) retrieval.
△ Less
Submitted 1 August, 2019; v1 submitted 25 July, 2019;
originally announced July 2019.
-
DDLSTM: Dual-Domain LSTM for Cross-Dataset Action Recognition
Authors:
Toby Perrett,
Dima Damen
Abstract:
Domain alignment in convolutional networks aims to learn the degree of layer-specific feature alignment beneficial to the joint learning of source and target datasets. While increasingly popular in convolutional networks, there have been no previous attempts to achieve domain alignment in recurrent networks. Similar to spatial features, both source and target domains are likely to exhibit temporal…
▽ More
Domain alignment in convolutional networks aims to learn the degree of layer-specific feature alignment beneficial to the joint learning of source and target datasets. While increasingly popular in convolutional networks, there have been no previous attempts to achieve domain alignment in recurrent networks. Similar to spatial features, both source and target domains are likely to exhibit temporal dependencies that can be jointly learnt and aligned.
In this paper we introduce Dual-Domain LSTM (DDLSTM), an architecture that is able to learn temporal dependencies from two domains concurrently. It performs cross-contaminated batch normalisation on both input-to-hidden and hidden-to-hidden weights, and learns the parameters for cross-contamination, for both single-layer and multi-layer LSTM architectures. We evaluate DDLSTM on frame-level action recognition using three datasets, taking a pair at a time, and report an average increase in accuracy of 3.5%. The proposed DDLSTM architecture outperforms standard, fine-tuned, and batch-normalised LSTMs.
△ Less
Submitted 18 April, 2019;
originally announced April 2019.
-
Action Recognition from Single Timestamp Supervision in Untrimmed Videos
Authors:
Davide Moltisanti,
Sanja Fidler,
Dima Damen
Abstract:
Recognising actions in videos relies on labelled supervision during training, typically the start and end times of each action instance. This supervision is not only subjective, but also expensive to acquire. Weak video-level supervision has been successfully exploited for recognition in untrimmed videos, however it is challenged when the number of different actions in training videos increases. W…
▽ More
Recognising actions in videos relies on labelled supervision during training, typically the start and end times of each action instance. This supervision is not only subjective, but also expensive to acquire. Weak video-level supervision has been successfully exploited for recognition in untrimmed videos, however it is challenged when the number of different actions in training videos increases. We propose a method that is supervised by single timestamps located around each action instance, in untrimmed videos. We replace expensive action bounds with sampling distributions initialised from these timestamps. We then use the classifier's response to iteratively update the sampling distributions. We demonstrate that these distributions converge to the location and extent of discriminative action segments. We evaluate our method on three datasets for fine-grained recognition, with increasing number of different actions per video, and show that single timestamps offer a reasonable compromise between recognition performance and labelling effort, performing comparably to full temporal supervision. Our update method improves top-1 test accuracy by up to 5.4%. across the evaluated datasets.
△ Less
Submitted 9 April, 2019;
originally announced April 2019.
-
The Pros and Cons: Rank-aware Temporal Attention for Skill Determination in Long Videos
Authors:
Hazel Doughty,
Walterio Mayol-Cuevas,
Dima Damen
Abstract:
We present a new model to determine relative skill from long videos, through learnable temporal attention modules. Skill determination is formulated as a ranking problem, making it suitable for common and generic tasks. However, for long videos, parts of the video are irrelevant for assessing skill, and there may be variability in the skill exhibited throughout a video. We therefore propose a meth…
▽ More
We present a new model to determine relative skill from long videos, through learnable temporal attention modules. Skill determination is formulated as a ranking problem, making it suitable for common and generic tasks. However, for long videos, parts of the video are irrelevant for assessing skill, and there may be variability in the skill exhibited throughout a video. We therefore propose a method which assesses the relative overall level of skill in a long video by attending to its skill-relevant parts. Our approach trains temporal attention modules, learned with only video-level supervision, using a novel rank-aware loss function. In addition to attending to task relevant video parts, our proposed loss jointly trains two attention modules to separately attend to video parts which are indicative of higher (pros) and lower (cons) skill. We evaluate our approach on the EPIC-Skills dataset and additionally annotate a larger dataset from YouTube videos for skill determination with five previously unexplored tasks. Our method outperforms previous approaches and classic softmax attention on both datasets by over 4% pairwise accuracy, and as much as 12% on individual tasks. We also demonstrate our model's ability to attend to rank-aware parts of the video.
△ Less
Submitted 10 April, 2019; v1 submitted 13 December, 2018;
originally announced December 2018.
-
CaloriNet: From silhouettes to calorie estimation in private environments
Authors:
Alessandro Masullo,
Tilo Burghardt,
Dima Damen,
Sion Hannuna,
Victor Ponce-López,
Majid Mirmehdi
Abstract:
We propose a novel deep fusion architecture, CaloriNet, for the online estimation of energy expenditure for free living monitoring in private environments, where RGB data is discarded and replaced by silhouettes. Our fused convolutional neural network architecture is trainable end-to-end, to estimate calorie expenditure, using temporal foreground silhouettes alongside accelerometer data. The netwo…
▽ More
We propose a novel deep fusion architecture, CaloriNet, for the online estimation of energy expenditure for free living monitoring in private environments, where RGB data is discarded and replaced by silhouettes. Our fused convolutional neural network architecture is trainable end-to-end, to estimate calorie expenditure, using temporal foreground silhouettes alongside accelerometer data. The network is trained and cross-validated on a publicly available dataset, SPHERE_RGBD + Inertial_calorie. Results show state-of-the-art minimum error on the estimation of energy expenditure (calories per minute), outperforming alternative, standard and single-modal techniques.
△ Less
Submitted 21 June, 2018;
originally announced June 2018.
-
Semantically Selective Augmentation for Deep Compact Person Re-Identification
Authors:
Víctor Ponce-López,
Tilo Burghardt,
Sion Hannunna,
Dima Damen,
Alessandro Masullo,
Majid Mirmehdi
Abstract:
We present a deep person re-identification approach that combines semantically selective, deep data augmentation with clustering-based network compression to generate high performance, light and fast inference networks. In particular, we propose to augment limited training data via sampling from a deep convolutional generative adversarial network (DCGAN), whose discriminator is constrained by a se…
▽ More
We present a deep person re-identification approach that combines semantically selective, deep data augmentation with clustering-based network compression to generate high performance, light and fast inference networks. In particular, we propose to augment limited training data via sampling from a deep convolutional generative adversarial network (DCGAN), whose discriminator is constrained by a semantic classifier to explicitly control the domain specificity of the generation process. Thereby, we encode information in the classifier network which can be utilized to steer adversarial synthesis, and which fuels our CondenseNet ID-network training. We provide a quantitative and qualitative analysis of the approach and its variants on a number of datasets, obtaining results that outperform the state-of-the-art on the LIMA dataset for long-term monitoring in indoor living spaces.
△ Less
Submitted 18 June, 2018; v1 submitted 11 June, 2018;
originally announced June 2018.
-
A Guide to the SPHERE 100 Homes Study Dataset
Authors:
Atis Elsts,
Tilo Burghardt,
Dallan Byrne,
Massimo Camplani,
Dima Damen,
Xenofon Fafoutis,
Sion Hannuna,
William Harwin,
Michael Holmes,
Balazs Janko,
Victor Ponce Lopez,
Alessandro Masullo,
Majid Mirmehdi,
George Oikonomou,
Robert Piechocki,
R. Simon Sherratt,
Emma Tonkin,
Niall Twomey,
Antonis Vafeas,
Przemyslaw Woznowski,
Ian Craddock
Abstract:
The SPHERE project has developed a multi-modal sensor platform for health and behavior monitoring in residential environments. So far, the SPHERE platform has been deployed for data collection in approximately 50 homes for duration up to one year. This technical document describes the format and the expected content of the SPHERE dataset(s) under preparation. It includes a list of some data qualit…
▽ More
The SPHERE project has developed a multi-modal sensor platform for health and behavior monitoring in residential environments. So far, the SPHERE platform has been deployed for data collection in approximately 50 homes for duration up to one year. This technical document describes the format and the expected content of the SPHERE dataset(s) under preparation. It includes a list of some data quality problems (both known to exist in the dataset(s) and potential ones), their workarounds, and other information important to people working with the SPHERE data, software, and hardware. This document does not aim to be an exhaustive descriptor of the SPHERE dataset(s); it also does not aim to discuss or validate the potential scientific uses of the SPHERE data.
△ Less
Submitted 30 October, 2018; v1 submitted 30 May, 2018;
originally announced May 2018.
-
Action Completion: A Temporal Model for Moment Detection
Authors:
Farnoosh Heidarivincheh,
Majid Mirmehdi,
Dima Damen
Abstract:
We introduce completion moment detection for actions - the problem of locating the moment of completion, when the action's goal is confidently considered achieved. The paper proposes a joint classification-regression recurrent model that predicts completion from a given frame, and then integrates frame-level contributions to detect sequence-level completion moment. We introduce a recurrent voting…
▽ More
We introduce completion moment detection for actions - the problem of locating the moment of completion, when the action's goal is confidently considered achieved. The paper proposes a joint classification-regression recurrent model that predicts completion from a given frame, and then integrates frame-level contributions to detect sequence-level completion moment. We introduce a recurrent voting node that predicts the frame's relative position of the completion moment by either classification or regression. The method is also capable of detecting incompletion. For example, the method is capable of detecting a missed ball-catch, as well as the moment at which the ball is safely caught. We test the method on 16 actions from three public datasets, covering sports as well as daily actions. Results show that when combining contributions from frames prior to the completion moment as well as frames post completion, the completion moment is detected within one second in 89% of all tested sequences.
△ Less
Submitted 23 July, 2018; v1 submitted 17 May, 2018;
originally announced May 2018.
-
Towards an Unequivocal Representation of Actions
Authors:
Michael Wray,
Davide Moltisanti,
Dima Damen
Abstract:
This work introduces verb-only representations for actions and interactions; the problem of describing similar motions (e.g. 'open door', 'open cupboard'), and distinguish differing ones (e.g. 'open door' vs 'open bottle') using verb-only labels. Current approaches for action recognition neglect legitimate semantic ambiguities and class overlaps between verbs (Fig. 1), relying on the objects to di…
▽ More
This work introduces verb-only representations for actions and interactions; the problem of describing similar motions (e.g. 'open door', 'open cupboard'), and distinguish differing ones (e.g. 'open door' vs 'open bottle') using verb-only labels. Current approaches for action recognition neglect legitimate semantic ambiguities and class overlaps between verbs (Fig. 1), relying on the objects to disambiguate interactions. We deviate from single-verb labels and introduce a map** between observations and multiple verb labels - in order to create an Unequivocal Representation of Actions. The new representation benefits from increased vocabulary and a soft assignment to an enriched space of verb labels. We learn these representations as multi-output regression, using a two-stream fusion CNN. The proposed approach outperforms conventional single-verb labels (also known as majority voting) on three egocentric datasets for both recognition and retrieval.
△ Less
Submitted 10 May, 2018;
originally announced May 2018.
-
Scaling Egocentric Vision: The EPIC-KITCHENS Dataset
Authors:
Dima Damen,
Hazel Doughty,
Giovanni Maria Farinella,
Sanja Fidler,
Antonino Furnari,
Evangelos Kazakos,
Davide Moltisanti,
Jonathan Munro,
Toby Perrett,
Will Price,
Michael Wray
Abstract:
First-person vision is gaining interest as it offers a unique viewpoint on people's interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen…
▽ More
First-person vision is gaining interest as it offers a unique viewpoint on people's interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. Our videos depict nonscripted daily activities: we simply asked each participant to start recording every time they entered their kitchen. Recording took place in 4 cities (in North America and Europe) by participants belonging to 10 different nationalities, resulting in highly diverse cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labeled for a total of 39.6K action segments and 454.3K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos (after recording), thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens. Dataset and Project page: http://epic-kitchens.github.io
△ Less
Submitted 31 July, 2018; v1 submitted 8 April, 2018;
originally announced April 2018.
-
Detecting the Moment of Completion: Temporal Models for Localising Action Completion
Authors:
Farnoosh Heidarivincheh,
Majid Mirmehdi,
Dima Damen
Abstract:
Action completion detection is the problem of modelling the action's progression towards localising the moment of completion - when the action's goal is confidently considered achieved. In this work, we assess the ability of two temporal models, namely Hidden Markov Models (HMM) and Long-Short Term Memory (LSTM), to localise completion for six object interactions: switch, plug, open, pull, pick an…
▽ More
Action completion detection is the problem of modelling the action's progression towards localising the moment of completion - when the action's goal is confidently considered achieved. In this work, we assess the ability of two temporal models, namely Hidden Markov Models (HMM) and Long-Short Term Memory (LSTM), to localise completion for six object interactions: switch, plug, open, pull, pick and drink. We use a supervised approach, where annotations of pre-completion and post-completion frames are available per action, and fine-tuned CNN features are used to train temporal models. Tested on the Action-Completion-2016 dataset, we detect completion within 10 frames of annotations for ~75% of completed action sequences using both temporal models. Results show that fine-tuned CNN features outperform hand-crafted features for localisation, and that observing incomplete instances is necessary when incomplete sequences are also present in the test set.
△ Less
Submitted 6 October, 2017;
originally announced October 2017.
-
Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination
Authors:
Hazel Doughty,
Dima Damen,
Walterio Mayol-Cuevas
Abstract:
We present a method for assessing skill from video, applicable to a variety of tasks, ranging from surgery to drawing and rolling pizza dough. We formulate the problem as pairwise (who's better?) and overall (who's best?) ranking of video collections, using supervised deep ranking. We propose a novel loss function that learns discriminative features when a pair of videos exhibit variance in skill,…
▽ More
We present a method for assessing skill from video, applicable to a variety of tasks, ranging from surgery to drawing and rolling pizza dough. We formulate the problem as pairwise (who's better?) and overall (who's best?) ranking of video collections, using supervised deep ranking. We propose a novel loss function that learns discriminative features when a pair of videos exhibit variance in skill, and learns shared features when a pair of videos exhibit comparable skill levels. Results demonstrate our method is applicable across tasks, with the percentage of correctly ordered pairs of videos ranging from 70% to 83% for four datasets. We demonstrate the robustness of our approach via sensitivity analysis of its parameters. We see this work as effort toward the automated organization of how-to video collections and overall, generic skill determination in video.
△ Less
Submitted 29 March, 2018; v1 submitted 29 March, 2017;
originally announced March 2017.
-
Trespassing the Boundaries: Labeling Temporal Bounds for Object Interactions in Egocentric Video
Authors:
Davide Moltisanti,
Michael Wray,
Walterio Mayol-Cuevas,
Dima Damen
Abstract:
Manual annotations of temporal bounds for object interactions (i.e. start and end times) are typical training input to recognition, localization and detection algorithms. For three publicly available egocentric datasets, we uncover inconsistencies in ground truth temporal bounds within and across annotators and datasets. We systematically assess the robustness of state-of-the-art approaches to cha…
▽ More
Manual annotations of temporal bounds for object interactions (i.e. start and end times) are typical training input to recognition, localization and detection algorithms. For three publicly available egocentric datasets, we uncover inconsistencies in ground truth temporal bounds within and across annotators and datasets. We systematically assess the robustness of state-of-the-art approaches to changes in labeled temporal bounds, for object interaction recognition. As boundaries are trespassed, a drop of up to 10% is observed for both Improved Dense Trajectories and Two-Stream Convolutional Neural Network.
We demonstrate that such disagreement stems from a limited understanding of the distinct phases of an action, and propose annotating based on the Rubicon Boundaries, inspired by a similarly named cognitive model, for consistent temporal bounds of object interactions. Evaluated on a public dataset, we report a 4% increase in overall accuracy, and an increase in accuracy for 55% of classes when Rubicon Boundaries are used for temporal annotations.
△ Less
Submitted 26 July, 2017; v1 submitted 27 March, 2017;
originally announced March 2017.
-
Improving Classification by Improving Labelling: Introducing Probabilistic Multi-Label Object Interaction Recognition
Authors:
Michael Wray,
Davide Moltisanti,
Walterio Mayol-Cuevas,
Dima Damen
Abstract:
This work deviates from easy-to-define class boundaries for object interactions. For the task of object interaction recognition, often captured using an egocentric view, we show that semantic ambiguities in verbs and recognising sub-interactions along with concurrent interactions result in legitimate class overlaps (Figure 1). We thus aim to model the map** between observations and interaction c…
▽ More
This work deviates from easy-to-define class boundaries for object interactions. For the task of object interaction recognition, often captured using an egocentric view, we show that semantic ambiguities in verbs and recognising sub-interactions along with concurrent interactions result in legitimate class overlaps (Figure 1). We thus aim to model the map** between observations and interaction classes, as well as class overlaps, towards a probabilistic multi-label classifier that emulates human annotators. Given a video segment containing an object interaction, we model the probability for a verb, out of a list of possible verbs, to be used to annotate that interaction. The proba- bility is learnt from crowdsourced annotations, and is tested on two public datasets, comprising 1405 video sequences for which we provide annotations on 90 verbs. We outper- form conventional single-label classification by 11% and 6% on the two datasets respectively, and show that learning from annotation probabilities outperforms majority voting and enables discovery of co-occurring labels.
△ Less
Submitted 21 April, 2017; v1 submitted 24 March, 2017;
originally announced March 2017.
-
Automated capture and delivery of assistive task guidance with an eyewear computer: The GlaciAR system
Authors:
Teesid Leelasawassuk,
Dima Damen,
Walterio Mayol-Cuevas
Abstract:
In this paper we describe and evaluate a mixed reality system that aims to augment users in task guidance applications by combining automated and unsupervised information collection with minimally invasive video guides. The result is a self-contained system that we call GlaciAR (Glass-enabled Contextual Interactions for Augmented Reality), that operates by extracting contextual interactions from o…
▽ More
In this paper we describe and evaluate a mixed reality system that aims to augment users in task guidance applications by combining automated and unsupervised information collection with minimally invasive video guides. The result is a self-contained system that we call GlaciAR (Glass-enabled Contextual Interactions for Augmented Reality), that operates by extracting contextual interactions from observing users performing actions. GlaciAR is able to i) automatically determine moments of relevance based on a head motion attention model, ii) automatically produce video guidance information, iii) trigger these video guides based on an object detection method, iv) learn without supervision from observing multiple users and v) operate fully on-board a current eyewear computer (Google Glass). We describe the components of GlaciAR together with evaluations on how users are able to use the system to achieve three tasks. We see this work as a first step toward the development of systems that aim to scale up the notoriously difficult authoring problem in guidance systems and where people's natural abilities are enhanced via minimally invasive visual guidance.
△ Less
Submitted 28 December, 2016;
originally announced January 2017.
-
SEMBED: Semantic Embedding of Egocentric Action Videos
Authors:
Michael Wray,
Davide Moltisanti,
Walterio Mayol-Cuevas,
Dima Damen
Abstract:
We present SEMBED, an approach for embedding an egocentric object interaction video in a semantic-visual graph to estimate the probability distribution over its potential semantic labels. When object interactions are annotated using unbounded choice of verbs, we embrace the wealth and ambiguity of these labels by capturing the semantic relationships as well as the visual similarities over motion a…
▽ More
We present SEMBED, an approach for embedding an egocentric object interaction video in a semantic-visual graph to estimate the probability distribution over its potential semantic labels. When object interactions are annotated using unbounded choice of verbs, we embrace the wealth and ambiguity of these labels by capturing the semantic relationships as well as the visual similarities over motion and appearance features. We show how SEMBED can interpret a challenging dataset of 1225 freely annotated egocentric videos, outperforming SVM classification by more than 5%.
△ Less
Submitted 29 July, 2016; v1 submitted 28 July, 2016;
originally announced July 2016.
-
Calorie Counter: RGB-Depth Visual Estimation of Energy Expenditure at Home
Authors:
Lili Tao,
Tilo Burghardt,
Majid Mirmehdi,
Dima Damen,
Ashley Cooper,
Sion Hannuna,
Massimo Camplani,
Adeline Paiement,
Ian Craddock
Abstract:
We present a new framework for vision-based estimation of calorific expenditure from RGB-D data - the first that is validated on physical gas exchange measurements and applied to daily living scenarios. Deriving a person's energy expenditure from sensors is an important tool in tracking physical activity levels for health and lifestyle monitoring. Most existing methods use metabolic lookup tables…
▽ More
We present a new framework for vision-based estimation of calorific expenditure from RGB-D data - the first that is validated on physical gas exchange measurements and applied to daily living scenarios. Deriving a person's energy expenditure from sensors is an important tool in tracking physical activity levels for health and lifestyle monitoring. Most existing methods use metabolic lookup tables (METs) for a manual estimate or systems with inertial sensors which ultimately require users to wear devices. In contrast, the proposed pose-invariant and individual-independent vision framework allows for a remote estimation of calorific expenditure. We introduce, and evaluate our approach on, a new dataset called SPHERE-calorie, for which visual estimates can be compared against simultaneously obtained, indirect calorimetry measures based on gas exchange. % based on per breath gas exchange. We conclude from our experiments that the proposed vision pipeline is suitable for home monitoring in a controlled environment, with calorific expenditure estimates above accuracy levels of commonly used manual estimations via METs. With the dataset released, our work establishes a baseline for future research for this little-explored area of computer vision.
△ Less
Submitted 27 July, 2016;
originally announced July 2016.
-
Multiple Human Tracking in RGB-D Data: A Survey
Authors:
Massimo Camplani,
Adeline Paiement,
Majid Mirmehdi,
Dima Damen,
Sion Hannuna,
Tilo Burghardt,
Lili Tao
Abstract:
Multiple human tracking (MHT) is a fundamental task in many computer vision applications. Appearance-based approaches, primarily formulated on RGB data, are constrained and affected by problems arising from occlusions and/or illumination variations. In recent years, the arrival of cheap RGB-Depth (RGB-D) devices has {led} to many new approaches to MHT, and many of these integrate color and depth c…
▽ More
Multiple human tracking (MHT) is a fundamental task in many computer vision applications. Appearance-based approaches, primarily formulated on RGB data, are constrained and affected by problems arising from occlusions and/or illumination variations. In recent years, the arrival of cheap RGB-Depth (RGB-D) devices has {led} to many new approaches to MHT, and many of these integrate color and depth cues to improve each and every stage of the process. In this survey, we present the common processing pipeline of these methods and review their methodology based (a) on how they implement this pipeline and (b) on what role depth plays within each stage of it. We identify and introduce existing, publicly available, benchmark datasets and software resources that fuse color and depth data for MHT. Finally, we present a brief comparative evaluation of the performance of those works that have applied their methods to these datasets.
△ Less
Submitted 14 June, 2016;
originally announced June 2016.
-
You-Do, I-Learn: Unsupervised Multi-User egocentric Approach Towards Video-Based Guidance
Authors:
Dima Damen,
Teesid Leelasawassuk,
Walterio Mayol-Cuevas
Abstract:
This paper presents an unsupervised approach towards automatically extracting video-based guidance on object usage, from egocentric video and wearable gaze tracking, collected from multiple users while performing tasks. The approach i) discovers task relevant objects, ii) builds a model for each, iii) distinguishes different ways in which each discovered object has been used and iv) discovers the…
▽ More
This paper presents an unsupervised approach towards automatically extracting video-based guidance on object usage, from egocentric video and wearable gaze tracking, collected from multiple users while performing tasks. The approach i) discovers task relevant objects, ii) builds a model for each, iii) distinguishes different ways in which each discovered object has been used and iv) discovers the dependencies between object interactions. The work investigates using appearance, position, motion and attention, and presents results using each and a combination of relevant features. Moreover, an online scalable approach is presented and is compared to offline results. The paper proposes a method for selecting a suitable video guide to be displayed to a novice user indicating how to use an object, purely triggered by the user's gaze. The potential assistive mode can also recommend an object to be used next based on the learnt sequence of object interactions. The approach was tested on a variety of daily tasks such as initialising a printer, preparing a coffee and setting up a gym machine.
△ Less
Submitted 19 March, 2016; v1 submitted 16 October, 2015;
originally announced October 2015.