Skip to main content

Showing 1–50 of 70 results for author: Damen, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.14735  [pdf, other

    cs.RO

    Rank2Reward: Learning Shaped Reward Functions from Passive Video

    Authors: Daniel Yang, Davin Tjia, Jacob Berg, Dima Damen, Pulkit Agrawal, Abhishek Gupta

    Abstract: Teaching robots novel skills with demonstrations via human-in-the-loop data collection techniques like kinesthetic teaching or teleoperation puts a heavy burden on human supervisors. In contrast to this paradigm, it is often significantly easier to provide raw, action-free visual data of tasks being performed. Moreover, this data can even be mined from video datasets or the web. Ideally, this data… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

    Comments: ICRA 2024

  2. arXiv:2404.09933  [pdf, other

    cs.CV

    HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision

    Authors: Siddhant Bansal, Michael Wray, Dima Damen

    Abstract: Large Vision Language Models (VLMs) are now the de facto state-of-the-art for a number of tasks including visual question answering, recognising objects, and spatial referral. In this work, we propose the HOI-Ref task for egocentric images that aims to understand interactions between hands and objects using VLMs. To enable HOI-Ref, we curate the HOI-QA dataset that consists of 3.9M question-answer… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

    Comments: Project Page: https://sid2697.github.io/hoi-ref/

  3. arXiv:2404.05559  [pdf, other

    cs.CV

    TIM: A Time Interval Machine for Audio-Visual Action Recognition

    Authors: Jacob Chalk, Jaesung Huh, Evangelos Kazakos, Andrew Zisserman, Dima Damen

    Abstract: Diverse actions give rise to rich audio-visual signals in long videos. Recent works showcase that the two modalities of audio and video exhibit different temporal extents of events and distinct labels. We address the interplay between the two modalities in long videos by explicitly modelling the temporal extents of audio and visual events. We propose the Time Interval Machine (TIM) where a modalit… ▽ More

    Submitted 9 April, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

    Comments: Accepted to CVPR 2024. Project Webpage: https://jacobchalk.github.io/TIM-Project

  4. arXiv:2404.05072  [pdf, other

    cs.CV

    Spatial Cognition from Egocentric Video: Out of Sight, Not Out of Mind

    Authors: Chiara Plizzari, Shubham Goel, Toby Perrett, Jacob Chalk, Angjoo Kanazawa, Dima Damen

    Abstract: As humans move around, performing their daily tasks, they are able to recall where they have positioned objects in their environment, even if these objects are currently out of sight. In this paper, we aim to mimic this spatial cognition ability. We thus formulate the task of Out of Sight, Not Out of Mind - 3D tracking active objects using observations captured through an egocentric camera. We int… ▽ More

    Submitted 7 April, 2024; originally announced April 2024.

    Comments: 21 pages including references and appendix. Project Webpage: http://dimadamen.github.io/OSNOM/

  5. arXiv:2403.18074  [pdf, other

    cs.CV eess.IV

    Every Shot Counts: Using Exemplars for Repetition Counting in Videos

    Authors: Saptarshi Sinha, Alexandros Stergiou, Dima Damen

    Abstract: Video repetition counting infers the number of repetitions of recurring actions or motion within a video. We propose an exemplar-based approach that discovers visual correspondence of video exemplars across repetitions within target videos. Our proposed Every Shot Counts (ESCounts) model is an attention-based encoder-decoder that encodes videos of varying lengths alongside exemplars from the same… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: Project website: https://sinhasaptarshi.github.io/escounts

  6. arXiv:2402.02335  [pdf

    cs.CV cs.IR

    Video Editing for Video Retrieval

    Authors: Bin Zhu, Kevin Flanagan, Adriano Fragomeni, Michael Wray, Dima Damen

    Abstract: Though pre-training vision-language models have demonstrated significant benefits in boosting video-text retrieval performance from large-scale web videos, fine-tuning still plays a critical role with manually annotated clips with start and end times, which requires considerable human effort. To address this issue, we explore an alternative cheaper source of annotations, single timestamps, for vid… ▽ More

    Submitted 3 February, 2024; originally announced February 2024.

  7. arXiv:2312.15719  [pdf, other

    cs.CV

    Get a Grip: Reconstructing Hand-Object Stable Grasps in Egocentric Videos

    Authors: Zhifan Zhu, Dima Damen

    Abstract: We propose the task of Hand-Object Stable Grasp Reconstruction (HO-SGR), the reconstruction of frames during which the hand is stably holding the object. We first develop the stable grasp definition based on the intuition that the in-contact area between the hand and object should remain stable. By analysing the 3D ARCTIC dataset, we identify stable grasp durations and showcase that objects in sta… ▽ More

    Submitted 7 April, 2024; v1 submitted 25 December, 2023; originally announced December 2023.

    Comments: webpage: https://zhifanzhu.github.io/getagrip

  8. arXiv:2312.13090  [pdf, other

    cs.CV

    Perception Test 2023: A Summary of the First Challenge And Outcome

    Authors: Joseph Heyward, João Carreira, Dima Damen, Andrew Zisserman, Viorica Pătrăucean

    Abstract: The First Perception Test challenge was held as a half-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2023, with the goal of benchmarking state-of-the-art video models on the recently proposed Perception Test benchmark. The challenge had six tracks covering low-level and high-level tasks, with both a language and non-language interface, across video, audio,… ▽ More

    Submitted 20 December, 2023; originally announced December 2023.

  9. arXiv:2312.07322  [pdf, other

    cs.CV

    GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos

    Authors: Tomáš Souček, Dima Damen, Michael Wray, Ivan Laptev, Josef Sivic

    Abstract: We address the task of generating temporally consistent and physically plausible images of actions and object state transformations. Given an input image and a text prompt describing the targeted transformation, our generated images preserve the environment and transform objects in the initial image. Our contributions are threefold. First, we leverage a large body of instructional videos and autom… ▽ More

    Submitted 2 April, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

    Comments: CVPR 2024

  10. arXiv:2312.00598  [pdf, other

    cs.CV cs.AI

    Learning from One Continuous Video Stream

    Authors: João Carreira, Michael King, Viorica Pătrăucean, Dilara Gokay, Cătălin Ionescu, Yi Yang, Daniel Zoran, Joseph Heyward, Carl Doersch, Yusuf Aytar, Dima Damen, Andrew Zisserman

    Abstract: We introduce a framework for online learning from a single continuous video stream -- the way people and animals learn, without mini-batches, data augmentation or shuffling. This poses great challenges given the high correlation between consecutive video frames and there is very little prior work on it. Our framework allows us to do a first deep dive into the topic and includes a collection of str… ▽ More

    Submitted 28 March, 2024; v1 submitted 1 December, 2023; originally announced December 2023.

    Comments: CVPR camera ready version

  11. arXiv:2311.18259  [pdf, other

    cs.CV cs.AI

    Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

    Authors: Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, **g Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, **g Huang, Md Mohaiminul Islam, Suyog Jain , et al. (76 additional authors not shown)

    Abstract: We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from… ▽ More

    Submitted 29 April, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

    Comments: updated baseline results and dataset statistics to match the released v2 data; added table to appendix comparing stats of Ego-Exo4D alongside other datasets

  12. arXiv:2311.16446  [pdf, other

    cs.CV

    Centre Stage: Centricity-based Audio-Visual Temporal Action Detection

    Authors: Hanyuan Wang, Majid Mirmehdi, Dima Damen, Toby Perrett

    Abstract: Previous one-stage action detection approaches have modelled temporal dependencies using only the visual modality. In this paper, we explore different strategies to incorporate the audio modality, using multi-scale cross-attention to fuse the two modalities. We also demonstrate the correlation between the distance from the timestep to the action centre and the accuracy of the predicted boundaries.… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

    Comments: Accepted to VUA workshop at BMVC 2023

  13. arXiv:2310.17395  [pdf, other

    cs.CV

    Learning Temporal Sentence Grounding From Narrated EgoVideos

    Authors: Kevin Flanagan, Dima Damen, Michael Wray

    Abstract: The onset of long-form egocentric datasets such as Ego4D and EPIC-Kitchens presents a new challenge for the task of Temporal Sentence Grounding (TSG). Compared to traditional benchmarks on which this task is evaluated, these datasets offer finer-grained sentences to ground in notably longer videos. In this paper, we develop an approach for learning to ground sentences in these datasets using only… ▽ More

    Submitted 26 October, 2023; originally announced October 2023.

    Comments: Accepted in BMVC 2023

  14. arXiv:2308.07123  [pdf, other

    cs.CV

    An Outlook into the Future of Egocentric Vision

    Authors: Chiara Plizzari, Gabriele Goletto, Antonino Furnari, Siddhant Bansal, Francesco Ragusa, Giovanni Maria Farinella, Dima Damen, Tatiana Tommasi

    Abstract: What will the future be? We wonder! In this survey, we explore the gap between current research in egocentric vision and the ever-anticipated future, where wearable computing, with outward facing cameras and digital overlays, is expected to be integrated in our every day lives. To understand this gap, the article starts by envisaging the future through character-based stories, showcasing through e… ▽ More

    Submitted 7 February, 2024; v1 submitted 14 August, 2023; originally announced August 2023.

    Comments: We invite comments, suggestions and corrections here: https://openreview.net/forum?id=V3974SUk1w

  15. arXiv:2306.08731  [pdf, other

    cs.CV

    EPIC Fields: Marrying 3D Geometry and Video Understanding

    Authors: Vadim Tschernezki, Ahmad Darkhalil, Zhifan Zhu, David Fouhey, Iro Laina, Diane Larlus, Dima Damen, Andrea Vedaldi

    Abstract: Neural rendering is fuelling a unification of learning, 3D geometry and video understanding that has been waiting for more than two decades. Progress, however, is still hampered by a lack of suitable datasets and benchmarks. To address this gap, we introduce EPIC Fields, an augmentation of EPIC-KITCHENS with 3D camera information. Like other datasets for neural rendering, EPIC Fields removes the c… ▽ More

    Submitted 1 February, 2024; v1 submitted 14 June, 2023; originally announced June 2023.

    Comments: Published at NeurIPS 2023. 24 pages, 15 figures. Project Webpage: http://epic-kitchens.github.io/epic-fields

  16. arXiv:2306.08713  [pdf, other

    cs.CV

    What can a cook in Italy teach a mechanic in India? Action Recognition Generalisation Over Scenarios and Locations

    Authors: Chiara Plizzari, Toby Perrett, Barbara Caputo, Dima Damen

    Abstract: We propose and address a new generalisation problem: can a model trained for action recognition successfully classify actions when they are performed within a previously unseen scenario and in a previously unseen location? To answer this question, we introduce the Action Recognition Generalisation Over scenarios and locations dataset (ARGO1M), which contains 1.1M video clips from the large-scale E… ▽ More

    Submitted 24 August, 2023; v1 submitted 14 June, 2023; originally announced June 2023.

    Comments: Accepted at ICCV 2023. Project page: https://chiaraplizz.github.io/what-can-a-cook/

  17. arXiv:2305.13786  [pdf, other

    cs.CV cs.AI cs.LG

    Perception Test: A Diagnostic Benchmark for Multimodal Video Models

    Authors: Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman, João Carreira

    Abstract: We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g. Flamingo, SeViLA, or GPT-4). Compared to existing benchmarks that focus on computational tasks (e.g. classification, detection or tracking), the Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning… ▽ More

    Submitted 30 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks

  18. arXiv:2304.01143  [pdf, other

    cs.CV

    Use Your Head: Improving Long-Tail Video Recognition

    Authors: Toby Perrett, Saptarshi Sinha, Tilo Burghardt, Majid Mirmehdi, Dima Damen

    Abstract: This paper presents an investigation into long-tail video recognition. We demonstrate that, unlike naturally-collected video datasets and existing long-tail image benchmarks, current video benchmarks fall short on multiple long-tailed properties. Most critically, they lack few-shot classes in their tails. In response, we propose new video benchmarks that better assess long-tail recognition, by sam… ▽ More

    Submitted 3 April, 2023; originally announced April 2023.

    Comments: CVPR 2023

  19. arXiv:2302.00646  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Epic-Sounds: A Large-scale Dataset of Actions That Sound

    Authors: Jaesung Huh, Jacob Chalk, Evangelos Kazakos, Dima Damen, Andrew Zisserman

    Abstract: We introduce EPIC-SOUNDS, a large-scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos. We propose an annotation pipeline where annotators temporally label distinguishable audio segments and describe the action that could have caused this sound. We identify actions that can be discriminated purely from audio, through groupi… ▽ More

    Submitted 1 February, 2023; originally announced February 2023.

    Comments: 6 pages, 4 figures

  20. arXiv:2210.14284  [pdf, other

    cs.CV

    Refining Action Boundaries for One-stage Detection

    Authors: Hanyuan Wang, Majid Mirmehdi, Dima Damen, Toby Perrett

    Abstract: Current one-stage action detection methods, which simultaneously predict action boundaries and the corresponding class, do not estimate or use a measure of confidence in their boundary predictions, which can lead to inaccurate boundaries. We incorporate the estimation of boundary confidence into one-stage anchor-free detection, through an additional prediction head that predicts the refined bounda… ▽ More

    Submitted 25 October, 2022; originally announced October 2022.

    Comments: Accepted to AVSS 2022. Our code is available at https://github.com/hanielwang/Refining_Boundary_Head.git

  21. arXiv:2210.11328  [pdf, other

    cs.SD cs.CV cs.LG eess.AS

    Play It Back: Iterative Attention for Audio Recognition

    Authors: Alexandros Stergiou, Dima Damen

    Abstract: A key function of auditory cognition is the association of characteristic sounds with their corresponding semantics over time. Humans attempting to discriminate between fine-grained audio categories, often replay the same discriminative sounds to increase their prediction confidence. We propose an end-to-end attention-based architecture that through selective repetition attends over the most discr… ▽ More

    Submitted 12 March, 2023; v1 submitted 20 October, 2022; originally announced October 2022.

    Comments: Accepted at IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023

  22. arXiv:2210.04341  [pdf, other

    cs.CV

    ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval

    Authors: Adriano Fragomeni, Michael Wray, Dima Damen

    Abstract: In this paper, we re-examine the task of cross-modal clip-sentence retrieval, where the clip is part of a longer untrimmed video. When the clip is short or visually ambiguous, knowledge of its local temporal context (i.e. surrounding video segments) can be used to improve the retrieval performance. We propose Context Transformer (ConTra); an encoder architecture that models the interaction between… ▽ More

    Submitted 9 October, 2022; originally announced October 2022.

    Comments: Accepted in ACCV 2022

  23. arXiv:2209.13064  [pdf, other

    cs.CV cs.AI cs.LG

    EPIC-KITCHENS VISOR Benchmark: VIdeo Segmentations and Object Relations

    Authors: Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, Dima Damen

    Abstract: We introduce VISOR, a new dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video. VISOR annotates videos from EPIC-KITCHENS, which comes with a new set of challenges not encountered in current video segmentation datasets. Specifically, we need to ensure both short- and long-term consistency of pixel-level annotations as objects undergo transf… ▽ More

    Submitted 26 September, 2022; originally announced September 2022.

    Comments: 10 pages main, 38 pages appendix. Accepted at NeurIPS 2022 Track on Datasets and Benchmarks Data, code and leaderboards from: http://epic-kitchens.github.io/VISOR

  24. arXiv:2207.06789  [pdf, other

    cs.CV

    Inertial Hallucinations -- When Wearable Inertial Devices Start Seeing Things

    Authors: Alessandro Masullo, Toby Perrett, Tilo Burghardt, Ian Craddock, Dima Damen, Majid Mirmehdi

    Abstract: We propose a novel approach to multimodal sensor fusion for Ambient Assisted Living (AAL) which takes advantage of learning using privileged information (LUPI). We address two major shortcomings of standard multimodal approaches, limited area coverage and reduced reliability. Our new framework fuses the concept of modality hallucination with triplet learning to train a model with different modalit… ▽ More

    Submitted 14 July, 2022; originally announced July 2022.

  25. arXiv:2207.01622  [pdf, other

    cs.CV

    Egocentric Video-Language Pretraining @ Ego4D Challenge 2022

    Authors: Kevin Qinghong Lin, Alex **peng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou

    Abstract: In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for four Ego4D challenge tasks, including Natural Language Query (NLQ), Moment Query (MQ), Object State Change Classification (OSCC), and PNR Localization (PNR). Especially, we exploit the recently released Ego4D dataset \cite{grauman2021ego4d} to pioneer Egocentric VLP from pretraining dataset, pre… ▽ More

    Submitted 3 August, 2022; v1 submitted 4 July, 2022; originally announced July 2022.

    Comments: Preprint. 4 pages, 2 figures, 5 tables. Code: https://github.com/showlab/EgoVLP. The Ego4D challenge technical report of EgoVLP arXiv:2206.01670. See EPIC challenge technical report arXiv:2207.01334 for overlap

  26. arXiv:2206.05496  [pdf, other

    cs.CV

    An Evaluation of OCR on Egocentric Data

    Authors: Valentin Popescu, Dima Damen, Toby Perrett

    Abstract: In this paper, we evaluate state-of-the-art OCR methods on Egocentric data. We annotate text in EPIC-KITCHENS images, and demonstrate that existing OCR methods struggle with rotated text, which is frequently observed on objects being handled. We introduce a simple rotate-and-merge procedure which can be applied to pre-trained OCR models that halves the normalized edit distance error. This suggests… ▽ More

    Submitted 11 June, 2022; originally announced June 2022.

    Comments: Extended Abstract, EPIC workshop at CVPR 22

  27. arXiv:2206.01670  [pdf, other

    cs.CV cs.AI

    Egocentric Video-Language Pretraining

    Authors: Kevin Qinghong Lin, Alex **peng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou

    Abstract: Video-Language Pretraining (VLP), which aims to learn transferable representation to advance a wide range of video-text downstream tasks, has recently received increasing attention. Best performing works rely on large-scale, 3rd-person video-text datasets, such as HowTo100M. In this work, we exploit the recently released Ego4D dataset to pioneer Egocentric VLP along three directions. (i) We create… ▽ More

    Submitted 12 October, 2022; v1 submitted 3 June, 2022; originally announced June 2022.

    Comments: Accepted by NeurIPS 2022. Double champions at Ego4D and EPIC-Kitchens, CVPR 2022 challenges. 23 pages, 13 figures, 12 tables. Code: https://github.com/showlab/EgoVLP

  28. arXiv:2204.13340  [pdf, other

    cs.CV

    The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction

    Authors: Alexandros Stergiou, Dima Damen

    Abstract: Early action prediction deals with inferring the ongoing action from partially-observed videos, typically at the outset of the video. We propose a bottleneck-based attention model that captures the evolution of the action, through progressive sampling over fine-to-coarse scales. Our proposed Temporal Progressive (TemPr) model is composed of multiple attention towers, one for each scale. The predic… ▽ More

    Submitted 1 April, 2023; v1 submitted 28 April, 2022; originally announced April 2022.

    Comments: Accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023

  29. arXiv:2204.09015  [pdf, other

    cs.CV

    Dual-Domain Image Synthesis using Segmentation-Guided GAN

    Authors: Dena Bazazian, Andrew Calway, Dima Damen

    Abstract: We introduce a segmentation-guided approach to synthesise images that integrate features from two distinct domains. Images synthesised by our dual-domain model belong to one domain within the semantic mask, and to another in the rest of the image - smoothly integrated. We build on the successes of few-shot StyleGAN and single-shot semantic segmentation to minimise the amount of training required i… ▽ More

    Submitted 19 April, 2022; originally announced April 2022.

    Comments: CVPR2022 Workshops. 14 pages, 19 figures

  30. arXiv:2201.04906  [pdf, other

    cs.CV

    Hand-Object Interaction Reasoning

    Authors: Jian Ma, Dima Damen

    Abstract: This paper proposes an interaction reasoning network for modelling spatio-temporal relationships between hands and objects in video. The proposed interaction unit utilises a Transformer module to reason about each acting hand, and its spatio-temporal relation to the other hand as well as objects being interacted with. We show that modelling two-handed interactions are critical for action recogniti… ▽ More

    Submitted 13 January, 2022; originally announced January 2022.

  31. arXiv:2201.00434  [pdf, other

    cs.CV

    TVNet: Temporal Voting Network for Action Localization

    Authors: Hanyuan Wang, Dima Damen, Majid Mirmehdi, Toby Perrett

    Abstract: We propose a Temporal Voting Network (TVNet) for action localization in untrimmed videos. This incorporates a novel Voting Evidence Module to locate temporal boundaries, more accurately, where temporal contextual evidence is accumulated to predict frame-level probabilities of start and end action boundaries. Our action-independent evidence module is incorporated within a pipeline to calculate conf… ▽ More

    Submitted 2 January, 2022; originally announced January 2022.

    Comments: 9 pages, 7 figures, 11 tables

  32. arXiv:2112.10194  [pdf, other

    cs.CV

    UnweaveNet: Unweaving Activity Stories

    Authors: Will Price, Carl Vondrick, Dima Damen

    Abstract: Our lives can be seen as a complex weaving of activities; we switch from one activity to another, to maximise our achievements or in reaction to demands placed upon us. Observing a video of unscripted daily activities, we parse the video into its constituent activity threads through a process we call unweaving. To accomplish this, we introduce a video representation explicitly capturing activity t… ▽ More

    Submitted 4 April, 2022; v1 submitted 19 December, 2021; originally announced December 2021.

    Comments: Accepted at IEEE/CVF Computer Vision and Pattern Recognition (CVPR) 2022

  33. arXiv:2111.01024  [pdf, other

    cs.CV cs.SD eess.AS

    With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition

    Authors: Evangelos Kazakos, Jaesung Huh, Arsha Nagrani, Andrew Zisserman, Dima Damen

    Abstract: In egocentric videos, actions occur in quick succession. We capitalise on the action's temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance. To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities, with an explicit language model providing action s… ▽ More

    Submitted 1 November, 2021; originally announced November 2021.

    Comments: Accepted at BMVC 2021

  34. arXiv:2110.12812  [pdf, other

    cs.CV cs.LG

    Domain Adaptation in Multi-View Embedding for Cross-Modal Video Retrieval

    Authors: Jonathan Munro, Michael Wray, Diane Larlus, Gabriela Csurka, Dima Damen

    Abstract: Given a gallery of uncaptioned video sequences, this paper considers the task of retrieving videos based on their relevance to an unseen text query. To compensate for the lack of annotations, we rely instead on a related video gallery composed of video-caption pairs, termed the source gallery, albeit with a domain gap between its videos and those in the target gallery. We thus introduce the proble… ▽ More

    Submitted 25 October, 2021; originally announced October 2021.

    Comments: 15 pages

  35. arXiv:2110.07058  [pdf, other

    cs.CV cs.AI

    Ego4D: Around the World in 3,000 Hours of Egocentric Video

    Authors: Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do , et al. (60 additional authors not shown)

    Abstract: We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with cons… ▽ More

    Submitted 11 March, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: To appear in the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. This version updates the baseline result numbers for the Hands and Objects benchmark (appendix)

  36. arXiv:2103.10095  [pdf, other

    cs.CV

    On Semantic Similarity in Video Retrieval

    Authors: Michael Wray, Hazel Doughty, Dima Damen

    Abstract: Current video retrieval efforts all found their evaluation on an instance-based assumption, that only a single caption is relevant to a query video and vice versa. We demonstrate that this assumption results in performance comparisons often not indicative of models' retrieval capabilities. We propose a move to semantic similarity video retrieval, where (i) multiple videos/captions can be deemed eq… ▽ More

    Submitted 18 March, 2021; originally announced March 2021.

    Comments: Accepted in CVPR 2021. Project Page: https://mwray.github.io/SSVR/

  37. arXiv:2103.03516  [pdf, other

    cs.SD cs.CV eess.AS

    Slow-Fast Auditory Streams For Audio Recognition

    Authors: Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen

    Abstract: We propose a two-stream convolutional network for audio recognition, that operates on time-frequency spectrogram inputs. Following similar success in visual recognition, we learn Slow-Fast auditory streams with separable convolutions and multi-level lateral connections. The Slow pathway has high channel capacity while the Fast pathway operates at a fine-grained temporal resolution. We showcase the… ▽ More

    Submitted 5 March, 2021; originally announced March 2021.

    Comments: Accepted for presentation at ICASSP 2021

  38. arXiv:2101.06184  [pdf, other

    cs.CV

    Temporal-Relational CrossTransformers for Few-Shot Action Recognition

    Authors: Toby Perrett, Alessandro Masullo, Tilo Burghardt, Majid Mirmehdi, Dima Damen

    Abstract: We propose a novel approach to few-shot action recognition, finding temporally-corresponding frame tuples between the query and videos in the support set. Distinct from previous few-shot works, we construct class prototypes using the CrossTransformer attention mechanism to observe relevant sub-sequences of all support videos, rather than using class averages or single best matches. Video represent… ▽ More

    Submitted 28 March, 2021; v1 submitted 15 January, 2021; originally announced January 2021.

    Comments: Accepted in CVPR 2021

  39. arXiv:2011.12372  [pdf, other

    cs.CV cs.LG

    Play Fair: Frame Attributions in Video Models

    Authors: Will Price, Dima Damen

    Abstract: In this paper, we introduce an attribution method for explaining action recognition models. Such models fuse information from multiple frames within a video, through score aggregation or relational reasoning. We break down a model's class score into the sum of contributions from each frame, fairly. Our method adapts an axiomatic solution to fair reward distribution in cooperative games, known as t… ▽ More

    Submitted 24 November, 2020; originally announced November 2020.

    Comments: Code available at: https://github.com/willprice/play-fair/ and supporting website at: https://play-fair.willprice.dev/

  40. arXiv:2008.09890  [pdf, other

    cs.CV

    Supervision Levels Scale (SLS)

    Authors: Dima Damen, Michael Wray

    Abstract: We propose a three-dimensional discrete and incremental scale to encode a method's level of supervision - i.e. the data and labels used when training a model to achieve a given performance. We capture three aspects of supervision, that are known to give methods an advantage while requiring additional costs: pre-training, training labels and training data. The proposed three-dimensional scale can b… ▽ More

    Submitted 22 August, 2020; originally announced August 2020.

  41. arXiv:2007.14658  [pdf, other

    cs.CV

    Meta-Learning with Context-Agnostic Initialisations

    Authors: Toby Perrett, Alessandro Masullo, Tilo Burghardt, Majid Mirmehdi, Dima Damen

    Abstract: Meta-learning approaches have addressed few-shot problems by finding initialisations suited for fine-tuning to target tasks. Often there are additional properties within training data (which we refer to as context), not relevant to the target task, which act as a distractor to meta-learning, particularly when the target task contains examples from a novel context not seen during training. We addre… ▽ More

    Submitted 22 October, 2020; v1 submitted 29 July, 2020; originally announced July 2020.

    Comments: Accepted at ACCV 2020

  42. Rescaling Egocentric Vision

    Authors: Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, Michael Wray

    Abstract: This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M frames, 90K actions in 700 variable-length videos, capturing long-term unscripted activities in 45 environments, using head-mounted cameras. Compared to its previous version, EPIC-KITCHENS-100 has been annotated using a nov… ▽ More

    Submitted 17 September, 2021; v1 submitted 23 June, 2020; originally announced June 2020.

    Comments: Accepted at the International Journal of Computer Vision (IJCV). Dataset available from: http://epic-kitchens.github.io/

  43. arXiv:2005.00343  [pdf, other

    cs.CV

    The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines

    Authors: Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, Michael Wray

    Abstract: Since its introduction in 2018, EPIC-KITCHENS has attracted attention as the largest egocentric video benchmark, offering a unique viewpoint on people's interaction with objects, their attention, and even intention. In this paper, we detail how this large-scale dataset was captured by 32 participants in their native kitchen environments, and densely annotated with actions and object interactions.… ▽ More

    Submitted 29 April, 2020; originally announced May 2020.

    Comments: Preprint for paper at IEEE TPAMI. arXiv admin note: substantial text overlap with arXiv:1804.02748

  44. arXiv:2001.09691  [pdf, other

    cs.CV

    Multi-Modal Domain Adaptation for Fine-Grained Action Recognition

    Authors: Jonathan Munro, Dima Damen

    Abstract: Fine-grained action recognition datasets exhibit environmental bias, where multiple video sequences are captured from a limited number of environments. Training a model in one environment and deploying in another results in a drop in performance due to an unavoidable domain shift. Unsupervised Domain Adaptation (UDA) approaches have frequently utilised adversarial training between the source and t… ▽ More

    Submitted 19 March, 2020; v1 submitted 27 January, 2020; originally announced January 2020.

    Comments: Accepted to CVPR 2020 for an oral presentation

  45. arXiv:1912.06617  [pdf, other

    cs.CV

    Action Modifiers: Learning from Adverbs in Instructional Videos

    Authors: Hazel Doughty, Ivan Laptev, Walterio Mayol-Cuevas, Dima Damen

    Abstract: We present a method to learn a representation for adverbs from instructional videos using weak supervision from the accompanying narrations. Key to our method is the fact that the visual representation of the adverb is highly dependant on the action to which it applies, although the same adverb will modify multiple actions in a similar way. For instance, while 'spread quickly' and 'mix quickly' wi… ▽ More

    Submitted 24 March, 2020; v1 submitted 13 December, 2019; originally announced December 2019.

    Comments: CVPR 2020

  46. arXiv:1910.09920  [pdf, other

    cs.CV

    Weakly-Supervised Completion Moment Detection using Temporal Attention

    Authors: Farnoosh Heidarivincheh, Majid Mirmehdi, Dima Damen

    Abstract: Monitoring the progression of an action towards completion offers fine grained insight into the actor's behaviour. In this work, we target detecting the completion moment of actions, that is the moment when the action's goal has been successfully accomplished. This has potential applications from surveillance to assistive living and human-robot interactions. Previous effort required human annotati… ▽ More

    Submitted 22 October, 2019; originally announced October 2019.

  47. Sit-to-Stand Analysis in the Wild using Silhouettes for Longitudinal Health Monitoring

    Authors: Alessandro Masullo, Tilo Burghardt, Toby Perrett, Dima Damen, Majid Mirmehdi

    Abstract: We present the first fully automated Sit-to-Stand or Stand-to-Sit (StS) analysis framework for long-term monitoring of patients in free-living environments using video silhouettes. Our method adopts a coarse-to-fine time localisation approach, where a deep learning classifier identifies possible StS sequences from silhouettes, and a smart peak detection stage provides fine localisation based on 3D… ▽ More

    Submitted 3 October, 2019; originally announced October 2019.

  48. arXiv:1909.09422  [pdf, other

    cs.CV

    Retro-Actions: Learning 'Close' by Time-Reversing 'Open' Videos

    Authors: Will Price, Dima Damen

    Abstract: We investigate video transforms that result in class-homogeneous label-transforms. These are video transforms that consistently maintain or modify the labels of all videos in each class. We propose a general approach to discover invariant classes, whose transformed examples maintain their label; pairs of equivariant classes, whose transformed examples exchange their labels; and novel-generating cl… ▽ More

    Submitted 20 September, 2019; originally announced September 2019.

    Comments: ICCVW 2019, 8 pages, 7 figures, 6 tables. https://video-reversal.willprice.dev/

  49. arXiv:1908.08498  [pdf, other

    cs.CV

    EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition

    Authors: Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen

    Abstract: We focus on multi-modal fusion for egocentric action recognition, and propose a novel architecture for multi-modal temporal-binding, i.e. the combination of modalities within a range of temporal offsets. We train the architecture with three modalities -- RGB, Flow and Audio -- and combine them with mid-level fusion alongside sparse temporal sampling of fused representations. In contrast with previ… ▽ More

    Submitted 22 August, 2019; originally announced August 2019.

    Comments: Accepted for presentation at ICCV 2019

  50. arXiv:1908.03477  [pdf, other

    cs.CV

    Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

    Authors: Michael Wray, Diane Larlus, Gabriela Csurka, Dima Damen

    Abstract: We address the problem of cross-modal fine-grained action retrieval between text and video. Cross-modal retrieval is commonly achieved through learning a shared embedding space, that can indifferently embed modalities. In this paper, we propose to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions. We build a separate multi-modal embedding space for each PoS t… ▽ More

    Submitted 9 August, 2019; originally announced August 2019.

    Comments: Accepted for presentation at ICCV. Project Page: https://mwray.github.io/FGAR