Skip to main content

Showing 1–22 of 22 results for author: Wray, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.09933  [pdf, other

    cs.CV

    HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision

    Authors: Siddhant Bansal, Michael Wray, Dima Damen

    Abstract: Large Vision Language Models (VLMs) are now the de facto state-of-the-art for a number of tasks including visual question answering, recognising objects, and spatial referral. In this work, we propose the HOI-Ref task for egocentric images that aims to understand interactions between hands and objects using VLMs. To enable HOI-Ref, we curate the HOI-QA dataset that consists of 3.9M question-answer… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

    Comments: Project Page: https://sid2697.github.io/hoi-ref/

  2. arXiv:2402.11057  [pdf, other

    cs.CV

    Are you Struggling? Dataset and Baselines for Struggle Determination in Assembly Videos

    Authors: Shijia Feng, Michael Wray, Brian Sullivan, Youngkyoon Jang, Casimir Ludwig, Iain Gilchrist, Walterio Mayol-Cuevas

    Abstract: Determining when people are struggling from video enables a finer-grained understanding of actions and opens opportunities for building intelligent support visual interfaces. In this paper, we present a new dataset with three assembly activities and corresponding performance baselines for the determination of struggle from video. Three real-world problem-solving activities including assembling plu… ▽ More

    Submitted 28 February, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

  3. arXiv:2402.02335  [pdf

    cs.CV cs.IR

    Video Editing for Video Retrieval

    Authors: Bin Zhu, Kevin Flanagan, Adriano Fragomeni, Michael Wray, Dima Damen

    Abstract: Though pre-training vision-language models have demonstrated significant benefits in boosting video-text retrieval performance from large-scale web videos, fine-tuning still plays a critical role with manually annotated clips with start and end times, which requires considerable human effort. To address this issue, we explore an alternative cheaper source of annotations, single timestamps, for vid… ▽ More

    Submitted 3 February, 2024; originally announced February 2024.

  4. arXiv:2312.07322  [pdf, other

    cs.CV

    GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos

    Authors: Tomáš Souček, Dima Damen, Michael Wray, Ivan Laptev, Josef Sivic

    Abstract: We address the task of generating temporally consistent and physically plausible images of actions and object state transformations. Given an input image and a text prompt describing the targeted transformation, our generated images preserve the environment and transform objects in the initial image. Our contributions are threefold. First, we leverage a large body of instructional videos and autom… ▽ More

    Submitted 2 April, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

    Comments: CVPR 2024

  5. arXiv:2311.18259  [pdf, other

    cs.CV cs.AI

    Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

    Authors: Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, **g Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, **g Huang, Md Mohaiminul Islam, Suyog Jain , et al. (76 additional authors not shown)

    Abstract: We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from… ▽ More

    Submitted 29 April, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

    Comments: updated baseline results and dataset statistics to match the released v2 data; added table to appendix comparing stats of Ego-Exo4D alongside other datasets

  6. arXiv:2310.17395  [pdf, other

    cs.CV

    Learning Temporal Sentence Grounding From Narrated EgoVideos

    Authors: Kevin Flanagan, Dima Damen, Michael Wray

    Abstract: The onset of long-form egocentric datasets such as Ego4D and EPIC-Kitchens presents a new challenge for the task of Temporal Sentence Grounding (TSG). Compared to traditional benchmarks on which this task is evaluated, these datasets offer finer-grained sentences to ground in notably longer videos. In this paper, we develop an approach for learning to ground sentences in these datasets using only… ▽ More

    Submitted 26 October, 2023; originally announced October 2023.

    Comments: Accepted in BMVC 2023

  7. arXiv:2210.04341  [pdf, other

    cs.CV

    ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval

    Authors: Adriano Fragomeni, Michael Wray, Dima Damen

    Abstract: In this paper, we re-examine the task of cross-modal clip-sentence retrieval, where the clip is part of a longer untrimmed video. When the clip is short or visually ambiguous, knowledge of its local temporal context (i.e. surrounding video segments) can be used to improve the retrieval performance. We propose Context Transformer (ConTra); an encoder architecture that models the interaction between… ▽ More

    Submitted 9 October, 2022; originally announced October 2022.

    Comments: Accepted in ACCV 2022

  8. arXiv:2207.01622  [pdf, other

    cs.CV

    Egocentric Video-Language Pretraining @ Ego4D Challenge 2022

    Authors: Kevin Qinghong Lin, Alex **peng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou

    Abstract: In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for four Ego4D challenge tasks, including Natural Language Query (NLQ), Moment Query (MQ), Object State Change Classification (OSCC), and PNR Localization (PNR). Especially, we exploit the recently released Ego4D dataset \cite{grauman2021ego4d} to pioneer Egocentric VLP from pretraining dataset, pre… ▽ More

    Submitted 3 August, 2022; v1 submitted 4 July, 2022; originally announced July 2022.

    Comments: Preprint. 4 pages, 2 figures, 5 tables. Code: https://github.com/showlab/EgoVLP. The Ego4D challenge technical report of EgoVLP arXiv:2206.01670. See EPIC challenge technical report arXiv:2207.01334 for overlap

  9. arXiv:2206.01670  [pdf, other

    cs.CV cs.AI

    Egocentric Video-Language Pretraining

    Authors: Kevin Qinghong Lin, Alex **peng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou

    Abstract: Video-Language Pretraining (VLP), which aims to learn transferable representation to advance a wide range of video-text downstream tasks, has recently received increasing attention. Best performing works rely on large-scale, 3rd-person video-text datasets, such as HowTo100M. In this work, we exploit the recently released Ego4D dataset to pioneer Egocentric VLP along three directions. (i) We create… ▽ More

    Submitted 12 October, 2022; v1 submitted 3 June, 2022; originally announced June 2022.

    Comments: Accepted by NeurIPS 2022. Double champions at Ego4D and EPIC-Kitchens, CVPR 2022 challenges. 23 pages, 13 figures, 12 tables. Code: https://github.com/showlab/EgoVLP

  10. arXiv:2110.12812  [pdf, other

    cs.CV cs.LG

    Domain Adaptation in Multi-View Embedding for Cross-Modal Video Retrieval

    Authors: Jonathan Munro, Michael Wray, Diane Larlus, Gabriela Csurka, Dima Damen

    Abstract: Given a gallery of uncaptioned video sequences, this paper considers the task of retrieving videos based on their relevance to an unseen text query. To compensate for the lack of annotations, we rely instead on a related video gallery composed of video-caption pairs, termed the source gallery, albeit with a domain gap between its videos and those in the target gallery. We thus introduce the proble… ▽ More

    Submitted 25 October, 2021; originally announced October 2021.

    Comments: 15 pages

  11. arXiv:2110.07058  [pdf, other

    cs.CV cs.AI

    Ego4D: Around the World in 3,000 Hours of Egocentric Video

    Authors: Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do , et al. (60 additional authors not shown)

    Abstract: We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with cons… ▽ More

    Submitted 11 March, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: To appear in the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. This version updates the baseline result numbers for the Hands and Objects benchmark (appendix)

  12. arXiv:2103.10095  [pdf, other

    cs.CV

    On Semantic Similarity in Video Retrieval

    Authors: Michael Wray, Hazel Doughty, Dima Damen

    Abstract: Current video retrieval efforts all found their evaluation on an instance-based assumption, that only a single caption is relevant to a query video and vice versa. We demonstrate that this assumption results in performance comparisons often not indicative of models' retrieval capabilities. We propose a move to semantic similarity video retrieval, where (i) multiple videos/captions can be deemed eq… ▽ More

    Submitted 18 March, 2021; originally announced March 2021.

    Comments: Accepted in CVPR 2021. Project Page: https://mwray.github.io/SSVR/

  13. arXiv:2008.09890  [pdf, other

    cs.CV

    Supervision Levels Scale (SLS)

    Authors: Dima Damen, Michael Wray

    Abstract: We propose a three-dimensional discrete and incremental scale to encode a method's level of supervision - i.e. the data and labels used when training a model to achieve a given performance. We capture three aspects of supervision, that are known to give methods an advantage while requiring additional costs: pre-training, training labels and training data. The proposed three-dimensional scale can b… ▽ More

    Submitted 22 August, 2020; originally announced August 2020.

  14. Rescaling Egocentric Vision

    Authors: Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, Michael Wray

    Abstract: This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M frames, 90K actions in 700 variable-length videos, capturing long-term unscripted activities in 45 environments, using head-mounted cameras. Compared to its previous version, EPIC-KITCHENS-100 has been annotated using a nov… ▽ More

    Submitted 17 September, 2021; v1 submitted 23 June, 2020; originally announced June 2020.

    Comments: Accepted at the International Journal of Computer Vision (IJCV). Dataset available from: http://epic-kitchens.github.io/

  15. arXiv:2005.00343  [pdf, other

    cs.CV

    The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines

    Authors: Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, Michael Wray

    Abstract: Since its introduction in 2018, EPIC-KITCHENS has attracted attention as the largest egocentric video benchmark, offering a unique viewpoint on people's interaction with objects, their attention, and even intention. In this paper, we detail how this large-scale dataset was captured by 32 participants in their native kitchen environments, and densely annotated with actions and object interactions.… ▽ More

    Submitted 29 April, 2020; originally announced May 2020.

    Comments: Preprint for paper at IEEE TPAMI. arXiv admin note: substantial text overlap with arXiv:1804.02748

  16. arXiv:1908.03477  [pdf, other

    cs.CV

    Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

    Authors: Michael Wray, Diane Larlus, Gabriela Csurka, Dima Damen

    Abstract: We address the problem of cross-modal fine-grained action retrieval between text and video. Cross-modal retrieval is commonly achieved through learning a shared embedding space, that can indifferently embed modalities. In this paper, we propose to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions. We build a separate multi-modal embedding space for each PoS t… ▽ More

    Submitted 9 August, 2019; originally announced August 2019.

    Comments: Accepted for presentation at ICCV. Project Page: https://mwray.github.io/FGAR

  17. arXiv:1907.11117  [pdf, other

    cs.CV

    Learning Visual Actions Using Multiple Verb-Only Labels

    Authors: Michael Wray, Dima Damen

    Abstract: This work introduces verb-only representations for both recognition and retrieval of visual actions, in video. Current methods neglect legitimate semantic ambiguities between verbs, instead choosing unambiguous subsets of verbs along with objects to disambiguate the actions. We instead propose multiple verb-only labels, which we learn through hard or soft assignment as a regression. This enables l… ▽ More

    Submitted 1 August, 2019; v1 submitted 25 July, 2019; originally announced July 2019.

    Comments: Accepted at BMVC 2019. More information can be found at https://mwray.github.io/MVOL/. Annotations can be found at https://github.com/mwray/Multi-Verb-Labels

  18. arXiv:1805.04026  [pdf, other

    cs.CV

    Towards an Unequivocal Representation of Actions

    Authors: Michael Wray, Davide Moltisanti, Dima Damen

    Abstract: This work introduces verb-only representations for actions and interactions; the problem of describing similar motions (e.g. 'open door', 'open cupboard'), and distinguish differing ones (e.g. 'open door' vs 'open bottle') using verb-only labels. Current approaches for action recognition neglect legitimate semantic ambiguities and class overlaps between verbs (Fig. 1), relying on the objects to di… ▽ More

    Submitted 10 May, 2018; originally announced May 2018.

  19. arXiv:1804.02748  [pdf, other

    cs.CV

    Scaling Egocentric Vision: The EPIC-KITCHENS Dataset

    Authors: Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, Michael Wray

    Abstract: First-person vision is gaining interest as it offers a unique viewpoint on people's interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen… ▽ More

    Submitted 31 July, 2018; v1 submitted 8 April, 2018; originally announced April 2018.

    Comments: European Conference on Computer Vision (ECCV) 2018 Dataset and Project page: http://epic-kitchens.github.io

  20. arXiv:1703.09026  [pdf, other

    cs.CV

    Trespassing the Boundaries: Labeling Temporal Bounds for Object Interactions in Egocentric Video

    Authors: Davide Moltisanti, Michael Wray, Walterio Mayol-Cuevas, Dima Damen

    Abstract: Manual annotations of temporal bounds for object interactions (i.e. start and end times) are typical training input to recognition, localization and detection algorithms. For three publicly available egocentric datasets, we uncover inconsistencies in ground truth temporal bounds within and across annotators and datasets. We systematically assess the robustness of state-of-the-art approaches to cha… ▽ More

    Submitted 26 July, 2017; v1 submitted 27 March, 2017; originally announced March 2017.

    Comments: ICCV 2017

  21. arXiv:1703.08338  [pdf, other

    cs.CV

    Improving Classification by Improving Labelling: Introducing Probabilistic Multi-Label Object Interaction Recognition

    Authors: Michael Wray, Davide Moltisanti, Walterio Mayol-Cuevas, Dima Damen

    Abstract: This work deviates from easy-to-define class boundaries for object interactions. For the task of object interaction recognition, often captured using an egocentric view, we show that semantic ambiguities in verbs and recognising sub-interactions along with concurrent interactions result in legitimate class overlaps (Figure 1). We thus aim to model the map** between observations and interaction c… ▽ More

    Submitted 21 April, 2017; v1 submitted 24 March, 2017; originally announced March 2017.

  22. arXiv:1607.08414  [pdf, other

    cs.CV

    SEMBED: Semantic Embedding of Egocentric Action Videos

    Authors: Michael Wray, Davide Moltisanti, Walterio Mayol-Cuevas, Dima Damen

    Abstract: We present SEMBED, an approach for embedding an egocentric object interaction video in a semantic-visual graph to estimate the probability distribution over its potential semantic labels. When object interactions are annotated using unbounded choice of verbs, we embrace the wealth and ambiguity of these labels by capturing the semantic relationships as well as the visual similarities over motion a… ▽ More

    Submitted 29 July, 2016; v1 submitted 28 July, 2016; originally announced July 2016.