Skip to main content

Showing 1–4 of 4 results for author: Fragomeni, A

.
  1. arXiv:2402.02335  [pdf

    cs.CV cs.IR

    Video Editing for Video Retrieval

    Authors: Bin Zhu, Kevin Flanagan, Adriano Fragomeni, Michael Wray, Dima Damen

    Abstract: Though pre-training vision-language models have demonstrated significant benefits in boosting video-text retrieval performance from large-scale web videos, fine-tuning still plays a critical role with manually annotated clips with start and end times, which requires considerable human effort. To address this issue, we explore an alternative cheaper source of annotations, single timestamps, for vid… ▽ More

    Submitted 3 February, 2024; originally announced February 2024.

  2. arXiv:2210.04341  [pdf, other

    cs.CV

    ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval

    Authors: Adriano Fragomeni, Michael Wray, Dima Damen

    Abstract: In this paper, we re-examine the task of cross-modal clip-sentence retrieval, where the clip is part of a longer untrimmed video. When the clip is short or visually ambiguous, knowledge of its local temporal context (i.e. surrounding video segments) can be used to improve the retrieval performance. We propose Context Transformer (ConTra); an encoder architecture that models the interaction between… ▽ More

    Submitted 9 October, 2022; originally announced October 2022.

    Comments: Accepted in ACCV 2022

  3. arXiv:2110.07058  [pdf, other

    cs.CV cs.AI

    Ego4D: Around the World in 3,000 Hours of Egocentric Video

    Authors: Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do , et al. (60 additional authors not shown)

    Abstract: We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with cons… ▽ More

    Submitted 11 March, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: To appear in the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. This version updates the baseline result numbers for the Hands and Objects benchmark (appendix)

  4. 3D Hand Pose and Shape Estimation from RGB Images for Keypoint-Based Hand Gesture Recognition

    Authors: Danilo Avola, Luigi Cinque, Alessio Fagioli, Gian Luca Foresti, Adriano Fragomeni, Daniele Pannone

    Abstract: Estimating the 3D pose of a hand from a 2D image is a well-studied problem and a requirement for several real-life applications such as virtual reality, augmented reality, and hand gesture recognition. Currently, reasonable estimations can be computed from single RGB images, especially when a multi-task learning approach is used to force the system to consider the shape of the hand when its pose i… ▽ More

    Submitted 9 May, 2022; v1 submitted 28 September, 2021; originally announced September 2021.