Skip to main content

Showing 1–17 of 17 results for author: Doughty, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2401.04716  [pdf, other

    cs.CV

    Low-Resource Vision Challenges for Foundation Models

    Authors: Yunhua Zhang, Hazel Doughty, Cees G. M. Snoek

    Abstract: Low-resource settings are well-established in natural language processing, where many languages lack sufficient data for deep learning at scale. However, low-resource problems are under-explored in computer vision. In this paper, we address this gap and explore the challenges of low-resource image tasks with vision foundation models. We first collect a benchmark of genuinely low-resource image dat… ▽ More

    Submitted 11 April, 2024; v1 submitted 9 January, 2024; originally announced January 2024.

    Comments: Accepted at CVPR2024

  2. arXiv:2310.19776  [pdf, other

    cs.CV cs.AI cs.IT cs.LG

    Learn to Categorize or Categorize to Learn? Self-Coding for Generalized Category Discovery

    Authors: Sarah Rastegar, Hazel Doughty, Cees G. M. Snoek

    Abstract: In the quest for unveiling novel categories at test time, we confront the inherent limitations of traditional supervised recognition models that are restricted by a predefined category set. While strides have been made in the realms of self-supervised and open-world learning towards test-time category discovery, a crucial yet often overlooked question persists: what exactly delineates a category?… ▽ More

    Submitted 18 January, 2024; v1 submitted 30 October, 2023; originally announced October 2023.

    Comments: Accepted by NeurIPS 2023

    ACM Class: I.2.1.b; I.2.6.g; I.5.4.b; I.4

  3. arXiv:2306.12795  [pdf, other

    cs.CV cs.LG cs.MM

    Learning Unseen Modality Interaction

    Authors: Yunhua Zhang, Hazel Doughty, Cees G. M. Snoek

    Abstract: Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. In this paper, we challenge this modality-complete assumption for multimodal learning and instead strive for generalization to unseen modality combinations during inference. We pose the problem of unseen modality interaction and introduce a first solution. It exploi… ▽ More

    Submitted 25 October, 2023; v1 submitted 22 June, 2023; originally announced June 2023.

    Comments: Published at NeurIPS 2023

  4. arXiv:2303.11003  [pdf, other

    cs.CV cs.AI

    Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization

    Authors: Fida Mohammad Thoker, Hazel Doughty, Cees Snoek

    Abstract: We propose a self-supervised method for learning motion-focused video representations. Existing approaches minimize distances between temporally augmented videos, which maintain high spatial similarity. We instead propose to learn similarities between videos with identical local motion dynamics but an otherwise different appearance. We do so by adding synthetic motion trajectories to videos which… ▽ More

    Submitted 28 September, 2023; v1 submitted 20 March, 2023; originally announced March 2023.

    Comments: Accepted in ICCV 2023

  5. arXiv:2212.02053  [pdf, other

    cs.CV

    Day2Dark: Pseudo-Supervised Activity Recognition beyond Silent Daylight

    Authors: Yunhua Zhang, Hazel Doughty, Cees G. M. Snoek

    Abstract: This paper strives to recognize activities in the dark, as well as in the day. We first establish that state-of-the-art activity recognizers are effective during the day, but not trustworthy in the dark. The main causes are the limited availability of labeled dark videos to learn from, as well as the distribution shift towards the lower color contrast at test-time. To compensate for the lack of la… ▽ More

    Submitted 27 August, 2023; v1 submitted 5 December, 2022; originally announced December 2022.

    Comments: Under review

  6. arXiv:2203.14240  [pdf, other

    cs.CV

    Audio-Adaptive Activity Recognition Across Video Domains

    Authors: Yunhua Zhang, Hazel Doughty, Ling Shao, Cees G. M. Snoek

    Abstract: This paper strives for activity recognition under domain shift, for example caused by change of scenery or camera viewpoint. The leading approaches reduce the shift in activity appearance by adversarial training and self-supervised learning. Different from these vision-focused works we leverage activity sounds for domain adaptation as they have less variance across domains and can reliably indicat… ▽ More

    Submitted 29 March, 2022; v1 submitted 27 March, 2022; originally announced March 2022.

    Comments: Accepted at CVPR 2022

  7. arXiv:2203.14221  [pdf, other

    cs.CV

    How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?

    Authors: Fida Mohammad Thoker, Hazel Doughty, Piyush Bagad, Cees Snoek

    Abstract: Despite the recent success of video self-supervised learning models, there is much still to be understood about their generalization capability. In this paper, we investigate how sensitive video self-supervised learning is to the current conventional benchmark and whether methods generalize beyond the canonical evaluation setting. We do this across four different factors of sensitivity: domain, sa… ▽ More

    Submitted 30 July, 2022; v1 submitted 27 March, 2022; originally announced March 2022.

    Comments: Accepted in ECCV 2022

  8. arXiv:2203.12344  [pdf, other

    cs.CV

    How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs

    Authors: Hazel Doughty, Cees G. M. Snoek

    Abstract: We aim to understand how actions are performed and identify subtle differences, such as 'fold firmly' vs. 'fold gently'. To this end, we propose a method which recognizes adverbs across different actions. However, such fine-grained annotations are difficult to obtain and their long-tailed nature makes it challenging to recognize adverbs in rare action-adverb compositions. Our approach therefore us… ▽ More

    Submitted 10 June, 2022; v1 submitted 23 March, 2022; originally announced March 2022.

    Comments: Accepted to CVPR 2022

  9. arXiv:2108.03656  [pdf, other

    cs.CV

    Skeleton-Contrastive 3D Action Representation Learning

    Authors: Fida Mohammad Thoker, Hazel Doughty, Cees G. M. Snoek

    Abstract: This paper strives for self-supervised learning of a feature space suitable for skeleton-based action recognition. Our proposal is built upon learning invariances to input skeleton representations and various skeleton augmentations via a noise contrastive estimation. In particular, we propose inter-skeleton contrastive learning, which learns from multiple different input skeleton representations i… ▽ More

    Submitted 8 August, 2021; originally announced August 2021.

    Comments: Accepted in ACM Multimedia 2021

  10. arXiv:2103.10095  [pdf, other

    cs.CV

    On Semantic Similarity in Video Retrieval

    Authors: Michael Wray, Hazel Doughty, Dima Damen

    Abstract: Current video retrieval efforts all found their evaluation on an instance-based assumption, that only a single caption is relevant to a query video and vice versa. We demonstrate that this assumption results in performance comparisons often not indicative of models' retrieval capabilities. We propose a move to semantic similarity video retrieval, where (i) multiple videos/captions can be deemed eq… ▽ More

    Submitted 18 March, 2021; originally announced March 2021.

    Comments: Accepted in CVPR 2021. Project Page: https://mwray.github.io/SSVR/

  11. arXiv:2101.03787  [pdf, other

    cs.CV

    WiCV 2020: The Seventh Women In Computer Vision Workshop

    Authors: Hazel Doughty, Nour Karessli, Kathryn Leonard, Boyi Li, Carianne Martinez, Azadeh Mobasher, Arsha Nagrani, Srishti Yadav

    Abstract: In this paper we present the details of Women in Computer Vision Workshop - WiCV 2020, organized in alongside virtual CVPR 2020. This event aims at encouraging the women researchers in the field of computer vision. It provides a voice to a minority (female) group in computer vision community and focuses on increasingly the visibility of these researchers, both in academia and industry. WiCV believ… ▽ More

    Submitted 11 January, 2021; originally announced January 2021.

  12. Rescaling Egocentric Vision

    Authors: Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, Michael Wray

    Abstract: This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M frames, 90K actions in 700 variable-length videos, capturing long-term unscripted activities in 45 environments, using head-mounted cameras. Compared to its previous version, EPIC-KITCHENS-100 has been annotated using a nov… ▽ More

    Submitted 17 September, 2021; v1 submitted 23 June, 2020; originally announced June 2020.

    Comments: Accepted at the International Journal of Computer Vision (IJCV). Dataset available from: http://epic-kitchens.github.io/

  13. arXiv:2005.00343  [pdf, other

    cs.CV

    The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines

    Authors: Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, Michael Wray

    Abstract: Since its introduction in 2018, EPIC-KITCHENS has attracted attention as the largest egocentric video benchmark, offering a unique viewpoint on people's interaction with objects, their attention, and even intention. In this paper, we detail how this large-scale dataset was captured by 32 participants in their native kitchen environments, and densely annotated with actions and object interactions.… ▽ More

    Submitted 29 April, 2020; originally announced May 2020.

    Comments: Preprint for paper at IEEE TPAMI. arXiv admin note: substantial text overlap with arXiv:1804.02748

  14. arXiv:1912.06617  [pdf, other

    cs.CV

    Action Modifiers: Learning from Adverbs in Instructional Videos

    Authors: Hazel Doughty, Ivan Laptev, Walterio Mayol-Cuevas, Dima Damen

    Abstract: We present a method to learn a representation for adverbs from instructional videos using weak supervision from the accompanying narrations. Key to our method is the fact that the visual representation of the adverb is highly dependant on the action to which it applies, although the same adverb will modify multiple actions in a similar way. For instance, while 'spread quickly' and 'mix quickly' wi… ▽ More

    Submitted 24 March, 2020; v1 submitted 13 December, 2019; originally announced December 2019.

    Comments: CVPR 2020

  15. arXiv:1812.05538  [pdf, other

    cs.CV

    The Pros and Cons: Rank-aware Temporal Attention for Skill Determination in Long Videos

    Authors: Hazel Doughty, Walterio Mayol-Cuevas, Dima Damen

    Abstract: We present a new model to determine relative skill from long videos, through learnable temporal attention modules. Skill determination is formulated as a ranking problem, making it suitable for common and generic tasks. However, for long videos, parts of the video are irrelevant for assessing skill, and there may be variability in the skill exhibited throughout a video. We therefore propose a meth… ▽ More

    Submitted 10 April, 2019; v1 submitted 13 December, 2018; originally announced December 2018.

    Comments: CVPR 2019

  16. arXiv:1804.02748  [pdf, other

    cs.CV

    Scaling Egocentric Vision: The EPIC-KITCHENS Dataset

    Authors: Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, Michael Wray

    Abstract: First-person vision is gaining interest as it offers a unique viewpoint on people's interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen… ▽ More

    Submitted 31 July, 2018; v1 submitted 8 April, 2018; originally announced April 2018.

    Comments: European Conference on Computer Vision (ECCV) 2018 Dataset and Project page: http://epic-kitchens.github.io

  17. arXiv:1703.09913  [pdf, other

    cs.CV

    Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination

    Authors: Hazel Doughty, Dima Damen, Walterio Mayol-Cuevas

    Abstract: We present a method for assessing skill from video, applicable to a variety of tasks, ranging from surgery to drawing and rolling pizza dough. We formulate the problem as pairwise (who's better?) and overall (who's best?) ranking of video collections, using supervised deep ranking. We propose a novel loss function that learns discriminative features when a pair of videos exhibit variance in skill,… ▽ More

    Submitted 29 March, 2018; v1 submitted 29 March, 2017; originally announced March 2017.

    Comments: CVPR 2018