Skip to main content

Showing 1–13 of 13 results for author: Moltisanti, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.07723  [pdf, other

    cs.CV

    Coarse or Fine? Recognising Action End States without Labels

    Authors: Davide Moltisanti, Hakan Bilen, Laura Sevilla-Lara, Frank Keller

    Abstract: We focus on the problem of recognising the end state of an action in an image, which is critical for understanding what action is performed and in which manner. We study this focusing on the task of predicting the coarseness of a cut, i.e., deciding whether an object was cut "coarsely" or "finely". No dataset with these annotated end states is available, so we propose an augmentation method to syn… ▽ More

    Submitted 13 May, 2024; originally announced May 2024.

    Comments: The Eleventh Workshop on Fine-Grained Visual Categorization (CVPR 24)

  2. arXiv:2311.15964  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Efficient Pre-training for Localized Instruction Generation of Videos

    Authors: Anil Batra, Davide Moltisanti, Laura Sevilla-Lara, Marcus Rohrbach, Frank Keller

    Abstract: Procedural videos, exemplified by recipe demonstrations, are instrumental in conveying step-by-step instructions. However, understanding such videos is challenging as it involves the precise localization of steps and the generation of textual instructions. Manually annotating steps and writing instructions is costly, which limits the size of current datasets and hinders effective learning. Leverag… ▽ More

    Submitted 23 May, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

    Comments: updated version

  3. arXiv:2303.15086  [pdf, other

    cs.CV

    Learning Action Changes by Measuring Verb-Adverb Textual Relationships

    Authors: Davide Moltisanti, Frank Keller, Hakan Bilen, Laura Sevilla-Lara

    Abstract: The goal of this work is to understand the way actions are performed in videos. That is, given a video, we aim to predict an adverb indicating a modification applied to the action (e.g. cut "finely"). We cast this problem as a regression task. We measure textual relationships between verbs and adverbs to generate a regression target representing the action change we aim to learn. We test our appro… ▽ More

    Submitted 23 May, 2023; v1 submitted 27 March, 2023; originally announced March 2023.

    Comments: CVPR 23. Version 2 updates some results due to an errata (see code repository for more details). Code and dataset available at https://github.com/dmoltisanti/air-cvpr23

  4. arXiv:2210.04933  [pdf, other

    cs.CV

    An Action Is Worth Multiple Words: Handling Ambiguity in Action Recognition

    Authors: Kiyoon Kim, Davide Moltisanti, Oisin Mac Aodha, Laura Sevilla-Lara

    Abstract: Precisely naming the action depicted in a video can be a challenging and oftentimes ambiguous task. In contrast to object instances represented as nouns (e.g. dog, cat, chair, etc.), in the case of actions, human annotators typically lack a consensus as to what constitutes a specific action (e.g. jogging versus running). In practice, a given video can contain multiple valid positive annotations fo… ▽ More

    Submitted 10 October, 2022; originally announced October 2022.

    Comments: BMVC 2022

  5. arXiv:2207.10120  [pdf, other

    cs.CV

    BRACE: The Breakdancing Competition Dataset for Dance Motion Synthesis

    Authors: Davide Moltisanti, **yi Wu, Bo Dai, Chen Change Loy

    Abstract: Generative models for audio-conditioned dance motion synthesis map music features to dance movements. Models are trained to associate motion patterns to audio patterns, usually without an explicit knowledge of the human body. This approach relies on a few assumptions: strong music-dance correlation, controlled motion data and relatively simple poses and movements. These characteristics are found i… ▽ More

    Submitted 22 July, 2022; v1 submitted 20 July, 2022; originally announced July 2022.

    Comments: ECCV 2022. Dataset available at https://github.com/dmoltisanti/brace

  6. Rescaling Egocentric Vision

    Authors: Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, Michael Wray

    Abstract: This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M frames, 90K actions in 700 variable-length videos, capturing long-term unscripted activities in 45 environments, using head-mounted cameras. Compared to its previous version, EPIC-KITCHENS-100 has been annotated using a nov… ▽ More

    Submitted 17 September, 2021; v1 submitted 23 June, 2020; originally announced June 2020.

    Comments: Accepted at the International Journal of Computer Vision (IJCV). Dataset available from: http://epic-kitchens.github.io/

  7. arXiv:2005.00343  [pdf, other

    cs.CV

    The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines

    Authors: Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, Michael Wray

    Abstract: Since its introduction in 2018, EPIC-KITCHENS has attracted attention as the largest egocentric video benchmark, offering a unique viewpoint on people's interaction with objects, their attention, and even intention. In this paper, we detail how this large-scale dataset was captured by 32 participants in their native kitchen environments, and densely annotated with actions and object interactions.… ▽ More

    Submitted 29 April, 2020; originally announced May 2020.

    Comments: Preprint for paper at IEEE TPAMI. arXiv admin note: substantial text overlap with arXiv:1804.02748

  8. arXiv:1904.04689  [pdf, other

    cs.CV

    Action Recognition from Single Timestamp Supervision in Untrimmed Videos

    Authors: Davide Moltisanti, Sanja Fidler, Dima Damen

    Abstract: Recognising actions in videos relies on labelled supervision during training, typically the start and end times of each action instance. This supervision is not only subjective, but also expensive to acquire. Weak video-level supervision has been successfully exploited for recognition in untrimmed videos, however it is challenged when the number of different actions in training videos increases. W… ▽ More

    Submitted 9 April, 2019; originally announced April 2019.

    Comments: CVPR 2019

  9. arXiv:1805.04026  [pdf, other

    cs.CV

    Towards an Unequivocal Representation of Actions

    Authors: Michael Wray, Davide Moltisanti, Dima Damen

    Abstract: This work introduces verb-only representations for actions and interactions; the problem of describing similar motions (e.g. 'open door', 'open cupboard'), and distinguish differing ones (e.g. 'open door' vs 'open bottle') using verb-only labels. Current approaches for action recognition neglect legitimate semantic ambiguities and class overlaps between verbs (Fig. 1), relying on the objects to di… ▽ More

    Submitted 10 May, 2018; originally announced May 2018.

  10. arXiv:1804.02748  [pdf, other

    cs.CV

    Scaling Egocentric Vision: The EPIC-KITCHENS Dataset

    Authors: Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, Michael Wray

    Abstract: First-person vision is gaining interest as it offers a unique viewpoint on people's interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen… ▽ More

    Submitted 31 July, 2018; v1 submitted 8 April, 2018; originally announced April 2018.

    Comments: European Conference on Computer Vision (ECCV) 2018 Dataset and Project page: http://epic-kitchens.github.io

  11. arXiv:1703.09026  [pdf, other

    cs.CV

    Trespassing the Boundaries: Labeling Temporal Bounds for Object Interactions in Egocentric Video

    Authors: Davide Moltisanti, Michael Wray, Walterio Mayol-Cuevas, Dima Damen

    Abstract: Manual annotations of temporal bounds for object interactions (i.e. start and end times) are typical training input to recognition, localization and detection algorithms. For three publicly available egocentric datasets, we uncover inconsistencies in ground truth temporal bounds within and across annotators and datasets. We systematically assess the robustness of state-of-the-art approaches to cha… ▽ More

    Submitted 26 July, 2017; v1 submitted 27 March, 2017; originally announced March 2017.

    Comments: ICCV 2017

  12. arXiv:1703.08338  [pdf, other

    cs.CV

    Improving Classification by Improving Labelling: Introducing Probabilistic Multi-Label Object Interaction Recognition

    Authors: Michael Wray, Davide Moltisanti, Walterio Mayol-Cuevas, Dima Damen

    Abstract: This work deviates from easy-to-define class boundaries for object interactions. For the task of object interaction recognition, often captured using an egocentric view, we show that semantic ambiguities in verbs and recognising sub-interactions along with concurrent interactions result in legitimate class overlaps (Figure 1). We thus aim to model the map** between observations and interaction c… ▽ More

    Submitted 21 April, 2017; v1 submitted 24 March, 2017; originally announced March 2017.

  13. arXiv:1607.08414  [pdf, other

    cs.CV

    SEMBED: Semantic Embedding of Egocentric Action Videos

    Authors: Michael Wray, Davide Moltisanti, Walterio Mayol-Cuevas, Dima Damen

    Abstract: We present SEMBED, an approach for embedding an egocentric object interaction video in a semantic-visual graph to estimate the probability distribution over its potential semantic labels. When object interactions are annotated using unbounded choice of verbs, we embrace the wealth and ambiguity of these labels by capturing the semantic relationships as well as the visual similarities over motion a… ▽ More

    Submitted 29 July, 2016; v1 submitted 28 July, 2016; originally announced July 2016.