Skip to main content

Showing 1–21 of 21 results for author: Pérez-Rúa, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2312.16218  [pdf, other

    cs.CV

    Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object Structure via HyperNetworks

    Authors: Christian Simon, Sen He, Juan-Manuel Perez-Rua, Mengmeng Xu, Amine Benhalloum, Tao Xiang

    Abstract: Solving image-to-3D from a single view is an ill-posed problem, and current neural reconstruction methods addressing it through diffusion models still rely on scene-specific optimization, constraining their generalization capability. To overcome the limitations of existing approaches regarding generalization and consistency, we introduce a novel neural rendering technique. Our approach employs the… ▽ More

    Submitted 5 January, 2024; v1 submitted 24 December, 2023; originally announced December 2023.

  2. arXiv:2312.04557  [pdf, other

    cs.CV

    GenTron: Diffusion Transformers for Image and Video Generation

    Authors: Shoufa Chen, Mengmeng Xu, Jiawei Ren, Yuren Cong, Sen He, Animesh Sinha, ** Luo, Tao Xiang, Juan-Manuel Perez-Rua

    Abstract: In this study, we explore Transformer-based diffusion models for image and video generation. Despite the dominance of Transformer architectures in various fields due to their flexibility and scalability, the visual generative domain primarily utilizes CNN-based U-Net architectures, particularly in diffusion-based models. We introduce GenTron, a family of Generative models employing Transformer-bas… ▽ More

    Submitted 2 June, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

    Comments: CVPR2024 Camera Ready. Website: https://www.shoufachen.com/gentron_website/

  3. arXiv:2310.05922  [pdf, other

    cs.CV

    FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing

    Authors: Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, Sen He

    Abstract: Text-to-video editing aims to edit the visual appearance of a source video conditional on textual prompts. A major challenge in this task is to ensure that all frames in the edited video are visually consistent. Most recent works apply advanced text-to-image diffusion models to this task by inflating 2D spatial attention in the U-Net into spatio-temporal attention. Although temporal context can be… ▽ More

    Submitted 29 February, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

    Comments: Accepted by ICLR2024. Project page: https://flatten-video-editing.github.io/

  4. arXiv:2304.02934  [pdf, other

    cs.CV

    Boundary-Denoising for Video Activity Localization

    Authors: Mengmeng Xu, Mattia Soldan, Jialin Gao, Shuming Liu, Juan-Manuel Pérez-Rúa, Bernard Ghanem

    Abstract: Video activity localization aims at understanding the semantic content in long untrimmed videos and retrieving actions of interest. The retrieved action with its start and end locations can be used for highlight generation, temporal action detection, etc. Unfortunately, learning the exact boundary location of activities is highly challenging because temporal activities are continuous in time, and… ▽ More

    Submitted 6 April, 2023; originally announced April 2023.

  5. arXiv:2211.14905  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.MM

    Multi-Modal Few-Shot Temporal Action Detection

    Authors: Sauradip Nag, Mengmeng Xu, Xiatian Zhu, Juan-Manuel Perez-Rua, Bernard Ghanem, Yi-Zhe Song, Tao Xiang

    Abstract: Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection (TAD) to new classes. The former adapts a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter requires no training examples by exploiting a semantic description of the new class. In this work, we introduce a new multi-modality few-shot… ▽ More

    Submitted 27 March, 2023; v1 submitted 27 November, 2022; originally announced November 2022.

    Comments: Technical Report

  6. arXiv:2211.10528  [pdf, other

    cs.CV

    Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization

    Authors: Mengmeng Xu, Yanghao Li, Cheng-Yang Fu, Bernard Ghanem, Tao Xiang, Juan-Manuel Perez-Rua

    Abstract: This paper deals with the problem of localizing objects in image and video datasets from visual exemplars. In particular, we focus on the challenging problem of egocentric visual query localization. We first identify grave implicit biases in current query-conditioned model design and visual query datasets. Then, we directly tackle such biases at both frame and object set levels. Concretely, our me… ▽ More

    Submitted 6 April, 2023; v1 submitted 18 November, 2022; originally announced November 2022.

    Comments: We ranked first and second in the VQ2D and VQ3D tasks in the 2nd Ego4D challenge

  7. arXiv:2208.01949  [pdf, other

    cs.CV

    Negative Frames Matter in Egocentric Visual Query 2D Localization

    Authors: Mengmeng Xu, Cheng-Yang Fu, Yanghao Li, Bernard Ghanem, Juan-Manuel Perez-Rua, Tao Xiang

    Abstract: The recently released Ego4D dataset and benchmark significantly scales and diversifies the first-person visual perception data. In Ego4D, the Visual Queries 2D Localization task aims to retrieve objects appeared in the past from the recording in the first-person view. This task requires a system to spatially and temporally localize the most recent appearance of a given object query, where query is… ▽ More

    Submitted 3 August, 2022; originally announced August 2022.

    Comments: First place winning solution for VQ2D task in CVPR-2022 Ego4D Challenge. Our code is publicly available at https://github.com/facebookresearch/vq2d_cvpr

  8. arXiv:2203.13903  [pdf, other

    cs.CV

    Sylph: A Hypernetwork Framework for Incremental Few-shot Object Detection

    Authors: Li Yin, Juan M Perez-Rua, Kevin J Liang

    Abstract: We study the challenging incremental few-shot object detection (iFSD) setting. Recently, hypernetwork-based approaches have been studied in the context of continuous and finetune-free iFSD with limited success. We take a closer look at important design choices of such methods, leading to several key improvements and resulting in a more accurate and flexible framework, which we call Sylph. In parti… ▽ More

    Submitted 4 April, 2022; v1 submitted 25 March, 2022; originally announced March 2022.

    Comments: Accepted to CVPR 2022

  9. arXiv:2110.02902  [pdf, ps, other

    cs.CV

    SAIC_Cambridge-HuPBA-FBK Submission to the EPIC-Kitchens-100 Action Recognition Challenge 2021

    Authors: Swathikiran Sudhakaran, Adrian Bulat, Juan-Manuel Perez-Rua, Alex Falcon, Sergio Escalera, Oswald Lanz, Brais Martinez, Georgios Tzimiropoulos

    Abstract: This report presents the technical details of our submission to the EPIC-Kitchens-100 Action Recognition Challenge 2021. To participate in the challenge we deployed spatio-temporal feature extraction and aggregation models we have developed recently: GSF and XViT. GSF is an efficient spatio-temporal feature extracting module that can be plugged into 2D CNNs for video action recognition. XViT is a… ▽ More

    Submitted 6 October, 2021; originally announced October 2021.

    Comments: Ranked third in the EPIC-Kitchens-100 Action Recognition Challenge @ CVPR 2021

  10. arXiv:2106.11173  [pdf, other

    cs.CV

    TNT: Text-Conditioned Network with Transductive Inference for Few-Shot Video Classification

    Authors: Andrés Villa, Juan-Manuel Perez-Rua, Victor Escorcia, Vladimir Araujo, Juan Carlos Niebles, Alvaro Soto

    Abstract: Recently, few-shot video classification has received an increasing interest. Current approaches mostly focus on effectively exploiting the temporal dimension in videos to improve learning under low data regimes. However, most works have largely ignored that videos are often accompanied by rich textual descriptions that can also be an essential source of information to handle few-shot recognition c… ▽ More

    Submitted 15 December, 2021; v1 submitted 21 June, 2021; originally announced June 2021.

    Comments: 18 pages including references, 6 figures, and 3 tables

  11. arXiv:2106.05968  [pdf, other

    cs.CV cs.AI cs.LG

    Space-time Mixing Attention for Video Transformer

    Authors: Adrian Bulat, Juan-Manuel Perez-Rua, Swathikiran Sudhakaran, Brais Martinez, Georgios Tzimiropoulos

    Abstract: This paper is on video recognition using Transformers. Very recent attempts in this area have demonstrated promising results in terms of recognition accuracy, yet they have been also shown to induce, in many cases, significant computational overheads due to the additional modelling of the temporal information. In this work, we propose a Video Transformer model the complexity of which scales linear… ▽ More

    Submitted 11 June, 2021; v1 submitted 10 June, 2021; originally announced June 2021.

    Comments: Updated results on SSv2

  12. arXiv:2103.15233  [pdf, other

    cs.CV

    Low-Fidelity End-to-End Video Encoder Pre-training for Temporal Action Localization

    Authors: Mengmeng Xu, Juan-Manuel Perez-Rua, Xiatian Zhu, Bernard Ghanem, Brais Martinez

    Abstract: Temporal action localization (TAL) is a fundamental yet challenging task in video understanding. Existing TAL methods rely on pre-training a video encoder through action classification supervision. This results in a task discrepancy problem for the video encoder -- trained for action classification, but used for TAL. Intuitively, end-to-end model optimization is a good solution. However, this is n… ▽ More

    Submitted 29 October, 2021; v1 submitted 28 March, 2021; originally announced March 2021.

    Comments: To appear at NeurIPS 2021. 15 pages, 1 figure

  13. arXiv:2101.08085  [pdf, other

    cs.CV

    Few-shot Action Recognition with Prototype-centered Attentive Learning

    Authors: Xiatian Zhu, Antoine Toisoul, Juan-Manuel Perez-Rua, Li Zhang, Brais Martinez, Tao Xiang

    Abstract: Few-shot action recognition aims to recognize action classes with few training samples. Most existing methods adopt a meta-learning approach with episodic training. In each episode, the few samples in a meta-training task are split into support and query sets. The former is used to build a classifier, which is then evaluated on the latter using a query-centered loss for model updating. There are h… ▽ More

    Submitted 28 March, 2021; v1 submitted 20 January, 2021; originally announced January 2021.

    Comments: 10 pages, 4 figures

    Journal ref: BMVC 2021

  14. arXiv:2011.10830  [pdf, other

    cs.CV

    Boundary-sensitive Pre-training for Temporal Localization in Videos

    Authors: Mengmeng Xu, Juan-Manuel Perez-Rua, Victor Escorcia, Brais Martinez, Xiatian Zhu, Li Zhang, Bernard Ghanem, Tao Xiang

    Abstract: Many video analysis tasks require temporal localization thus detection of content changes. However, most existing models developed for these tasks are pre-trained on general video action classification tasks. This is because large scale annotation of temporal boundaries in untrimmed videos is expensive. Therefore no suitable datasets exist for temporal boundary-sensitive pre-training. In this pape… ▽ More

    Submitted 26 March, 2021; v1 submitted 21 November, 2020; originally announced November 2020.

    Comments: 11 pages, 4 figures

  15. arXiv:2007.01883  [pdf, other

    cs.CV

    Egocentric Action Recognition by Video Attention and Temporal Context

    Authors: Juan-Manuel Perez-Rua, Antoine Toisoul, Brais Martinez, Victor Escorcia, Li Zhang, Xiatian Zhu, Tao Xiang

    Abstract: We present the submission of Samsung AI Centre Cambridge to the CVPR2020 EPIC-Kitchens Action Recognition Challenge. In this challenge, action recognition is posed as the problem of simultaneously predicting a single `verb' and `noun' class label given an input trimmed video clip. That is, a `verb' and a `noun' together define a compositional `action' class. The challenging aspects of this real-li… ▽ More

    Submitted 3 July, 2020; originally announced July 2020.

    Comments: EPIC-Kitchens challenges@CVPR 2020

  16. arXiv:2004.01278  [pdf, other

    cs.CV

    Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention

    Authors: Juan-Manuel Perez-Rua, Brais Martinez, Xiatian Zhu, Antoine Toisoul, Victor Escorcia, Tao Xiang

    Abstract: Attentive video modeling is essential for action recognition in unconstrained videos due to their rich yet redundant information over space and time. However, introducing attention in a deep neural network for action recognition is challenging for two reasons. First, an effective attention module needs to learn what (objects and their local motion patterns), where (spatially), and when (temporally… ▽ More

    Submitted 2 April, 2020; originally announced April 2020.

  17. arXiv:2003.04668  [pdf, other

    cs.CV

    Incremental Few-Shot Object Detection

    Authors: Juan-Manuel Perez-Rua, Xiatian Zhu, Timothy Hospedales, Tao Xiang

    Abstract: Most existing object detection methods rely on the availability of abundant labelled training samples per class and offline model training in a batch mode. These requirements substantially limit their scalability to open-ended accommodation of novel classes with limited labelled training data. We present a study aiming to go beyond these limitations by considering the Incremental Few-Shot Detectio… ▽ More

    Submitted 12 March, 2020; v1 submitted 10 March, 2020; originally announced March 2020.

    Comments: CVPR 2020

  18. arXiv:1903.06496  [pdf, other

    cs.LG cs.CV cs.NE

    MFAS: Multimodal Fusion Architecture Search

    Authors: Juan-Manuel Pérez-Rúa, Valentin Vielzeuf, Stéphane Pateux, Moez Baccouche, Frédéric Jurie

    Abstract: We tackle the problem of finding good architectures for multimodal classification problems. We propose a novel and generic search space that spans a large number of possible fusion architectures. In order to find an optimal architecture for a given dataset in the proposed search space, we leverage an efficient sequential model-based exploration approach that is tailored for the problem. We demonst… ▽ More

    Submitted 15 March, 2019; originally announced March 2019.

    Comments: CVPR 2019, Jun 2019, Long Beach, United States http://cvpr2019.thecvf.com/

  19. arXiv:1808.00391  [pdf, other

    cs.CV

    Efficient Progressive Neural Architecture Search

    Authors: Juan-Manuel Perez-Rua, Moez Baccouche, Stephane Pateux

    Abstract: This paper addresses the difficult problem of finding an optimal neural architecture design for a given image classification task. We propose a method that aggregates two main results of the previous state-of-the-art in neural architecture search. These are, appealing to the strong sampling efficiency of a search scheme based on sequential model-based optimization (SMBO), and increasing training e… ▽ More

    Submitted 1 August, 2018; originally announced August 2018.

    Comments: Accepted for publication by the BMVA (BMVC 2018)

  20. arXiv:1804.06504  [pdf, other

    cs.CV

    Learning how to be robust: Deep polynomial regression

    Authors: Juan-Manuel Perez-Rua, Tomas Crivelli, Patrick Bouthemy, Patrick Perez

    Abstract: Polynomial regression is a recurrent problem with a large number of applications. In computer vision it often appears in motion analysis. Whatever the application, standard methods for regression of polynomial models tend to deliver biased results when the input data is heavily contaminated by outliers. Moreover, the problem is even harder when outliers have strong structure. Departing from proble… ▽ More

    Submitted 23 May, 2018; v1 submitted 17 April, 2018; originally announced April 2018.

    Comments: 18 pages, conference

  21. arXiv:1612.01495  [pdf, other

    cs.CV

    ROAM: a Rich Object Appearance Model with Application to Rotosco**

    Authors: Ondrej Miksik, Juan-Manuel Pérez-Rúa, Philip H. S. Torr, Patrick Pérez

    Abstract: Rotosco**, the detailed delineation of scene elements through a video shot, is a painstaking task of tremendous importance in professional post-production pipelines. While pixel-wise segmentation techniques can help for this task, professional rotosco** tools rely on parametric curves that offer the artists a much better interactive control on the definition, editing and manipulation of the se… ▽ More

    Submitted 5 December, 2016; originally announced December 2016.