Skip to main content

Showing 1–33 of 33 results for author: Weinzaepfel, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.12942  [pdf, other

    cs.CV

    Purposer: Putting Human Motion Generation in Context

    Authors: Nicolas Ugrinovic, Thomas Lucas, Fabien Baradel, Philippe Weinzaepfel, Gregory Rogez, Francesc Moreno-Noguer

    Abstract: We present a novel method to generate human motion to populate 3D indoor scenes. It can be controlled with various combinations of conditioning signals such as a path in a scene, target poses, past motions, and scenes represented as 3D point clouds. State-of-the-art methods are either models specialized to one single setting, require vast amounts of high-quality and diverse training data, or are u… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

  2. arXiv:2402.14654  [pdf, other

    cs.CV

    Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot

    Authors: Fabien Baradel, Matthieu Armando, Salma Galaaoui, Romain Brégier, Philippe Weinzaepfel, Grégory Rogez, Thomas Lucas

    Abstract: We present Multi-HMR, a strong single-shot model for multi-person 3D human mesh recovery from a single RGB image. Predictions encompass the whole body, i.e, including hands and facial expressions, using the SMPL-X parametric model and spatial location in the camera coordinate system. Our model detects people by predicting coarse 2D heatmaps of person centers, using features produced by a standard… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

    Comments: https://github.com/naver/multi-hmr

  3. arXiv:2402.09237  [pdf, other

    cs.CV

    Weatherproofing Retrieval for Localization with Generative AI and Geometric Consistency

    Authors: Yannis Kalantidis, Mert Bülent Sarıyıldız, Rafael S. Rezende, Philippe Weinzaepfel, Diane Larlus, Gabriela Csurka

    Abstract: State-of-the-art visual localization approaches generally rely on a first image retrieval step whose role is crucial. Yet, retrieval often struggles when facing varying conditions, due to e.g. weather or time of day, with dramatic consequences on the visual localization accuracy. In this paper, we improve this retrieval step and tailor it to the final localization task. Among the several changes w… ▽ More

    Submitted 14 February, 2024; originally announced February 2024.

    Comments: Accepted at ICLR 2024. Project Page: https://europe.naverlabs.com/ret4loc

  4. arXiv:2311.09104  [pdf, other

    cs.CV

    Cross-view and Cross-pose Completion for 3D Human Understanding

    Authors: Matthieu Armando, Salma Galaaoui, Fabien Baradel, Thomas Lucas, Vincent Leroy, Romain Brégier, Philippe Weinzaepfel, Grégory Rogez

    Abstract: Human perception and understanding is a major domain of computer vision which, like many other vision subdomains recently, stands to gain from the use of large models pre-trained on large datasets. We hypothesize that the most common pre-training strategy of relying on general purpose, object-centric image datasets such as ImageNet, is limited by an important domain shift. On the other hand, colle… ▽ More

    Submitted 18 April, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

    Comments: CVPR 2024

  5. arXiv:2310.00632  [pdf, other

    cs.CV

    Win-Win: Training High-Resolution Vision Transformers from Two Windows

    Authors: Vincent Leroy, Jerome Revaud, Thomas Lucas, Philippe Weinzaepfel

    Abstract: Transformers have become the standard in state-of-the-art vision architectures, achieving impressive performance on both image-level and dense pixelwise tasks. However, training vision transformers for high-resolution pixelwise tasks has a prohibitive cost. Typical solutions boil down to hierarchical architectures, fast and approximate attention, or training on low-resolution crops. This latter so… ▽ More

    Submitted 22 March, 2024; v1 submitted 1 October, 2023; originally announced October 2023.

    Comments: ICLR 2024

  6. arXiv:2309.16634  [pdf, other

    cs.CV

    End-to-End (Instance)-Image Goal Navigation through Correspondence as an Emergent Phenomenon

    Authors: Guillaume Bono, Leonid Antsfeld, Boris Chidlovskii, Philippe Weinzaepfel, Christian Wolf

    Abstract: Most recent work in goal oriented visual navigation resorts to large-scale machine learning in simulated environments. The main challenge lies in learning compact representations generalizable to unseen environments and in learning high-capacity perception modules capable of reasoning on high-dimensional input. The latter is particularly difficult when the goal is not given as a category ("ObjectN… ▽ More

    Submitted 28 September, 2023; originally announced September 2023.

  7. arXiv:2309.10748  [pdf, other

    cs.CV cs.AI cs.GR cs.LG cs.RO

    SHOWMe: Benchmarking Object-agnostic Hand-Object 3D Reconstruction

    Authors: Anilkumar Swamy, Vincent Leroy, Philippe Weinzaepfel, Fabien Baradel, Salma Galaaoui, Romain Bregier, Matthieu Armando, Jean-Sebastien Franco, Gregory Rogez

    Abstract: Recent hand-object interaction datasets show limited real object variability and rely on fitting the MANO parametric model to obtain groundtruth hand shapes. To go beyond these limitations and spur further research, we introduce the SHOWMe dataset which consists of 96 videos, annotated with real and detailed hand-object 3D textured meshes. Following recent work, we consider a rigid hand-object sce… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

    Comments: Paper and Appendix, Accepted in ACVR workshop at ICCV conference

  8. arXiv:2309.08480  [pdf, other

    cs.CV

    PoseFix: Correcting 3D Human Poses with Natural Language

    Authors: Ginger Delmas, Philippe Weinzaepfel, Francesc Moreno-Noguer, Grégory Rogez

    Abstract: Automatically producing instructions to modify one's posture could open the door to endless applications, such as personalized coaching and in-home physical therapy. Tackling the reverse problem (i.e., refining a 3D pose based on some natural language feedback) could help for assisted 3D character animation or robot teaching, for instance. Although a few recent works explore the connections betwee… ▽ More

    Submitted 17 January, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

    Comments: Published in ICCV 2023

  9. arXiv:2307.11702  [pdf, other

    cs.CV

    SACReg: Scene-Agnostic Coordinate Regression for Visual Localization

    Authors: Jerome Revaud, Yohann Cabon, Romain Brégier, JongMin Lee, Philippe Weinzaepfel

    Abstract: Scene coordinates regression (SCR), i.e., predicting 3D coordinates for every pixel of a given image, has recently shown promising potential. However, existing methods remain limited to small scenes memorized during training, and thus hardly scale to realistic datasets and scenarios. In this paper, we propose a generalized SCR model trained once to be deployed in new test scenes, regardless of the… ▽ More

    Submitted 30 November, 2023; v1 submitted 21 July, 2023; originally announced July 2023.

  10. arXiv:2211.10408  [pdf, other

    cs.CV

    CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow

    Authors: Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy, Yohann Cabon, Vaibhav Arora, Romain Brégier, Gabriela Csurka, Leonid Antsfeld, Boris Chidlovskii, Jérôme Revaud

    Abstract: Despite impressive performance for high-level downstream tasks, self-supervised pre-training methods have not yet fully delivered on dense geometric vision tasks such as stereo matching or optical flow. The application of self-supervised concepts, such as instance discrimination or masked image modeling, to geometric tasks is an active area of research. In this work, we build on the recent cross-v… ▽ More

    Submitted 18 August, 2023; v1 submitted 18 November, 2022; originally announced November 2022.

    Comments: ICCV 2023

  11. arXiv:2211.07304  [pdf, other

    cs.RO

    Multi-Finger Gras** Like Humans

    Authors: Yuming Du, Philippe Weinzaepfel, Vincent Lepetit, Romain Brégier

    Abstract: Robots with multi-fingered grippers could perform advanced manipulation tasks for us if we were able to properly specify to them what to do. In this study, we take a step in that direction by making a robot grasp an object like a gras** demonstration performed by a human. We propose a novel optimization-based approach for transferring human grasp demonstrations to any multi-fingered grippers, wh… ▽ More

    Submitted 14 November, 2022; originally announced November 2022.

    Comments: presented at IROS 2022 conference

    Journal ref: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

  12. arXiv:2210.11795  [pdf, other

    cs.CV

    PoseScript: Linking 3D Human Poses and Natural Language

    Authors: Ginger Delmas, Philippe Weinzaepfel, Thomas Lucas, Francesc Moreno-Noguer, Grégory Rogez

    Abstract: Natural language plays a critical role in many computer vision applications, such as image captioning, visual question answering, and cross-modal retrieval, to provide fine-grained semantic information. Unfortunately, while human pose is key to human understanding, current 3D human pose datasets lack detailed language descriptions. To address this issue, we have introduced the PoseScript dataset.… ▽ More

    Submitted 19 January, 2024; v1 submitted 21 October, 2022; originally announced October 2022.

    Comments: Extended version of the ECCV 2022 paper

  13. arXiv:2210.10716  [pdf, other

    cs.CV

    CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion

    Authors: Philippe Weinzaepfel, Vincent Leroy, Thomas Lucas, Romain Brégier, Yohann Cabon, Vaibhav Arora, Leonid Antsfeld, Boris Chidlovskii, Gabriela Csurka, Jérôme Revaud

    Abstract: Masked Image Modeling (MIM) has recently been established as a potent pre-training paradigm. A pretext task is constructed by masking patches in an input image, and this masked content is then predicted by a neural network using visible patches as sole input. This pre-training leads to state-of-the-art performance when finetuned for high-level semantic tasks, e.g. image classification and object d… ▽ More

    Submitted 12 January, 2023; v1 submitted 19 October, 2022; originally announced October 2022.

    Comments: NeurIPS 2022

  14. arXiv:2210.10542  [pdf, other

    cs.CV

    PoseGPT: Quantization-based 3D Human Motion Generation and Forecasting

    Authors: Thomas Lucas, Fabien Baradel, Philippe Weinzaepfel, Grégory Rogez

    Abstract: We address the problem of action-conditioned generation of human motion sequences. Existing work falls into two categories: forecast models conditioned on observed past motions, or generative models conditioned on action labels and duration only. In contrast, we generate motion conditioned on observations of arbitrary length, including none. To solve this generalized problem, we propose PoseGPT, a… ▽ More

    Submitted 19 October, 2022; originally announced October 2022.

    Comments: ECCV'22 Conference paper

  15. arXiv:2208.10211  [pdf, other

    cs.CV

    PoseBERT: A Generic Transformer Module for Temporal 3D Human Modeling

    Authors: Fabien Baradel, Romain Brégier, Thibault Groueix, Philippe Weinzaepfel, Yannis Kalantidis, Grégory Rogez

    Abstract: Training state-of-the-art models for human pose estimation in videos requires datasets with annotations that are really hard and expensive to obtain. Although transformers have been recently utilized for body pose sequence modeling, related methods rely on pseudo-ground truth to augment the currently limited training data available for learning such models. In this paper, we introduce PoseBERT, a… ▽ More

    Submitted 19 October, 2022; v1 submitted 22 August, 2022; originally announced August 2022.

    Comments: Accepted to TPAMI 2022

  16. Investigating the Role of Image Retrieval for Visual Localization -- An exhaustive benchmark

    Authors: Martin Humenberger, Yohann Cabon, Noé Pion, Philippe Weinzaepfel, Donghwan Lee, Nicolas Guérin, Torsten Sattler, Gabriela Csurka

    Abstract: Visual localization, i.e., camera pose estimation in a known scene, is a core component of technologies such as autonomous driving and augmented reality. State-of-the-art localization approaches often rely on image retrieval techniques for one of two purposes: (1) provide an approximate pose estimate or (2) determine which parts of the scene are potentially visible in a given query image. It is co… ▽ More

    Submitted 31 May, 2022; originally announced May 2022.

    Comments: International Journal of Computer Vision (2022). arXiv admin note: text overlap with arXiv:2011.11946

  17. arXiv:2201.13182  [pdf, other

    cs.CV

    Learning Super-Features for Image Retrieval

    Authors: Philippe Weinzaepfel, Thomas Lucas, Diane Larlus, Yannis Kalantidis

    Abstract: Methods that combine local and global features have recently shown excellent performance on multiple challenging deep image retrieval benchmarks, but their use of local features raises at least two issues. First, these local features simply boil down to the localized map activations of a neural network, and hence can be extremely redundant. Second, they are typically trained with a global loss tha… ▽ More

    Submitted 31 January, 2022; originally announced January 2022.

    Comments: ICLR 2022

  18. arXiv:2112.12004  [pdf, other

    cs.CV

    Barely-Supervised Learning: Semi-Supervised Learning with very few labeled images

    Authors: Thomas Lucas, Philippe Weinzaepfel, Gregory Rogez

    Abstract: This paper tackles the problem of semi-supervised learning when the set of labeled samples is limited to a small number of images per class, typically less than 10, problem that we refer to as barely-supervised learning. We analyze in depth the behavior of a state-of-the-art semi-supervised method, FixMatch, which relies on a weakly-augmented version of an image to obtain supervision signal for a… ▽ More

    Submitted 22 December, 2021; originally announced December 2021.

  19. arXiv:2110.09243  [pdf, other

    cs.CV

    Leveraging MoCap Data for Human Mesh Recovery

    Authors: Fabien Baradel, Thibault Groueix, Philippe Weinzaepfel, Romain Brégier, Yannis Kalantidis, Grégory Rogez

    Abstract: Training state-of-the-art models for human body pose and shape recovery from images or videos requires datasets with corresponding annotations that are really hard and expensive to obtain. Our goal in this paper is to study whether poses from 3D Motion Capture (MoCap) data can be used to improve image-based and video-based human mesh recovery methods. We find that fine-tune image-based models with… ▽ More

    Submitted 18 October, 2021; originally announced October 2021.

    Comments: 3DV 2021

  20. arXiv:2105.08941  [pdf, other

    cs.CV

    Large-scale Localization Datasets in Crowded Indoor Spaces

    Authors: Donghwan Lee, Soohyun Ryu, Suyong Yeon, Yonghan Lee, Deokhwa Kim, Cheolho Han, Yohann Cabon, Philippe Weinzaepfel, Nicolas Guérin, Gabriela Csurka, Martin Humenberger

    Abstract: Estimating the precise location of a camera using visual localization enables interesting applications such as augmented reality or robot navigation. This is particularly useful in indoor environments where other localization technologies, such as GNSS, fail. Indoor spaces impose interesting challenges on visual localization algorithms: occlusions due to people, textureless surfaces, large viewpoi… ▽ More

    Submitted 19 May, 2021; originally announced May 2021.

  21. arXiv:2012.09696  [pdf, other

    cs.RO cs.LG

    Multi-FinGAN: Generative Coarse-To-Fine Sampling of Multi-Finger Grasps

    Authors: Jens Lundell, Enric Corona, Tran Nguyen Le, Francesco Verdoja, Philippe Weinzaepfel, Gregory Rogez, Francesc Moreno-Noguer, Ville Kyrki

    Abstract: While there exists many methods for manipulating rigid objects with parallel-jaw grippers, gras** with multi-finger robotic hands remains a quite unexplored research topic. Reasoning and planning collision-free trajectories on the additional degrees of freedom of several fingers represents an important challenge that, so far, involves computationally costly and slow processes. In this work, we p… ▽ More

    Submitted 15 March, 2021; v1 submitted 17 December, 2020; originally announced December 2020.

    Comments: Accepted to IEEE Conference on Robotics and Automation 2021 (ICRA). Code is available at https://irobotics.aalto.fi/multi-fingan/

  22. arXiv:2012.02743  [pdf, other

    cs.CV

    SMPLy Benchmarking 3D Human Pose Estimation in the Wild

    Authors: Vincent Leroy, Philippe Weinzaepfel, Romain Brégier, Hadrien Combaluzier, Grégory Rogez

    Abstract: Predicting 3D human pose from images has seen great recent improvements. Novel approaches that can even predict both pose and shape from a single input image have been introduced, often relying on a parametric model of the human body such as SMPL. While qualitative results for such methods are often shown for images captured in-the-wild, a proper benchmark in such conditions is still missing, as i… ▽ More

    Submitted 4 December, 2020; originally announced December 2020.

    Comments: 3DV 2020 Oral presentation

  23. arXiv:2010.01028  [pdf, other

    cs.CV cs.LG

    Hard Negative Mixing for Contrastive Learning

    Authors: Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, Diane Larlus

    Abstract: Contrastive learning has become a key component of self-supervised learning approaches for computer vision. By learning to embed two augmented versions of the same image close to each other and to push the embeddings of different images apart, one can train highly transferable visual representations. As revealed by recent studies, heavy data augmentation and large sets of negatives are both crucia… ▽ More

    Submitted 4 December, 2020; v1 submitted 2 October, 2020; originally announced October 2020.

    Comments: Accepted at NeurIPS 2020. Project page with pretrained models: https://europe.naverlabs.com/mochi

  24. arXiv:2008.09457  [pdf, other

    cs.CV

    DOPE: Distillation Of Part Experts for whole-body 3D pose estimation in the wild

    Authors: Philippe Weinzaepfel, Romain Brégier, Hadrien Combaluzier, Vincent Leroy, Grégory Rogez

    Abstract: We introduce DOPE, the first method to detect and estimate whole-body 3D human poses, including bodies, hands and faces, in the wild. Achieving this level of details is key for a number of applications that require understanding the interactions of the people with each other or with the environment. The main challenge is the lack of in-the-wild data with labeled whole-body 3D poses. In previous wo… ▽ More

    Submitted 21 August, 2020; originally announced August 2020.

    Comments: ECCV 2020

  25. arXiv:2003.13764  [pdf, other

    cs.CV

    Measuring Generalisation to Unseen Viewpoints, Articulations, Shapes and Objects for 3D Hand Pose Estimation under Hand-Object Interaction

    Authors: Anil Armagan, Guillermo Garcia-Hernando, Seungryul Baek, Shreyas Hampali, Mahdi Rad, Zhaohui Zhang, Shipeng Xie, MingXiu Chen, Boshen Zhang, Fu Xiong, Yang Xiao, Zhiguo Cao, Junsong Yuan, Pengfei Ren, Weiting Huang, Haifeng Sun, Marek Hrúz, Jakub Kanis, Zdeněk Krňoul, Qingfu Wan, Shile Li, Linlin Yang, Dongheui Lee, Angela Yao, Weiguo Zhou , et al. (10 additional authors not shown)

    Abstract: We study how well different types of approaches generalise in the task of 3D hand pose estimation under single hand scenarios and hand-object interaction. We show that the accuracy of state-of-the-art methods can drop, and that they fail mostly on poses absent from the training set. Unfortunately, since the space of hand poses is highly dimensional, it is inherently not feasible to cover the whole… ▽ More

    Submitted 10 September, 2020; v1 submitted 30 March, 2020; originally announced March 2020.

    Comments: European Conference on Computer Vision (ECCV), 2020

  26. arXiv:1912.07249  [pdf, other

    cs.CV

    Mimetics: Towards Understanding Human Actions Out of Context

    Authors: Philippe Weinzaepfel, Grégory Rogez

    Abstract: Recent methods for video action recognition have reached outstanding performances on existing benchmarks. However, they tend to leverage context such as scenes or objects instead of focusing on understanding the human action itself. For instance, a tennis field leads to the prediction playing tennis irrespectively of the actions performed in the video. In contrast, humans have a more complete unde… ▽ More

    Submitted 2 February, 2021; v1 submitted 16 December, 2019; originally announced December 2019.

  27. arXiv:1906.06195  [pdf, other

    cs.CV

    R2D2: Repeatable and Reliable Detector and Descriptor

    Authors: Jerome Revaud, Philippe Weinzaepfel, César De Souza, Noe Pion, Gabriela Csurka, Yohann Cabon, Martin Humenberger

    Abstract: Interest point detection and local feature description are fundamental steps in many computer vision applications. Classical methods for these tasks are based on a detect-then-describe paradigm where separate handcrafted methods are used to first identify repeatable keypoints and then represent them with a local descriptor. Neural networks trained with metric learning losses have recently caught u… ▽ More

    Submitted 17 June, 2019; v1 submitted 14 June, 2019; originally announced June 2019.

  28. LCR-Net++: Multi-person 2D and 3D Pose Detection in Natural Images

    Authors: Gregory Rogez, Philippe Weinzaepfel, Cordelia Schmid

    Abstract: We propose an end-to-end architecture for joint 2D and 3D human pose estimation in natural images. Key to our approach is the generation and scoring of a number of pose proposals per image, which allows us to predict 2D and 3D poses of multiple people simultaneously. Hence, our approach does not require an approximate localization of the humans for initialization. Our Localization-Classification-R… ▽ More

    Submitted 13 January, 2019; v1 submitted 1 March, 2018; originally announced March 2018.

    Comments: journal version of the CVPR 2017 paper, accepted to appear in IEEE Trans. PAMI

  29. arXiv:1705.01861  [pdf, other

    cs.CV

    Action Tubelet Detector for Spatio-Temporal Action Localization

    Authors: Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, Cordelia Schmid

    Abstract: Current state-of-the-art approaches for spatio-temporal action localization rely on detections at the frame level that are then linked or tracked across time. In this paper, we leverage the temporal continuity of videos instead of operating at the frame level. We propose the ACtion Tubelet detector (ACT-detector) that takes as input a sequence of frames and outputs tubelets, i.e., sequences of bou… ▽ More

    Submitted 21 August, 2017; v1 submitted 4 May, 2017; originally announced May 2017.

    Comments: 9 pages

  30. arXiv:1605.05197  [pdf, other

    cs.CV

    Human Action Localization with Sparse Spatial Supervision

    Authors: Philippe Weinzaepfel, Xavier Martin, Cordelia Schmid

    Abstract: We introduce an approach for spatio-temporal human action localization using sparse spatial supervision. Our method leverages the large amount of annotated humans available today and extracts human tubes by combining a state-of-the-art human detector with a tracking-by-detection approach. Given these high-quality human tubes and temporal supervision, we select positive and negative tubes with very… ▽ More

    Submitted 23 May, 2017; v1 submitted 17 May, 2016; originally announced May 2016.

  31. arXiv:1506.07656  [pdf, other

    cs.CV

    DeepMatching: Hierarchical Deformable Dense Matching

    Authors: Jerome Revaud, Philippe Weinzaepfel, Zaid Harchaoui, Cordelia Schmid

    Abstract: We introduce a novel matching algorithm, called DeepMatching, to compute dense correspondences between images. DeepMatching relies on a hierarchical, multi-layer, correlational architecture designed for matching images and was inspired by deep convolutional approaches. The proposed matching algorithm can handle non-rigid deformations and repetitive textures and efficiently determines dense corresp… ▽ More

    Submitted 8 October, 2015; v1 submitted 25 June, 2015; originally announced June 2015.

  32. arXiv:1506.01929  [pdf, other

    cs.CV

    Learning to track for spatio-temporal action localization

    Authors: Philippe Weinzaepfel, Zaid Harchaoui, Cordelia Schmid

    Abstract: We propose an effective approach for spatio-temporal action localization in realistic videos. The approach first detects proposals at the frame-level and scores them with a combination of static and motion CNN features. It then tracks high-scoring proposals throughout the video using a tracking-by-detection approach. Our tracker relies simultaneously on instance-level and class-level detectors. Th… ▽ More

    Submitted 27 September, 2015; v1 submitted 5 June, 2015; originally announced June 2015.

  33. arXiv:1501.02565  [pdf, other

    cs.CV

    EpicFlow: Edge-Preserving Interpolation of Correspondences for Optical Flow

    Authors: Jerome Revaud, Philippe Weinzaepfel, Zaid Harchaoui, Cordelia Schmid

    Abstract: We propose a novel approach for optical flow estimation , targeted at large displacements with significant oc-clusions. It consists of two steps: i) dense matching by edge-preserving interpolation from a sparse set of matches; ii) variational energy minimization initialized with the dense matches. The sparse-to-dense interpolation relies on an appropriate choice of the distance, namely an edge-awa… ▽ More

    Submitted 19 May, 2015; v1 submitted 12 January, 2015; originally announced January 2015.