Skip to main content

Showing 1–6 of 6 results for author: Guhur, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2211.09646  [pdf, other

    cs.CV

    Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

    Authors: Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

    Abstract: Localizing objects in 3D scenes based on natural language requires understanding and reasoning about spatial relations. In particular, it is often crucial to distinguish similar objects referred by the text, such as "the left most chair" and "a chair next to the window". In this work we propose a language-conditioned transformer model for grounding 3D objects and their spatial relations. To this e… ▽ More

    Submitted 17 November, 2022; originally announced November 2022.

    Comments: Accepted in NeurIPS 2022; Project website: https://cshizhe.github.io/projects/vil3dref.html

  2. arXiv:2209.04899  [pdf, other

    cs.RO cs.AI cs.CL cs.CV cs.LG

    Instruction-driven history-aware policies for robotic manipulations

    Authors: Pierre-Louis Guhur, Shizhe Chen, Ricardo Garcia, Makarand Tapaswi, Ivan Laptev, Cordelia Schmid

    Abstract: In human environments, robots are expected to accomplish a variety of manipulation tasks given simple natural language instructions. Yet, robotic manipulation is extremely challenging as it requires fine-grained motor control, long-term memory as well as generalization to previously unseen tasks and environments. To address these challenges, we propose a unified transformer-based approach that tak… ▽ More

    Submitted 17 December, 2022; v1 submitted 11 September, 2022; originally announced September 2022.

    Comments: Accepted in CoRL 2022 (oral); project page at https://guhur.github.io/hiveformer/

  3. arXiv:2208.11781  [pdf, other

    cs.CV cs.AI

    Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

    Authors: Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

    Abstract: In vision-and-language navigation (VLN), an embodied agent is required to navigate in realistic 3D environments following natural language instructions. One major bottleneck for existing VLN approaches is the lack of sufficient training data, resulting in unsatisfactory generalization to unseen environments. While VLN data is typically collected manually, such an approach is expensive and prevents… ▽ More

    Submitted 24 August, 2022; originally announced August 2022.

    Comments: ECCV 2022

  4. arXiv:2202.11742  [pdf, other

    cs.CV

    Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

    Authors: Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

    Abstract: Following language instructions to navigate in unseen environments is a challenging problem for autonomous embodied agents. The agent not only needs to ground languages in visual scenes, but also should explore the environment to reach its target. In this work, we propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding. We build… ▽ More

    Submitted 23 February, 2022; originally announced February 2022.

  5. arXiv:2110.13309  [pdf, other

    cs.CV cs.AI

    History Aware Multimodal Transformer for Vision-and-Language Navigation

    Authors: Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, Ivan Laptev

    Abstract: Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes. To remember previously visited locations and actions taken, most approaches to VLN implement memory using recurrent states. Instead, we introduce a History Aware Multimodal Transformer (HAMT) to incorporate a long-horizon history into multimodal decision making. HAMT ef… ▽ More

    Submitted 17 August, 2023; v1 submitted 25 October, 2021; originally announced October 2021.

    Comments: Accepted in NeurIPS 2021; project page at https://cshizhe.github.io/projects/vln_hamt.html; corrected a typo

  6. arXiv:2108.09105  [pdf, other

    cs.CV cs.AI cs.CL cs.HC cs.LG

    Airbert: In-domain Pretraining for Vision-and-Language Navigation

    Authors: Pierre-Louis Guhur, Makarand Tapaswi, Shizhe Chen, Ivan Laptev, Cordelia Schmid

    Abstract: Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions. Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging. Recent methods explore pretraining to improve generalization, however, the… ▽ More

    Submitted 20 August, 2021; originally announced August 2021.

    Comments: To be published on ICCV 2021. Webpage is at https://airbert-vln.github.io/ linking to our dataset, codes and models