Skip to main content

Showing 1–50 of 86 results for author: Laptev, I

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.10221  [pdf, other

    cs.CV cs.AI cs.CL

    Short Film Dataset (SFD): A Benchmark for Story-Level Video Understanding

    Authors: Ridouane Ghermi, Xi Wang, Vicky Kalogeiton, Ivan Laptev

    Abstract: Recent advances in vision-language models have significantly propelled video understanding. Existing datasets and tasks, however, have notable limitations. Most datasets are confined to short videos with limited events and narrow narratives. For example, datasets with instructional and egocentric videos often document the activities of one person in a single scene. Although some movie datasets off… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

  2. arXiv:2406.09250  [pdf, other

    cs.CV cs.AI cs.LG

    MirrorCheck: Efficient Adversarial Defense for Vision-Language Models

    Authors: Samar Fares, Klea Ziu, Toluwani Aremu, Nikita Durasov, Martin Takáč, Pascal Fua, Karthik Nandakumar, Ivan Laptev

    Abstract: Vision-Language Models (VLMs) are becoming increasingly vulnerable to adversarial attacks as various novel attack strategies are being proposed against these models. While existing defenses excel in unimodal contexts, they currently fall short in safeguarding VLMs against adversarial threats. To mitigate this vulnerability, we propose a novel, yet elegantly simple approach for detecting adversaria… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  3. arXiv:2404.15709  [pdf, other

    cs.CV cs.LG cs.RO

    ViViDex: Learning Vision-based Dexterous Manipulation from Human Videos

    Authors: Zerui Chen, Shizhe Chen, Cordelia Schmid, Ivan Laptev

    Abstract: In this work, we aim to learn a unified vision-based policy for a multi-fingered robot hand to manipulate different objects in diverse poses. Though prior work has demonstrated that human videos can benefit policy learning, performance improvement has been limited by physically implausible trajectories extracted from videos. Moreover, reliance on privileged object information such as ground-truth… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

    Comments: Project Page: https://zerchen.github.io/projects/vividex.html

  4. arXiv:2404.01491  [pdf, other

    cs.CV

    SUGAR: Pre-training 3D Visual Representations for Robotics

    Authors: Shizhe Chen, Ricardo Garcia, Ivan Laptev, Cordelia Schmid

    Abstract: Learning generalizable visual representations from Internet data has yielded promising results for robotics. Yet, prevailing approaches focus on pre-training 2D representations, being sub-optimal to deal with occlusions and accurately localize objects in complex 3D scenes. Meanwhile, 3D representation learning has been limited to single-object understanding. To address these limitations, we introd… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: Accepted to CVPR 2024. Project webpage: https://cshizhe.github.io/projects/robot_sugar.html

  5. arXiv:2312.07322  [pdf, other

    cs.CV

    GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos

    Authors: Tomáš Souček, Dima Damen, Michael Wray, Ivan Laptev, Josef Sivic

    Abstract: We address the task of generating temporally consistent and physically plausible images of actions and object state transformations. Given an input image and a text prompt describing the targeted transformation, our generated images preserve the environment and transform objects in the initial image. Our contributions are threefold. First, we leverage a large body of instructional videos and autom… ▽ More

    Submitted 2 April, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

    Comments: CVPR 2024

  6. arXiv:2309.15596  [pdf, other

    cs.RO cs.CV

    PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation

    Authors: Shizhe Chen, Ricardo Garcia, Cordelia Schmid, Ivan Laptev

    Abstract: The ability for robots to comprehend and execute manipulation tasks based on natural language instructions is a long-term goal in robotics. The dominant approaches for language-guided manipulation use 2D image representations, which face difficulties in combining multi-view cameras and inferring precise 3D positions and relationships. To address these limitations, we propose a 3D point cloud based… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

    Comments: Accepted to CoRL 2023. Project website: https://www.di.ens.fr/willow/research/polarnet/

  7. arXiv:2309.13952  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    VidChapters-7M: Video Chapters at Scale

    Authors: Antoine Yang, Arsha Nagrani, Ivan Laptev, Josef Sivic, Cordelia Schmid

    Abstract: Segmenting long videos into chapters enables users to quickly navigate to the information of their interest. This important topic has been understudied due to the lack of publicly released datasets. To address this issue, we present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total. VidChapters-7M is automatically created from videos online in a scalable manner… ▽ More

    Submitted 25 September, 2023; originally announced September 2023.

    Comments: Accepted at NeurIPS 2023 Track on Datasets and Benchmarks; Project Webpage: https://antoyang.github.io/vidchapters.html ; 31 pages; 8 figures

  8. arXiv:2308.05602  [pdf, other

    cs.CV cs.RO

    Object Goal Navigation with Recursive Implicit Maps

    Authors: Shizhe Chen, Thomas Chabal, Ivan Laptev, Cordelia Schmid

    Abstract: Object goal navigation aims to navigate an agent to locations of a given object category in unseen environments. Classical methods explicitly build maps of environments and require extensive engineering while lacking semantic information for object-oriented exploration. On the other hand, end-to-end learning methods alleviate manual map design and predict actions using implicit representations. Su… ▽ More

    Submitted 10 August, 2023; originally announced August 2023.

    Comments: Accepted to IROS 2023

  9. arXiv:2307.15320  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    Robust Visual Sim-to-Real Transfer for Robotic Manipulation

    Authors: Ricardo Garcia, Robin Strudel, Shizhe Chen, Etienne Arlaud, Ivan Laptev, Cordelia Schmid

    Abstract: Learning visuomotor policies in simulation is much safer and cheaper than in the real world. However, due to discrepancies between the simulated and real data, simulator-trained policies often fail when transferred to real robots. One common approach to bridge the visual sim-to-real domain gap is domain randomization (DR). While previous work mainly evaluates DR for disembodied tasks, such as pose… ▽ More

    Submitted 28 July, 2023; originally announced July 2023.

  10. arXiv:2305.06289  [pdf, other

    cs.RO cs.CV cs.LG

    Learning Video-Conditioned Policies for Unseen Manipulation Tasks

    Authors: Elliot Chane-Sane, Cordelia Schmid, Ivan Laptev

    Abstract: The ability to specify robot commands by a non-expert user is critical for building generalist agents capable of solving a large variety of tasks. One convenient way to specify the intended robot goal is by a video of a person demonstrating the target task. While prior work typically aims to imitate human demonstrations performed in robot environments, here we focus on a more realistic and challen… ▽ More

    Submitted 10 May, 2023; originally announced May 2023.

    Comments: ICRA 2023. See the project webpage at https://www.di.ens.fr/willow/research/vip/

  11. arXiv:2304.11970  [pdf, other

    cs.CV

    gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction

    Authors: Zerui Chen, Shizhe Chen, Cordelia Schmid, Ivan Laptev

    Abstract: Signed distance functions (SDFs) is an attractive framework that has recently shown promising results for 3D shape reconstruction from images. SDFs seamlessly generalize to different shape resolutions and topologies but lack explicit modelling of the underlying 3D geometry. In this work, we exploit the hand structure and use it as guidance for SDF-based shape reconstruction. In particular, we addr… ▽ More

    Submitted 24 April, 2023; originally announced April 2023.

    Comments: Accepted by CVPR 2023. Project Page: https://zerchen.github.io/projects/gsdf.html

  12. arXiv:2304.06372  [pdf, other

    cs.RO

    Contact Models in Robotics: a Comparative Analysis

    Authors: Quentin Le Lidec, Wilson Jallet, Louis Montaut, Ivan Laptev, Cordelia Schmid, Justin Carpentier

    Abstract: Physics simulation is ubiquitous in robotics. Whether in model-based approaches (e.g., trajectory optimization), or model-free algorithms (e.g., reinforcement learning), physics simulators are a central component of modern control pipelines in robotics. Over the past decades, several robotic simulators have been developed, each with dedicated contact modeling assumptions and algorithmic solutions.… ▽ More

    Submitted 21 June, 2024; v1 submitted 13 April, 2023; originally announced April 2023.

  13. arXiv:2302.14115  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

    Authors: Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, Cordelia Schmid

    Abstract: In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale. The Vid2Seq architecture augments a language model with special time tokens, allowing it to seamlessly predict event boundaries and textual descriptions in the same output sequence. Such a unified model requires large-scale training data, w… ▽ More

    Submitted 21 March, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

    Comments: CVPR 2023 Camera-Ready; Project Webpage: https://antoyang.github.io/vid2seq.html ; 18 pages; 6 figures

  14. Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation

    Authors: Matthieu Futeral, Cordelia Schmid, Ivan Laptev, Benoît Sagot, Rachel Bawden

    Abstract: One of the major challenges of machine translation (MT) is ambiguity, which can in some cases be resolved by accompanying context such as images. However, recent work in multimodal MT (MMT) has shown that obtaining improvements from images is challenging, limited not only by the difficulty of building effective cross-modal representations, but also by the lack of specific evaluation and training d… ▽ More

    Submitted 26 May, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: Accepted to ACL 2023

  15. arXiv:2212.07372  [pdf, other

    cs.CV eess.IV

    Image Compression with Product Quantized Masked Image Modeling

    Authors: Alaaeldin El-Nouby, Matthew J. Muckley, Karen Ullrich, Ivan Laptev, Jakob Verbeek, Hervé Jégou

    Abstract: Recent neural compression methods have been based on the popular hyperprior framework. It relies on Scalar Quantization and offers a very strong compression performance. This contrasts from recent advances in image generation and representation learning, where Vector Quantization is more commonly employed. In this work, we attempt to bring these lines of research closer by revisiting vector quanti… ▽ More

    Submitted 6 November, 2023; v1 submitted 14 December, 2022; originally announced December 2022.

  16. arXiv:2211.13500  [pdf, other

    cs.CV

    Multi-Task Learning of Object State Changes from Uncurated Videos

    Authors: Tomáš Souček, Jean-Baptiste Alayrac, Antoine Miech, Ivan Laptev, Josef Sivic

    Abstract: We aim to learn to temporally localize object state changes and the corresponding state-modifying actions by observing people interacting with objects in long uncurated web videos. We introduce three principal contributions. First, we explore alternative multi-task network architectures and identify a model that enables efficient joint learning of multiple object states and actions such as pouring… ▽ More

    Submitted 24 November, 2022; originally announced November 2022.

  17. arXiv:2211.09646  [pdf, other

    cs.CV

    Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

    Authors: Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

    Abstract: Localizing objects in 3D scenes based on natural language requires understanding and reasoning about spatial relations. In particular, it is often crucial to distinguish similar objects referred by the text, such as "the left most chair" and "a chair next to the window". In this work we propose a language-conditioned transformer model for grounding 3D objects and their spatial relations. To this e… ▽ More

    Submitted 17 November, 2022; originally announced November 2022.

    Comments: Accepted in NeurIPS 2022; Project website: https://cshizhe.github.io/projects/vil3dref.html

  18. arXiv:2209.09006  [pdf, other

    cs.RO cs.LG

    Enforcing the consensus between Trajectory Optimization and Policy Learning for precise robot control

    Authors: Quentin Le Lidec, Wilson Jallet, Ivan Laptev, Cordelia Schmid, Justin Carpentier

    Abstract: Reinforcement learning (RL) and trajectory optimization (TO) present strong complementary advantages. On one hand, RL approaches are able to learn global control policies directly from data, but generally require large sample sizes to properly converge towards feasible policies. On the other hand, TO methods are able to exploit gradient-based information extracted from simulators to quickly conver… ▽ More

    Submitted 16 February, 2023; v1 submitted 19 September, 2022; originally announced September 2022.

  19. arXiv:2209.04899  [pdf, other

    cs.RO cs.AI cs.CL cs.CV cs.LG

    Instruction-driven history-aware policies for robotic manipulations

    Authors: Pierre-Louis Guhur, Shizhe Chen, Ricardo Garcia, Makarand Tapaswi, Ivan Laptev, Cordelia Schmid

    Abstract: In human environments, robots are expected to accomplish a variety of manipulation tasks given simple natural language instructions. Yet, robotic manipulation is extremely challenging as it requires fine-grained motor control, long-term memory as well as generalization to previously unseen tasks and environments. To address these challenges, we propose a unified transformer-based approach that tak… ▽ More

    Submitted 17 December, 2022; v1 submitted 11 September, 2022; originally announced September 2022.

    Comments: Accepted in CoRL 2022 (oral); project page at https://guhur.github.io/hiveformer/

  20. arXiv:2208.11781  [pdf, other

    cs.CV cs.AI

    Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

    Authors: Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

    Abstract: In vision-and-language navigation (VLN), an embodied agent is required to navigate in realistic 3D environments following natural language instructions. One major bottleneck for existing VLN approaches is the lack of sufficient training data, resulting in unsatisfactory generalization to unseen environments. While VLN data is typically collected manually, such an approach is expensive and prevents… ▽ More

    Submitted 24 August, 2022; originally announced August 2022.

    Comments: ECCV 2022

  21. arXiv:2207.12909  [pdf, other

    cs.CV

    AlignSDF: Pose-Aligned Signed Distance Fields for Hand-Object Reconstruction

    Authors: Zerui Chen, Yana Hasson, Cordelia Schmid, Ivan Laptev

    Abstract: Recent work achieved impressive progress towards joint reconstruction of hands and manipulated objects from monocular color images. Existing methods focus on two alternative representations in terms of either parametric meshes or signed distance fields (SDFs). On one side, parametric models can benefit from prior knowledge at the cost of limited shape deformations and mesh resolutions. Mesh models… ▽ More

    Submitted 26 July, 2022; originally announced July 2022.

    Comments: Accepted by ECCV 2022. Project Page: https://zerchen.github.io/projects/alignsdf.html

  22. arXiv:2206.11884  [pdf, other

    cs.RO

    Augmenting differentiable physics with randomized smoothing

    Authors: Quentin Le Lidec, Louis Montaut, Cordelia Schmid, Ivan Laptev, Justin Carpentier

    Abstract: In the past few years, following the differentiable programming paradigm, there has been a growing interest in computing the gradient information of physical processes (e.g., physical simulation, image rendering). However, such processes may be non-differentiable or yield uninformative gradients (i.d., null almost everywhere). When faced with the former pitfalls, gradients estimated via analytical… ▽ More

    Submitted 23 June, 2022; originally announced June 2022.

  23. arXiv:2206.08155  [pdf, other

    cs.CV cs.CL cs.LG

    Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

    Authors: Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

    Abstract: Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language… ▽ More

    Submitted 10 October, 2022; v1 submitted 16 June, 2022; originally announced June 2022.

    Comments: NeurIPS 2022 Camera-Ready; Project Webpage: https://antoyang.github.io/frozenbilm.html; 25 pages; 5 figures

  24. arXiv:2205.05019  [pdf, other

    cs.CV cs.CL cs.LG

    Learning to Answer Visual Questions from Web Videos

    Authors: Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

    Abstract: Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question genera… ▽ More

    Submitted 11 May, 2022; v1 submitted 10 May, 2022; originally announced May 2022.

    Comments: Accepted at the TPAMI Special Issue on the Best Papers of ICCV 2021. Journal extension of the conference paper arXiv:2012.00451. 16 pages, 13 figures

  25. arXiv:2205.04725  [pdf, other

    cs.CV cs.AI cs.LG

    Weakly-supervised segmentation of referring expressions

    Authors: Robin Strudel, Ivan Laptev, Cordelia Schmid

    Abstract: Visual grounding localizes regions (boxes or segments) in the image corresponding to given referring expressions. In this work we address image segmentation from referring expressions, a problem that has so far only been addressed in a fully-supervised setting. A fully-supervised setup, however, requires pixel-wise supervision and is hard to scale given the expense of manual annotation. We therefo… ▽ More

    Submitted 12 May, 2022; v1 submitted 10 May, 2022; originally announced May 2022.

  26. arXiv:2203.16434  [pdf, other

    cs.CV cs.CL cs.LG

    TubeDETR: Spatio-Temporal Video Grounding with Transformers

    Authors: Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

    Abstract: We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query. This is a challenging task that requires the joint and efficient modeling of temporal, spatial and multi-modal interactions. To address this task, we propose TubeDETR, a transformer-based architecture inspired by the recent success of such models for text-conditioned object detection. Our m… ▽ More

    Submitted 9 June, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

    Comments: Updated vIoU results compared to the CVPR'22 camera-ready version; 17 pages; 8 figures

  27. arXiv:2203.11637  [pdf, other

    cs.CV

    Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos

    Authors: Tomáš Souček, Jean-Baptiste Alayrac, Antoine Miech, Ivan Laptev, Josef Sivic

    Abstract: Human actions often induce changes of object states such as "cutting an apple", "cleaning shoes" or "pouring coffee". In this paper, we seek to temporally localize object states (e.g. "empty" and "full" cup) together with the corresponding state-modifying actions ("pouring coffee") in long uncurated videos with minimal supervision. The contributions of this work are threefold. First, we develop a… ▽ More

    Submitted 22 March, 2022; originally announced March 2022.

    Comments: To be published in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  28. arXiv:2203.03986  [pdf, other

    cs.RO math.OC

    Leveraging Randomized Smoothing for Optimal Control of Nonsmooth Dynamical Systems

    Authors: Quentin Le Lidec, Fabian Schramm, Louis Montaut, Cordelia Schmid, Ivan Laptev, Justin Carpentier

    Abstract: Optimal control (OC) algorithms such as Differential Dynamic Programming (DDP) take advantage of the derivatives of the dynamics to efficiently control physical systems. Yet, in the presence of nonsmooth dynamical systems, such class of algorithms are likely to fail due, for instance, to the presence of discontinuities in the dynamics derivatives or because of non-informative gradient. On the cont… ▽ More

    Submitted 22 January, 2024; v1 submitted 8 March, 2022; originally announced March 2022.

  29. arXiv:2202.11742  [pdf, other

    cs.CV

    Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

    Authors: Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

    Abstract: Following language instructions to navigate in unseen environments is a challenging problem for autonomous embodied agents. The agent not only needs to ground languages in visual scenes, but also should explore the environment to reach its target. In this work, we propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding. We build… ▽ More

    Submitted 23 February, 2022; originally announced February 2022.

  30. arXiv:2112.10740  [pdf, other

    cs.CV

    Are Large-scale Datasets Necessary for Self-Supervised Pre-training?

    Authors: Alaaeldin El-Nouby, Gautier Izacard, Hugo Touvron, Ivan Laptev, Hervé Jegou, Edouard Grave

    Abstract: Pre-training models on large scale datasets, like ImageNet, is a standard practice in computer vision. This paradigm is especially effective for tasks with small training sets, for which high-capacity models tend to overfit. In this work, we consider a self-supervised pre-training scenario that only leverages the target task data. We consider datasets, like Stanford Cars, Sketch or COCO, which are… ▽ More

    Submitted 20 December, 2021; originally announced December 2021.

  31. arXiv:2111.01591  [pdf, other

    cs.CV

    Estimating 3D Motion and Forces of Human-Object Interactions from Internet Videos

    Authors: Zongmian Li, Jiri Sedlar, Justin Carpentier, Ivan Laptev, Nicolas Mansard, Josef Sivic

    Abstract: In this paper, we introduce a method to automatically reconstruct the 3D motion of a person interacting with an object from a single RGB video. Our method estimates the 3D poses of the person together with the object pose, the contact positions and the contact forces exerted on the human body. The main contributions of this work are three-fold. First, we introduce an approach to jointly estimate t… ▽ More

    Submitted 2 November, 2021; originally announced November 2021.

    Comments: arXiv admin note: substantial text overlap with arXiv:1904.02683

  32. arXiv:2110.13309  [pdf, other

    cs.CV cs.AI

    History Aware Multimodal Transformer for Vision-and-Language Navigation

    Authors: Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, Ivan Laptev

    Abstract: Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes. To remember previously visited locations and actions taken, most approaches to VLN implement memory using recurrent states. Instead, we introduce a History Aware Multimodal Transformer (HAMT) to incorporate a long-horizon history into multimodal decision making. HAMT ef… ▽ More

    Submitted 17 August, 2023; v1 submitted 25 October, 2021; originally announced October 2021.

    Comments: Accepted in NeurIPS 2021; project page at https://cshizhe.github.io/projects/vln_hamt.html; corrected a typo

  33. arXiv:2110.09107  [pdf, other

    cs.CV cs.LG

    Differentiable Rendering with Perturbed Optimizers

    Authors: Quentin Le Lidec, Ivan Laptev, Cordelia Schmid, Justin Carpentier

    Abstract: Reasoning about 3D scenes from their 2D image projections is one of the core problems in computer vision. Solutions to this inverse and ill-posed problem typically involve a search for models that best explain observed image data. Notably, images depend both on the properties of observed scenes and on the process of image formation. Hence, if optimization techniques should be used to explain image… ▽ More

    Submitted 18 October, 2021; originally announced October 2021.

  34. arXiv:2109.04409  [pdf, other

    cs.CV

    Reconstructing and grounding narrated instructional videos in 3D

    Authors: Dimitri Zhukov, Ignacio Rocco, Ivan Laptev, Josef Sivic, Johannes L. Schönberger, Bugra Tekin, Marc Pollefeys

    Abstract: Narrated instructional videos often show and describe manipulations of similar objects, e.g., repairing a particular model of a car or laptop. In this work we aim to reconstruct such objects and to localize associated narrations in 3D. Contrary to the standard scenario of instance-level 3D reconstruction, where identical objects or scenes are present in all views, objects in different instructiona… ▽ More

    Submitted 10 September, 2021; v1 submitted 9 September, 2021; originally announced September 2021.

  35. arXiv:2108.09105  [pdf, other

    cs.CV cs.AI cs.CL cs.HC cs.LG

    Airbert: In-domain Pretraining for Vision-and-Language Navigation

    Authors: Pierre-Louis Guhur, Makarand Tapaswi, Shizhe Chen, Ivan Laptev, Cordelia Schmid

    Abstract: Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions. Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging. Recent methods explore pretraining to improve generalization, however, the… ▽ More

    Submitted 20 August, 2021; originally announced August 2021.

    Comments: To be published on ICCV 2021. Webpage is at https://airbert-vln.github.io/ linking to our dataset, codes and models

  36. arXiv:2108.07044  [pdf, other

    cs.CV

    Towards unconstrained joint hand-object reconstruction from RGB videos

    Authors: Yana Hasson, Gül Varol, Ivan Laptev, Cordelia Schmid

    Abstract: Our work aims to obtain 3D reconstruction of hands and manipulated objects from monocular videos. Reconstructing hand-object manipulations holds a great potential for robotics and learning from human demonstrations. The supervised learning approach to this problem, however, requires 3D supervision and remains limited to constrained laboratory settings and simulators for which 3D ground truth is av… ▽ More

    Submitted 12 March, 2022; v1 submitted 16 August, 2021; originally announced August 2021.

    Comments: Project website: https://hassony2.github.io/homan.html

  37. arXiv:2107.00541  [pdf, other

    cs.LG cs.RO

    Goal-Conditioned Reinforcement Learning with Imagined Subgoals

    Authors: Elliot Chane-Sane, Cordelia Schmid, Ivan Laptev

    Abstract: Goal-conditioned reinforcement learning endows an agent with a large variety of skills, but it often struggles to solve tasks that require more temporally extended reasoning. In this work, we propose to incorporate imagined subgoals into policy learning to facilitate learning of complex tasks. Imagined subgoals are predicted by a separate high-level policy, which is trained simultaneously with the… ▽ More

    Submitted 1 July, 2021; originally announced July 2021.

    Comments: ICML 2021. See the project webpage at https://www.di.ens.fr/willow/research/ris/

  38. arXiv:2106.09681  [pdf, other

    cs.CV cs.LG

    XCiT: Cross-Covariance Image Transformers

    Authors: Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, Hervé Jegou

    Abstract: Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic comple… ▽ More

    Submitted 18 June, 2021; v1 submitted 17 June, 2021; originally announced June 2021.

  39. arXiv:2105.05633  [pdf, other

    cs.CV cs.AI cs.LG

    Segmenter: Transformer for Semantic Segmentation

    Authors: Robin Strudel, Ricardo Garcia, Ivan Laptev, Cordelia Schmid

    Abstract: Image segmentation is often ambiguous at the level of individual image patches and requires contextual information to reach label consensus. In this paper we introduce Segmenter, a transformer model for semantic segmentation. In contrast to convolution-based methods, our approach allows to model global context already at the first layer and throughout the network. We build on the recent Vision Tra… ▽ More

    Submitted 2 September, 2021; v1 submitted 12 May, 2021; originally announced May 2021.

    Comments: ICCV 2021. Code available at https://github.com/rstrudel/segmenter

  40. arXiv:2103.16553  [pdf, other

    cs.CV

    Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers

    Authors: Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman

    Abstract: Our objective is language-based search of large-scale image and video datasets. For this task, the approach that consists of independently map** text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales and is efficient for billions of images using approximate nearest neighbour search. An alternative approach of using vision-text transformers with cross-… ▽ More

    Submitted 30 March, 2021; originally announced March 2021.

    Comments: Accepted to CVPR 2021

  41. arXiv:2102.05644  [pdf, other

    cs.CV

    Training Vision Transformers for Image Retrieval

    Authors: Alaaeldin El-Nouby, Natalia Neverova, Ivan Laptev, Hervé Jégou

    Abstract: Transformers have shown outstanding results for natural language understanding and, more recently, for image classification. We here extend this work and propose a transformer-based approach for image retrieval: we adopt vision transformers for generating image descriptors and train the resulting model with a metric learning objective, which combines a contrastive loss with a differential entropy… ▽ More

    Submitted 10 February, 2021; originally announced February 2021.

  42. arXiv:2012.00451  [pdf, other

    cs.CV cs.CL cs.LG

    Just Ask: Learning to Answer Questions from Millions of Narrated Videos

    Authors: Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

    Abstract: Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question genera… ▽ More

    Submitted 12 August, 2021; v1 submitted 1 December, 2020; originally announced December 2020.

    Comments: Accepted at ICCV 2021 (Oral); 20 pages; 14 figures

  43. arXiv:2011.06813  [pdf, other

    cs.RO cs.CV cs.LG

    Learning Object Manipulation Skills via Approximate State Estimation from Real Videos

    Authors: Vladimír Petrík, Makarand Tapaswi, Ivan Laptev, Josef Sivic

    Abstract: Humans are adept at learning new tasks by watching a few instructional videos. On the other hand, robots that learn new actions either require a lot of effort through trial and error, or use expert demonstrations that are challenging to obtain. In this paper, we explore a method that facilitates learning object manipulation skills directly from videos. Leveraging recent advances in 2D visual recog… ▽ More

    Submitted 13 November, 2020; originally announced November 2020.

    Comments: CoRL 2020, code at https://github.com/makarandtapaswi/Real2Sim_CoRL2020, project page at https://data.ciirc.cvut.cz/public/projects/2020Real2Sim/

  44. arXiv:2008.11174  [pdf, other

    cs.RO cs.AI cs.CV cs.LG stat.ML

    Learning Obstacle Representations for Neural Motion Planning

    Authors: Robin Strudel, Ricardo Garcia, Justin Carpentier, Jean-Paul Laumond, Ivan Laptev, Cordelia Schmid

    Abstract: Motion planning and obstacle avoidance is a key challenge in robotics applications. While previous work succeeds to provide excellent solutions for known environments, sensor-based motion planning in new and dynamic environments remains difficult. In this work we address sensor-based motion planning from a learning perspective. Motivated by recent advances in visual recognition, we argue the impor… ▽ More

    Submitted 7 November, 2020; v1 submitted 25 August, 2020; originally announced August 2020.

    Comments: CoRL 2020. See the project webpage at https://www.di.ens.fr/willow/research/nmp_repr/

  45. arXiv:2008.01018  [pdf, other

    cs.CV

    RareAct: A video dataset of unusual interactions

    Authors: Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman

    Abstract: This paper introduces a manually annotated video dataset of unusual actions, namely RareAct, including actions such as "blend phone", "cut keyboard" and "microwave shoes". RareAct aims at evaluating the zero-shot and few-shot compositionality of action recognition models for unlikely compositions of common action verbs and object nouns. It contains 122 different actions which were obtained by comb… ▽ More

    Submitted 3 August, 2020; originally announced August 2020.

  46. arXiv:2008.00744  [pdf, other

    cs.CV

    The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020)

    Authors: Samuel Albanie, Yang Liu, Arsha Nagrani, Antoine Miech, Ernesto Coto, Ivan Laptev, Rahul Sukthankar, Bernard Ghanem, Andrew Zisserman, Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid, Shizhe Chen, Yida Zhao, Qin **, Kaixu Cui, Hui Liu, Chen Wang, Yudong Jiang, Xiaoshuai Hao

    Abstract: We present a new video understanding pentathlon challenge, an open competition held in conjunction with the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020. The objective of the challenge was to explore and evaluate new methods for text-to-video retrieval-the task of searching for content within a corpus of videos using natural language queries. This report summarizes the re… ▽ More

    Submitted 3 August, 2020; originally announced August 2020.

    Comments: Individual reports, dataset information, rules, and released source code can be found at the competition webpage (https://www.robots.ox.ac.uk/~vgg/challenges/video-pentathlon)

  47. arXiv:2005.00069  [pdf, other

    cs.CV cs.LG eess.IV

    Occlusion resistant learning of intuitive physics from videos

    Authors: Ronan Riochet, Josef Sivic, Ivan Laptev, Emmanuel Dupoux

    Abstract: To reach human performance on complex tasks, a key ability for artificial systems is to understand physical interactions between objects, and predict future outcomes of a situation. This ability, often referred to as intuitive physics, has recently received attention and several methods were proposed to learn these physical rules from video sequences. Yet, most of these methods are restricted to t… ▽ More

    Submitted 30 April, 2020; originally announced May 2020.

  48. arXiv:2004.13449  [pdf, other

    cs.CV

    Leveraging Photometric Consistency over Time for Sparsely Supervised Hand-Object Reconstruction

    Authors: Yana Hasson, Bugra Tekin, Federica Bogo, Ivan Laptev, Marc Pollefeys, Cordelia Schmid

    Abstract: Modeling hand-object manipulations is essential for understanding how humans interact with their environment. While of practical importance, estimating the pose of hands and objects during interactions is challenging due to the large mutual occlusions that occur during manipulation. Recent efforts have been directed towards fully-supervised methods that require large amounts of labeled training sa… ▽ More

    Submitted 28 April, 2020; originally announced April 2020.

    Comments: CVPR 2020. See the project webpage at https://hassony2.github.io/handobjectconsist.html

  49. arXiv:2004.07950  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    Learning visual policies for building 3D shape categories

    Authors: Alexander Pashevich, Igor Kalevatykh, Ivan Laptev, Cordelia Schmid

    Abstract: Manipulation and assembly tasks require non-trivial planning of actions depending on the environment and the final goal. Previous work in this domain often assembles particular instances of objects from known sets of primitives. In contrast, we aim to handle varying sets of primitives and to construct different objects of a shape category. Given a single object instance of a category, e.g. an arch… ▽ More

    Submitted 30 September, 2020; v1 submitted 15 April, 2020; originally announced April 2020.

    Comments: IROS 2020

  50. arXiv:2003.13158  [pdf, other

    cs.CV

    Learning Interactions and Relationships between Movie Characters

    Authors: Anna Kukleva, Makarand Tapaswi, Ivan Laptev

    Abstract: Interactions between people are often governed by their relationships. On the flip side, social relationships are built upon several interactions. Two strangers are more likely to greet and introduce themselves while becoming friends over time. We are fascinated by this interplay between interactions and relationships, and believe that it is an important aspect of understanding social situations.… ▽ More

    Submitted 29 March, 2020; originally announced March 2020.

    Comments: CVPR 2020 (Oral)