Skip to main content

Showing 1–50 of 68 results for author: Sivic, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.10457  [pdf, other

    cs.LG

    Revealing data leakage in protein interaction benchmarks

    Authors: Anton Bushuiev, Roman Bushuiev, Jiri Sedlar, Tomas Pluskal, Jiri Damborsky, Stanislav Mazurenko, Josef Sivic

    Abstract: In recent years, there has been remarkable progress in machine learning for protein-protein interactions. However, prior work has predominantly focused on improving learning algorithms, with less attention paid to evaluation strategies and data preparation. Here, we demonstrate that further development of machine learning methods may be hindered by the quality of existing train-test splits. Specif… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

  2. arXiv:2401.09413  [pdf, other

    cs.CV

    POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images

    Authors: Antonin Vobecky, Oriane Siméoni, David Hurych, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Josef Sivic

    Abstract: We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images with the objective of enabling 3D grounding, segmentation and retrieval of free-form language queries. This is a challenging problem because of the 2D-3D ambiguity and the open-vocabulary nature of the target tasks, where obtaining annotated training data in 3D is difficult. The contributions of… ▽ More

    Submitted 17 January, 2024; originally announced January 2024.

    Comments: accepted to NeurIPS 2023

  3. arXiv:2312.07322  [pdf, other

    cs.CV

    GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos

    Authors: Tomáš Souček, Dima Damen, Michael Wray, Ivan Laptev, Josef Sivic

    Abstract: We address the task of generating temporally consistent and physically plausible images of actions and object state transformations. Given an input image and a text prompt describing the targeted transformation, our generated images preserve the environment and transform objects in the initial image. Our contributions are threefold. First, we leverage a large body of instructional videos and autom… ▽ More

    Submitted 2 April, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

    Comments: CVPR 2024

  4. arXiv:2312.04966  [pdf, other

    cs.CV

    Customizing Motion in Text-to-Video Diffusion Models

    Authors: Joanna Materzynska, Josef Sivic, Eli Shechtman, Antonio Torralba, Richard Zhang, Bryan Russell

    Abstract: We introduce an approach for augmenting text-to-video generation models with customized motions, extending their capabilities beyond the motions depicted in the original training data. By leveraging a few video samples demonstrating specific movements as input, our method learns and generalizes the input motion patterns for diverse, text-specified scenarios. Our contributions are threefold. First,… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

    Comments: Project page: this website https://joaanna.github.io/customizing_motion/

  5. arXiv:2312.02985  [pdf, other

    cs.CV

    FocalPose++: Focal Length and Object Pose Estimation via Render and Compare

    Authors: Martin Cífka, Georgy Ponimatkin, Yann Labbé, Bryan Russell, Mathieu Aubry, Vladimir Petrik, Josef Sivic

    Abstract: We introduce FocalPose++, a neural render-and-compare method for jointly estimating the camera-object 6D pose and camera focal length given a single RGB input image depicting a known object. The contributions of this work are threefold. First, we derive a focal length update rule that extends an existing state-of-the-art render-and-compare 6D pose estimator to address the joint estimation task. Se… ▽ More

    Submitted 15 November, 2023; originally announced December 2023.

    Comments: 21 pages, 18 figures. arXiv admin note: substantial text overlap with arXiv:2204.05145

  6. arXiv:2311.05344  [pdf, other

    cs.RO

    Visually Guided Model Predictive Robot Control via 6D Object Pose Localization and Tracking

    Authors: Mederic Fourmy, Vojtech Priban, Jan Kristof Behrens, Nicolas Mansard, Josef Sivic, Vladimir Petrik

    Abstract: The objective of this work is to enable manipulation tasks with respect to the 6D pose of a dynamically moving object using a camera mounted on a robot. Examples include maintaining a constant relative 6D pose of the robot arm with respect to the object, gras** the dynamically moving object, or co-manipulating the object together with a human. Fast and accurate 6D pose estimation is crucial to a… ▽ More

    Submitted 9 November, 2023; originally announced November 2023.

  7. arXiv:2310.18515  [pdf, other

    cs.LG

    Learning to design protein-protein interactions with enhanced generalization

    Authors: Anton Bushuiev, Roman Bushuiev, Petr Kouba, Anatolii Filkin, Marketa Gabrielova, Michal Gabriel, Jiri Sedlar, Tomas Pluskal, Jiri Damborsky, Stanislav Mazurenko, Josef Sivic

    Abstract: Discovering mutations enhancing protein-protein interactions (PPIs) is critical for advancing biomedical research and develo** improved therapeutics. While machine learning approaches have substantially advanced the field, they often struggle to generalize beyond training data in practical scenarios. The contributions of this work are three-fold. First, we construct PPIRef, the largest and non-r… ▽ More

    Submitted 16 March, 2024; v1 submitted 27 October, 2023; originally announced October 2023.

  8. arXiv:2309.13952  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    VidChapters-7M: Video Chapters at Scale

    Authors: Antoine Yang, Arsha Nagrani, Ivan Laptev, Josef Sivic, Cordelia Schmid

    Abstract: Segmenting long videos into chapters enables users to quickly navigate to the information of their interest. This important topic has been understudied due to the lack of publicly released datasets. To address this issue, we present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total. VidChapters-7M is automatically created from videos online in a scalable manner… ▽ More

    Submitted 25 September, 2023; originally announced September 2023.

    Comments: Accepted at NeurIPS 2023 Track on Datasets and Benchmarks; Project Webpage: https://antoyang.github.io/vidchapters.html ; 31 pages; 8 figures

  9. arXiv:2306.10169  [pdf, other

    cs.CV cs.CL cs.LG

    Meta-Personalizing Vision-Language Models to Find Named Instances in Video

    Authors: Chun-Hsiao Yeh, Bryan Russell, Josef Sivic, Fabian Caba Heilbron, Simon Jenni

    Abstract: Large-scale vision-language models (VLM) have shown impressive results for language-guided search applications. While these models allow category-level queries, they currently struggle with personalized searches for moments in a video where a specific object instance such as ``My dog Biscuit'' appears. We present the following three contributions to address this problem. First, we describe a metho… ▽ More

    Submitted 16 June, 2023; originally announced June 2023.

    Comments: Accepted to CVPR 2023. Project webpage: https://danielchyeh.github.io/metaper/

  10. arXiv:2306.09327  [pdf, other

    cs.CV

    Language-Guided Music Recommendation for Video via Prompt Analogies

    Authors: Daniel McKee, Justin Salamon, Josef Sivic, Bryan Russell

    Abstract: We propose a method to recommend music for an input video while allowing a user to guide music selection with free-form natural language. A key challenge of this problem setting is that existing music video datasets provide the needed (video, music) training pairs, but lack text descriptions of the music. This work addresses this challenge with the following three contributions. First, we propose… ▽ More

    Submitted 15 June, 2023; originally announced June 2023.

    Comments: CVPR 2023 (Highlight paper). Project page: https://www.danielbmckee.com/language-guided-music-for-video

  11. arXiv:2302.14115  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

    Authors: Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, Cordelia Schmid

    Abstract: In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale. The Vid2Seq architecture augments a language model with special time tokens, allowing it to seamlessly predict event boundaries and textual descriptions in the same output sequence. Such a unified model requires large-scale training data, w… ▽ More

    Submitted 21 March, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

    Comments: CVPR 2023 Camera-Ready; Project Webpage: https://antoyang.github.io/vid2seq.html ; 18 pages; 6 figures

  12. arXiv:2212.06870  [pdf, other

    cs.CV cs.RO

    MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare

    Authors: Yann Labbé, Lucas Manuelli, Arsalan Mousavian, Stephen Tyree, Stan Birchfield, Jonathan Tremblay, Justin Carpentier, Mathieu Aubry, Dieter Fox, Josef Sivic

    Abstract: We introduce MegaPose, a method to estimate the 6D pose of novel objects, that is, objects unseen during training. At inference time, the method only assumes knowledge of (i) a region of interest displaying the object in the image and (ii) a CAD model of the observed object. The contributions of this work are threefold. First, we present a 6D pose refiner based on a render&compare strategy which c… ▽ More

    Submitted 13 December, 2022; originally announced December 2022.

    Comments: CoRL 2022

  13. arXiv:2211.13500  [pdf, other

    cs.CV

    Multi-Task Learning of Object State Changes from Uncurated Videos

    Authors: Tomáš Souček, Jean-Baptiste Alayrac, Antoine Miech, Ivan Laptev, Josef Sivic

    Abstract: We aim to learn to temporally localize object state changes and the corresponding state-modifying actions by observing people interacting with objects in long uncurated web videos. We introduce three principal contributions. First, we explore alternative multi-task network architectures and identify a model that enables efficient joint learning of multiple object states and actions such as pouring… ▽ More

    Submitted 24 November, 2022; originally announced November 2022.

  14. arXiv:2210.02549  [pdf, other

    cs.LG

    Benchmarking Learning Efficiency in Deep Reservoir Computing

    Authors: Hugo Cisneros, Josef Sivic, Tomas Mikolov

    Abstract: It is common to evaluate the performance of a machine learning model by measuring its predictive power on a test dataset. This approach favors complicated models that can smoothly fit complex functions and generalize well from training data points. Although essential components of intelligence, speed and data efficiency of this learning process are rarely reported or compared between different can… ▽ More

    Submitted 29 September, 2022; originally announced October 2022.

    Comments: Conference on Lifelong Learning Agents, Aug 2022, Montreal, Canada

  15. arXiv:2209.09012  [pdf, other

    cs.RO

    Differentiable Collision Detection: a Randomized Smoothing Approach

    Authors: Louis Montaut, Quentin Le Lidec, Antoine Bambade, Vladimir Petrik, Josef Sivic, Justin Carpentier

    Abstract: Collision detection appears as a canonical operation in a large range of robotics applications from robot control to simulation, including motion planning and estimation. While the seminal works on the topic date back to the 80s, it is only recently that the question of properly differentiating collision detection has emerged as a central issue, thanks notably to the ongoing and various efforts ma… ▽ More

    Submitted 19 September, 2022; originally announced September 2022.

    Comments: 7 pages, 6 figures, 2 tables

  16. Imitrob: Imitation Learning Dataset for Training and Evaluating 6D Object Pose Estimators

    Authors: Jiri Sedlar, Karla Stepanova, Radoslav Skoviera, Jan K. Behrens, Matus Tuna, Gabriela Sejnova, Josef Sivic, Robert Babuska

    Abstract: This paper introduces a dataset for training and evaluating methods for 6D pose estimation of hand-held tools in task demonstrations captured by a standard RGB camera. Despite the significant progress of 6D pose estimation methods, their performance is usually limited for heavily occluded objects, which is a common case in imitation learning, where the object is typically partially occluded by the… ▽ More

    Submitted 5 April, 2023; v1 submitted 16 September, 2022; originally announced September 2022.

    Comments: The dataset and code are publicly available at http://imitrob.ciirc.cvut.cz/imitrobdataset.php

    Journal ref: IEEE Robotics and Automation Letters, vol. 8, no. 5, pp. 2788-2795, 2023

  17. arXiv:2208.01960  [pdf, other

    cs.RO cs.CV cs.LG

    Learning Object Manipulation Skills from Video via Approximate Differentiable Physics

    Authors: Vladimir Petrik, Mohammad Nomaan Qureshi, Josef Sivic, Makarand Tapaswi

    Abstract: We aim to teach robots to perform simple object manipulation tasks by watching a single video demonstration. Towards this goal, we propose an optimization approach that outputs a coarse and temporally evolving 3D scene to mimic the action demonstrated in the input video. Similar to previous work, a differentiable renderer ensures perceptual fidelity between the 3D scene and the 2D video. Our key n… ▽ More

    Submitted 3 August, 2022; originally announced August 2022.

    Comments: Accepted for IROS2022, code at https://github.com/petrikvladimir/video_skills_learning_with_approx_physics, project page at https://data.ciirc.cvut.cz/public/projects/2022Real2SimPhysics/

  18. arXiv:2206.08155  [pdf, other

    cs.CV cs.CL cs.LG

    Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

    Authors: Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

    Abstract: Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language… ▽ More

    Submitted 10 October, 2022; v1 submitted 16 June, 2022; originally announced June 2022.

    Comments: NeurIPS 2022 Camera-Ready; Project Webpage: https://antoyang.github.io/frozenbilm.html; 25 pages; 5 figures

  19. arXiv:2205.09663  [pdf, other

    cs.RO

    Collision Detection Accelerated: An Optimization Perspective

    Authors: Louis Montaut, Quentin Le Lidec, Vladimir Petrik, Josef Sivic, Justin Carpentier

    Abstract: Collision detection between two convex shapes is an essential feature of any physics engine or robot motion planner. It has often been tackled as a computational geometry problem, with the Gilbert, Johnson and Keerthi (GJK) algorithm being the most common approach today. In this work we leverage the fact that collision detection is fundamentally a convex optimization problem. In particular, we est… ▽ More

    Submitted 20 May, 2022; v1 submitted 19 May, 2022; originally announced May 2022.

    Comments: RSS 2022, 12 pages, 9 figures, 2 tables

    Journal ref: Robotics: Science and Systems 2022

  20. arXiv:2205.05019  [pdf, other

    cs.CV cs.CL cs.LG

    Learning to Answer Visual Questions from Web Videos

    Authors: Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

    Abstract: Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question genera… ▽ More

    Submitted 11 May, 2022; v1 submitted 10 May, 2022; originally announced May 2022.

    Comments: Accepted at the TPAMI Special Issue on the Best Papers of ICCV 2021. Journal extension of the conference paper arXiv:2012.00451. 16 pages, 13 figures

  21. arXiv:2204.05145  [pdf, other

    cs.CV

    Focal Length and Object Pose Estimation via Render and Compare

    Authors: Georgy Ponimatkin, Yann Labbé, Bryan Russell, Mathieu Aubry, Josef Sivic

    Abstract: We introduce FocalPose, a neural render-and-compare method for jointly estimating the camera-object 6D pose and camera focal length given a single RGB input image depicting a known object. The contributions of this work are twofold. First, we derive a focal length update rule that extends an existing state-of-the-art render-and-compare 6D pose estimator to address the joint estimation task. Second… ▽ More

    Submitted 11 April, 2022; originally announced April 2022.

    Comments: Accepted to CVPR2022. Code available at http://github.com/ponimatkin/focalpose

  22. arXiv:2203.16434  [pdf, other

    cs.CV cs.CL cs.LG

    TubeDETR: Spatio-Temporal Video Grounding with Transformers

    Authors: Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

    Abstract: We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query. This is a challenging task that requires the joint and efficient modeling of temporal, spatial and multi-modal interactions. To address this task, we propose TubeDETR, a transformer-based architecture inspired by the recent success of such models for text-conditioned object detection. Our m… ▽ More

    Submitted 9 June, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

    Comments: Updated vIoU results compared to the CVPR'22 camera-ready version; 17 pages; 8 figures

  23. arXiv:2203.11637  [pdf, other

    cs.CV

    Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos

    Authors: Tomáš Souček, Jean-Baptiste Alayrac, Antoine Miech, Ivan Laptev, Josef Sivic

    Abstract: Human actions often induce changes of object states such as "cutting an apple", "cleaning shoes" or "pouring coffee". In this paper, we seek to temporally localize object states (e.g. "empty" and "full" cup) together with the corresponding state-modifying actions ("pouring coffee") in long uncurated videos with minimal supervision. The contributions of this work are threefold. First, we develop a… ▽ More

    Submitted 22 March, 2022; originally announced March 2022.

    Comments: To be published in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  24. arXiv:2203.11160  [pdf, other

    cs.CV

    Drive&Segment: Unsupervised Semantic Segmentation of Urban Scenes via Cross-modal Distillation

    Authors: Antonin Vobecky, David Hurych, Oriane Siméoni, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Josef Sivic

    Abstract: This work investigates learning pixel-wise semantic image segmentation in urban scenes without any manual annotation, just from the raw non-curated data collected by cars which, equipped with cameras and LiDAR sensors, drive around a city. Our contributions are threefold. First, we propose a novel method for cross-modal unsupervised learning of semantic image segmentation by leveraging synchronize… ▽ More

    Submitted 21 February, 2024; v1 submitted 21 March, 2022; originally announced March 2022.

    Comments: v2: improved quality of images. See the project webpage https://vobecant.github.io/DriveAndSegment/ for the code and more

  25. arXiv:2111.03088  [pdf, other

    cs.RO

    Learning to Manipulate Tools by Aligning Simulation to Video Demonstration

    Authors: Kateryna Zorina, Justin Carpentier, Josef Sivic, Vladimír Petrík

    Abstract: A seamless integration of robots into human environments requires robots to learn how to use existing human tools. Current approaches for learning tool manipulation skills mostly rely on expert demonstrations provided in the target robot environment, for example, by manually guiding the robot manipulator or by teleoperation. In this work, we introduce an automated approach that replaces an expert… ▽ More

    Submitted 4 November, 2021; originally announced November 2021.

    Comments: Accepted to IEEE Robotics and Automation Letters (RA-L)

  26. arXiv:2111.01591  [pdf, other

    cs.CV

    Estimating 3D Motion and Forces of Human-Object Interactions from Internet Videos

    Authors: Zongmian Li, Jiri Sedlar, Justin Carpentier, Ivan Laptev, Nicolas Mansard, Josef Sivic

    Abstract: In this paper, we introduce a method to automatically reconstruct the 3D motion of a person interacting with an object from a single RGB video. Our method estimates the 3D poses of the person together with the object pose, the contact positions and the contact forces exerted on the human body. The main contributions of this work are three-fold. First, we introduce an approach to jointly estimate t… ▽ More

    Submitted 2 November, 2021; originally announced November 2021.

    Comments: arXiv admin note: substantial text overlap with arXiv:1904.02683

  27. arXiv:2110.03562  [pdf, other

    cs.CV

    Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions

    Authors: Shuang Li, Yilun Du, Antonio Torralba, Josef Sivic, Bryan Russell

    Abstract: We introduce the task of weakly supervised learning for detecting human and object interactions in videos. Our task poses unique challenges as a system does not know what types of human-object interactions are present in a video or the actual spatiotemporal location of the human and the object. To address these challenges, we introduce a contrastive weakly supervised training loss that aims to joi… ▽ More

    Submitted 7 October, 2021; originally announced October 2021.

  28. arXiv:2109.04409  [pdf, other

    cs.CV

    Reconstructing and grounding narrated instructional videos in 3D

    Authors: Dimitri Zhukov, Ignacio Rocco, Ivan Laptev, Josef Sivic, Johannes L. Schönberger, Bugra Tekin, Marc Pollefeys

    Abstract: Narrated instructional videos often show and describe manipulations of similar objects, e.g., repairing a particular model of a car or laptop. In this work we aim to reconstruct such objects and to localize associated narrations in 3D. Contrary to the standard scenario of instance-level 3D reconstruction, where identical objects or scenes are present in all views, objects in different instructiona… ▽ More

    Submitted 10 September, 2021; v1 submitted 9 September, 2021; originally announced September 2021.

  29. arXiv:2106.14195  [pdf, other

    cs.CV cs.AI cs.CG cs.LG cs.LO

    Learning to solve geometric construction problems from images

    Authors: J. Macke, J. Sedlar, M. Olsak, J. Urban, J. Sivic

    Abstract: We describe a purely image-based method for finding geometric constructions with a ruler and compass in the Euclidea geometric game. The method is based on adapting the Mask R-CNN state-of-the-art image processing neural architecture and adding a tree-based search procedure to it. In a supervised setting, the method learns to solve all 68 kinds of geometric construction problems from the first six… ▽ More

    Submitted 27 June, 2021; originally announced June 2021.

    Comments: 16 pages, 7 figures, 3 tables

  30. arXiv:2104.09359  [pdf, other

    cs.CV cs.RO

    Single-view robot pose and joint angle estimation via render & compare

    Authors: Yann Labbé, Justin Carpentier, Mathieu Aubry, Josef Sivic

    Abstract: We introduce RoboPose, a method to estimate the joint angles and the 6D camera-to-robot pose of a known articulated robot from a single RGB image. This is an important problem to grant mobile and itinerant autonomous systems the ability to interact with other robots using only visual information in non-instrumented environments, especially in the context of collaborative robotics. It is also chall… ▽ More

    Submitted 19 April, 2021; originally announced April 2021.

    Comments: Accepted at CVPR 2021 (Oral)

  31. Visualizing computation in large-scale cellular automata

    Authors: Hugo Cisneros, Josef Sivic, Tomas Mikolov

    Abstract: Emergent processes in complex systems such as cellular automata can perform computations of increasing complexity, and could possibly lead to artificial evolution. Such a feat would require scaling up current simulation sizes to allow for enough computational capacity. Understanding complex computations happening in cellular automata and other systems capable of emergence poses many challenges, es… ▽ More

    Submitted 1 April, 2021; originally announced April 2021.

    Journal ref: Artificial Life Conference Proceedings 2020 (pp. 239-247). MIT Press

  32. arXiv:2103.16553  [pdf, other

    cs.CV

    Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers

    Authors: Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman

    Abstract: Our objective is language-based search of large-scale image and video datasets. For this task, the approach that consists of independently map** text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales and is efficient for billions of images using approximate nearest neighbour search. An alternative approach of using vision-text transformers with cross-… ▽ More

    Submitted 30 March, 2021; originally announced March 2021.

    Comments: Accepted to CVPR 2021

  33. arXiv:2012.08274  [pdf, other

    cs.CV

    Artificial Dummies for Urban Dataset Augmentation

    Authors: Antonín Vobecký, David Hurych, Michal Uřičář, Patrick Pérez, Josef Šivic

    Abstract: Existing datasets for training pedestrian detectors in images suffer from limited appearance and pose variation. The most challenging scenarios are rarely included because they are too difficult to capture due to safety reasons, or they are very unlikely to happen. The strict safety requirements in assisted and autonomous driving applications call for an extra high detection accuracy also in these… ▽ More

    Submitted 15 December, 2020; originally announced December 2020.

    Comments: Accepted to AAAI 2021

  34. arXiv:2012.00451  [pdf, other

    cs.CV cs.CL cs.LG

    Just Ask: Learning to Answer Questions from Millions of Narrated Videos

    Authors: Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

    Abstract: Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question genera… ▽ More

    Submitted 12 August, 2021; v1 submitted 1 December, 2020; originally announced December 2020.

    Comments: Accepted at ICCV 2021 (Oral); 20 pages; 14 figures

  35. arXiv:2011.06813  [pdf, other

    cs.RO cs.CV cs.LG

    Learning Object Manipulation Skills via Approximate State Estimation from Real Videos

    Authors: Vladimír Petrík, Makarand Tapaswi, Ivan Laptev, Josef Sivic

    Abstract: Humans are adept at learning new tasks by watching a few instructional videos. On the other hand, robots that learn new actions either require a lot of effort through trial and error, or use expert demonstrations that are challenging to obtain. In this paper, we explore a method that facilitates learning object manipulation skills directly from videos. Leveraging recent advances in 2D visual recog… ▽ More

    Submitted 13 November, 2020; originally announced November 2020.

    Comments: CoRL 2020, code at https://github.com/makarandtapaswi/Real2Sim_CoRL2020, project page at https://data.ciirc.cvut.cz/public/projects/2020Real2Sim/

  36. arXiv:2008.08465  [pdf, other

    cs.CV

    CosyPose: Consistent multi-view multi-object 6D pose estimation

    Authors: Yann Labbé, Justin Carpentier, Mathieu Aubry, Josef Sivic

    Abstract: We introduce an approach for recovering the 6D pose of multiple known objects in a scene captured by a set of input images with unknown camera viewpoints. First, we present a single-view single-object 6D pose estimation method, which we use to generate 6D object pose hypotheses. Second, we develop a robust method for matching individual 6D object pose hypotheses across different input images in or… ▽ More

    Submitted 19 August, 2020; originally announced August 2020.

    Comments: ECCV 2020

  37. arXiv:2008.01018  [pdf, other

    cs.CV

    RareAct: A video dataset of unusual interactions

    Authors: Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman

    Abstract: This paper introduces a manually annotated video dataset of unusual actions, namely RareAct, including actions such as "blend phone", "cut keyboard" and "microwave shoes". RareAct aims at evaluating the zero-shot and few-shot compositionality of action recognition models for unlikely compositions of common action verbs and object nouns. It contains 122 different actions which were obtained by comb… ▽ More

    Submitted 3 August, 2020; originally announced August 2020.

  38. arXiv:2005.00069  [pdf, other

    cs.CV cs.LG eess.IV

    Occlusion resistant learning of intuitive physics from videos

    Authors: Ronan Riochet, Josef Sivic, Ivan Laptev, Emmanuel Dupoux

    Abstract: To reach human performance on complex tasks, a key ability for artificial systems is to understand physical interactions between objects, and predict future outcomes of a situation. This ability, often referred to as intuitive physics, has recently received attention and several methods were proposed to learn these physical rules from video sequences. Yet, most of these methods are restricted to t… ▽ More

    Submitted 30 April, 2020; originally announced May 2020.

  39. arXiv:2004.10566  [pdf, other

    cs.CV

    Efficient Neighbourhood Consensus Networks via Submanifold Sparse Convolutions

    Authors: Ignacio Rocco, Relja Arandjelović, Josef Sivic

    Abstract: In this work we target the problem of estimating accurately localised correspondences between a pair of images. We adopt the recent Neighbourhood Consensus Networks that have demonstrated promising performance for difficult correspondence problems and propose modifications to overcome their main limitations: large memory consumption, large inference time and poorly localised correspondences. Our p… ▽ More

    Submitted 22 April, 2020; originally announced April 2020.

  40. arXiv:1912.06430  [pdf, other

    cs.CV

    End-to-End Learning of Visual Representations from Uncurated Instructional Videos

    Authors: Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, Andrew Zisserman

    Abstract: Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narra… ▽ More

    Submitted 23 August, 2020; v1 submitted 13 December, 2019; originally announced December 2019.

    Comments: CVPR'2020 Oral

  41. Evolving Structures in Complex Systems

    Authors: Hugo Cisneros, Josef Sivic, Tomas Mikolov

    Abstract: In this paper we propose an approach for measuring growth of complexity of emerging patterns in complex systems such as cellular automata. We discuss several ways how a metric for measuring the complexity growth can be defined. This includes approaches based on compression algorithms and artificial neural networks. We believe such a metric can be useful for designing systems that could exhibit ope… ▽ More

    Submitted 18 March, 2020; v1 submitted 4 November, 2019; originally announced November 2019.

    Comments: IEEE Symposium Series on Computational Intelligence 2019 (IEEE SSCI 2019)

    Journal ref: Proceedings of the 2019 IEEE Symposium Series on Computational Intelligence

  42. arXiv:1908.04598  [pdf, other

    cs.CV

    Is This The Right Place? Geometric-Semantic Pose Verification for Indoor Visual Localization

    Authors: Hajime Taira, Ignacio Rocco, Jiri Sedlar, Masatoshi Okutomi, Josef Sivic, Tomas Pajdla, Torsten Sattler, Akihiko Torii

    Abstract: Visual localization in large and complex indoor scenes, dominated by weakly textured rooms and repeating geometric patterns, is a challenging problem with high practical relevance for applications such as Augmented Reality and robotics. To handle the ambiguities arising in this scenario, a common strategy is, first, to generate multiple estimates for the camera pose from which a given query image… ▽ More

    Submitted 2 September, 2019; v1 submitted 13 August, 2019; originally announced August 2019.

  43. arXiv:1908.00722  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Learning to combine primitive skills: A step towards versatile robotic manipulation

    Authors: Robin Strudel, Alexander Pashevich, Igor Kalevatykh, Ivan Laptev, Josef Sivic, Cordelia Schmid

    Abstract: Manipulation tasks such as preparing a meal or assembling furniture remain highly challenging for robotics and vision. Traditional task and motion planning (TAMP) methods can solve complex tasks but require full state observability and are not adapted to dynamic scene changes. Recent learning methods can operate directly on visual inputs but typically require many demonstrations and/or task-specif… ▽ More

    Submitted 20 June, 2020; v1 submitted 2 August, 2019; originally announced August 2019.

    Comments: ICRA 2020. See the project webpage at https://www.di.ens.fr/willow/research/rlbc/

    Journal ref: IEEE ROBOTICS AND AUTOMATION LETTERS, JULY 2020. 4637-4643

  44. arXiv:1907.12763  [pdf, other

    cs.CV cs.CL

    Finding Moments in Video Collections Using Natural Language

    Authors: Victor Escorcia, Mattia Soldan, Josef Sivic, Bernard Ghanem, Bryan Russell

    Abstract: We introduce the task of retrieving relevant video moments from a large corpus of untrimmed, unsegmented videos given a natural language query. Our task poses unique challenges as a system must efficiently identify both the relevant videos and localize the relevant moments in the videos. To address these challenges, we propose SpatioTemporal Alignment with Language (STAL), a model that represents… ▽ More

    Submitted 23 February, 2022; v1 submitted 30 July, 2019; originally announced July 2019.

  45. arXiv:1906.03327  [pdf, other

    cs.CV

    HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

    Authors: Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic

    Abstract: Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narration… ▽ More

    Submitted 31 July, 2019; v1 submitted 7 June, 2019; originally announced June 2019.

    Comments: Accepted at ICCV 2019

  46. arXiv:1905.03561  [pdf, other

    cs.CV

    D2-Net: A Trainable CNN for Joint Detection and Description of Local Features

    Authors: Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, Torsten Sattler

    Abstract: In this work we address the problem of finding reliable pixel-level correspondences under difficult imaging conditions. We propose an approach where a single convolutional neural network plays a dual role: It is simultaneously a dense feature descriptor and a feature detector. By postponing the detection to a later stage, the obtained keypoints are more stable than their traditional counterparts b… ▽ More

    Submitted 9 May, 2019; originally announced May 2019.

    Comments: Accepted at CVPR 2019

  47. arXiv:1904.10348  [pdf, other

    cs.RO cs.CV

    Monte-Carlo Tree Search for Efficient Visually Guided Rearrangement Planning

    Authors: Yann Labbé, Sergey Zagoruyko, Igor Kalevatykh, Ivan Laptev, Justin Carpentier, Mathieu Aubry, Josef Sivic

    Abstract: We address the problem of visually guided rearrangement planning with many movable objects, i.e., finding a sequence of actions to move a set of objects from an initial arrangement to a desired one, while relying on visual inputs coming from an RGB camera. To do so, we introduce a complete pipeline relying on two key contributions. First, we introduce an efficient and scalable rearrangement planni… ▽ More

    Submitted 1 April, 2020; v1 submitted 23 April, 2019; originally announced April 2019.

    Comments: Accepted for publication in IEEE Robotics and Automation Letters (RA-L)

  48. Estimating 3D Motion and Forces of Person-Object Interactions from Monocular Video

    Authors: Zongmian Li, Jiri Sedlar, Justin Carpentier, Ivan Laptev, Nicolas Mansard, Josef Sivic

    Abstract: In this paper, we introduce a method to automatically reconstruct the 3D motion of a person interacting with an object from a single RGB video. Our method estimates the 3D poses of the person and the object, contact positions, and forces and torques actuated by the human limbs. The main contributions of this work are three-fold. First, we introduce an approach to jointly estimate the motion and th… ▽ More

    Submitted 17 June, 2019; v1 submitted 4 April, 2019; originally announced April 2019.

  49. arXiv:1903.08225  [pdf, other

    cs.CV

    Cross-task weakly supervised learning from instructional videos

    Authors: Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, Josef Sivic

    Abstract: In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal annotations. At the heart of our approach is the observation that weakly supervised learning may be easier if a model shares components while learning different steps: `pour egg' should be tra… ▽ More

    Submitted 29 April, 2019; v1 submitted 19 March, 2019; originally announced March 2019.

    Comments: 18 pages, 17 figures, to be published in proceedings of the CVPR, 2019

  50. arXiv:1901.08335  [pdf, other

    cs.HC cs.RO

    Teaching robots to imitate a human with no on-teacher sensors. What are the key challenges?

    Authors: Radoslav Skoviera, Karla Stepanova, Michael Tesar, Gabriela Sejnova, Jiri Sedlar, Michal Vavrecka, Robert Babuska, Josef Sivic

    Abstract: In this paper, we consider the problem of learning object manipulation tasks from human demonstration using RGB or RGB-D cameras. We highlight the key challenges in capturing sufficiently good data with no tracking devices - starting from sensor selection and accurate 6DoF pose estimation to natural language processing. In particular, we focus on two showcases: gluing task with a glue gun and simp… ▽ More

    Submitted 24 January, 2019; originally announced January 2019.

    Journal ref: The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2018, Workshop on: Towards Intelligent Social Robots: From Naive Robots to Robot Sapiens http://intelligent-social-robots-ws.com/materials/