Skip to main content

Showing 51–69 of 69 results for author: Sivic, J

.
  1. arXiv:1901.08335  [pdf, other

    cs.HC cs.RO

    Teaching robots to imitate a human with no on-teacher sensors. What are the key challenges?

    Authors: Radoslav Skoviera, Karla Stepanova, Michael Tesar, Gabriela Sejnova, Jiri Sedlar, Michal Vavrecka, Robert Babuska, Josef Sivic

    Abstract: In this paper, we consider the problem of learning object manipulation tasks from human demonstration using RGB or RGB-D cameras. We highlight the key challenges in capturing sufficiently good data with no tracking devices - starting from sensor selection and accurate 6DoF pose estimation to natural language processing. In particular, we focus on two showcases: gluing task with a glue gun and simp… ▽ More

    Submitted 24 January, 2019; originally announced January 2019.

    Journal ref: The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2018, Workshop on: Towards Intelligent Social Robots: From Naive Robots to Robot Sapiens http://intelligent-social-robots-ws.com/materials/

  2. arXiv:1812.05736  [pdf, other

    cs.CV

    Detecting unseen visual relations using analogies

    Authors: Julia Peyre, Ivan Laptev, Cordelia Schmid, Josef Sivic

    Abstract: We seek to detect visual relations in images of the form of triplets t = (subject, predicate, object), such as "person riding dog", where training examples of the individual entities are available but their combinations are unseen at training. This is an important set-up due to the combinatorial nature of visual relations : collecting sufficient training data for all possible triplets would be ver… ▽ More

    Submitted 22 September, 2019; v1 submitted 13 December, 2018; originally announced December 2018.

  3. arXiv:1810.10510  [pdf, other

    cs.CV cs.LG

    Neighbourhood Consensus Networks

    Authors: Ignacio Rocco, Mircea Cimpoi, Relja Arandjelović, Akihiko Torii, Tomas Pajdla, Josef Sivic

    Abstract: We address the problem of finding reliable dense correspondences between a pair of images. This is a challenging task due to strong appearance differences between the corresponding scene elements and ambiguities generated by repetitive patterns. The contributions of this work are threefold. First, inspired by the classic idea of disambiguating feature matches using semi-local constraints, we devel… ▽ More

    Submitted 29 November, 2018; v1 submitted 24 October, 2018; originally announced October 2018.

    Comments: In Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018)

  4. arXiv:1809.01337  [pdf, other

    cs.CV cs.CL

    Localizing Moments in Video with Temporal Language

    Authors: Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell

    Abstract: Localizing moments in a longer video via natural language queries is a new, challenging task at the intersection of language and video understanding. Though moment localization with natural language is similar to other language and vision tasks like natural language object retrieval in images, moment localization offers an interesting opportunity to model temporal dependencies and reasoning in tex… ▽ More

    Submitted 5 September, 2018; originally announced September 2018.

    Comments: EMNLP 2018

  5. arXiv:1804.02516  [pdf, other

    cs.CV

    Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

    Authors: Antoine Miech, Ivan Laptev, Josef Sivic

    Abstract: Joint understanding of video and language is an active research area with many applications. Prior work in this domain typically relies on learning text-video embeddings. One difficulty with this approach, however, is the lack of large-scale annotated video-caption datasets for training. To address this issue, we aim at learning text-video embeddings from heterogeneous data sources. To this end, w… ▽ More

    Submitted 16 January, 2020; v1 submitted 7 April, 2018; originally announced April 2018.

    Comments: The paper had a major update in January 2020 after a bug we found in the codebase that affected many results

  6. arXiv:1803.10368  [pdf, other

    cs.CV

    InLoc: Indoor Visual Localization with Dense Matching and View Synthesis

    Authors: Hajime Taira, Masatoshi Okutomi, Torsten Sattler, Mircea Cimpoi, Marc Pollefeys, Josef Sivic, Tomas Pajdla, Akihiko Torii

    Abstract: We seek to predict the 6 degree-of-freedom (6DoF) pose of a query photograph with respect to a large indoor 3D map. The contributions of this work are three-fold. First, we develop a new large-scale visual localization method targeted for indoor environments. The method proceeds along three steps: (i) efficient retrieval of candidate poses that ensures scalability to large-scale environments, (ii)… ▽ More

    Submitted 8 April, 2018; v1 submitted 27 March, 2018; originally announced March 2018.

  7. arXiv:1712.06861  [pdf, other

    cs.CV cs.LG

    End-to-end weakly-supervised semantic alignment

    Authors: Ignacio Rocco, Relja Arandjelović, Josef Sivic

    Abstract: We tackle the task of semantic alignment where the goal is to compute dense semantic correspondence aligning two images depicting objects of the same category. This is a challenging task due to large intra-class variation, changes in viewpoint and background clutter. We present the following three principal contributions. First, we develop a convolutional neural network architecture for semantic a… ▽ More

    Submitted 24 April, 2018; v1 submitted 19 December, 2017; originally announced December 2017.

    Comments: In 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018)

  8. arXiv:1708.01641  [pdf, other

    cs.CV

    Localizing Moments in Video with Natural Language

    Authors: Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell

    Abstract: We consider retrieving a specific temporal segment, or moment, from a video given a natural language text description. Methods designed to retrieve whole video clips with natural language determine what occurs in a video but not when. To address this issue, we propose the Moment Context Network (MCN) which effectively localizes natural language queries in videos by integrating local and global vid… ▽ More

    Submitted 4 August, 2017; originally announced August 2017.

    Comments: ICCV 2017

  9. arXiv:1707.09472  [pdf, other

    cs.CV

    Weakly-supervised learning of visual relations

    Authors: Julia Peyre, Ivan Laptev, Cordelia Schmid, Josef Sivic

    Abstract: This paper introduces a novel approach for modeling visual relations between pairs of objects. We call relation a triplet of the form (subject, predicate, object) where the predicate is typically a preposition (eg. 'under', 'in front of') or a verb ('hold', 'ride') that links a pair of objects (subject, object). Learning such relations is challenging as the objects have different spatial configura… ▽ More

    Submitted 29 July, 2017; originally announced July 2017.

  10. arXiv:1707.09092  [pdf, ps, other

    cs.CV

    Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions

    Authors: Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, Fredrik Kahl, Tomas Pajdla

    Abstract: Visual localization enables autonomous vehicles to navigate in their surroundings and augmented reality applications to link virtual to real worlds. Practical visual localization approaches need to be robust to a wide variety of viewing condition, including day-night changes, as well as weather and seasonal variations, while providing highly accurate 6 degree-of-freedom (6DOF) camera pose estimate… ▽ More

    Submitted 4 April, 2018; v1 submitted 27 July, 2017; originally announced July 2017.

    Comments: Accepted to CVPR 2018 as a spotlight

  11. arXiv:1707.09074  [pdf, other

    cs.CV

    Learning from Video and Text via Large-Scale Discriminative Clustering

    Authors: Antoine Miech, Jean-Baptiste Alayrac, Piotr Bojanowski, Ivan Laptev, Josef Sivic

    Abstract: Discriminative clustering has been successfully applied to a number of weakly-supervised learning tasks. Such applications include person and action recognition, text-to-video alignment, object co-segmentation and colocalization in videos and images. One drawback of discriminative clustering, however, is its limited scalability. We address this issue and propose an online optimization algorithm ba… ▽ More

    Submitted 27 July, 2017; originally announced July 2017.

    Comments: To appear in ICCV 2017

  12. arXiv:1706.06905  [pdf, other

    cs.CV

    Learnable pooling with Context Gating for video classification

    Authors: Antoine Miech, Ivan Laptev, Josef Sivic

    Abstract: Current methods for video analysis often extract frame-level features using pre-trained convolutional neural networks (CNNs). Such features are then aggregated over time e.g., by simple temporal averaging or more sophisticated recurrent neural networks such as long short-term memory (LSTM) or gated recurrent units (GRU). In this work we revise existing video representations and study alternative m… ▽ More

    Submitted 5 March, 2018; v1 submitted 21 June, 2017; originally announced June 2017.

    Comments: Presented at Youtube 8M CVPR17 Workshop. Kaggle Winning model. Under review for TPAMI

  13. arXiv:1704.02895  [pdf, other

    cs.CV

    ActionVLAD: Learning spatio-temporal aggregation for action classification

    Authors: Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, Bryan Russell

    Abstract: In this work, we introduce a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video. We do so by integrating state-of-the-art two-stream networks with learnable spatio-temporal feature aggregation. The resulting architecture is end-to-end trainable for whole-video classification. We investigate different… ▽ More

    Submitted 10 April, 2017; originally announced April 2017.

    Comments: Accepted to CVPR 2017. Project page: https://rohitgirdhar.github.io/ActionVLAD/

  14. arXiv:1703.05593  [pdf, other

    cs.CV cs.LG

    Convolutional neural network architecture for geometric matching

    Authors: Ignacio Rocco, Relja Arandjelović, Josef Sivic

    Abstract: We address the problem of determining correspondences between two images in agreement with a geometric model such as an affine or thin-plate spline transformation, and estimating its parameters. The contributions of this work are three-fold. First, we propose a convolutional neural network architecture for geometric matching. The architecture is based on three main components that mimic the standa… ▽ More

    Submitted 13 April, 2017; v1 submitted 16 March, 2017; originally announced March 2017.

    Comments: In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017)

  15. arXiv:1702.02738  [pdf, other

    cs.CV cs.LG

    Joint Discovery of Object States and Manipulation Actions

    Authors: Jean-Baptiste Alayrac, Josev Sivic, Ivan Laptev, Simon Lacoste-Julien

    Abstract: Many human activities involve object manipulations aiming to modify the object state. Examples of common state changes include full/empty bottle, open/closed door, and attached/detached car wheel. In this work, we seek to automatically discover the states of objects and the associated manipulation actions. Given a set of videos for a particular task, we propose a joint model that learns to identif… ▽ More

    Submitted 28 August, 2017; v1 submitted 9 February, 2017; originally announced February 2017.

    Comments: Appears in: International Conference on Computer Vision 2017 (ICCV 2017). 15 pages

    ACM Class: I.5.1; I.5.4; I.2

  16. arXiv:1511.07247  [pdf, other

    cs.CV cs.LG

    NetVLAD: CNN architecture for weakly supervised place recognition

    Authors: Relja Arandjelović, Petr Gronat, Akihiko Torii, Tomas Pajdla, Josef Sivic

    Abstract: We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph. We present the following three principal contributions. First, we develop a convolutional neural network (CNN) architecture that is trainable in an end-to-end manner directly for the place recognition task. The main component of this archite… ▽ More

    Submitted 2 May, 2016; v1 submitted 23 November, 2015; originally announced November 2015.

    Comments: Appears in: IEEE Computer Vision and Pattern Recognition (CVPR) 2016

  17. arXiv:1506.09215  [pdf, other

    cs.CV cs.LG

    Unsupervised Learning from Narrated Instruction Videos

    Authors: Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, Simon Lacoste-Julien

    Abstract: We address the problem of automatically learning the main steps to complete a certain task, such as changing a car tire, from a set of narrated instruction videos. The contributions of this paper are three-fold. First, we develop a new unsupervised learning approach that takes advantage of the complementary nature of the input video and the associated narration. The method solves two clustering pr… ▽ More

    Submitted 28 June, 2016; v1 submitted 30 June, 2015; originally announced June 2015.

    Comments: Appears in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016). 21 pages

    ACM Class: I.5.1; I.5.4; I.2

  18. arXiv:1408.3304  [pdf, other

    cs.CV math.OC

    On Pairwise Costs for Network Flow Multi-Object Tracking

    Authors: Visesh Chari, Simon Lacoste-Julien, Ivan Laptev, Josef Sivic

    Abstract: Multi-object tracking has been recently approached with the min-cost network flow optimization techniques. Such methods simultaneously resolve multiple object tracks in a video and enable modeling of dependencies among tracks. Min-cost network flow methods also fit well within the "tracking-by-detection" paradigm where object trajectories are obtained by connecting per-frame outputs of an object d… ▽ More

    Submitted 5 May, 2015; v1 submitted 14 August, 2014; originally announced August 2014.

    Journal ref: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 5537-5545

  19. arXiv:1407.1208  [pdf, other

    cs.CV cs.LG

    Weakly Supervised Action Labeling in Videos Under Ordering Constraints

    Authors: Piotr Bojanowski, Rémi Lajugie, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, Josef Sivic

    Abstract: We are given a set of video clips, each one annotated with an {\em ordered} list of actions, such as "walk" then "sit" then "answer phone" extracted from, for example, the associated text script. We seek to temporally localize the individual actions in each clip as well as to learn a discriminative classifier for each action. We formulate the problem as a weakly supervised temporal assignment with… ▽ More

    Submitted 4 July, 2014; originally announced July 2014.

    Comments: 17 pages, completed version of a ECCV2014 conference paper