Skip to main content

Showing 1–12 of 12 results for author: Zamir, A R

.
  1. arXiv:1910.02527  [pdf, other

    cs.CV cs.RO

    3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera

    Authors: Iro Armeni, Zhi-Yang He, JunYoung Gwak, Amir R. Zamir, Martin Fischer, Jitendra Malik, Silvio Savarese

    Abstract: A comprehensive semantic understanding of a scene is important for many applications - but in what space should diverse semantic information (e.g., objects, scene categories, material types, texture, etc.) be grounded and what should be its structure? Aspiring to have one unified structure that hosts diverse types of semantics, we follow the Scene Graph paradigm in 3D, generating a 3D Scene Graph.… ▽ More

    Submitted 6 October, 2019; originally announced October 2019.

    Comments: ICCV 2019

  2. arXiv:1905.07553  [pdf, other

    cs.CV

    Which Tasks Should Be Learned Together in Multi-task Learning?

    Authors: Trevor Standley, Amir R. Zamir, Dawn Chen, Leonidas Guibas, Jitendra Malik, Silvio Savarese

    Abstract: Many computer vision applications require solving multiple tasks in real-time. A neural network can be trained to solve multiple tasks simultaneously using multi-task learning. This can save computation at inference time as only a single network needs to be evaluated. Unfortunately, this often leads to inferior overall performance as task objectives can compete, which consequently poses the questi… ▽ More

    Submitted 2 September, 2020; v1 submitted 18 May, 2019; originally announced May 2019.

    Comments: Presented to ICML 2020 See project website at http://taskgrou**.stanford.edu/

  3. arXiv:1812.11971  [pdf, other

    cs.CV cs.AI cs.LG cs.NE cs.RO

    Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Visuomotor Policies

    Authors: Alexander Sax, Bradley Emi, Amir R. Zamir, Leonidas Guibas, Silvio Savarese, Jitendra Malik

    Abstract: How much does having visual priors about the world (e.g. the fact that the world is 3D) assist in learning to perform downstream motor tasks (e.g. delivering a package)? We study this question by integrating a generic perceptual skill set (e.g. a distance estimator, an edge detector, etc.) within a reinforcement learning framework--see Figure 1. This skill set (hereafter mid-level perception) prov… ▽ More

    Submitted 22 April, 2019; v1 submitted 31 December, 2018; originally announced December 2018.

    Comments: See project website, demos, and code at http://perceptual.actor

  4. arXiv:1807.06757  [pdf, other

    cs.AI cs.CV cs.LG cs.RO

    On Evaluation of Embodied Navigation Agents

    Authors: Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, Amir R. Zamir

    Abstract: Skillful mobile operation in three-dimensional environments is a primary topic of study in Artificial Intelligence. The past two years have seen a surge of creative work on navigation. This creative output has produced a plethora of sometimes incompatible task definitions and evaluation protocols. To coordinate ongoing and future research in this area, we have convened a working group to study emp… ▽ More

    Submitted 17 July, 2018; originally announced July 2018.

    Comments: Report of a working group on empirical methodology in navigation research. Authors are listed in alphabetical order

  5. arXiv:1710.08247  [pdf, other

    cs.CV cs.LG cs.NE cs.RO

    Generic 3D Representation via Pose Estimation and Matching

    Authors: Amir R. Zamir, Tilman Wekel, Pulkit Argrawal, Colin Weil, Jitendra Malik, Silvio Savarese

    Abstract: Though a large body of computer vision research has investigated develo** generic semantic representations, efforts towards develo** a similar representation for 3D has been limited. In this paper, we learn a generic 3D representation through solving a set of foundational proxy 3D tasks: object-centric camera pose estimation and wide baseline feature matching. Our method is based upon the prem… ▽ More

    Submitted 23 October, 2017; originally announced October 2017.

    Comments: Published in ECCV16. See the project website http://3drepresentation.stanford.edu/ and dataset website https://github.com/amir32002/3D_Street_View

    Journal ref: ECCV 2016 535-553

  6. arXiv:1702.01105  [pdf, other

    cs.CV cs.RO

    Joint 2D-3D-Semantic Data for Indoor Scene Understanding

    Authors: Iro Armeni, Sasha Sax, Amir R. Zamir, Silvio Savarese

    Abstract: We present a dataset of large-scale indoor spaces that provides a variety of mutually registered modalities from 2D, 2.5D and 3D domains, with instance-level semantic and geometric annotations. The dataset covers over 6,000m2 and contains over 70,000 RGB images, along with the corresponding depths, surface normals, semantic annotations, global XYZ images (all in forms of both regular and 360° equi… ▽ More

    Submitted 5 April, 2017; v1 submitted 3 February, 2017; originally announced February 2017.

    Comments: The dataset is available http://3Dsemantics.stanford.edu/

  7. arXiv:1612.09508  [pdf, other

    cs.CV

    Feedback Networks

    Authors: Amir R. Zamir, Te-Lin Wu, Lin Sun, William Shen, Jitendra Malik, Silvio Savarese

    Abstract: Currently, the most successful learning models in computer vision are based on learning successive representations followed by a decision layer. This is usually actualized through feedforward multilayer neural networks, e.g. ConvNets, where each layer forms one of such successive representations. However, an alternative that can achieve the same goal is a feedback based approach in which the repre… ▽ More

    Submitted 20 August, 2017; v1 submitted 30 December, 2016; originally announced December 2016.

    Comments: See a video describing the method at https://youtu.be/MY5Uhv38Ttg and the website at http://feedbacknet.stanford.edu/

  8. arXiv:1605.03324  [pdf, other

    cs.CV cs.RO stat.ML

    Unsupervised Semantic Action Discovery from Video Collections

    Authors: Ozan Sener, Amir Roshan Zamir, Chenxia Wu, Silvio Savarese, Ashutosh Saxena

    Abstract: Human communication takes many forms, including speech, text and instructional videos. It typically has an underlying structure, with a starting point, ending, and certain objective steps between them. In this paper, we consider instructional videos where there are tens of millions of them on the Internet. We propose a method for parsing a video into such semantic steps in an unsupervised way. O… ▽ More

    Submitted 11 May, 2016; originally announced May 2016.

    Comments: First version of this paper arXiv:1506.08438 appeared in ICCV 2015. This extended version has more details on the learning algorithm and hierarchical clustering with full derivation, additional analysis on the robustness to the subtitle noise, and a novel application on robotics

  9. The THUMOS Challenge on Action Recognition for Videos "in the Wild"

    Authors: Haroon Idrees, Amir R. Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, Mubarak Shah

    Abstract: Automatically recognizing and localizing wide ranges of human actions has crucial importance for video understanding. Towards this goal, the THUMOS challenge was introduced in 2013 to serve as a benchmark for action recognition. Until then, video action recognition, including THUMOS challenge, had focused primarily on the classification of pre-segmented (i.e., trimmed) videos, which is an artifici… ▽ More

    Submitted 21 April, 2016; originally announced April 2016.

    Comments: Preprint submitted to Computer Vision and Image Understanding

  10. arXiv:1511.05298  [pdf, other

    cs.CV cs.LG cs.NE cs.RO

    Structural-RNN: Deep Learning on Spatio-Temporal Graphs

    Authors: Ashesh Jain, Amir R. Zamir, Silvio Savarese, Ashutosh Saxena

    Abstract: Deep Recurrent Neural Network architectures, though remarkably capable at modeling sequences, lack an intuitive high-level spatio-temporal structure. That is while many problems in computer vision inherently have an underlying high-level structure and can benefit from it. Spatio-temporal graphs are a popular tool for imposing such high-level intuitions in the formulation of real world problems. In… ▽ More

    Submitted 11 April, 2016; v1 submitted 17 November, 2015; originally announced November 2015.

    Comments: CVPR 2016 (Oral)

  11. arXiv:1508.07654  [pdf, other

    cs.CV

    Action Recognition by Hierarchical Mid-level Action Elements

    Authors: Tian Lan, Yuke Zhu, Amir Roshan Zamir, Silvio Savarese

    Abstract: Realistic videos of human actions exhibit rich spatiotemporal structures at multiple levels of granularity: an action can always be decomposed into multiple finer-grained elements in both space and time. To capture this intuition, we propose to represent videos by a hierarchy of mid-level action elements (MAEs), where each MAE corresponds to an action-related spatiotemporal segment in the video. W… ▽ More

    Submitted 30 August, 2015; originally announced August 2015.

  12. arXiv:1212.0402  [pdf, other

    cs.CV

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Authors: Khurram Soomro, Amir Roshan Zamir, Mubarak Shah

    Abstract: We introduce UCF101 which is currently the largest dataset of human actions. It consists of 101 action classes, over 13k clips and 27 hours of video data. The database consists of realistic user uploaded videos containing camera motion and cluttered background. Additionally, we provide baseline action recognition results on this new dataset using standard bag of words approach with overall perform… ▽ More

    Submitted 3 December, 2012; originally announced December 2012.

    Report number: CRCV-TR-12-01