Skip to main content

Showing 1–39 of 39 results for author: Zamir, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.11769  [pdf, other

    cs.CV

    Solving Vision Tasks with Simple Photoreceptors Instead of Cameras

    Authors: Andrei Atanov, Jiawei Fu, Rishubh Singh, Isabella Yu, Andrew Spielberg, Amir Zamir

    Abstract: A de facto standard in solving computer vision problems is to use a common high-resolution camera and choose its placement on an agent (i.e., position and orientation) based on human intuition. On the other hand, extremely simple and well-designed visual sensors found throughout nature allow many organisms to perform diverse, complex behaviors. In this work, motivated by these examples, we raise t… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  2. arXiv:2406.09406  [pdf, other

    cs.CV cs.AI cs.LG

    4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

    Authors: Roman Bachmann, Oğuzhan Fatih Kar, David Mizrahi, Ali Garjani, Mingfei Gao, David Griffiths, Jiaming Hu, Afshin Dehghan, Amir Zamir

    Abstract: Current multimodal and multitask foundation models like 4M or UnifiedIO show promising results, but in practice their out-of-the-box abilities to accept diverse inputs and perform diverse tasks are limited by the (usually rather small) number of modalities and tasks they are trained on. In this paper, we expand upon the capabilities of them by training a single model on tens of highly diverse moda… ▽ More

    Submitted 14 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: Project page at 4m.epfl.ch

  3. arXiv:2404.07204  [pdf, other

    cs.CV cs.AI cs.LG

    BRAVE: Broadening the visual encoding of vision-language models

    Authors: Oğuzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, Federico Tombari

    Abstract: Vision-language models (VLMs) are typically composed of a vision encoder, e.g. CLIP, and a language model (LM) that interprets the encoded features to solve downstream tasks. Despite remarkable progress, VLMs are subject to several shortcomings due to the limited capabilities of vision encoders, e.g. "blindness" to certain image features, visual hallucination, etc. To address these issues, we stud… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

    Comments: Project page at https://brave-vlms.epfl.ch/

  4. arXiv:2403.15309  [pdf, other

    cs.CV cs.CL cs.LG

    Controlled Training Data Generation with Diffusion Models

    Authors: Teresa Yeo, Andrei Atanov, Harold Benoit, Aleksandr Alekseev, Ruchira Ray, Pooya Esmaeil Akhoondi, Amir Zamir

    Abstract: In this work, we present a method to control a text-to-image generative model to produce training data specifically "useful" for supervised learning. Unlike previous works that employ an open-loop approach and pre-define prompts to generate new data using either a language model or human expertise, we develop an automated closed-loop system which involves two feedback mechanisms. The first mechani… ▽ More

    Submitted 22 March, 2024; originally announced March 2024.

    Comments: Project page at https://adversarial-prompts.epfl.ch/

  5. arXiv:2312.16313  [pdf, other

    cs.LG

    Unraveling the Key Components of OOD Generalization via Diversification

    Authors: Harold Benoit, Liangze Jiang, Andrei Atanov, Oğuzhan Fatih Kar, Mattia Rigotti, Amir Zamir

    Abstract: Supervised learning datasets may contain multiple cues that explain the training set equally well, i.e., learning any of them would lead to the correct predictions on the training data. However, many of them can be spurious, i.e., lose their predictive power under a distribution shift and consequently fail to generalize to out-of-distribution (OOD) data. Recently developed "diversification" method… ▽ More

    Submitted 20 April, 2024; v1 submitted 26 December, 2023; originally announced December 2023.

    Comments: ICLR 2024

  6. arXiv:2312.06647  [pdf, other

    cs.CV cs.AI cs.LG

    4M: Massively Multimodal Masked Modeling

    Authors: David Mizrahi, Roman Bachmann, Oğuzhan Fatih Kar, Teresa Yeo, Mingfei Gao, Afshin Dehghan, Amir Zamir

    Abstract: Current machine learning models for vision are often highly specialized and limited to a single modality and task. In contrast, recent large language models exhibit a wide range of capabilities, hinting at a possibility for similarly versatile models in computer vision. In this paper, we take a step in this direction and propose a multimodal training scheme called 4M. It consists of training a sin… ▽ More

    Submitted 11 December, 2023; originally announced December 2023.

    Comments: NeurIPS 2023 Spotlight. Project page at https://4m.epfl.ch/

  7. arXiv:2309.15762  [pdf, other

    cs.CV cs.LG

    Rapid Network Adaptation: Learning to Adapt Neural Networks Using Test-Time Feedback

    Authors: Teresa Yeo, Oğuzhan Fatih Kar, Zahra Sodagar, Amir Zamir

    Abstract: We propose a method for adapting neural networks to distribution shifts at test-time. In contrast to training-time robustness mechanisms that attempt to anticipate and counter the shift, we create a closed-loop system and make use of a test-time feedback signal to adapt a network on the fly. We show that this loop can be effectively implemented using a learning-based function, which realizes an am… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

    Comments: Project website at https://rapid-network-adaptation.epfl.ch/

  8. arXiv:2305.00348  [pdf, other

    cs.CV cs.RO

    Modality-invariant Visual Odometry for Embodied Vision

    Authors: Marius Memmel, Roman Bachmann, Amir Zamir

    Abstract: Effectively localizing an agent in a realistic, noisy setting is crucial for many embodied vision tasks. Visual Odometry (VO) is a practical substitute for unreliable GPS and compass sensors, especially in indoor environments. While SLAM-based methods show a solid performance without large data requirements, they are less flexible and robust w.r.t. to noise and changes in the sensor suite compared… ▽ More

    Submitted 29 April, 2023; originally announced May 2023.

  9. arXiv:2212.10082  [pdf, other

    cs.LG cs.CV

    An Information-Theoretic Approach to Transferability in Task Transfer Learning

    Authors: Yajie Bao, Yang Li, Shao-Lun Huang, Lin Zhang, Lizhong Zheng, Amir Zamir, Leonidas Guibas

    Abstract: Task transfer learning is a popular technique in image processing applications that uses pre-trained models to reduce the supervision cost of related tasks. An important question is to determine task transferability, i.e. given a common input domain, estimating to what extent representations learned from a source task can help in learning a target task. Typically, transferability is either measure… ▽ More

    Submitted 20 December, 2022; originally announced December 2022.

    Journal ref: 2019 IEEE International Conference on Image Processing (ICIP) (pp. 2309-2313). IEEE

  10. arXiv:2212.04581  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    PALMER: Perception-Action Loop with Memory for Long-Horizon Planning

    Authors: Onur Beker, Mohammad Mohammadi, Amir Zamir

    Abstract: To achieve autonomy in a priori unknown real-world scenarios, agents should be able to: i) act from high-dimensional sensory observations (e.g., images), ii) learn from past experience to adapt and improve, and iii) be capable of long horizon planning. Classical planning algorithms (e.g. PRM, RRT) are proficient at handling long-horizon planning. Deep learning based methods in turn can provide the… ▽ More

    Submitted 8 December, 2022; originally announced December 2022.

    Comments: Website: https://palmer.epfl.ch

  11. arXiv:2212.00261  [pdf, other

    cs.LG

    Task Discovery: Finding the Tasks that Neural Networks Generalize on

    Authors: Andrei Atanov, Andrei Filatov, Teresa Yeo, Ajay Sohmshetty, Amir Zamir

    Abstract: When develo** deep learning models, we usually decide what task we want to solve then search for a model that generalizes well on the task. An intriguing question would be: what if, instead of fixing the task and searching in the model space, we fix the model and search in the task space? Can we find tasks that the model generalizes on? How do they look, or do they indicate anything? These are t… ▽ More

    Submitted 30 November, 2022; originally announced December 2022.

    Comments: NeurIPS 2022, Project page at https://taskdiscovery.epfl.ch

  12. arXiv:2204.01678  [pdf, other

    cs.CV cs.LG

    MultiMAE: Multi-modal Multi-task Masked Autoencoders

    Authors: Roman Bachmann, David Mizrahi, Andrei Atanov, Amir Zamir

    Abstract: We propose a pre-training strategy called Multi-modal Multi-task Masked Autoencoders (MultiMAE). It differs from standard Masked Autoencoding in two key aspects: I) it can optionally accept additional modalities of information in the input besides the RGB image (hence "multi-modal"), and II) its training objective accordingly includes predicting multiple outputs besides the RGB image (hence "multi… ▽ More

    Submitted 4 April, 2022; originally announced April 2022.

    Comments: Project page at https://multimae.epfl.ch

  13. arXiv:2203.01441  [pdf, other

    cs.CV cs.LG

    3D Common Corruptions and Data Augmentation

    Authors: Oğuzhan Fatih Kar, Teresa Yeo, Andrei Atanov, Amir Zamir

    Abstract: We introduce a set of image transformations that can be used as corruptions to evaluate the robustness of models as well as data augmentation mechanisms for training neural networks. The primary distinction of the proposed transformations is that, unlike existing approaches such as Common Corruptions, the geometry of the scene is incorporated in the transformations -- thus leading to corruptions t… ▽ More

    Submitted 29 April, 2022; v1 submitted 2 March, 2022; originally announced March 2022.

    Comments: CVPR 2022 (Oral). Project website at https://3dcommoncorruptions.epfl.ch/

  14. arXiv:2202.05822  [pdf, other

    cs.GR cs.AI cs.CV

    CLIPasso: Semantically-Aware Object Sketching

    Authors: Yael Vinker, Ehsan Pajouheshgar, Jessica Y. Bo, Roman Christian Bachmann, Amit Haim Bermano, Daniel Cohen-Or, Amir Zamir, Ariel Shamir

    Abstract: Abstraction is at the heart of sketching due to the simple and minimal nature of line drawings. Abstraction entails identifying the essential visual properties of an object or scene, which requires semantic understanding and prior knowledge of high-level concepts. Abstract depictions are therefore challenging for artists, and even more so for machines. We present CLIPasso, an object sketching meth… ▽ More

    Submitted 16 May, 2022; v1 submitted 11 February, 2022; originally announced February 2022.

    Comments: https://clipasso.github.io/clipasso/

  15. arXiv:2202.03365  [pdf, other

    cs.CV cs.LG

    Simple Control Baselines for Evaluating Transfer Learning

    Authors: Andrei Atanov, Shijian Xu, Onur Beker, Andrei Filatov, Amir Zamir

    Abstract: Transfer learning has witnessed remarkable progress in recent years, for example, with the introduction of augmentation-based contrastive self-supervised learning methods. While a number of large-scale empirical studies on the transfer performance of such models have been conducted, there is not yet an agreed-upon set of control baselines, evaluation practices, and metrics to report, which often h… ▽ More

    Submitted 7 February, 2022; originally announced February 2022.

    Comments: Project website: https://transfer-controls.epfl.ch

  16. arXiv:2201.13433  [pdf, other

    cs.CV

    Third Time's the Charm? Image and Video Editing with StyleGAN3

    Authors: Yuval Alaluf, Or Patashnik, Zongze Wu, Asif Zamir, Eli Shechtman, Dani Lischinski, Daniel Cohen-Or

    Abstract: StyleGAN is arguably one of the most intriguing and well-studied generative models, demonstrating impressive performance in image generation, inversion, and manipulation. In this work, we explore the recent StyleGAN3 architecture, compare it to its predecessor, and investigate its unique advantages, as well as drawbacks. In particular, we demonstrate that while StyleGAN3 can be trained on unaligne… ▽ More

    Submitted 31 January, 2022; originally announced January 2022.

    Comments: Project page available at https://yuval-alaluf.github.io/stylegan3-editing/

  17. arXiv:2110.04994  [pdf, other

    cs.CV cs.AI cs.GR cs.RO

    Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets from 3D Scans

    Authors: Ainaz Eftekhar, Alexander Sax, Roman Bachmann, Jitendra Malik, Amir Zamir

    Abstract: This paper introduces a pipeline to parametrically sample and render multi-task vision datasets from comprehensive 3D scans from the real world. Changing the sampling parameters allows one to "steer" the generated datasets to emphasize specific information. In addition to enabling interesting lines of research, we show the tooling and generated data suffice to train robust vision models. Common… ▽ More

    Submitted 11 October, 2021; originally announced October 2021.

    Comments: ICCV 2021: See project website https://omnidata.vision

  18. arXiv:2103.10919  [pdf, other

    cs.CV cs.LG

    Robustness via Cross-Domain Ensembles

    Authors: Teresa Yeo, Oğuzhan Fatih Kar, Alexander Sax, Amir Zamir

    Abstract: We present a method for making neural network predictions robust to shifts from the training data distribution. The proposed method is based on making predictions via a diverse set of cues (called 'middle domains') and ensembling them into one strong prediction. The premise of the idea is that predictions made via different cues respond differently to a distribution shift, hence one should be able… ▽ More

    Submitted 3 September, 2021; v1 submitted 19 March, 2021; originally announced March 2021.

    Comments: Project website at https://crossdomain-ensembles.epfl.ch/

  19. arXiv:2011.06698  [pdf, other

    cs.RO cs.CV cs.LG

    Robust Policies via Mid-Level Visual Representations: An Experimental Study in Manipulation and Navigation

    Authors: Bryan Chen, Alexander Sax, Gene Lewis, Iro Armeni, Silvio Savarese, Amir Zamir, Jitendra Malik, Lerrel Pinto

    Abstract: Vision-based robotics often separates the control loop into one module for perception and a separate module for control. It is possible to train the whole system end-to-end (e.g. with deep RL), but doing it "from scratch" comes with a high sample complexity cost and the final result is often brittle, failing unexpectedly if the test environment differs from that of training. We study the effects… ▽ More

    Submitted 12 November, 2020; originally announced November 2020.

    Comments: Extended version of CoRL 2020 camera ready. Supplementary released separately

  20. arXiv:2006.04096  [pdf, other

    cs.CV cs.GR cs.LG

    Robust Learning Through Cross-Task Consistency

    Authors: Amir Zamir, Alexander Sax, Teresa Yeo, Oğuzhan Kar, Nikhil Cheerla, Rohan Suri, Zhangjie Cao, Jitendra Malik, Leonidas Guibas

    Abstract: Visual perception entails solving a wide set of tasks, e.g., object detection, depth estimation, etc. The predictions made for multiple tasks from the same image are not independent, and therefore, are expected to be consistent. We propose a broadly applicable and fully computational method for augmenting learning with Cross-Task Consistency. The proposed formulation is based on inference-path inv… ▽ More

    Submitted 7 June, 2020; originally announced June 2020.

    Comments: CVPR 2020 (Oral). Project website, models, live demo at http://consistency.epfl.ch/

  21. arXiv:2002.09832  [pdf, other

    cs.NI cs.LG

    Sequence Preserving Network Traffic Generation

    Authors: Sigal Shaked, Amos Zamir, Roman Vainshtein, Moshe Unger, Lior Rokach, Rami Puzis, Bracha Shapira

    Abstract: We present the Network Traffic Generator (NTG), a framework for perturbing recorded network traffic with the purpose of generating diverse but realistic background traffic for network simulation and what-if analysis in enterprise environments. The framework preserves many characteristics of the original traffic recorded in an enterprise, as well as sequences of network activities. Using the propos… ▽ More

    Submitted 23 February, 2020; originally announced February 2020.

  22. arXiv:1912.13503  [pdf, other

    cs.LG cs.CV cs.NE cs.RO

    Side-Tuning: A Baseline for Network Adaptation via Additive Side Networks

    Authors: Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, Jitendra Malik

    Abstract: When training a neural network for a desired task, one may prefer to adapt a pre-trained network rather than starting from randomly initialized weights. Adaptation can be useful in cases when training data is scarce, when a single learner needs to perform multiple tasks, or when one wishes to encode priors in the network. The most commonly employed approaches for network adaptation are fine-tuning… ▽ More

    Submitted 30 July, 2020; v1 submitted 31 December, 2019; originally announced December 2019.

    Comments: In ECCV 2020 (Spotlight). For more, see project website and code at http://sidetuning.berkeley.edu

  23. arXiv:1912.11121  [pdf, other

    cs.CV cs.LG cs.NE cs.RO

    Learning to Navigate Using Mid-Level Visual Priors

    Authors: Alexander Sax, Jeffrey O. Zhang, Bradley Emi, Amir Zamir, Silvio Savarese, Leonidas Guibas, Jitendra Malik

    Abstract: How much does having visual priors about the world (e.g. the fact that the world is 3D) assist in learning to perform downstream motor tasks (e.g. navigating a complex environment)? What are the consequences of not utilizing such visual priors in learning? We study these questions by integrating a generic perceptual skill set (a distance estimator, an edge detector, etc.) within a reinforcement le… ▽ More

    Submitted 23 December, 2019; originally announced December 2019.

    Comments: In Conference on Robot Learning, 2019. See project website and demos at http://perceptual.actor/

  24. arXiv:1910.02527  [pdf, other

    cs.CV cs.RO

    3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera

    Authors: Iro Armeni, Zhi-Yang He, JunYoung Gwak, Amir R. Zamir, Martin Fischer, Jitendra Malik, Silvio Savarese

    Abstract: A comprehensive semantic understanding of a scene is important for many applications - but in what space should diverse semantic information (e.g., objects, scene categories, material types, texture, etc.) be grounded and what should be its structure? Aspiring to have one unified structure that hosts diverse types of semantics, we follow the Scene Graph paradigm in 3D, generating a 3D Scene Graph.… ▽ More

    Submitted 6 October, 2019; originally announced October 2019.

    Comments: ICCV 2019

  25. arXiv:1905.07553  [pdf, other

    cs.CV

    Which Tasks Should Be Learned Together in Multi-task Learning?

    Authors: Trevor Standley, Amir R. Zamir, Dawn Chen, Leonidas Guibas, Jitendra Malik, Silvio Savarese

    Abstract: Many computer vision applications require solving multiple tasks in real-time. A neural network can be trained to solve multiple tasks simultaneously using multi-task learning. This can save computation at inference time as only a single network needs to be evaluated. Unfortunately, this often leads to inferior overall performance as task objectives can compete, which consequently poses the questi… ▽ More

    Submitted 2 September, 2020; v1 submitted 18 May, 2019; originally announced May 2019.

    Comments: Presented to ICML 2020 See project website at http://taskgrou**.stanford.edu/

  26. arXiv:1812.11971  [pdf, other

    cs.CV cs.AI cs.LG cs.NE cs.RO

    Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Visuomotor Policies

    Authors: Alexander Sax, Bradley Emi, Amir R. Zamir, Leonidas Guibas, Silvio Savarese, Jitendra Malik

    Abstract: How much does having visual priors about the world (e.g. the fact that the world is 3D) assist in learning to perform downstream motor tasks (e.g. delivering a package)? We study this question by integrating a generic perceptual skill set (e.g. a distance estimator, an edge detector, etc.) within a reinforcement learning framework--see Figure 1. This skill set (hereafter mid-level perception) prov… ▽ More

    Submitted 22 April, 2019; v1 submitted 31 December, 2018; originally announced December 2018.

    Comments: See project website, demos, and code at http://perceptual.actor

  27. arXiv:1808.10654  [pdf, other

    cs.AI cs.CV cs.GR cs.LG cs.RO

    Gibson Env: Real-World Perception for Embodied Agents

    Authors: Fei Xia, Amir Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, Silvio Savarese

    Abstract: Develo** visual perception models for active agents and sensorimotor control are cumbersome to be done in the physical world, as existing algorithms are too slow to efficiently learn in real-time and robots are fragile and costly. This has given rise to learning-in-simulation which consequently casts a question on whether the results transfer to real-world. In this paper, we are concerned with t… ▽ More

    Submitted 31 August, 2018; originally announced August 2018.

    Comments: Access the code, dataset, and project website at http://gibsonenv.vision/ . CVPR 2018

    Journal ref: CVPR 2018

  28. arXiv:1807.06757  [pdf, other

    cs.AI cs.CV cs.LG cs.RO

    On Evaluation of Embodied Navigation Agents

    Authors: Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, Amir R. Zamir

    Abstract: Skillful mobile operation in three-dimensional environments is a primary topic of study in Artificial Intelligence. The past two years have seen a surge of creative work on navigation. This creative output has produced a plethora of sometimes incompatible task definitions and evaluation protocols. To coordinate ongoing and future research in this area, we have convened a working group to study emp… ▽ More

    Submitted 17 July, 2018; originally announced July 2018.

    Comments: Report of a working group on empirical methodology in navigation research. Authors are listed in alphabetical order

  29. arXiv:1804.08328  [pdf, other

    cs.CV cs.AI cs.LG cs.NE cs.RO

    Taskonomy: Disentangling Task Transfer Learning

    Authors: Amir Zamir, Alexander Sax, William Shen, Leonidas Guibas, Jitendra Malik, Silvio Savarese

    Abstract: Do visual tasks have a relationship, or are they unrelated? For instance, could having surface normals simplify estimating the depth of an image? Intuition answers these questions positively, implying existence of a structure among visual tasks. Knowing this structure has notable values; it is the concept underlying transfer learning and provides a principled way for identifying redundancies acros… ▽ More

    Submitted 23 April, 2018; originally announced April 2018.

    Comments: CVPR 2018 (Oral). See project website and live demos at http://taskonomy.vision/

  30. arXiv:1710.08247  [pdf, other

    cs.CV cs.LG cs.NE cs.RO

    Generic 3D Representation via Pose Estimation and Matching

    Authors: Amir R. Zamir, Tilman Wekel, Pulkit Argrawal, Colin Weil, Jitendra Malik, Silvio Savarese

    Abstract: Though a large body of computer vision research has investigated develo** generic semantic representations, efforts towards develo** a similar representation for 3D has been limited. In this paper, we learn a generic 3D representation through solving a set of foundational proxy 3D tasks: object-centric camera pose estimation and wide baseline feature matching. Our method is based upon the prem… ▽ More

    Submitted 23 October, 2017; originally announced October 2017.

    Comments: Published in ECCV16. See the project website http://3drepresentation.stanford.edu/ and dataset website https://github.com/amir32002/3D_Street_View

    Journal ref: ECCV 2016 535-553

  31. arXiv:1702.01105  [pdf, other

    cs.CV cs.RO

    Joint 2D-3D-Semantic Data for Indoor Scene Understanding

    Authors: Iro Armeni, Sasha Sax, Amir R. Zamir, Silvio Savarese

    Abstract: We present a dataset of large-scale indoor spaces that provides a variety of mutually registered modalities from 2D, 2.5D and 3D domains, with instance-level semantic and geometric annotations. The dataset covers over 6,000m2 and contains over 70,000 RGB images, along with the corresponding depths, surface normals, semantic annotations, global XYZ images (all in forms of both regular and 360° equi… ▽ More

    Submitted 5 April, 2017; v1 submitted 3 February, 2017; originally announced February 2017.

    Comments: The dataset is available http://3Dsemantics.stanford.edu/

  32. arXiv:1612.09508  [pdf, other

    cs.CV

    Feedback Networks

    Authors: Amir R. Zamir, Te-Lin Wu, Lin Sun, William Shen, Jitendra Malik, Silvio Savarese

    Abstract: Currently, the most successful learning models in computer vision are based on learning successive representations followed by a decision layer. This is usually actualized through feedforward multilayer neural networks, e.g. ConvNets, where each layer forms one of such successive representations. However, an alternative that can achieve the same goal is a feedback based approach in which the repre… ▽ More

    Submitted 20 August, 2017; v1 submitted 30 December, 2016; originally announced December 2016.

    Comments: See a video describing the method at https://youtu.be/MY5Uhv38Ttg and the website at http://feedbacknet.stanford.edu/

  33. arXiv:1605.03324  [pdf, other

    cs.CV cs.RO stat.ML

    Unsupervised Semantic Action Discovery from Video Collections

    Authors: Ozan Sener, Amir Roshan Zamir, Chenxia Wu, Silvio Savarese, Ashutosh Saxena

    Abstract: Human communication takes many forms, including speech, text and instructional videos. It typically has an underlying structure, with a starting point, ending, and certain objective steps between them. In this paper, we consider instructional videos where there are tens of millions of them on the Internet. We propose a method for parsing a video into such semantic steps in an unsupervised way. O… ▽ More

    Submitted 11 May, 2016; originally announced May 2016.

    Comments: First version of this paper arXiv:1506.08438 appeared in ICCV 2015. This extended version has more details on the learning algorithm and hierarchical clustering with full derivation, additional analysis on the robustness to the subtitle noise, and a novel application on robotics

  34. The THUMOS Challenge on Action Recognition for Videos "in the Wild"

    Authors: Haroon Idrees, Amir R. Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, Mubarak Shah

    Abstract: Automatically recognizing and localizing wide ranges of human actions has crucial importance for video understanding. Towards this goal, the THUMOS challenge was introduced in 2013 to serve as a benchmark for action recognition. Until then, video action recognition, including THUMOS challenge, had focused primarily on the classification of pre-segmented (i.e., trimmed) videos, which is an artifici… ▽ More

    Submitted 21 April, 2016; originally announced April 2016.

    Comments: Preprint submitted to Computer Vision and Image Understanding

  35. arXiv:1511.05298  [pdf, other

    cs.CV cs.LG cs.NE cs.RO

    Structural-RNN: Deep Learning on Spatio-Temporal Graphs

    Authors: Ashesh Jain, Amir R. Zamir, Silvio Savarese, Ashutosh Saxena

    Abstract: Deep Recurrent Neural Network architectures, though remarkably capable at modeling sequences, lack an intuitive high-level spatio-temporal structure. That is while many problems in computer vision inherently have an underlying high-level structure and can benefit from it. Spatio-temporal graphs are a popular tool for imposing such high-level intuitions in the formulation of real world problems. In… ▽ More

    Submitted 11 April, 2016; v1 submitted 17 November, 2015; originally announced November 2015.

    Comments: CVPR 2016 (Oral)

  36. arXiv:1511.00098  [pdf, other

    cs.CV

    Semantic Cross-View Matching

    Authors: Francesco Castaldo, Amir Zamir, Roland Angst, Francesco Palmieri, Silvio Savarese

    Abstract: Matching cross-view images is challenging because the appearance and viewpoints are significantly different. While low-level features based on gradient orientations or filter responses can drastically vary with such changes in viewpoint, semantic information of images however shows an invariant characteristic in this respect. Consequently, semantically labeled regions can be used for performing cr… ▽ More

    Submitted 31 October, 2015; originally announced November 2015.

  37. arXiv:1508.07654  [pdf, other

    cs.CV

    Action Recognition by Hierarchical Mid-level Action Elements

    Authors: Tian Lan, Yuke Zhu, Amir Roshan Zamir, Silvio Savarese

    Abstract: Realistic videos of human actions exhibit rich spatiotemporal structures at multiple levels of granularity: an action can always be decomposed into multiple finer-grained elements in both space and time. To capture this intuition, we propose to represent videos by a hierarchy of mid-level action elements (MAEs), where each MAE corresponds to an action-related spatiotemporal segment in the video. W… ▽ More

    Submitted 30 August, 2015; originally announced August 2015.

  38. arXiv:1506.08438  [pdf, other

    cs.CV

    Unsupervised Semantic Parsing of Video Collections

    Authors: Ozan Sener, Amir Zamir, Silvio Savarese, Ashutosh Saxena

    Abstract: Human communication typically has an underlying structure. This is reflected in the fact that in many user generated videos, a starting point, ending, and certain objective steps between these two can be identified. In this paper, we propose a method for parsing a video into such semantic steps in an unsupervised way. The proposed method is capable of providing a semantic "storyline" of the video… ▽ More

    Submitted 27 January, 2016; v1 submitted 28 June, 2015; originally announced June 2015.

  39. arXiv:1212.0402  [pdf, other

    cs.CV

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Authors: Khurram Soomro, Amir Roshan Zamir, Mubarak Shah

    Abstract: We introduce UCF101 which is currently the largest dataset of human actions. It consists of 101 action classes, over 13k clips and 27 hours of video data. The database consists of realistic user uploaded videos containing camera motion and cluttered background. Additionally, we provide baseline action recognition results on this new dataset using standard bag of words approach with overall perform… ▽ More

    Submitted 3 December, 2012; originally announced December 2012.

    Report number: CRCV-TR-12-01