-
AI2-THOR: An Interactive 3D Environment for Visual AI
Authors:
Eric Kolve,
Roozbeh Mottaghi,
Winson Han,
Eli VanderBilt,
Luca Weihs,
Alvaro Herrasti,
Matt Deitke,
Kiana Ehsani,
Daniel Gordon,
Yuke Zhu,
Aniruddha Kembhavi,
Abhinav Gupta,
Ali Farhadi
Abstract:
We introduce The House Of inteRactions (THOR), a framework for visual AI research, available at http://ai2thor.allenai.org. AI2-THOR consists of near photo-realistic 3D indoor scenes, where AI agents can navigate in the scenes and interact with objects to perform tasks. AI2-THOR enables research in many different domains including but not limited to deep reinforcement learning, imitation learning,…
▽ More
We introduce The House Of inteRactions (THOR), a framework for visual AI research, available at http://ai2thor.allenai.org. AI2-THOR consists of near photo-realistic 3D indoor scenes, where AI agents can navigate in the scenes and interact with objects to perform tasks. AI2-THOR enables research in many different domains including but not limited to deep reinforcement learning, imitation learning, learning by interaction, planning, visual question answering, unsupervised representation learning, object detection and segmentation, and learning models of cognition. The goal of AI2-THOR is to facilitate building visually intelligent models and push the research forward in this domain.
△ Less
Submitted 26 August, 2022; v1 submitted 14 December, 2017;
originally announced December 2017.
-
IQA: Visual Question Answering in Interactive Environments
Authors:
Daniel Gordon,
Aniruddha Kembhavi,
Mohammad Rastegari,
Joseph Redmon,
Dieter Fox,
Ali Farhadi
Abstract:
We introduce Interactive Question Answering (IQA), the task of answering questions that require an autonomous agent to interact with a dynamic visual environment. IQA presents the agent with a scene and a question, like: "Are there any apples in the fridge?" The agent must navigate around the scene, acquire visual understanding of scene elements, interact with objects (e.g. open refrigerators) and…
▽ More
We introduce Interactive Question Answering (IQA), the task of answering questions that require an autonomous agent to interact with a dynamic visual environment. IQA presents the agent with a scene and a question, like: "Are there any apples in the fridge?" The agent must navigate around the scene, acquire visual understanding of scene elements, interact with objects (e.g. open refrigerators) and plan for a series of actions conditioned on the question. Popular reinforcement learning approaches with a single controller perform poorly on IQA owing to the large and diverse state space. We propose the Hierarchical Interactive Memory Network (HIMN), consisting of a factorized set of controllers, allowing the system to operate at multiple levels of temporal abstraction. To evaluate HIMN, we introduce IQUAD V1, a new dataset built upon AI2-THOR, a simulated photo-realistic environment of configurable indoor scenes with interactive objects (code and dataset available at https://github.com/danielgordon10/thor-iqa-cvpr-2018). IQUAD V1 has 75,000 questions, each paired with a unique scene configuration. Our experiments show that our proposed model outperforms popular single controller based methods on IQUAD V1. For sample questions and results, please view our video: https://youtu.be/pXd3C-1jr98
△ Less
Submitted 6 September, 2018; v1 submitted 8 December, 2017;
originally announced December 2017.
-
Structured Set Matching Networks for One-Shot Part Labeling
Authors:
Jonghyun Choi,
Jayant Krishnamurthy,
Aniruddha Kembhavi,
Ali Farhadi
Abstract:
Diagrams often depict complex phenomena and serve as a good test bed for visual and textual reasoning. However, understanding diagrams using natural image understanding approaches requires large training datasets of diagrams, which are very hard to obtain. Instead, this can be addressed as a matching problem either between labeled diagrams, images or both. This problem is very challenging since th…
▽ More
Diagrams often depict complex phenomena and serve as a good test bed for visual and textual reasoning. However, understanding diagrams using natural image understanding approaches requires large training datasets of diagrams, which are very hard to obtain. Instead, this can be addressed as a matching problem either between labeled diagrams, images or both. This problem is very challenging since the absence of significant color and texture renders local cues ambiguous and requires global reasoning. We consider the problem of one-shot part labeling: labeling multiple parts of an object in a target image given only a single source image of that category. For this set-to-set matching problem, we introduce the Structured Set Matching Network (SSMN), a structured prediction model that incorporates convolutional neural networks. The SSMN is trained using global normalization to maximize local match scores between corresponding elements and a global consistency score among all matched elements, while also enforcing a matching constraint between the two sets. The SSMN significantly outperforms several strong baselines on three label transfer scenarios: diagram-to-diagram, evaluated on a new diagram dataset of over 200 categories; image-to-image, evaluated on a dataset built on top of the Pascal Part Dataset; and image-to-diagram, evaluated on transferring labels across these datasets.
△ Less
Submitted 3 April, 2018; v1 submitted 5 December, 2017;
originally announced December 2017.
-
Neural Speed Reading via Skim-RNN
Authors:
Minjoon Seo,
Sewon Min,
Ali Farhadi,
Hannaneh Hajishirzi
Abstract:
Inspired by the principles of speed reading, we introduce Skim-RNN, a recurrent neural network (RNN) that dynamically decides to update only a small fraction of the hidden state for relatively unimportant input tokens. Skim-RNN gives computational advantage over an RNN that always updates the entire hidden state. Skim-RNN uses the same input and output interfaces as a standard RNN and can be easil…
▽ More
Inspired by the principles of speed reading, we introduce Skim-RNN, a recurrent neural network (RNN) that dynamically decides to update only a small fraction of the hidden state for relatively unimportant input tokens. Skim-RNN gives computational advantage over an RNN that always updates the entire hidden state. Skim-RNN uses the same input and output interfaces as a standard RNN and can be easily used instead of RNNs in existing models. In our experiments, we show that Skim-RNN can achieve significantly reduced computational cost without losing accuracy compared to standard RNNs across five different natural language tasks. In addition, we demonstrate that the trade-off between accuracy and speed of Skim-RNN can be dynamically controlled during inference time in a stable manner. Our analysis also shows that Skim-RNN running on a single CPU offers lower latency compared to standard RNNs on GPUs.
△ Less
Submitted 28 March, 2018; v1 submitted 6 November, 2017;
originally announced November 2017.
-
On the Complexity of Chore Division
Authors:
Alireza Farhadi,
MohammadTaghi Hajiaghayi
Abstract:
We study the proportional chore division problem where a protocol wants to divide an undesirable object, called chore, among $n$ different players. The goal is to find an allocation such that the cost of the chore assigned to each player be at most $1/n$ of the total cost. This problem is the dual variant of the cake cutting problem in which we want to allocate a desirable object. Edmonds and Pruh…
▽ More
We study the proportional chore division problem where a protocol wants to divide an undesirable object, called chore, among $n$ different players. The goal is to find an allocation such that the cost of the chore assigned to each player be at most $1/n$ of the total cost. This problem is the dual variant of the cake cutting problem in which we want to allocate a desirable object. Edmonds and Pruhs showed that any protocol for the proportional cake cutting must use at least $Ω(n \log n)$ queries in the worst case, however, finding a lower bound for the proportional chore division remained an interesting open problem. We show that chore division and cake cutting problems are closely related to each other and provide an $Ω(n \log n)$ lower bound for chore division.
△ Less
Submitted 7 May, 2018; v1 submitted 30 September, 2017;
originally announced October 2017.
-
AJILE Movement Prediction: Multimodal Deep Learning for Natural Human Neural Recordings and Video
Authors:
Nancy Xin Ru Wang,
Ali Farhadi,
Rajesh Rao,
Bingni Brunton
Abstract:
Develo** useful interfaces between brains and machines is a grand challenge of neuroengineering. An effective interface has the capacity to not only interpret neural signals, but predict the intentions of the human to perform an action in the near future; prediction is made even more challenging outside well-controlled laboratory experiments. This paper describes our approach to detect and to pr…
▽ More
Develo** useful interfaces between brains and machines is a grand challenge of neuroengineering. An effective interface has the capacity to not only interpret neural signals, but predict the intentions of the human to perform an action in the near future; prediction is made even more challenging outside well-controlled laboratory experiments. This paper describes our approach to detect and to predict natural human arm movements in the future, a key challenge in brain computer interfacing that has never before been attempted. We introduce the novel Annotated Joints in Long-term ECoG (AJILE) dataset; AJILE includes automatically annotated poses of 7 upper body joints for four human subjects over 670 total hours (more than 72 million frames), along with the corresponding simultaneously acquired intracranial neural recordings. The size and scope of AJILE greatly exceeds all previous datasets with movements and electrocorticography (ECoG), making it possible to take a deep learning approach to movement prediction. We propose a multimodal model that combines deep convolutional neural networks (CNN) with long short-term memory (LSTM) blocks, leveraging both ECoG and video modalities. We demonstrate that our models are able to detect movements and predict future movements up to 800 msec before movement initiation. Further, our multimodal movement prediction models exhibit resilience to simulated ablation of input neural signals. We believe a multimodal approach to natural neural decoding that takes context into account is critical in advancing bioelectronic technologies and human neuroscience.
△ Less
Submitted 1 March, 2018; v1 submitted 12 September, 2017;
originally announced September 2017.
-
Visual Semantic Planning using Deep Successor Representations
Authors:
Yuke Zhu,
Daniel Gordon,
Eric Kolve,
Dieter Fox,
Li Fei-Fei,
Abhinav Gupta,
Roozbeh Mottaghi,
Ali Farhadi
Abstract:
A crucial capability of real-world intelligent agents is their ability to plan a sequence of actions to achieve their goals in the visual world. In this work, we address the problem of visual semantic planning: the task of predicting a sequence of actions from visual observations that transform a dynamic environment from an initial state to a goal state. Doing so entails knowledge about objects an…
▽ More
A crucial capability of real-world intelligent agents is their ability to plan a sequence of actions to achieve their goals in the visual world. In this work, we address the problem of visual semantic planning: the task of predicting a sequence of actions from visual observations that transform a dynamic environment from an initial state to a goal state. Doing so entails knowledge about objects and their affordances, as well as actions and their preconditions and effects. We propose learning these through interacting with a visual and dynamic environment. Our proposed solution involves bootstrap** reinforcement learning with imitation learning. To ensure cross task generalization, we develop a deep predictive model based on successor representations. Our experimental results show near optimal results across a wide range of tasks in the challenging THOR environment.
△ Less
Submitted 15 August, 2017; v1 submitted 23 May, 2017;
originally announced May 2017.
-
Re3 : Real-Time Recurrent Regression Networks for Visual Tracking of Generic Objects
Authors:
Daniel Gordon,
Ali Farhadi,
Dieter Fox
Abstract:
Robust object tracking requires knowledge and understanding of the object being tracked: its appearance, its motion, and how it changes over time. A tracker must be able to modify its underlying model and adapt to new observations. We present Re3, a real-time deep object tracker capable of incorporating temporal information into its model. Rather than focusing on a limited set of objects or traini…
▽ More
Robust object tracking requires knowledge and understanding of the object being tracked: its appearance, its motion, and how it changes over time. A tracker must be able to modify its underlying model and adapt to new observations. We present Re3, a real-time deep object tracker capable of incorporating temporal information into its model. Rather than focusing on a limited set of objects or training a model at test-time to track a specific instance, we pretrain our generic tracker on a large variety of objects and efficiently update on the fly; Re3 simultaneously tracks and updates the appearance model with a single forward pass. This lightweight model is capable of tracking objects at 150 FPS, while attaining competitive results on challenging benchmarks. We also show that our method handles temporary occlusion better than other comparable trackers using experiments that directly measure performance on sequences with occlusion.
△ Less
Submitted 26 February, 2018; v1 submitted 17 May, 2017;
originally announced May 2017.
-
SeGAN: Segmenting and Generating the Invisible
Authors:
Kiana Ehsani,
Roozbeh Mottaghi,
Ali Farhadi
Abstract:
Objects often occlude each other in scenes; Inferring their appearance beyond their visible parts plays an important role in scene understanding, depth estimation, object interaction and manipulation. In this paper, we study the challenging problem of completing the appearance of occluded objects. Doing so requires knowing which pixels to paint (segmenting the invisible parts of objects) and what…
▽ More
Objects often occlude each other in scenes; Inferring their appearance beyond their visible parts plays an important role in scene understanding, depth estimation, object interaction and manipulation. In this paper, we study the challenging problem of completing the appearance of occluded objects. Doing so requires knowing which pixels to paint (segmenting the invisible parts of objects) and what color to paint them (generating the invisible parts). Our proposed novel solution, SeGAN, jointly optimizes for both segmentation and generation of the invisible parts of objects. Our experimental results show that: (a) SeGAN can learn to generate the appearance of the occluded parts of objects; (b) SeGAN outperforms state-of-the-art segmentation baselines for the invisible parts of objects; (c) trained on synthetic photo realistic images, SeGAN can reliably segment natural images; (d) by reasoning about occluder occludee relations, our method can infer depth layering.
△ Less
Submitted 7 May, 2018; v1 submitted 29 March, 2017;
originally announced March 2017.
-
Fair Allocation of Indivisible Goods to Asymmetric Agents
Authors:
Alireza Farhadi,
Mohammad Ghodsi,
MohammadTaghi Hajiaghayi,
Sebastien Lahaie,
David Pennock,
Masoud Seddighin,
Saeed Seddighin,
Hadi Yami
Abstract:
We study fair allocation of indivisible goods to agents with unequal entitlements. Fair allocation has been the subject of many studies in both divisible and indivisible settings. Our emphasis is on the case where the goods are indivisible and agents have unequal entitlements. This problem is a generalization of the work by Procaccia and Wang wherein the agents are assumed to be symmetric with res…
▽ More
We study fair allocation of indivisible goods to agents with unequal entitlements. Fair allocation has been the subject of many studies in both divisible and indivisible settings. Our emphasis is on the case where the goods are indivisible and agents have unequal entitlements. This problem is a generalization of the work by Procaccia and Wang wherein the agents are assumed to be symmetric with respect to their entitlements. Although Procaccia and Wang show an almost fair (constant approximation) allocation exists in their setting, our main result is in sharp contrast to their observation. We show that, in some cases with $n$ agents, no allocation can guarantee better than $1/n$ approximation of a fair allocation when the entitlements are not necessarily equal. Furthermore, we devise a simple algorithm that ensures a $1/n$ approximation guarantee. Our second result is for a restricted version of the problem where the valuation of every agent for each good is bounded by the total value he wishes to receive in a fair allocation. Although this assumption might seem w.l.o.g, we show it enables us to find a $1/2$ approximation fair allocation via a greedy algorithm. Finally, we run some experiments on real-world data and show that, in practice, a fair allocation is likely to exist. We also support our experiments by showing positive results for two stochastic variants of the problem, namely stochastic agents and stochastic items.
△ Less
Submitted 11 April, 2017; v1 submitted 5 March, 2017;
originally announced March 2017.
-
See the Glass Half Full: Reasoning about Liquid Containers, their Volume and Content
Authors:
Roozbeh Mottaghi,
Connor Schenck,
Dieter Fox,
Ali Farhadi
Abstract:
Humans have rich understanding of liquid containers and their contents; for example, we can effortlessly pour water from a pitcher to a cup. Doing so requires estimating the volume of the cup, approximating the amount of water in the pitcher, and predicting the behavior of water when we tilt the pitcher. Very little attention in computer vision has been made to liquids and their containers. In thi…
▽ More
Humans have rich understanding of liquid containers and their contents; for example, we can effortlessly pour water from a pitcher to a cup. Doing so requires estimating the volume of the cup, approximating the amount of water in the pitcher, and predicting the behavior of water when we tilt the pitcher. Very little attention in computer vision has been made to liquids and their containers. In this paper, we study liquid containers and their contents, and propose methods to estimate the volume of containers, approximate the amount of liquid in them, and perform comparative volume estimations all from a single RGB image. Furthermore, we show the results of the proposed model for predicting the behavior of liquids inside containers when one tilts the containers. We also introduce a new dataset of Containers Of liQuid contEnt (COQE) that contains more than 5,000 images of 10,000 liquid containers in context labelled with volume, amount of content, bounding box annotation, and corresponding similar 3D CAD models.
△ Less
Submitted 6 September, 2017; v1 submitted 10 January, 2017;
originally announced January 2017.
-
YOLO9000: Better, Faster, Stronger
Authors:
Joseph Redmon,
Ali Farhadi
Abstract:
We introduce YOLO9000, a state-of-the-art, real-time object detection system that can detect over 9000 object categories. First we propose various improvements to the YOLO detection method, both novel and drawn from prior work. The improved model, YOLOv2, is state-of-the-art on standard detection tasks like PASCAL VOC and COCO. At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007. At 40 FPS, YOLOv2 gets 78…
▽ More
We introduce YOLO9000, a state-of-the-art, real-time object detection system that can detect over 9000 object categories. First we propose various improvements to the YOLO detection method, both novel and drawn from prior work. The improved model, YOLOv2, is state-of-the-art on standard detection tasks like PASCAL VOC and COCO. At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007. At 40 FPS, YOLOv2 gets 78.6 mAP, outperforming state-of-the-art methods like Faster RCNN with ResNet and SSD while still running significantly faster. Finally we propose a method to jointly train on object detection and classification. Using this method we train YOLO9000 simultaneously on the COCO detection dataset and the ImageNet classification dataset. Our joint training allows YOLO9000 to predict detections for object classes that don't have labelled detection data. We validate our approach on the ImageNet detection task. YOLO9000 gets 19.7 mAP on the ImageNet detection validation set despite only having detection data for 44 of the 200 classes. On the 156 classes not in COCO, YOLO9000 gets 16.0 mAP. But YOLO can detect more than just 200 classes; it predicts detections for more than 9000 different object categories. And it still runs in real-time.
△ Less
Submitted 25 December, 2016;
originally announced December 2016.
-
Asynchronous Temporal Fields for Action Recognition
Authors:
Gunnar A. Sigurdsson,
Santosh Divvala,
Ali Farhadi,
Abhinav Gupta
Abstract:
Actions are more than just movements and trajectories: we cook to eat and we hold a cup to drink from it. A thorough understanding of videos requires going beyond appearance modeling and necessitates reasoning about the sequence of activities, as well as the higher-level constructs such as intentions. But how do we model and reason about these? We propose a fully-connected temporal CRF model for r…
▽ More
Actions are more than just movements and trajectories: we cook to eat and we hold a cup to drink from it. A thorough understanding of videos requires going beyond appearance modeling and necessitates reasoning about the sequence of activities, as well as the higher-level constructs such as intentions. But how do we model and reason about these? We propose a fully-connected temporal CRF model for reasoning over various aspects of activities that includes objects, actions, and intentions, where the potentials are predicted by a deep network. End-to-end training of such structured models is a challenging endeavor: For inference and learning we need to construct mini-batches consisting of whole videos, leading to mini-batches with only a few videos. This causes high-correlation between data points leading to breakdown of the backprop algorithm. To address this challenge, we present an asynchronous variational inference method that allows efficient end-to-end training. Our method achieves a classification mAP of 22.4% on the Charades benchmark, outperforming the state-of-the-art (17.2% mAP), and offers equal gains on the task of temporal localization.
△ Less
Submitted 24 July, 2017; v1 submitted 19 December, 2016;
originally announced December 2016.
-
Commonly Uncommon: Semantic Sparsity in Situation Recognition
Authors:
Mark Yatskar,
Vicente Ordonez,
Luke Zettlemoyer,
Ali Farhadi
Abstract:
Semantic sparsity is a common challenge in structured visual classification problems; when the output space is complex, the vast majority of the possible predictions are rarely, if ever, seen in the training set. This paper studies semantic sparsity in situation recognition, the task of producing structured summaries of what is happening in images, including activities, objects and the roles objec…
▽ More
Semantic sparsity is a common challenge in structured visual classification problems; when the output space is complex, the vast majority of the possible predictions are rarely, if ever, seen in the training set. This paper studies semantic sparsity in situation recognition, the task of producing structured summaries of what is happening in images, including activities, objects and the roles objects play within the activity. For this problem, we find empirically that most object-role combinations are rare, and current state-of-the-art models significantly underperform in this sparse data regime. We avoid many such errors by (1) introducing a novel tensor composition function that learns to share examples across role-noun combinations and (2) semantically augmenting our training data with automatically gathered examples of rarely observed outputs using web data. When integrated within a complete CRF-based structured prediction model, the tensor-based approach outperforms existing state of the art by a relative improvement of 2.11% and 4.40% on top-5 verb and noun-role accuracy, respectively. Adding 5 million images with our semantic augmentation techniques gives further relative improvements of 6.23% and 9.57% on top-5 verb and noun-role accuracy.
△ Less
Submitted 2 December, 2016;
originally announced December 2016.
-
LCNN: Lookup-based Convolutional Neural Network
Authors:
Hessam Bagherinezhad,
Mohammad Rastegari,
Ali Farhadi
Abstract:
Porting state of the art deep learning algorithms to resource constrained compute platforms (e.g. VR, AR, wearables) is extremely challenging. We propose a fast, compact, and accurate model for convolutional neural networks that enables efficient learning and inference. We introduce LCNN, a lookup-based convolutional neural network that encodes convolutions by few lookups to a dictionary that is t…
▽ More
Porting state of the art deep learning algorithms to resource constrained compute platforms (e.g. VR, AR, wearables) is extremely challenging. We propose a fast, compact, and accurate model for convolutional neural networks that enables efficient learning and inference. We introduce LCNN, a lookup-based convolutional neural network that encodes convolutions by few lookups to a dictionary that is trained to cover the space of weights in CNNs. Training LCNN involves jointly learning a dictionary and a small set of linear combinations. The size of the dictionary naturally traces a spectrum of trade-offs between efficiency and accuracy. Our experimental results on ImageNet challenge show that LCNN can offer 3.2x speedup while achieving 55.1% top-1 accuracy using AlexNet architecture. Our fastest LCNN offers 37.6x speed up over AlexNet while maintaining 44.3% top-1 accuracy. LCNN not only offers dramatic speed ups at inference, but it also enables efficient training. In this paper, we show the benefits of LCNN in few-shot learning and few-iteration learning, two crucial aspects of on-device training of deep learning models.
△ Less
Submitted 12 June, 2017; v1 submitted 20 November, 2016;
originally announced November 2016.
-
Bidirectional Attention Flow for Machine Comprehension
Authors:
Minjoon Seo,
Aniruddha Kembhavi,
Ali Farhadi,
Hannaneh Hajishirzi
Abstract:
Machine comprehension (MC), answering a query about a given context paragraph, requires modeling complex interactions between the context and the query. Recently, attention mechanisms have been successfully extended to MC. Typically these methods use attention to focus on a small portion of the context and summarize it with a fixed-size vector, couple attentions temporally, and/or often form a uni…
▽ More
Machine comprehension (MC), answering a query about a given context paragraph, requires modeling complex interactions between the context and the query. Recently, attention mechanisms have been successfully extended to MC. Typically these methods use attention to focus on a small portion of the context and summarize it with a fixed-size vector, couple attentions temporally, and/or often form a uni-directional attention. In this paper we introduce the Bi-Directional Attention Flow (BIDAF) network, a multi-stage hierarchical process that represents the context at different levels of granularity and uses bi-directional attention flow mechanism to obtain a query-aware context representation without early summarization. Our experimental evaluations show that our model achieves the state-of-the-art results in Stanford Question Answering Dataset (SQuAD) and CNN/DailyMail cloze test.
△ Less
Submitted 21 June, 2018; v1 submitted 5 November, 2016;
originally announced November 2016.
-
Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning
Authors:
Yuke Zhu,
Roozbeh Mottaghi,
Eric Kolve,
Joseph J. Lim,
Abhinav Gupta,
Li Fei-Fei,
Ali Farhadi
Abstract:
Two less addressed issues of deep reinforcement learning are (1) lack of generalization capability to new target goals, and (2) data inefficiency i.e., the model requires several (and often costly) episodes of trial and error to converge, which makes it impractical to be applied to real-world scenarios. In this paper, we address these two issues and apply our model to the task of target-driven vis…
▽ More
Two less addressed issues of deep reinforcement learning are (1) lack of generalization capability to new target goals, and (2) data inefficiency i.e., the model requires several (and often costly) episodes of trial and error to converge, which makes it impractical to be applied to real-world scenarios. In this paper, we address these two issues and apply our model to the task of target-driven visual navigation. To address the first issue, we propose an actor-critic model whose policy is a function of the goal as well as the current state, which allows to better generalize. To address the second issue, we propose AI2-THOR framework, which provides an environment with high-quality 3D scenes and physics engine. Our framework enables agents to take actions and interact with objects. Hence, we can collect a huge number of training samples efficiently.
We show that our proposed method (1) converges faster than the state-of-the-art deep reinforcement learning methods, (2) generalizes across targets and across scenes, (3) generalizes to a real robot scenario with a small amount of fine-tuning (although the model is trained in simulation), (4) is end-to-end trainable and does not need feature engineering, feature matching between frames or 3D reconstruction of the environment.
The supplementary video can be accessed at the following link: https://youtu.be/SmBxMDiOrvs.
△ Less
Submitted 16 September, 2016;
originally announced September 2016.
-
Much Ado About Time: Exhaustive Annotation of Temporal Data
Authors:
Gunnar A. Sigurdsson,
Olga Russakovsky,
Ali Farhadi,
Ivan Laptev,
Abhinav Gupta
Abstract:
Large-scale annotated datasets allow AI systems to learn from and build upon the knowledge of the crowd. Many crowdsourcing techniques have been developed for collecting image annotations. These techniques often implicitly rely on the fact that a new input image takes a negligible amount of time to perceive. In contrast, we investigate and determine the most cost-effective way of obtaining high-qu…
▽ More
Large-scale annotated datasets allow AI systems to learn from and build upon the knowledge of the crowd. Many crowdsourcing techniques have been developed for collecting image annotations. These techniques often implicitly rely on the fact that a new input image takes a negligible amount of time to perceive. In contrast, we investigate and determine the most cost-effective way of obtaining high-quality multi-label annotations for temporal data such as videos. Watching even a short 30-second video clip requires a significant time investment from a crowd worker; thus, requesting multiple annotations following a single viewing is an important cost-saving strategy. But how many questions should we ask per video? We conclude that the optimal strategy is to ask as many questions as possible in a HIT (up to 52 binary questions after watching a 30-second video clip in our experiments). We demonstrate that while workers may not correctly answer all questions, the cost-benefit analysis nevertheless favors consensus from multiple such cheap-yet-imperfect iterations over more complex alternatives. When compared with a one-question-per-video baseline, our method is able to achieve a 10% improvement in recall 76.7% ours versus 66.7% baseline) at comparable precision (83.8% ours versus 83.0% baseline) in about half the annotation time (3.8 minutes ours compared to 7.1 minutes baseline). We demonstrate the effectiveness of our method by collecting multi-label annotations of 157 human activities on 1,815 videos.
△ Less
Submitted 2 October, 2016; v1 submitted 25 July, 2016;
originally announced July 2016.
-
Query-Reduction Networks for Question Answering
Authors:
Minjoon Seo,
Sewon Min,
Ali Farhadi,
Hannaneh Hajishirzi
Abstract:
In this paper, we study the problem of question answering when reasoning over multiple facts is required. We propose Query-Reduction Network (QRN), a variant of Recurrent Neural Network (RNN) that effectively handles both short-term (local) and long-term (global) sequential dependencies to reason over multiple facts. QRN considers the context sentences as a sequence of state-changing triggers, and…
▽ More
In this paper, we study the problem of question answering when reasoning over multiple facts is required. We propose Query-Reduction Network (QRN), a variant of Recurrent Neural Network (RNN) that effectively handles both short-term (local) and long-term (global) sequential dependencies to reason over multiple facts. QRN considers the context sentences as a sequence of state-changing triggers, and reduces the original query to a more informed query as it observes each trigger (context sentence) through time. Our experiments show that QRN produces the state-of-the-art results in bAbI QA and dialog tasks, and in a real goal-oriented dialog dataset. In addition, QRN formulation allows parallelization on RNN's time axis, saving an order of magnitude in time complexity for training and inference.
△ Less
Submitted 24 February, 2017; v1 submitted 14 June, 2016;
originally announced June 2016.
-
Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks
Authors:
Junyuan Xie,
Ross Girshick,
Ali Farhadi
Abstract:
As 3D movie viewing becomes mainstream and Virtual Reality (VR) market emerges, the demand for 3D contents is growing rapidly. Producing 3D videos, however, remains challenging. In this paper we propose to use deep neural networks for automatically converting 2D videos and images to stereoscopic 3D format. In contrast to previous automatic 2D-to-3D conversion algorithms, which have separate stages…
▽ More
As 3D movie viewing becomes mainstream and Virtual Reality (VR) market emerges, the demand for 3D contents is growing rapidly. Producing 3D videos, however, remains challenging. In this paper we propose to use deep neural networks for automatically converting 2D videos and images to stereoscopic 3D format. In contrast to previous automatic 2D-to-3D conversion algorithms, which have separate stages and need ground truth depth map as supervision, our approach is trained end-to-end directly on stereo pairs extracted from 3D movies. This novel training scheme makes it possible to exploit orders of magnitude more data and significantly increases performance. Indeed, Deep3D outperforms baselines in both quantitative and human subject evaluations.
△ Less
Submitted 13 April, 2016;
originally announced April 2016.
-
Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding
Authors:
Gunnar A. Sigurdsson,
Gül Varol,
Xiaolong Wang,
Ali Farhadi,
Ivan Laptev,
Abhinav Gupta
Abstract:
Computer vision has a great potential to help our daily lives by searching for lost keys, watering flowers or reminding us to take a pill. To succeed with such tasks, computer vision methods need to be trained from real and diverse examples of our daily dynamic scenes. While most of such scenes are not particularly exciting, they typically do not appear on YouTube, in movies or TV broadcasts. So h…
▽ More
Computer vision has a great potential to help our daily lives by searching for lost keys, watering flowers or reminding us to take a pill. To succeed with such tasks, computer vision methods need to be trained from real and diverse examples of our daily dynamic scenes. While most of such scenes are not particularly exciting, they typically do not appear on YouTube, in movies or TV broadcasts. So how do we collect sufficiently many diverse but boring samples representing our lives? We propose a novel Hollywood in Homes approach to collect such data. Instead of shooting videos in the lab, we ensure diversity by distributing and crowdsourcing the whole process of video creation from script writing to video recording and annotation. Following this procedure we collect a new dataset, Charades, with hundreds of people recording videos in their own homes, acting out casual everyday activities. The dataset is composed of 9,848 annotated videos with an average length of 30 seconds, showing activities of 267 people from three continents. Each video is annotated by multiple free-text descriptions, action labels, action intervals and classes of interacted objects. In total, Charades provides 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes. Using this rich data, we evaluate and provide baseline results for several tasks including action recognition and automatic description generation. We believe that the realism, diversity, and casual nature of this dataset will present unique challenges and new opportunities for computer vision community.
△ Less
Submitted 26 July, 2016; v1 submitted 6 April, 2016;
originally announced April 2016.
-
A Diagram Is Worth A Dozen Images
Authors:
Aniruddha Kembhavi,
Mike Salvato,
Eric Kolve,
Minjoon Seo,
Hannaneh Hajishirzi,
Ali Farhadi
Abstract:
Diagrams are common tools for representing complex concepts, relationships and events, often when it would be difficult to portray the same information with natural images. Understanding natural images has been extensively studied in computer vision, while diagram understanding has received little attention. In this paper, we study the problem of diagram interpretation and reasoning, the challengi…
▽ More
Diagrams are common tools for representing complex concepts, relationships and events, often when it would be difficult to portray the same information with natural images. Understanding natural images has been extensively studied in computer vision, while diagram understanding has received little attention. In this paper, we study the problem of diagram interpretation and reasoning, the challenging task of identifying the structure of a diagram and the semantics of its constituents and their relationships. We introduce Diagram Parse Graphs (DPG) as our representation to model the structure of diagrams. We define syntactic parsing of diagrams as learning to infer DPGs for diagrams and study semantic interpretation and reasoning of diagrams in the context of diagram question answering. We devise an LSTM-based method for syntactic parsing of diagrams and introduce a DPG-based attention model for diagram question answering. We compile a new dataset of diagrams with exhaustive annotations of constituents and relationships for over 5,000 diagrams and 15,000 questions and answers. Our results show the significance of our models for syntactic parsing and question answering in diagrams using DPGs.
△ Less
Submitted 23 March, 2016;
originally announced March 2016.
-
"What happens if..." Learning to Predict the Effect of Forces in Images
Authors:
Roozbeh Mottaghi,
Mohammad Rastegari,
Abhinav Gupta,
Ali Farhadi
Abstract:
What happens if one pushes a cup sitting on a table toward the edge of the table? How about pushing a desk against a wall? In this paper, we study the problem of understanding the movements of objects as a result of applying external forces to them. For a given force vector applied to a specific location in an image, our goal is to predict long-term sequential movements caused by that force. Doing…
▽ More
What happens if one pushes a cup sitting on a table toward the edge of the table? How about pushing a desk against a wall? In this paper, we study the problem of understanding the movements of objects as a result of applying external forces to them. For a given force vector applied to a specific location in an image, our goal is to predict long-term sequential movements caused by that force. Doing so entails reasoning about scene geometry, objects, their attributes, and the physical rules that govern the movements of objects. We design a deep neural network model that learns long-term sequential dependencies of object movements while taking into account the geometry and appearance of the scene by combining Convolutional and Recurrent Neural Networks. Training our model requires a large-scale dataset of object movements caused by external forces. To build a dataset of forces in scenes, we reconstructed all images in SUN RGB-D dataset in a physics simulator to estimate the physical movements of objects caused by external forces applied to them. Our Forces in Scenes (ForScene) dataset contains 10,335 images in which a variety of external forces are applied to different types of objects resulting in more than 65,000 object movements represented in 3D. Our experimental evaluations show that the challenging task of predicting long-term movements of objects as their reaction to external forces is possible from a single image.
△ Less
Submitted 17 March, 2016;
originally announced March 2016.
-
XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
Authors:
Mohammad Rastegari,
Vicente Ordonez,
Joseph Redmon,
Ali Farhadi
Abstract:
We propose two efficient approximations to standard convolutional neural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-Networks, the filters are approximated with binary values resulting in 32x memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations. This resu…
▽ More
We propose two efficient approximations to standard convolutional neural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-Networks, the filters are approximated with binary values resulting in 32x memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations. This results in 58x faster convolutional operations and 32x memory savings. XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time. Our binary networks are simple, accurate, efficient, and work on challenging visual tasks. We evaluate our approach on the ImageNet classification task. The classification accuracy with a Binary-Weight-Network version of AlexNet is only 2.9% less than the full-precision AlexNet (in top-1 measure). We compare our method with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than 16% in top-1 accuracy.
△ Less
Submitted 2 August, 2016; v1 submitted 16 March, 2016;
originally announced March 2016.
-
Are Elephants Bigger than Butterflies? Reasoning about Sizes of Objects
Authors:
Hessam Bagherinezhad,
Hannaneh Hajishirzi,
Ye** Choi,
Ali Farhadi
Abstract:
Human vision greatly benefits from the information about sizes of objects. The role of size in several visual reasoning tasks has been thoroughly explored in human perception and cognition. However, the impact of the information about sizes of objects is yet to be determined in AI. We postulate that this is mainly attributed to the lack of a comprehensive repository of size information. In this pa…
▽ More
Human vision greatly benefits from the information about sizes of objects. The role of size in several visual reasoning tasks has been thoroughly explored in human perception and cognition. However, the impact of the information about sizes of objects is yet to be determined in AI. We postulate that this is mainly attributed to the lack of a comprehensive repository of size information. In this paper, we introduce a method to automatically infer object sizes, leveraging visual and textual information from web. By maximizing the joint likelihood of textual and visual observations, our method learns reliable relative size estimates, with no explicit human supervision. We introduce the relative size dataset and show that our method outperforms competitive textual and visual baselines in reasoning about size comparisons.
△ Less
Submitted 1 February, 2016;
originally announced February 2016.
-
Toward a Taxonomy and Computational Models of Abnormalities in Images
Authors:
Babak Saleh,
Ahmed Elgammal,
Jacob Feldman,
Ali Farhadi
Abstract:
The human visual system can spot an abnormal image, and reason about what makes it strange. This task has not received enough attention in computer vision. In this paper we study various types of atypicalities in images in a more comprehensive way than has been done before. We propose a new dataset of abnormal images showing a wide range of atypicalities. We design human subject experiments to dis…
▽ More
The human visual system can spot an abnormal image, and reason about what makes it strange. This task has not received enough attention in computer vision. In this paper we study various types of atypicalities in images in a more comprehensive way than has been done before. We propose a new dataset of abnormal images showing a wide range of atypicalities. We design human subject experiments to discover a coarse taxonomy of the reasons for abnormality. Our experiments reveal three major categories of abnormality: object-centric, scene-centric, and contextual. Based on this taxonomy, we propose a comprehensive computational model that can predict all different types of abnormality in images and outperform prior arts in abnormality recognition.
△ Less
Submitted 4 December, 2015;
originally announced December 2015.
-
Actions ~ Transformations
Authors:
Xiaolong Wang,
Ali Farhadi,
Abhinav Gupta
Abstract:
What defines an action like "kicking ball"? We argue that the true meaning of an action lies in the change or transformation an action brings to the environment. In this paper, we propose a novel representation for actions by modeling an action as a transformation which changes the state of the environment before the action happens (precondition) to the state after the action (effect). Motivated b…
▽ More
What defines an action like "kicking ball"? We argue that the true meaning of an action lies in the change or transformation an action brings to the environment. In this paper, we propose a novel representation for actions by modeling an action as a transformation which changes the state of the environment before the action happens (precondition) to the state after the action (effect). Motivated by recent advancements of video representation using deep learning, we design a Siamese network which models the action as a transformation on a high-level feature space. We show that our model gives improvements on standard action recognition datasets including UCF101 and HMDB51. More importantly, our approach is able to generalize beyond learned action categories and shows significant performance improvement on cross-category generalization on our new ACT dataset.
△ Less
Submitted 26 July, 2016; v1 submitted 2 December, 2015;
originally announced December 2015.
-
Unsupervised Deep Embedding for Clustering Analysis
Authors:
Junyuan Xie,
Ross Girshick,
Ali Farhadi
Abstract:
Clustering is central to many data-driven application domains and has been studied extensively in terms of distance functions and grou** algorithms. Relatively little work has focused on learning representations for clustering. In this paper, we propose Deep Embedded Clustering (DEC), a method that simultaneously learns feature representations and cluster assignments using deep neural networks.…
▽ More
Clustering is central to many data-driven application domains and has been studied extensively in terms of distance functions and grou** algorithms. Relatively little work has focused on learning representations for clustering. In this paper, we propose Deep Embedded Clustering (DEC), a method that simultaneously learns feature representations and cluster assignments using deep neural networks. DEC learns a map** from the data space to a lower-dimensional feature space in which it iteratively optimizes a clustering objective. Our experimental evaluations on image and text corpora show significant improvement over state-of-the-art methods.
△ Less
Submitted 24 May, 2016; v1 submitted 19 November, 2015;
originally announced November 2015.
-
Newtonian Image Understanding: Unfolding the Dynamics of Objects in Static Images
Authors:
Roozbeh Mottaghi,
Hessam Bagherinezhad,
Mohammad Rastegari,
Ali Farhadi
Abstract:
In this paper, we study the challenging problem of predicting the dynamics of objects in static images. Given a query object in an image, our goal is to provide a physical understanding of the object in terms of the forces acting upon it and its long term motion as response to those forces. Direct and explicit estimation of the forces and the motion of objects from a single image is extremely chal…
▽ More
In this paper, we study the challenging problem of predicting the dynamics of objects in static images. Given a query object in an image, our goal is to provide a physical understanding of the object in terms of the forces acting upon it and its long term motion as response to those forces. Direct and explicit estimation of the forces and the motion of objects from a single image is extremely challenging. We define intermediate physical abstractions called Newtonian scenarios and introduce Newtonian Neural Network ($N^3$) that learns to map a single image to a state in a Newtonian scenario. Our experimental evaluations show that our method can reliably predict dynamics of a query object from a single image. In addition, our approach can provide physical reasoning that supports the predicted dynamics in terms of velocity and force vectors. To spur research in this direction we compiled Visual Newtonian Dynamics (VIND) dataset that includes 6806 videos aligned with Newtonian scenarios represented using game engines, and 4516 still images with their ground truth dynamics.
△ Less
Submitted 12 November, 2015;
originally announced November 2015.
-
VISALOGY: Answering Visual Analogy Questions
Authors:
Fereshteh Sadeghi,
C. Lawrence Zitnick,
Ali Farhadi
Abstract:
In this paper, we study the problem of answering visual analogy questions. These questions take the form of image A is to image B as image C is to what. Answering these questions entails discovering the map** from image A to image B and then extending the map** to image C and searching for the image D such that the relation from A to B holds for C to D. We pose this problem as learning an embe…
▽ More
In this paper, we study the problem of answering visual analogy questions. These questions take the form of image A is to image B as image C is to what. Answering these questions entails discovering the map** from image A to image B and then extending the map** to image C and searching for the image D such that the relation from A to B holds for C to D. We pose this problem as learning an embedding that encourages pairs of analogous images with similar transformations to be close together using convolutional neural networks with a quadruple Siamese architecture. We introduce a dataset of visual analogy questions in natural images, and show first results of its kind on solving analogy questions on natural images.
△ Less
Submitted 30 October, 2015;
originally announced October 2015.
-
Segment-Phrase Table for Semantic Segmentation, Visual Entailment and Paraphrasing
Authors:
Hamid Izadinia,
Fereshteh Sadeghi,
Santosh Kumar Divvala,
Ye** Choi,
Ali Farhadi
Abstract:
We introduce Segment-Phrase Table (SPT), a large collection of bijective associations between textual phrases and their corresponding segmentations. Leveraging recent progress in object recognition and natural language semantics, we show how we can successfully build a high-quality segment-phrase table using minimal human supervision. More importantly, we demonstrate the unique value unleashed by…
▽ More
We introduce Segment-Phrase Table (SPT), a large collection of bijective associations between textual phrases and their corresponding segmentations. Leveraging recent progress in object recognition and natural language semantics, we show how we can successfully build a high-quality segment-phrase table using minimal human supervision. More importantly, we demonstrate the unique value unleashed by this rich bimodal resource, for both vision as well as natural language understanding. First, we show that fine-grained textual labels facilitate contextual reasoning that helps in satisfying semantic constraints across image segments. This feature enables us to achieve state-of-the-art segmentation results on benchmark datasets. Next, we show that the association of high-quality segmentations to textual phrases aids in richer semantic understanding and reasoning of these textual phrases. Leveraging this feature, we motivate the problem of visual entailment and visual paraphrasing, and demonstrate its utility on a large dataset.
△ Less
Submitted 27 September, 2015;
originally announced September 2015.
-
You Only Look Once: Unified, Real-Time Object Detection
Authors:
Joseph Redmon,
Santosh Divvala,
Ross Girshick,
Ali Farhadi
Abstract:
We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detec…
▽ More
We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.
Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is far less likely to predict false detections where nothing exists. Finally, YOLO learns very general representations of objects. It outperforms all other detection methods, including DPM and R-CNN, by a wide margin when generalizing from natural images to artwork on both the Picasso Dataset and the People-Art Dataset.
△ Less
Submitted 9 May, 2016; v1 submitted 8 June, 2015;
originally announced June 2015.
-
Image Classification and Retrieval from User-Supplied Tags
Authors:
Hamid Izadinia,
Ali Farhadi,
Aaron Hertzmann,
Matthew D. Hoffman
Abstract:
This paper proposes direct learning of image classification from user-supplied tags, without filtering. Each tag is supplied by the user who shared the image online. Enormous numbers of these tags are freely available online, and they give insight about the image categories important to users and to image classification. Our approach is complementary to the conventional approach of manual annotati…
▽ More
This paper proposes direct learning of image classification from user-supplied tags, without filtering. Each tag is supplied by the user who shared the image online. Enormous numbers of these tags are freely available online, and they give insight about the image categories important to users and to image classification. Our approach is complementary to the conventional approach of manual annotation, which is extremely costly. We analyze of the Flickr 100 Million Image dataset, making several useful observations about the statistics of these tags. We introduce a large-scale robust classification algorithm, in order to handle the inherent noise in these tags, and a calibration procedure to better predict objective annotations. We show that freely available, user-supplied tags can obtain similar or superior results to large databases of costly manual annotations.
△ Less
Submitted 25 November, 2014;
originally announced November 2014.
-
Abnormal Object Recognition: A Comprehensive Study
Authors:
Babak Saleh,
Ali Farhadi,
Ahmed Elgammal
Abstract:
When describing images, humans tend not to talk about the obvious, but rather mention what they find interesting. We argue that abnormalities and deviations from typicalities are among the most important components that form what is worth mentioning. In this paper we introduce the abnormality detection as a recognition problem and show how to model typicalities and, consequently, meaningful deviat…
▽ More
When describing images, humans tend not to talk about the obvious, but rather mention what they find interesting. We argue that abnormalities and deviations from typicalities are among the most important components that form what is worth mentioning. In this paper we introduce the abnormality detection as a recognition problem and show how to model typicalities and, consequently, meaningful deviations from prototypical properties of categories. Our model can recognize abnormalities and report the main reasons of any recognized abnormality. We introduce the abnormality detection dataset and show interesting results on how to reason about abnormalities.
△ Less
Submitted 9 November, 2014;
originally announced November 2014.
-
Semantic Understanding of Professional Soccer Commentaries
Authors:
Hannaneh Hajishirzi,
Mohammad Rastegari,
Ali Farhadi,
Jessica K. Hodgins
Abstract:
This paper presents a novel approach to the problem of semantic parsing via learning the correspondences between complex sentences and rich sets of events. Our main intuition is that correct correspondences tend to occur more frequently. Our model benefits from a discriminative notion of similarity to learn the correspondence between sentence and an event and a ranking machinery that scores the po…
▽ More
This paper presents a novel approach to the problem of semantic parsing via learning the correspondences between complex sentences and rich sets of events. Our main intuition is that correct correspondences tend to occur more frequently. Our model benefits from a discriminative notion of similarity to learn the correspondence between sentence and an event and a ranking machinery that scores the popularity of each correspondence. Our method can discover a group of events (called macro-events) that best describes a sentence. We evaluate our method on our novel dataset of professional soccer commentaries. The empirical results show that our method significantly outperforms the state-of-theart.
△ Less
Submitted 16 October, 2012;
originally announced October 2012.