Skip to main content

Showing 101–150 of 174 results for author: Batra, D

.
  1. arXiv:1904.03461  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Embodied Question Answering in Photorealistic Environments with Point Cloud Perception

    Authors: Erik Wijmans, Samyak Datta, Oleksandr Maksymets, Abhishek Das, Georgia Gkioxari, Stefan Lee, Irfan Essa, Devi Parikh, Dhruv Batra

    Abstract: To help bridge the gap between internet vision-style problems and the goal of vision for embodied perception we instantiate a large-scale navigation task -- Embodied Question Answering [1] in photo-realistic environments (Matterport 3D). We thoroughly study navigation policies that utilize 3D point clouds, RGB images, or their combination. Our analysis of these models reveals several key findings.… ▽ More

    Submitted 6 April, 2019; originally announced April 2019.

  2. arXiv:1904.01201  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.RO

    Habitat: A Platform for Embodied AI Research

    Authors: Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, Dhruv Batra

    Abstract: We present Habitat, a platform for research in embodied artificial intelligence (AI). Habitat enables training embodied agents (virtual robots) in highly efficient photorealistic 3D simulation. Specifically, Habitat consists of: (i) Habitat-Sim: a flexible, high-performance 3D simulator with configurable agents, sensors, and generic 3D dataset handling. Habitat-Sim is fast -- when rendering a scen… ▽ More

    Submitted 24 November, 2019; v1 submitted 1 April, 2019; originally announced April 2019.

    Comments: ICCV 2019

  3. arXiv:1903.03166  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog

    Authors: Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, Marcus Rohrbach

    Abstract: Visual Dialog is a multimodal task of answering a sequence of questions grounded in an image, using the conversation history as context. It entails challenges in vision, language, reasoning, and grounding. However, studying these subtasks in isolation on large, real datasets is infeasible as it requires prohibitively-expensive complete annotation of the 'state' of all images and dialogs. We deve… ▽ More

    Submitted 18 September, 2019; v1 submitted 7 March, 2019; originally announced March 2019.

    Comments: 13 pages, 11 figures, 3 tables, accepted as a short paper at NAACL 2019

  4. arXiv:1903.01599  [pdf, other

    stat.ML cs.LG

    Learning Dynamics Model in Reinforcement Learning by Incorporating the Long Term Future

    Authors: Nan Rosemary Ke, Amanpreet Singh, Ahmed Touati, Anirudh Goyal, Yoshua Bengio, Devi Parikh, Dhruv Batra

    Abstract: In model-based reinforcement learning, the agent interleaves between model learning and planning. These two components are inextricably intertwined. If the model is not able to provide sensible long-term prediction, the executed planner would exploit model flaws, which can yield catastrophic failures. This paper focuses on building a model that reasons about the long-term future and demonstrates h… ▽ More

    Submitted 16 March, 2019; v1 submitted 4 March, 2019; originally announced March 2019.

    Comments: To appear at ICLR 2019

  5. arXiv:1902.07864  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering

    Authors: Ramakrishna Vedantam, Karan Desai, Stefan Lee, Marcus Rohrbach, Dhruv Batra, Devi Parikh

    Abstract: We propose a new class of probabilistic neural-symbolic models, that have symbolic functional programs as a latent, stochastic variable. Instantiated in the context of visual question answering, our probabilistic formulation offers two key conceptual advantages over prior neural-symbolic models for VQA. Firstly, the programs generated by our model are more understandable while requiring lesser num… ▽ More

    Submitted 27 June, 2019; v1 submitted 20 February, 2019; originally announced February 2019.

    Comments: ICML 2019 Camera Ready + Appendix

  6. arXiv:1902.03751  [pdf, other

    cs.CV

    Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded

    Authors: Ramprasaath R. Selvaraju, Stefan Lee, Yilin Shen, Hongxia **, Shalini Ghosh, Larry Heck, Dhruv Batra, Devi Parikh

    Abstract: Many vision and language models suffer from poor visual grounding - often falling back on easy-to-learn language priors rather than basing their decisions on visual concepts in the image. In this work, we propose a generic approach called Human Importance-aware Network Tuning (HINT) that effectively leverages human demonstrations to improve visual grounding. HINT encourages deep networks to be sen… ▽ More

    Submitted 28 October, 2019; v1 submitted 11 February, 2019; originally announced February 2019.

    Comments: Published at ICCV'2019

    Journal ref: The IEEE International Conference on Computer Vision (ICCV) 2019

  7. arXiv:1902.03570  [pdf, other

    cs.AI cs.CL cs.CV cs.LG

    EvalAI: Towards Better Evaluation Systems for AI Agents

    Authors: Deshraj Yadav, Rishabh Jain, Harsh Agrawal, Prithvijit Chattopadhyay, Taranjeet Singh, Akash Jain, Shiv Baran Singh, Stefan Lee, Dhruv Batra

    Abstract: We introduce EvalAI, an open source platform for evaluating and comparing machine learning (ML) and artificial intelligence algorithms (AI) at scale. EvalAI is built to provide a scalable solution to the research community to fulfill the critical need of evaluating machine learning models and agents acting in an environment against annotations or with a human-in-the-loop. This will help researcher… ▽ More

    Submitted 10 February, 2019; originally announced February 2019.

  8. arXiv:1902.01385  [pdf, other

    cs.LG cs.AI cs.CL cs.RO stat.ML

    Embodied Multimodal Multitask Learning

    Authors: Devendra Singh Chaplot, Lisa Lee, Ruslan Salakhutdinov, Devi Parikh, Dhruv Batra

    Abstract: Recent efforts on training visual navigation agents conditioned on language using deep reinforcement learning have been successful in learning policies for different multimodal tasks, such as semantic goal navigation and embodied question answering. In this paper, we propose a multitask model capable of jointly learning these multimodal tasks, and transferring knowledge of words and their groundin… ▽ More

    Submitted 4 February, 2019; originally announced February 2019.

    Comments: See https://devendrachaplot.github.io/projects/EMML for demo videos

  9. arXiv:1901.09107  [pdf, other

    cs.CV

    Audio-Visual Scene-Aware Dialog

    Authors: Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K. Marks, Chiori Hori, Peter Anderson, Stefan Lee, Devi Parikh

    Abstract: We introduce the task of scene-aware dialog. Our goal is to generate a complete and natural response to a question about a scene, given video and audio of the scene and the history of previous turns in the dialog. To answer successfully, agents must ground concepts from the question in the video while leveraging contextual cues from the dialog history. To benchmark this task, we introduce the Audi… ▽ More

    Submitted 8 May, 2019; v1 submitted 25 January, 2019; originally announced January 2019.

  10. arXiv:1901.05531  [pdf, other

    cs.CV cs.CL cs.LG

    Response to "Visual Dialogue without Vision or Dialogue" (Massiceti et al., 2018)

    Authors: Abhishek Das, Devi Parikh, Dhruv Batra

    Abstract: In a recent workshop paper, Massiceti et al. presented a baseline model and subsequent critique of Visual Dialog (Das et al., CVPR 2017) that raises what we believe to be unfounded concerns about the dataset and evaluation. This article intends to rebut the critique and clarify potential confusions for practitioners and future participants in the Visual Dialog challenge.

    Submitted 16 January, 2019; originally announced January 2019.

  11. arXiv:1901.03461  [pdf, ps, other

    cs.CL

    Dialog System Technology Challenge 7

    Authors: Koichiro Yoshino, Chiori Hori, Julien Perez, Luis Fernando D'Haro, Lazaros Polymenakos, Chulaka Gunasekara, Walter S. Lasecki, Jonathan K. Kummerfeld, Michel Galley, Chris Brockett, Jianfeng Gao, Bill Dolan, Xiang Gao, Huda Alamari, Tim K. Marks, Devi Parikh, Dhruv Batra

    Abstract: This paper introduces the Seventh Dialog System Technology Challenges (DSTC), which use shared datasets to explore the problem of building dialog systems. Recently, end-to-end dialog modeling approaches have been applied to various dialog tasks. The seventh DSTC (DSTC7) focuses on develo** technologies related to end-to-end dialog systems for (1) sentence selection, (2) sentence generation and (… ▽ More

    Submitted 10 January, 2019; originally announced January 2019.

    Comments: This paper is presented at NIPS2018 2nd Conversational AI workshop

  12. arXiv:1812.08658  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    nocaps: novel object captioning at scale

    Authors: Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, Peter Anderson

    Abstract: Image captioning models have achieved impressive results on datasets containing limited visual concepts and large amounts of paired image-caption training data. However, if these models are to ever function in the wild, a much larger variety of visual concepts must be learned, ideally from less supervision. To encourage the development of image captioning models that can learn visual concepts from… ▽ More

    Submitted 30 September, 2019; v1 submitted 20 December, 2018; originally announced December 2018.

    Journal ref: IEEE International Conference on Computer Vision (ICCV) 2019

  13. arXiv:1810.11649  [pdf, other

    cs.LG cs.AI cs.CV

    Fabrik: An Online Collaborative Neural Network Editor

    Authors: Utsav Garg, Viraj Prabhu, Deshraj Yadav, Ram Ramrakhya, Harsh Agrawal, Dhruv Batra

    Abstract: We present Fabrik, an online neural network editor that provides tools to visualize, edit, and share neural networks from within a browser. Fabrik provides a simple and intuitive GUI to import neural networks written in popular deep learning frameworks such as Caffe, Keras, and TensorFlow, and allows users to interact with, build, and edit models via simple drag and drop. Fabrik is designed to be… ▽ More

    Submitted 27 October, 2018; originally announced October 2018.

  14. arXiv:1810.11187  [pdf, other

    cs.LG cs.AI cs.MA stat.ML

    TarMAC: Targeted Multi-Agent Communication

    Authors: Abhishek Das, Théophile Gervet, Joshua Romoff, Dhruv Batra, Devi Parikh, Michael Rabbat, Joelle Pineau

    Abstract: We propose a targeted communication architecture for multi-agent reinforcement learning, where agents learn both what messages to send and whom to address them to while performing cooperative tasks in partially-observable environments. This targeting behavior is learnt solely from downstream task-specific reward without any communication supervision. We additionally augment this with a multi-round… ▽ More

    Submitted 21 February, 2020; v1 submitted 26 October, 2018; originally announced October 2018.

    Comments: ICML 2019

  15. arXiv:1810.11181  [pdf, other

    cs.AI cs.CL cs.CV cs.LG

    Neural Modular Control for Embodied Question Answering

    Authors: Abhishek Das, Georgia Gkioxari, Stefan Lee, Devi Parikh, Dhruv Batra

    Abstract: We present a modular approach for learning policies for navigation over long planning horizons from language input. Our hierarchical policy operates at multiple timescales, where the higher-level master policy proposes subgoals to be executed by specialized sub-policies. Our choice of subgoals is compositional and semantic, i.e. they can be sequentially combined in arbitrary orderings, and assume… ▽ More

    Submitted 2 May, 2019; v1 submitted 25 October, 2018; originally announced October 2018.

    Comments: 10 pages, 3 figures, 2 tables. Published at CoRL 2018. Webpage: https://embodiedqa.org/

  16. arXiv:1810.00912  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition

    Authors: Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, Devi Parikh

    Abstract: In an open-world setting, it is inevitable that an intelligent agent (e.g., a robot) will encounter visual objects, attributes or relationships it does not recognize. In this work, we develop an agent empowered with visual curiosity, i.e. the ability to ask questions to an Oracle (e.g., human) about the contents in images (e.g., What is the object on the left side of the red cube?) and build visua… ▽ More

    Submitted 1 October, 2018; originally announced October 2018.

    Comments: 18 pages, 10 figures, Oral Presentation in Conference on Robot Learning (CoRL) 2018

  17. arXiv:1809.01816  [pdf, other

    cs.CV cs.AI cs.CL

    Visual Coreference Resolution in Visual Dialog using Neural Module Networks

    Authors: Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, Marcus Rohrbach

    Abstract: Visual dialog entails answering a series of questions grounded in an image, using dialog history as context. In addition to the challenges found in visual question answering (VQA), which can be seen as one-round dialog, visual dialog encompasses several more. We focus on one such problem called visual coreference resolution that involves determining which words, typically noun phrases and pronouns… ▽ More

    Submitted 6 September, 2018; originally announced September 2018.

    Comments: ECCV 2018 + results on VisDial v1.0 dataset

  18. arXiv:1808.02861  [pdf, other

    cs.CV

    Choose Your Neuron: Incorporating Domain Knowledge through Neuron-Importance

    Authors: Ramprasaath R. Selvaraju, Prithvijit Chattopadhyay, Mohamed Elhoseiny, Tilak Sharma, Dhruv Batra, Devi Parikh, Stefan Lee

    Abstract: Individual neurons in convolutional neural networks supervised for image-level classification tasks have been shown to implicitly learn semantically meaningful concepts ranging from simple textures and shapes to whole or partial objects - forming a "dictionary" of concepts acquired through the learning process. In this work we introduce a simple, efficient zero-shot learning approach based on this… ▽ More

    Submitted 8 August, 2018; originally announced August 2018.

    Comments: In Proceedings of ECCV 2018

  19. arXiv:1808.00191  [pdf, other

    cs.CV cs.LG

    Graph R-CNN for Scene Graph Generation

    Authors: Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, Devi Parikh

    Abstract: We propose a novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images. Our model contains a Relation Proposal Network (RePN) that efficiently deals with the quadratic number of potential relations between objects in an image. We also propose an attentional Graph Convolutional Network (aGCN) that effectively captu… ▽ More

    Submitted 1 August, 2018; originally announced August 2018.

    Comments: 16 pages, ECCV 2018 camera ready

  20. arXiv:1807.09956  [pdf, other

    cs.CV

    Pythia v0.1: the Winning Entry to the VQA Challenge 2018

    Authors: Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, Devi Parikh

    Abstract: This document describes Pythia v0.1, the winning entry from Facebook AI Research (FAIR)'s A-STAR team to the VQA Challenge 2018. Our starting point is a modular re-implementation of the bottom-up top-down (up-down) model. We demonstrate that by making subtle but important changes to the model architecture and the learning rate schedule, fine-tuning image features, and adding data augmentation, w… ▽ More

    Submitted 27 July, 2018; v1 submitted 26 July, 2018; originally announced July 2018.

  21. arXiv:1807.03367  [pdf, other

    cs.AI cs.CL cs.CV cs.LG

    Talk the Walk: Navigating New York City through Grounded Dialogue

    Authors: Harm de Vries, Kurt Shuster, Dhruv Batra, Devi Parikh, Jason Weston, Douwe Kiela

    Abstract: We introduce "Talk The Walk", the first large-scale dialogue dataset grounded in action and perception. The task involves two agents (a "guide" and a "tourist") that communicate via natural language in order to achieve a common goal: having the tourist navigate to a given target location. The task and dataset, which are described in detail, are challenging and their full solution is an open proble… ▽ More

    Submitted 23 December, 2018; v1 submitted 9 July, 2018; originally announced July 2018.

  22. arXiv:1806.08409  [pdf, other

    cs.CL cs.CV cs.SD eess.AS

    End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features

    Authors: Chiori Hori, Huda Alamri, Jue Wang, Gordon Wichern, Takaaki Hori, Anoop Cherian, Tim K. Marks, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Irfan Essa, Dhruv Batra, Devi Parikh

    Abstract: Dialog systems need to understand dynamic visual scenes in order to have conversations with users about the objects and events around them. Scene-aware dialog systems for real-world applications could be developed by integrating state-of-the-art technologies from multiple research areas, including: end-to-end dialog technologies, which generate system responses using models trained from dialog dat… ▽ More

    Submitted 29 June, 2018; v1 submitted 21 June, 2018; originally announced June 2018.

    Comments: A prototype system for the Audio Visual Scene-aware Dialog (AVSD) at DSTC7

  23. arXiv:1806.02934  [pdf, other

    stat.ML cs.CL cs.CV cs.LG

    Learn from Your Neighbor: Learning Multi-modal Map**s from Sparse Annotations

    Authors: Ashwin Kalyan, Stefan Lee, Anitha Kannan, Dhruv Batra

    Abstract: Many structured prediction problems (particularly in vision and language domains) are ambiguous, with multiple outputs being correct for an input - e.g. there are many ways of describing an image, multiple ways of translating a sentence; however, exhaustively annotating the applicability of all possible outputs is intractable due to exponentially large output spaces (e.g. all English sentences). I… ▽ More

    Submitted 7 June, 2018; originally announced June 2018.

    Comments: To be presented at ICML 2018; 10 pages 5 figures

  24. arXiv:1806.00525  [pdf, other

    cs.CL cs.CV

    Audio Visual Scene-Aware Dialog (AVSD) Challenge at DSTC7

    Authors: Huda Alamri, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Jue Wang, Irfan Essa, Dhruv Batra, Devi Parikh, Anoop Cherian, Tim K. Marks, Chiori Hori

    Abstract: Scene-aware dialog systems will be able to have conversations with users about the objects and events around them. Progress on such systems can be made by integrating state-of-the-art technologies from multiple research areas including end-to-end dialog systems visual dialog, and video description. We introduce the Audio Visual Scene Aware Dialog (AVSD) challenge and dataset. In this challenge, wh… ▽ More

    Submitted 1 June, 2018; originally announced June 2018.

  25. arXiv:1804.01186  [pdf, ps, other

    cs.AI cs.LG cs.PL

    Neural-Guided Deductive Search for Real-Time Program Synthesis from Examples

    Authors: Ashwin Kalyan, Abhishek Mohta, Oleksandr Polozov, Dhruv Batra, Prateek Jain, Sumit Gulwani

    Abstract: Synthesizing user-intended programs from a small number of input-output examples is a challenging problem with several important applications like spreadsheet manipulation, data wrangling and code refactoring. Existing synthesis systems either completely rely on deductive logic techniques that are extensively hand-engineered or on purely statistical models that need massive amounts of data, and in… ▽ More

    Submitted 9 September, 2018; v1 submitted 3 April, 2018; originally announced April 2018.

    Comments: Published in ICLR 2018, International Conference on Learning Representations (2018)

  26. arXiv:1803.09845  [pdf, other

    cs.CV cs.CL

    Neural Baby Talk

    Authors: Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh

    Abstract: We introduce a novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image. Our approach reconciles classical slot filling approaches (that are generally better grounded in images) with modern neural captioning approaches (that are generally more natural sounding and accurate). Our approach first generates a sentenc… ▽ More

    Submitted 26 March, 2018; originally announced March 2018.

    Comments: 12 pages, 7 figures, CVPR 2018

  27. arXiv:1712.05558  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication

    Authors: **-Hwa Kim, Nikita Kitaev, Xinlei Chen, Marcus Rohrbach, Byoung-Tak Zhang, Yuandong Tian, Dhruv Batra, Devi Parikh

    Abstract: In this work, we propose a goal-driven collaborative task that combines language, perception, and action. Specifically, we develop a Collaborative image-Drawing game between two agents, called CoDraw. Our game is grounded in a virtual world that contains movable clip art objects. The game involves two players: a Teller and a Drawer. The Teller sees an abstract scene containing multiple clip art pi… ▽ More

    Submitted 4 June, 2019; v1 submitted 15 December, 2017; originally announced December 2017.

    Comments: ACL 2019

  28. arXiv:1712.00377  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering

    Authors: Aishwarya Agrawal, Dhruv Batra, Devi Parikh, Aniruddha Kembhavi

    Abstract: A number of studies have found that today's Visual Question Answering (VQA) models are heavily driven by superficial correlations in the training data and lack sufficient image grounding. To encourage development of models geared towards the latter, we propose a new setting for VQA where for every question type, train and test sets have different prior distributions of answers. Specifically, we pr… ▽ More

    Submitted 3 June, 2018; v1 submitted 1 December, 2017; originally announced December 2017.

    Comments: 15 pages, 10 figures. To appear in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  29. arXiv:1711.11543  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Embodied Question Answering

    Authors: Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, Dhruv Batra

    Abstract: We present a new AI task -- Embodied Question Answering (EmbodiedQA) -- where an agent is spawned at a random location in a 3D environment and asked a question ("What color is the car?"). In order to answer, the agent must first intelligently navigate to explore the environment, gather information through first-person (egocentric) vision, and then answer the question ("orange"). This challenging… ▽ More

    Submitted 1 December, 2017; v1 submitted 30 November, 2017; originally announced November 2017.

    Comments: 20 pages, 13 figures, Webpage: https://embodiedqa.org/

  30. arXiv:1708.05122  [pdf, other

    cs.HC cs.AI cs.CL cs.CV

    Evaluating Visual Conversational Agents via Cooperative Human-AI Games

    Authors: Prithvijit Chattopadhyay, Deshraj Yadav, Viraj Prabhu, Arjun Chandrasekaran, Abhishek Das, Stefan Lee, Dhruv Batra, Devi Parikh

    Abstract: As AI continues to advance, human-AI teams are inevitable. However, progress in AI is routinely measured in isolation, without a human in the loop. It is crucial to benchmark progress in AI, not just in isolation, but also in terms of how it translates to hel** humans perform certain tasks, i.e., the performance of human-AI teams. In this work, we design a cooperative game - GuessWhich - to me… ▽ More

    Submitted 16 August, 2017; originally announced August 2017.

    Comments: HCOMP 2017

  31. arXiv:1706.08502  [pdf, other

    cs.CL cs.AI cs.CV

    Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog

    Authors: Satwik Kottur, José M. F. Moura, Stefan Lee, Dhruv Batra

    Abstract: A number of recent works have proposed techniques for end-to-end learning of communication protocols among cooperative multi-agent populations, and have simultaneously found the emergence of grounded human-interpretable language in the protocols developed by the agents, all learned without any human supervision! In this paper, using a Task and Tell reference game between two agents as a testbed,… ▽ More

    Submitted 20 August, 2017; v1 submitted 26 June, 2017; originally announced June 2017.

    Comments: 9 pages, 7 figures, 2 tables, accepted at EMNLP 2017 as short paper

  32. arXiv:1706.05125  [pdf, ps, other

    cs.AI cs.CL

    Deal or No Deal? End-to-End Learning for Negotiation Dialogues

    Authors: Mike Lewis, Denis Yarats, Yann N. Dauphin, Devi Parikh, Dhruv Batra

    Abstract: Much of human dialogue occurs in semi-cooperative settings, where agents with different goals attempt to agree on common decisions. Negotiations require complex communication and reasoning skills, but success is easy to measure, making this an interesting task for AI. We gather a large dataset of human-human negotiations on a multi-issue bargaining task, where agents who cannot observe each other'… ▽ More

    Submitted 15 June, 2017; originally announced June 2017.

  33. arXiv:1706.01554  [pdf, other

    cs.CV cs.AI cs.CL

    Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model

    Authors: Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, Dhruv Batra

    Abstract: We present a novel training framework for neural sequence models, particularly for grounded dialog generation. The standard training paradigm for these models is maximum likelihood estimation (MLE), or minimizing the cross-entropy of the human responses. Across a variety of domains, a recurring problem with MLE trained generative neural dialog models (G) is that they tend to produce 'safe' and gen… ▽ More

    Submitted 27 October, 2017; v1 submitted 5 June, 2017; originally announced June 2017.

    Comments: 11 pages, 3 figures

  34. arXiv:1705.08759  [pdf, other

    cs.CV

    Bidirectional Beam Search: Forward-Backward Inference in Neural Sequence Models for Fill-in-the-Blank Image Captioning

    Authors: Qing Sun, Stefan Lee, Dhruv Batra

    Abstract: We develop the first approximate inference algorithm for 1-Best (and M-Best) decoding in bidirectional neural sequence models by extending Beam Search (BS) to reason about both forward and backward time dependencies. Beam Search (BS) is a widely used approximate inference algorithm for decoding sequences from unidirectional neural sequence models. Interestingly, approximate inference in bidirectio… ▽ More

    Submitted 24 May, 2017; originally announced May 2017.

  35. arXiv:1705.06476  [pdf, other

    cs.CL

    ParlAI: A Dialog Research Software Platform

    Authors: Alexander H. Miller, Will Feng, Adam Fisch, Jiasen Lu, Dhruv Batra, Antoine Bordes, Devi Parikh, Jason Weston

    Abstract: We introduce ParlAI (pronounced "par-lay"), an open-source software platform for dialog research implemented in Python, available at http://parl.ai. Its goal is to provide a unified framework for sharing, training and testing of dialog models, integration of Amazon Mechanical Turk for data collection, human evaluation, and online/reinforcement learning; and a repository of machine learning models… ▽ More

    Submitted 8 March, 2018; v1 submitted 18 May, 2017; originally announced May 2017.

  36. Properties of rapidly rotating hot neutron stars with antikaon condensates at constant entropy per baryon

    Authors: Neelam Dhanda Batra, Krishna Prakash Nunna, Sarmistha Banik

    Abstract: We consider a neutrino-free hot neutron star that contains antikaon condensates in its core and is at finite entropy per baryon. We find the equation of state for a range of entropies and antikaon optical potentials and generate the mass profile of static as well as rotating stars. Rotation induces many changes in the stellar equilibrium, and hence its structural properties evolve. In this work, w… ▽ More

    Submitted 27 September, 2018; v1 submitted 18 May, 2017; originally announced May 2017.

    Journal ref: Phys. Rev. C 98, 035801 (2018)

  37. arXiv:1705.00601  [pdf, other

    cs.CV cs.CL

    The Promise of Premise: Harnessing Question Premises in Visual Question Answering

    Authors: Aroma Mahendru, Viraj Prabhu, Akrit Mohapatra, Dhruv Batra, Stefan Lee

    Abstract: In this paper, we make a simple observation that questions about images often contain premises - objects and relationships implied by the question - and that reasoning about premises can help Visual Question Answering (VQA) models respond more intelligently to irrelevant or previously unseen questions. When presented with a question that is irrelevant to an image, state-of-the-art VQA models will… ▽ More

    Submitted 17 August, 2017; v1 submitted 1 May, 2017; originally announced May 2017.

    Comments: Published at EMNLP 2017

  38. arXiv:1704.08243  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    C-VQA: A Compositional Split of the Visual Question Answering (VQA) v1.0 Dataset

    Authors: Aishwarya Agrawal, Aniruddha Kembhavi, Dhruv Batra, Devi Parikh

    Abstract: Visual Question Answering (VQA) has received a lot of attention over the past couple of years. A number of deep learning models have been proposed for this task. However, it has been shown that these models are heavily driven by superficial correlations in the training data and lack compositionality -- the ability to answer questions about unseen compositions of seen concepts. This compositionalit… ▽ More

    Submitted 26 April, 2017; originally announced April 2017.

  39. arXiv:1703.06585  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning

    Authors: Abhishek Das, Satwik Kottur, José M. F. Moura, Stefan Lee, Dhruv Batra

    Abstract: We introduce the first goal-driven training for visual question answering and dialog agents. Specifically, we pose a cooperative 'image guessing' game between two agents -- Qbot and Abot -- who communicate in natural language dialog so that Qbot can select an unseen image from a lineup of images. We use deep reinforcement learning (RL) to learn the policies of these agents end-to-end -- from pixel… ▽ More

    Submitted 21 March, 2017; v1 submitted 19 March, 2017; originally announced March 2017.

    Comments: 11 pages, 4 figures, 2 tables, webpage: http://visualdialog.org/

  40. arXiv:1703.01560  [pdf, other

    cs.CV cs.LG

    LR-GAN: Layered Recursive Generative Adversarial Networks for Image Generation

    Authors: Jianwei Yang, Anitha Kannan, Dhruv Batra, Devi Parikh

    Abstract: We present LR-GAN: an adversarial image generation model which takes scene structure and context into account. Unlike previous generative adversarial networks (GANs), the proposed GAN learns to generate image background and foregrounds separately and recursively, and stitch the foregrounds on the background in a contextually relevant manner to produce a complete natural image. For each foreground,… ▽ More

    Submitted 1 August, 2017; v1 submitted 5 March, 2017; originally announced March 2017.

    Comments: 21 pages, 22 figures, published as a conference paper at ICLR 2017, code available on GitHub

  41. arXiv:1612.00837  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

    Authors: Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, Devi Parikh

    Abstract: Problems at the intersection of vision and language are of significant importance both as challenging research questions and for the rich set of applications they enable. However, inherent structure in our world and bias in our language tend to be a simpler signal for learning than visual modalities, resulting in models that ignore visual information, leading to an inflated sense of their capabili… ▽ More

    Submitted 15 May, 2017; v1 submitted 2 December, 2016; originally announced December 2016.

  42. arXiv:1611.08669  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Visual Dialog

    Authors: Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, Dhruv Batra

    Abstract: We introduce the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a question about the image, the agent has to ground the question in image, infer context from history, and answer the question accurately. Visual Dialog is disentangled enough from a… ▽ More

    Submitted 1 August, 2017; v1 submitted 26 November, 2016; originally announced November 2016.

    Comments: 23 pages, 18 figures, CVPR 2017 camera-ready, results on VisDial v0.9 dataset, Webpage: http://visualdialog.org

  43. arXiv:1611.07450  [pdf, other

    stat.ML cs.CV cs.LG

    Grad-CAM: Why did you say that?

    Authors: Ramprasaath R Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, Dhruv Batra

    Abstract: We propose a technique for making Convolutional Neural Network (CNN)-based models more transparent by visualizing input regions that are 'important' for predictions -- or visual explanations. Our approach, called Gradient-weighted Class Activation Map** (Grad-CAM), uses class-specific gradient information to localize important regions. These localizations are combined with existing pixel-space v… ▽ More

    Submitted 25 January, 2017; v1 submitted 22 November, 2016; originally announced November 2016.

    Comments: Presented at NIPS 2016 Workshop on Interpretable Machine Learning in Complex Systems. This is an extended abstract version of arXiv:1610.02391 (CVPR format)

  44. arXiv:1610.02424  [pdf, other

    cs.AI cs.CL cs.CV

    Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

    Authors: Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R. Selvaraju, Qing Sun, Stefan Lee, David Crandall, Dhruv Batra

    Abstract: Neural sequence models are widely used to model time-series data. Equally ubiquitous is the usage of beam search (BS) as an approximate inference algorithm to decode output sequences from these models. BS explores the search space in a greedy left-right fashion retaining only the top-B candidates - resulting in sequences that differ only slightly from each other. Producing lists of nearly identica… ▽ More

    Submitted 22 October, 2018; v1 submitted 7 October, 2016; originally announced October 2016.

    Comments: 16 pages; accepted at AAAI 2018

  45. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

    Authors: Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra

    Abstract: We propose a technique for producing "visual explanations" for decisions from a large class of CNN-based models, making them more transparent. Our approach - Gradient-weighted Class Activation Map** (Grad-CAM), uses the gradients of any target concept, flowing into the final convolutional layer to produce a coarse localization map highlighting important regions in the image for predicting the co… ▽ More

    Submitted 2 December, 2019; v1 submitted 7 October, 2016; originally announced October 2016.

    Comments: This version was published in International Journal of Computer Vision (IJCV) in 2019; A previous version of the paper was published at International Conference on Computer Vision (ICCV'17)

  46. arXiv:1608.08974  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Towards Transparent AI Systems: Interpreting Visual Question Answering Models

    Authors: Yash Goyal, Akrit Mohapatra, Devi Parikh, Dhruv Batra

    Abstract: Deep neural networks have shown striking progress and obtained state-of-the-art results in many AI research fields in the recent years. However, it is often unsatisfying to not know why they predict what they do. In this paper, we address the problem of interpreting Visual Question Answering (VQA) models. Specifically, we are interested in finding what part of the input (pixels in images or words… ▽ More

    Submitted 9 September, 2016; v1 submitted 31 August, 2016; originally announced August 2016.

  47. arXiv:1608.08716  [pdf, other

    cs.AI cs.CL cs.CV cs.LG

    Measuring Machine Intelligence Through Visual Question Answering

    Authors: C. Lawrence Zitnick, Aishwarya Agrawal, Stanislaw Antol, Margaret Mitchell, Dhruv Batra, Devi Parikh

    Abstract: As machines have become more intelligent, there has been a renewed interest in methods for measuring their intelligence. A common approach is to propose tasks for which a human excels, but one which machines find difficult. However, an ideal task should also be easy to evaluate and not be easily gameable. We begin with a case study exploring the recently popular task of image captioning and its li… ▽ More

    Submitted 30 August, 2016; originally announced August 2016.

    Comments: AI Magazine, 2016

  48. arXiv:1606.07839  [pdf, other

    cs.CV cs.CL

    Stochastic Multiple Choice Learning for Training Diverse Deep Ensembles

    Authors: Stefan Lee, Senthil Purushwalkam, Michael Cogswell, Viresh Ranjan, David Crandall, Dhruv Batra

    Abstract: Many practical perception systems exist within larger processes that include interactions with users or additional components capable of evaluating the quality of predicted solutions. In these contexts, it is beneficial to provide these oracle mechanisms with multiple highly likely hypotheses rather than a single prediction. In this work, we pose the task of producing multiple outputs as a learnin… ▽ More

    Submitted 5 October, 2016; v1 submitted 24 June, 2016; originally announced June 2016.

  49. arXiv:1606.07493  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    Sort Story: Sorting Jumbled Images and Captions into Stories

    Authors: Harsh Agrawal, Arjun Chandrasekaran, Dhruv Batra, Devi Parikh, Mohit Bansal

    Abstract: Temporal common sense has applications in AI tasks such as QA, multi-document summarization, and human-AI communication. We propose the task of sequencing -- given a jumbled set of aligned image-caption pairs that belong to a story, the task is to sort them such that the output sequence forms a coherent story. We present multiple approaches, via unary (position) and pairwise (order) predictions, a… ▽ More

    Submitted 7 November, 2016; v1 submitted 23 June, 2016; originally announced June 2016.

    Comments: EMNLP 2016

  50. arXiv:1606.07356  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    Analyzing the Behavior of Visual Question Answering Models

    Authors: Aishwarya Agrawal, Dhruv Batra, Devi Parikh

    Abstract: Recently, a number of deep-learning based models have been proposed for the task of Visual Question Answering (VQA). The performance of most models is clustered around 60-70%. In this paper we propose systematic methods to analyze the behavior of these models as a first step towards recognizing their strengths and weaknesses, and identifying the most fruitful directions for progress. We analyze tw… ▽ More

    Submitted 27 September, 2016; v1 submitted 23 June, 2016; originally announced June 2016.

    Comments: 13 pages, 20 figures; To appear in EMNLP 2016