Skip to main content

Showing 1–15 of 15 results for author: Selvaraju, R R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2208.01813  [pdf, other

    cs.CV

    TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation

    Authors: Jun Wang, Mingfei Gao, Yuqian Hu, Ramprasaath R. Selvaraju, Chetan Ramaiah, Ran Xu, Joseph F. JaJa, Larry S. Davis

    Abstract: Text-VQA aims at answering questions that require understanding the textual cues in an image. Despite the great progress of existing Text-VQA methods, their performance suffers from insufficient human-labeled question-answer (QA) pairs. However, we observe that, in general, the scene text is not fully exploited in the existing datasets -- only a small portion of the text in each image participates… ▽ More

    Submitted 7 October, 2022; v1 submitted 2 August, 2022; originally announced August 2022.

    Comments: BMVC 2022

  2. arXiv:2204.11122  [pdf, other

    cs.CV cs.LG

    Can domain adaptation make object recognition work for everyone?

    Authors: Viraj Prabhu, Ramprasaath R. Selvaraju, Judy Hoffman, Nikhil Naik

    Abstract: Despite the rapid progress in deep visual recognition, modern computer vision datasets significantly overrepresent the developed world and models trained on such datasets underperform on images from unseen geographies. We investigate the effectiveness of unsupervised domain adaptation (UDA) of such models across geographies at closing this performance gap. To do so, we first curate two shifts from… ▽ More

    Submitted 23 April, 2022; originally announced April 2022.

    Comments: Published at the L3D-IVU workshop at CVPR 2022

  3. arXiv:2112.07133  [pdf, other

    cs.CV

    CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision

    Authors: Aman Shrivastava, Ramprasaath R. Selvaraju, Nikhil Naik, Vicente Ordonez

    Abstract: We propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations. Compared to the previously proposed CLIP model, CLIP-Lite requires only one negative image-text sample pair for every positive image-text sample during the optimization of its contrastive learning objective. We accomplish this by taking advantage of an information… ▽ More

    Submitted 11 May, 2023; v1 submitted 13 December, 2021; originally announced December 2021.

  4. arXiv:2112.00804  [pdf, other

    cs.CV

    PreViTS: Contrastive Pretraining with Video Tracking Supervision

    Authors: Brian Chen, Ramprasaath R. Selvaraju, Shih-Fu Chang, Juan Carlos Niebles, Nikhil Naik

    Abstract: Videos are a rich source for self-supervised learning (SSL) of visual representations due to the presence of natural temporal transformations of objects. However, current methods typically randomly sample video clips for learning, which results in an imperfect supervisory signal. In this work, we propose PreViTS, an SSL framework that utilizes an unsupervised tracking signal for selecting clips co… ▽ More

    Submitted 27 September, 2022; v1 submitted 1 December, 2021; originally announced December 2021.

    Comments: To be presented at WACV 2023

  5. arXiv:2107.07651  [pdf, other

    cs.CV cs.AI

    Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

    Authors: Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, Steven Hoi

    Abstract: Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens. Because the visual tokens and word tokens are unaligned, it is challenging for the multimodal encoder to learn image-text interacti… ▽ More

    Submitted 7 October, 2021; v1 submitted 15 July, 2021; originally announced July 2021.

  6. arXiv:2012.04630  [pdf, other

    cs.CV cs.AI cs.LG

    CASTing Your Model: Learning to Localize Improves Self-Supervised Representations

    Authors: Ramprasaath R. Selvaraju, Karan Desai, Justin Johnson, Nikhil Naik

    Abstract: Recent advances in self-supervised learning (SSL) have largely closed the gap with supervised ImageNet pretraining. Despite their success these methods have been primarily applied to unlabeled ImageNet images, and show marginal gains when trained on larger sets of uncurated images. We hypothesize that current SSL methods perform best on iconic images, and struggle on complex scene images with many… ▽ More

    Submitted 8 December, 2020; originally announced December 2020.

  7. arXiv:2010.10038  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    SOrT-ing VQA Models : Contrastive Gradient Learning for Improved Consistency

    Authors: Sameer Dharur, Purva Tendulkar, Dhruv Batra, Devi Parikh, Ramprasaath R. Selvaraju

    Abstract: Recent research in Visual Question Answering (VQA) has revealed state-of-the-art models to be inconsistent in their understanding of the world -- they answer seemingly difficult questions requiring reasoning correctly but get simpler associated sub-questions wrong. These sub-questions pertain to lower level visual concepts in the image that models ideally should understand to be able to answer the… ▽ More

    Submitted 30 November, 2020; v1 submitted 20 October, 2020; originally announced October 2020.

    Comments: Accepted to the NeurIPS 2020 workshop on Interpretable Inductive Biases and Physically Structured Learning

  8. arXiv:2001.06927  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions

    Authors: Ramprasaath R. Selvaraju, Purva Tendulkar, Devi Parikh, Eric Horvitz, Marco Ribeiro, Besmira Nushi, Ece Kamar

    Abstract: Existing VQA datasets contain questions with varying levels of complexity. While the majority of questions in these datasets require perception for recognizing existence, properties, and spatial relationships of entities, a significant portion of questions pose challenges that correspond to reasoning tasks - tasks that can only be answered through a synthesis of perception and knowledge about the… ▽ More

    Submitted 16 June, 2020; v1 submitted 19 January, 2020; originally announced January 2020.

    Comments: Accepted to CVPR'20 as an Oral Presentation

  9. arXiv:1903.07820  [pdf, other

    cs.CV

    Trick or TReAT: Thematic Reinforcement for Artistic Typography

    Authors: Purva Tendulkar, Kalpesh Krishna, Ramprasaath R. Selvaraju, Devi Parikh

    Abstract: An approach to make text visually appealing and memorable is semantic reinforcement - the use of visual cues alluding to the context or theme in which the word is being used to reinforce the message (e.g., Google Doodles). We present a computational approach for semantic reinforcement called TReAT - Thematic Reinforcement for Artistic Typography. Given an input word (e.g. exam) and a theme (e.g. e… ▽ More

    Submitted 19 March, 2019; originally announced March 2019.

    Comments: 9 pages

  10. arXiv:1902.03751  [pdf, other

    cs.CV

    Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded

    Authors: Ramprasaath R. Selvaraju, Stefan Lee, Yilin Shen, Hongxia **, Shalini Ghosh, Larry Heck, Dhruv Batra, Devi Parikh

    Abstract: Many vision and language models suffer from poor visual grounding - often falling back on easy-to-learn language priors rather than basing their decisions on visual concepts in the image. In this work, we propose a generic approach called Human Importance-aware Network Tuning (HINT) that effectively leverages human demonstrations to improve visual grounding. HINT encourages deep networks to be sen… ▽ More

    Submitted 28 October, 2019; v1 submitted 11 February, 2019; originally announced February 2019.

    Comments: Published at ICCV'2019

    Journal ref: The IEEE International Conference on Computer Vision (ICCV) 2019

  11. arXiv:1808.02861  [pdf, other

    cs.CV

    Choose Your Neuron: Incorporating Domain Knowledge through Neuron-Importance

    Authors: Ramprasaath R. Selvaraju, Prithvijit Chattopadhyay, Mohamed Elhoseiny, Tilak Sharma, Dhruv Batra, Devi Parikh, Stefan Lee

    Abstract: Individual neurons in convolutional neural networks supervised for image-level classification tasks have been shown to implicitly learn semantically meaningful concepts ranging from simple textures and shapes to whole or partial objects - forming a "dictionary" of concepts acquired through the learning process. In this work we introduce a simple, efficient zero-shot learning approach based on this… ▽ More

    Submitted 8 August, 2018; originally announced August 2018.

    Comments: In Proceedings of ECCV 2018

  12. arXiv:1611.07450  [pdf, other

    stat.ML cs.CV cs.LG

    Grad-CAM: Why did you say that?

    Authors: Ramprasaath R Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, Dhruv Batra

    Abstract: We propose a technique for making Convolutional Neural Network (CNN)-based models more transparent by visualizing input regions that are 'important' for predictions -- or visual explanations. Our approach, called Gradient-weighted Class Activation Map** (Grad-CAM), uses class-specific gradient information to localize important regions. These localizations are combined with existing pixel-space v… ▽ More

    Submitted 25 January, 2017; v1 submitted 22 November, 2016; originally announced November 2016.

    Comments: Presented at NIPS 2016 Workshop on Interpretable Machine Learning in Complex Systems. This is an extended abstract version of arXiv:1610.02391 (CVPR format)

  13. arXiv:1610.02424  [pdf, other

    cs.AI cs.CL cs.CV

    Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

    Authors: Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R. Selvaraju, Qing Sun, Stefan Lee, David Crandall, Dhruv Batra

    Abstract: Neural sequence models are widely used to model time-series data. Equally ubiquitous is the usage of beam search (BS) as an approximate inference algorithm to decode output sequences from these models. BS explores the search space in a greedy left-right fashion retaining only the top-B candidates - resulting in sequences that differ only slightly from each other. Producing lists of nearly identica… ▽ More

    Submitted 22 October, 2018; v1 submitted 7 October, 2016; originally announced October 2016.

    Comments: 16 pages; accepted at AAAI 2018

  14. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

    Authors: Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra

    Abstract: We propose a technique for producing "visual explanations" for decisions from a large class of CNN-based models, making them more transparent. Our approach - Gradient-weighted Class Activation Map** (Grad-CAM), uses the gradients of any target concept, flowing into the final convolutional layer to produce a coarse localization map highlighting important regions in the image for predicting the co… ▽ More

    Submitted 2 December, 2019; v1 submitted 7 October, 2016; originally announced October 2016.

    Comments: This version was published in International Journal of Computer Vision (IJCV) in 2019; A previous version of the paper was published at International Conference on Computer Vision (ICCV'17)

  15. arXiv:1604.03505  [pdf, other

    cs.CV

    Counting Everyday Objects in Everyday Scenes

    Authors: Prithvijit Chattopadhyay, Ramakrishna Vedantam, Ramprasaath R. Selvaraju, Dhruv Batra, Devi Parikh

    Abstract: We are interested in counting the number of instances of object classes in natural, everyday images. Previous counting approaches tackle the problem in restricted domains such as counting pedestrians in surveillance videos. Counts can also be estimated from outputs of other vision tasks like object detection. In this work, we build dedicated models for counting designed to tackle the large varianc… ▽ More

    Submitted 8 May, 2017; v1 submitted 12 April, 2016; originally announced April 2016.