Skip to main content

Showing 1–6 of 6 results for author: Jasani, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.16385  [pdf, other

    cs.CV cs.CL

    Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA

    Authors: Zhuowan Li, Bhavan Jasani, Peng Tang, Shabnam Ghadar

    Abstract: Understanding data visualizations like charts and plots requires reasoning about both visual elements and numerics. Although strong in extractive questions, current chart visual question answering (chart VQA) models suffer on complex reasoning questions. In this work, we address the lack of reasoning ability by data augmentation. We leverage Large Language Models (LLMs), which have shown to have s… ▽ More

    Submitted 28 March, 2024; v1 submitted 24 March, 2024; originally announced March 2024.

    Comments: Accepted to CVPR 2024

  2. arXiv:2211.07912  [pdf, other

    cs.CV

    YORO -- Lightweight End to End Visual Grounding

    Authors: Chih-Hui Ho, Srikar Appalaraju, Bhavan Jasani, R. Manmatha, Nuno Vasconcelos

    Abstract: We present YORO - a multi-modal transformer encoder-only architecture for the Visual Grounding (VG) task. This task involves localizing, in an image, an object referred via natural language. Unlike the recent trend in the literature of using multi-stage approaches that sacrifice speed for accuracy, YORO seeks a better trade-off between speed an accuracy by embracing a single-stage design, without… ▽ More

    Submitted 15 November, 2022; originally announced November 2022.

    Comments: Accepted to ECCVW on International Challenge on Compositional and Multimodal Perception

  3. arXiv:2106.11539  [pdf, other

    cs.CV

    DocFormer: End-to-End Transformer for Document Understanding

    Authors: Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, R. Manmatha

    Abstract: We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses te… ▽ More

    Submitted 20 September, 2021; v1 submitted 22 June, 2021; originally announced June 2021.

    Comments: Accepted to ICCV 2021 main conference

  4. arXiv:1911.11344  [pdf, other

    cs.CV

    Skeleton based Zero Shot Action Recognition in Joint Pose-Language Semantic Space

    Authors: Bhavan Jasani, Afshaan Mazagonwalla

    Abstract: How does one represent an action? How does one describe an action that we have never seen before? Such questions are addressed by the Zero Shot Learning paradigm, where a model is trained on only a subset of classes and is evaluated on its ability to correctly classify an example from a class it has never seen before. In this work, we present a body pose based zero shot action recognition network… ▽ More

    Submitted 26 November, 2019; originally announced November 2019.

  5. arXiv:1911.03083  [pdf, other

    cs.CV cs.CL

    Are we asking the right questions in MovieQA?

    Authors: Bhavan Jasani, Rohit Girdhar, Deva Ramanan

    Abstract: Joint vision and language tasks like visual question answering are fascinating because they explore high-level understanding, but at the same time, can be more prone to language biases. In this paper, we explore the biases in the MovieQA dataset and propose a strikingly simple model which can exploit them. We find that using the right word embedding is of utmost importance. By using an appropriate… ▽ More

    Submitted 8 November, 2019; originally announced November 2019.

    Comments: Spotlight presentation at CLVL workshop, ICCV 2019. Project page: https://bhavanj.github.io/MovieQAWithoutMovies/

  6. arXiv:1805.07641  [pdf, other

    cs.CV cs.LG

    Learning Sampling Policies for Domain Adaptation

    Authors: Yash Patel, Kashyap Chitta, Bhavan Jasani

    Abstract: We address the problem of semi-supervised domain adaptation of classification algorithms through deep Q-learning. The core idea is to consider the predictions of a source domain network on target domain data as noisy labels, and learn a policy to sample from this data so as to maximize classification accuracy on a small annotated reward partition of the target domain. Our experiments show that lea… ▽ More

    Submitted 19 May, 2018; originally announced May 2018.