Skip to main content

Showing 1–42 of 42 results for author: Divakaran, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.02352  [pdf, other

    cs.CL

    Pelican: Correcting Hallucination in Vision-LLMs via Claim Decomposition and Program of Thought Verification

    Authors: Pritish Sahu, Karan Sikka, Ajay Divakaran

    Abstract: Large Visual Language Models (LVLMs) struggle with hallucinations in visual instruction following task(s), limiting their trustworthiness and real-world applicability. We propose Pelican -- a novel framework designed to detect and mitigate hallucinations through claim verification. Pelican first decomposes the visual claim into a chain of sub-claims based on first-order predicates. These sub-claim… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  2. arXiv:2406.17963  [pdf, other

    cs.LG cs.HC cs.SI

    Empowering Interdisciplinary Insights with Dynamic Graph Embedding Trajectories

    Authors: Yiqiao **, Andrew Zhao, Yeon-Chang Lee, Meng Ye, Ajay Divakaran, Srijan Kumar

    Abstract: We developed DyGETViz, a novel framework for effectively visualizing dynamic graphs (DGs) that are ubiquitous across diverse real-world systems. This framework leverages recent advancements in discrete-time dynamic graph (DTDG) models to adeptly handle the temporal dynamics inherent in dynamic graphs. DyGETViz effectively captures both micro- and macro-level structural shifts within these graphs,… ▽ More

    Submitted 28 June, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

    Comments: 27 pages, 11 figures

  3. arXiv:2312.12716  [pdf, other

    cs.CV cs.CL cs.LG

    BloomVQA: Assessing Hierarchical Multi-modal Comprehension

    Authors: Yunye Gong, Robik Shrestha, Jared Claypoole, Michael Cogswell, Arijit Ray, Christopher Kanan, Ajay Divakaran

    Abstract: We propose a novel VQA dataset, BloomVQA, to facilitate comprehensive evaluation of large vision-language models on comprehension tasks. Unlike current benchmarks that often focus on fact-based memorization and simple reasoning tasks without theoretical grounding, we collect multiple-choice samples based on picture stories that reflect different levels of comprehension, as laid out in Bloom's Taxo… ▽ More

    Submitted 10 June, 2024; v1 submitted 19 December, 2023; originally announced December 2023.

    Comments: Accepted by ACL Findings (2024). Dataset available at https://huggingface.co/datasets/ygong/BloomVQA

  4. arXiv:2312.00115  [pdf, other

    cs.CV cs.CL

    A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval

    Authors: Matthew Gwilliam, Michael Cogswell, Meng Ye, Karan Sikka, Abhinav Shrivastava, Ajay Divakaran

    Abstract: Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime, where every long video is described by a single long paragraph. This neglects the richness and variety of possible valid descriptions of a video, which could be described in moment-by-moment detail, or in a single phrase summary, or anything in between. To provide a more thorough evaluation of… ▽ More

    Submitted 30 November, 2023; originally announced December 2023.

    Comments: 13 pages, 15 tables, 5 figures

  5. arXiv:2311.10081  [pdf, other

    cs.CV cs.CL cs.LG

    DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback

    Authors: Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, Ajay Divakaran

    Abstract: We present DRESS, a large vision language model (LVLM) that innovatively exploits Natural Language feedback (NLF) from Large Language Models to enhance its alignment and interactions by addressing two key limitations in the state-of-the-art LVLMs. First, prior LVLMs generally rely only on the instruction finetuning stage to enhance alignment with human preferences. Without incorporating extra feed… ▽ More

    Submitted 19 March, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

    Comments: CVPR 2024. The feedback datasets are released at: https://huggingface.co/datasets/YangyiYY/LVLM_NLF

  6. arXiv:2310.10707  [pdf, other

    cs.CL cs.AI

    Demonstrations Are All You Need: Advancing Offensive Content Paraphrasing using In-Context Learning

    Authors: Anirudh Som, Karan Sikka, Helen Gent, Ajay Divakaran, Andreas Kathol, Dimitra Vergyri

    Abstract: Paraphrasing of offensive content is a better alternative to content removal and helps improve civility in a communication environment. Supervised paraphrasers; however, rely heavily on large quantities of labelled data to help preserve meaning and intent. They also often retain a large portion of the offensiveness of the original content, which raises questions on their overall usability. In this… ▽ More

    Submitted 9 June, 2024; v1 submitted 16 October, 2023; originally announced October 2023.

    Comments: Accepted in Association for Computational Linguistics (ACL) 2024 Findings

  7. arXiv:2309.12510  [pdf, other

    cs.LG

    Confidence Calibration for Systems with Cascaded Predictive Modules

    Authors: Yunye Gong, Yi Yao, Xiao Lin, Ajay Divakaran, Melinda Gervasio

    Abstract: Existing conformal prediction algorithms estimate prediction intervals at target confidence levels to characterize the performance of a regression model on new test samples. However, considering an autonomous system consisting of multiple modules, prediction intervals constructed for individual modules fall short of accommodating uncertainty propagation over different modules and thus cannot provi… ▽ More

    Submitted 21 September, 2023; originally announced September 2023.

  8. arXiv:2309.04461  [pdf, other

    cs.CL cs.CV cs.LG

    Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models

    Authors: Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, Ajay Divakaran

    Abstract: Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can parse natural queries about the visual content and generate human-like outputs. In this work, we explore the ability of these models to demonstrate human-like reasoning based on the perceived information. To address a crucial concern regarding the extent to which their reasoning capabilities are… ▽ More

    Submitted 19 March, 2024; v1 submitted 8 September, 2023; originally announced September 2023.

    Comments: NAACL 2024 Main Conference. The data is released at https://github.com/Yangyi-Chen/CoTConsistency

  9. arXiv:2308.03906  [pdf, other

    cs.CV

    TIJO: Trigger Inversion with Joint Optimization for Defending Multimodal Backdoored Models

    Authors: Indranil Sur, Karan Sikka, Matthew Walmer, Kaushik Koneripalli, Anirban Roy, Xiao Lin, Ajay Divakaran, Susmit Jha

    Abstract: We present a Multimodal Backdoor Defense technique TIJO (Trigger Inversion using Joint Optimization). Recent work arXiv:2112.07668 has demonstrated successful backdoor attacks on multimodal models for the Visual Question Answering task. Their dual-key backdoor trigger is split across two modalities (image and text), such that the backdoor is activated if and only if the trigger is present in both… ▽ More

    Submitted 7 August, 2023; originally announced August 2023.

    Comments: Published as conference paper at ICCV 2023. 13 pages, 6 figures, 7 tables

  10. Predicting Information Pathways Across Online Communities

    Authors: Yiqiao **, Yeon-Chang Lee, Kartik Sharma, Meng Ye, Karan Sikka, Ajay Divakaran, Srijan Kumar

    Abstract: The problem of community-level information pathway prediction (CLIPP) aims at predicting the transmission trajectory of content across online communities. A successful solution to CLIPP holds significance as it facilitates the distribution of valuable information to a larger audience and prevents the proliferation of misinformation. Notably, solving CLIPP is non-trivial as inter-community relation… ▽ More

    Submitted 4 June, 2023; originally announced June 2023.

    Comments: In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'23)

    ACM Class: J.4

  11. arXiv:2304.03659  [pdf, other

    cs.CV

    Probing Conceptual Understanding of Large Visual-Language Models

    Authors: Madeline Schiappa, Raiyaan Abdullah, Shehreen Azad, Jared Claypoole, Michael Cogswell, Ajay Divakaran, Yogesh Rawat

    Abstract: In recent years large visual-language (V+L) models have achieved great success in various downstream tasks. However, it is not well studied whether these models have a conceptual grasp of the visual content. In this work we focus on conceptual understanding of these large V+L models. To facilitate this study, we propose novel benchmarking datasets for probing three different aspects of content und… ▽ More

    Submitted 26 April, 2024; v1 submitted 7 April, 2023; originally announced April 2023.

    Comments: All code and dataset is available at: https://tinyurl.com/vlm-robustness. Accepted in CVPRW 2024

  12. arXiv:2302.09618  [pdf, other

    cs.CL

    Multilingual Content Moderation: A Case Study on Reddit

    Authors: Meng Ye, Karan Sikka, Katherine Atwell, Sabit Hassan, Ajay Divakaran, Malihe Alikhani

    Abstract: Content moderation is the process of flagging content based on pre-defined platform rules. There has been a growing need for AI moderators to safeguard users as well as protect the mental health of human moderators from traumatic content. While prior works have focused on identifying hateful/offensive language, they are not adequate for meeting the challenges of content moderation since 1) moderat… ▽ More

    Submitted 19 February, 2023; originally announced February 2023.

  13. System Design for an Integrated Lifelong Reinforcement Learning Agent for Real-Time Strategy Games

    Authors: Indranil Sur, Zachary Daniels, Abrar Rahman, Kamil Faber, Gianmarco J. Gallardo, Tyler L. Hayes, Cameron E. Taylor, Mustafa Burak Gurbuz, James Smith, Sahana Joshi, Nathalie Japkowicz, Michael Baron, Zsolt Kira, Christopher Kanan, Roberto Corizzo, Ajay Divakaran, Michael Piacentino, Jesse Hostetler, Aswin Raghavan

    Abstract: As Artificial and Robotic Systems are increasingly deployed and relied upon for real-world applications, it is important that they exhibit the ability to continually learn and adapt in dynamically-changing environments, becoming Lifelong Learning Machines. Continual/lifelong learning (LL) involves minimizing catastrophic forgetting of old tasks while maximizing a model's capability to learn new ta… ▽ More

    Submitted 8 December, 2022; originally announced December 2022.

    Comments: The Second International Conference on AIML Systems, October 12--15, 2022, Bangalore, India

  14. arXiv:2209.15093  [pdf, other

    cs.CL

    Unpacking Large Language Models with Conceptual Consistency

    Authors: Pritish Sahu, Michael Cogswell, Yunye Gong, Ajay Divakaran

    Abstract: If a Large Language Model (LLM) answers "yes" to the question "Are mountains tall?" then does it know what a mountain is? Can you rely on it responding correctly or incorrectly to other questions about mountains? The success of Large Language Models (LLMs) indicates they are increasingly able to answer queries like these accurately, but that ability does not necessarily imply a general understandi… ▽ More

    Submitted 29 September, 2022; originally announced September 2022.

  15. arXiv:2208.05056  [pdf, other

    cs.LG cs.AI

    Model-Free Generative Replay for Lifelong Reinforcement Learning: Application to Starcraft-2

    Authors: Zachary Daniels, Aswin Raghavan, Jesse Hostetler, Abrar Rahman, Indranil Sur, Michael Piacentino, Ajay Divakaran

    Abstract: One approach to meet the challenges of deep lifelong reinforcement learning (LRL) is careful management of the agent's learning experiences, to learn (without forgetting) and build internal meta-models (of the tasks, environments, agents, and world). Generative replay (GR) is a biologically inspired replay mechanism that augments learning experiences with self-labelled examples drawn from an inter… ▽ More

    Submitted 16 August, 2022; v1 submitted 9 August, 2022; originally announced August 2022.

    Comments: Accepted to the First Conference on Lifelong Learning Agents (CoLLAs 2022)

  16. Towards Understanding Confusion and Affective States Under Communication Failures in Voice-Based Human-Machine Interaction

    Authors: Sujeong Kim, Abhinav Garlapati, Jonah Lubin, Amir Tamrakar, Ajay Divakaran

    Abstract: We present a series of two studies conducted to understand user's affective states during voice-based human-machine interactions. Emphasis is placed on the cases of communication errors or failures. In particular, we are interested in understanding "confusion" in relation with other affective states. The studies consist of two types of tasks: (1) related to communication with a voice-based virtual… ▽ More

    Submitted 15 July, 2022; originally announced July 2022.

    Journal ref: 2021 9th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)

  17. Broadening AI Ethics Narratives: An Indic Art View

    Authors: Ajay Divakaran, Aparna Sridhar, Ramya Srinivasan

    Abstract: Incorporating interdisciplinary perspectives is seen as an essential step towards enhancing artificial intelligence (AI) ethics. In this regard, the field of arts is perceived to play a key role in elucidating diverse historical and cultural narratives, serving as a bridge across research communities. Most of the works that examine the interplay between the field of arts and AI ethics concern digi… ▽ More

    Submitted 15 May, 2023; v1 submitted 7 April, 2022; originally announced April 2022.

    Journal ref: 2023 ACM Conference on Fairness, Accountability, and Transparency

  18. arXiv:2202.05930  [pdf, other

    cs.CV cs.AI

    Detecting out-of-context objects using contextual cues

    Authors: Manoj Acharya, Anirban Roy, Kaushik Koneripalli, Susmit Jha, Christopher Kanan, Ajay Divakaran

    Abstract: This paper presents an approach to detect out-of-context (OOC) objects in an image. Given an image with a set of objects, our goal is to determine if an object is inconsistent with the scene context and detect the OOC object with a bounding box. In this work, we consider commonly explored contextual relations such as co-occurrence relations, the relative size of an object with respect to other obj… ▽ More

    Submitted 11 February, 2022; originally announced February 2022.

    Journal ref: IJCAI-ECAI 2022

  19. arXiv:2110.11899  [pdf, other

    cs.CV cs.CL

    Challenges in Procedural Multimodal Machine Comprehension:A Novel Way To Benchmark

    Authors: Pritish Sahu, Karan Sikka, Ajay Divakaran

    Abstract: We focus on Multimodal Machine Reading Comprehension (M3C) where a model is expected to answer questions based on given passage (or context), and the context and the questions can be in different modalities. Previous works such as RecipeQA have proposed datasets and cloze-style tasks for evaluation. However, we identify three critical biases stemming from the question-answer generation process and… ▽ More

    Submitted 22 October, 2021; originally announced October 2021.

  20. arXiv:2106.04653  [pdf, other

    cs.CL

    Comprehension Based Question Answering using Bloom's Taxonomy

    Authors: Pritish Sahu, Michael Cogswell, Sara Rutherford-Quach, Ajay Divakaran

    Abstract: Current pre-trained language models have lots of knowledge, but a more limited ability to use that knowledge. Bloom's Taxonomy helps educators teach children how to use knowledge by categorizing comprehension skills, so we use it to analyze and improve the comprehension skills of large pre-trained language models. Our experiments focus on zero-shot question answering, using the taxonomy to provide… ▽ More

    Submitted 8 June, 2021; originally announced June 2021.

  21. arXiv:2104.10139  [pdf, other

    cs.CL

    Towards Solving Multimodal Comprehension

    Authors: Pritish Sahu, Karan Sikka, Ajay Divakaran

    Abstract: This paper targets the problem of procedural multimodal machine comprehension (M3C). This task requires an AI to comprehend given steps of multimodal instructions and then answer questions. Compared to vanilla machine comprehension tasks where an AI is required only to understand a textual input, procedural M3C is more challenging as the AI needs to comprehend both the temporal and causal factors… ▽ More

    Submitted 20 April, 2021; originally announced April 2021.

  22. arXiv:2104.00742  [pdf, other

    cs.LG cs.CV

    Confidence Calibration for Domain Generalization under Covariate Shift

    Authors: Yunye Gong, Xiao Lin, Yi Yao, Thomas G. Dietterich, Ajay Divakaran, Melinda Gervasio

    Abstract: Existing calibration algorithms address the problem of covariate shift via unsupervised domain adaptation. However, these methods suffer from the following limitations: 1) they require unlabeled data from the target domain, which may not be available at the stage of calibration in real-world applications and 2) their performance depends heavily on the disparity between the distributions of the sou… ▽ More

    Submitted 19 August, 2021; v1 submitted 1 April, 2021; originally announced April 2021.

    Journal ref: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 8958-8967

  23. arXiv:2104.00619  [pdf, other

    cs.CV

    Modular Adaptation for Cross-Domain Few-Shot Learning

    Authors: Xiao Lin, Meng Ye, Yunye Gong, Giedrius Buracas, Nikoletta Basiou, Ajay Divakaran, Yi Yao

    Abstract: Adapting pre-trained representations has become the go-to recipe for learning new downstream tasks with limited examples. While literature has demonstrated great successes via representation learning, in this work, we show that substantial performance improvement of downstream tasks can also be achieved by appropriate designs of the adaptation process. Specifically, we propose a modular adaptation… ▽ More

    Submitted 1 April, 2021; originally announced April 2021.

  24. arXiv:2103.14712  [pdf, other

    cs.CV cs.AI cs.CY cs.HC

    Generating and Evaluating Explanations of Attended and Error-Inducing Input Regions for VQA Models

    Authors: Arijit Ray, Michael Cogswell, Xiao Lin, Kamran Alipour, Ajay Divakaran, Yi Yao, Giedrius Burachas

    Abstract: Attention maps, a popular heatmap-based explanation method for Visual Question Answering (VQA), are supposed to help users understand the model by highlighting portions of the image/question used by the model to infer answers. However, we see that users are often misled by current attention map visualizations that point to relevant regions despite the model producing an incorrect answer. Hence, we… ▽ More

    Submitted 25 October, 2021; v1 submitted 26 March, 2021; originally announced March 2021.

    Comments: Applied AI Letters, Wiley, 25 October 2021

  25. arXiv:2012.02275  [pdf, other

    cs.LG cs.AI cs.CV cs.IT

    Detecting Trojaned DNNs Using Counterfactual Attributions

    Authors: Karan Sikka, Indranil Sur, Susmit Jha, Anirban Roy, Ajay Divakaran

    Abstract: We target the problem of detecting Trojans or backdoors in DNNs. Such models behave normally with typical inputs but produce specific incorrect predictions for inputs poisoned with a Trojan trigger. Our approach is based on a novel observation that the trigger behavior depends on a few ghost neurons that activate on trigger pattern and exhibit abnormally higher relative attribution for wrong decis… ▽ More

    Submitted 3 December, 2020; originally announced December 2020.

  26. arXiv:2011.10889  [pdf, other

    cs.CV

    Zero-Shot Learning with Knowledge Enhanced Visual Semantic Embeddings

    Authors: Karan Sikka, Jihua Huang, Andrew Silberfarb, Prateeth Nayak, Luke Rohrer, Pritish Sahu, John Byrnes, Ajay Divakaran, Richard Rohwer

    Abstract: We improve zero-shot learning (ZSL) by incorporating common-sense knowledge in DNNs. We propose Common-Sense based Neuro-Symbolic Loss (CSNL) that formulates prior knowledge as novel neuro-symbolic loss functions that regularize visual-semantic embedding. CSNL forces visual features in the VSE to obey common-sense rules relating to hypernyms and attributes. We introduce two key novelties for impro… ▽ More

    Submitted 21 November, 2020; originally announced November 2020.

  27. arXiv:2011.10082  [pdf, other

    cs.CV

    Hybrid Consistency Training with Prototype Adaptation for Few-Shot Learning

    Authors: Meng Ye, Xiao Lin, Giedrius Burachas, Ajay Divakaran, Yi Yao

    Abstract: Few-Shot Learning (FSL) aims to improve a model's generalization capability in low data regimes. Recent FSL works have made steady progress via metric learning, meta learning, representation learning, etc. However, FSL remains challenging due to the following longstanding difficulties. 1) The seen and unseen classes are disjoint, resulting in a distribution shift between training and testing. 2) D… ▽ More

    Submitted 19 November, 2020; originally announced November 2020.

  28. arXiv:2007.06918  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Lifelong Learning using Eigentasks: Task Separation, Skill Acquisition, and Selective Transfer

    Authors: Aswin Raghavan, Jesse Hostetler, Indranil Sur, Abrar Rahman, Ajay Divakaran

    Abstract: We introduce the eigentask framework for lifelong learning. An eigentask is a pairing of a skill that solves a set of related tasks, paired with a generative model that can sample from the skill's input space. The framework extends generative replay approaches, which have mainly been used to avoid catastrophic forgetting, to also address other lifelong learning goals such as forward knowledge tran… ▽ More

    Submitted 14 July, 2020; originally announced July 2020.

    Comments: Accepted at the 4th Lifelong Machine Learning Workshop at the Thirty-seventh International Conference on Machine Learning (ICML) 2020

  29. arXiv:2003.07344  [pdf, other

    cs.CV cs.AI

    Deep Adaptive Semantic Logic (DASL): Compiling Declarative Knowledge into Deep Neural Networks

    Authors: Karan Sikka, Andrew Silberfarb, John Byrnes, Indranil Sur, Ed Chow, Ajay Divakaran, Richard Rohwer

    Abstract: We introduce Deep Adaptive Semantic Logic (DASL), a novel framework for automating the generation of deep neural networks that incorporates user-provided formal knowledge to improve learning from data. We provide formal semantics that demonstrate that our knowledge representation captures all of first order logic and that finite sampling from infinite domains converges to correct truth values. DAS… ▽ More

    Submitted 16 March, 2020; originally announced March 2020.

  30. arXiv:2003.03695  [pdf, other

    cs.LG cs.NE stat.ML

    Progressive Growing of Neural ODEs

    Authors: Hammad A. Ayyubi, Yi Yao, Ajay Divakaran

    Abstract: Neural Ordinary Differential Equations (NODEs) have proven to be a powerful modeling tool for approximating (interpolation) and forecasting (extrapolation) irregularly sampled time series data. However, their performance degrades substantially when applied to real-world data, especially long-term data with complex behaviors (e.g., long-term trend across years, mid-term seasonality across months, a… ▽ More

    Submitted 7 March, 2020; originally announced March 2020.

    Journal ref: ICLR Workshop on Neural Networks and Differential Equations, 2020

  31. arXiv:1909.04696  [pdf, other

    cs.CV cs.AI

    Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation

    Authors: Arijit Ray, Karan Sikka, Ajay Divakaran, Stefan Lee, Giedrius Burachas

    Abstract: While models for Visual Question Answering (VQA) have steadily improved over the years, interacting with one quickly reveals that these models lack consistency. For instance, if a model answers "red" to "What color is the balloon?", it might answer "no" if asked, "Is the balloon red?". These responses violate simple notions of entailment and raise questions about how effectively VQA models ground… ▽ More

    Submitted 10 September, 2019; originally announced September 2019.

    Comments: 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP 2019)

  32. arXiv:1907.06167  [pdf, other

    cs.CV

    FoodX-251: A Dataset for Fine-grained Food Classification

    Authors: Parneet Kaur, Karan Sikka, Weijun Wang, Serge Belongie, Ajay Divakaran

    Abstract: Food classification is a challenging problem due to the large number of categories, high visual similarity between different foods, as well as the lack of datasets for training state-of-the-art deep models. Solving this problem will require advances in both computer vision models as well as datasets for evaluating these models. In this paper we focus on the second aspect and introduce FoodX-251, a… ▽ More

    Submitted 14 July, 2019; originally announced July 2019.

    Comments: Published at Fine-Grained Visual Categorization Workshop, CVPR19

  33. arXiv:1905.07075  [pdf, other

    cs.IR cs.CL cs.CV cs.SI

    Deep Unified Multimodal Embeddings for Understanding both Content and Users in Social Media Networks

    Authors: Karan Sikka, Lucas Van Bramer, Ajay Divakaran

    Abstract: There has been an explosion of multimodal content generated on social media networks in the last few years, which has necessitated a deeper understanding of social media content and user behavior. We present a novel content-independent content-user-reaction model for social multimedia content analysis. Compared to prior works that generally tackle semantic content understanding and user behavior m… ▽ More

    Submitted 10 June, 2019; v1 submitted 16 May, 2019; originally announced May 2019.

    Comments: Preprint submitted to IJCV

  34. arXiv:1905.03319  [pdf, other

    cs.LG stat.ML

    Data-Efficient Mutual Information Neural Estimator

    Authors: Xiao Lin, Indranil Sur, Samuel A. Nastase, Ajay Divakaran, Uri Hasson, Mohamed R. Amer

    Abstract: Measuring Mutual Information (MI) between high-dimensional, continuous, random variables from observed samples has wide theoretical and practical applications. Recent work, MINE (Belghazi et al. 2018), focused on estimating tight variational lower bounds of MI using neural networks, but assumed unlimited supply of samples to prevent overfitting. In real world applications, data is not always avail… ▽ More

    Submitted 24 May, 2019; v1 submitted 8 May, 2019; originally announced May 2019.

  35. arXiv:1904.09073  [pdf, other

    cs.CV

    Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts

    Authors: Julia Kruk, Jonah Lubin, Karan Sikka, Xiao Lin, Dan Jurafsky, Ajay Divakaran

    Abstract: Computing author intent from multimodal data like Instagram posts requires modeling a complex relationship between text and image. For example, a caption might evoke an ironic contrast with the image, so neither caption nor image is a mere transcript of the other. Instead they combine -- via what has been called meaning multiplication -- to create a new meaning that has a more complex relation to… ▽ More

    Submitted 7 November, 2019; v1 submitted 19 April, 2019; originally announced April 2019.

    Comments: Accepted at EMNLP'2019; Added dataset link

  36. arXiv:1904.03285  [pdf, other

    cs.CY cs.CV cs.HC

    Can You Explain That? Lucid Explanations Help Human-AI Collaborative Image Retrieval

    Authors: Arijit Ray, Yi Yao, Rakesh Kumar, Ajay Divakaran, Giedrius Burachas

    Abstract: While there have been many proposals on making AI algorithms explainable, few have attempted to evaluate the impact of AI-generated explanations on human performance in conducting human-AI collaborative tasks. To bridge the gap, we propose a Twenty-Questions style collaborative image retrieval game, Explanation-assisted Guess Which (ExAG), as a method of evaluating the efficacy of explanations (vi… ▽ More

    Submitted 21 September, 2019; v1 submitted 5 April, 2019; originally announced April 2019.

    Comments: 2019 AAAI Conference on Human Computation and Crowdsourcing

    Journal ref: 2019 AAAI Conference on Human Computation and Crowdsourcing

  37. arXiv:1903.11649  [pdf, other

    cs.CV

    Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment

    Authors: Samyak Datta, Karan Sikka, Anirban Roy, Karuna Ahuja, Devi Parikh, Ajay Divakaran

    Abstract: We address the problem of grounding free-form textual phrases by using weak supervision from image-caption pairs. We propose a novel end-to-end model that uses caption-to-image retrieval as a `downstream' task to guide the process of phrase localization. Our method, as a first step, infers the latent correspondences between regions-of-interest (RoIs) and phrases in the caption and creates a discri… ▽ More

    Submitted 15 October, 2019; v1 submitted 27 March, 2019; originally announced March 2019.

    Comments: v2 contains phrase localization results on Flickr30k Entities. Accepted for publication at ICCV 2019

  38. arXiv:1811.10575  [pdf, other

    cs.CV

    Stacked Spatio-Temporal Graph Convolutional Networks for Action Segmentation

    Authors: Pallabi Ghosh, Yi Yao, Larry S. Davis, Ajay Divakaran

    Abstract: We propose novel Stacked Spatio-Temporal Graph Convolutional Networks (Stacked-STGCN) for action segmentation, i.e., predicting and localizing a sequence of actions over long videos. We extend the Spatio-Temporal Graph Convolutional Network (STGCN) originally proposed for skeleton-based action recognition to enable nodes with different characteristics (e.g., scene, actor, object, action, etc.), fe… ▽ More

    Submitted 2 June, 2019; v1 submitted 26 November, 2018; originally announced November 2018.

  39. arXiv:1807.01448  [pdf, other

    cs.CV

    Understanding Visual Ads by Aligning Symbols and Objects using Co-Attention

    Authors: Karuna Ahuja, Karan Sikka, Anirban Roy, Ajay Divakaran

    Abstract: We tackle the problem of understanding visual ads where given an ad image, our goal is to rank appropriate human generated statements describing the purpose of the ad. This problem is generally addressed by jointly embedding images and candidate statements to establish correspondence. Decoding a visual ad requires inference of both semantic and symbolic nuances referenced in an image and prior met… ▽ More

    Submitted 4 July, 2018; originally announced July 2018.

    Comments: Accepted at CVPR 2018 workshop- Towards Automatic Understanding of Visual Advertisements

  40. arXiv:1804.04340  [pdf, other

    cs.CV

    Zero-Shot Object Detection

    Authors: Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, Ajay Divakaran

    Abstract: We introduce and tackle the problem of zero-shot object detection (ZSD), which aims to detect object classes which are not observed during training. We work with a challenging set of object classes, not restricting ourselves to similar and/or fine-grained categories as in prior works on zero-shot classification. We present a principled approach by first adapting visual-semantic embeddings for ZSD.… ▽ More

    Submitted 27 July, 2018; v1 submitted 12 April, 2018; originally announced April 2018.

    Comments: 17 pages. ECCV 2018

  41. arXiv:1712.08730  [pdf, other

    cs.CV

    Combining Weakly and Webly Supervised Learning for Classifying Food Images

    Authors: Parneet Kaur, Karan Sikka, Ajay Divakaran

    Abstract: Food classification from images is a fine-grained classification problem. Manual curation of food images is cost, time and scalability prohibitive. On the other hand, web data is available freely but contains noise. In this paper, we address the problem of classifying food images with minimal data curation. We also tackle a key problems with food images from the web where they often have multiple… ▽ More

    Submitted 23 December, 2017; originally announced December 2017.

  42. arXiv:1505.02137  [pdf, other

    cs.CY cs.LG

    Human Social Interaction Modeling Using Temporal Deep Networks

    Authors: Mohamed R. Amer, Behjat Siddiquie, Amir Tamrakar, David A. Salter, Brian Lande, Darius Mehri, Ajay Divakaran

    Abstract: We present a novel approach to computational modeling of social interactions based on modeling of essential social interaction predicates (ESIPs) such as joint attention and entrainment. Based on sound social psychological theory and methodology, we collect a new "Tower Game" dataset consisting of audio-visual capture of dyadic interactions labeled with the ESIPs. We expect this dataset to provide… ▽ More

    Submitted 28 May, 2015; v1 submitted 6 May, 2015; originally announced May 2015.